# Avatar the Last Air Bender 
## *Sentiment Analysis of Characters, plot over time*

## Step One: Scrape The Fandom Wiki from Transcript text

First, we do our imports and define some helper functions.

In [51]:
# import url request library
import urllib.request
# import regular expression library
import re
# import web scraping library
from bs4 import BeautifulSoup

# Define a function to get the html content of a url.
def html_from_url(url):
  fp = urllib.request.urlopen(url)
  mybytes = fp.read()
  fp.close()
  return mybytes.decode("utf8")

# Define a function to make a full url to the transcript from a links href tag
def make_link(aTag):
  return "https://avatar.fandom.com" + aTag['href']

Okay, now we should get the html doc from the fandom wiki transcripts page

We will use this document to find the links to all the transcripts. 

In [34]:
def get_avatar_fan_wiki_soup():
    wikiURL = "https://avatar.fandom.com/wiki/Avatar_Wiki:Transcripts"
    transcriptSoup = BeautifulSoup(html_from_url(wikiURL), "html.parser")
    return transcriptSoup

avatarTranscriptSoup = get_avatar_fan_wiki_soup()

Now we will scrape that document to get all the transcript links.

First let's define a function to determine if a link is a transcript link.
And if it's a link from Avatar Books 1-3

In [88]:
def is_transcript_link(tag):
  isLink = tag.has_attr('href') and tag.has_attr('title')
  if not isLink:
    return False
  
  isTranscript = ('Transcript' in tag['title'])
  if not isTranscript:
    return False

  headerId = ''
  for parent in tag.parents:
    prSib = parent.previous_sibling
    prPrSib = parent.previous_sibling.previous_sibling if prSib else None
    if prPrSib and prPrSib.name == 'h3':
        headerId = prPrSib.contents[0]['id']
        break

  result = re.compile('Book_One:_Water|Book_Two:_Earth|Book_Three:_Fire').search(headerId)
  return result

Time to define a function to scrape our html document for desired links. 

Then run it.

In [89]:
def get_transcript_links(transcriptSoup):
  transcriptLinkElements = transcriptSoup.find_all(is_transcript_link)
  transcriptLinks = []

  for elem in transcriptLinkElements:
    transcriptLinks.append(make_link(elem))
  return transcriptLinks

transcriptLinks = get_transcript_links(avatarTranscriptSoup)

Since we should have the links now, lets check our data.

In [90]:
print('\n'.join([str(x) for x in transcriptLinks]))

https://avatar.fandom.com/wiki/Transcript:The_Boy_in_the_Iceberg
https://avatar.fandom.com/wiki/Transcript:The_Avatar_Returns
https://avatar.fandom.com/wiki/Transcript:The_Southern_Air_Temple
https://avatar.fandom.com/wiki/Transcript:The_Warriors_of_Kyoshi
https://avatar.fandom.com/wiki/Transcript:The_King_of_Omashu
https://avatar.fandom.com/wiki/Transcript:Imprisoned
https://avatar.fandom.com/wiki/Transcript:Winter_Solstice,_Part_1:_The_Spirit_World
https://avatar.fandom.com/wiki/Transcript:Winter_Solstice,_Part_2:_Avatar_Roku
https://avatar.fandom.com/wiki/Transcript:The_Waterbending_Scroll
https://avatar.fandom.com/wiki/Transcript:Jet_(episode)
https://avatar.fandom.com/wiki/Transcript:The_Great_Divide
https://avatar.fandom.com/wiki/Transcript:The_Storm
https://avatar.fandom.com/wiki/Transcript:The_Blue_Spirit
https://avatar.fandom.com/wiki/Transcript:The_Fortuneteller
https://avatar.fandom.com/wiki/Transcript:Bato_of_the_Water_Tribe
https://avatar.fandom.com/wiki/Transcript:The_Des

### Okay So now we have the links to all of the avatar transcripts. We now need get the data from each link

Let's start be defining an Episode Object

In [115]:
class Episode:
  def __init__(self, title, lines):
    self.title = title
    self.lines = lines

Now define a function to convert a link to an episode object

In [136]:
def link_to_episode(link):
    soup = BeautifulSoup(html_from_url(link), "html.parser")
    title = soup.select('.page-header__title')[0].text[11:]
    lines = []
    for tr in soup.select('.wikitable > tr'):
        if (tr.th):
            lines.append([tr.th.text.strip(), tr.td.text.strip()])
    return Episode(title, lines)

We'll iterate through the list of links and convert them all to Episode objects.

In [None]:
avatarEpisodes = [link_to_episode(link) for link in transcriptLinks]

Let's see if our data looks good.
Check out the length to see if there are enough episodes.
Check out the first episode to see if its formatted correctly. 

In [138]:
print(len(avatarEpisodes))
print(avatarEpisodes[0].title)
print(avatarEpisodes[0].lines)

64
The Boy in the Iceberg
[['Katara', "Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Avatar kept balance between the Water Tribes, Earth Kingdom, Fire Nation and Air Nomads. But that all changed when the Fire Nation attacked. Only the Avatar mastered all four elements; only he could stop the ruthless firebenders. But when the world needed him most, he vanished. A hundred years have passed, and the Fire Nation is nearing victory in the war. Two years ago, my father and the men of my tribe journeyed to the Earth Kingdom to help fight against the Fire Nation, leaving me and my brother to look after our tribe. Some people believe that the Avatar was never reborn into the Air Nomads and that the cycle is broken, but I haven't lost hope. I still believe that, somehow, the Avatar will return to save the world."], ['Sokka', "It's not getting away from me this time. [Close-up of the boy as he grins confidently over his shoulder in t

Save the data as a json file so we don't have to do the webscraping again.

In [142]:
import json

def save_avatar_episodes_as_json(episodes):
  eps = {
    'episodes': [{"title": e.title, "lines": e.lines} for e in episodes]
  }

  with open('avatar-episodes.json', 'w') as f:
    json.dump(eps, f)

#run the function
save_avatar_episodes_as_json(avatarEpisodes)

opening document avatarEpisodes.json. . .


### Now we have all the Episodes as python objects and json. Time for some NLP!

## Step Two: Use NLTK to Build an Emotion to Line Mapping of Episodes