## Character Affinity in Avatar: The Last Airbender

#### Project Background
I love this show, and unfortunately don't get to interact with this kind of data at work. 

#### Goals
- Play around with a web scraper
- Display a relationship between words using D3

#### D3 Result
Is available [here](http://ameliachu.github.io/avatar_project/).

*Last Updated May 7, 2017.* 

#### Importing Required Libraries for this Notebook

In [7]:
from bs4 import BeautifulSoup
import urllib
import re

import pandas as pd

from nltk import word_tokenize
from collections import Counter

#### Programmatically Assembling a List of Episodes
It just so happens that all seasons have 20 episodes except season 3, and luckily the site I'm scrapping from labels its fan-created transcripts this way.

In [3]:
list_of_episodes = ['321']
seasons = ['1','2','3']
for season in seasons:
    for episode in range(1,21):
        if episode < 10:
            episode_str = season+'0'+str(episode)
        else:
            episode_str = season+str(episode)
        list_of_episodes.append(episode_str)       

#### Importing, Cleaning, and Converting the transcripts to .tsv

In [24]:
for episode_number in list_of_episodes:
    #Defining what the url should be for each episode
    site_url = 'http://atla.avatarspirit.net/transcripts.php?num={episode_number}'.format(episode_number=episode_number)
    #Reading what is on each page
    r = urllib.urlopen(site_url).read()
    soup = BeautifulSoup(r,"lxml")
    #Taking in only the places where the class is 'content'
    content = soup.find_all("div", class_="content")

    episode_page = """"""
    for element in content:
        episode_page += str(element)
    #Removing html code and other undesired formatting
    stage_1 = episode_page.split("<br/>")
    stage_2 = filter(None,stage_1)
    stage_3 = []
    for line in stage_2:
        new_line = re.sub('<[^>]*>','',line)
        stage_3.append(new_line)
    stage_4 = []
    for line in stage_3:
        new_line = re.sub('\([^)]*\)','',line)
        if new_line != '':
            stage_4.append(new_line)
    #Creating a tsv for each episode page
    #Formatting so that each row is the character name then their line
    csv_file = "/users/chuamelia/Downloads/avatar_{episode_number}.tsv".format(episode_number=episode_number)
    content_csv = "character\tline \n"
    for line in stage_4:
        if line != stage_4[0]:
            content_csv += line.replace(':','\t',1) + "\n"
    text_file = open(csv_file, "w")

    text_file.write(content_csv)

    text_file.close()

    process_csv = pd.read_csv(csv_file,sep="\t",header=0)
    process_csv["character"] = process_csv["character"].str.strip()
    df = process_csv.dropna()
    #For Sanity, printing out lines created per page
    print "Processed: " + str(len(df)) + " lines for " + episode_number
    df.to_csv(csv_file, sep="\t",index=False)

Processed: 76 lines for 321
Processed: 174 lines for 101
Processed: 132 lines for 102
Processed: 210 lines for 103
Processed: 213 lines for 104
Processed: 163 lines for 105
Processed: 194 lines for 106
Processed: 170 lines for 107
Processed: 188 lines for 108
Processed: 211 lines for 109
Processed: 261 lines for 110
Processed: 193 lines for 111
Processed: 227 lines for 112
Processed: 109 lines for 113
Processed: 251 lines for 114
Processed: 206 lines for 115
Processed: 219 lines for 116
Processed: 191 lines for 117
Processed: 192 lines for 118
Processed: 162 lines for 119
Processed: 140 lines for 120
Processed: 180 lines for 201
Processed: 198 lines for 202
Processed: 176 lines for 203
Processed: 156 lines for 204
Processed: 189 lines for 205
Processed: 198 lines for 206
Processed: 153 lines for 207
Processed: 194 lines for 208
Processed: 158 lines for 209
Processed: 187 lines for 210
Processed: 186 lines for 211
Processed: 192 lines for 212
Processed: 157 lines for 213
Processed: 204 

#### Defining a Unique List of Characters for use later

In [28]:
master_list_of_characters = []
for episode_number in list_of_episodes:
    csv_file = "/users/chuamelia/Downloads/avatar_{episode_number}.tsv".format(episode_number=episode_number)
    process_csv = pd.read_csv(csv_file,sep="\t",header=0)
    #process_csv["character"]= process_csv["character"].str.strip()
    characters = process_csv["character"].unique()
    characters = characters.tolist()
    num_characters = len(characters)
    for character in characters:
        master_list_of_characters.append(character)
    #Print response for sanity
    print """Done with episode {episode_number}, there are {num_characters} characters...""".format(episode_number=episode_number,num_characters=num_characters)
master_list_of_characters = list(set(master_list_of_characters))


Done with episode 321, there are 19 characters...
Done with episode 101, there are 17 characters...
Done with episode 102, there are 16 characters...
Done with episode 103, there are 16 characters...
Done with episode 104, there are 18 characters...
Done with episode 105, there are 14 characters...
Done with episode 106, there are 17 characters...
Done with episode 107, there are 15 characters...
Done with episode 108, there are 18 characters...
Done with episode 109, there are 14 characters...
Done with episode 110, there are 21 characters...
Done with episode 111, there are 12 characters...
Done with episode 112, there are 25 characters...
Done with episode 113, there are 21 characters...
Done with episode 114, there are 22 characters...
Done with episode 115, there are 19 characters...
Done with episode 116, there are 23 characters...
Done with episode 117, there are 20 characters...
Done with episode 118, there are 18 characters...
Done with episode 119, there are 15 characters...


###### How many unique characters are there in this dataset?

In [29]:
len(master_list_of_characters)

407

###### A hack-y way of Removing some things that were not actually Characters

In [62]:
master_list_of_characters = [character for character in master_list_of_characters if len(character) <= 30]

In [50]:
master_list_of_characters

['(Scene',
 'Azula',
 'Jin',
 'Zuko and Aang',
 'Sela',
 'Hippo',
 'Crone Teacher #1',
 'Crone Teacher #2',
 'Fire Sage',
 'Mayor Tong',
 'Aunt Wu',
 'Actress &gt;Aang',
 'Avatar Kuruk',
 'Male &gt;Guard',
 'Mask Merchant',
 'Aang and Gyatso',
 'Kya',
 'Zookeeper',
 'Chief Arnook',
 'Doorman',
 'Canyon Guide',
 'Food Merchant',
 'Directed By',
 'Calmn Man',
 'Gyatso',
 'Calm Man',
 'Farmer',
 'Aang, Sokka, and Katara',
 'General',
 'Sozin',
 'General Mung',
 'Circus Trainers',
 'Fire Nation Officer',
 'Actor Bumi',
 'Audience Member',
 'Old Sage',
 "Katara's Mum",
 'Elder 2',
 'Elder 1',
 'Actor Toph',
 'Herald',
 'Spectator',
 'Pakku',
 'Yue',
 'School Headmaster',
 'Actress Yue',
 'Animation By',
 'Blue Spirit',
 'Joo Dees',
 'Koh',
 'Jet',
 'Older Boy',
 'Written by',
 'Jeong Jeong',
 'Earth King',
 'Sokka and Aang',
 'Hue',
 'Star',
 'Male Guard',
 'Female Fire Soldier',
 'June',
 'Chit Sang',
 'Warden',
 'Madame Macmu-Ling',
 'Sokka (from o.c.',
 'Terra Team Leader',
 'Fire Nation

#### Creating a .tsv file for each character for later conversion to json

Desired end result: source = character, target = character spoken to, value = times spoken to

1. Create one file per character, simply listing the other character they spoke to
2. Aggregate the number of utterances, 
3. and append that to a final file ready for direct conversion to json

In [51]:
for character in master_list_of_characters:
    filename_1 = character.lower()
    filename =''.join(e for e in filename_1 if e.isalnum())
    character_file = "/users/chuamelia/Downloads/avatar_{character}.tsv".format(character=filename)
    for episode_number in list_of_episodes:
        csv_file = "/users/chuamelia/Downloads/avatar_{episode_number}.tsv".format(episode_number=episode_number)
        process_csv = pd.read_csv(csv_file,sep="\t",header=0)
        lines_from_episode = process_csv[process_csv["character"]==character]["line "].reset_index()
        list_length = len(lines_from_episode["line "])
        character_tokens=[]
        for i in range(list_length):
            tokens = word_tokenize(lines_from_episode["line "][i].decode('utf-8'))
            for token in tokens:
                if token in master_list_of_characters:
                    character_tokens.append(token)
        values = Counter(character_tokens)
        df_2 = pd.DataFrame.from_dict(values, orient='index').reset_index()
        df_3 = df_2.rename(columns={'index':'target', 0:'value'})
        if episode_number == list_of_episodes[0]:
            df_3.to_csv(character_file, sep="\t",index=False)
        else:
            preprocess_csv = pd.read_csv(character_file,sep="\t",header=0)
            frames = [preprocess_csv, df_3]
            result = pd.concat(frames)
            result.to_csv(character_file, sep="\t",index=False)
        print "processed {episode_number} for {character} sucessfully".format(episode_number=episode_number, character=character)

processed 321 for (Scene sucessfully
processed 101 for (Scene sucessfully
processed 102 for (Scene sucessfully
processed 103 for (Scene sucessfully
processed 104 for (Scene sucessfully
processed 105 for (Scene sucessfully
processed 106 for (Scene sucessfully
processed 107 for (Scene sucessfully
processed 108 for (Scene sucessfully
processed 109 for (Scene sucessfully
processed 110 for (Scene sucessfully
processed 111 for (Scene sucessfully
processed 112 for (Scene sucessfully
processed 113 for (Scene sucessfully
processed 114 for (Scene sucessfully
processed 115 for (Scene sucessfully
processed 116 for (Scene sucessfully
processed 117 for (Scene sucessfully
processed 118 for (Scene sucessfully
processed 119 for (Scene sucessfully
processed 120 for (Scene sucessfully
processed 201 for (Scene sucessfully
processed 202 for (Scene sucessfully
processed 203 for (Scene sucessfully
processed 204 for (Scene sucessfully
processed 205 for (Scene sucessfully
processed 206 for (Scene sucessfully
p

In [137]:
links_list  = []
for character in master_list_of_characters:
    filename_1 = character.lower()
    filename =''.join(e for e in filename_1 if e.isalnum())
    character_file = "/users/chuamelia/Downloads/avatar_project/characters_preprocess/avatar_{character}.tsv".format(character=filename)
    links_file = "/users/chuamelia/Downloads/avatar_project/links/avatar_{character}.tsv".format(character=filename)
    process_csv = pd.read_csv(character_file,sep="\t",header=0)
    if process_csv.size != 0:
        links_list.append(filename)
        df = process_csv.groupby("target")['value'].sum()
        df_2 = df.reset_index()
        df_2["source"] = character
        df_3 = df_2[["source","target","value"]]
        df_3.to_csv(links_file, sep="\t",index=False)

In [154]:
final_file = "/users/chuamelia/Downloads/avatar_project/avatar_links.tsv"
for character in master_list_of_characters:
    filename_1 = character.lower()
    filename =''.join(e for e in filename_1 if e.isalnum())
    if filename in links_list:
        links_file = "/users/chuamelia/Downloads/avatar_project/links/avatar_{character}.tsv".format(character=filename)
        process_csv = pd.read_csv(links_file,sep="\t",header=0)
        if process_csv.size != 0:
            if filename == links_list[0]:
                print filename + " this is first"  #sanity checking
                process_csv.to_csv(final_file, sep="\t",index=False)
            else:
                print filename #sanity checking
                final_table = pd.read_csv(final_file,sep="\t",header=0)
                frames = [process_csv, final_table]
                result = pd.concat(frames)
                result.to_csv(final_file, sep="\t",index=False)

lo this is first
li
yu
jin
jet
due
pao
tho
teo
lao
mai
man
sela
june
bato
hama
ursa
zuko
hahn
chan
quon
dock
zhao
chey
guru
hide
ozai
appa
song
roku
bumi
haru
suki
lily
shyu
ying
meng
iroh
ming
tyro
aang
koko
toph
azula
hippo
sozin
pakku
guard
chong
sokka
gansu
oyagi
voice
abbot
scene
gyatso
herald
warden
hakoda
mother
tylee
tylee
azulon
katara
xinfu
shamo
auntwu
yonrha
piandao
oldman
joodee
soldier
tylee
captain
boulder
hamgao
calmman
children
everyone
theduke
governor
students
prisoner
mskwan
firesage
zookeeper
olderboy
chitsang
pipsqueak
kingbumi
student1
youngboy
liandlo
masteryu
grangran
actorjet
messenger
oldwoman
longfeng
mayortong
actortoph
writtenby
earthking
maleguard
actorzuko
writtenby
actorozai
reddragon
littleboy
bureaucrat
smellerbee
shopkeeper
actressyue
jeongjeong
monkgyatso
koalasheep
actorsokka
littlegirl
1stservant
wardenpoon
gurupathik
avatarroku
bullyguard
bluedragon
manyvoices
liamplo
generalhow
shopkeeper
actoruncle
avatarkuruk
chiefarnook
canyonguide
dailiagent

In [177]:
links = links[links["value"] > 2]
#shrinking the number of links to make data more managable

In [179]:
len(links)
#number of remaining links

160

#### Creating the json files for display in D3

Using [this](https://bl.ocks.org/mbostock/4062045) as template.

In [194]:
output_links = "/users/chuamelia/Downloads/avatar_project/avatar_links.json"
links = pd.read_csv(final_file,sep="\t",header=0)
links["value"] = links["value"].astype('int')
#links = links[links["value"] > 2]
links.to_json(output_links,orient='records')

output_nodes ="/users/chuamelia/Downloads/avatar_project/avatar_nodes.json"
ids1 = links["source"].unique()
ids1 = pd.DataFrame(ids1)
ids1 = ids1.rename(columns={0:'id'})
ids2 = links["target"].unique()
ids2 = pd.DataFrame(ids2)
ids2 = ids2.rename(columns={0:'id'})
frames = [ids1, ids2]
ids = pd.concat(frames)
ids = ids["id"].unique()
ids = pd.DataFrame(ids)
ids = ids.rename(columns={0:'id'})
ids["group"] = 1
nodes = ids[['id','group']]
nodes.to_json(output_nodes,orient='records')

f1 = open(output_nodes, "r")
f2 = open(output_links, "r")

nodes = f1.read()
links = f2.read()

graph_str = "{\"nodes\": "+ nodes + ",\n \"links\": "+ links +"}"

final_location = '/Users/chuamelia/Downloads/avatar_project/{brand}_force.json'.format(brand='avatar')
f = open(final_location, 'w')
f.write(graph_str)
f.close()

In [191]:
nodes_file = "/users/chuamelia/Downloads/avatar_project/avatar_nodes.tsv"
nodes.to_csv(nodes_file, sep="\t",index=False)

### Future Direction
- Add labels using structure set forth [here](https://bl.ocks.org/mbostock/950642).
- Programmatically, then manually change groups to reflect Nation Affinity
