# **Webscraping the script of Neon Genesis Evangelion**

In [1]:
import pandas as pd
import bs4 as bs
import requests
import re
import time

In this notebook, I'll be pulling scripts from each episode of Neon Genesis Evangelion using the BeautifulSoup and Requests packages and parsing through for each separate line of dialogue using regular expressions. I created two separate dictionaries to track the number of times each character speaks and their word count. The first dictionary covers character lines and word count for the entire series, and the second is a dictionary of dictionaries which contains the same data, but for a given episode and the characters in that episode.

To get the scripts I'll be using for this analysis, I'll be pulling from scripts available online at https://www.animanga.com/scripts/anime_scripts_english.html. The urls follow a pattern in naming conventions, so I can pull each episode script sequentially with a loop.

In [127]:
character_lines = {}
lines_by_episode = {}


for i in range(1,27):
    current_ep = {}

    webpage = requests.get(f'https://www.animanga.com/scripts/textesgb/eva{i}.html')
    soup = bs.BeautifulSoup(webpage.text, 'lxml')
    page_text = soup.get_text()

    #------------------------------------------------------------------------

    regex = re.compile(r"(?m).+:[^*#:]+\n (?:(?![#*]).)*") # regular expression to match each separate line of dialogue
    result = re.findall(regex,page_text)

    #------------------------------------------------------------------------

    # iterating through each line of text, creating a key and word count
    # if character not already in character_lines, otherwise adding to the existing entry
    for text in result:
        character_name = re.findall('[^:#]*',text)[0] # regex to pull the name of the character speaking
        character_name = character_name.strip('!"#$%(*+, ''-./:;<=>?@[\]^_`{|}~') # stripping leading + trailing punctuation
        word_count_line = len(text.split())-1
        text=text.split('------------')[0].split('----------------------')[0].split('-----------------------------------------------------------------------')[0].split('-------------------------------------------------------')[0].split('--------------------------------------------------------------------')[0].replace('\n','').replace('\r','')
        if character_name in character_lines:
            character_lines[character_name]['Lines'].append(text)
            character_lines[character_name]['Line Count'] += 1
            character_lines[character_name]['Word Count'] += word_count_line
        else:
            character_lines[character_name] = {}
            character_lines[character_name]['Lines'] = []
            character_lines[character_name]['Lines'].append(text)
            character_lines[character_name]['Line Count'] = 1
            character_lines[character_name]['Word Count'] = word_count_line

    for text in result:
        character_name = re.findall('[^:#]*',text)[0] # regex to pull the name of the character speaking
        character_name = character_name.strip('!"#$%(*+, ''-./:;<=>?@[\]^_`{|}~') # stripping leading + trailing punctuation
        word_count_line = len(text.split())-1
        text=text.split('------------')[0].split('----------------------')[0].split('-----------------------------------------------------------------------')[0].split('-------------------------------------------------------')[0].split('--------------------------------------------------------------------')[0].replace('\n','').replace('\r','')
        if character_name in current_ep:
            current_ep[character_name]['Lines'].append(text)
            current_ep[character_name]['Line Count'] += 1
            current_ep[character_name]['Word Count'] += word_count_line
        else:
            current_ep[character_name] = {}
            current_ep[character_name]['Lines'] = []
            current_ep[character_name]['Lines'].append(text)
            current_ep[character_name]['Line Count'] = 1
            current_ep[character_name]['Word Count'] = word_count_line

        # ------------------------------------------------------------------------

    lines_by_episode[i] = current_ep

        # there shouldn't be a risk of overloading the server, but added a short wait in between URL requests just in case
#     time.sleep(1)

In [125]:
# entire series: dictionary - {character name:[line count, word count]}
sorted_lines = dict(sorted(character_lines.items())) 

# individual episodes: dictionary of dictionaries - {episode number: {character name:[line count, word count]}}
sorted_eps = {}
for i in range(1,27):
    sorted_eps[i] = dict(sorted(lines_by_episode[i].items())) 

    #------------------------------------------------------------------------

# # filtering out common translator notes not in the script
filter = ['Neon','EVA','Email','E-mail','http','title','episode','Episode','EPISODE','Nadia','Movie','Preview','Trail','0','4']

for i in range(1,27):
    sorted_eps[i] = {k:v for k, v in sorted_eps[i].items() if not any(x in k for x in filter)}

sorted_lines = {k:v for k, v in sorted_lines.items() if not any(x in k for x in filter)}

sorted_lines['Misato']

{'Lines': ['Misato: Why, of all times, have I missed him at such a time?! ......   What am I going to do?      ',
  "Misato: Hey! It can't be... Are they going to use a N2-mine?! Lie   down!  ",
  'Misato: Here goes!  ',
  "Misato: Sure. Don't worry. He is under my protection at top priority.   Prepare a car train for us. A linear one, please. Yes. I'll bear   the full responsibility for him because it was my idea to meet him   in the first place. Bye. ",
  "Misato: Ah, It's OK. No problem. It is during an emergency... We   can't do anything if the car doesn't run. In the addition, I am an   international officer even if I don't seem like it. ",
  'Misato: Uninteresting boy. You look so calm, very unlike your pretty   face. ',
  "Misato: Hmm, are you angry? Sorry, I'm sorry. It's natural because   you a boy. ",
  'Misato: Yes, the secret organization directly attached to the United   Nations. ',
  'Misato: Thanks!      ',
  'Misato: Then, read this.      ',
  'Misato: I know. You consi

In [126]:
list(sorted_lines.keys())

['',
 '     ',
 '        At 18',
 '        The two pilots, Ikari and Ayanami, will scramble at the cage at 17',
 '(Misato',
 '---B PART',
 'A Boy',
 'A Man',
 'A boy',
 'A man from Nerv',
 'Aircraft',
 'Akagi',
 'Announce',
 'Announce ',
 'Announce(man) ',
 'Announce(woman) ',
 'Announcement',
 'Announcement from a car',
 'Announcement from an airplane',
 'Announcement from the airplane',
 'Announcer',
 "Aska's father ",
 "Aska's grandmother ",
 'Asuka',
 'Asuka ',
 "Asuka's father ",
 'B',
 'Boy',
 'Bus announce ',
 'C',
 'Captain',
 'Chair',
 'Children',
 'Cmdr-in-Chief',
 'Commander A',
 'Commander C',
 'Committee',
 'Committee A',
 'Committee C',
 'Committer',
 'Courier',
 'D',
 'Driver',
 'Engineering Staff',
 'Father',
 'Female Announcement',
 'Female Operater',
 'Female Operator',
 'Female Voice',
 'Female doctor ',
 'Female operator A ',
 'Female operator C ',
 'Female operator D ',
 'Female operator E ',
 'Female operator G ',
 'Female operator H ',
 'Female operator I ',
 'Fe

Looking through the compiled list of characters and lines, there are clearly a lot of typos, ranging from misspelled names, to translator notes being included, to whitespace or punctuation causing errors in dialogue attribution.

The order these issues will be tackled in:

0. Done in the previous step: leading punctuation and whitespace was stripped so that names with errors (ex: 'Shinji' and '  Shinji  ') are counted for the same key in each dictionary.

1. Splitting up lines where multiple characters are speaking at the same time and attributing the line to each character individually. There are occasionally other ways the translators denote multiple characters speaking but splitting with '&' was the most common. Looped through each episode and also separately for the entire series.

2. Removing "names" which only appear due to being in the translator notes on each page.

3. Spelling errors (in excel, not in this notebook)

In [89]:
# COMBINING NAMES

# had to look through spreadysheet to find typos

# converting to dataframe and exporting to excel sheet
df = pd.DataFrame(data=sorted_lines).T
df.columns = ['Lines','Linecount','Wordcount']
# df.to_excel('src/NGE_entire_series_lines_v2.xlsx')

df1 = pd.DataFrame(data=sorted_eps)
df1 = df1
# df1.to_excel('src/NGE_lines_by_episode_v2.xlsx')

In [142]:
# with pandas, we can very simply combine entire dataframes. for this analysis, I'm just going to compile typos for the
# top 10 characters + a couple other important ones from an old ranking poll

# merges all names in the list into the first name of the list in the df dataframe
def character_merge(list1):
    result = df.loc[list1[0]].copy()
    for name in list1[1:]:
        result += df.loc[name].copy()
    df.loc[list1[0]] = result
        
shinji_list = ['Shinji','Shiji',"Shinji'","Shinji '",'Shinji&Asuka']
asuka_list = ['Asuka','Little Asuka','Shinji&Asuka']
misato_list = ['Misato','Misato (thinking)','Mistato','Phone(Misato)']
ritsuko_list = ['Ritsuko','Ritusko','Ritsukko','Rituko']
ryoji_list = ['Ryoji','Ryouji','Ryouji (voice from the telephone)']
gendo_list = ['Gendo','Gendou','Gendow']
fuyutsuki_list = ['Fuyutsuki','Fuyutsuki (voice)','Fuyutsuki(mono)','Fuyuzuki','Kouzou','Kozo','Kozou']

character_merge(shinji_list)
character_merge(asuka_list)
character_merge(misato_list)
character_merge(ritsuko_list)
character_merge(ryoji_list)
character_merge(gendo_list)
character_merge(fuyutsuki_list)

In [143]:
df.loc['Shinji']

Lines        [Shinji: Out of order ... I shouldn't have com...
Linecount                                                  249
Wordcount                                                 4162
Name: Shinji, dtype: object

### **1. Splitting names and adding values**

In [129]:
# # Before:
# # print(f'Before splitting and adding for Shinji:', character_lines['Shinji'], '- (format: [linecount, wordcount])')

# # for each episode
# # for i in range(1,27):
# #     shared_lines = [[key,val] for key, val in lines_by_episode[i].items() if re.search('&', key)]
    
# #     for item in shared_lines:
# #         item[0] = item[0].split('&')
# #         word_count_line = item[1][1]
# #         for names in item[0]:
# #             if names.strip() in lines_by_episode[i]:
# #                 lines_by_episode[i][names.strip()][0] += 1
# #                 lines_by_episode[i][names.strip()][1] += word_count_line
# #             else:
# #                 lines_by_episode[i][names.strip()] = [1,word_count_line]
                
# # for entire series
# shared_lines = [[key,val] for key, val in character_lines.items() if re.search('&', key)]
# shared_lines

# for item in shared_lines:
#     item[0] = item[0].split('&')
#     print(item[1])
#     item_shared_line = item[1]['Lines']
#     item_shared_linecount = item[1]['Line Count']
#     item_shared_words = item[1]['Word Count']
    
#     for names in item[0]:
# #         if names.strip() in character_lines:
#             character_lines[names.strip()]['Lines'] += item_shared_line
#             character_lines[names.strip()]['Line Count'] += item_shared_linecount
#             character_lines[names.strip()]['Word Count'] += item_shared_words
# #         else:
# #             pass
# #             character_lines[names.strip()] = [item_shared_lines,item_shared_words]

# character_lines['Shinji']
            
# # print(f'After:', character_lines['Shinji'])

{'Lines': ['Shinji&Asuka: This is forced by Misato-san, who insists that Japanese should        begin with form.'], 'Line Count': 1, 'Word Count': 13}


{'Lines': ["Shinji: Out of order ... I shouldn't have come ...      ",
  "Shinji: I may not be able to meet her. I can't help it. I'll go to   the shelter.  ",
  'Shinji: Ahh  ',
  'Shinji: Ahhhhhh! ',
  'Shinji: Ahh!  ',
  'Shinji: Is it OK that you did such a thing...?  ',
  "Shinji: You are childish for your age, aren't you?  ",
  "Shinji: I heard from the teacher that it's an important job for   protecting the human race. ",
  "Shinji: It's about my father's work... Are there anything for me to   do? ",
  "Shinji: I can't say I'm surprised. He can't write to me ... unless he   wants me to do anything. ",
  "Shinji: Ah, great! It's a real geofront!      ",
  "Shinji: Uh,uh, it's pitch dark.  ",
  'Shinji: Do you mean that I should get into it and fight against the   guy which I saw. ',
  "Shinji: No way! What are you saying now?! I have been thinking that   you didn't want me?! ",
  "Shinji: I can't do that. I've neither seen it nor heard it. Why are   you saying that I can do it? "

Now each time a line was spoken by multiple characters (ex: "Shinji & Asuka"), each character receives credit for the spoken line in their original dictionary key.

### **2. Sorting dictionaries and removing extraneous keys**

In [5]:
# entire series: dictionary - {character name:[line count, word count]}
sorted_lines = dict(sorted(character_lines.items())) 

# individual episodes: dictionary of dictionaries - {episode number: {character name:[line count, word count]}}
sorted_eps = {}
for i in range(1,27):
    sorted_eps[i] = dict(sorted(lines_by_episode[i].items())) 

    #------------------------------------------------------------------------

# filtering out common translator notes not in the script
filter = ['Neon','EVA','Email','E-mail','http','title','episode','Episode','EPISODE','Nadia','Movie','Preview','Trail','0']

for i in range(1,27):
    sorted_eps[i] = {k:v for k, v in sorted_eps[i].items() if not any(x in k for x in filter)}

sorted_lines = {k:v for k, v in sorted_lines.items() if not any(x in k for x in filter)}

    #------------------------------------------------------------------------

# converting to dataframe and exporting to excel sheet
df = pd.DataFrame(data=sorted_lines).T
df.columns = ['Linecount','Wordcount']
df.to_excel('src/NGE_entire_series_lines.xlsx')

df1 = pd.DataFrame(data=sorted_eps)
df1 = df1
df1.to_excel('src/NGE_lines_by_episode.xlsx')

The dictionaries are now sorted and converted into pandas DataFrames, and exported to Excel files for further analysis.