# **Webscraping the script of Neon Genesis Evangelion**

In [1]:
import pandas as pd
import bs4 as bs
import requests
import re
import time

In this notebook, I'll be pulling scripts from each episode of Neon Genesis Evangelion using the BeautifulSoup and Requests packages and parsing through for each separate line of dialogue using regular expressions. I created two separate dictionaries to track the number of times each character speaks and their word count. The first dictionary covers character lines and word count for the entire series, and the second is a dictionary of dictionaries which contains the same data, but for a given episode and the characters in that episode.

To get the scripts I'll be using for this analysis, I'll be pulling from scripts available online at https://www.animanga.com/scripts/anime_scripts_english.html. The urls follow a pattern in naming conventions, so I can pull each episode script sequentially with a loop.

In [127]:
character_lines = {}
lines_by_episode = {}


for i in range(1,27):
    current_ep = {}

    webpage = requests.get(f'https://www.animanga.com/scripts/textesgb/eva{i}.html')
    soup = bs.BeautifulSoup(webpage.text, 'lxml')
    page_text = soup.get_text()

    #------------------------------------------------------------------------

    regex = re.compile(r"(?m).+:[^*#:]+\n (?:(?![#*]).)*") # regular expression to match each separate line of dialogue
    result = re.findall(regex,page_text)

    #------------------------------------------------------------------------

    # iterating through each line of text, creating a key and word count
    # if character not already in character_lines, otherwise adding to the existing entry
    for text in result:
        character_name = re.findall('[^:#]*',text)[0] # regex to pull the name of the character speaking
        character_name = character_name.strip('!"#$%(*+, ''-./:;<=>?@[\]^_`{|}~') # stripping leading + trailing punctuation
        word_count_line = len(text.split())-1
        text=text.split('------------')[0].split('----------------------')[0].split('-----------------------------------------------------------------------')[0].split('-------------------------------------------------------')[0].split('--------------------------------------------------------------------')[0].replace('\n','').replace('\r','')
        if character_name in character_lines:
            character_lines[character_name]['Lines'].append(text)
            character_lines[character_name]['Line Count'] += 1
            character_lines[character_name]['Word Count'] += word_count_line
        else:
            character_lines[character_name] = {}
            character_lines[character_name]['Lines'] = []
            character_lines[character_name]['Lines'].append(text)
            character_lines[character_name]['Line Count'] = 1
            character_lines[character_name]['Word Count'] = word_count_line

    for text in result:
        character_name = re.findall('[^:#]*',text)[0] # regex to pull the name of the character speaking
        character_name = character_name.strip('!"#$%(*+, ''-./:;<=>?@[\]^_`{|}~') # stripping leading + trailing punctuation
        word_count_line = len(text.split())-1
        text=text.split('------------')[0].split('----------------------')[0].split('-----------------------------------------------------------------------')[0].split('-------------------------------------------------------')[0].split('--------------------------------------------------------------------')[0].replace('\n','').replace('\r','')
        if character_name in current_ep:
            current_ep[character_name]['Lines'].append(text)
            current_ep[character_name]['Line Count'] += 1
            current_ep[character_name]['Word Count'] += word_count_line
        else:
            current_ep[character_name] = {}
            current_ep[character_name]['Lines'] = []
            current_ep[character_name]['Lines'].append(text)
            current_ep[character_name]['Line Count'] = 1
            current_ep[character_name]['Word Count'] = word_count_line

        # ------------------------------------------------------------------------

    lines_by_episode[i] = current_ep

        # there shouldn't be a risk of overloading the server, but added a short wait in between URL requests just in case
#     time.sleep(1)

Looking through the compiled list of characters and lines, there are clearly a lot of typos, ranging from misspelled names, to translator notes being included, to whitespace or punctuation causing errors in dialogue attribution.

Next steps in order:

0. Done in the previous step: leading punctuation and whitespace was stripped so that names with errors (ex: 'Shinji' and '  Shinji  ') are counted for the same key in each dictionary.

1. Removing "names" which only appear due to being in the translator notes on each page and sorting dictionaries alphabetically.

2. Converting dictionaries into dataframes

In [147]:
# entire series: dictionary - {character name:[line count, word count]}
sorted_lines = dict(sorted(character_lines.items())) 

# individual episodes: dictionary of dictionaries - {episode number: {character name:[line count, word count]}}
sorted_eps = {}
for i in range(1,27):
    sorted_eps[i] = dict(sorted(lines_by_episode[i].items())) 

    #------------------------------------------------------------------------

# # filtering out common translator notes not in the script
filter = ['Neon','EVA','Email','E-mail','http','title','episode','Episode','EPISODE','Nadia','Movie','Preview','Trail','0','4']

for i in range(1,27):
    sorted_eps[i] = {k:v for k, v in sorted_eps[i].items() if not any(x in k for x in filter)}

sorted_lines = {k:v for k, v in sorted_lines.items() if not any(x in k for x in filter)}

sorted_lines['Misato']

{'Lines': ['Misato: Why, of all times, have I missed him at such a time?! ......   What am I going to do?      ',
  "Misato: Hey! It can't be... Are they going to use a N2-mine?! Lie   down!  ",
  'Misato: Here goes!  ',
  "Misato: Sure. Don't worry. He is under my protection at top priority.   Prepare a car train for us. A linear one, please. Yes. I'll bear   the full responsibility for him because it was my idea to meet him   in the first place. Bye. ",
  "Misato: Ah, It's OK. No problem. It is during an emergency... We   can't do anything if the car doesn't run. In the addition, I am an   international officer even if I don't seem like it. ",
  'Misato: Uninteresting boy. You look so calm, very unlike your pretty   face. ',
  "Misato: Hmm, are you angry? Sorry, I'm sorry. It's natural because   you a boy. ",
  'Misato: Yes, the secret organization directly attached to the United   Nations. ',
  'Misato: Thanks!      ',
  'Misato: Then, read this.      ',
  'Misato: I know. You consi

In [185]:
# COMBINING NAMES

# had to look through spreadysheet to find typos

# converting to dataframe and exporting to excel sheet
df = pd.DataFrame(data=sorted_lines).T
df.columns = ['Lines','Linecount','Wordcount']
# df.to_excel('src/NGE_entire_series_lines_v2.xlsx')

df1 = pd.DataFrame(data=sorted_eps)
df1 = df1
# df1.to_excel('src/NGE_lines_by_episode_v2.xlsx')

In [186]:
# with pandas, we can very simply combine entire dataframes. for this analysis, I'm just going to compile typos for the
# top 10 characters + a couple other important ones from an old ranking poll

# merges all names in the list into the first name of the list in the df dataframe
def character_merge(list1):
    result = df.loc[list1[0]].copy()
    for name in list1[1:]:
        result += df.loc[name].copy()
    df.loc[list1[0]] = result
        
shinji_list = ['Shinji','Shiji',"Shinji'","Shinji '",'Shinji&Asuka']
asuka_list = ['Asuka','Little Asuka','Shinji&Asuka']
misato_list = ['Misato','Misato (thinking)','Mistato','Phone(Misato)']
ritsuko_list = ['Ritsuko','Ritusko','Ritsukko','Rituko']
ryoji_list = ['Ryoji','Ryouji','Ryouji (voice from the telephone)']
gendo_list = ['Gendo','Gendou','Gendow','Ikari']
fuyutsuki_list = ['Fuyutsuki','Fuyutsuki (voice)','Fuyutsuki(mono)','Fuyuzuki','Kouzou','Kozo','Kozou']

character_merge(shinji_list)
character_merge(asuka_list)
character_merge(misato_list)
character_merge(ritsuko_list)
character_merge(ryoji_list)
character_merge(gendo_list)
character_merge(fuyutsuki_list)

df.drop(['Shiji',"Shinji'","Shinji '",'Shinji&Asuka','Little Asuka','Shinji&Asuka','Misato (thinking)','Mistato',
         'Phone(Misato)','Ritusko','Ritsukko','Rituko','Ryouji','Ryouji (voice from the telephone)','Gendou',
         'Gendow','Fuyutsuki (voice)','Fuyutsuki(mono)','Fuyuzuki','Kouzou','Kozo','Kozou'], inplace=True)

df1.drop(['Shiji',"Shinji'","Shinji '",'Shinji&Asuka','Little Asuka','Shinji&Asuka','Misato (thinking)','Mistato',
         'Phone(Misato)','Ritusko','Ritsukko','Rituko','Ryouji','Ryouji (voice from the telephone)','Gendou',
         'Gendow','Fuyutsuki (voice)','Fuyutsuki(mono)','Fuyuzuki','Kouzou','Kozo','Kozou'], inplace=True)

In [182]:
df.loc['Shinji']

Lines        [Shinji: Out of order ... I shouldn't have com...
Linecount                                                  237
Wordcount                                                 3816
Name: Shinji, dtype: object

In [187]:
# converting to dataframe and exporting to excel sheet
df.to_excel('src/NGE_entire_series_lines.xlsx')
df1.to_excel('src/NGE_lines_by_episode.xlsx')