# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Seinfeld Script Generator

Notebook 1: Data Retrieving - Web Scraping

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Data Identification & Source

The data I'm using for this project is the scripts of all 180 episodes of 90s hit sitcom <em>Seinfeld</em>.

It is hosted by the Internet Movie Script Database ([IMSDb](https://www.imsdb.com/)), which is a renowned resource for movie and TV scripts. To make sure that data scraping is allowed, I checked the website's robots.txt (saved under `data` folder).

![](../img/screenshot_imsdb_robots.png)

### Data Retrieving

Since IMSDb does not have an API for scraping, I build a function to automate the web scraping process. The function is composed of three parts: 1. locates the dedicated page for <em>Seinfeld</em> with the list of all episodes; 2. locates the dedicated pages for each episode; 3. scrapes scripts on each episode page.

The challenges in building this function include: first, it took two redirctions from the original <em>Seinfeld</em> page to the script page, the urls are different and thus require additional string slicing; second, the scripts were formatted in a strange way that is not friendly to scraping, in that each character and their corresponding line are in seperate rows with blank tags that are hard to locate, as well as random spaces inserted between lines. A combination of string slicing and web scraping techniques are applied to make it work. As it's a long function, to examine the progress of my codes, I print out status updates every step of the way.

In [2]:
def get_script(name):
    
    # accessing the script page
    url = 'https://www.imsdb.com/TV/'+name+'.html'
    res = requests.get(url)
    if res.status_code == 200:
        print(f'ACCESSING <{name.upper()}> SCRIPT PAGE...')
    else:
        print(f'UNABLE TO ACCESS {name.upper()} SCRIPT PAGE...')
    soup = BeautifulSoup(res.content, 'lxml')
    
    # locate the actual link for scripts
    link = soup.find_all('a')[64:-7]
    print(f'{len(link)} SCRIPTS FOUND!')

    # find the right url to each episode webpage
    ep_urls = []
    for episode in list(range(len(link))):
        branch = link[episode].attrs['href']
        # format the scraped link to match with the actual link
        branch_1 = branch[4:24].replace('T', 't')
        branch_2 = branch[27:-12].replace(' ', '-')
        branch_new = branch_1 + '-' + branch_2 + '.html'
        branch_url = url[:-16]
        ep_url = branch_url + branch_new
        ep_urls.append(ep_url)
    print(f'ON OUR WAY TO EPISODES!')
    
    # access each episodes page url using BeautifulSoup
    
    script_collection=[]
    ep_no = 0
    for sub in ep_urls:
        ep_no += 1
        ep_res = requests.get(sub)
        if ep_res.status_code != 200:
            print('ERROR ACCESSING EPISODE SCRIPTS...')
        else:
            ep_soup = BeautifulSoup(ep_res.content, 'lxml')
            script = ep_soup.find('td', {'class': 'scrtext'})
            script_new = script.find_all('pre')[0]
            
        # scrape script for each episode
        # credit to Dan Wilhelm for helping me out on these codes. Thanks so much!
        tags = []
        actor = ''
        lines = []
        scripts = []
        
        for tag in script_new.contents:
            if tag.name == 'b':
                if tag.text.strip()!= '':
                    if lines:
                        scripts.append((actor, ' '.join(lines).replace('\n', ' ').replace('  ', '')))
                        lines = []
                    actor = tag.text.strip()
            else:
                text = tag.strip()
                if len(text)>0:
                    lines.append(text)
        # create tuples that consist the character and the corresponding line for each script
        scripts.append((actor, ' '.join(lines).replace('\n', ' ').replace('  ', '')))
        script_collection.append(scripts)
        print(f'GENERATING SCRIPT OF EP.{ep_no} of {len(ep_urls)}')
    return script_collection
    print(f'ALL SCRIPTS RETRIEVED!')

In [3]:
all_scripts = get_script('Seinfeld')

ACCESSING <SEINFELD> SCRIPT PAGE...
176 SCRIPTS FOUND!
ON OUR WAY TO EPISODES!
GENERATING SCRIPT OF EP.1 of 176
GENERATING SCRIPT OF EP.2 of 176
GENERATING SCRIPT OF EP.3 of 176
GENERATING SCRIPT OF EP.4 of 176
GENERATING SCRIPT OF EP.5 of 176
GENERATING SCRIPT OF EP.6 of 176
GENERATING SCRIPT OF EP.7 of 176
GENERATING SCRIPT OF EP.8 of 176
GENERATING SCRIPT OF EP.9 of 176
GENERATING SCRIPT OF EP.10 of 176
GENERATING SCRIPT OF EP.11 of 176
GENERATING SCRIPT OF EP.12 of 176
GENERATING SCRIPT OF EP.13 of 176
GENERATING SCRIPT OF EP.14 of 176
GENERATING SCRIPT OF EP.15 of 176
GENERATING SCRIPT OF EP.16 of 176
GENERATING SCRIPT OF EP.17 of 176
GENERATING SCRIPT OF EP.18 of 176
GENERATING SCRIPT OF EP.19 of 176
GENERATING SCRIPT OF EP.20 of 176
GENERATING SCRIPT OF EP.21 of 176
GENERATING SCRIPT OF EP.22 of 176
GENERATING SCRIPT OF EP.23 of 176
GENERATING SCRIPT OF EP.24 of 176
GENERATING SCRIPT OF EP.25 of 176
GENERATING SCRIPT OF EP.26 of 176
GENERATING SCRIPT OF EP.27 of 176
GENERATING S

Note that the official number of episodes in total for <em>Seinfeld</em> is 180 however only 176 were scraped. I will explore details at the EDA stage. 

For each script scraped, it is a list of numbers of tuples that contain the character who is speaking and the corresponding line. 

In [4]:
all_scripts[0][1]

('JERRY',
 'You know, why we\'re here? To be out, this is out...and out is one of the single most enjoyable experiences of life. People...did you ever hear people talking about "We should go out"? This is what they\'re talking about...this whole thing, we\'re all out now, no one is home. Not one person here is home, we\'re all out! There are people tryin\' to find us, they don\'t know where we are. (imitates one of these people "tryin\' to find us"; pretends his hand is a phone) "Did you ring?, I can\'t find him." (imitates other person on phone) "Where did he go?" (the first person again) "He didn\'t tell me where he was going". He must have gone out. You wanna go out: you get ready, you pick out the clothes, right? You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...There you\'re staring around, whatta you do? You go: "We gotta be getting back". Once you\'re out, you wanna get back! You wanna go to sleep, you wanna get up, you 

### Dataframe Conversion

With all data scraped, my next step is to turn them into a dataframe. I built a function to achieve this. Note that the first and last row of every script contain the episode name, the writers and "the end", which are information that functions more as format rather than content indicators. I therefore decided to exclude them from the dataset that I would use to feed into the model. However, I added the episode name as a column for reference; and moreover, I made the function flexible by allowing the user to select if they want to include the title and ending or not.

In [7]:
all_scripts[0][0]

('GOOD NEWS, BAD NEWS', 'Written byLarry David & Jerry Seinfeld (Comedy club)')

In [8]:
# converting scraped scripts to dataframe

def convert_df(all_scripts, with_title_end=False):
    df = pd.DataFrame(columns = ['character', 'line', 'episode'])
    for script in all_scripts:
        
        # exclude the title and the end of each script
        if with_title_end == False:
            df_script = pd.DataFrame(script[1:-2], columns = ['character', 'line'])
            df_script['episode'] = script[0][0]
            df = pd.concat([df, df_script], axis = 0)
            
        # include the title and the end of each script
        else:
            df_script = pd.DataFrame(script, columns = ['character', 'line'])
            df_script['episode'] = script[0][0]
            df = pd.concat([df, df_script], axis = 0)
    return df

In [9]:
# generate dataframe with all lines

without_title = convert_df(all_scripts)
without_title.head()

Unnamed: 0,character,line,episode
0,JERRY,"You know, why we're here? To be out, this is o...","GOOD NEWS, BAD NEWS"
1,JERRY,"Seems to me, that button is in the worst possi...","GOOD NEWS, BAD NEWS"
2,GEORGE,Are you through? (kind of irritated),"GOOD NEWS, BAD NEWS"
3,JERRY,"You do of course try on, when you buy?","GOOD NEWS, BAD NEWS"
4,GEORGE,"Yes, it was purple, I liked it, I don't actual...","GOOD NEWS, BAD NEWS"


In [10]:
# generate dataframe with all lines and titles & ends for each episode

with_title = convert_df(all_scripts, with_title_end=True)
with_title.head()

Unnamed: 0,character,line,episode
0,"GOOD NEWS, BAD NEWS",Written byLarry David & Jerry Seinfeld (Comedy...,"GOOD NEWS, BAD NEWS"
1,JERRY,"You know, why we're here? To be out, this is o...","GOOD NEWS, BAD NEWS"
2,JERRY,"Seems to me, that button is in the worst possi...","GOOD NEWS, BAD NEWS"
3,GEORGE,Are you through? (kind of irritated),"GOOD NEWS, BAD NEWS"
4,JERRY,"You do of course try on, when you buy?","GOOD NEWS, BAD NEWS"


In [11]:
# save dataframes

without_title.to_csv('../data/scripts_no_title.csv', index=False)
with_title.to_csv('../data/scripts_with_title.csv', index=False)