# The Data Wrangling


If we are to create a Multi-Class classification that predicts whether or not a screenplay passes The Bechdel Test, we are going to need some scripts to train and test on! [IMSDB](https://imsdb.com/) comes to our rescue in this instance.

The below code is how you can scrape screenplays from the website and then save them for future use. A lot of this was inspired by the code found on this [github repo.](https://github.com/j2kun/imsdb_download_all_scripts)


Okay, second thought, my scripts are coming from [this database on Kaggle](https://www.kaggle.com/parthplc/movie-scripts)


The database used from Kaggle was incredibly messy and hard to work with. All files were unnamed, so I wrote a brief python script to rename each script to the first line of the file, which was normally the title of the film. This worked for the most part, but did also result in data loss, decreasing my sample size in the process.

Enjoy!

In [16]:
#Gotta Bring Some Mates

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import quote
import urllib
import json

In [10]:
imsdb_url = 'https://imsdb.com'
res = requests.get(imsdb_url)
SCRIPTS_DIR = 'scripts'

In [11]:
soup = BeautifulSoup(res.content, 'lxml')

### Building a function to grab and store scripts

Okay, we now have the basic connection to IMSDB working. Next, we are going to create a function that is going to scrape this site and save every single script that is on here as a file in the data folder.

The below code came from this [github repo](https://github.com/j2kun/imsdb_download_all_scripts), but unfortunately didn't work for me.

This will all likely be removed later on as it doesn't contribute to my project.

In [8]:
def get_script(relative_link):
    tail = relative_link.split('/')[-1]
    print('fetching %s' % tail)
    script_front_url = BASE_URL + quote(relative_link)
    front_page_response = requests.get(script_front_url)
    front_soup = BeautifulSoup(front_page_response.text, "html.parser")

    try:
        script_link = front_soup.find_all('p', align="center")[0].a['href']
    except IndexError:
        print('%s has no script :(' % tail)
        return None, None

    if script_link.endswith('.html'):
        title = script_link.split('/')[-1].split(' Script')[0]
        script_url = BASE_URL + script_link
        script_soup = BeautifulSoup(requests.get(script_url).text, "html.parser")
        script_text = script_soup.find_all('td', {'class': "scrtext"})[0].get_text()
        script_text = clean_script(script_text)
        return title, script_text
    else:
        print('%s is a pdf :(' % tail)
        return None, None


if __name__ == "__main__":
    response = requests.get('http://www.imsdb.com/all%20scripts/')
    html = response.text

    soup = BeautifulSoup(html, "html.parser")
    paragraphs = soup.find_all('p')

    for p in paragraphs:
        print(relative_link)
        relative_link = p.a['href']
        title, script = get_script(relative_link)
        if not script:
            continue

        with open(os.path.join(SCRIPTS_DIR, title.strip('.html') + '.txt'), 'w') as outfile:
            outfile.write(script)

NameError: name 'relative_link' is not defined

### www.bechdeltest.com API

The below code is used to call [The Bechdel Test Website](https://bechdeltest.com/) and receive a payload of the entire site. We can see the information below.

In [17]:
bechdel_df = pd.read_json('http://bechdeltest.com/api/v1/getAllMovies')

In [18]:
bechdel_df.head()

Unnamed: 0,rating,id,title,year,imdbid
0,0,8040,Roundhay Garden Scene,1888,392728
1,0,5433,Pauvre Pierrot,1892,3
2,0,9583,Blacksmith Scene,1893,5
3,0,6200,"Execution of Mary, Queen of Scots, The",1895,132134
4,0,5444,Tables Turned on the Gardener,1895,14


In [19]:
bechdel_df.shape

(8930, 5)

In [20]:
bechdel_df.groupby('rating').describe()

Unnamed: 0_level_0,id,id,id,id,id,id,id,id,year,year,year,year,year,year,year,year
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
rating,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,918.0,4833.455338,2768.443096,15.0,2132.5,5002.5,7127.75,9590.0,918.0,1983.530501,33.46488,1888.0,1965.25,1995.0,2010.0,2021.0
1,1969.0,4678.57034,2884.948847,3.0,2190.0,4612.0,7254.0,9598.0,1969.0,1996.637887,21.220757,1906.0,1988.0,2005.0,2012.0,2020.0
2,906.0,4736.759382,2785.521502,5.0,2173.0,4743.0,7135.5,9596.0,906.0,1991.995585,26.188817,1909.0,1979.25,2002.0,2012.0,2021.0
3,5137.0,4756.908896,2711.08209,1.0,2460.0,4725.0,7013.0,9600.0,5137.0,1999.690481,20.490556,1899.0,1994.0,2007.0,2013.0,2021.0


### Save to CSV To Work in EDA Notebook

We are saving this information so we can explore it further in the next notebook. Come meet me there!

In [22]:
bechdel_df.to_csv('../data/bechdel_test.csv')