## Context

XYZ television station is an up and coming NBC affiliate. It is new to its market. It has the stable of NBC shows behind it, but it is looking for other ways to improve its ratings in its market. Management has decided that broadcasting movies may help. They feel that well-rated movies will improve ratings. 

As an NBC affiliate, XYZ has access to some money. From this limited pot of money, they can create their own movies. Management thinks that producing movies will not only improve their ratings, but improve their standing with NBC.

XYZ has access to scripts for the movies, but the movies do not have ratings from customers. It wants to know which movies to produce, because the money from NBC is limited. That is, XYZ wants to predict how well a movie will rate based on the script so they can choose good movies for production and broadcasting


This notebook will import the ratings from Netflix (technically https://www.kaggle.com/datasets/ashirwadsangwan/imdb-dataset; Netflix charges for downloads now). There are three files of interest: "title.basics.tsv" that contains title and year; "title.akas.tsv" that contains the language of the film, so we can work with just American titles; and "title.ratings.tsv" that contains the ratings of each film. The films are linked by a unique code.

This notebook will also download the scripts from the internet (https://www.simplyscripts.com/year/). The files will be downloaded, read and converted to .txt files, read into a pandas dataframe, and then converted to a bag-of-words. 

Open Netflix titles



In [1]:
import pandas as pd
from datetime import datetime
import numpy as np

In [2]:
titlesFromNetflix = pd.read_csv('title.basics.tsv', sep='\t')

  titlesFromNetflix = pd.read_csv('title.basics.tsv', sep='\t')


In [3]:
titlesFromNetflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10388629 entries, 0 to 10388628
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 713.3+ MB


In [4]:
print(titlesFromNetflix.head())
print(titlesFromNetflix.shape)

      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult startYear endYear runtimeMinutes                    genres  
0       0      1894      \N              1         Documentary,Short  
1       0      1892      \N              5           Animation,Short  
2       0      1892      \N              4  Animation,Comedy,Romance  
3       0      1892      \N             12           Animation,Short  
4       0      1893      \N              1              Comedy,Short  
(10388629, 9)


Now that we have the titles and year, convert startYear to datetime format. We don't need to convert endYear because only looking at movies

In [5]:
titlesFromNetflix["startYear"] = pd.to_datetime(titlesFromNetflix["startYear"], format='%Y', errors="coerce").dt.year
#df['your column'] = df['your column'].astype(int, errors='ignore')
titlesFromNetflix["startYear"] = titlesFromNetflix["startYear"].astype(int, errors="ignore")

In [6]:
print(titlesFromNetflix.head())

      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short          Pauvre Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult  startYear endYear runtimeMinutes                    genres  
0       0     1894.0      \N              1         Documentary,Short  
1       0     1892.0      \N              5           Animation,Short  
2       0     1892.0      \N              4  Animation,Comedy,Romance  
3       0     1892.0      \N             12           Animation,Short  
4       0     1893.0      \N              1              Comedy,Short  


In [7]:
titlesFromNetflix.dtypes

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult            object
startYear         float64
endYear            object
runtimeMinutes     object
genres             object
dtype: object

Let's clean up titlesFromNetflix.

First we drop unnecessary columns:

In [8]:
titlesFromNetflix = titlesFromNetflix.drop(['originalTitle', 'isAdult', 'endYear', 'runtimeMinutes'], axis=1)

In [9]:
titlesFromNetflix.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
0,tt0000001,short,Carmencita,1894.0,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,1892.0,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,1892.0,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,1892.0,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,1893.0,"Comedy,Short"


Oops! Looks like some startYears are NaNs. Let's drop those.

In [10]:
titlesFromNetflix_dropped = titlesFromNetflix.drop(titlesFromNetflix[(titlesFromNetflix['startYear'].isna())].index)

In [11]:
titlesFromNetflix_dropped.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
0,tt0000001,short,Carmencita,1894.0,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,1892.0,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,1892.0,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,1892.0,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,1893.0,"Comedy,Short"


In [12]:
titlesFromNetflix_dropped.tail()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
10388624,tt9916848,tvEpisode,Episode #3.17,2009.0,"Action,Drama,Family"
10388625,tt9916850,tvEpisode,Episode #3.19,2010.0,"Action,Drama,Family"
10388626,tt9916852,tvEpisode,Episode #3.20,2010.0,"Action,Drama,Family"
10388627,tt9916856,short,The Wind,2015.0,Short
10388628,tt9916880,tvEpisode,Horrid Henry Knows It All,2014.0,"Adventure,Animation,Comedy"


We don't want movies that have empty genres field, either 

In [13]:
titlesFromNetflix_dropped = titlesFromNetflix_dropped.drop(titlesFromNetflix_dropped[(titlesFromNetflix_dropped['genres'].isna())].index)

Next, find out what kinds of items are in the database besides movies

In [14]:
titlesFromNetflix_dropped['titleType'].unique()

array(['short', 'movie', 'tvShort', 'tvMovie', 'tvSeries', 'tvEpisode',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

Drop everything that isn't a movie

In [15]:
titlesFromNetflix_dropped = titlesFromNetflix_dropped.drop(titlesFromNetflix_dropped[titlesFromNetflix_dropped['titleType'] != 'movie'].index)

In [16]:
titlesFromNetflix_dropped['titleType'].unique()

array(['movie'], dtype=object)

In [17]:
titlesFromNetflix_dropped.tail()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
10388520,tt9916622,movie,Rodolpho Teóphilo - O Legado de um Pioneiro,2015.0,Documentary
10388547,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,2007.0,Documentary
10388559,tt9916706,movie,Dankyavar Danka,2013.0,Comedy
10388569,tt9916730,movie,6 Gunn,2017.0,Drama
10388579,tt9916754,movie,Chico Albuquerque - Revelações,2013.0,Documentary


How many genres are there?

In [18]:
genres = titlesFromNetflix_dropped['genres'].unique()
len(genres)

1419

Too many genres to sort here! However, let's drop documentaries.

In [19]:
titlesFromNetflix_dropped = titlesFromNetflix_dropped.drop(titlesFromNetflix_dropped[titlesFromNetflix_dropped['genres'] == 'Documentary'].index)

In [20]:
titlesFromNetflix_dropped.shape

(480408, 5)

In [21]:
titlesFromNetflix_dropped['startYear'].max()

2031.0

Let's drop movies that haven't been made yet

In [22]:
titlesFromNetflix_dropped = titlesFromNetflix_dropped.drop(titlesFromNetflix_dropped[titlesFromNetflix_dropped['startYear'] > 2023].index)

In [23]:
titlesFromNetflix_dropped.shape

(477815, 5)

In [24]:
titlesFromNetflix_dropped['startYear'].max()

2023.0

In [25]:
titlesFromNetflix_dropped['startYear'].min()

1894.0

That's enough cleaning of the titles database for now.

Next, let's import the language the movies are in, because we only want American movies

In [26]:
countriesFromNetflix = pd.read_csv('title.akas.tsv', sep='\t')

  countriesFromNetflix = pd.read_csv('title.akas.tsv', sep='\t')


In [27]:
countriesFromNetflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38098714 entries, 0 to 38098713
Data columns (total 8 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   titleId          object
 1   ordering         int64 
 2   title            object
 3   region           object
 4   language         object
 5   types            object
 6   attributes       object
 7   isOriginalTitle  object
dtypes: int64(1), object(7)
memory usage: 2.3+ GB


In [28]:
countriesFromNetflix.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0


Looks like we have multiple versions of movies. Let's take a look at the regions

In [29]:
countries =countriesFromNetflix['region'].unique()


In [30]:
np.info(countries)

class:  ndarray
shape:  (249,)
strides:  (8,)
itemsize:  8
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x7fa0443ec000
byteorder:  little
byteswap:  False
type: object


In [31]:
countries_str=countries.astype(str)

In [32]:
np.info(countries_str)

class:  ndarray
shape:  (249,)
strides:  (16,)
itemsize:  16
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x7fa043b80c00
byteorder:  little
byteswap:  False
type: <U4


In [33]:
np.sort(countries_str)

array(['AD', 'AE', 'AF', 'AG', 'AI', 'AL', 'AM', 'AN', 'AO', 'AQ', 'AR',
       'AS', 'AT', 'AU', 'AW', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG',
       'BH', 'BI', 'BJ', 'BM', 'BN', 'BO', 'BR', 'BS', 'BT', 'BUMM', 'BW',
       'BY', 'BZ', 'CA', 'CC', 'CD', 'CF', 'CG', 'CH', 'CI', 'CK', 'CL',
       'CM', 'CN', 'CO', 'CR', 'CSHH', 'CSXX', 'CU', 'CV', 'CW', 'CY',
       'CZ', 'DDDE', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG',
       'EH', 'ER', 'ES', 'ET', 'FI', 'FJ', 'FM', 'FO', 'FR', 'GA', 'GB',
       'GD', 'GE', 'GF', 'GH', 'GI', 'GL', 'GM', 'GN', 'GP', 'GQ', 'GR',
       'GT', 'GU', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE',
       'IL', 'IM', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JE', 'JM', 'JO', 'JP',
       'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KP', 'KR', 'KW', 'KY', 'KZ',
       'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY',
       'MA', 'MC', 'MD', 'ME', 'MG', 'MH', 'MK', 'ML', 'MM', 'MN', 'MO',
       'MP', 'MQ', 'MR', 'MS', 'MT', 'MU', 'MV', 

Look, there's "US"! 

But first, before dropping non-American movies, merge the two databases

In [34]:
filmsCountries = pd.merge(titlesFromNetflix_dropped,countriesFromNetflix,how = 'left',left_on='tconst', right_on='titleId')

In [35]:
filmsCountries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2613075 entries, 0 to 2613074
Data columns (total 13 columns):
 #   Column           Dtype  
---  ------           -----  
 0   tconst           object 
 1   titleType        object 
 2   primaryTitle     object 
 3   startYear        float64
 4   genres           object 
 5   titleId          object 
 6   ordering         float64
 7   title            object 
 8   region           object 
 9   language         object 
 10  types            object 
 11  attributes       object 
 12  isOriginalTitle  object 
dtypes: float64(2), object(11)
memory usage: 259.2+ MB


We have some missing data from the region/language database. We'll deal with that after we drop some unnecessary columns

In [36]:
filmsCountries = filmsCountries.drop(['ordering', 'types', 'attributes', 'isOriginalTitle'], axis=1)

In [37]:
filmsCountries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2613075 entries, 0 to 2613074
Data columns (total 9 columns):
 #   Column        Dtype  
---  ------        -----  
 0   tconst        object 
 1   titleType     object 
 2   primaryTitle  object 
 3   startYear     float64
 4   genres        object 
 5   titleId       object 
 6   title         object 
 7   region        object 
 8   language      object 
dtypes: float64(1), object(8)
memory usage: 179.4+ MB


Let's drop the non-english-language films first, then we'll deal with region. First, let's find the code for english.

In [38]:
filmsCountries['language'].unique()

array(['\\N', 'en', 'sv', 'cs', 'ca', 'ru', 'ja', 'bg', 'es', 'fr', 'tr',
       'qbn', nan, 'nl', 'sr', 'pt', 'cmn', 'uz', 'uk', 'qbp', 'fa', 'hi',
       'ar', 'rn', 'bs', 'yue', 'th', 'yi', 'sl', 'hr', 'ka', 'he', 'sk',
       'de', 'it', 'ga', 'kk', 'bn', 'gsw', 'gl', 'eu', 'az', 'ms', 'pl',
       'id', 'mr', 'qbo', 'mi', 'ta', 'lt', 'be', 'lv', 'af', 'hy', 'ur',
       'la', 'te', 'ro', 'ml', 'tl', 'mk', 'fi', 'el', 'cy', 'et', 'qal',
       'da', 'xh', 'gu', 'kn', 'eka', 'tg', 'gd', 'ko', 'ky', 'wo', 'no',
       'is', 'hu', 'sq', 'vi', 'zh', 'tk', 'pa', 'sd', 'ps', 'lb', 'ku',
       'zu', 'su', 'jv', 'fro', 'haw', 'mn', 'lo', 'my', 'am', 'qac',
       'ne', 'iu', 'st', 'tn'], dtype=object)

We saw some films don't have a language. Let's drop those first.

In [39]:
filmsCountries = filmsCountries.drop(filmsCountries[(filmsCountries['language'].isna())].index)

And now drop the non-english-language films

In [40]:
filmsCountries = filmsCountries.drop(filmsCountries[filmsCountries['language'] != 'en'].index)

In [41]:
filmsCountries.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277946 entries, 15 to 2613066
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   tconst        277946 non-null  object 
 1   titleType     277946 non-null  object 
 2   primaryTitle  277946 non-null  object 
 3   startYear     277946 non-null  float64
 4   genres        277946 non-null  object 
 5   titleId       277946 non-null  object 
 6   title         277946 non-null  object 
 7   region        277946 non-null  object 
 8   language      277946 non-null  object 
dtypes: float64(1), object(8)
memory usage: 21.2+ MB


We don't have any null values now!

Now it's time to get rid of the non-US-region films

In [42]:
filmsCountries['region'].unique()

array(['SG', 'XWW', 'CA', 'IN', 'PH', 'NZ', 'US', 'XEU', 'IL', 'EG', 'BE',
       'CH', 'IE', 'JM', 'TH', 'ZA', 'HK', 'JP', 'MY', 'GB', 'ID', 'PK',
       'BD', 'XAS', 'NG', 'AU', 'IR', 'AF', 'FR', 'SE', 'BR', 'ES', 'XSA',
       'DE', 'CN', 'GR', 'TW', 'RU', 'IT', 'MX', 'YE', 'NO', 'DDDE', 'KR',
       'UA', 'RO', 'TR', 'AT', 'BZ', 'XNA', 'ET', 'LT', 'PT', 'CZ', 'DK'],
      dtype=object)

In [43]:
filmsCountries = filmsCountries.drop(filmsCountries[filmsCountries['region'] != 'US'].index)

In [44]:
filmsCountries.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1288 entries, 4256 to 2612246
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   tconst        1288 non-null   object 
 1   titleType     1288 non-null   object 
 2   primaryTitle  1288 non-null   object 
 3   startYear     1288 non-null   float64
 4   genres        1288 non-null   object 
 5   titleId       1288 non-null   object 
 6   title         1288 non-null   object 
 7   region        1288 non-null   object 
 8   language      1288 non-null   object 
dtypes: float64(1), object(8)
memory usage: 100.6+ KB


In [45]:
filmsCountries.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres,titleId,title,region,language
4256,tt0005809,movie,The Governor,1915.0,Drama,tt0005809,The Governor,US,en
9407,tt0008305,movie,Les Misérables,1917.0,Drama,tt0008305,Les Misérables,US,en
14335,tt0009968,movie,Broken Blossoms,1919.0,"Drama,Romance",tt0009968,Broken Blossoms,US,en
51995,tt0018455,movie,Sunrise,1927.0,"Drama,Romance",tt0018455,Sunrise,US,en
66909,tt0020697,movie,The Blue Angel,1930.0,"Drama,Music",tt0020697,The Blue Angel,US,en


Okay, we have only English-language movies from the US region. Time to add some ratings! 

Let's import the ratings database

In [46]:
ratingsFromNetflix = pd.read_csv('title.ratings.tsv', sep='\t')

In [47]:
ratingsFromNetflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378285 entries, 0 to 1378284
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1378285 non-null  object 
 1   averageRating  1378285 non-null  float64
 2   numVotes       1378285 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 31.5+ MB


And merge the ratings with the culled movies

In [48]:
filmsCountriesRatings = pd.merge(filmsCountries,ratingsFromNetflix,how = 'left',on='tconst')

In [49]:
filmsCountriesRatings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1288 entries, 0 to 1287
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         1288 non-null   object 
 1   titleType      1288 non-null   object 
 2   primaryTitle   1288 non-null   object 
 3   startYear      1288 non-null   float64
 4   genres         1288 non-null   object 
 5   titleId        1288 non-null   object 
 6   title          1288 non-null   object 
 7   region         1288 non-null   object 
 8   language       1288 non-null   object 
 9   averageRating  1154 non-null   float64
 10  numVotes       1154 non-null   float64
dtypes: float64(3), object(8)
memory usage: 110.8+ KB


In [50]:
filmsCountriesRatings.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres,titleId,title,region,language,averageRating,numVotes
0,tt0005809,movie,The Governor,1915.0,Drama,tt0005809,The Governor,US,en,6.7,324.0
1,tt0008305,movie,Les Misérables,1917.0,Drama,tt0008305,Les Misérables,US,en,6.7,36.0
2,tt0009968,movie,Broken Blossoms,1919.0,"Drama,Romance",tt0009968,Broken Blossoms,US,en,7.2,10903.0
3,tt0018455,movie,Sunrise,1927.0,"Drama,Romance",tt0018455,Sunrise,US,en,8.1,53184.0
4,tt0020697,movie,The Blue Angel,1930.0,"Drama,Music",tt0020697,The Blue Angel,US,en,7.7,16068.0


Now that we've culled the movies, we can drop the region and language columns. We can also drop the titleId column, as it just repeats tconst

In [51]:
filmsCountriesRatings = filmsCountriesRatings.drop(['titleId', 'region', 'language'], axis=1)

In [52]:
filmsCountriesRatings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1288 entries, 0 to 1287
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         1288 non-null   object 
 1   titleType      1288 non-null   object 
 2   primaryTitle   1288 non-null   object 
 3   startYear      1288 non-null   float64
 4   genres         1288 non-null   object 
 5   title          1288 non-null   object 
 6   averageRating  1154 non-null   float64
 7   numVotes       1154 non-null   float64
dtypes: float64(3), object(5)
memory usage: 80.6+ KB


Looks like some movies didn't get rated. Let's drop those.

In [53]:
filmsCountriesRatings = filmsCountriesRatings.dropna(subset=['averageRating'])

In [54]:
filmsCountriesRatings.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1154 entries, 0 to 1287
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         1154 non-null   object 
 1   titleType      1154 non-null   object 
 2   primaryTitle   1154 non-null   object 
 3   startYear      1154 non-null   float64
 4   genres         1154 non-null   object 
 5   title          1154 non-null   object 
 6   averageRating  1154 non-null   float64
 7   numVotes       1154 non-null   float64
dtypes: float64(3), object(5)
memory usage: 81.1+ KB


 So we're left with 1154 rated movies to train our model on!
 
 Save the dataframe

In [55]:
filmsCountriesRatings.to_csv("filmsCountriesRatings.tsv", sep="\t", index=None)

In the following cell, we'll import all the scripts from the web

In [3]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import os
from urllib.parse import urljoin
import requests
import time 
from requests.exceptions import ConnectionError as ce

options = Options()
options.add_argument("--headless")
dr = webdriver.Chrome(options=options)
print ("Headless Chrome Initialized")
dr.get("https://www.simplyscripts.com/year/")
url = "https://www.simplyscripts.com/year/"

bs = BeautifulSoup(dr.page_source,"lxml")

#https://stackoverflow.com/questions/54616638/download-all-pdf-files-from-a-website-using-python    
# help from Upom Malik
folder_location = r'/Users/tracker/Capstone/scripts'

list_of_titles=[]

for link in bs.find(id="movie_wide").find_all('a'):
    web_link = link.get('href')
    if (web_link.endswith('.pdf')
        or web_link.endswith('.txt')
        or web_link.endswith('.html')): 
    #Name the pdf files using the last portion of each link which are unique in this case
        filename = os.path.join(folder_location,web_link.split('/')[-1])
 
       # print(urljoin(url,web_link))
        print(link.string)
        
        with open(filename, 'wb') as f:
            try:
                response = requests.get(urljoin(url,web_link))
                if(response.status_code != 404):
                    f.write(response.content)
                # append the link to the file, and the name of the movie
                list_of_titles.append([filename,link.string])
            except ce:
                continue
        
dr.quit()



#https://stackoverflow.com/questions/62846611/how-to-prevent-downloading-an-empty-pdf-file-while-using-get-and-requests-in-pyt

Headless Chrome Initialized
The Great Train Robbery
The Musketeers of Pig Alley
The New York Hat
Fogg's Millions
Spitfire
Satan McAllister's Heir
Witchcraft
Youth's Endearing Charm
The Hand that Rocks the Cradle
Everybody's Girl
The Love Expert
The Mystery of the Jamaica Bar
Pirate Gold
Nosferatu
Red Hot Romance
The Dinkum Bloke
The Sidewalks of New York
Judgment of the Storm
Peter Pan
Battleship Potemkin
The Lost World
The Phantom of the Opera
The Scarlet Letter
The Jazz Singer
London After Midnight
The Passion of Joan of Arc
An Andalusian Dog (Un Chien Andalou)
Un Chien Andalou (An Andalusian Dog)
An American Tragedy
Charlie Chan Carries On
Monkey Business
Platinum Blonde
American Madness
Grand Hotel
Trouble in Paradise
Charlie Chan's Greatest Case
Duck Soup
Duck Soup
King Kong (Kong)
Kong (King Kong)
Charlie Chan's Courage
It Happened One Night
It Happened One Night
The Thin Man
The Thin Man
Transatlantic Merry-Go-Round
Woman in the Dark
Triumph of the Will
Mr. Deeds Goes to Town
Lo

Return of the Jedi
Return of the Jedi (Revenge of the Jedi)
Rock & Rule
Star Wars: Return of the Jedi
Star Wars: Return of the Jedi
Star Wars: Return of the Jedi (Revenge of the Jedi)
Superman III
Under Fire
2010 The Odyssey Continues
The Adventures of Buckaroo Banzai Across the Eighth Dimension
Amadeus
Bachelor Party
Blood Simple
Blood Simple
A Christmas Carol
Dune
Ghostbusters
Gremlins
Indiana Jones and the Temple of Doom
Indiana Jones and the Temple of Doom
Karate Kid
Once Upon A Time In America
Purple Rain
Starman
Supergirl
Terminator
Terminator
Agnes of God
Back To The Future
Back To The Future
Back To The Future
Brazil
The Breakfast Club
Clue (part 1)
Clue (part 2)
Code Of Silence
Commando
Day of the Dead
Fletch
Fletch
Going for the Gold: The Bill Johnson Story
The Goonies
Kiss of the Spider Woman
Legend
Legend of Darkness
Lost in America
Nightmare on Elm Street 2: Freddy's Revenge
Rambo: First Blood II: The Mission
Rambo: First Blood Part II
Real Genius
Silver Bullet
Silver Bull

House of the Damned
Independence Day
The Island of Dr. Moreau
The Island of Dr. Moreau
Jerry Maguire
Kids in the Hall: Brain Candy
Leprechaun 4
Lone Star
Lone Star
Long Kiss Goodnight
Mission Impossible
Return of the Apes
Return of the Apes
The Rock
Romeo and Juliet (1996)
Romeo and Juliet (1996)
Sandman
Scream
The Six Million Dollar Man
Sling Blade
Sling Blade
Steel Sharks
SubUrbia
Swingers
Terminator 2: 3-D Battle Across Time
Tin Cup
Trainspotting
Trainspotting
White Squall
One Eight Seven
The 5th Element
Affliction
Airforce One
Alien 4 Resurrection
Alien 4 Resurrection
An American Werewolf In Paris
An American Werewolf In Paris
Anastasia
The Assignment
Austin Powers: International Man of Mystery
Austin Powers: International Man of Mystery
Batman & Robin
Batman and Robin
Bean
Bloodmoon
Chasing Amy
Mr. Joshua's Screenplay Site
Chasing Amy
Contact
Crow 3: Resurrection
The Devil's Advocate
Donnie Brasco
Edward Ford
Event Horizon
Face/Off
Face/Off
The Fifth Element
G. I. Jane
The Game
Th

Freddy vs. Jason
Freddy vs. Jason
Freddy vs. Jason
Freddy vs. Jason
Freddy vs. Jason
Girl With a Pearl Earring
Gothika
Harry Potter and the Chamber of Secrets
Harry Potter Fan Site
The Hebrew Hammer
Holes
Horror Inc.
Style Commando Entertainment
Identity
Intolerable Cruelty
Italian Job
Jason vs. Freddy
Jeepers Creepers 2
The Last Samurai
The Last Samurai
Leprechaun Back 2 Tha Hood
The Life of David Gale
Lost in Translation
Lost in Translation
Malibu's Most Wanted
Matchstick Men
The Matrix Reloaded
The Matrix Revolutions
Phone Booth
Pirates of the Caribbean
Pirates of the Caribbean
Pirates of the Caribbean
S.W.A.T.
Something's Gotta Give
Special
Timeline
Timeline
Waking up the Day 
Who's Your Daddy?
Willard
X2
X-men 2
2001 Maniacs
Addicted to Murder 5: The Last Vampire
Alfie
Batman: The Frightening
Blade Trinity
Blade Trinity
The Bourne Supremacy
The Butterfly Effect
Catwoman
Cellular
Cellular
Cold Mountain
Collateral
Collateral
Collateral
Crash
Dawn of the Dead 2004
Dead Birds
Feast - 

In [53]:
list_of_titles[0:5]

[['/Users/tracker/Capstone/scripts/musketeers.html',
  'The Musketeers of Pig Alley'],
 ['/Users/tracker/Capstone/scripts/nyhat.html', 'The New York Hat'],
 ['/Users/tracker/Capstone/scripts/fogg.html', "Fogg's Millions"],
 ['/Users/tracker/Capstone/scripts/spitfire.html', 'Spitfire'],
 ['/Users/tracker/Capstone/scripts/satan.html', "Satan McAllister's Heir"]]

In [4]:
len(list_of_titles)

1395

In [5]:
list_of_titles[-1][1]

'Zapper'

From the website, we know that some of the scripts are unpublished movies: remove those from the list of scripts

In [6]:
list_of_titles[1363:1365]

[['/Users/tracker/Capstone/scripts/WJ.pdf', 'White Jazz'],
 ['/Users/tracker/Capstone/scripts/android_army.txt', 'Android Army']]

In [7]:
list_of_titles_a = list_of_titles[0:1364]

In [66]:
len(list_of_titles_a)

1364

In [8]:
list_of_titles_a[1363]

['/Users/tracker/Capstone/scripts/WJ.pdf', 'White Jazz']

Save the list of movies, so we don't have to rerun the first code cell

In [11]:
import csv

with open('allTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(list_of_titles)

In [12]:
with open('prunedTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(list_of_titles_a)

Separate out the types of files, so we can read each type properly.

Create a function to do this for the three file types: .pdf, .html, and .txt

In [9]:
def list_separate(old_list,new_list,suffix):
    for title in old_list:
        
        if title[0].endswith(suffix):
            new_list.append(title)

In [13]:
txtlist = []

list_separate(list_of_titles_a,txtlist,'.txt')
        
print(len(txtlist))

307


In [14]:
htmllist = []

list_separate(list_of_titles_a,htmllist,'.html')
        
print(len(htmllist))

592


In [15]:
pdflist = []

list_separate(list_of_titles_a,pdflist,'.pdf')
        
print(len(pdflist))

465


Read the .pdf files and convert them to .txt files

In [16]:
#https://www.freecodecamp.org/news/extract-data-from-pdf-files-with-python/
from pdfquery import PDFQuery

pdf_dir = '/Users/tracker/Capstone/scripts'

n=0

# some of the .pdfs are unopenable

# create a new list that just has the openable files
list_of_pdf_to_txt = []

for long_title in pdflist:
    try:
        title = long_title[0]
        print(title)
        pdf = PDFQuery(os.path.join(pdf_dir, title))
       
        pdf.load()

        # Use CSS-like selectors to locate the elements
        text_elements = pdf.pq('LTTextLineHorizontal')

        # Extract the text from the elements
        text = [t.text for t in text_elements]
        #connect the elements
        total_text = ' '.join(text)
        
        # write to file
        filename1 = title + '.txt'
        with open(filename1, 'w') as file1:
            file1.write(total_text)
        
        # create the new list that just has the openable pdf names
        list_of_pdf_to_txt.append([filename1, long_title[1]])


    except:
        # the files that were unopenable
        print('except ',n)




/Users/tracker/Capstone/scripts/only_angels_have_wings.pdf
/Users/tracker/Capstone/scripts/torridzone.pdf
except  0
/Users/tracker/Capstone/scripts/Maltese_Falcon.pdf
/Users/tracker/Capstone/scripts/tarzan_s_secret_treasure.pdf
/Users/tracker/Capstone/scripts/casablanca.pdf
/Users/tracker/Capstone/scripts/saboteur.pdf
/Users/tracker/Capstone/scripts/deadofnight.pdf
except  0
/Users/tracker/Capstone/scripts/enchantedcottage.pdf
except  0
/Users/tracker/Capstone/scripts/Big_Sleep.pdf
/Users/tracker/Capstone/scripts/call_northside_777-1.pdf
except  0
/Users/tracker/Capstone/scripts/call_northside_777-2.pdf
except  0
/Users/tracker/Capstone/scripts/portrait.pdf
except  0
/Users/tracker/Capstone/scripts/treasureofthesierramadre.pdf
/Users/tracker/Capstone/scripts/apachedrums.pdf
except  0
/Users/tracker/Capstone/scripts/Strangers_On_A_Train.pdf
/Users/tracker/Capstone/scripts/from_here_to_eternity_(1953).pdf
/Users/tracker/Capstone/scripts/From_Here_To_Eternity.pdf
except  0
/Users/tracker/

/Users/tracker/Capstone/scripts/freddydeadshootingscript.pdf
except  0
/Users/tracker/Capstone/scripts/prince_of_thieves.pdf
/Users/tracker/Capstone/scripts/badlieu.pdf
/Users/tracker/Capstone/scripts/The_Body_Guard.pdf
/Users/tracker/Capstone/scripts/city_of_joy.pdf
/Users/tracker/Capstone/scripts/cryinggame.pdf
/Users/tracker/Capstone/scripts/Distinguished_Gentleman.pdf
/Users/tracker/Capstone/scripts/Man_Trouble.pdf
/Users/tracker/Capstone/scripts/Newsies.pdf
except  0
/Users/tracker/Capstone/scripts/Power_of_One.pdf
/Users/tracker/Capstone/scripts/thunderheart.pdf
except  0
/Users/tracker/Capstone/scripts/Fugitive_EARLY.pdf
/Users/tracker/Capstone/scripts/groundhogday.pdf
/Users/tracker/Capstone/scripts/Last_Action_Hero.pdf
/Users/tracker/Capstone/scripts/Last_Action_Hero_(1st Draft).pdf
except  0
/Users/tracker/Capstone/scripts/A_Perfect_World.pdf
/Users/tracker/Capstone/scripts/This_Boys_Life.pdf
/Users/tracker/Capstone/scripts/three_musketeers.pdf
/Users/tracker/Capstone/scripts

/Users/tracker/Capstone/scripts/SomethingsGottaGive.pdf
/Users/tracker/Capstone/scripts/special.pdf
except  0
/Users/tracker/Capstone/scripts/timeline%20(1-2)2.pdf
/Users/tracker/Capstone/scripts/timeline%20(2-2)1.pdf
/Users/tracker/Capstone/scripts/waking.pdf
/Users/tracker/Capstone/scripts/X-Men_2.pdf
/Users/tracker/Capstone/scripts/X-Men_2.pdf
/Users/tracker/Capstone/scripts/2001_maniacs_(2004).pdf
/Users/tracker/Capstone/scripts/Alfie.pdf
/Users/tracker/Capstone/scripts/bournesupremacy.pdf
/Users/tracker/Capstone/scripts/catwoman.pdf
/Users/tracker/Capstone/scripts/cellular.pdf
/Users/tracker/Capstone/scripts/collateral_101203.pdf
/Users/tracker/Capstone/scripts/Crash.pdf
/Users/tracker/Capstone/scripts/garden-state.pdf
/Users/tracker/Capstone/scripts/GRUDGE,THE-2004.01.26-DOUBLE-BLUE.pdf
/Users/tracker/Capstone/scripts/Harold_Kumar-5-28-03-DOUBLE-WHITE-Final.pdf
/Users/tracker/Capstone/scripts/DEADER.pdf
except  0
/Users/tracker/Capstone/scripts/hr7-020829.pdf
except  0
/Users/tra

/Users/tracker/Capstone/scripts/The-Number-23-script.pdf
/Users/tracker/Capstone/scripts/once.pdf
except  0
/Users/tracker/Capstone/scripts/para.pdf
except  0
/Users/tracker/Capstone/scripts/r&d.pdf
except  0
/Users/tracker/Capstone/scripts/rails-and-ties.pdf
/Users/tracker/Capstone/scripts/rs_script.pdf
except  0
/Users/tracker/Capstone/scripts/savages.pdf
except  0
/Users/tracker/Capstone/scripts/SmokingUNIVERSALscript.pdf
except  0
/Users/tracker/Capstone/scripts/waitress.pdf
except  0
/Users/tracker/Capstone/scripts/WILD-HOGS_Copeland.pdf
/Users/tracker/Capstone/scripts/WC_0223_YellowRevs1.pdf
/Users/tracker/Capstone/scripts/DARK_GAMES-final_draft.pdf
/Users/tracker/Capstone/scripts/obg%20script.pdf
except  0
/Users/tracker/Capstone/scripts/WJ.pdf
except  0


Save the other lists of files: the new one that has just the openable .pdf files, and the three lists that have the separated file types

In [18]:
with open('pdfToTextTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(list_of_pdf_to_txt)

In [19]:
with open('txtOnlyPrunedTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(txtlist)
    
with open('htmlOnlyPrunedTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(htmllist)
    
with open('pdfOnlyPrunedTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(pdflist)

Write the .html files to .txt

In [20]:
htmllist[0]

['/Users/tracker/Capstone/scripts/musketeers.html',
 'The Musketeers of Pig Alley']

In [41]:
from lxml import html

list_of_html_to_txt = []

for long_file_name in htmllist:
    # some of the files were empty and were deleted, so we need
    # to do a try/except clause to get around the missing
    # files
    try:
        
        title = long_file_name[0] 
        print(title)
    
        with open(title,'rb') as f:
            #text_elements = html.fromstring(f.read())
            soup = BeautifulSoup(f, "html.parser")
        text_elements = soup.get_text()
        new_string = " ".join(text_elements.splitlines())
        
    
        # write to file
        filename1 = title + '.txt'
        with open(filename1, 'w') as file1:
            file1.write(new_string)
        
        # create the new list just has the openable html names
        list_of_html_to_txt.append([filename1, long_file_name[1]])
    
    except:
        # skip the missing/unopenable files
        print('except')


/Users/tracker/Capstone/scripts/musketeers.html
/Users/tracker/Capstone/scripts/nyhat.html
/Users/tracker/Capstone/scripts/fogg.html
/Users/tracker/Capstone/scripts/spitfire.html
/Users/tracker/Capstone/scripts/satan.html
except
/Users/tracker/Capstone/scripts/witchcraft.html
/Users/tracker/Capstone/scripts/youthsend.html
/Users/tracker/Capstone/scripts/hand.html
/Users/tracker/Capstone/scripts/everybody.html
/Users/tracker/Capstone/scripts/expert.html
/Users/tracker/Capstone/scripts/jamaica.html
/Users/tracker/Capstone/scripts/pirate.html
except
/Users/tracker/Capstone/scripts/nosferatu.html
/Users/tracker/Capstone/scripts/redhotromance.html
/Users/tracker/Capstone/scripts/dinkum.html
/Users/tracker/Capstone/scripts/sidewalks.html
/Users/tracker/Capstone/scripts/jos.html
/Users/tracker/Capstone/scripts/pan.html
/Users/tracker/Capstone/scripts/lostworld.html
/Users/tracker/Capstone/scripts/poto.html
/Users/tracker/Capstone/scripts/lam.html
/Users/tracker/Capstone/scripts/andalou.html
e

/Users/tracker/Capstone/scripts/fletch.html
/Users/tracker/Capstone/scripts/lia.html
/Users/tracker/Capstone/scripts/firstblood2.html
except
/Users/tracker/Capstone/scripts/rambo_first_blood_2.html
except
/Users/tracker/Capstone/scripts/real_genius.html
/Users/tracker/Capstone/scripts/silverbullet.html
except
/Users/tracker/Capstone/scripts/silverbullet.html
except
/Users/tracker/Capstone/scripts/Aliens_James_Cameron_May_28_1985_first_draft.html
/Users/tracker/Capstone/scripts/aliens.html
except
/Users/tracker/Capstone/scripts/bluevelvet.html
except
/Users/tracker/Capstone/scripts/hannah.html
/Users/tracker/Capstone/scripts/index.html
except
/Users/tracker/Capstone/scripts/manhunter.html
except
/Users/tracker/Capstone/scripts/script.html
/Users/tracker/Capstone/scripts/peggysue.html
/Users/tracker/Capstone/scripts/platoon.html
/Users/tracker/Capstone/scripts/TopGun.html
/Users/tracker/Capstone/scripts/transformers.html
/Users/tracker/Capstone/scripts/broadc_news.html
/Users/tracker/Cap

/Users/tracker/Capstone/scripts/romeo_juliet.html
/Users/tracker/Capstone/scripts/sling_blade.html
/Users/tracker/Capstone/scripts/suburbia.html
/Users/tracker/Capstone/scripts/Swingers.html
/Users/tracker/Capstone/scripts/trainspotting.html
/Users/tracker/Capstone/scripts/trainspotting.html
/Users/tracker/Capstone/scripts/AirForceOne_TXT.html
/Users/tracker/Capstone/scripts/alienresurrection_early.html
except
/Users/tracker/Capstone/scripts/american_werewolf_in_paris.html
except
/Users/tracker/Capstone/scripts/Austin_Powers_IMM.html
/Users/tracker/Capstone/scripts/bean.html
/Users/tracker/Capstone/scripts/bloodmoon.html
except
/Users/tracker/Capstone/scripts/chasing_amy.html
except
/Users/tracker/Capstone/scripts/index.html
except
/Users/tracker/Capstone/scripts/faceoff_early.html
except
/Users/tracker/Capstone/scripts/faceoff_production.html
/Users/tracker/Capstone/scripts/the-game-early.html
except
/Users/tracker/Capstone/scripts/the-game_shooting.html
/Users/tracker/Capstone/script

/Users/tracker/Capstone/scripts/15minutes.html
/Users/tracker/Capstone/scripts/American_Outlaws.html
/Users/tracker/Capstone/scripts/anniversaryparty.html
/Users/tracker/Capstone/scripts/thebeliever.html
/Users/tracker/Capstone/scripts/TheBijou.html
/Users/tracker/Capstone/scripts/blow.html
/Users/tracker/Capstone/scripts/bones.html
except
/Users/tracker/Capstone/scripts/index2.html
except
/Users/tracker/Capstone/scripts/pg1.html
except
/Users/tracker/Capstone/scripts/index.html
except
/Users/tracker/Capstone/scripts/forsaken.html
/Users/tracker/Capstone/scripts/frailty-script.html
except
/Users/tracker/Capstone/scripts/Jason_X_early.html
except
/Users/tracker/Capstone/scripts/jason_x_shooting.html
/Users/tracker/Capstone/scripts/ghostworld.html
/Users/tracker/Capstone/scripts/hannibal_unproduced.html
except
/Users/tracker/Capstone/scripts/hannibal_production.html
/Users/tracker/Capstone/scripts/Jason_X_early.html
except
/Users/tracker/Capstone/scripts/jason_x_shooting.html
/Users/trac

Save the list that doesn't have the missing filenames in it

In [31]:
with open('htmlToTextTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(list_of_html_to_txt)

There are also deleted .txt files, so let's write the .txt files again, creating a new list like the one in the cell above

In [42]:
list_of_txt_to_txt = []

for long_file_name in txtlist:
    # some of the files were empty and were deleted, so we need
    # to do a try/except clause to get around the missing
    # files
    try:
        
        title = long_file_name[0] 
        filename1 = title + '.txt'

        # https://stackoverflow.com/questions/15343743/copying-from-one-text-file-to-another-using-python
        # https://stackoverflow.com/questions/13298907/remove-all-newlines-from-inside-a-string
        with open(title) as f:
            print(title)
            with open(filename1, "w") as f1:
                alist = f.read().splitlines()
                jalist = " ".join(alist)
                
                f1.write(jalist)


        # create the new list just has the text html names
        list_of_txt_to_txt.append([filename1, long_file_name[1]])
    
    except:
        # skip the files that were deleted
        print('except')



except
except
except
except
/Users/tracker/Capstone/scripts/MrSmithGoesToWashington.txt
/Users/tracker/Capstone/scripts/wizoz.txt
except
except
except
except
except
/Users/tracker/Capstone/scripts/WARWORLDS.txt
except
except
except
except
except
except
except
except
/Users/tracker/Capstone/scripts/strangelove.txt
/Users/tracker/Capstone/scripts/fantasticvoyager.txt
except
/Users/tracker/Capstone/scripts/planetofapes67.txt
except
/Users/tracker/Capstone/scripts/2001.txt
/Users/tracker/Capstone/scripts/POTA_Remake.txt
except
except
/Users/tracker/Capstone/scripts/clockwork.txt
/Users/tracker/Capstone/scripts/escape_pota.txt
/Users/tracker/Capstone/scripts/escape_pota.txt
/Users/tracker/Capstone/scripts/thx1138.txt
except
/Users/tracker/Capstone/scripts/THEGODFATHER.txt
except
except
except
except
except
/Users/tracker/Capstone/scripts/mp-holy.txt
except
except
/Users/tracker/Capstone/scripts/star_wars_4th.txt
except
/Users/tracker/Capstone/scripts/star_wars_1st_7_74.txt
except
/Users/tra

/Users/tracker/Capstone/scripts/MotherDreamsFD.txt
except
except
/Users/tracker/Capstone/scripts/nothill.txt
/Users/tracker/Capstone/scripts/StarWars-Episode1.txt
/Users/tracker/Capstone/scripts/sixth-sense.txt
/Users/tracker/Capstone/scripts/sixth-sense.txt
except
/Users/tracker/Capstone/scripts/StarWars-Episode1.txt
/Users/tracker/Capstone/scripts/wildwest.txt
/Users/tracker/Capstone/scripts/worldisnotenough.txt
except
except
/Users/tracker/Capstone/scripts/Bamboozled.txt
/Users/tracker/Capstone/scripts/bm.txt
/Users/tracker/Capstone/scripts/CastAway.txt
/Users/tracker/Capstone/scripts/cherryfalls.txt
except
/Users/tracker/Capstone/scripts/crouchingtiger.txt
except
/Users/tracker/Capstone/scripts/DarkAngel.txt
except
except
/Users/tracker/Capstone/scripts/gladiator_seconddraft.txt
except
except
except
except
except
except
/Users/tracker/Capstone/scripts/wtchmn.txt
except
/Users/tracker/Capstone/scripts/whatliesbeneath.txt
/Users/tracker/Capstone/scripts/xmenscript.txt
except
except
e

Save the new list of file names

In [33]:
with open('txtToTextTitleList.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerows(list_of_txt_to_txt)

Concatenate the three new lists that have only existing files in them.

In [45]:
new_list_of_scripts = []
new_list_of_scripts = list_of_pdf_to_txt + list_of_html_to_txt + list_of_txt_to_txt

In [46]:
len(new_list_of_scripts)

831

Create a dataframe that has the name of each movie in the first column and the script as the second column. 

In [48]:
import pandas as pd

temp_list_of_scripts = []

for long_file_name in new_list_of_scripts:
    # some of the new files were empty and were deleted, so we need
    # to do a try/except clause to get around the missing
    # files
    try:
        
        file_name = long_file_name[0]
        title = long_file_name[1]
        
        with open(file_name,'r') as f:
            temp_text = f.read()
        
        temp_list =[]
        temp_list.append(title)
        temp_list.append(temp_text)
        
        temp_list_of_scripts.append(temp_list)
        
    except:
        # skip the deleted files
        print('except')

df_of_scripts = pd.DataFrame(temp_list_of_scripts)

except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except
except


In [51]:
df_of_scripts.shape

(708, 2)

In [52]:
df_of_scripts.head()

Unnamed: 0,0,1
0,The Maltese Falcon,Screen Play by John Huston Based u...
1,Casablanca,Released: 1942 Studio: Warner Bros. Runn...
2,The Big Sleep,William Faulkner Leigh Brackett Jules ...
3,Strangers on a Train,"FINAL DRAFT October 18, 1950 Converted to..."
4,From Here To Eternity,(cid:13) (cid:13) (cid:13) (cid:13) ...


In [54]:
df_of_scripts2 = df_of_scripts.rename(columns={0: 'Movie_name', 1: 'Script'})
df_of_scripts2.head()

Unnamed: 0,Movie_name,Script
0,The Maltese Falcon,Screen Play by John Huston Based u...
1,Casablanca,Released: 1942 Studio: Warner Bros. Runn...
2,The Big Sleep,William Faulkner Leigh Brackett Jules ...
3,Strangers on a Train,"FINAL DRAFT October 18, 1950 Converted to..."
4,From Here To Eternity,(cid:13) (cid:13) (cid:13) (cid:13) ...


Save the file of all scripts

In [56]:
df_of_scripts2.to_csv('movieTitlesAndScripts.csv')

In [90]:
df_of_scripts2.head()

Unnamed: 0,Movie_name,Script
0,The Maltese Falcon,Screen Play by John Huston Based u...
1,Casablanca,Released: 1942 Studio: Warner Bros. Runn...
2,The Big Sleep,William Faulkner Leigh Brackett Jules ...
3,Strangers on a Train,"FINAL DRAFT October 18, 1950 Converted to..."
4,From Here To Eternity,(cid:13) (cid:13) (cid:13) (cid:13) ...


Convert each scripts in the file of scripts to a bag of words

In [93]:
from sklearn.feature_extraction.text import CountVectorizer

# drop the most common and least common words
# don't use "stopwords": 
# see https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words
# for why
cv = CountVectorizer(max_df = .9, min_df = .1)
matrix_of_counts = cv.fit_transform(df_of_scripts2.Script)

In [94]:
counts = pd.DataFrame(matrix_of_counts.toarray(),
                      columns=cv.get_feature_names_out())

counts.head()

Unnamed: 0,00,000,10,100,101,102,103,104,105,106,...,yours,yourselves,youth,zero,zip,zips,zone,zoo,zoom,zooms
0,0,0,0,0,0,0,0,0,0,0,...,2,0,14,0,0,0,0,0,0,0
1,0,0,0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,0,...,8,0,1,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,5,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,4,1,0,0,0,0,1,0,1,2


Append the movie names to the bag-of-words for each script

In [100]:
movie_name = df_of_scripts2.Movie_name

In [101]:
df_of_scripts3 = counts

In [103]:
df_of_scripts3['movie_name'] = movie_name

In [104]:
df_of_scripts3.head()

Unnamed: 0,00,000,10,100,101,102,103,104,105,106,...,yourselves,youth,zero,zip,zips,zone,zoo,zoom,zooms,movie_name
0,0,0,0,0,0,0,0,0,0,0,...,0,14,0,0,0,0,0,0,0,The Maltese Falcon
1,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Casablanca
2,1,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,The Big Sleep
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Strangers on a Train
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,1,2,From Here To Eternity


Save the files

In [105]:
counts.to_csv('bagOfWords.csv')
df_of_scripts3.to_csv('bagOfWordsAndMovieTitle.csv')