## Boxoffice Mojo Web Scrape
An exercise in Absolute Frequency, Weighted Frequency and Relative Frequency by web scraping the all-time domestic box office results from box office mojo. This is modified from a Datacamp exercise.


Background on the metrics to be calculated:
- Absolute Frequency - number of times that a value appears, essentially, it is the number of times a particular thing happens. - in Movie Titles how many times does the word 'Star' appear

- Weighted Frequency - weights the individual value in terms of frequencies for each result in the 'value' content (overall frequency in the complete data set). - what is the value of 'Star' (how are we measuring)

- Relative frequency - the result of dividing the absolute frequency of a certain value by the total number of data. - Weighted Frequency divided by Absolute Frequency (times 'star' appears / value(measure) of star)


### Step One - gather the data


In [1]:
# import pandas, requests and Beautiful Soup

import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# create the movies list including rank, title, studio lifetime gross and year.
# prints in increments of 10 for the web scrape progress. there are 150 pages to scrape
movie_list = []

for i in range(1, 156):
    if not i%10:
        print(i)
    page = 'http://www.boxofficemojo.com/alltime/domestic.htm?page=' + str(i) + '&p=.htm'
    resp = requests.get(page)
    soup = BeautifulSoup(resp.text, 'lxml')
    # trial and error to get the exact positions
    table_data = [x.text for x in soup.select('tr td')[11:511]]  
    # put every 5 values in a row
    temp_list = [table_data[i:i+5] for i in range(0, len(table_data[:-4]), 5)] 
    for temp in temp_list:
        movie_list.append(temp)

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150


In [3]:
# review final list to ensure accuarcy
movie_list[0:10]

[['1', 'Star Wars: The Force Awakens', 'BV', '$936,662,225', '2015'],
 ['2', 'Avatar', 'Fox', '$760,507,625', '2009^'],
 ['3', 'Black Panther', 'BV', '$700,059,566', '2018'],
 ['4', 'Avengers: Infinity War', 'BV', '$678,690,536', '2018'],
 ['5', 'Titanic', 'Par.', '$659,363,944', '1997^'],
 ['6', 'Jurassic World', 'Uni.', '$652,270,625', '2015'],
 ['7', "Marvel's The Avengers", 'BV', '$623,357,910', '2012'],
 ['8', 'Star Wars: The Last Jedi', 'BV', '$620,181,382', '2017'],
 ['9', 'Incredibles 2', 'BV', '$597,247,671', '2018'],
 ['10', 'The Dark Knight', 'WB', '$534,858,444', '2008^']]

In [4]:
# convert list to a Dataframe
df = pd.DataFrame.from_records(movie_list)
print(df.shape)
df.head(10)

(15500, 5)


Unnamed: 0,0,1,2,3,4
0,1,Star Wars: The Force Awakens,BV,"$936,662,225",2015
1,2,Avatar,Fox,"$760,507,625",2009^
2,3,Black Panther,BV,"$700,059,566",2018
3,4,Avengers: Infinity War,BV,"$678,690,536",2018
4,5,Titanic,Par.,"$659,363,944",1997^
5,6,Jurassic World,Uni.,"$652,270,625",2015
6,7,Marvel's The Avengers,BV,"$623,357,910",2012
7,8,Star Wars: The Last Jedi,BV,"$620,181,382",2017
8,9,Incredibles 2,BV,"$597,247,671",2018
9,10,The Dark Knight,WB,"$534,858,444",2008^


### Step Two - Review & Clean the data

In [5]:
# rename the columns
df.columns = ['rank', 'title', 'studio', 'lifetime_gross', 'year']

In [6]:
df.head()

Unnamed: 0,rank,title,studio,lifetime_gross,year
0,1,Star Wars: The Force Awakens,BV,"$936,662,225",2015
1,2,Avatar,Fox,"$760,507,625",2009^
2,3,Black Panther,BV,"$700,059,566",2018
3,4,Avengers: Infinity War,BV,"$678,690,536",2018
4,5,Titanic,Par.,"$659,363,944",1997^


In [7]:
# we have some characters to remove to 'clean' up: '$', ',' '^', ':'
# review the year column
df['year'].unique()

array(['2015', '2009^', '2018', '1997^', '2012', '2017', '2008^', '2016',
       '1999^', '1977^', '2004', '1982^', '2013', '2006', '1994^', '2010',
       '2002', '1993^', '2009', '2011', '2003^', '2005^', '2004^', '2014',
       '2002^', '2007', '2008', '2001', '2001^', '1983^', '1996^', '2003',
       '1999', '2005', '1980^', '1990', '2000', '1975', '1989', '1997',
       '1981^', '1984^', '1996', '1984', '1973^', '1993', '1991^', '1992',
       '1998', '1985', '1939^', '1995^', '2017^', '1978^', '1937^',
       '1995', '1986^', '1982', '1986', '1988', '1987', '1991', '1965',
       '1973', '2015^', '1994', '1961^', '1967^', '2010^', '1972^',
       '1978', '2013^', '1977', '1992^', '1974^', '1990^', '1981', '1976',
       '1974', '1989^', '1983', '1970', '1979', '1980', '1969', '1964^',
       '1942^', '1985^', '1955^', '1950^', '1953^', '1972', '1940^',
       '1979^', '1941^', '1988^', '1959', '1964', '1956', '1987^',
       '2000^', '1968^', '1963', '1998^', '1967', '1971', '195

In [8]:
# we have 'n/a' in the data, how many and where?
df[df['year'] == 'n/a']

Unnamed: 0,rank,title,studio,lifetime_gross,year
8088,8089,Warner Bros. 75th Anniversary Film Festival,WB,"$741,855",
8232,8233,Hum Aapke Dil Mein Rahte Hain,Eros,"$668,678",
8282,8283,Purple Moon (Re-issue),Mira.,"$640,945",
10598,10599,Amarcord,Jan.,"$125,493",


In [9]:
# not relevant so we will delete the 'n/a'
# find the index and remove the 'n/a' items that match
na_year_idx =  [i for i, x in enumerate(movie_list) if x[4] == 'n/a'] # get the indexes of the 'n/a' values
na_year_idx = set(na_year_idx)
final_list = [v for i, v in enumerate(movie_list) if i not in na_year_idx]

In [10]:
# double check to ensure we removed 4 items
len(final_list)

15496

In [11]:
# create dataframe
# remove special characters
# rename columns
# change data types where applicable

import re
regex = '|'.join(['\$', ',', '\^']) # list of special chars to remove

columns = ['rank', 'title', 'studio', 'lifetime_gross', 'year']

bom_df = pd.DataFrame({
    'rank': [int(x[0]) for x in final_list],
    'title': [x[1] for x in final_list],
    'studio': [x[2] for x in final_list],
    'lifetime_gross': [int(re.sub(regex, '', x[3])) for x in final_list],  
    'year': [int(re.sub(regex, '', str(x[4]))) for x in final_list],  }) 

print(bom_df.shape)
bom_df.head(10)

(15496, 5)


Unnamed: 0,rank,title,studio,lifetime_gross,year
0,1,Star Wars: The Force Awakens,BV,936662225,2015
1,2,Avatar,Fox,760507625,2009
2,3,Black Panther,BV,700059566,2018
3,4,Avengers: Infinity War,BV,678690536,2018
4,5,Titanic,Par.,659363944,1997
5,6,Jurassic World,Uni.,652270625,2015
6,7,Marvel's The Avengers,BV,623357910,2012
7,8,Star Wars: The Last Jedi,BV,620181382,2017
8,9,Incredibles 2,BV,597247671,2018
9,10,The Dark Knight,WB,534858444,2008


In [12]:
# check datatypes
bom_df.dtypes

rank               int64
title             object
studio            object
lifetime_gross     int64
year               int64
dtype: object

In [13]:
# quick stats
bom_df.describe()

Unnamed: 0,rank,lifetime_gross,year
count,15496.0,15496.0,15496.0
mean,7750.228833,18268790.0,2003.388358
std,4475.123375,45886100.0,11.187391
min,1.0,1808.0,1921.0
25%,3874.75,62182.75,1996.0
50%,7748.5,976593.0,2006.0
75%,11626.25,15449610.0,2012.0
max,15500.0,936662200.0,2018.0


In [14]:
# review the year
print(bom_df['year'].unique())
print()
print(bom_df['year'].max())
print(bom_df['year'].min())

[2015 2009 2018 1997 2012 2017 2008 2016 1999 1977 2004 1982 2013 2006
 1994 2010 2002 1993 2011 2003 2005 2014 2007 2001 1983 1996 1980 1990
 2000 1975 1989 1981 1984 1973 1991 1992 1998 1985 1939 1995 1978 1937
 1986 1988 1987 1965 1961 1967 1972 1974 1976 1970 1979 1969 1964 1942
 1955 1950 1953 1940 1941 1959 1956 1968 1963 1971 1962 1960 1954 1952
 1966 1957 1946 1945 1921 1958 1927 1926 1949 1947 1936 1931]

2018
1921


In [15]:
print(bom_df['lifetime_gross'].max())
print(bom_df['lifetime_gross'].min())

936662225
1808


15,496 movies 

Year ranges from 1921 - 2018

Life_time Gross ranges from \$1,788 to \$936,662,225

### Step Three - Calculate Absolute Frequency, Weighted Frequency and Relative Value of Words in Movie Titles

In [16]:
# to account for duplicates we will lowercase all words: example The and the are the same. 

from collections import defaultdict

def word_frequency(text_list, num_list, sep=None):
    word_freq = defaultdict(lambda: [0, 0])

    for text, num in zip(text_list, num_list):
        for word in text.split(sep=sep): 
            word_freq[word.lower()][0] += 1 
            word_freq[word.lower()][1] += num

    columns = {0: 'abs_freq', 1: 'wtd_freq'}

    abs_wtd_df = (pd.DataFrame.from_dict(word_freq, orient='index')
                 .rename(columns=columns )
                 .sort_values('abs_freq', ascending=False)
                 .assign(rel_value=lambda df: df['wtd_freq'] / df['abs_freq']).round())

    abs_wtd_df.insert(1, 'abs_perc', value=abs_wtd_df['abs_freq'] / abs_wtd_df['abs_freq'].sum())
    abs_wtd_df.insert(2, 'abs_perc_cum', abs_wtd_df['abs_perc'].cumsum())
    abs_wtd_df.insert(4, 'wtd_freq_perc', abs_wtd_df['wtd_freq'] / abs_wtd_df['wtd_freq'].sum())
    abs_wtd_df.insert(5, 'wtd_freq_perc_cum', abs_wtd_df['wtd_freq_perc'].cumsum())

    return abs_wtd_df
word_freq_df = word_frequency(bom_df['title'], bom_df['lifetime_gross'])
word_freq_df.head(10).style.bar(['abs_freq', 'wtd_freq', 'rel_value'], color='#edc993')

Unnamed: 0,abs_freq,abs_perc,abs_perc_cum,wtd_freq,wtd_freq_perc,wtd_freq_perc_cum,rel_value
the,4370,0.098337,0.098337,101183660820,0.119604,0.119604,23154200.0
of,1409,0.0317064,0.130043,30341141654,0.0358646,0.155468,21533800.0
a,646,0.0145368,0.14458,8415667616,0.00994769,0.165416,13027300.0
and,547,0.012309,0.156889,12418250673,0.0146789,0.180095,22702500.0
in,501,0.0112739,0.168163,6359960043,0.00751775,0.187612,12694500.0
to,359,0.00807849,0.176242,5394198703,0.00637618,0.193989,15025600.0
love,210,0.00472558,0.180967,1605077459,0.00189727,0.195886,7643230.0
man,197,0.00443304,0.1854,3994622020,0.00472182,0.200608,20277300.0
my,191,0.00429803,0.189698,1651905864,0.00195263,0.20256,8648720.0
for,185,0.00416301,0.193861,2010847995,0.00237691,0.204937,10869400.0


- So, the top words are stop words. 'the' is the top word and is in %10 of all movie titles.

- Let's refine and remove selected stopwords and recalulate:
'of','in', 'to', 'and', 'a', 'the', 'for', 'on', '&', 'is', 'at', 'it', 'from', 'with'


In [17]:
def word_frequency(text_list, num_list, sep=None, rm_words=('of','in', 'to', 'and', 'a', 'the', 
                                         'for', 'on', '&', 'is', 'at', 'it',
                                         'from', 'with')):  
    word_freq = defaultdict(lambda: [0, 0])

    for text, num in zip(text_list, num_list):
        for word in text.split(sep=sep): 
            # This should take care of ignoring the word if it's in the stop words
            if word.lower() in rm_words:  
                continue                  
            # .lower() makes sure we are not duplicating words
            word_freq[word.lower()][0] += 1  
            word_freq[word.lower()][1] += num

    columns = {0: 'abs_freq', 1: 'wtd_freq'}

    abs_wtd_df = (pd.DataFrame.from_dict(word_freq, orient='index')
                 .rename(columns=columns )
                 .sort_values('abs_freq', ascending=False)
                 .assign(rel_value=lambda df: df['wtd_freq'] / df['abs_freq']).round())

    abs_wtd_df.insert(1, 'abs_perc', value=abs_wtd_df['abs_freq'] / abs_wtd_df['abs_freq'].sum())
    abs_wtd_df.insert(2, 'abs_perc_cum', abs_wtd_df['abs_perc'].cumsum())
    abs_wtd_df.insert(4, 'wtd_freq_perc', abs_wtd_df['wtd_freq'] / abs_wtd_df['wtd_freq'].sum())
    abs_wtd_df.insert(5, 'wtd_freq_perc_cum', abs_wtd_df['wtd_freq_perc'].cumsum())

    abs_wtd_df = abs_wtd_df.reset_index().rename(columns={'index': 'word'})

    return abs_wtd_df
word_freq_df = word_frequency(bom_df['title'], bom_df['lifetime_gross'])
word_freq_df.head(10).style.bar(['abs_freq', 'wtd_freq', 'rel_value'], color='#edc993')

Unnamed: 0,word,abs_freq,abs_perc,abs_perc_cum,wtd_freq,wtd_freq_perc,wtd_freq_perc_cum,rel_value
0,love,210,0.00589705,0.00589705,1605077459,0.00239919,0.00239919,7643230.0
1,man,197,0.005532,0.0114291,3994622020,0.00597097,0.00837016,20277300.0
2,my,191,0.00536351,0.0167926,1651905864,0.00246919,0.0108393,8648720.0
3,i,167,0.00468956,0.0214821,2240841163,0.0033495,0.0141888,13418200.0
4,2,159,0.00446491,0.025947,10062535980,0.015041,0.0292298,63286400.0
5,me,139,0.00390329,0.0298503,2491755301,0.00372455,0.0329544,17926300.0
6,life,135,0.00379096,0.0336413,1587513579,0.00237294,0.0353273,11759400.0
7,last,133,0.0037348,0.0373761,2671544788,0.0039933,0.0393206,20086800.0
8,you,125,0.00351015,0.0408862,1754593285,0.00262268,0.0419433,14036700.0
9,movie,115,0.00322934,0.0441156,3216257864,0.00480751,0.0467508,27967500.0


In [18]:
(word_freq_df.sort_values('wtd_freq', ascending=False)
 .head(10)
    .style.bar(['abs_freq', 'wtd_freq', 'rel_value'],
               color='#edc993'))

Unnamed: 0,word,abs_freq,abs_perc,abs_perc_cum,wtd_freq,wtd_freq_perc,wtd_freq_perc_cum,rel_value
4,2,159,0.00446491,0.025947,10062535980,0.015041,0.0292298,63286400.0
67,star,46,0.00129174,0.146752,5588259962,0.00835306,0.157388,121484000.0
1,man,197,0.005532,0.0114291,3994622020,0.00597097,0.00837016,20277300.0
81,part,41,0.00115133,0.163882,3262579777,0.00487675,0.179455,79575100.0
9,movie,115,0.00322934,0.0441156,3216257864,0.00480751,0.0467508,27967500.0
32,3,64,0.0017972,0.0948583,3199658091,0.00478269,0.0956812,49994700.0
29,ii,67,0.00188144,0.0894106,3077717709,0.00460042,0.0886937,45936100.0
952,wars:,6,0.000168487,0.5033,2757497155,0.00412177,0.534464,459583000.0
7,last,133,0.0037348,0.0373761,2671544788,0.0039933,0.0393206,20086800.0
148,harry,27,0.000758193,0.226054,2611329714,0.00390329,0.235729,96715900.0


In [19]:
(word_freq_df.sort_values('rel_value', ascending=False)
 .head(10)
    .style.bar(['abs_freq', 'wtd_freq', 'rel_value'],
               color='#edc993'))

Unnamed: 0,word,abs_freq,abs_perc,abs_perc_cum,wtd_freq,wtd_freq_perc,wtd_freq_perc_cum,rel_value
6753,awakens,1,2.80812e-05,0.840948,936662225,0.00140008,0.881073,936662000.0
7129,avatar,1,2.80812e-05,0.851507,760507625,0.00113677,0.883144,760508000.0
6905,marvel's,1,2.80812e-05,0.845216,623357910,0.000931765,0.882005,623358000.0
3743,avengers:,2,5.61624e-05,0.743843,1137696404,0.00170057,0.771981,568848000.0
3666,jedi,2,5.61624e-05,0.739519,929487559,0.00138935,0.759554,464744000.0
952,wars:,6,0.000168487,0.5033,2757497155,0.00412177,0.534464,459583000.0
6299,ultron,1,2.80812e-05,0.828199,459005868,0.0006861,0.873391,459006000.0
6087,extra-terrestrial,1,2.80812e-05,0.822246,435110554,0.000650382,0.872702,435111000.0
6076,e.t.:,1,2.80812e-05,0.821937,435110554,0.000650382,0.872051,435111000.0
3901,incredibles,2,5.61624e-05,0.752717,858688763,0.00128353,0.795393,429344000.0


In [20]:
# abs_freq greater than 1
abs_greter_than_1 = word_freq_df[word_freq_df['abs_freq'] > 1]
(abs_greter_than_1.sort_values('rel_value', ascending=False)
 .head(10)
    .style.bar(['abs_freq', 'wtd_freq', 'rel_value'],
               color='#edc993'))

Unnamed: 0,word,abs_freq,abs_perc,abs_perc_cum,wtd_freq,wtd_freq_perc,wtd_freq_perc_cum,rel_value
3743,avengers:,2,5.61624e-05,0.743843,1137696404,0.00170057,0.771981,568848000.0
3666,jedi,2,5.61624e-05,0.739519,929487559,0.00138935,0.759554,464744000.0
952,wars:,6,0.000168487,0.5033,2757497155,0.00412177,0.534464,459583000.0
3901,incredibles,2,5.61624e-05,0.752717,858688763,0.00128353,0.795393,429344000.0
1892,rings:,3,8.42436e-05,0.62256,1035942020,0.00154848,0.678616,345314000.0
3914,deadpool,2,5.61624e-05,0.753447,681473332,0.00101863,0.798757,340737000.0
3944,hallows,2,5.61624e-05,0.755132,676994524,0.00101194,0.804657,338497000.0
3936,deathly,2,5.61624e-05,0.754683,676994524,0.00101194,0.802386,338497000.0
3627,titanic,2,5.61624e-05,0.737328,659429737,0.000985684,0.753623,329715000.0
3680,avengers,2,5.61624e-05,0.740305,646742849,0.00096672,0.762993,323371000.0


In [21]:
abs_greter_than_2 = word_freq_df[word_freq_df['abs_freq'] > 2]
(abs_greter_than_2.sort_values('rel_value', ascending=False)
 .head(10)
    .style.bar(['abs_freq', 'wtd_freq', 'rel_value'],
               color='#edc993'))

Unnamed: 0,word,abs_freq,abs_perc,abs_perc_cum,wtd_freq,wtd_freq_perc,wtd_freq_perc_cum,rel_value
952,wars:,6,0.000168487,0.5033,2757497155,0.00412177,0.534464,459583000.0
1892,rings:,3,8.42436e-05,0.62256,1035942020,0.00154848,0.678616,345314000.0
1656,shrek,4,0.000112325,0.599028,1270347989,0.00189885,0.648129,317587000.0
1075,jurassic,6,0.000168487,0.524023,1878068192,0.00280725,0.559144,313011000.0
1842,despicable,3,8.42436e-05,0.618348,884199550,0.00132166,0.674581,294733000.0
1521,saga:,4,0.000112325,0.583865,1170767255,0.00175001,0.626545,292692000.0
1737,episode,4,0.000112325,0.608127,1165526659,0.00174217,0.661472,291382000.0
1333,caribbean:,5,0.000140406,0.56154,1451780833,0.00217005,0.59615,290356000.0
2407,hobbit:,3,8.42436e-05,0.665946,816490211,0.00122045,0.731139,272163000.0
971,spider-man,6,0.000168487,0.506501,1585340069,0.00236969,0.539359,264223000.0


### Step Four - Analysis and Export


- Love is the most used word in movie titles. Out of 15,496 movies, Love appears in 210 movie titles followed by Man(197), My(191), 'I'(168) and '2'(159). 


- Love is not that high when it comes Weighted Frequency. In fact, Love is not even in the top ten. The number 2 is the top word (I know 2 is not a word but, in this analysis, '2' and '3' and 'i' and 'ii' are being used as words as they are part of movie titles). Weighted Frequency is calculating the sum of lifetime gross revenue for each movie the word appeared in the title. With this as the metric, 2's weighted frequency is worth \$10,062,046,737 - that's right 2 is in the title of movies whose sum is greater than 10 billion. This is almost double the next entry 'star' at \$5,588,247,468 - thanks Star Wars and Star Trek. 'Part' is interesting as it is a combination of 2 or 3 or ii which is the value of the 2nd and 3rd parts of a movie franchise. 


- With Relative Value, the words with an abs_freq of 1 are actually the lifetime gross of one movie, i.e. awakens is a word in the number one movie of all time 'Star Wars: The Force Awakens'. Ok but what about abs_freq is > 1? Similar findings with highly successful franchises or the second movie. Ok what about abs_freq greater than 2 - Again Similar findings but now 'wars' has the highest relative value.

In [22]:
# a way to sense check the words in titles and further explore other top words.
bom_df[bom_df['title'].str.contains('star | star', case=False)].head(10)

Unnamed: 0,rank,title,studio,lifetime_gross,year
0,1,Star Wars: The Force Awakens,BV,936662225,2015
7,8,Star Wars: The Last Jedi,BV,620181382,2017
10,11,Rogue One: A Star Wars Story,BV,532177324,2016
13,14,Star Wars: Episode I - The Phantom Menace,Fox,474544677,1999
14,15,Star Wars,Fox,460998007,1977
36,37,Star Wars: Episode III - Revenge of the Sith,Fox,380270577,2005
69,70,Star Wars: Episode II - Attack of the Clones,Fox,310676740,2002
109,110,Star Trek,Par.,257730019,2009
144,145,Star Trek Into Darkness,Par.,228778661,2013
171,172,Solo: A Star Wars Story,BV,213601143,2018


In [23]:
# what is the gross by studio
total_gross_studio = bom_df.groupby('studio')['lifetime_gross'].agg(sum).sort_values(ascending=False)
total_gross_studio.head()

studio
BV      41352429405
WB      37794700855
Uni.    33836240681
Fox     32855030953
Par.    28055922760
Name: lifetime_gross, dtype: int64

In [24]:
# number of movies by studio
total_movies = bom_df['studio'].value_counts()
total_movies.head()

WB      803
Uni.    691
Fox     617
Par.    591
BV      570
Name: studio, dtype: int64

In [25]:
# create a new dataframe, reset the index, rename the columns
new_df = pd.DataFrame([dict(total_gross_studio), dict(total_movies)]).transpose()
new_df = new_df.rename_axis('studio').reset_index()
new_df.columns = ['studio','lifetime_gross', 'number_movies']

In [26]:
# get the average total gross by studio
new_df.insert(3, 'avg_gross', value=new_df['lifetime_gross'] / new_df['number_movies'])

In [27]:
new_df.sort_values('avg_gross', ascending = False).head(10)

Unnamed: 0,studio,lifetime_gross,number_movies,avg_gross
698,P/DW,4730766291,39,121301700.0
573,MFF,87178599,1,87178600.0
1014,WB (NL),4799255502,59,81343310.0
142,BV,41352429405,570,72548120.0
260,DW,4283727041,60,71395450.0
271,Dis.,1432560292,21,68217160.0
915,Sum.,1663937182,31,53675390.0
32,AAP,53267000,1,53267000.0
364,Fox,32855030953,617,53249640.0
523,LG/S,2031545083,39,52090900.0


BV or Buena Vista is top in total gross \$’s but is 5th in the total number of movies. WB or Warner Brothers is 2nd in total gross \$’s but 1st in total number of movies. The Star Wars franchise really boosts BV. 

How about the average gross by studio?

1.  is Paramount (DreamWorks) (Transformers, Shrek, Madagascar etc)
2.  is MacGillivray Freeman Films a documentary on Everest IMAX(only) 
3.  is Warner Bros. (New Line) (The Hobbit, Sex And The City
4.  is BV or Disney (Starwars, Marvel movies, Toy Story)
5.  is Dreamworks (see above)

I could apply the same metrics with Absolute Frequency, Weighted Frequency and Relative frequency of the studios but I will save that for another day.

Exporting the files

In [28]:
word_freq_df.to_csv('word_freq_df.csv')
bom_df.to_csv('bom_df.csv')