# Package Demo

1. get_data.py
    * get_tmdb_id_list()
    * get_profit()
    * get_info()
    * call_data()
2. clean_data.py
    * clean_director_actor()
    * clean_regression_data()
3. movie_plot.py
    * scatter_plot()
    * box_plot()
4. regression_model.py
    * model_evaluation()
    * save_model()

In [1]:
import os
os.chdir("../movie_badgers/")
import get_data as gd
import clean_data as cdata
import movie_plot as mp
import regression_model as rm

### get_data

This component is for users to crawl data from the two data sources based on their defined parameters: start year, end year, start page, and end page. Years are used for the movie release range and pages are used to define how many pages of resutls the crawler should go through.

#### *get_tmdb_id_list* (start year, end year, start page, end page): <br>
This will return users a list of tmdb ids that the user can use in their own script.

In [2]:
id_list = gd.get_tmdb_id_list(2006,2007,1,2)

In [3]:
id_list

[58,
 36557,
 920,
 1422,
 950,
 834,
 591,
 214,
 1124,
 1593,
 350,
 9693,
 1246,
 956,
 7518,
 1402,
 1452,
 1948,
 7551,
 9762]

#### *get_profit* (start year, end year, start page, end page): <br>
This will return users a datafram df_profit with movie ids, revenue, and budget.

In [4]:
df_profit = gd.get_profit(2006,2007,1,2)

getting profit on IMDB ID  58
getting profit on IMDB ID  36557
getting profit on IMDB ID  920
getting profit on IMDB ID  1422
getting profit on IMDB ID  950
getting profit on IMDB ID  834
getting profit on IMDB ID  591
getting profit on IMDB ID  214
getting profit on IMDB ID  1124
getting profit on IMDB ID  1593
getting profit on IMDB ID  350
getting profit on IMDB ID  9693
getting profit on IMDB ID  1246
getting profit on IMDB ID  956
getting profit on IMDB ID  7518
getting profit on IMDB ID  1402
getting profit on IMDB ID  1452
getting profit on IMDB ID  1948
getting profit on IMDB ID  7551
getting profit on IMDB ID  9762


In [6]:
df_profit

Unnamed: 0,budget,imdb_id,revenue
0,200000000,tt0383574,1065659812
1,150000000,tt0381061,599045960
2,120000000,tt0317219,461983149
3,90000000,tt0407887,289847354
4,80000000,tt0438097,660940780
5,50000000,tt0401855,111340801
6,125000000,tt0382625,767820459
7,10000000,tt0489270,163876815
8,40000000,tt0482571,109676311
9,110000000,tt0477347,574480841


#### *get_info* (start year, end year, start page, end page): <br>
This will return users a datafram df_info with movie ids and other relevant information like releases, directors, ratings, etc.

In [7]:
df_info = gd.get_info(2006,2007,1,2)

getting profit on IMDB ID  58
getting profit on IMDB ID  36557
getting profit on IMDB ID  920
getting profit on IMDB ID  1422
getting profit on IMDB ID  950
getting profit on IMDB ID  834
getting profit on IMDB ID  591
getting profit on IMDB ID  214
getting profit on IMDB ID  1124
getting profit on IMDB ID  1593
getting profit on IMDB ID  350
getting profit on IMDB ID  9693
getting profit on IMDB ID  1246
getting profit on IMDB ID  956
getting profit on IMDB ID  7518
getting profit on IMDB ID  1402
getting profit on IMDB ID  1452
getting profit on IMDB ID  1948
getting profit on IMDB ID  7551
getting profit on IMDB ID  9762
Finished:tt0383574
Finished:tt0381061
Finished:tt0317219
Finished:tt0407887
Finished:tt0438097
Finished:tt0401855
Finished:tt0382625
Finished:tt0489270
Finished:tt0482571
Finished:tt0477347
Finished:tt0458352
Finished:tt0206634
Finished:tt0479143
Finished:tt0317919
Finished:tt0327084
Finished:tt0454921
Finished:tt0348150
Finished:tt0479884
Finished:tt0453467
Finishe

In [8]:
df_info

Unnamed: 0,Actors,Country,Director,Genre,IMDB Rating,IMDB Votes,Language,Production,Rated,Released,Runtime,Title,Year,imdbID
0,"Johnny Depp, Orlando Bloom, Keira Knightley, J...",USA,Gore Verbinski,"Action, Adventure, Fantasy",7.3,573167,"English, Turkish, Greek, Mandarin, French",Buena Vista,PG-13,07 Jul 2006,151 min,Pirates of the Caribbean: Dead Man's Chest,2006,tt0383574
1,"Daniel Craig, Eva Green, Mads Mikkelsen, Judi ...","UK, Czech Republic, USA, Germany, Bahamas, Italy",Martin Campbell,"Action, Adventure, Thriller",8.0,506685,"English, French",Sony,PG-13,17 Nov 2006,144 min,Casino Royale,2006,tt0381061
2,"Owen Wilson, Paul Newman, Bonnie Hunt, Larry t...",USA,"John Lasseter, Joe Ranft(co-director)","Animation, Comedy, Family",7.1,294294,"English, Italian, Japanese, Yiddish",Buena Vista,G,09 Jun 2006,117 min,Cars,2006,tt0317219
3,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...","USA, Hong Kong",Martin Scorsese,"Crime, Drama, Thriller",8.5,973137,"English, Cantonese",Warner Bros. Pictures,R,06 Oct 2006,151 min,The Departed,2006,tt0407887
4,"Ray Romano, John Leguizamo, Denis Leary, Seann...",USA,Carlos Saldanha,"Animation, Action, Adventure",6.8,213930,English,20th Century Fox,PG,31 Mar 2006,91 min,Ice Age: The Meltdown,2006,tt0438097
5,"Kate Beckinsale, Scott Speedman, Tony Curran, ...","USA, Canada, Hungary",Len Wiseman,"Action, Adventure, Fantasy",6.7,170850,"English, French, Hungarian",Sony/Screen Gems,R,20 Jan 2006,106 min,Underworld: Evolution,2006,tt0401855
6,"Tom Hanks, Audrey Tautou, Ian McKellen, Jean Reno","USA, Malta, France, UK",Ron Howard,"Mystery, Thriller",6.6,347239,"English, French, Latin, Spanish",Sony Pictures,PG-13,19 May 2006,149 min,The Da Vinci Code,2006,tt0382625
7,"Tobin Bell, Shawnee Smith, Angus Macfadyen, Ba...","USA, Canada",Darren Lynn Bousman,Horror,6.2,154539,English,Lionsgate Films,R,27 Oct 2006,108 min,Saw III,2006,tt0489270
8,"Hugh Jackman, Christian Bale, Michael Caine, P...","USA, UK",Christopher Nolan,"Drama, Mystery, Sci-Fi",8.5,951885,English,Buena Vista Pictures,PG-13,20 Oct 2006,130 min,The Prestige,2006,tt0482571
9,"Ben Stiller, Carla Gugino, Dick Van Dyke, Mick...","USA, UK",Shawn Levy,"Adventure, Comedy, Family",6.4,260038,"English, Italian, Hebrew",20th Century Fox,PG,22 Dec 2006,108 min,Night at the Museum,2006,tt0477347


#### *call_data* (start year, end year, start page, end page): <br>
This will return a complete dataframe ready for user to clean or further process with all information listed above. A csv is also generated and saved locally under the data directory name "data_raw_user.csv"

In [9]:
path = ('./data/data_raw_user.csv')

In [10]:
df_info_revenue = gd.call_data(2006,2007,1,2,path)

start getting profit
getting profit on IMDB ID  58
getting profit on IMDB ID  36557
getting profit on IMDB ID  920
getting profit on IMDB ID  1422
getting profit on IMDB ID  950
getting profit on IMDB ID  834
getting profit on IMDB ID  591
getting profit on IMDB ID  214
getting profit on IMDB ID  1124
getting profit on IMDB ID  1593
getting profit on IMDB ID  350
getting profit on IMDB ID  9693
getting profit on IMDB ID  1246
getting profit on IMDB ID  956
getting profit on IMDB ID  7518
getting profit on IMDB ID  1402
getting profit on IMDB ID  1452
getting profit on IMDB ID  1948
getting profit on IMDB ID  7551
getting profit on IMDB ID  9762
start getting movie info
getting profit on IMDB ID  58
getting profit on IMDB ID  36557
getting profit on IMDB ID  920
getting profit on IMDB ID  1422
getting profit on IMDB ID  950
getting profit on IMDB ID  834
getting profit on IMDB ID  591
getting profit on IMDB ID  214
getting profit on IMDB ID  1124
getting profit on IMDB ID  1593
getting 

In [11]:
df_info_revenue

Unnamed: 0.1,Unnamed: 0,budget,imdb_id,revenue,Actors,Country,Director,Genre,IMDB Rating,IMDB Votes,Language,Production,Rated,Released,Runtime,Title,Year,imdbID
0,0,200000000,tt0383574,1065659812,"Johnny Depp, Orlando Bloom, Keira Knightley, J...",USA,Gore Verbinski,"Action, Adventure, Fantasy",7.3,573167,"English, Turkish, Greek, Mandarin, French",Buena Vista,PG-13,07 Jul 2006,151 min,Pirates of the Caribbean: Dead Man's Chest,2006,tt0383574
1,1,150000000,tt0381061,599045960,"Daniel Craig, Eva Green, Mads Mikkelsen, Judi ...","UK, Czech Republic, USA, Germany, Bahamas, Italy",Martin Campbell,"Action, Adventure, Thriller",8.0,506685,"English, French",Sony,PG-13,17 Nov 2006,144 min,Casino Royale,2006,tt0381061
2,2,120000000,tt0317219,461983149,"Owen Wilson, Paul Newman, Bonnie Hunt, Larry t...",USA,"John Lasseter, Joe Ranft(co-director)","Animation, Comedy, Family",7.1,294294,"English, Italian, Japanese, Yiddish",Buena Vista,G,09 Jun 2006,117 min,Cars,2006,tt0317219
3,3,90000000,tt0407887,289847354,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...","USA, Hong Kong",Martin Scorsese,"Crime, Drama, Thriller",8.5,973137,"English, Cantonese",Warner Bros. Pictures,R,06 Oct 2006,151 min,The Departed,2006,tt0407887
4,4,80000000,tt0438097,660940780,"Ray Romano, John Leguizamo, Denis Leary, Seann...",USA,Carlos Saldanha,"Animation, Action, Adventure",6.8,213930,English,20th Century Fox,PG,31 Mar 2006,91 min,Ice Age: The Meltdown,2006,tt0438097
5,5,50000000,tt0401855,111340801,"Kate Beckinsale, Scott Speedman, Tony Curran, ...","USA, Canada, Hungary",Len Wiseman,"Action, Adventure, Fantasy",6.7,170850,"English, French, Hungarian",Sony/Screen Gems,R,20 Jan 2006,106 min,Underworld: Evolution,2006,tt0401855
6,6,125000000,tt0382625,767820459,"Tom Hanks, Audrey Tautou, Ian McKellen, Jean Reno","USA, Malta, France, UK",Ron Howard,"Mystery, Thriller",6.6,347239,"English, French, Latin, Spanish",Sony Pictures,PG-13,19 May 2006,149 min,The Da Vinci Code,2006,tt0382625
7,7,10000000,tt0489270,163876815,"Tobin Bell, Shawnee Smith, Angus Macfadyen, Ba...","USA, Canada",Darren Lynn Bousman,Horror,6.2,154539,English,Lionsgate Films,R,27 Oct 2006,108 min,Saw III,2006,tt0489270
8,8,40000000,tt0482571,109676311,"Hugh Jackman, Christian Bale, Michael Caine, P...","USA, UK",Christopher Nolan,"Drama, Mystery, Sci-Fi",8.5,951885,English,Buena Vista Pictures,PG-13,20 Oct 2006,130 min,The Prestige,2006,tt0482571
9,9,110000000,tt0477347,574480841,"Ben Stiller, Carla Gugino, Dick Van Dyke, Mick...","USA, UK",Shawn Levy,"Adventure, Comedy, Family",6.4,260038,"English, Italian, Hebrew",20th Century Fox,PG,22 Dec 2006,108 min,Night at the Museum,2006,tt0477347


### clean_data

#### *clean_director_actor()*: <br>
This gets all the listed actor and directors populariting rating from the database

In [12]:
df_add_act_dir = cdata.clean_director_actor(path)

In [13]:
df_add_act_dir

Unnamed: 0.1,Unnamed: 0,budget,imdb_id,revenue,Actors,Country,Director,Genre,IMDB Rating,IMDB Votes,Language,Production,Rated,Released,Runtime,Title,Year,imdbID,actor_popularity,director_popularity
0,0,200000000,tt0383574,1065659812,"Johnny Depp, Orlando Bloom, Keira Knightley, J...",USA,Gore Verbinski,"Action, Adventure, Fantasy",7.3,573167,"English, Turkish, Greek, Mandarin, French",Buena Vista,PG-13,07 Jul 2006,151 min,Pirates of the Caribbean: Dead Man's Chest,2006,tt0383574,7.5256,2.821019
1,1,150000000,tt0381061,599045960,"Daniel Craig, Eva Green, Mads Mikkelsen, Judi ...","UK, Czech Republic, USA, Germany, Bahamas, Italy",Martin Campbell,"Action, Adventure, Thriller",8.0,506685,"English, French",Sony,PG-13,17 Nov 2006,144 min,Casino Royale,2006,tt0381061,8.819674,2.261955
2,2,120000000,tt0317219,461983149,"Owen Wilson, Paul Newman, Bonnie Hunt, Larry t...",USA,"John Lasseter, Joe Ranft(co-director)","Animation, Comedy, Family",7.1,294294,"English, Italian, Japanese, Yiddish",Buena Vista,G,09 Jun 2006,117 min,Cars,2006,tt0317219,5.113393,2.772886
3,3,90000000,tt0407887,289847354,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...","USA, Hong Kong",Martin Scorsese,"Crime, Drama, Thriller",8.5,973137,"English, Cantonese",Warner Bros. Pictures,R,06 Oct 2006,151 min,The Departed,2006,tt0407887,8.127918,5.258641
4,4,80000000,tt0438097,660940780,"Ray Romano, John Leguizamo, Denis Leary, Seann...",USA,Carlos Saldanha,"Animation, Action, Adventure",6.8,213930,English,20th Century Fox,PG,31 Mar 2006,91 min,Ice Age: The Meltdown,2006,tt0438097,4.15674,1.765386
5,5,50000000,tt0401855,111340801,"Kate Beckinsale, Scott Speedman, Tony Curran, ...","USA, Canada, Hungary",Len Wiseman,"Action, Adventure, Fantasy",6.7,170850,"English, French, Hungarian",Sony/Screen Gems,R,20 Jan 2006,106 min,Underworld: Evolution,2006,tt0401855,6.022322,3.084873
6,6,125000000,tt0382625,767820459,"Tom Hanks, Audrey Tautou, Ian McKellen, Jean Reno","USA, Malta, France, UK",Ron Howard,"Mystery, Thriller",6.6,347239,"English, French, Latin, Spanish",Sony Pictures,PG-13,19 May 2006,149 min,The Da Vinci Code,2006,tt0382625,5.483848,4.593478
7,7,10000000,tt0489270,163876815,"Tobin Bell, Shawnee Smith, Angus Macfadyen, Ba...","USA, Canada",Darren Lynn Bousman,Horror,6.2,154539,English,Lionsgate Films,R,27 Oct 2006,108 min,Saw III,2006,tt0489270,2.562011,1.003553
8,8,40000000,tt0482571,109676311,"Hugh Jackman, Christian Bale, Michael Caine, P...","USA, UK",Christopher Nolan,"Drama, Mystery, Sci-Fi",8.5,951885,English,Buena Vista Pictures,PG-13,20 Oct 2006,130 min,The Prestige,2006,tt0482571,7.454158,8.069432
9,9,110000000,tt0477347,574480841,"Ben Stiller, Carla Gugino, Dick Van Dyke, Mick...","USA, UK",Shawn Levy,"Adventure, Comedy, Family",6.4,260038,"English, Italian, Hebrew",20th Century Fox,PG,22 Dec 2006,108 min,Night at the Museum,2006,tt0477347,5.467439,1.474869


#### *clean_regression_data()*: <br>
This returns a tranformed dataframe based on dev team's preloaded data and return both a csv called "data_for_lr"

In [14]:
input_path = ('./data/data_clean.csv')
output_path = ('./data/data_for_lr.csv')

In [15]:
df_for_model = cdata.clean_regression_data(input_path, output_path)

In [16]:
df_for_model

Unnamed: 0,IMDB.Rating,IMDB.Votes,Language,Runtime,budget,actor_popularity,director_popularity,released_on_weekend,released_not_on_dump_month,Action,...,Romance,Thriller,Other_genre,G,NC-17,PG,PG-13,R,UNRATED,revenue
0,0.782051,0.796139,0.055556,0.250000,0.782555,0.289032,6.857463e-02,1,1,0,...,1,0,0,0,0,0,0,1,0,7.811728
1,0.858974,0.871550,0.444444,0.242857,0.881144,0.178595,7.261486e-02,1,0,0,...,0,1,0,0,0,0,0,1,0,7.844848
2,0.679487,0.723415,0.055556,0.153571,0.890273,0.210875,6.176784e-05,1,1,0,...,0,1,0,0,0,0,0,1,0,7.157572
3,0.064103,0.528331,0.000000,0.178571,0.869598,0.174832,0.000000e+00,0,0,1,...,0,0,0,0,0,1,0,0,0,4.867503
4,0.512821,0.812777,0.000000,0.260714,0.908450,0.085095,1.218422e-01,1,0,1,...,0,1,0,0,0,0,1,0,0,8.359339
5,0.666667,0.511503,0.000000,0.207143,0.714887,0.049863,4.056555e-04,0,1,0,...,0,0,0,0,0,0,0,1,0,5.661085
6,0.782051,0.745081,0.000000,0.271429,0.774774,0.265122,6.545472e-02,1,1,0,...,0,1,0,0,0,0,0,1,0,7.398031
7,0.641026,0.779687,0.222222,0.178571,0.926259,0.272759,1.373763e-01,1,0,1,...,0,0,0,0,0,0,1,0,0,8.411657
8,0.423077,0.558102,0.000000,0.164286,0.845653,0.177585,3.206331e-03,1,0,0,...,0,0,0,0,0,1,0,0,0,7.580763
9,0.769231,0.592969,0.000000,0.221429,0.683074,0.081337,9.215069e-03,1,0,1,...,0,0,0,0,0,0,0,0,1,6.963788


### movie_plot

#### *scatter_plot* (path, x, y): <br>
This function takes a path where the "data_for_lr.csv" is stored and two continuous variables as input. It returns a scatter plot for the two variables in movie data as an html file.

In [17]:
path = "./data"

In [18]:
mp.scatter_plot(path, "budget", "revenue")

1

#### *box_plot* (path, x):<br>
This function takes a path where the "data_for_lr.csv" is stored and a binary variable as input. It returns a boxplot plot of movie revenue against this binary variable as an html file.

In [19]:
mp.box_plot(path, "released_not_on_dump_month")

1

### regression_model

#### *model_evaluation* (model_name, df, n_fold) : <br>
This function evaluates regression model using k-fold cross validation. Users have 4 models avaliable(linear regression, lasso, ridge, decision tree). User can also specify # of folds to split the dataframe df.

In [20]:
tree_model = rm.model_evaluation("tree",df_for_model,10)

(Regression Model:  tree
Average Cross Validation Score (Mean Absolute Error):  0.472553740814
Average Cross Validation Score (Root Mean Squared Error):  0.665948528551
Average Cross Validation Score (R^2):  0.586957029619


#### *save_model* (model, file_name, path): <br>
This function takes the model given by model_evaluation(), a temporary file name and a path as input. It saves the regression model to local machine as .pkl.

In [21]:
rm.save_model(tree_model, "lasso", "lasso_model.pkl")

'model save complete!'

# Jupyter notebook demo complete!