<a href="https://colab.research.google.com/github/brucebra000/DS-Unit-2-Applied-Modeling/blob/master/U2S3A1_applied_modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

# **This dataset contains information about movies from the website Rotten Tomatoes**

https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critics-datasets

In [0]:
import pandas as pd
import numpy as np
from google.colab import files
from sklearn.model_selection import train_test_split

In [3]:
upload = files.upload()

Saving rotten_tomatoes_movies.csv to rotten_tomatoes_movies.csv


In [4]:
df = pd.read_csv('rotten_tomatoes_movies.csv')
print(df.shape)
df.head()

(16638, 23)


Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,critics_consensus,poster_image_url,rating,genre,directors,writers,cast,in_theaters_date,on_streaming_date,runtime_in_minutes,studio_name,tomatometer_status,tomatometer_rating,tomatometer_count,audience_status,audience_rating,audience_count,audience_top_critics_count,audience_fresh_critics_count,audience_rotten_critics_count
0,/m/0814255,Percy Jackson & the Olympians: The Lightning T...,A teenager discovers he's the descendant of a ...,Though it may seem like just another Harry Pot...,https://resizing.flixster.com/p1veUpQ4ktsSHtRu...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,Craig Titley,"Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,2010-06-29,83.0,20th Century Fox,Rotten,49,144,Spilled,53.0,254287.0,38,71,73
1,/m/0878835,Please Give,Kate has a lot on her mind. There's the ethics...,Nicole Holofcener's newest might seem slight i...,https://resizing.flixster.com/0AbudQ4KsB4BeXSB...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,2010-10-19,90.0,Sony Pictures Classics,Certified Fresh,86,140,Upright,64.0,11567.0,43,121,19
2,/m/10,10,Blake Edwards' 10 stars Dudley Moore as George...,,https://resizing.flixster.com/mF0dxH6UTa0FdkMs...,R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,1997-08-27,118.0,Waner Bros.,Fresh,68,22,Spilled,53.0,14670.0,2,15,7
3,/m/1000013-12_angry_men,12 Angry Men (Twelve Angry Men),"A Puerto Rican youth is on trial for murder, a...",Sidney Lumet's feature debut is a superbly wri...,https://resizing.flixster.com/u-8xAyGaDVvROLiR...,NR,"Classics, Drama",Sidney Lumet,Reginald Rose,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",1957-04-13,2001-03-06,95.0,Criterion Collection,Certified Fresh,100,51,Upright,97.0,105000.0,6,51,0
4,/m/1000079-20000_leagues_under_the_sea,"20,000 Leagues Under The Sea","This 1954 Disney version of Jules Verne's 20,0...","One of Disney's finest live-action adventures,...",https://resizing.flixster.com/FKExgYBHu07XLoil...,G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,Earl Felton,"James Mason, Kirk Douglas, Paul Lukas, Peter L...",1954-01-01,2003-05-20,127.0,Disney,Fresh,89,27,Upright,74.0,68860.0,5,24,3


Critic ratings on Rotten Tomatoes are classified as certified fresh, fresh, or rotten based on their tomatometer rating.


*   Certified Fresh = Tomatometer raiting >= 75%
*   Fresh = 75% > Tomatometer raiting >= 60%
*   Rotten = Tomatometer raiting < 60%

This machine will try to predict if a movie will be rated as certified fresh, fresh, or rotten (the 'tomatometer_status' column).

In [7]:
df['tomatometer_status'].describe()

count      16638
unique         3
top       Rotten
freq        7233
Name: tomatometer_status, dtype: object

In [8]:
df['tomatometer_status'].value_counts(normalize = True)

Rotten             0.434728
Fresh              0.387547
Certified Fresh    0.177726
Name: tomatometer_status, dtype: float64

In [9]:
#Let's look further into the dataset
df.describe()

Unnamed: 0,runtime_in_minutes,tomatometer_rating,tomatometer_count,audience_rating,audience_count,audience_top_critics_count,audience_fresh_critics_count,audience_rotten_critics_count
count,16483.0,16638.0,16638.0,16386.0,16386.0,16638.0,16638.0,16638.0
mean,102.391494,60.466522,56.607104,60.470829,152479.7,14.594242,35.730496,20.867592
std,25.028011,28.58723,66.3838,20.462368,1817736.0,14.774244,50.795198,29.995032
min,1.0,0.0,5.0,0.0,5.0,0.0,0.0,0.0
25%,90.0,38.0,12.0,45.0,864.25,3.0,6.0,3.0
50%,99.0,66.0,28.0,62.0,4876.5,8.0,16.0,8.0
75%,111.0,86.0,76.0,77.0,28752.0,24.0,43.0,24.0
max,2000.0,100.0,497.0,100.0,35797640.0,64.0,470.0,296.0


In [10]:
df.isnull().sum()

rotten_tomatoes_link                0
movie_title                         0
movie_info                         24
critics_consensus                8329
poster_image_url                    0
rating                              0
genre                              17
directors                         114
writers                          1349
cast                              284
in_theaters_date                  815
on_streaming_date                   2
runtime_in_minutes                155
studio_name                       416
tomatometer_status                  0
tomatometer_rating                  0
tomatometer_count                   0
audience_status                   252
audience_rating                   252
audience_count                    252
audience_top_critics_count          0
audience_fresh_critics_count        0
audience_rotten_critics_count       0
dtype: int64

In [11]:
df.select_dtypes(include = 'number').nunique()

runtime_in_minutes                 201
tomatometer_rating                 101
tomatometer_count                  393
audience_rating                     98
audience_count                   10885
audience_top_critics_count          65
audience_fresh_critics_count       345
audience_rotten_critics_count      203
dtype: int64

In [12]:
df.select_dtypes(exclude = 'number').nunique()

rotten_tomatoes_link    16638
movie_title             16106
movie_info              16613
critics_consensus        8307
poster_image_url        16623
rating                      8
genre                    1080
directors                8314
writers                 12121
cast                    16326
in_theaters_date         5586
on_streaming_date        2260
studio_name              2886
tomatometer_status          3
audience_status             2
dtype: int64

In [18]:
#Cleaning the dataset
clean_df = df
dropped_features = ['rotten_tomatoes_link', 'movie_title', 'movie_info', 'poster_image_url', 'tomatometer_rating']
clean_df = clean_df.drop(columns = dropped_features)

print(clean_df.shape)
clean_df.head()

(16638, 18)


Unnamed: 0,critics_consensus,rating,genre,directors,writers,cast,in_theaters_date,on_streaming_date,runtime_in_minutes,studio_name,tomatometer_status,tomatometer_count,audience_status,audience_rating,audience_count,audience_top_critics_count,audience_fresh_critics_count,audience_rotten_critics_count
0,Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,Craig Titley,"Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,2010-06-29,83.0,20th Century Fox,Rotten,144,Spilled,53.0,254287.0,38,71,73
1,Nicole Holofcener's newest might seem slight i...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,2010-10-19,90.0,Sony Pictures Classics,Certified Fresh,140,Upright,64.0,11567.0,43,121,19
2,,R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,1997-08-27,118.0,Waner Bros.,Fresh,22,Spilled,53.0,14670.0,2,15,7
3,Sidney Lumet's feature debut is a superbly wri...,NR,"Classics, Drama",Sidney Lumet,Reginald Rose,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",1957-04-13,2001-03-06,95.0,Criterion Collection,Certified Fresh,51,Upright,97.0,105000.0,6,51,0
4,"One of Disney's finest live-action adventures,...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,Earl Felton,"James Mason, Kirk Douglas, Paul Lukas, Peter L...",1954-01-01,2003-05-20,127.0,Disney,Fresh,27,Upright,74.0,68860.0,5,24,3
