Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

# Uploading Data Set

In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import pandas as pd
df = pd.read_csv('/Users/bradbrauser/Desktop/Data Science/MoviesOnStreamingPlatforms_updated.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


In [29]:
df.describe()

Unnamed: 0.1,Unnamed: 0,ID,Year,IMDb,Netflix,Hulu,Prime Video,Disney+,Type,Runtime
count,16744.0,16744.0,16744.0,16173.0,16744.0,16744.0,16744.0,16744.0,16744.0,16152.0
mean,8371.5,8372.5,2003.014035,5.902751,0.212613,0.05393,0.737817,0.033684,0.0,93.413447
std,4833.720789,4833.720789,20.674321,1.347867,0.409169,0.225886,0.439835,0.180419,0.0,28.219222
min,0.0,1.0,1902.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,4185.75,4186.75,2000.0,5.1,0.0,0.0,0.0,0.0,0.0,82.0
50%,8371.5,8372.5,2012.0,6.1,0.0,0.0,1.0,0.0,0.0,92.0
75%,12557.25,12558.25,2016.0,6.9,0.0,0.0,1.0,0.0,0.0,104.0
max,16743.0,16744.0,2020.0,9.3,1.0,1.0,1.0,1.0,0.0,1256.0


In [56]:
def wrangle(df, thresh=350):
    df = df.copy()
    
    # Setting Title as index
    # df.set_index(pd.to_datetime(df['Title']), inplace = True)
    
    # Changing "Rotten Tomatoes" to float
    df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.rstrip('%')
    df['Rotten Tomatoes'] = pd.to_numeric(df['Rotten Tomatoes'], downcast="float")
    df['Rotten Tomatoes'] = (df['Rotten Tomatoes'] / 10)
    
    # Creating target for model
    # A >= 9.0 / B >= 8.0 and < 9.0 / C >= 7.0 and < 8.0 / D >= 6.0 and < 7.0 / E < 6.0
    df['Rating'] = ((df['IMDb'] + df['Rotten Tomatoes']) / 2)
    df['Rating'] = df['Rating'].astype(float)
    df.loc[df.Rating >= 9.0, "Rating"] = "A"
    df.loc[df.Rating >= 8.0 & < 9.0, "Rating"] = "B"
# #     df.loc[df.Rating >= 7.0, "Rating"] = "C"
#     df.loc[df.Rating >= 6.0, "Rating"] = "D"
#     df.loc[df.Rating < 6.0, "Rating"] = "E"
          
    # Split label and feature matrix
    y = df['Rating']
    df.drop(['Unnamed: 0', 'ID', 'Type'], axis=1, inplace=True)
    
    # Splitting genre column
    df['Genres'] = df['Genres'].str.split(",", n = 3, expand = True)
    
#     df.dropna(subset=['overall'], inplace=True)
#     df['great'] = df['overall'] >= 4
#     df = df.dropna(axis=1, thresh=thresh).drop('Location', axis=1)
#     df.set_index(pd.to_datetime(df['Date']), inplace = True)
#     df.drop('Date', axis=1, inplace=True)
#     df['Burrito'] = df['Burrito'].str.lower()
#     df['Reviewer'] = df['Reviewer'].str.lower()
    
#     # Burrito pseudo OHE
#     burrito_types = ['california', 'asada', 'surf', 'carnitas']
#     for col in burrito_types:
#         df[col] = df['Burrito'].str.contains(col)
    
#     # Reviewers
#     burrito_types = ['scott', 'emily']
#     for col in burrito_types:
#         df[col] = df['Reviewer'].str.contains(col)
    
#     # Split labels from feature matrix
#     y = df['great']
#     df.drop(['Reviewer', 'Burrito', 'overall', 'great'], axis=1, inplace=True)
    
    return df, y

In [57]:
X, y = wrangle(df)

In [58]:
X.head()

Unnamed: 0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime,Rating
0,Inception,2010,13+,8.8,8.7,1,0,0,0,Christopher Nolan,Action,"United States,United Kingdom","English,Japanese,French",148.0,8.75
1,The Matrix,1999,18+,8.7,8.7,1,0,0,0,"Lana Wachowski,Lilly Wachowski",Action,United States,English,136.0,8.7
2,Avengers: Infinity War,2018,13+,8.5,8.4,1,0,0,0,"Anthony Russo,Joe Russo",Action,United States,English,149.0,8.45
3,Back to the Future,1985,7+,8.5,9.6,1,0,0,0,Robert Zemeckis,Adventure,United States,English,116.0,A
4,"The Good, the Bad and the Ugly",1966,18+,8.8,9.7,1,0,1,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0,A


# Target Choice

In [23]:
df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.strip('%')


In [24]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


In [16]:
df.isnull().sum()

Unnamed: 0             0
ID                     0
Title                  0
Year                   0
Age                 9390
IMDb                 571
Rotten Tomatoes    11586
Netflix                0
Hulu                   0
Prime Video            0
Disney+                0
Type                   0
Directors            726
Genres               275
Country              435
Language             599
Runtime              592
dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16744 entries, 0 to 16743
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       16744 non-null  int64  
 1   ID               16744 non-null  int64  
 2   Title            16744 non-null  object 
 3   Year             16744 non-null  int64  
 4   Age              7354 non-null   object 
 5   IMDb             16173 non-null  float64
 6   Rotten Tomatoes  5158 non-null   object 
 7   Netflix          16744 non-null  int64  
 8   Hulu             16744 non-null  int64  
 9   Prime Video      16744 non-null  int64  
 10  Disney+          16744 non-null  int64  
 11  Type             16744 non-null  int64  
 12  Directors        16018 non-null  object 
 13  Genres           16469 non-null  object 
 14  Country          16309 non-null  object 
 15  Language         16145 non-null  object 
 16  Runtime          16152 non-null  float64
dtypes: float64(2