Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

# Wrangle ML datasets (From 231 assignment)
[ ] Continue to clean and explore your data.

[ ] For the evaluation metric you chose, what score would you get just by guessing?

[ ] Can you make a fast, first model that beats guessing?

# Uploading Data Set

In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('/Users/bradbrauser/Desktop/Data Science/MoviesOnStreamingPlatforms_updated.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


# Which column in your tabular dataset will you predict, and how is your target distributed?

In [4]:
# Train test split on years movies were released
train = df[df['Year'] < 2017]
test = df[df['Year'] >= 2017]

train

train.shape, test.shape

((13222, 17), (3522, 17))

In [38]:
def wrangle(df, thresh=500):
    df = df.copy()
    
    # Dropping movies made before 1942
    df.drop(df[df.Year < 1942].index, inplace=True)
    
    # Setting Title as index
    df.set_index(df['Year'], inplace = True)    
    
    # Changing "Rotten Tomatoes" to float
    df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.rstrip('%')
    df['Rotten Tomatoes'] = pd.to_numeric(df['Rotten Tomatoes'], downcast="float")
    df['Rotten Tomatoes'] = (df['Rotten Tomatoes'] / 10)
    
    # Replacing missing values in IMDb with Rotten Tomatoes and vice versa
    # if there is at least one non-null in either  
    df['Rotten Tomatoes'].fillna(df['IMDb'], inplace=True)
    df['IMDb'].fillna(df['Rotten Tomatoes'], inplace=True)
    
    # Dropping rows if their are nulls in both IMDb AND Rotten Tomatoes
    df.dropna(subset=['IMDb', 'Rotten Tomatoes'], how='all')
    
    # Beginning to create target for model - getting the average of the
    # IMDb and Rotten Tomatoes ratings
    df['Rating'] = ((df['IMDb'] + df['Rotten Tomatoes']) / 2)
    df['Rating'] = df['Rating'].astype(float)
    
    # Creating conditions for grading scale based on Rating column
    condition = [(df['Rating'] >= 9.0),
              (df['Rating'] >= 8.0) & (df['Rating'] < 9.0),
              (df['Rating'] >= 7.0) & (df['Rating'] < 8.0),
              (df['Rating'] >= 6.0) & (df['Rating'] < 7.0),
              (df['Rating'] < 6.0)]
    
    # Creating grading scale
    values = ['A', 'B', 'C', 'D', 'E']
    
    # Creating new Rating colums
    df['Rating'] = np.select(condition, values)
    
    # Dropping NaNs
    df = df.dropna()
    
    # Dropping unnecessary values
    df.drop(['Unnamed: 0', 'ID', 'Title', 'Type', 'ID', 'Year', 'IMDb', 
             'Rotten Tomatoes', 'Directors', 'Genres', 'Country', 'Language'], axis=1, inplace=True)
    
    # Split label and feature matrix
    y = df['Rating']
    df.drop(['Rating'], axis=1, inplace=True)
    
    return df, y

In [39]:
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

# Wrangling
X, y = wrangle(df)

print(X.shape)
print(y.shape)
X.head()

# Train test split on years movies were released
cutoff = 2015
X_train, y_train = X[X.index < cutoff], y[y.index < cutoff]
X_val, y_val = X[X.index >= cutoff], y[y.index >= cutoff]

# Baseline
y_train.value_counts(normalize=True)

(6994, 6)
(6994,)


E    0.557444
D    0.191276
C    0.164243
B    0.083350
A    0.003686
Name: Rating, dtype: float64

In [40]:
from sklearn.ensemble import RandomForestClassifier
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
import category_encoders as ce
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline

In [41]:
# Building model 1
model1 = Pipeline([
                  ('ohe', OneHotEncoder()),
                  ('impute', SimpleImputer()),
                  ('classifier', RandomForestClassifier())
])

# Fitting the model
model1.fit(X_train, y_train)

print('Training Accuracy:', model1.score(X_train, y_train))
print('Validation Accuracy:', model1.score(X_val, y_val))

Training Accuracy: 0.6936309645709605
Validation Accuracy: 0.42539081004263385


In [42]:
# Building model 2
model2 = Pipeline([
                  ('ohe', OneHotEncoder()),
                  ('impute', SimpleImputer()),
                  ('classifier', RandomForestClassifier())
])

# Fitting the model
model2.fit(X_train, y_train)

print('Training Accuracy:', model2.score(X_train, y_train))
print('Validation Accuracy:', model2.score(X_val, y_val))

Training Accuracy: 0.6936309645709605
Validation Accuracy: 0.414021790620559


In [43]:
# Model 3
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

model3 = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    StandardScaler(), 
    RandomForestClassifier(n_estimators = 10, min_samples_split = 40))

# Fitting the model
model3.fit(X_train, y_train)

print('Training Accuracy:', model3.score(X_train, y_train))
print('Validation Accuracy:', model3.score(X_val, y_val))

Training Accuracy: 0.594921155027647
Validation Accuracy: 0.46849834201800095


In [35]:
# Model 4
pipeline = make_pipeline(
  OrdinalEncoder(drop_invariant=True),
  SimpleImputer(strategy='median'),
  StandardScaler(),
  RandomForestClassifier(
      criterion='entropy',
      min_samples_split=3,
      max_depth=15,
      n_estimators= 200,
      n_jobs=1)
)
param_distributions = {
    'randomforestclassifier__criterion': ('gini','entropy'),
    #'randomforestclassifier__max_depth' : (11, 12, 13, 14, 15),
#     'randomforestclassifier__max_features': (11,12,13,14,15),
    #'randomforestclassifier__min_samples_split': (1,2,3),
}
search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    n_iter=30,
    cv=7,
    scoring='accuracy',
    verbose = 30,
    return_train_score=True,
    n_jobs=4,
)
search.fit(X_train, y_train)
print('Cross-validation Best Score:', search.best_score_)
print('Best Estimator:', search.best_params_)
print('Best Model:', search.best_estimator_)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 7 folds for each of 2 candidates, totalling 14 fits


[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:    0.8s
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:    0.8s
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed:    0.9s
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed:    0.9s
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    1.7s
[Parallel(n_jobs=4)]: Done   6 tasks      | elapsed:    1.7s
[Parallel(n_jobs=4)]: Done   7 tasks      | elapsed:    1.7s
[Parallel(n_jobs=4)]: Done   8 out of  14 | elapsed:    1.8s remaining:    1.3s
[Parallel(n_jobs=4)]: Done   9 out of  14 | elapsed:    2.6s remaining:    1.4s
[Parallel(n_jobs=4)]: Done  10 out of  14 | elapsed:    2.6s remaining:    1.1s
[Parallel(n_jobs=4)]: Done  11 out of  14 | elapsed:    2.7s remaining:    0.7s
[Parallel(n_jobs=4)]: Done  12 out of  14 | elapsed:    2.7s remaining:    0.4s
[Parallel(n_jobs=4)]: Done  14 out of  14 | elapsed:    3.5s remaining:    0.0s
[Parallel(n_jobs=4)]: Done  14 out of  14 | elapsed:    3.5s finished


Cross-validation Best Score: 0.44397163120567373
Best Estimator: {'randomforestclassifier__criterion': 'gini'}
Best Model: Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['Age'], drop_invariant=True,
                                mapping=[{'col': 'Age', 'data_type': dtype('O'),
                                          'mapping': 13+    1
18+    2
7+     3
all    4
16+    5
NaN   -2
dtype: int64}])),
                ('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=15, min_samples_split=3,
                                        n_estimators=200, n_jobs=1))])
