<a href="https://colab.research.google.com/github/cdixson-ds/DS-Unit-2-Applied-Modeling/blob/master/LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

df_basic = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', sep='\t', low_memory=False)
df_rating = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', sep='\t', low_memory=False)

In [83]:
#Look at the shape of both dataframes

df_basic.shape, df_rating.shape

((6505830, 9), (1019739, 3))

In [0]:
#drop tv/game tags that aren't needed since I am focusing on movies

df_basic = df_basic[~df_basic['titleType'].isin(['short', 'tvShort', 'videoGame', 'tvSpecial', 'tvMiniSeries', 'tvMovie', 'tvSeries', 'video', 'short', 'tvEpisode'])]

In [0]:
df_basic = df_basic.drop(columns=['originalTitle', 'isAdult', 'originalTitle','runtimeMinutes', 'endYear', 'titleType'])

In [0]:
df_basic['genres'] = df_basic['genres'].str.lower()

In [0]:
#Need to write a function

#X = 'drama'

#def wrangle(X):
#  X = X.copy()
#  X = Y
#  Y = df_basic['genres'].str.contains(X)
#  df_basic.loc[Y, 'genres'] = X
#  return X


In [0]:
drama = df_basic['genres'].str.contains('drama')
comedy = df_basic['genres'].str.contains('comedy')
documentary = df_basic['genres'].str.contains('documentary')
romance = df_basic['genres'].str.contains('romance')
family = df_basic['genres'].str.contains('family')
animation = df_basic['genres'].str.contains('animation')
crime = df_basic['genres'].str.contains('crime')
action = df_basic['genres'].str.contains('action')
adventure = df_basic['genres'].str.contains('adventure')
mystery = df_basic['genres'].str.contains('mystery')
musical = df_basic['genres'].str.contains('musical')
thriller = df_basic['genres'].str.contains('thriller')
horror = df_basic['genres'].str.contains('horror')
sci_fi = df_basic['genres'].str.contains('sci')
fantasy = df_basic['genres'].str.contains('fantasy')
war = df_basic['genres'].str.contains('war')
western = df_basic['genres'].str.contains('western')
film_noir = df_basic['genres'].str.contains('film')
mystery = df_basic['genres'].str.contains('mystery')
history = df_basic['genres'].str.contains('history')
sport = df_basic['genres'].str.contains('sport')
biography = df_basic['genres'].str.contains('biography')

In [0]:
df_basic.loc[drama, 'genres'] = 'drama'
df_basic.loc[comedy, 'genres'] = 'comedy'
df_basic.loc[documentary, 'genres'] = 'documentary'
df_basic.loc[romance, 'genres'] = 'romance'
df_basic.loc[family, 'genres'] = 'family'
df_basic.loc[animation, 'genres'] = 'animation'
df_basic.loc[crime, 'genres'] = 'crime'
df_basic.loc[action, 'genres'] = 'action'
df_basic.loc[adventure, 'genres'] = 'adventure'
df_basic.loc[mystery, 'genres'] = 'mystery'
df_basic.loc[thriller, 'genres'] = 'thriller'
df_basic.loc[horror, 'genres'] = 'horror'
df_basic.loc[sci_fi, 'genres']  ='sci_fi'
df_basic.loc[fantasy, 'genres']  ='fantasy'
df_basic.loc[war, 'genres']  ='war'
df_basic.loc[western, 'genres']  ='western'
df_basic.loc[film_noir, 'genres'] ='film_noir'
df_basic.loc[mystery, 'genres'] ='mystery'
df_basic.loc[history, 'genres'] ='history'
df_basic.loc[sport, 'genres'] ='sport'
df_basic.loc[biography, 'genres'] ='biography'

In [0]:
df_basic = df_basic[~df_basic['genres'].isin(['music,reality-tv''reality-tv,talk-show', 
                                              'news,talk-show',
                                              'news,talk-show',
                                              'adult,music',
                                              'music,talk-show',
                                              'news,reality-tv,talk-show',
                                              'game-show,music',
                                              'adult,short',
                                              'music,musical',
                                              'adult,musical',
                                              'music,musical,reality-tv',
                                              'musical,reality-tv',
                                              'short',
                                              'reality-tv,talk-show',
                                              'news',
                                              'talk-show',
                                              'reality-tv',
                                              'game-show',
                                              '\\n',
                                              'music,reality-tv'
                                              ])]

In [90]:
df_basic['genres'].value_counts()

drama          98729
documentary    81136
comedy         54860
romance        29917
action         26541
thriller       24193
horror         22092
adventure      17396
biography      15347
crime          13747
mystery        13363
fantasy        11359
history         9782
sci_fi          9734
family          9533
war             7055
western         6571
adult           6180
sport           5553
animation       3644
musical         2099
music           1484
film_noir        686
Name: genres, dtype: int64

In [91]:
df_basic.shape

(471001, 4)

Merge dataframes

In [0]:
df_imdb = pd.merge(df_basic, df_rating, on='tconst')

In [93]:
df_imdb.shape

(233761, 6)

In [94]:
df_imdb.head()

Unnamed: 0,tconst,primaryTitle,startYear,genres,averageRating,numVotes
0,tt0000009,Miss Jerry,1894,romance,5.4,86
1,tt0000147,The Corbett-Fitzsimmons Fight,1897,sport,5.2,323
2,tt0000335,Soldiers of the Cross,1900,biography,6.1,40
3,tt0000574,The Story of the Kelly Gang,1906,biography,6.1,552
4,tt0000615,Robbery Under Arms,1907,drama,4.8,14


In [95]:
df_imdb.dtypes

tconst            object
primaryTitle      object
startYear         object
genres            object
averageRating    float64
numVotes           int64
dtype: object

In [96]:
#convert startYear into numeric

df_imdb['startYear'] = pd.Categorical(df_imdb['startYear'])
df_imdb['startYear'] =df_imdb['startYear'].cat.codes
df_imdb['startYear'].dtypes

dtype('int8')

In [0]:
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    val = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, val, test

In [99]:
train_validate_test_split(df_imdb)

(            tconst             primaryTitle  ...  averageRating numVotes
 147452   tt1410281       Rabbit à la Berlin  ...            7.6      729
 162550   tt1866132        The Sleeping Girl  ...            6.7       26
 120540   tt0453150                   Mazhai  ...            4.7      177
 81652    tt0199091  Tis zileias ta kamomata  ...            5.8       95
 709      tt0009142      He Comes Up Smiling  ...            6.1       41
 ...            ...                      ...  ...            ...      ...
 230816   tt8929704         Maniyar Kudumbam  ...            4.8       13
 157849   tt1727373            The Kill Hole  ...            4.1      344
 144692   tt1329345  Forgetful Not Forgotten  ...            8.2        9
 137458  tt11091724            Missed Nuance  ...            9.4        5
 142757   tt1276988            Summer Eleven  ...            6.2      337
 
 [140256 rows x 6 columns],
            tconst          primaryTitle  ...  averageRating numVotes
 46835   tt0

In [0]:
#Arrange data into X features matrix and y target vector

target = 'averageRating'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [0]:
#Need to make ratings a categorical variable to fit an xgboost model

#from xgboost import XGBClassifier

#pipeline = make_pipeline(
#    ce.OrdinalEncoder(), 
#    XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)
#)

#pipeline.fit(X_train, y_train)

In [0]:
#from sklearn.metrics import accuracy_score
#y_pred = pipeline.predict(X_val)
#print('Validation Accuracy', accuracy_score(y_val, y_pred))