<a href="https://colab.research.google.com/github/worldwidekatie/DS-Unit-2-Applied-Modeling/blob/master/module3-permutation-boosting/LS_DS_233_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [0]:
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
!pip install category_encoders
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.feature_selection import SelectPercentile, f_classif
from xgboost import XGBClassifier



In [0]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/worldwidekatie/Build_Week_2/master/ira_cleaned_data.csv')
df = df.copy()
df = df[['content', 'target']]
df.head()

Unnamed: 0,content,target
0,#adee RT davis1988will: Congratulations for Ma...,1.0
1,RT SSOL getting attention. It's penny play day...,1.0
2,#laup SHOCK VIDEO : Antifa Thugs Break a Latin...,1.0
3,PROOF Melania Has Done FAR MORE for Disaster R...,1.0
4,"An USC professor, Raphael Bostic, named first ...",1.0


In [0]:
df.target.value_counts(normalize=True)

0.0    0.947465
1.0    0.052535
Name: target, dtype: float64

In [0]:
train, val = train_test_split(df, random_state=42)
print(train.shape, val.shape)

(151984, 2) (50662, 2)


In [0]:
train, test = train_test_split(train, random_state=42)
print(train.shape, test.shape)

(113988, 2) (37996, 2)


In [0]:
target = 'target'
features = 'content'

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]

# Today, I tried using an XGBClassifier for my model instead of a passiveaggressiveclassifier

In [0]:
pipeline = make_pipeline(
    TfidfVectorizer(), 
    XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token...
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                        

In [0]:
y_pred = pipeline.predict(X_val)
tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
print('Validation Accuracy', accuracy_score(y_val, y_pred))
print("Precision:", tp /(tp+fp))  
print("Recall:", tp/(tp+fn))
#It made my precision really high but kinda tanked my recall which
#is what I'm optimizing for.

Validation Accuracy 0.9854526074770045
Precision: 0.9855147439213657
Recall: 0.7287681713848508


## Then I practiced pulling pieces out of the pipeline to be able to use non-scikitlearn stuff and worked on explainability and feature importances

In [0]:
transformers = make_pipeline(
    TfidfVectorizer(min_df=10)
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

model = XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train_transformed, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [0]:
type(X_val_transformed.toarray())

numpy.ndarray

## I never quite grasped what to do with this.

In [0]:
X_val_transformed = pd.DataFrame(X_val_transformed.toarray())
X_val_transformed

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,7513,7514,7515,7516,7517,7518,7519,7520,7521,7522,7523,7524,7525,7526,7527,7528,7529,7530,7531,7532,7533,7534,7535,7536,7537,7538,7539,7540,7541,7542,7543,7544,7545,7546,7547,7548,7549,7550,7551,7552
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50657,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50660,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
relabeler = dict(map(reversed, vect.vocabulary_.items()))
transformers.named_steps.tfidfvectorizer.vocabulary_
wdm = wdm.rename(mapper=relabeler, axis=1)

{'disappointed': 1951,
 'with': 7339,
 'the': 6583,
 'tt': 6850,
 'today': 6706,
 'ruined': 5584,
 'by': 1124,
 'hurricane': 3281,
 'doing': 2010,
 'make': 4053,
 'up': 6995,
 'homework': 3187,
 'missing': 4257,
 'school': 5686,
 'all': 343,
 'time': 6677,
 'for': 2596,
 'acting': 225,
 'is': 3456,
 'getting': 2758,
 'little': 3910,
 'annoying': 423,
 'would': 7405,
 'be': 731,
 'so': 6016,
 'much': 4360,
 'easier': 2154,
 'guns': 2948,
 'november': 4561,
 'rain': 5257,
 'new': 4478,
 'proposal': 5169,
 'takes': 6463,
 'care': 1186,
 'of': 4607,
 'border': 945,
 'wall': 7156,
 'funding': 2689,
 'sanctuary': 5638,
 'cities': 1374,
 'https': 3248,
 'co': 1428,
 'xd': 7437,
 'nope': 4543,
 'quite': 5237,
 'some': 6042,
 'people': 4861,
 'can': 1158,
 'come': 1464,
 'we': 7208,
 're': 5294,
 'not': 4551,
 'nice': 4490,
 'missed': 4254,
 'your': 7506,
 'presentation': 5100,
 'due': 2119,
 'to': 6703,
 'airline': 315,
 'issues': 3467,
 'good': 2830,
 'night': 4502,
 'wish': 7332,
 'me': 4142

In [0]:
vect = transformers.named_steps.tfidfvectorizer
relabeler = dict(map(reversed, vect.vocabulary_.items()))
relabeler

{1951: 'disappointed',
 7339: 'with',
 6583: 'the',
 6850: 'tt',
 6706: 'today',
 5584: 'ruined',
 1124: 'by',
 3281: 'hurricane',
 2010: 'doing',
 4053: 'make',
 6995: 'up',
 3187: 'homework',
 4257: 'missing',
 5686: 'school',
 343: 'all',
 6677: 'time',
 2596: 'for',
 225: 'acting',
 3456: 'is',
 2758: 'getting',
 3910: 'little',
 423: 'annoying',
 7405: 'would',
 731: 'be',
 6016: 'so',
 4360: 'much',
 2154: 'easier',
 2948: 'guns',
 4561: 'november',
 5257: 'rain',
 4478: 'new',
 5169: 'proposal',
 6463: 'takes',
 1186: 'care',
 4607: 'of',
 945: 'border',
 7156: 'wall',
 2689: 'funding',
 5638: 'sanctuary',
 1374: 'cities',
 3248: 'https',
 1428: 'co',
 7437: 'xd',
 4543: 'nope',
 5237: 'quite',
 6042: 'some',
 4861: 'people',
 1158: 'can',
 1464: 'come',
 7208: 'we',
 5294: 're',
 4551: 'not',
 4490: 'nice',
 4254: 'missed',
 7506: 'your',
 5100: 'presentation',
 2119: 'due',
 6703: 'to',
 315: 'airline',
 3467: 'issues',
 2830: 'good',
 4502: 'night',
 7332: 'wish',
 4142: 'me'

# But I did figure out using IDF and got that into an actual pandas dataframe with the feature names that I could sort by importance using IDF as a proxy.
That was exciting because I spent about 4 hours yesterday trying to figure that out and couldn't.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = X_train
vectorizer = TfidfVectorizer(min_df=10)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
features = pd.DataFrame({'Whole_DF': vectorizer.get_feature_names(), 
                         'Importance': idf}).sort_values(by='Importance', ascending=False)

In [0]:
features.shape

(7553, 2)

In [0]:
features.shape

(115505, 2)

In [0]:
X_train_top_50 = features.head(50)
X_train_top_50['Whole_DF'] #MOST important features

Unnamed: 0,Whole_DF,Importance
4400,nada,10.245962
4611,offense,10.245962
1178,captured,10.245962
1180,cara,10.245962
1183,cardboard,10.245962
1184,cardio,10.245962
1191,carlisle,10.245962
1192,carlos,10.245962
1193,carol,10.245962
1195,caroline,10.245962


In [0]:
features.tail(50) #LEAST important features

Unnamed: 0,Feature,Importance
638,back,4.393498
2666,from,4.389849
6677,time,4.374869
1428,co,4.348808
2852,got,4.333442
7506,your,4.319763
7208,we,4.315165
7255,what,4.304117
7382,work,4.2939
3989,love,4.291067


## Then I decided to look at the difference between feature importance between groups in an attempt to figure out how to explain how it's making decisions

In [0]:
#Non-Target Importances
Nontarget = train[train['target']==0]
Nontarget.shape

(107926, 2)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = Nontarget['content']
vectorizer = TfidfVectorizer(min_df=10)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
features = pd.DataFrame({'Non_IRA': vectorizer.get_feature_names(), 
                         'Importance': idf}).sort_values(by='Importance', ascending=False)
features.shape

(6917, 2)

In [0]:
non_target_train = features.head(50)
non_target_train['Non_IRA']

Unnamed: 0,Non_IRA,Importance
5110,sadface,10.191315
4340,owns,10.191315
3439,lasagna,10.191315
6249,trvsbrkr,10.191315
3649,lousy,10.191315
5408,sized,10.191315
5698,stinky,10.191315
4291,oregon,10.191315
2864,highest,10.191315
2865,highlight,10.191315


In [0]:
#Target Importances
Target = train[train['target']==1]
Target.shape

(6062, 2)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = Target['content']
vectorizer = TfidfVectorizer(min_df=10)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
features = pd.DataFrame({'IRA': vectorizer.get_feature_names(), 
                         'Importance': idf}).sort_values(by='Importance', ascending=False)
features.shape

(1020, 2)

In [0]:
target_train = features.head(50)
target_train['IRA']

Unnamed: 0,IRA,Importance
420,illegals,7.312065
891,todolistbeforechristmas,7.312065
93,beautiful,7.312065
828,stupid,7.312065
647,pittsburgh,7.312065
270,everybody,7.312065
312,football,7.312065
562,move,7.312065
309,followers,7.312065
727,rise,7.312065
