<a href="https://colab.research.google.com/github/fuse999/DS-Unit-2-Applied-Modeling/blob/master/module2/assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [0]:
import pandas as pd
df = pd.read_csv('gradcafe.csv')

In [0]:
df.head()

Unnamed: 0,comment,major,gre_quant,degree,season,decision,gre_verbal,date_of_result,term_year,undergrad_gpa,date_added,applicant_status,university,gre_awa
0,I applied because one compatriot teacher there...,"Computer Science,",160.0,Masters,Fall,Accepted,158.0,2019-08-14,2019,,2019-08-23,International without US degree,University Of South Carolina,3.5
1,,"Computer Science,",,PhD,Fall,Accepted,,2019-08-22,2019,,2019-08-21,International without US degree,"University Of Californa, Los Angeles (UCLA)",
2,Admitted to CS in Toronto,"Computer Science,",,PhD,Fall,Accepted,,2019-04-16,2019,,2019-08-17,Unknown,University Of Toronto,
3,,"Computer Science,",,Masters,Fall,Accepted,,2019-05-22,2019,,2019-08-17,Unknown,University Of Toronto,
4,"Hello, I've heard Amherst allows one to defer ...","Computer Science,",168.0,Masters,Spring,Others,159.0,2019-08-12,2020,3.8,2019-08-11,International without US degree,"University Of Massachusetts, Amherst",4.0


In [0]:
df.decision.value_counts()

Accepted    1310840
Rejected    1272526
Others      1120158
Name: decision, dtype: int64

In [0]:
df.columns

Index(['comment', 'major', 'gre_quant', 'degree', 'season', 'decision',
       'gre_verbal', 'date_of_result', 'term_year', 'undergrad_gpa',
       'date_added', 'applicant_status', 'university', 'gre_awa'],
      dtype='object')

In [0]:
df.decision = df['decision']

In [0]:
df['y_n'] = df['decision'] == 'Accepted'

In [0]:
df = df.dropna()

In [0]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, train_size=0.80,
                              test_size=0.20, stratify=df['y_n'],
                              random_state=59)

train, val = train_test_split(train, train_size=0.80,
                              test_size=0.20, stratify=train['y_n'],
                              random_state=59)
train.shape, val.shape, test.shape

((512415, 15), (128104, 15), (160130, 15))

In [0]:
train.groupby('y_n').size()

y_n
False    257143
True     255272
dtype: int64

In [0]:
# Majority class baseline
from sklearn.metrics import accuracy_score
y_pred = [False] * len(train['y_n'])
accuracy_score(train['y_n'], y_pred)

0.501825668647483

In [0]:
train.select_dtypes(include='number').describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
gre_quant,512415.0,196.626843,134.458813,0.0,163.0,167.0,170.0,900.0
gre_verbal,512415.0,180.351412,105.395005,0.0,154.0,159.0,164.0,801.0
term_year,512415.0,2016.480013,2.150291,2009.0,2015.0,2017.0,2018.0,2020.0
undergrad_gpa,512415.0,3.891407,1.13122,0.9,3.5,3.71,3.9,9.99
gre_awa,512415.0,4.169711,4.936383,0.0,3.5,4.0,4.5,99.99


In [0]:
train.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
degree,512415,2,Masters,305662
season,512415,2,Fall,503352
y_n,512415,2,False,257143
decision,512415,3,Accepted,255272
applicant_status,512415,4,International without US degree,326439
major,512415,349,"Computer Science,",326725
date_of_result,512415,1337,2018-02-08,4271
date_added,512415,1345,2018-02-08,3981
university,512415,1551,Stanford University,10837
comment,512415,8048,:(,1233


In [0]:
train.head()

Unnamed: 0,comment,major,gre_quant,degree,season,decision,gre_verbal,date_of_result,term_year,undergrad_gpa,date_added,applicant_status,university,gre_awa,y_n
1175636,"Two master's degrees, one in CS and one in CE....","Computer Science,",165.0,Masters,Fall,Accepted,170.0,2017-05-10,2017,3.4,2017-05-13,American,McGill University,5.5,True
1624806,still waiting to hear about the joint program ...,"(Computer Science,",162.0,PhD,Fall,Accepted,158.0,2017-02-06,2017,3.58,2017-02-07,American,University Of Maryland - College Park (UMD),4.5,True
2254057,"One research paper co authored, extensive expe...","Computer Science,",167.0,Masters,Fall,Accepted,163.0,2014-02-14,2014,3.88,2014-02-16,American,Brown University,5.0,True
660047,MCS. CS undergrad. 3+ years work experience. S...,"Computer Science,",170.0,Masters,Fall,Accepted,155.0,2019-02-12,2019,3.56,2019-03-07,International without US degree,Arizona State University,3.5,True
2179477,"B.E 67% (MSRIT) , MTech 76% (RVCE), TOEFL 91, ...","Computer Science,",167.0,PhD,Fall,Accepted,151.0,2016-02-26,2016,3.0,2016-03-01,International without US degree,Syracuse University,4.0,True


In [0]:
def wrangle(X):
# Wrangle train, validate, and test sets
    
    # Prevent SettingWithCopyWarning (in acordince with lectuere)
    X = X.copy()
  
    X['com_word_cont'] = X['comment'].str.split(" ").str.len()
      # return the wrangled dataframe
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

# creating target and features lsts
target = 'y_n'
target2 = 'decision'
train_features = train.drop(columns=[target, target2])
numeric_features = train_features.select_dtypes(include='number').columns.tolist()
cardinality = train_features.select_dtypes(exclude='number').nunique()
categorical_features = cardinality[cardinality <= 50].index.tolist()
features = numeric_features + categorical_features
print(features)

['gre_quant', 'gre_verbal', 'term_year', 'undergrad_gpa', 'gre_awa', 'com_word_cont', 'degree', 'season', 'applicant_status']


In [0]:
# creating Train, Validation, and Test vars
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]
y_test = test[target]
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='most_frequent'), 
    RandomForestClassifier(n_estimators=350, random_state=42, n_jobs=3)
)

# Fit on train, score on val
#pipeline.fit(X_train, y_train)
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.9885093361643664


In [0]:
y_pred = pipeline.predict(X_test)
# print('Validation Accuracy', pipeline.score(y_pred, y_test))
accuracy_score(y_pred, y_test)

0.9886030100543308

In [0]:
# Somthing not right