Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [x] If you haven't completed assignment #1, please do so first.
- [x] Continue to clean and explore your data. Make exploratory visualizations.
- [x] Fit a model. Does it beat your baseline? 
- [x] Try xgboost.
- [x] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 999)
PATH = '../data/dota/'

Unnamed: 0,match_id,skill
0,1971358627,1
1,1620277838,2
2,1967152662,1
3,1967118934,1
4,1991580700,1
...,...,...
132447330,2313016744,1
132447331,2313016495,1
132447332,2313016459,1
132447333,2313016395,1


In [6]:
matches_small_chunk = pd.read_csv(PATH + 'matches_small.csv', chunksize=10000)
i = 0
for chunk in matches_small_chunk:
    i += chunk.shape[0]
i

1959515

In [8]:
matches_small

<pandas.io.parsers.TextFileReader at 0x2575709ed08>

In [38]:
%%time

player_matches_small_chunk = pd.read_csv(PATH + 'player_matches_small.csv', chunksize=10000)
i = 0
games = set()
for chunk in player_matches_small_chunk:
    if i == 0:
        i += chunk.shape[0]
        df1 = chunk
        continue
    df2 = chunk
    bigdf = pd.concat([df1, df2])
    games = games.union(set(pd.Series(bigdf.groupby('match_id').groups)[(pd.Series(bigdf.groupby('match_id').groups).apply(len) == 10)].index))
    df1 = df2.copy()
    i += chunk.shape[0]
display(games)

{383254533,
 385351685,
 2317877256,
 380633099,
 2317877260,
 2317877261,
 2317877262,
 384827408,
 2317877265,
 378011668,
 2317877268,
 2317877269,
 2317877270,
 2317877271,
 380633120,
 2317877281,
 2317877282,
 2317877286,
 2317877287,
 2317877288,
 2317877289,
 2317877290,
 389546026,
 386924588,
 2317877301,
 2317877302,
 378011703,
 2317877304,
 2317877305,
 2317877306,
 2317877307,
 379584572,
 2317877308,
 381157429,
 382206016,
 2317877313,
 2317877314,
 2317877315,
 2317877316,
 2317877317,
 2317877318,
 2317877319,
 383254598,
 382206025,
 2317877322,
 386924620,
 2317877325,
 2317877326,
 2317877327,
 2317877328,
 389021775,
 2317877330,
 2317877331,
 378536020,
 389546062,
 379584603,
 381157469,
 380108894,
 2317877343,
 2317877344,
 2317877345,
 2317877346,
 379060325,
 2317877349,
 2317877350,
 382730343,
 387448933,
 2317877356,
 2317877357,
 380633196,
 382730350,
 2317877360,
 2317877361,
 2317877362,
 386400373,
 378011766,
 2317877374,
 385351806,
 2317877376,
 2

Wall time: 35min 52s


In [39]:
len(games)

189320

In [41]:
match_skill_chunk = pd.read_csv(PATH + 'match_skill.csv', chunksize=10000)
i = 0
for chunk in match_skill_chunk:
    if i == 0:
        i += chunk.shape[0]
        match_skill = chunk[chunk.match_id.isin(games)]
        continue
    match_skill = pd.concat([match_skill, chunk[chunk.match_id.isin(games)]])
    i += chunk.shape[0]
match_skill.shape

(61411, 2)

In [42]:
%%time
matches_small_chunk = pd.read_csv(PATH + 'matches_small.csv', chunksize=10000)
i = 0
for chunk in matches_small_chunk:
    if i == 0:
        i += chunk.shape[0]
        matches_small = chunk[chunk.match_id.isin(games)]
        continue
    matches_small = pd.concat([matches_small, chunk[chunk.match_id.isin(games)]])
    i += chunk.shape[0]
matches_small.shape

Wall time: 53.8 s


(0, 27)

In [44]:
matches_small_chunk = pd.read_csv(PATH + 'matches_small.csv', chunksize=10000)
i = 0
for chunk in matches_small_chunk:
    if i == 0:
        i += chunk.shape[0]
        matches_small = chunk
        break

In [47]:
matches_small.match_id

0       2304340261
1       2304335744
2       2304324185
3       2304339409
4       2304329004
           ...    
9995    2304335268
9996    2304337839
9997    2304333710
9998    2304343390
9999    2041687012
Name: match_id, Length: 10000, dtype: int64

In [46]:
%%time
player_matches_small_chunk = pd.read_csv(PATH + 'player_matches_small.csv', chunksize=10000)
i = 0
for chunk in player_matches_small_chunk:
    if i == 0:
        i += chunk.shape[0]
        player_matches_small = chunk[chunk.match_id.isin(games)]
        continue
    player_matches_small = pd.concat([player_matches_small, chunk[chunk.match_id.isin(games)]])
    i += chunk.shape[0]
player_matches_small.shape

MemoryError: 

In [20]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 999)

from sklearn.feature_selection import f_regression, chi2, SelectKBest, f_classif
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import eli5
from eli5.sklearn import PermutationImportance
from xgboost import XGBClassifier

PATH = '../data/waterpumps/'

In [22]:
train_features = pd.read_csv(PATH + 'train_features.csv')
train_labels = pd.read_csv(PATH + 'train_labels.csv')
train = train_features.merge(train_labels, on='id', how='inner')
train.status_group.value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

In [23]:
def wrangle(data):
    df = data.copy()
    
    weird_mismatches = {
        5: ('Tanga', 4),
        11: ('Shinyanga', 17),
        14: ('Shinyanga', 17),
        17: ('Mwanza', 19),
        18: ('Lindi', 8),
        24: ('Arusha', 2),
        40: ('Pwani', 6),
        60: ('Pwani', 6),
        80: ('Lindi', 8),
        90: ('Mtwara', 9),
        99: ('Mtwara', 9)
    }
    for code in weird_mismatches:
        wrong_region, right_code = weird_mismatches[code]
        df.loc[(df.region_code == code) & (df.region == wrong_region), 'region_code'] = right_code
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    df['latitude'] = df['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population', 'amount_tsh']
    for col in cols_with_zeros:
        df[col] = df[col].replace(0, np.nan)
    
    # Convert date_recorded to datetime
    df['date_recorded'] = pd.to_datetime(df['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    df['year_recorded'] = df['date_recorded'].dt.year
    df['month_recorded'] = df['date_recorded'].dt.month
    df['day_recorded'] = df['date_recorded'].dt.day
    df = df.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    df['years'] = df['year_recorded'] - df['construction_year']
    
    
    evil_dimensions = ['extraction_type_group', 'management', 'water_quality',
                       'payment', 'extraction_type', 'waterpoint_type_group',
                       'scheme_management', 'quantity_group', 'source',
                       'source_class', 'recorded_by', 'region']
        
    df = df.drop(evil_dimensions, axis=1)
    return df

train, val = train_test_split(train, random_state=0)
train = wrangle(train)
val = wrangle(val)

target = 'status_group'
X_train = train.drop(target, axis=1)
y_train = train[target]
X_val = val.drop(target, axis=1)
y_val = val[target]

In [24]:
pipeline = make_pipeline(
    OrdinalEncoder(),
    IterativeImputer(random_state=0, imputation_order='descending'),
    StandardScaler()
)

X_train_transformed = pipeline.fit_transform(X_train)
X_val_transformed = pipeline.transform(X_val)
model = RandomForestClassifier(random_state=0,
                           n_jobs=-1,
                           n_estimators=100)
model.fit(X_train_transformed, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [25]:
permuter = PermutationImportance(
    model,
    scoring='accuracy',
    n_iter=5,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)

PermutationImportance(cv='prefit',
                      estimator=RandomForestClassifier(bootstrap=True,
                                                       class_weight=None,
                                                       criterion='gini',
                                                       max_depth=None,
                                                       max_features='auto',
                                                       max_leaf_nodes=None,
                                                       min_impurity_decrease=0.0,
                                                       min_impurity_split=None,
                                                       min_samples_leaf=1,
                                                       min_samples_split=2,
                                                       min_weight_fraction_leaf=0.0,
                                                       n_estimators=100,
                                                     

In [26]:
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

wpt_name                -0.002721
id                      -0.001616
num_private             -0.000202
day_recorded             0.000471
year_recorded            0.000539
district_code            0.000727
basin                    0.000781
amount_tsh               0.000983
month_recorded           0.001104
subvillage               0.001226
region_code              0.001226
ward                     0.001306
management_group         0.001414
public_meeting           0.001522
permit                   0.001522
quality_group            0.001926
funder                   0.002626
installer                0.002747
gps_height               0.003259
years                    0.003488
scheme_name              0.004202
latitude                 0.006869
lga                      0.007057
source_type              0.007084
construction_year        0.008175
population               0.009212
longitude                0.009226
payment_type             0.022101
waterpoint_type          0.031650
extraction_typ

In [27]:
eli5.show_weights(
    permuter,
    top=None,
    feature_names=feature_names
)

Weight,Feature
0.1014  ± 0.0057,quantity
0.0362  ± 0.0032,extraction_type_class
0.0316  ± 0.0036,waterpoint_type
0.0221  ± 0.0019,payment_type
0.0092  ± 0.0015,longitude
0.0092  ± 0.0029,population
0.0082  ± 0.0023,construction_year
0.0071  ± 0.0015,source_type
0.0071  ± 0.0009,lga
0.0069  ± 0.0028,latitude


In [33]:
# Only at this point did I realize I chose the same dataset as the lecture...

model = XGBClassifier(
    n_estimators=750, # <= 1000 trees, depend on early stopping
    max_depth=10,       # try deeper trees because of high cardinality categoricals
    learning_rate=0.75, # try higher learning rate
    n_jobs=-1
)

eval_set = [(X_train_transformed, y_train), 
            (X_val_transformed, y_val)]

model.fit(X_train_transformed, y_train, 
          eval_set=eval_set, 
          eval_metric='merror', 
          early_stopping_rounds=50) # Stop if the score hasn't improved in 50 rounds

XGBoostError: [13:39:47] src/metric/metric.cc:23: Unknown metric function precision

In [32]:
model.score(X_val_transformed, y_val)

# Well that's just worse than my random forest one... I think this requires more than 5 minutes of testing...
# To be revisited in the next assignment

0.788956228956229