Some reminders about xgboost: 
* Uses CPU 
* More params than scikit-learn
* highly performant 
* gradient-boosted 
* don't have to worry about missing data! 
* very resilient at ignoring irrelevant data 

Working through the Kickstarter dataset

Best practice - check your test score and validation score at the beginning to unerstand the relationship between the two of them. Then, you don't use the test score again until the very end

## Stratify when doing classification

For classification problems, it's very important to take **stratified** samples when you're doing your test-training split, esp. if the target condition occurs relatively infrequently. This is particularly common in medical diagnoses problems. If you don't do this, you risk missing all of the "1s" in one bucket or another 

## Adding features using summary statistics 

Such as, how different is this observed value from the average observed value? How many SDs away? 

**Grouping** is often relevant. For example, for the kickstarter campaigns, the average value raised for the "Technology" category may be quite far away from the average value for the "Haberdashery" category, so you'd want to take the average within those groups. 

* **Pro-Tip** - when doing the groupings, be careful to do them only on the training and not the whole dataset. This is a common way that data can "leak" into your training set. In class, Jonathan does it by grabbing the ids of the training set rows into a variable, then passing that variable into iloc. This is clever
* After creating the grouping, we left join it to the ENTIRE main dataset with df.merge()

Often you'll want the probabilities (0.65, 0.22, ...), instead of just the classification labels (0,1,0,0). 

For that, use **.predict_proba()** instead of just .predict()

### Classification questions 

1. Where do you set the cutoff value for classification? Is the default 0.5? 
2. I'd like to use the cars dataset 

In [1]:
# do the imports
import pandas as pd
import numpy as np
import category_encoders as ce
from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
# suppress warning messages
import warnings
warnings.filterwarnings("ignore")

In [10]:
df = pd.read_csv(r"https://raw.githubusercontent.com/JonathanBechtel/dat-02-22/main/ClassMaterial/Unit3/data/ks2.csv", parse_dates = ['deadline', 'launched'])

In [11]:
df

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,country,goal
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,0,GB,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,0,US,30000.00
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,0,US,45000.00
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,0,US,5000.00
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,0,US,19500.00
...,...,...,...,...,...,...,...,...,...,...
370449,999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,2014-09-17 02:35:30,0,US,50000.00
370450,999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,2011-06-22 03:35:14,0,US,1500.00
370451,999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,2010-07-01 19:40:30,0,US,15000.00
370452,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,2016-01-13 18:13:53,0,US,15000.00


In [3]:
# a variation of what we did previously -- gives us option of getting training / validation / test scores
# in a single function
def get_model_scores(mod, X_train, y_train, X_test, y_test, val_score = True, test_score=False):
    if val_score:
        X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                          test_size = 0.2, 
                                                          stratify = y_train, 
                                                          random_state= 42)
 
    mod.fit(X_train, y_train)
    
    results = {}
    
    results['train_score'] = mod.score(X_train, y_train)
    if val_score:
        results['val_score'] = mod.score(X_val, y_val)
        
    if test_score:
        results['test_score'] = mod.score(X_test, y_test)
        
    return results

In [4]:
# helper functions to aid in the process
def split_data(df, split_frac=0.2, random_state=42):
    df = df.drop(['deadline', 'launched'], axis = 1)
    X  = df.drop('state', axis=1)
    y  = df['state']
    # notice the use of 'stratify' -- makes sure y values are in equal proportions in train + test
    return train_test_split(X, y, test_size = split_frac, stratify = y, random_state = random_state)

# helper function to pull out feature importances_
def get_feature_importances(pipe, X_train, onehot=False):
    if onehot:
        X_train = pipe[0].transform(X_train)
        X_train = pipe[1].transform(X_train)
    return pd.DataFrame({
        'Col': X_train.columns,
        'Importance': pipe[-1].feature_importances_
    }).sort_values(by='Importance', ascending=False)

In [6]:
split_data(df)

[                ID                                               name  \
 317618   723318017                                   City Reflections   
 181611  1944996418  Nothin' To Lose Entertainment Album & Album Re...   
 195846  2019366381  The Real African History:  As It Relates To Th...   
 37117   1192319792                            Ameritocracy: Card Game   
 95025   1493261104                          $1 + 10K PEOPLE CHALLENGE   
 ...            ...                                                ...   
 134725  1699214470  Making a horror movie about finding the Founta...   
 106564  1553430656                                    The Nightingale   
 35971   1186439102                          Double D string stretcher   
 6711    1034862045     Visions of Fantastic Realms - Calendar Project   
 106672  1553980675                      James for PCT: We Go Together   
 
               category main_category currency country      goal  
 317618     Photography   Photography      

In [7]:
mod = XGBClassifier()

In [None]:
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

plot_confusion_matrix(pipe, X_val, y_val,
                                 cmap=plt.cm.Blues,
                                 normalize='true');

In [None]:
# parameter search
estimators = [100, 200, 300, 400]
max_depth  = [3, 4]
sub_sample = [0.8, 0.6] # this is the amount of samples to randomly sample in each round
learning_rate = [0.1, 0.2]
cv_scores  = []
# do a training loop
for estimator in estimators:
    for depth in max_depth:
        for sample in sub_sample:
            for rate in learning_rate:
                print(f"Fitting new training loop for rounds: {estimator}, depth: {depth}, sampling rate: {sample}, rate: {rate}")
                pipe[-1].set_params(n_estimators = estimator,
                                    max_depth = depth,
                                    subsample = sample,
                                    learning_rate = rate)
                scores = get_model_scores(pipe, X_train, y_train, X_test, y_test)
                cv_scores.append((scores['train_score'], scores['val_score'], estimator, depth, sample, rate))