# Homework 2 part b
**Instructions:**
- Submit your code to github by the deadline.
- DO NOT change paths (-3 points).
- DO NOT submit data to github (-2 points).

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

## Problem 1: log loss (3 points)
**(a)** You have a classification problem with 3 classes: "cat", "dog", "bird". For your test observation is a "dog". Your model give you the following prediction for that observation (0.1, 0.5, 0.4). What is the accuracy? What is the log loss?

**(b)** Suppose that you are submitting to a Kaggle competition. You are solving a binary classification task being evaluated by log loss metric. You suspect train and test target distributions are different, you submit a constant predition of 0.3 and to the public LB and get a score of 1.01. Mean of target variable in train is 0.44. What is the mean of target variable in public part of test data? Explain how you derive the results.

**(c)** You have a set of images that are tagged with labels. Each image may be tagged with more than one label. You have a total of 4 labels. Here is the soft prediction for one of your test images (0.1, 0.9, 0.8, 0.3). What is the hard prediction if you are using a threshold of 0.7? What is the loss if your real label is (0, 1, 0, 0)?

# here your code and your solution
### a 
The accuracy is 1.

The log loss = $-log(1-0.5) = 0.6931$

#### b
Denote the mean of test data as $m$

then we have $N\cdot m$ 1s and $N\cdot{(1- m)}$ 0s in the test data.

${logloss} = -\frac{1}{N} \cdot{\sum_{y_i=0}}log(1-0.3)-\frac{1}{N} \cdot{\sum_{y_i=1}}log(0.3) = 1.01 $

$-(1-m) \cdot log(0.7) - m \cdot log(0.3) = 1.01$

$m = 0.771067$

Therefore, the distribution of the test data are different from the training data.

#### c
The hard prediction is (0, 1, 1, 1)

The log loss is 

$\frac {log(1-0.1) + log(0.9) + log(1-0.8) + log(1-0.3)}{-4} = 0.544$

## Problem 2: AUC (2 points)
Compute AUC score by hand with the formula explained in class for the following dataset.

In [2]:
d = pd.DataFrame({
        'prediction': [0.1, 0.5, 0.95, 0.99, 0.8, 0.4, 0.03, 0.44, 0.2],
        'y': [1, 0, 1, 1, 1, 1, 0, 0, 0]})
d

Unnamed: 0,prediction,y
0,0.1,1
1,0.5,0
2,0.95,1
3,0.99,1
4,0.8,1
5,0.4,1
6,0.03,0
7,0.44,0
8,0.2,0


In [3]:
# your solution here
d = d.sort_values(by='prediction')
d

Unnamed: 0,prediction,y
6,0.03,0
0,0.1,1
8,0.2,0
5,0.4,1
7,0.44,0
1,0.5,0
4,0.8,1
2,0.95,1
3,0.99,1


There are $4 \times 5 = 20$ pairs, among which there are $3+2=5$ pairs in the incorrect order.

Therefore, AUC = $(20 - 5)/20 = 0.75$

## Problem 3: Regularized mean (target) encoding for Avazu competition (8 points)

For this problem you will implement a version of regularized mean encoding. We will be using the data on this Kaggle [compettion](https://www.kaggle.com/c/avazu-ctr-prediction). You can use the kaggle api (install first). 

`kaggle competitions download -c avazu-ctr-prediction`

**Instructions:**
- Split data (training) into training and validation. Take the lastest (most recent data) of training set as validation.
- Implement regularized mean encoding for the training set using pandas.
- Implement mean encoding for the validation set

In [82]:
## Split train and validation 
# get sample data first
PATH = Path("avazu-ctr-prediction")
!head -100000 $PATH/train > $PATH/train_sample.csv
!head -100000 $PATH/test > $PATH/test_sample.csv
data = pd.read_csv(PATH/"train_sample.csv")
test = pd.read_csv(PATH/"test_sample.csv")

In [83]:
def split_based_hour(data):
    """ Split data based on column hour.
    
    Use 20% of the date for validation.
    Inputs:
       data: dataframe from avazu
    Returns:
       train:
       val: 20% of the largest values of column "hour".
    """
    N = int(0.8*len(data))
    data = data.sort_values(by="hour")
    train = data[:N].copy()
    val = data[N:].copy()
    return train.reset_index(), val.reset_index()
train, val = split_based_hour(data)

### Regularized mean encoding 
Here is how you do mean encoding without regularization.

In [84]:
# Calculate a mapping: {device_type: click_mean}
mean_device_type = train.groupby('device_type').click.mean()
mean_device_type

device_type
0    0.224277
1    0.176116
4    0.069777
5    0.083333
Name: click, dtype: float64

In [85]:
# This is the global click mean
global_mean = train.click.mean()
global_mean

0.17477718471480894

In [86]:
train["device_type_mean_enc"] = train["device_type"].map(mean_device_type)
val["device_type_mean_enc"] = val["device_type"].map(mean_device_type)

In [87]:
train["device_type_mean_enc"].fillna(global_mean, inplace=True)
val["device_type_mean_enc"].fillna(global_mean, inplace=True)

In [88]:
# Print correlation
encoded_feature = val["device_type_mean_enc"].values
print(np.corrcoef(val["click"].values, encoded_feature)[0][1])

0.0530389229998215


To do mean encoding with K-fold regularization you do the following:

* Run a 5-fold split on train data where `mean_device_type` is computed on 4/5 of the data and the encoding is computed on the other 1/5.
* To compute mean encoding on the validation data use the code similar to encoding without regularization. That is compute on all the training data and apply to the validation set.

In [93]:
from sklearn.model_selection import KFold

def reg_target_encoding(train, col = "device_type", splits=5):
    """ Computes regularize mean encoding.
    Inputs:
       train: training dataframe
       
    """
    ### BEGIN SOLUTION
    name_new = col+"_mean_enc"
    kf = KFold(n_splits=splits)
    for train_index, valid_index in kf.split(train):
        X_train, X_valid = train.loc[train_index,:], train.loc[valid_index,:]    
        mean_col_value = X_train.groupby(col).click.mean()
        global_mean = X_train.click.mean()
        # renew the value
        train.loc[valid_index,name_new] = train.loc[valid_index,][col].map(mean_col_value)
        train.loc[valid_index,name_new].fillna(global_mean, inplace=True)
    ### END SOLUTION

In [94]:
reg_target_encoding(train) 
encoded_feature = train["device_type_mean_enc"].values
corr = np.corrcoef(train["click"].values, encoded_feature)[0][1]
print(corr)
assert(np.around(corr, decimals=4) == 0.0551)

0.05508074566359165


In [97]:
def mean_encoding_test(test, train, col = "device_type"):
    """ Computes target enconding for test data.
    
    This is similar to how we do validation
    """
    ### BEGIN SOLUTION
    name_new = col+"_mean_enc"
    mean_col_value = train.groupby(col).click.mean()
    global_mean = train.click.mean()
    test[name_new] = test[col].map(mean_col_value)
    test[name_new].fillna(global_mean, inplace=True)
    ### END SOLUTION

In [98]:
mean_encoding_test(test, train) 
encoded_feature_mean = test["device_type_mean_enc"].values.mean()
assert(np.around(encoded_feature_mean, decimals=4) == 0.177)

## Problem 4: Implement other features and fit a model (7 points)
* Implement a few more features, include:
   * day of the week and hour
   * mean encoding of some other features (at least two)
   * use plots and `value_counts()` to understand the data

* Fit a random forest (to the whole dataset)
   * Do hyperparameter tunning using your validation set
   * Report validation and train log loss

In [12]:
import datetime, time
import feather
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

### EDA with sample data

In [175]:
## Split train and validation 
# get sample data first
PATH = Path("avazu-ctr-prediction")
!head -1000000 $PATH/train > $PATH/train_sample.csv
!head -1000000 $PATH/test > $PATH/test_sample.csv
train0 = pd.read_csv(PATH/"train_sample.csv")
# test0 = pd.read_csv(PATH/"test_sample.csv")

In [161]:
print(train0.columns)
print(train0.hour.value_counts())

Index(['id', 'click', 'hour', 'C1', 'banner_pos', 'site_id', 'site_domain',
       'site_category', 'app_id', 'app_domain', 'app_category', 'device_id',
       'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14',
       'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21'],
      dtype='object')
14102104    264711
14102102    207471
14102103    193355
14102101    137442
14102100    119006
14102105     78014
Name: hour, dtype: int64


In [107]:
train0.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 24 columns):
id                  999999 non-null float64
click               999999 non-null int64
hour                999999 non-null int64
C1                  999999 non-null int64
banner_pos          999999 non-null int64
site_id             999999 non-null object
site_domain         999999 non-null object
site_category       999999 non-null object
app_id              999999 non-null object
app_domain          999999 non-null object
app_category        999999 non-null object
device_id           999999 non-null object
device_ip           999999 non-null object
device_model        999999 non-null object
device_type         999999 non-null int64
device_conn_type    999999 non-null int64
C14                 999999 non-null int64
C15                 999999 non-null int64
C16                 999999 non-null int64
C17                 999999 non-null int64
C18                 999999 non-null in

In [108]:
train0.describe()

Unnamed: 0,id,click,hour,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
count,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0,999999.0
mean,9.376309e+18,0.160219,14102100.0,1005.088166,0.229922,1.02554,0.22336,18262.199732,318.965807,56.495552,2041.030657,1.45226,190.779412,45505.902746,69.936165
std,5.236908e+18,0.366809,1.493255,1.156928,0.464627,0.453899,0.667159,3510.366393,19.452907,36.546962,441.200951,1.362637,273.439422,49843.814296,38.513837
min,9984920000000.0,0.0,14102100.0,1001.0,0.0,0.0,0.0,375.0,120.0,20.0,112.0,0.0,33.0,-1.0,13.0
25%,4.84666e+18,0.0,14102100.0,1005.0,0.0,1.0,0.0,15707.0,320.0,50.0,1722.0,0.0,35.0,-1.0,43.0
50%,9.834382e+18,0.0,14102100.0,1005.0,0.0,1.0,0.0,19251.0,320.0,50.0,2161.0,1.0,39.0,-1.0,61.0
75%,1.373053e+19,0.0,14102100.0,1005.0,0.0,1.0,0.0,21153.0,320.0,50.0,2420.0,3.0,297.0,100084.0,79.0
max,1.84467e+19,1.0,14102100.0,1012.0,7.0,5.0,5.0,21705.0,1024.0,1024.0,2497.0,3.0,1835.0,100248.0,195.0


In [122]:
# Clean the data.
types = {'id': np.uint32, 'click': np.uint8, 'hour': np.uint32, 'C1': np.uint32, 'banner_pos': np.uint32,
         'site_id': 'category', 'site_domain': 'category', 'site_category': 'category', 'app_id': 'category',
         'app_domain': 'category', 'app_category': 'category', 'device_id': 'category',
         'device_ip': 'category', 'device_model': 'category', 'device_type': np.uint8, 'device_conn_type': np.uint8,
         'C14': np.uint16, 'C15': np.uint16, 'C16': np.uint16, 'C17': np.uint16, 'C18': np.uint16, 'C19': np.uint16,
         'C20': np.uint16, 'C21': np.uint16}

types_test = {'id': np.uint32, 'hour': np.uint32, 'C1': np.uint32, 'banner_pos': np.uint32,
         'site_id': 'category', 'site_domain': 'category', 'site_category': 'category', 'app_id': 'category',
         'app_domain': 'category', 'app_category': 'category', 'device_id': 'category',
         'device_ip': 'category', 'device_model': 'category', 'device_type': np.uint8, 'device_conn_type': np.uint8,
         'C14': np.uint16, 'C15': np.uint16, 'C16': np.uint16, 'C17': np.uint16, 'C18': np.uint16, 'C19': np.uint16,
         'C20': np.uint16, 'C21': np.uint16}

train0 = pd.read_csv(PATH/"train_sample.csv", usecols=types.keys(), dtype=types)
# test0 = pd.read_csv(PATH/"test_sample.csv", usecols=types_test.keys(), dtype=types_test)
print(train0.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999999 entries, 0 to 999998
Data columns (total 24 columns):
id                  999999 non-null uint32
click               999999 non-null uint8
hour                999999 non-null uint32
C1                  999999 non-null uint32
banner_pos          999999 non-null uint32
site_id             999999 non-null category
site_domain         999999 non-null category
site_category       999999 non-null category
app_id              999999 non-null category
app_domain          999999 non-null category
app_category        999999 non-null category
device_id           999999 non-null category
device_ip           999999 non-null category
device_model        999999 non-null category
device_type         999999 non-null uint8
device_conn_type    999999 non-null uint8
C14                 999999 non-null uint16
C15                 999999 non-null uint16
C16                 999999 non-null uint16
C17                 999999 non-null uint16
C18           

### Extract dayOfWeek, hour and perform mean encoding

In [196]:
# initialize the toy train/test set.
train1 = train0.copy()
traink, validk = split_based_hour(train1)
# testk = test0.copy()

In [177]:
def extract_dates_and_mean_encode(df, cols=["device_type", "banner_pos", "device_conn_type"]):
    """
    Perform mean encoding for the interested columns.
    Then extract the date information.
    Finally drop the useless columns.
    """
    if 'dayofweek' in df.columns and 'device_type_mean_enc' in df.columns:
        print('All set.')
        return df
    
    for col in cols:
        reg_target_encoding(df, col=col)

    df['datetime'] = pd.to_datetime(df['hour'], format='%y%m%d%H', errors='ignore')
    df['dayofweek'] = df['datetime'].dt.dayofweek
    df['hour'] = df['datetime'].dt.hour
    df = df.drop('datetime', axis=1)
    
    return df

In [204]:
def do_the_same_to_valid(train, valid, cols=["device_type", "banner_pos", "device_conn_type"]):
    if 'dayofweek' in valid.columns and 'device_type_mean_enc' in valid.columns:
        print('All set.')
        return valid
    
    for col in cols:
        mean_encoding_test(valid, train, col)
        
    valid['datetime'] = pd.to_datetime(valid['hour'], format='%y%m%d%H', errors='ignore')
    valid['dayofweek'] = valid['datetime'].dt.dayofweek
    valid['hour'] = valid['datetime'].dt.hour
    valid = valid.drop('datetime', axis=1)
    return valid


from sklearn.preprocessing import OneHotEncoder
hot = OneHotEncoder(handle_unknown='ignore')
def label_encode(train, valid):
    cols_to_label = ['site_id', 'site_domain', 'site_category', 'app_id',
                 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model']
    for col in cols_to_label:
        X = hot.fit_transform(train[col].values.reshape(-1,1)).toarray()
        dfOneHot = pd.DataFrame(X, columns = [col+"_"+str(int(i)) for i in range(X.shape[1])])
        train = pd.concat([train, dfOneHot], axis=1)
        X = hot.transform(valid[col].values.reshape(-1,1)).toarray()
        dfOneHot = pd.DataFrame(X, columns = [col+"_"+str(int(i)) for i in range(X.shape[1])])
        valid = pd.concat([valid, dfOneHot], axis=1)
    return train, valid

In [208]:
traink = extract_dates_and_mean_encode(traink)

All set.


In [209]:
validk = do_the_same_to_valid(traink, validk)

All set.


In [211]:
#  traink, validk = label_encode(traink, validk)

In [213]:
def featureAndLabel(df):
    cols_to_label = ['site_id', 'site_domain', 'site_category', 'app_id',
                 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model']
    evil = ['index', 'id']+cols_to_label
    df = df.drop(evil, axis=1)
    y = df['click']
    X = df.drop(['click'], axis=1)
    return X, y

In [214]:
Xt, yt = featureAndLabel(traink)
Xv, yv = featureAndLabel(validk)

In [215]:
Xt

Unnamed: 0,hour,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21,device_type_mean_enc,banner_pos_mean_enc,device_conn_type_mean_enc,dayofweek
0,0,1005,0,1,2,15706,320,50,1722,0,35,-1,79,0.160044,0.151323,0.114499,1
1,0,1010,1,4,0,21665,320,50,2493,3,35,-1,117,0.075583,0.191974,0.165967,1
2,0,1005,0,1,0,15707,320,50,1722,0,35,-1,79,0.160044,0.151323,0.165967,1
3,0,1005,1,1,0,20751,320,50,1895,0,681,100028,101,0.160044,0.191974,0.165967,1
4,0,1005,0,1,0,16920,320,50,1899,0,431,100075,117,0.160044,0.151323,0.165967,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
799994,4,1005,0,1,0,20003,320,50,2282,3,35,-1,117,0.165773,0.157090,0.170598,1
799995,4,1005,0,1,0,21647,320,50,2487,1,547,-1,51,0.165773,0.157090,0.170598,1
799996,4,1005,0,1,0,15708,320,50,1722,0,35,-1,79,0.165773,0.157090,0.170598,1
799997,4,1005,0,1,0,15705,320,50,1722,0,35,-1,79,0.165773,0.157090,0.170598,1


In [230]:
def search_para():
    for max_depth in [4, 5, 6, 8, 10]:
        rf = RandomForestClassifier(n_estimators=100
                                    , max_depth=max_depth
        #                             , class_weight='balanced'
                                    , n_jobs = -1)
        rf.fit(Xt, yt)
        print("Training set accuracy:", rf.score(Xt, yt))
        print("Validation set accuracy:", rf.score(Xv, yv))
        ytpred = rf.predict_proba(Xt)
        yvpred = rf.predict_proba(Xv)
        print("max_depth:", max_depth)
        print("Training set log loss:", log_loss(yt, ytpred))
        print("Validation set log loss:", log_loss(yv, yvpred))
        print("\n")

In [231]:
search_para()

Training set accuracy: 0.8369722962153703
Validation set accuracy: 0.851015
max_depth: 4
Training set log loss: 0.4103056970639195
Validation set log loss: 0.39169623357880096


Training set accuracy: 0.8383572979466224
Validation set accuracy: 0.85154
max_depth: 5
Training set log loss: 0.4080902155158526
Validation set log loss: 0.38967980753399417


Training set accuracy: 0.8388397985497482
Validation set accuracy: 0.851665
max_depth: 6
Training set log loss: 0.4060513064830593
Validation set log loss: 0.3878176749497941


Training set accuracy: 0.8391835489794363
Validation set accuracy: 0.85193
max_depth: 8
Training set log loss: 0.4030321894824807
Validation set log loss: 0.38498688353980237


Training set accuracy: 0.8397510496888121
Validation set accuracy: 0.85214
max_depth: 10
Training set log loss: 0.3998703517577298
Validation set log loss: 0.38229128782356014




We can conclude that higher max_depth of trees give us higher accuracy and lower log loss.

### Do the real stuff

In [234]:
realdata = pd.read_csv(PATH/"train")
trainr, validr = split_based_hour(realdata)
trainr = extract_dates_and_mean_encode(trainr)
validr = do_the_same_to_valid(trainr, validr)
Xtr, ytr = featureAndLabel(trainr)
Xvr, yvr = featureAndLabel(validr)

In [235]:
rf = RandomForestClassifier(n_estimators=100, max_depth=max_depth  # , class_weight='balanced'
                            , n_jobs=-1)
rf.fit(Xt, yt)
print("Training set accuracy:", rf.score(Xt, yt))
print("Validation set accuracy:", rf.score(Xv, yv))
ytpred = rf.predict_proba(Xt)
yvpred = rf.predict_proba(Xv)
print("max_depth:", max_depth)
print("Training set log loss:", log_loss(yt, ytpred))
print("Validation set log loss:", log_loss(yv, yvpred))

Training set accuracy: 0.8397522996903746
Validation set accuracy: 0.85215
Training set log loss: 0.3998168818775622
Validation set log loss: 0.3824445543306857


Training log loss is 0.3998, and validation log logss is 0.3824.

----