# Diplodatos Kaggle Competition

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Learn
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)

In [1]:
# Import the required packages
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read the *original* dataset...

For this competition, you are tasked with categorizing shopping trip types based on the items that customers purchased. To give a few hypothetical examples of trip types: a customer may make a small daily dinner trip, a weekly large grocery trip, a trip to buy gifts for an upcoming holiday, or a seasonal trip to buy clothes.

Walmart has categorized the trips contained in this data into 38 distinct types using a proprietary method applied to an extended set of data. You are challenged to recreate this categorization/clustering with a more limited set of features. This could provide new and more robust ways to categorize trips.

The training set (train.csv) contains a large number of customer visits with the TripType included. You must predict the TripType for each customer visit in the test set (test.csv). Each visit may only have one TripType. You will not be provided with more information than what is given in the data (e.g. what the TripTypes represent or more product information).

The test set file is encrypted. You must complete this brief survey to receive the password.

Data fields
TripType - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category.
VisitNumber - an id corresponding to a single trip by a single customer
Weekday - the weekday of the trip
Upc - the UPC number of the product purchased
ScanCount - the number of the given item that was purchased. A negative value indicates a product return.
DepartmentDescription - a high-level description of the item's department
FinelineNumber - a more refined category for each of the products, created by Walmart

In [2]:
def transform_data(train_data_fname, test_data_fname):
    df_train = pd.read_csv(train_data_fname)
    df_train['is_train_set'] = 1
    df_test = pd.read_csv(test_data_fname)
    df_test['is_train_set'] = 0

    # we  get the TripType for the train set. To do that, we group by VisitNumber and
    # then we get the max (or min or avg)
    y = df_train.groupby(["VisitNumber"], as_index=False).mean().TripType
    
    # we remove the TripType now, and concat training and testing data
    # the concat is done so that we have the same columns for both datasets
    # after one-hot encoding
    df_train = df_train.drop("TripType", axis=1)
    df = pd.concat([df_train, df_test])
    
    # the next three operations are the ones we have just presented in the previous lines
    
    # drop the columns we won't use (it may be good to use them somehow)
    df = df.drop(["Upc", "FinelineNumber"], axis=1)

    # one-hot encoding for the DepartmentDescription
    df = pd.get_dummies(df, columns=["DepartmentDescription"], dummy_na=True)

    # now we add the groupby values
    df = df.groupby(["VisitNumber", "Weekday"], as_index=False).sum()
    
    # finally, we do one-hot encoding for the Weekday
    df = pd.get_dummies(df, columns=["Weekday"], dummy_na=True)
    
    # we add a feature to represent the biggest purchase pct
    df['Mayor_purchase']= df.max(axis=1)/df.sum(axis=1)
    
    
    # we add a Weekend feature
    df["Weekend"] = df[['Weekday_Saturday', 'Weekday_Sunday']].sum(axis=1) > 0
    df['Weekend'] = df['Weekend'] * 1.0
    
    # we add a "only returns" feature
    df['Only_returns'] = df['ScanCount'] < 0
    df['Only_returns'] = df['Only_returns'] * 1.0
    
    #we add a "just looking" feature 
    df['Just_looking'] = df['ScanCount'] == 0
    df['Just_looking'] = df['Just_looking'] * 1.0
    

    # get train and test back
    df_train = df[df.is_train_set != 0]
    df_test = df[df.is_train_set == 0]
    
    X = df_train.drop(["is_train_set"], axis=1)
    yy = None
    XX = df_test.drop(["is_train_set"], axis=1)

    return X, y, XX, yy

In [3]:
X, y, XX, yy = transform_data("../data/train.csv", "../data/test.csv")

In [4]:
XX.head(30)

Unnamed: 0,VisitNumber,ScanCount,DepartmentDescription_1-HR PHOTO,DepartmentDescription_ACCESSORIES,DepartmentDescription_AUTOMOTIVE,DepartmentDescription_BAKERY,DepartmentDescription_BATH AND SHOWER,DepartmentDescription_BEAUTY,DepartmentDescription_BEDDING,DepartmentDescription_BOOKS AND MAGAZINES,...,Weekday_Saturday,Weekday_Sunday,Weekday_Thursday,Weekday_Tuesday,Weekday_Wednesday,Weekday_nan,Mayor_purchase,Weekend,Only_returns,Just_looking
1,7,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.583333,0.0,0.0,0.0
2,8,28,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.466667,0.0,0.0,0.0
7,15,9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.454545,0.0,0.0,0.0
9,19,9,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0.5,0.0,0.0,0.0
11,23,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.821429,0.0,0.0,0.0
12,25,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.78125,0.0,0.0,0.0
25,47,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.839286,0.0,0.0,0.0
33,57,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0.95,0.0,0.0,0.0
34,61,12,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.743902,0.0,0.0,0.0
35,63,5,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0.851351,0.0,0.0,0.0


In [5]:
y.head()

0    999.0
1      8.0
2      8.0
3     35.0
4     41.0
Name: TripType, dtype: float64

In [6]:
X.shape

(67029, 83)

In [7]:
XX.shape

(28645, 83)

In [8]:
y.shape

(67029,)

## Gradient Boosting Classifier

In [9]:
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(random_state=0, learning_rate=0.1, max_depth=8, validation_fraction=0.1, n_iter_no_change=10, verbose=1)
clf.fit(X, y)


      Iter       Train Loss   Remaining Time 
         1      128705.8676           31.00m
         2      112170.9721           33.13m
         3     4613031.8372           33.69m
         4     6492258.9298           33.78m
         5     6485310.7545           33.69m
         6     6479328.6627           33.40m
         7     6474417.3342           33.18m
         8     6470123.9181           32.86m
         9     6466407.4255           32.57m
        10     6463082.4494           32.22m
        20     6443190.8653           28.83m
        30     6434308.9890           25.09m
        40     6429461.1911           21.43m
        50 3367026362321653215396929452225486920193505837232887470800695909791855056252829696.0000           17.85m
        60 529353536501.8047           14.30m
        70 529353534513.4964           10.74m
        80 529353532927.4039            7.19m


GradientBoostingClassifier(learning_rate=0.075, max_depth=8,
                           n_iter_no_change=10, random_state=0, verbose=1)

In [11]:
clf.predict(X)

array([999.,   8.,   8., ...,  25.,  22.,   8.])

In [12]:
clf.score(X, y)

0.896999806054096

In [13]:
yy = clf.predict(XX)

In [14]:
yy = yy.astype(int)
yy

array([ 9, 26, 21, ...,  3, 39,  7])

In [15]:
submission = pd.DataFrame(list(zip(XX.VisitNumber, yy)), columns=["VisitNumber", "TripType"])

In [16]:
submission.to_csv("../data/submission_GBC.csv", header=True, index=False)

## Alternative Model

In [17]:
def heavy_transform_data(train_data_fname, test_data_fname):
    df_train = pd.read_csv(train_data_fname)
    df_train['is_train_set'] = 1
    df_test = pd.read_csv(test_data_fname)
    df_test['is_train_set'] = 0

    # we  get the TripType for the train set. To do that, we group by VisitNumber and
    # then we get the max (or min or avg)
    y = df_train.groupby(["VisitNumber"], as_index=False).mean().TripType
    
    # we remove the TripType now, and concat training and testing data
    # the concat is done so that we have the same columns for both datasets
    # after one-hot encoding
    df_train = df_train.drop("TripType", axis=1)
    df = pd.concat([df_train, df_test])
    
    # the next three operations are the ones we have just presented in the previous lines
    
    # drop the columns we won't use (it may be good to use them somehow)
    df = df.drop(["Upc"], axis=1)

    # one-hot encoding for the DepartmentDescription
    df = pd.get_dummies(df, columns=["DepartmentDescription"], dummy_na=True)
    
    # one-hot encoding for the FinelineNumber
    df = pd.get_dummies(df, columns=["FinelineNumber"], dummy_na=True)

    # now we add the groupby values
    df = df.groupby(["VisitNumber", "Weekday"], as_index=False).sum()
    
    # finally, we do one-hot encoding for the Weekday
    df = pd.get_dummies(df, columns=["Weekday"], dummy_na=True)
    
    # we add a feature to represent the biggest purchase pct
    df['Mayor_purchase']= df.max(axis=1)/df.sum(axis=1)
    
    # we add a Weekend feature
    df["Weekend"] = df[['Weekday_Saturday', 'Weekday_Sunday']].sum(axis=1) > 0
    df['Weekend'] = df['Weekend'] * 1.0
    
    # we add a "only returns" feature
    df['Only_returns'] = df['ScanCount'] < 0
    df['Only_returns'] = df['Only_returns'] * 1.0
    
    #we add a "just looking" feature 
    df['Just_looking'] = df['ScanCount'] == 0
    df['Just_looking'] = df['Just_looking'] * 1.0
    

    # get train and test back
    df_train = df[df.is_train_set != 0]
    df_test = df[df.is_train_set == 0]
    
    X = df_train.drop(["is_train_set"], axis=1)
    yy = None
    XX = df_test.drop(["is_train_set"], axis=1)

    return X, y, XX, yy

In [18]:
X, y, XX, yy = heavy_transform_data("../data/train.csv", "../data/test.csv")

In [19]:
X.head(20)

Unnamed: 0,VisitNumber,ScanCount,DepartmentDescription_1-HR PHOTO,DepartmentDescription_ACCESSORIES,DepartmentDescription_AUTOMOTIVE,DepartmentDescription_BAKERY,DepartmentDescription_BATH AND SHOWER,DepartmentDescription_BEAUTY,DepartmentDescription_BEDDING,DepartmentDescription_BOOKS AND MAGAZINES,...,Weekday_Saturday,Weekday_Sunday,Weekday_Thursday,Weekday_Tuesday,Weekday_Wednesday,Weekday_nan,Mayor_purchase,Weekend,Only_returns,Just_looking
0,5,-1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.625,0.0,1.0,0.0
3,9,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.409091,0.0,0.0,0.0
4,10,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.434783,0.0,0.0,0.0
5,11,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.392857,0.0,0.0,0.0
6,12,7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.292683,0.0,0.0,0.0
8,17,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.5,0.0,0.0,0.0
10,20,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.666667,0.0,0.0,0.0
13,26,12,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.393939,0.0,0.0,0.0
14,28,8,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0.459016,0.0,0.0,0.0
15,29,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.852941,0.0,0.0,0.0


In [20]:
X.shape

(67029, 5279)

In [21]:
XX.shape

(28645, 5279)

In [22]:
y.shape

(67029,)

In [23]:
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(random_state=0, learning_rate=0.1, max_depth=3,validation_fraction=0.1, n_iter_no_change=3, verbose=1)
clf.fit(X, y)

      Iter       Train Loss   Remaining Time 
         1      141505.9120         1182.17m
         2  1177487665.1904         1181.25m
         3 971192535824977261682053950845901110555705344.0000         1189.19m
         4 971192535824977261682053950845901110555705344.0000         1242.88m
         5 971192535824977261682053950845901110555705344.0000         1331.12m
         6 971192535824977261682053950845901110555705344.0000         1405.03m
         7 971192535824977261682053950845901110555705344.0000         1363.00m
         8 971192535824977261682053950845901110555705344.0000         1329.48m


GradientBoostingClassifier(n_iter_no_change=3, random_state=0, verbose=1)

In [24]:
clf.score(X, y)

0.6199406227155411

In [25]:
yy = clf.predict(XX)

In [26]:
yy = yy.astype(int)
yy

array([ 8, 26, 21, ...,  9, 39, 39])

In [27]:
submission = pd.DataFrame(list(zip(XX.VisitNumber, yy)), columns=["VisitNumber", "TripType"])

In [28]:
submission.to_csv("../data/submission_ramello_GDB_2.csv", header=True, index=False)