# Predicting Customer Device Types

In this project, we will be predicting what type of device the customers are using to connect to an e-commerce site. We will be using deep learning to predict customers' probabilities of using each of the three unique device categories.

Data used can be found [here](https://www.kaggle.com/lipann/prepaired-data-of-customer-revenue-prediction?select=test_filtered.csv). Let's take a look at it.

# Reading the data in

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_addons as tfa
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings("ignore")

In [3]:
initial = pd.read_csv("customer_data.csv", low_memory=False) #see the link in the markdown above for the data used.
initial.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,visitNumber,visitStartTime,totals_bounces,totals_hits,totals_newVisits,totals_pageviews,device_browser,...,trafficSource_adwordsClickInfo.adNetworkType,trafficSource_adwordsClickInfo.isVideoAd,trafficSource_adwordsClickInfo.page,trafficSource_adwordsClickInfo.slot,trafficSource_campaign,trafficSource_isTrueDirect,trafficSource_keyword,trafficSource_medium,trafficSource_referralPath,trafficSource_source
0,Organic Search,20171016,6167871330617112363,2,1508151024,0.0,4,0.0,4.0,Chrome,...,@,1,0.0,@,@,1,@,organic,@,google
1,Organic Search,20171016,643697640977915618,1,1508175522,0.0,5,1.0,5.0,Chrome,...,@,1,0.0,@,@,0,@,organic,@,google
2,Organic Search,20171016,6059383810968229466,1,1508143220,0.0,7,1.0,7.0,Chrome,...,@,1,0.0,@,@,0,@,organic,@,google
3,Organic Search,20171016,2376720078563423631,1,1508193530,0.0,8,1.0,4.0,Safari,...,@,1,0.0,@,@,0,@,organic,@,google
4,Organic Search,20171016,2314544520795440038,1,1508217442,0.0,9,1.0,4.0,Safari,...,@,1,0.0,@,@,0,@,organic,@,google


In [4]:
initial.device_deviceCategory.value_counts()

desktop    507100
mobile     262611
tablet      34973
Name: device_deviceCategory, dtype: int64

The data currently has more than 800k rows. However, as you can see above, the classes are very imbalanced. To prevent having an imbalanced sample, we will create 3 unique samples for each of these categories.

In [5]:
desktop = initial[initial["device_deviceCategory"] == "desktop"].sample(n=34000, random_state=42)
mobile = initial[initial["device_deviceCategory"] == "mobile"].sample(n=34000, random_state=42)
tablet = initial[initial["device_deviceCategory"] == "tablet"].sample(n=34000, random_state=42)
sample = pd.concat([desktop, mobile, tablet])
sample.head()

Unnamed: 0,channelGrouping,date,fullVisitorId,visitNumber,visitStartTime,totals_bounces,totals_hits,totals_newVisits,totals_pageviews,device_browser,...,trafficSource_adwordsClickInfo.adNetworkType,trafficSource_adwordsClickInfo.isVideoAd,trafficSource_adwordsClickInfo.page,trafficSource_adwordsClickInfo.slot,trafficSource_campaign,trafficSource_isTrueDirect,trafficSource_keyword,trafficSource_medium,trafficSource_referralPath,trafficSource_source
392331,Organic Search,20171201,427495481997939573,1,1512195777,1.0,1,1.0,1.0,Chrome,...,@,1,0.0,@,@,0,@,organic,@,google
446140,Organic Search,20180206,2391255455386289990,1,1517910251,1.0,1,1.0,1.0,Chrome,...,@,1,0.0,@,@,0,@,organic,@,google
604595,Social,20180114,498301535948389503,1,1515987549,1.0,1,1.0,1.0,Chrome,...,@,1,0.0,@,@,0,@,referral,/intl/id/yt/about/copyright/,youtube.com
400759,Organic Search,20171214,5141774161087542464,1,1513262920,0.0,2,1.0,2.0,Chrome,...,@,1,0.0,@,@,0,@,organic,@,google
790780,Organic Search,20180427,6687993334352031777,1,1524855308,0.0,3,1.0,3.0,Edge,...,@,1,0.0,@,@,0,@,organic,@,google


In [6]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102000 entries, 392331 to 496585
Data columns (total 31 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   channelGrouping                               102000 non-null  object 
 1   date                                          102000 non-null  int64  
 2   fullVisitorId                                 102000 non-null  object 
 3   visitNumber                                   102000 non-null  int64  
 4   visitStartTime                                102000 non-null  int64  
 5   totals_bounces                                102000 non-null  float64
 6   totals_hits                                   102000 non-null  int64  
 7   totals_newVisits                              102000 non-null  float64
 8   totals_pageviews                              102000 non-null  float64
 9   device_browser                             

# Cleaning the data

Although we do not have missing values, we can see that some places were filled by "@". Real life organizational data is nowadays better structured than this. We will be dropping some of these columns so that they won't disrupt the algorithm. Let's see how many of these columns are problematic.

In [7]:
how_many_ats = {}
for col in sample.columns:
    if "@" in sample[col].value_counts().index:
        how_many_ats[col] = sample[col].value_counts()["@"]
    else:
        how_many_ats[col] = 0

pd.DataFrame(how_many_ats.items()).sort_values(1, ascending=False)

Unnamed: 0,0,1
24,trafficSource_adwordsClickInfo.slot,89427
21,trafficSource_adwordsClickInfo.adNetworkType,89427
20,trafficSource_adContent,89006
27,trafficSource_keyword,87269
25,trafficSource_campaign,86891
29,trafficSource_referralPath,82832
16,geoNetwork_metro,81526
13,geoNetwork_city,64294
18,geoNetwork_region,63204
17,geoNetwork_networkDomain,47939


As we can see, some of the columns are full of @ as a placeholder for missing values. We will be dropping columns that have more than 1000 @s and replacing the ones that are left with the most frequent value of the columns.

In [8]:
to_be_dropped = [key for key in how_many_ats if how_many_ats[key] > 1000]
clean = sample.drop(columns=to_be_dropped)
clean.info()
to_be_dropped

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102000 entries, 392331 to 496585
Data columns (total 20 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   channelGrouping                           102000 non-null  object 
 1   date                                      102000 non-null  int64  
 2   fullVisitorId                             102000 non-null  object 
 3   visitNumber                               102000 non-null  int64  
 4   visitStartTime                            102000 non-null  int64  
 5   totals_bounces                            102000 non-null  float64
 6   totals_hits                               102000 non-null  int64  
 7   totals_newVisits                          102000 non-null  float64
 8   totals_pageviews                          102000 non-null  float64
 9   device_browser                            102000 non-null  object 
 10  device_deviceCa

['geoNetwork_city',
 'geoNetwork_metro',
 'geoNetwork_networkDomain',
 'geoNetwork_region',
 'trafficSource_adContent',
 'trafficSource_adwordsClickInfo.adNetworkType',
 'trafficSource_adwordsClickInfo.slot',
 'trafficSource_campaign',
 'trafficSource_keyword',
 'trafficSource_medium',
 'trafficSource_referralPath']

In [9]:
ats_left = {}
for col in clean.columns:
    if "@" in clean[col].value_counts().index:
        ats_left[col] = clean[col].value_counts()["@"]
    else:
        ats_left[col] = 0
        
pd.DataFrame(ats_left.items()).sort_values(1, ascending=False)

Unnamed: 0,0,1
12,device_operatingSystem,905
15,geoNetwork_subContinent,115
14,geoNetwork_country,115
13,geoNetwork_continent,115
0,channelGrouping,0
1,date,0
18,trafficSource_isTrueDirect,0
17,trafficSource_adwordsClickInfo.page,0
16,trafficSource_adwordsClickInfo.isVideoAd,0
11,device_isMobile,0


We are now left with very few values to be cleaned. We will be filling these values with the most frequent value for each corresponding column.

In [10]:
cols_to_work_on = ["device_operatingSystem", "geoNetwork_subContinent", "geoNetwork_country", "geoNetwork_continent"]
filler = {col: clean[col].value_counts().index[0] for col in cols_to_work_on}

for col in cols_to_work_on:
    clean[col].replace("@", filler[col], inplace=True)
    print(clean[col].value_counts())

Android          33814
iOS              32429
Windows          19559
Macintosh        12359
Linux             1908
Chrome OS         1651
Tizen              108
Samsung             78
Windows Phone       50
BlackBerry          26
OS/2                11
Xbox                 5
Nintendo 3DS         1
Nokia                1
Name: device_operatingSystem, dtype: int64
Northern America      50518
Northern Europe        7874
Southern Asia          7313
Western Europe         7054
Eastern Asia           4740
Southeast Asia         4662
Southern Europe        3413
South America          3378
Eastern Europe         3038
Western Asia           2797
Australasia            2031
Central America        1803
Northern Africa        1305
Southern Africa         646
Western Africa          635
Caribbean               301
Eastern Africa          289
Central Asia             96
Middle Africa            78
Micronesian Region       16
Melanesia                 9
Polynesia                 4
Name: geoNetwork_su

Now that out dataset is clean, we will initiate encoding process on Categorical Data.

# Handling Categorical Data

We will be using One-Hot encoding to deal with categorical data.

In [11]:
clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102000 entries, 392331 to 496585
Data columns (total 20 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   channelGrouping                           102000 non-null  object 
 1   date                                      102000 non-null  int64  
 2   fullVisitorId                             102000 non-null  object 
 3   visitNumber                               102000 non-null  int64  
 4   visitStartTime                            102000 non-null  int64  
 5   totals_bounces                            102000 non-null  float64
 6   totals_hits                               102000 non-null  int64  
 7   totals_newVisits                          102000 non-null  float64
 8   totals_pageviews                          102000 non-null  float64
 9   device_browser                            102000 non-null  object 
 10  device_deviceCa

In [12]:
categoricals = ["channelGrouping", "device_browser", "device_operatingSystem", "geoNetwork_continent", "geoNetwork_country", "geoNetwork_subContinent", "trafficSource_source"]
for cat in categoricals:
    print(cat + ": " + str(clean[cat].value_counts().shape[0]))

channelGrouping: 8
device_browser: 38
device_operatingSystem: 14
geoNetwork_continent: 5
geoNetwork_country: 199
geoNetwork_subContinent: 22
trafficSource_source: 136


As we can see, some columns have much more unique values compared to the others. Still, there's not much. We will be One-Hot encoding all of them.

In [13]:
final = clean.copy()
final = pd.get_dummies(final, columns=categoricals)

print(final.columns[:20])

Index(['date', 'fullVisitorId', 'visitNumber', 'visitStartTime',
       'totals_bounces', 'totals_hits', 'totals_newVisits', 'totals_pageviews',
       'device_deviceCategory', 'device_isMobile',
       'trafficSource_adwordsClickInfo.isVideoAd',
       'trafficSource_adwordsClickInfo.page', 'trafficSource_isTrueDirect',
       'channelGrouping_(Other)', 'channelGrouping_Affiliates',
       'channelGrouping_Direct', 'channelGrouping_Display',
       'channelGrouping_Organic Search', 'channelGrouping_Paid Search',
       'channelGrouping_Referral'],
      dtype='object')


# Preprocessing

We will begin with ditching the columns that we will not be fitting into our model.

In [14]:
unused = final[["fullVisitorId", "device_isMobile", "date", "visitStartTime"]].copy()
final.drop(columns=["fullVisitorId", "device_isMobile", "date", "visitStartTime"], inplace=True)
final.head()

Unnamed: 0,visitNumber,totals_bounces,totals_hits,totals_newVisits,totals_pageviews,device_deviceCategory,trafficSource_adwordsClickInfo.isVideoAd,trafficSource_adwordsClickInfo.page,trafficSource_isTrueDirect,channelGrouping_(Other),...,trafficSource_source_support.google.com,trafficSource_source_t.co,trafficSource_source_tpc.googlesyndication.com,trafficSource_source_tw.search.yahoo.com,trafficSource_source_uk.search.yahoo.com,trafficSource_source_vk.com,trafficSource_source_yahoo,trafficSource_source_yandex,trafficSource_source_youtube.com,trafficSource_source_youtube.thinkwithgoogle.com
392331,1,1.0,1,1.0,1.0,desktop,1,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
446140,1,1.0,1,1.0,1.0,desktop,1,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
604595,1,1.0,1,1.0,1.0,desktop,1,0.0,0,0,...,0,0,0,0,0,0,0,0,1,0
400759,1,0.0,2,1.0,2.0,desktop,1,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
790780,1,0.0,3,1.0,3.0,desktop,1,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102000 entries, 392331 to 496585
Columns: 431 entries, visitNumber to trafficSource_source_youtube.thinkwithgoogle.com
dtypes: float64(4), int64(4), object(1), uint8(422)
memory usage: 48.8+ MB


In [16]:
final.rename({"device_deviceCategory": "Y"}, axis=1, inplace=True)
final.Y.value_counts()

tablet     34000
mobile     34000
desktop    34000
Name: Y, dtype: int64

We will be using Label Encoding on our target column.

In [17]:
label_encoder = LabelEncoder()
final["Y"] = label_encoder.fit_transform(final["Y"])
final["Y"].value_counts()

2    34000
1    34000
0    34000
Name: Y, dtype: int64

Let's split our data into train, validation and test sets.

In [18]:
Y = final.pop("Y")
Y = pd.get_dummies(Y)
Y.head()

Unnamed: 0,0,1,2
392331,1,0,0
446140,1,0,0
604595,1,0,0
400759,1,0,0
790780,1,0,0


In [19]:
X_t_v, X_test, Y_t_v, Y_test = train_test_split(final, Y, test_size=0.2, random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_t_v, Y_t_v, test_size=0.2, random_state=42)
X_train.shape

(65280, 430)

# Training & evaluation of the model

In this step, we will be training and evaluating our model. We will use Rectified Linear Unit (hidden layers - helps avoid vanishing gradient) & Softmax as our activation functions. Since we're doing a multi-class classification, we will be using Categorical Cross-Entropy loss. We will be evaluating our results using ROC-AUC.

In [20]:
input_size = X_train.shape[1]
output_size = Y.shape[1]

model = tf.keras.Sequential([
                            tf.keras.Input(shape=(input_size,)),
                            tf.keras.layers.Dense(42, activation="relu"),
                            tf.keras.layers.Dense(12, activation="relu"),
                            tf.keras.layers.Dense(output_size, activation="softmax")
                            ])

In [21]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics="AUC")

In [22]:
NUM_EPOCHS = 7
BATCH_SIZE = 100

model.fit(x=X_train, y=Y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, validation_data=(X_val, Y_val), verbose=2)

Epoch 1/7
653/653 - 2s - loss: 0.4998 - auc: 0.9213 - val_loss: 0.4139 - val_auc: 0.9394
Epoch 2/7
653/653 - 2s - loss: 0.4160 - auc: 0.9388 - val_loss: 0.4088 - val_auc: 0.9411
Epoch 3/7
653/653 - 2s - loss: 0.4089 - auc: 0.9407 - val_loss: 0.4081 - val_auc: 0.9415
Epoch 4/7
653/653 - 2s - loss: 0.4037 - auc: 0.9419 - val_loss: 0.4070 - val_auc: 0.9415
Epoch 5/7
653/653 - 2s - loss: 0.4020 - auc: 0.9426 - val_loss: 0.4068 - val_auc: 0.9415
Epoch 6/7
653/653 - 2s - loss: 0.4016 - auc: 0.9428 - val_loss: 0.4076 - val_auc: 0.9412
Epoch 7/7
653/653 - 2s - loss: 0.3964 - auc: 0.9440 - val_loss: 0.4076 - val_auc: 0.9415


<tensorflow.python.keras.callbacks.History at 0x1e4b4db5f88>

In [23]:
test_loss, test_accuracy = model.evaluate(X_test, Y_test)

print("\nTest loss: " + str(test_loss) + ". Test accuracy: " + str(test_accuracy*100.) + "%.")


Test loss: 0.41068512201309204. Test accuracy: 93.9420223236084%.


In [24]:
Y_pred = pd.DataFrame(model.predict(X_test))
Y_pred.head()

Unnamed: 0,0,1,2
0,0.000858,0.780304,0.218839
1,0.999681,8e-05,0.000238
2,0.003125,0.721027,0.275848
3,0.00102,0.425695,0.573285
4,0.002418,0.749921,0.247662


# Conclusions

We have predicted customers' probabilities to be using one of the three different devices in this project. Problem in hand was a multi-class classification problem with 3 classes. We had highly imbalanced classes and therefore, had to sample. At the end, our model has performed well on the test set according to our evaluation using the ROC AUC Score as our metric.