# Machine Learning - Hands on Lab - Session #1

* **Lecturer:** Jonathan DEKHTIAR
* **Date:** 2017-03-13
<br/><br/>
* **Contact:** [contact@jonathandekhtiar.eu](mailto:contact@jonathandekhtiar.eu)
* **Twitter:** [@born2data](https://twitter.com/born2data)
* **LinkedIn:** [JonathanDEKHTIAR](https://fr.linkedin.com/in/jonathandekhtiar)
* **Personal Website:** [JonathanDEKHTIAR](http://www.jonathandekhtiar.eu)
* **RSS Feed:** [FeedCrunch.io](https://www.feedcrunch.io/@dataradar/)
* **Tech. Blog:** [born2data.com](http://www.born2data.com/)
* **Github:** [DEKHTIARJonathan](https://github.com/DEKHTIARJonathan)
<br/><br/>

```
*************************************************************************
**
** 2017 March 13
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

## 1. Loading the Python libraries

In [1]:
import os
from datetime import datetime

import numpy as np
import pandas as pd

import sklearn as sk

## 2. Re-Import the Check NaN Function from Part 2

In [2]:
def check_NaN_Values_in_df(df):
    # searching for NaN values is all the columns
    for col in df:
        nan_count = df[col].isnull().sum()

        if nan_count != 0:
            print (col + " => "+  str(nan_count) + " NaN Values")

## 3. Loading in the Data from Session #2

In [3]:
df_all = pd.read_csv(
    "output/cleaned.csv", 
    dtype={
        'country_destination': str
    }
)

# We transform again the date column into datetime
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d %H:%M:%S')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y-%m-%d %H:%M:%S')

# Check for NaN Values => We must find: country_destination => 62096 NaN Values
check_NaN_Values_in_df(df_all) 

df_all.sample(n=5) # Only display a few lines and not the whole dataframe

country_destination => 62096 NaN Values


Unnamed: 0,affiliate_channel,affiliate_provider,age,country_destination,date_account_created,first_affiliate_tracked,first_browser,first_device_type,gender,id,language,signup_app,signup_flow,signup_method,timestamp_first_active
55773,direct,direct,-1,US,2013-10-05,omg,Safari,Mac Desktop,FEMALE,e5uj93n1or,en,Web,0,basic,2013-10-05 00:15:28
37575,sem-non-brand,google,40,NDF,2013-08-03,omg,Chrome,Windows Desktop,FEMALE,bulf1euo01,en,Web,0,facebook,2013-08-03 02:18:24
112022,direct,direct,29,US,2014-04-02,untracked,Safari,Mac Desktop,MALE,68wfudy756,en,Web,0,basic,2014-04-02 02:14:02
36398,direct,direct,36,US,2013-07-29,untracked,Chrome,Windows Desktop,FEMALE,91nzc3szqc,en,Web,0,basic,2013-07-29 20:01:42
48670,sem-brand,google,-1,NDF,2013-09-13,omg,IE,Windows Desktop,-unknown-,te76y9uto9,en,Web,0,basic,2013-09-13 04:43:46


## 3. Transforming Categorical Data

Let's go for some **One Hot Encoding** - replacing the categorical fields in the dataset with multiple columns representing one value from each column.

In [4]:
# Home made One Hot Encoding function
def convert_to_binary(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())

    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1

    return df

In [5]:
columns_to_convert = [
    'gender', 
    'signup_method', 
    'signup_flow', 
    'language', 
    'affiliate_channel', 
    'affiliate_provider', 
    'first_affiliate_tracked', 
    'signup_app', 
    'first_device_type', 
    'first_browser'
]

# One Hot Encoding
for column in columns_to_convert:
    df_all = convert_to_binary(df=df_all, column_to_convert=column)
    df_all.drop(column, axis=1, inplace=True)
    
df_all.sample(n=5)

Unnamed: 0,age,country_destination,date_account_created,id,timestamp_first_active,gende_other,gende_female,gende_male,gende_unknown,signu_basic,...,first_crazy_brow,first_stainless,first_coolnovo,first_opera_mini,first_googlebot,first_outlook_20,first_icedragon,first_ibrowse,first_nintendo_b,first_uc_browser
142404,-1,NDF,2014-06-07,deep65lgfb,2014-06-07 15:51:39,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
106537,-1,NDF,2014-03-19,54deuw2k8p,2014-03-19 07:51:15,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
152310,44,US,2014-06-26,ce5sk93xqk,2014-06-26 02:45:49,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
136061,22,NDF,2014-05-25,hp5urn796c,2014-05-25 02:47:10,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
209674,-1,,2014-09-18,4nmf77dlpb,2014-09-18 23:30:03,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Add new date related fields
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days

# Drop unnecessary columns
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)

print ("Dataframe Shape:", df_all.shape)
df_all.sample(n=5)

Dataframe Shape: (216973, 147)


Unnamed: 0,age,id,gende_other,gende_female,gende_male,gende_unknown,signu_basic,signu_facebook,signu_google,signu_weibo,...,day_account_created,month_account_created,quarter_account_created,year_account_created,hour_first_active,day_first_active,month_first_active,quarter_first_active,year_first_active,created_less_active
4754,24,ip1v9qfz6y,0,1,0,0,0,1,0,0,...,0,3,1,2013,1,0,3,1,2013,-1
117075,43,vy4yd3hwjd,0,0,1,0,1,0,0,0,...,0,4,2,2014,19,0,4,2,2014,-1
41724,36,254dbgjuad,0,0,0,1,1,0,0,0,...,6,8,3,2013,5,6,8,3,2013,-1
214395,25,iwb4jwqu8c,0,1,0,0,1,0,0,0,...,4,9,3,2014,20,4,9,3,2014,-1
131841,33,zcomr2n3i7,0,0,1,0,1,0,0,0,...,4,5,2,2014,21,4,5,2,2014,-1


## 5. Mid - Conclusion

We had **14 Columns**, we now have **147 columns**. 

It seems like a lot, however we mostly just have **restructured** the information and only created **10 columns**


## 6. Adding new data

### 6.1. Understanding session.csv
Exactly like we did with the training / testing data. We now investigate session data.

In [7]:
df_sessions = pd.read_csv("data/sessions.csv")
print ("DF Session Shape:", df_sessions.shape)
df_sessions.head(n=5) # Only display a few lines and not the whole dataframe

DF Session Shape: (10567737, 6)


Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,lookup,,,Windows Desktop,435.0


### 6.2. Cleaning and Transforming the Data

#### 6.2.1. Extract the primary and secondary devices for each user

The first piece of information we are going to extract is the primary and secondary device for each user. 

How do we determine what is the user's primary and secondary devices are? We look at how much time they spent on each device.

In [8]:
# Determine primary device
sessions_device = df_sessions.loc[:, ['user_id', 'device_type', 'secs_elapsed']]
aggregated_lvl1 = sessions_device.groupby(['user_id', 'device_type'], as_index=False, sort=False).aggregate(np.sum)
idx = aggregated_lvl1.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == aggregated_lvl1['secs_elapsed']
df_primary = pd.DataFrame(aggregated_lvl1.loc[idx , ['user_id', 'device_type', 'secs_elapsed']])
df_primary.rename(columns = {'device_type':'primary_device', 'secs_elapsed':'primary_secs'}, inplace=True)
df_primary = convert_to_binary(df=df_primary, column_to_convert='primary_device')
df_primary.drop('primary_device', axis=1, inplace=True)

df_primary.sample(n=5)

Unnamed: 0,user_id,primary_secs,prima_windows_de,prima_mac_deskto,prima_iphone,prima_ipad_table,prima_unknown,prima_android_ap,prima_linux_desk,prima_tablet,prima_chromebook,prima_android_ph,prima_ipodtouch,prima_blackberry,prima_windows_ph,prima_opera_phon
95538,p66tzhxm53,205597.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
41319,vazdmbkqe8,1290600.0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1926,mvy42eiv5g,3125968.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
163608,6w7z8r9k5v,574355.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
151785,idzwwxu4iy,191851.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [9]:
# Determine Secondary device
remaining = aggregated_lvl1.drop(aggregated_lvl1.index[idx])
idx = remaining.groupby(['user_id'], sort=False)['secs_elapsed'].transform(max) == remaining['secs_elapsed']
df_secondary = pd.DataFrame(remaining.loc[idx , ['user_id', 'device_type', 'secs_elapsed']])
df_secondary.rename(columns = {'device_type':'secondary_device', 'secs_elapsed':'secondary_secs'}, inplace=True)
df_secondary = convert_to_binary(df=df_secondary, column_to_convert='secondary_device')
df_secondary.drop('secondary_device', axis=1, inplace=True)

df_secondary.sample(n=5)

Unnamed: 0,user_id,secondary_secs,secon_unknown,secon_android_ph,secon_ipad_table,secon_android_ap,secon_mac_deskto,secon_iphone,secon_windows_de,secon_linux_desk,secon_tablet,secon_blackberry,secon_windows_ph,secon_chromebook,secon_opera_phon,secon_ipodtouch
30956,wlwif7w7y0,227.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
51574,7ir5ybbf36,372427.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
92738,r3s2x12bta,90286.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
95318,sex52f0sbh,157173.0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
131709,x2izlb17o3,263078.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


#### 6.2.2. Determine Counts of Actions

The next thing we are going to do is take counts of how many times each action was taken by each user. This is a two-step process. 

To handle the multiple action columns, we repeat these steps for each column individually, effectively creating three separate tables. 

Because we have now created tables where each row represents one user, we can now **join these three tables together** on the basis of the user id.

In [10]:
# Count occurrences of value in a column
def convert_to_counts(df, id_col, column_to_convert):
    id_list = df[id_col].drop_duplicates()
    
    df_counts = df.loc[:,[id_col, column_to_convert]]
    df_counts['count'] = 1
    df_counts = df_counts.groupby(by=[id_col, column_to_convert], as_index=False, sort=False).sum()
    
    new_df = df_counts.pivot(index=id_col, columns=column_to_convert, values='count')
    new_df = new_df.fillna(0)
    
    # Rename Columns
    categories = list(df[column_to_convert].drop_duplicates())
    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert + '_' + cat_name
        new_df.rename(columns = {category:col_name}, inplace=True)
        
    return new_df

In [11]:
# Aggregate and combine actions taken columns

session_actions = df_sessions.loc[:,['user_id', 'action', 'action_type', 'action_detail']]
columns_to_convert = ['action', 'action_type', 'action_detail']

session_actions = session_actions.fillna('not provided')
first = True

for column in columns_to_convert:
    print("Converting " + column + " column...")
    current_data = convert_to_counts(df=session_actions, id_col='user_id', column_to_convert=column)

    # If first loop, current data becomes existing data, otherwise merge existing and current
    if first:
        first = False
        actions_data = current_data
    else:
        actions_data = pd.concat([actions_data, current_data], axis=1, join='inner')
        
actions_data.sample(n=5)

Converting action column...
Converting action_type column...
Converting action_detail column...


Unnamed: 0_level_0,action_10,action_11,action_12,action_15,action_about_us,action_accept_decline,action_account,action_acculynk_bin_check_failed,action_acculynk_bin_check_success,action_acculynk_load_pin_pad,...,action_detail_view_resolutions,action_detail_view_search_results,action_detail_view_security_checks,action_detail_view_user_real_names,action_detail_wishlist,action_detail_wishlist_content_update,action_detail_wishlist_note,action_detail_your_listings,action_detail_your_reservations,action_detail_your_trips
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5rj93jxuae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,30.0,0.0,0.0,0.0,13.0,0.0,0.0,0.0,24.0
t6nsyzlmnk,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
vd7n1e4loq,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,36.0,0.0,0.0,0.0,32.0,0.0,0.0,0.0,0.0
1fvf8hr8tn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
kk3rgd1e2g,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,59.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


#### 6.2.3. Combine Data Sets

The final steps are to combine the various datasets we have created into one large dataset. 

1. we combine the two device dataframes (df_primary and df_secondary) to create a device dataframe
2. we combine the device dataframe with the actions dataframe to create a sessions dataframe with all the features we extracted from sessions.csv
3. Finally, we combine the sessions dataframe with the training and testing data dataframe

In [12]:
# Merge device datasets
df_primary.set_index('user_id', inplace=True)
df_secondary.set_index('user_id', inplace=True)
device_data = pd.concat([df_primary, df_secondary], axis=1, join="outer")

# Merge device and actions datasets
combined_results = pd.concat([device_data, actions_data], axis=1, join='outer')
df_sessions = combined_results.fillna(0)

# Merge user and session datasets
df_all.set_index('id', inplace=True)
df_all = pd.concat([df_all, df_sessions], axis=1, join='inner')

df_all.sample(n=5)

Unnamed: 0,age,gende_other,gende_female,gende_male,gende_unknown,signu_basic,signu_facebook,signu_google,signu_weibo,signu_24,...,action_detail_view_resolutions,action_detail_view_search_results,action_detail_view_security_checks,action_detail_view_user_real_names,action_detail_wishlist,action_detail_wishlist_content_update,action_detail_wishlist_note,action_detail_your_listings,action_detail_your_reservations,action_detail_your_trips
l4r31vw1qi,-1,0,0,0,1,1,0,0,0,0,...,0.0,65.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
957x12vgva,39,0,1,0,0,0,1,0,0,0,...,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f1gn7yj9am,27,0,1,0,0,0,1,0,0,0,...,0.0,16.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
90g6d74oid,31,0,1,0,0,1,0,0,0,0,...,0.0,5.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0
o7x0v0sysa,37,0,0,1,0,1,0,0,0,0,...,0.0,12.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0


In [13]:
df_sessions = df_sessions.astype(int)
df_sessions.sample(n=5)

Unnamed: 0,primary_secs,prima_windows_de,prima_mac_deskto,prima_iphone,prima_ipad_table,prima_unknown,prima_android_ap,prima_linux_desk,prima_tablet,prima_chromebook,...,action_detail_view_resolutions,action_detail_view_search_results,action_detail_view_security_checks,action_detail_view_user_real_names,action_detail_wishlist,action_detail_wishlist_content_update,action_detail_wishlist_note,action_detail_your_listings,action_detail_your_reservations,action_detail_your_trips
fb002qrrlh,158970,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
07mba72zco,411779,0,0,0,0,1,0,0,0,0,...,0,3,0,0,0,0,0,0,0,0
z219kr5vlo,363604,1,0,0,0,0,0,0,0,0,...,0,6,0,0,0,5,0,0,0,0
9sdva32ng8,534654,1,0,0,0,0,0,0,0,0,...,0,4,0,0,0,5,0,0,0,0
4cqalco6ed,988704,0,0,0,1,0,0,0,0,0,...,0,41,0,0,0,0,0,0,0,0


In [14]:
#Just recheck we don't have new NaN Values before saving our data
# We must find no NaN, the column "country_destination" has been deleted

check_NaN_Values_in_df(df_all) 

## 7. Saving the DataFrame to csv

In [15]:
# We create the output directory if necessary
if not os.path.exists("output"):
    os.makedirs("output")
    
# We export to csv
df_all.to_csv("output/enriched.csv", sep=',')