# Machine Learning - Hands on Lab - Session #1

* **Lecturer:** Jonathan DEKHTIAR
* **Date:** 2017-03-13
<br/><br/>
* **Contact:** [contact@jonathandekhtiar.eu](mailto:contact@jonathandekhtiar.eu)
* **Twitter:** [@born2data](https://twitter.com/born2data)
* **LinkedIn:** [JonathanDEKHTIAR](https://fr.linkedin.com/in/jonathandekhtiar)
* **Personal Website:** [JonathanDEKHTIAR](http://www.jonathandekhtiar.eu)
* **RSS Feed:** [FeedCrunch.io](https://www.feedcrunch.io/@dataradar/)
* **Tech. Blog:** [born2data.com](http://www.born2data.com/)
* **Github:** [DEKHTIARJonathan](https://github.com/DEKHTIARJonathan)
<br/><br/>

```
*************************************************************************
**
** 2017 March 13
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

# 1. Project Initialisation

## 1.1 Importing Librairies

In [1]:
import os
from datetime import datetime

import numpy as np
import pandas as pd

import bokeh.charts as bk
import bokeh.plotting as bk_plt
import bokeh.models as bk_md

bk.output_notebook()

# 2. Dataset Loading & Visualition
## 2.1 How does the expected "output" look like ?

### Sample Submission File - sample_submission_NDF.csv

We load the csv file and visualise it straight away

In [2]:
df_sample_submission = pd.read_csv("../input/sample_submission_NDF.csv")

df_sample_submission.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,id,country
0,5uwns89zht,NDF
1,jtl0dijy2j,NDF
2,xx0ulgorjt,NDF
3,6c6puo6ix0,NDF
4,czqhjk3yfe,NDF


As we can see, it is a fairly simple file with only two columns:
* **user identifier** 
* **country code** of destination

=> This file represent **how we must format our data** for the delivery. 

There is *no data* in this file, it only is a **template**.

## 2.2 Loading Data Files

### 2.2.1 Test Users - test_users.csv

We load the csv file and visualise it straight away

In [3]:
df_test_users = pd.read_csv("../input/test_users.csv")
df_test_users.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser
0,5uwns89zht,2014-07-01,20140701000006,,FEMALE,35.0,facebook,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
1,jtl0dijy2j,2014-07-01,20140701000051,,-unknown-,,basic,0,en,direct,direct,untracked,Moweb,iPhone,Mobile Safari
2,xx0ulgorjt,2014-07-01,20140701000148,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,Chrome
3,6c6puo6ix0,2014-07-01,20140701000215,,-unknown-,,basic,0,en,direct,direct,linked,Web,Windows Desktop,IE
4,czqhjk3yfe,2014-07-01,20140701000305,,-unknown-,,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Safari


Here are the information available on our **test users**:

* **id**: user ID
* **date_account_created**: date of account creation
* **timestamp_first_active**: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* **date_first_booking**: date of first booking
* **gender**
* **age**
* **signup_method**: What method was used to signup (Facebook, basic, etc.)
* **signup_flow**: the page a user came to signup up from
* **language**: international language preference
* **affiliate_channel**: what kind of paid marketing
* **affiliate_provider**: where the marketing is e.g. google, craigslist, other
* **first_affiliate_tracked**: whats the first marketing the user interacted with before the signing up
* **signup_app**
* **first_device_type**
* **first_browser**

### 1.2.2 countries.csv

We load the csv file and visualise it straight away

In [4]:
df_countries = pd.read_csv("../input/countries.csv")
df_countries.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,country_destination,lat_destination,lng_destination,distance_km,destination_km2,destination_language,language_levenshtein_distance
0,AU,-26.853388,133.27516,15297.744,7741220.0,eng,0.0
1,CA,62.393303,-96.818146,2828.1333,9984670.0,eng,0.0
2,DE,51.165707,10.452764,7879.568,357022.0,deu,72.61
3,ES,39.896027,-2.487694,7730.724,505370.0,spa,92.25
4,FR,46.232193,2.209667,7682.945,643801.0,fra,92.06


We have a file containing a list of countries with the following features:

* country code
* latitude
* longitude
* distance in km
* surface in km2
* language
* levenshtein distance with English Language

=> **General information about countries of destination**

### 1.2.3 age_gender_bkts.csv

We load the csv file and visualise it straight away

In [5]:
df_age_gender_bkts = pd.read_csv("../input/age_gender_bkts.csv")
df_age_gender_bkts.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,male,1.0,2015.0
1,95-99,AU,male,9.0,2015.0
2,90-94,AU,male,47.0,2015.0
3,85-89,AU,male,118.0,2015.0
4,80-84,AU,male,199.0,2015.0


Displayed as followed, it is fairly hard to really see what is in this file.

Let's try to sort the file in regards with the first column ("age_bucket") and then by the third one ("gender")

In [6]:
tmp_sorted = df_age_gender_bkts.sort_values(by=['age_bucket', 'gender'])
tmp_sorted.head(n=12) # Only display a few lines and not the whole dataframe

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
41,0-4,AU,female,781.0,2015.0
80,0-4,CA,female,991.0,2015.0
95,0-4,DE,female,1713.0,2015.0
135,0-4,ES,female,1198.0,2015.0
201,0-4,FR,female,1938.0,2015.0
235,0-4,GB,female,1888.0,2015.0
276,0-4,IT,female,1383.0,2015.0
324,0-4,NL,female,438.0,2015.0
363,0-4,PT,female,225.0,2015.0
408,0-4,US,female,10306.0,2015.0


Now it appears that we have a file containing demographic information by country, by age and by gender.

For each "age-bucket", "country" and gender, we have the following information:
* Population in thousands
* year of the statistic

In [7]:
tmp_sorted.year.unique() # We query the distinct values over the last column "year"

array([ 2015.])

=> We can conclude that we only have data about the year 2015 !

### 1.2.4 train_users_2.csv

We load the csv file and visualise it straight away

In [8]:
df_train_users = pd.read_csv("../input/train_users_2.csv")
df_train_users.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


Here are the information available on our **test users**:

* **id**: user ID
* **date_account_created**: date of account creation
* **timestamp_first_active**: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* **date_first_booking**: date of first booking
* **gender**
* **age**
* **signup_method**: What method was used to signup (Facebook, basic, etc.)
* **signup_flow**: the page a user came to signup up from
* **language**: international language preference
* **affiliate_channel**: what kind of paid marketing
* **affiliate_provider**: where the marketing is e.g. google, craigslist, other
* **first_affiliate_tracked**: whats the first marketing the user interacted with before the signing up
* **signup_app**
* **first_device_type**
* **first_browser**
* <span style="color:red">**country_destination**: this is the **target variable** that we want to predict </span>

**Remark on country_destination => ** 12 Possibilities: **US**, **FR**, **CA**, **GB**, **ES**, **IT**, **PT**, **NL**,**DE**, **AU**, **NDF** (no destination found), and **other**

### 1.2.5 session.csv

We load the csv file and visualise it straight away

In [9]:
df_sessions = pd.read_csv("../input/sessions.csv")

print("There are " + str(df_sessions.shape[0])+ " rows in the dataset")

df_sessions.head(n=5) # Only display a few lines and not the whole dataframe

There are 10567737 rows in the dataset


Unnamed: 0,user_id,action,action_type,action_detail,device_type,secs_elapsed
0,d1mm9tcy42,lookup,,,Windows Desktop,319.0
1,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,67753.0
2,d1mm9tcy42,lookup,,,Windows Desktop,301.0
3,d1mm9tcy42,search_results,click,view_search_results,Windows Desktop,22141.0
4,d1mm9tcy42,lookup,,,Windows Desktop,435.0


## 1.3 Summary - Important Information

### 1.3.1 Train - Test Split Analysis

In [10]:
training_min_date = df_train_users.date_account_created.min()
training_max_date = df_train_users.date_account_created.max()

print ("Training Date Range : [" + training_min_date +", " + training_max_date +"]")

# ===============

testing_min_date = df_test_users.date_account_created.min()
testing_max_date = df_test_users.date_account_created.max()

print ("Testing Date Range : [" + testing_min_date +", " + testing_max_date +"]")

Training Date Range : [2010-01-01, 2014-06-30]
Testing Date Range : [2014-07-01, 2014-09-30]


### 1.3.2 - Summary

- All the users in this dataset are from the USA (written on Kaggle)
- 12 Possibilities for the country of destination
- Data are split by data : 
    - _Train data_: [2010, June 2014] 
    - _Test data_: [July 2014, September 2014]

# 1.4 Train Data Analysis

Let us focus on training data. 

We transpose the df for a better reading and stop horinzontally scrolling.
We take the 6 first examples in order to have a sense of what's happening.

In [11]:
df_train_users.transpose().ix[:,:5] 

Unnamed: 0,0,1,2,3,4,5
id,gxn3p5htnn,820tgsjxq7,4ft3gnwmtx,bjjt8pjhuk,87mebub9p4,osr2jwljor
date_account_created,2010-06-28,2011-05-25,2010-09-28,2011-12-05,2010-09-14,2010-01-01
timestamp_first_active,20090319043255,20090523174809,20090609231247,20091031060129,20091208061105,20100101215619
date_first_booking,,,2010-08-02,2012-09-08,2010-02-18,2010-01-02
gender,-unknown-,MALE,FEMALE,FEMALE,-unknown-,-unknown-
age,,38,56,42,41,
signup_method,facebook,facebook,basic,facebook,basic,basic
signup_flow,0,0,3,0,0,0
language,en,en,en,en,en,en
affiliate_channel,direct,seo,direct,direct,direct,other


## 1.4.1 Feature Analysis

In [12]:
# How many rows in the DataFrame 

row_count = len(df_train_users.index)
print ("Row Count = " + str(row_count))
print("\n")
# ===============================

# Print what's the size of the ID
field_length = df_train_users.id.astype(str).map(len)

id_maxlength = len(df_train_users.loc[field_length.argmax(), 'id'])
id_minlength = len(df_train_users.loc[field_length.argmin(), 'id'])

if (id_maxlength != id_minlength):
    print ("ID Length = [" + str(id_minlength) + ", " + str(id_maxlength) + "]")
else:
    print ("ID Length = " + str(id_maxlength))

print("\n")

# ===============================

# Count NaN Values for date_first_booking
NaN_Count_date_first_booking = df_train_users.date_first_booking.isnull().sum()
print ("date_first_booking NaN Count = " + str(NaN_Count_date_first_booking) \
+ " && " +  str("%.2f" % (float(NaN_Count_date_first_booking)/row_count*100)) +"%")

print("\n")
# ===============================

# Possible Values for gender

gender_repartition = df_train_users.gender.value_counts()
for gender, count in gender_repartition.iteritems():
    print ("Gender: " + gender + " && Count: " + str(count) + " && " \
           + str("%.2f" % (float(count)/row_count*100)) +"%")

print("\n")
# ===============================

# Count NaN Values for age
NaN_Count_age = df_train_users.age.isnull().sum()
print ("age NaN Count = " + str(NaN_Count_age)  + " && "  \
       +  str("%.2f" % (float(NaN_Count_age)/row_count*100)) +"%")

print("\n")
# ===============================

# Possible Values for signup_method

signup_method_repartition = df_train_users.signup_method.value_counts()
for method, count in signup_method_repartition.iteritems():
    print ("Method: " + method + " && Count: " + str(count) + " && " \
           + str("%.2f" % (float(count)/row_count*100)) +"%")
    
print("\n")
# ===============================

# Possible Values for language

language_repartition = df_train_users.language.value_counts()
for language, count in language_repartition.iteritems():
    print ("language: " + language + " && Count: " + str(count)  \
           + " && " + str("%.2f" % (float(count)/row_count*100)) +"%")
    break
    
print("\n")

# ===============================

# Possible Values for affiliate_channel

affiliate_channel_repartition = df_train_users.affiliate_channel.value_counts()
i = 0
for channel, count in affiliate_channel_repartition.iteritems():
    print ("affiliate_channel: " + channel + " && Count: " + str(count) + " && " \
           + str("%.2f" % (float(count)/row_count*100)) +"%")
    i += 1
    if (i == 3):
        break
print("\n")

# ===============================

# Possible Values for first_affiliate_tracked

first_affiliate_tracked_repartition = df_train_users.first_affiliate_tracked.value_counts()
i = 0
for affiliate, count in first_affiliate_tracked_repartition.iteritems():
    print ("first_affiliate_tracked: " + affiliate + " && Count: " + str(count) + \
    " && " + str("%.2f" % (float(count)/row_count*100)) +"%")
    
    i += 1
    if (i == 3):
        break
print("\n")

# ===============================

# Possible Values for signup_app

signup_app_repartition = df_train_users.signup_app.value_counts()
for signup_app, count in signup_app_repartition.iteritems():
    print ("signup_app: " + signup_app + " && Count: " + str(count) + " && " \
           + str("%.2f" % (float(count)/row_count*100)) +"%")
print("\n")

# ===============================

# Possible Values for first_browser

first_browser_repartition = df_train_users.first_browser.value_counts()
i = 0
for first_browser, count in first_browser_repartition.iteritems():
    print ("first_browser: " + first_browser + " && Count: " + str(count) + \
    " && " + str("%.2f" % (float(count)/row_count*100)) +"%")
    
    i += 1
    if (i == 6):
        break
print("\n")

Row Count = 213451


ID Length = 10


date_first_booking NaN Count = 124543 && 58.35%


Gender: -unknown- && Count: 95688 && 44.83%
Gender: FEMALE && Count: 63041 && 29.53%
Gender: MALE && Count: 54440 && 25.50%
Gender: OTHER && Count: 282 && 0.13%


age NaN Count = 87990 && 41.22%


Method: basic && Count: 152897 && 71.63%
Method: facebook && Count: 60008 && 28.11%
Method: google && Count: 546 && 0.26%


language: en && Count: 206314 && 96.66%


affiliate_channel: direct && Count: 137727 && 64.52%
affiliate_channel: sem-brand && Count: 26045 && 12.20%
affiliate_channel: sem-non-brand && Count: 18844 && 8.83%


first_affiliate_tracked: untracked && Count: 109232 && 51.17%
first_affiliate_tracked: linked && Count: 46287 && 21.69%
first_affiliate_tracked: omg && Count: 43982 && 20.61%


signup_app: Web && Count: 182717 && 85.60%
signup_app: iOS && Count: 19019 && 8.91%
signup_app: Moweb && Count: 6261 && 2.93%
signup_app: Android && Count: 5454 && 2.56%


first_browser: Chrome && Count: 

### 1.4.2 Output Analysis - Is the dataset balanced ?

### 1.4.2.1 Dummy Analysis

Let's plot the the bookings distribution by destination in order to see whether the dataset is balanced or not.

In [13]:
dict_output = dict()

country_destination_repartition = df_train_users.country_destination.value_counts()

for country_destination, count in country_destination_repartition.iteritems():
    dict_output[country_destination] = float(count)/row_count*100

df_output = pd.DataFrame(list(dict_output.items()),
                      columns=['Country', 'Repartition'])

df_output.sort_values(by=['Repartition'], ascending=False).head(n=5)

Unnamed: 0,Country,Repartition
11,NDF,58.347349
6,US,29.222632
3,other,4.728954
4,FR,2.353233
2,IT,1.328174


In [14]:
p = bk.Bar(df_output, label=['Country'], values='Repartition', ylabel='Booking Proportion in %')

bk.show(p)

### 1.4.2.2 Same visualisation, with data grouped by Booking-year

In [15]:
df_output_by_year_and_month = df_train_users.loc[:,['country_destination','date_account_created']]

df_output_by_year_and_month.loc[:,'year_account_created'] = df_output_by_year_and_month['date_account_created'].apply(
    lambda x: datetime.strptime(x, "%Y-%m-%d").strftime("%Y")
)

df_output_by_year_and_month.loc[:,'month_account_created'] = df_output_by_year_and_month['date_account_created'].apply(
    lambda x: datetime.strptime(x, "%Y-%m-%d").strftime("%m")
)

df_output_by_year_and_month.loc[:,'year_month_account_created'] = df_output_by_year_and_month['date_account_created'].apply(
    lambda x: datetime.strptime(x, "%Y-%m-%d").strftime("%Y - %m")
)

df_output_by_year_and_month.head(n=3)

Unnamed: 0,country_destination,date_account_created,year_account_created,month_account_created,year_month_account_created
0,NDF,2010-06-28,2010,6,2010 - 06
1,NDF,2011-05-25,2011,5,2011 - 05
2,US,2010-09-28,2010,9,2010 - 09


In [16]:
tmp_df = df_output_by_year_and_month.loc[:,['country_destination','year_account_created']]
rslt = tmp_df.groupby("year_account_created")['country_destination'].value_counts()

list_year = [x[0] for x in rslt.index.values]
list_dest = [x[1] for x in rslt.index.values]
list_count = rslt.values

df_tmp = pd.DataFrame(
    {'Year': list_year,
     'Destination': list_dest,
     'Count': list_count
    }
)

df_tmp.sort_values(by=["Destination", "Year"]).head(n=5)

Unnamed: 0,Count,Destination,Year
10,7,AU,2010
22,38,AU,2011
34,103,AU,2012
46,239,AU,2013
58,152,AU,2014


In [17]:
df_pivot = pd.pivot_table(df_tmp, values='Count', index=['Destination'], columns=['Year'], 
                          aggfunc=np.sum, margins=True).astype(int)

total_by_years = df_pivot.loc[['All']]

for year, col in total_by_years.iteritems():
    subtotal_by_year = col[0]
    
    df_pivot[year] = df_pivot[year]/subtotal_by_year*100
    
df_pivot = df_pivot.sort_values(by="All", ascending=False).drop("All") # We sort by the column "All" and remove the line "All"
df_pivot

Year,2010,2011,2012,2013,2014,All
Destination,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NDF,42.503587,45.367304,55.022553,59.15622,61.76209,58.347349
US,44.045911,38.097665,31.131215,28.85487,26.729527,29.222632
other,2.761836,4.738854,4.85277,4.634764,4.837444,4.728954
FR,4.304161,3.983015,2.840708,2.232401,1.910653,2.353233
IT,1.07604,1.690021,1.459632,1.248795,1.299924,1.328174
GB,1.004304,1.469214,1.310121,1.042671,0.969058,1.088774
ES,1.542324,1.673036,1.191019,0.994455,0.933748,1.053638
CA,1.506456,1.061571,0.737418,0.626808,0.588497,0.669006
DE,0.573888,0.815287,0.734884,0.486982,0.333482,0.49707
NL,0.394548,0.577495,0.375044,0.342334,0.32825,0.356991


In [18]:
df_pivot2 = df_pivot.copy()

df_pivot2["Destination"] = df_pivot.index
df_pivot2 = pd.melt(df_pivot2, id_vars="Destination")
df_pivot2 = df_pivot2.drop(df_pivot2[df_pivot2.Year == "All"].index) # We remove the column 'All' from the plotting

df_pivot2.head(n=3)

Unnamed: 0,Destination,Year,value
0,NDF,2010,42.503587
1,US,2010,44.045911
2,other,2010,2.761836


In [19]:
p = bk.Bar(df_pivot2, label=['Destination'], values='value', group="Year", legend='top_right', ylabel='Booking Proportion in %')
bk.show(p)

### 1.4.2.3 Conclusion

Over the years the repartition has changed. As we are doing predictions with user data from **July 2014**, this changes may require us to focus on the most recent data during training. *It is most likely to have similar pattern than test data*.

## 1.4.3 Account Creation Analysis

We want to evaluate the quantity of data we would have left if we only consider the N-last months of the data.
We have seen that the "_early users_" are very different of the actual ones. Let's see how much of data we can delete of the dataset.

In [20]:
tmp_df = df_output_by_year_and_month[['country_destination','year_month_account_created']].sort_values(
    by="year_month_account_created"
)

tmp_df = tmp_df.groupby(["year_month_account_created"]).agg("count")
tmp_df.head(n=4)

Unnamed: 0_level_0,country_destination
year_month_account_created,Unnamed: 1_level_1
2010 - 01,61
2010 - 02,102
2010 - 03,163
2010 - 04,157


In [21]:
# Plot Data
x_data = [str(x) for x in tmp_df["country_destination"].index.values]
y_data = tmp_df["country_destination"].values

p = bk_plt.figure(plot_width=980, plot_height=400, y_range=[0, y_data.max()+1000], x_range=x_data)

p.line(x_data, y_data, line_width=2)
p.circle(x_data, y_data, fill_color="white", size=8)


# Draw a line when X% of the users of the dataset have sign-up
split_1 = 20 # 80% left
split_2 = 30 # 70% left
split_3 = 50 # 50% left
split_4 = 70 # 30% left

thrs_value_1 = 213450 * split_1 / 100
thrs_value_2 = 213450 * split_2 / 100
thrs_value_3 = 213450 * split_3 / 100
thrs_value_4 = 213450 * split_4 / 100

index_1 = 0
index_2 = 0
index_3 = 0
index_4 = 0

tmp_sum = 0
for val in y_data:
    tmp_sum += val
    
    if tmp_sum <= thrs_value_1:
        index_1 += 1     
        index_2 += 1
        index_3 += 1
        index_4 += 1
    elif tmp_sum <= thrs_value_2:    
        index_2 += 1
        index_3 += 1
        index_4 += 1
    elif tmp_sum <= thrs_value_3:
        index_3 += 1
        index_4 += 1
    elif tmp_sum <= thrs_value_4:
        index_4 += 1
    else:
        break
           
        
vline_1 = bk_md.Span(location=index_1, dimension='height', line_color='blue', line_width=3) # 80% Data Left
vline_2 = bk_md.Span(location=index_2, dimension='height', line_color='green', line_width=3) # 70% Data Left
vline_3 = bk_md.Span(location=index_3, dimension='height', line_color='orange', line_width=3) # 50% Data Left
vline_4 = bk_md.Span(location=index_4, dimension='height', line_color='red', line_width=3) # 30% Data Left

p.renderers.extend([vline_1, vline_2, vline_3, vline_4])

# We set label orientation
from math import pi
p.xaxis.major_label_orientation = pi/2

# We display the graph
bk_plt.show(p)

If we ony consider of 2013 and 2014, we still have more than 70% of the data for training.
This is **important**, because _session data_ start in **January 2014**.

_"In the sessions dataset, the data only dates back to 1/1/2014, while the users dataset dates back to 2010."_

## 1.4.4 Reported Ages of User

In [22]:
df_age = df_train_users[["age"]]

na_count = df_age.isnull().sum()[0]
row_count = df_age.shape[0]
na_percent = "%.2f" % (float(na_count) / row_count * 100)

print ("Count of NaN Values = " + str(na_count) + " => " + str(na_percent) + "%")
print ("Total Row Count = " + str(row_count))

df_age = df_age.dropna().reset_index(drop=True)
df_age.head(n=3)

Count of NaN Values = 87990 => 41.22%
Total Row Count = 213451


Unnamed: 0,age
0,38.0
1,56.0
2,42.0


In [23]:
df_age_bucket = df_age.groupby(pd.cut(df_age['age'], np.arange(0,999999,5))).count()

idx_tmp = df_age_bucket.index.values
bckt_count_tmp = df_age_bucket.age.values

data = dict()

for bck, cnt in zip(idx_tmp, bckt_count_tmp):
    start_point = int(bck[1:].partition(",")[0])
    end_point = int(bck[:-1].partition(",")[2])
    
    range_name = "("+ "%03d" % (start_point) + "," + "%03d" % (end_point) + "]"
    if start_point < 100:
        data[range_name] = cnt
    elif start_point >= 100 and start_point < 1000:
        if not "(100, 999]" in data:
            data["(100, 999]"] = cnt
        else:
            data["(100, 999]"] += cnt
    else:
        if not "+1000" in data:
            data["+1000"] = cnt
        else:
            data["+1000"] += cnt
            
df_age_stats = pd.DataFrame({
    'Bucket': list(data.keys()),
    'Count': list(data.values())
}).sort_values(by="Bucket").reset_index(drop=True)

df_age_stats.head(n=5)

Unnamed: 0,Bucket,Count
0,"(000,005]",57
1,"(005,010]",0
2,"(010,015]",8
3,"(015,020]",2404
4,"(020,025]",12825


In [24]:
x_range = df_age_stats["Bucket"].values

bar2 = bk.Bar(
    df_age_stats, 
    values = 'Count', 
    label  = ['Bucket'],
    title  = "Reported Ages of Users", 
    width  = 980,
    ylabel = "Number of Users by Bucket"
)

bk.show(bar2)