## EDA Lab

## Data understanding and preprocessing - Walkthrough of Traveler dataset




### * Goal: Predict the country that users will make their first booking in, based on some basic user profile data.



### DATA PREPROCESSING (Data Cleaning and Data Transformation on train and test csv)

#### [1] Data Cleaning

#### Some information:

Distinguishing between a cat and a dog is an important problem of machine learning. 
We should know what a cat should look like and what a dog should look like

Data set (Cat and Dog dataset)
|
Data Preprocessing (Rich features are features that are created fro existing features, this is called attribute selection with forward or backward propagation or decision tree)
|
Learning Model (Classification)
|
Performance Evaluation

We can have a plot with circles and dogs with sqaures
If a given data falls in the region of cats, it is a cat, and similarly for dog

We can have lets say 15 attributes of 12 countries, the different countries will be in 15 dimensions, we cannot visualize this but machines can. This number of attributes might even be too less for a machine.

Before data prepocessing, data cleaning will eliminate and thus reduce some attributes for the learning model.

Remove the destination country, then use this information fed into the learning model to predict the destination country of a user.

This is multiclass classification
The dog and cat problem is a dual class classification.

This notebook is compressed and we need to perform the analysis of the data ourselves.

### Milestones:
#### 1) Understanding the data
Or rather understanding the literature of the data.
Raise questions such as:
What are the features, number of records, format of the data, how different features relate to each other

##### Review the dataset
We need more data beyond the given samples.

A user might visit a site 10 times, before making a booking the 11th time. How to record sessions of a user. This is what is there in sessions.csv





In [1]:
##Exploring Traveler data
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline 

print("Reading data...")
train_file = "./traveler_dataset/train_users_2.csv"
df_train = pd.read_csv(train_file, header = 0,index_col=None)

test_file = "./traveler_dataset/test_users.csv"
df_test = pd.read_csv(test_file, header = 0,index_col=None)

# Combining into one dataset for cleaning
df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True, sort=False)
print("Reading data...completed")

# Fixing date formats in Pandas - to_datetime
## Change dates to specific format
print("Fixing timestamps...")
df_all['date_account_created'] = pd.to_datetime(df_all['date_account_created'], format='%Y-%m-%d')
df_all['timestamp_first_active'] = pd.to_datetime(df_all['timestamp_first_active'], format='%Y%m%d%H%M%S')
print("Fixing timestamps...completed")

## Removing date_first_booking column
df_all.drop('date_first_booking', axis = 1, inplace = True)
print("Droped date_first_booking column...")

import numpy as np

## Remove outliers function - [1]
def remove_outliers(df, column, min_val, max_val):
    col_values = df[column].values
    df[column] = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)
    return df

## Fixing age column - [2]
print("Fixing age column...")
df_all = remove_outliers(df = df_all, column = 'age', min_val = 15, max_val = 90)
df_all['age'].fillna(-1, inplace = True)
print("Fixing age column...completed")

# Other column missing value - Fill first_affiliate_tracked column
print("Filling first_affiliate_tracked column...")
df_all['first_affiliate_tracked'].fillna(-1, inplace=True)
print("Filling first_affiliate_tracked column...completed")

# df_all.head()
df_all.head().transpose()
# transpose() transposes the dataframe. 

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib
Reading data...
Reading data...completed
Fixing timestamps...
Fixing timestamps...completed
Droped date_first_booking column...
Fixing age column...
Fixing age column...completed
Filling first_affiliate_tracked column...
Filling first_affiliate_tracked column...completed


Unnamed: 0,0,1,2,3,4
id,gxn3p5htnn,820tgsjxq7,4ft3gnwmtx,bjjt8pjhuk,87mebub9p4
date_account_created,2010-06-28 00:00:00,2011-05-25 00:00:00,2010-09-28 00:00:00,2011-12-05 00:00:00,2010-09-14 00:00:00
timestamp_first_active,2009-03-19 04:32:55,2009-05-23 17:48:09,2009-06-09 23:12:47,2009-10-31 06:01:29,2009-12-08 06:11:05
gender,-unknown-,MALE,FEMALE,FEMALE,-unknown-
age,-1.0,38.0,56.0,42.0,41.0
signup_method,facebook,facebook,basic,facebook,basic
signup_flow,0,0,3,0,0
language,en,en,en,en,en
affiliate_channel,direct,seo,direct,direct,direct
affiliate_provider,direct,google,direct,direct,direct


#### [2] Data Transformation and Feature Extraction

Data transformation is the conversion of data from one form to another.

In [2]:
# Own implementation of One Hot Encoding - Data Transformation
def convert_to_binary(df, column_to_convert):
    categories = list(df[column_to_convert].drop_duplicates())

    for category in categories:
        cat_name = str(category).replace(" ", "_").replace("(", "").replace(")", "").replace("/", "_").replace("-", "").lower()
        col_name = column_to_convert[:5] + '_' + cat_name[:10]
        df[col_name] = 0
        df.loc[(df[column_to_convert] == category), col_name] = 1

    return df

# One Hot Encoding
print("One Hot Encoding categorical data...")
columns_to_convert = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked', 'signup_app', 'first_device_type', 'first_browser']

for column in columns_to_convert:
    df_all = convert_to_binary(df=df_all, column_to_convert=column)
    df_all.drop(column, axis=1, inplace=True)
print("One Hot Encoding categorical data...completed")

# Add new date related fields - Creating New Features
print("Adding new fields...")
df_all['day_account_created'] = df_all['date_account_created'].dt.weekday
df_all['month_account_created'] = df_all['date_account_created'].dt.month
df_all['quarter_account_created'] = df_all['date_account_created'].dt.quarter
df_all['year_account_created'] = df_all['date_account_created'].dt.year
df_all['hour_first_active'] = df_all['timestamp_first_active'].dt.hour
df_all['day_first_active'] = df_all['timestamp_first_active'].dt.weekday
df_all['month_first_active'] = df_all['timestamp_first_active'].dt.month
df_all['quarter_first_active'] = df_all['timestamp_first_active'].dt.quarter
df_all['year_first_active'] = df_all['timestamp_first_active'].dt.year
df_all['created_less_active'] = (df_all['date_account_created'] - df_all['timestamp_first_active']).dt.days
print("Adding new fields...completed")


# Drop unnecessary columns
print("Droping fields...")
columns_to_drop = ['date_account_created', 'timestamp_first_active', 'date_first_booking', 'country_destination']
for column in columns_to_drop:
    if column in df_all.columns:
        df_all.drop(column, axis=1, inplace=True)
print("Droping fields...completed")

One Hot Encoding categorical data...
One Hot Encoding categorical data...completed
Adding new fields...
Adding new fields...completed
Droping fields...
Droping fields...completed


There is a lot of data in NDF and US, this is due to human bias. A human will teach a machine dicrimination by feeding it biased data.
Data augmentation: Increasing the number of samples synthetically.

Distinction between crocodiles and dinosaurs. I can capture crocodiles from anywhere with even live data, but for dinosaurs we can only get a small amount of data.
How do we balance this? As the data for crocodiles there might be 10 lakh data and for dinosaurs only 1000.
We can take let's say only 50k of data
Even better would be data augmentation.

A computer trained to recognize a human from top to bottom will not recognize a human is shown upside down.
So we can rotate the image: that would be syntheetic data sample.

Ways of data augmentation:
Rotating an image
Decreasing the dimensions of an image
Translation of data

These ways of data augmentation is good for images but not text? 
For text, we have data sampling.

##### Analysis for building learning model:
There might be a ton of data from 2014, but the data from let's say 2012 might be more useful for data analysis. We might work with the whole data but the since the highest number of entries are of 2014, our results may be biased. Our model should be unbiased and generalised.

We can plot to see the amount of data from different years.

See the issues with different attributes such as age (sometimes age is not even reported), etc.

Booking through iphones could mean that those people are travelling domestically as domestic flights might mostly by booked by phones and international flights with laptops/PCs.

75%+ people are using iphones and Macs. So Iphone and Mac users have increased. Users with unknown devices have decreased.

#### 2) Cleaning the data

fix timestamp formats
Fill in missing values
fix erroneous values
Skipping of attributes (this is risky as we will be losing a whole attribute so take this decision visely based on our data, mostly we wont skip any attribute or record, we might fix it instead)

Standardizing the categories:
People from the US might fill in USA,US,America, etc. in the country. We need to ensure that we transform all these forms of country (this is a categorical data) to a standard form a machine can understand.

We usually perform data cleaning before data augmentation without worrying about the bias.

There is no standard order of steps for data preprocessing, If something works explain and prove how it did.

Getting new attributes from the existing ones, such as getting the time between creation of an account and the first booking.

##### Correcting erronous values:
outliers
missing values

we store some default value such as -1 instead of having to deal with garbage values in attributes. You can replace NaN with -1.

see np.where function to see how a condition affects the attributes of a given record.

Write a function to remove NaN value ('remove' as in change NaN to -1)

We can have 4 classes or columns and count the instances of that class. If a given instance belongs to say, class A, we mark that has 1, otherwise 0.

We will records correlating with the 4 classes:

Rx: M F O
R1: 1 0 0 
R2: 0 1 0
R3: 0 0 1
R4: 1 0 0

Here, we are using one-hot encoding. Google it.

##### Data transformation and feature extraction as a concept

A machine will get biased with big and small numbers.

Why data transformation:

methods of this:
bucketing/binning
normalization
other transformations
one-hot encoding

### End of Preprocessing Part1