# EDA Lab

## Data Exploration of Traveler dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline

### Exploration by a Data Analysist
   
   - Is there any mistakes in the data?
   - Does the data have peculiar behavior?
   - Do I need to fix or remove any of the data to be more realistic?


In [None]:
#Lets load data from train and test datasets
train_users = pd.read_csv('./traveler_dataset/train_users_2.csv')
test_users = pd.read_csv('./traveler_dataset/test_users.csv')

In [None]:
## How many users are in training set and test set
print("We have", train_users.shape[0], "users in the training set and", 
      test_users.shape[0], "in the test set.")
print("In total we have", train_users.shape[0] + test_users.shape[0], "users.")

In [None]:
train_users.head()

In [None]:
test_users.head()

In [None]:
# Merge train and test users
users = pd.concat((train_users, test_users), axis = 0, ignore_index = True, sort = False)
users.head()

In [None]:
# For data exploration let us remove the ID's
users.drop('id',axis = 1, inplace = True)

users.head()

### Data cleaning

###  Missing data - [1] Gender
 - **Viewing the data:** let us start with gender attribute to see certain values being **-unknown-**. 
 - If **-unknown-**, transform these values into **NaN**.
    

In [None]:
users.gender.head()

#### How much data we are missing?





In [None]:
## Compute NaN percentage of each feature (attribute).
users_nan = (users.isnull().sum() / users.shape[0]) * 100
users_nan[users_nan > 0]


#### Analysis: 
- ???

#### Exercise 1:
   - What is the NaN percentage of **date_first_booking** and **age** attribute in test_users dataset
   

In [None]:
### Start code



### End code

### Missing data - [1] Gender continue...
 - If -unknown-, transform these values into NaN.

#### Exercise 2:
 - Replace -unknown- with NaN

In [None]:
### Start code


### End code

#### Exercise 3:
 - Plot figure as shown below

In [None]:
## Start code




## End code
sns.despine()

#### Expected graph:
<img src="./eda_images/gender.png" height="400" width="400"/>

#### Exercise 4:
- Is there any **gender** preferences when travelling to destination country?
- Plot figure as shown below


In [None]:
women = sum(users['gender'] == 'FEMALE')
men = sum(users['gender'] == 'MALE')

female_destinations = users.loc[users['gender'] == 'FEMALE', 
                    'country_destination'].value_counts() / women * 100
male_destinations = users.loc[users['gender'] == 'MALE', 
                    'country_destination'].value_counts() / men * 100

## Plot bar graph
# Bar width
width = 0.4

### Start code










### End code
sns.despine()
plt.show()

#### Expected graph:

<img src="./eda_images/destination_country.png" height="400" width="400"/>

#### Analysis: 
- Summary: ???

### Missing data - [2] Age
 - **Viewing the data:** age attribute has certain values being **-unknown-**. 
 - Transform these values into **NaN**.
    

In [None]:
## age
users.age.describe()

#### Analysis:
 - ???

#### Exercise 5:
   - Display total number of users whose **age** is > 122 and < 18

In [None]:
### Start code





### End code

#### Statistical analysis of age attribute with value > 122 and < 18

In [None]:
users[users.age > 122]['age'].describe()

In [None]:
users[users.age < 18]['age'].describe()

#### Analysis: 
- Summary: ???


#### Exercise 6:
   - Set an acceptable range (95, 16) and put others as NaN in **age**

In [None]:
### Start code




### End code

#### Age with distribution for better analysis

In [None]:
sns.distplot(users.age.dropna(), color='red')
plt.xlabel('Age')
sns.despine()

#### Analysis: The common age of travelers is between 20 and 50.
   - How about older people, do they travel in a different way?    

#### Exercise 7:
   - Lets take arbitrary **age** (eg: 45) and split into two groups, namely, *Young* and *Old* based on **country_destination**.
   - Plot figure as shown below.

In [None]:
age = 45

### Start code 















## End code

##Plot

plt.legend()
plt.xlabel('Destination Country')
plt.ylabel('Percentage')

sns.despine()
plt.show()

#### Expected graph:

<img src="./eda_images/young_old.png" height="400" width="400"/>

#### Analysis: 
    
- ???
        

#### Exercise 8:
   - What about native language if 'en', what percentage ? 

In [None]:
### Start code




### End code

### [3] Dates

In [None]:
users['date_account_created'] = pd.to_datetime(users['date_account_created'])
users['date_first_booking'] = pd.to_datetime(users['date_first_booking'])
users['date_first_active'] = pd.to_datetime((users.timestamp_first_active)// 1000000, format='%Y%m%d')

In [None]:
users.head()

#### Plot number of user accounts created over time


In [None]:

sns.set_style("whitegrid", {'axes.edgecolor': '0'})
#sns.set_context("poster", font_scale=1.1)
users.date_account_created.value_counts().plot(kind='line', linewidth=1.2, color='red')


#### Analysis: 

- We observe how fast the traveler site has grown over the last few years. 
- Does this corelate with the date when the user was active for the first time? 
- **Exercise 9:** It might be similar, how to check the data!


In [None]:
### Start code




### End code

#### Analysis: 
   - ???
    

#### Plot for year 2013 - date_account_created and date_first_active

In [None]:
## Select 2013 year for date_account_created

users_2013_a = users[users['date_account_created'] > pd.to_datetime(20130101, format='%Y%m%d')]
users_2013_a = users_2013_a[users_2013_a['date_account_created'] < pd.to_datetime(20140101, format='%Y%m%d')]
users_2013_a.date_account_created.value_counts().plot(kind='line', linewidth=1.2, color='red')
plt.show()


In [None]:
## Select 2013 year for date_first_active

users_2013 = users[users['date_first_active'] > pd.to_datetime(20130101, format='%Y%m%d')]
users_2013 = users_2013[users_2013['date_first_active'] < pd.to_datetime(20140101, format='%Y%m%d')]
users_2013.date_first_active.value_counts().plot(kind='line', linewidth=1.2, color='red')
plt.show()

#### Analysis:
  - Small patterns: some peaks at Oct. - Nov.
  - **Exercise 10:** Lets look more closely, in particular days
    

In [None]:
weekdays = []
for date in users.date_account_created:
    weekdays.append(date.weekday())
weekdays = pd.Series(weekdays)

In [None]:
sns.barplot(x = weekdays.value_counts().index, y=weekdays.value_counts().values, order=range(0,7))
plt.xlabel('Week Day')
sns.despine()

#### Exercise 11: HW 
- Can you find some distinctions between **date_first_active** and **date_account_created** relating to **country_destination**?


#### Exercise 12: HW 
- Are there more registrations but less booking?

#### Exercise 13: HW 
- Does it make sense to find where do users stay when they book 'US'?

#### Exercise 14: HW 
- Try making plots about **devices** and **signups** for analysis

#### Exercise 15: HW 
- Raise many more questions and provide your analysis on rest of the attributes