<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Translated by Gleb Filatov (@gleb_filatov). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. This material is a translated version of the Capstone project (by the same author) from specialization "Machine learning and data analysis" by Yandex and MIPT. No solutions shared.

# <center> Project "Alice". User Identification Based on Visited Websites
## <center> Week 3. Visual data analysis, and feature engineering
    
This week, we will start doing data exploration via plots as well as some feature engineering. We'll build and explore some features for you so you can build up on this and be able to make features on your own. 

**Week 3 roadmap**
- Part 1. Feature engineering
- Part 2. Visual data exploration
- Part 3. Further feature engineering
- Part 4. Validation of prepared features

In this task we will extensively utilize **seaborn** library (you can install it by using *pip install seaborn* command). Also it will be helpful to take a look at [matplotlib](http://matplotlib.org/users/) and [seaborn](http://seaborn.pydata.org/) docs.

**Your task**
1. Fill in th missing code in the provided notebook
2. Choose the answers in the [form](https://docs.google.com/forms/d/19AKGoSekzO-exZhFWsCHSb3-wR-KT_ATtREKdPWKd24)

## Part 1. Feature engineering

In [1]:
# disable Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
from glob import glob
import os
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 25)
import pickle
# pip install seaborn
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
# change the path to data if necessary
PATH_TO_DATA = '../../data/capstone_user_identification'

Create a new function *prepare_train_set_with_fe*, based on *prepare_train_set* and *prepare_sparse_train_set_window*, which will produce following features:

- session_timespan - session duration in seconds (difference between maximal and minimal website visit time in one session)
- #unique_sites - number of unique sites in session
- start_hour - hour, when session started (hour of the minimal timestamp of this session)
- day_of_week - day of week (day of the minimal timestamp of this session)

Function should return a new DataFrame (just like function *prepare_train_set* did) but with 4 more features. The order, in which features are added is the following - site1, ..., site10, session_timespan, #unique_sites, start_hour, day_of_week, user_id.

In [None]:
def prepare_train_set_with_fe(path_to_csv_files, site_freq_path, feature_names,
                                    session_length=10, window_size=10):
    pass
    # you code here

Let's test the function on a toy example.

In [4]:
feature_names = ['site' + str(i) for i in range(1,11)] + \
                ['time_diff' + str(j) for j in range(1,10)] + \
                ['session_timespan', '#unique_sites', 'start_hour', 
                 'day_of_week', 'target']
train_data_toy  = prepare_train_set_with_fe(os.path.join(PATH_TO_DATA, 
                                                         '3users'), 
                  site_freq_path=os.path.join(PATH_TO_DATA, 
                                              'site_freq_3users.pkl'),
                  feature_names=feature_names, session_length=10)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [5]:
train_data_toy

Unnamed: 0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,time_diff1,time_diff2,time_diff3,time_diff4,time_diff5,time_diff6,time_diff7,time_diff8,time_diff9,session_timespan,#unique_sites,start_hour,day_of_week,target
0,3,2,2,8,2,1,10,5,7,9,287,1184,6278,186,2,1,2,3,55,7998,8,9,4,1
1,3,1,1,1,0,0,0,0,0,0,2,3,55,0,0,0,0,0,0,60,2,12,4,1
2,3,2,6,6,2,0,0,0,0,0,287,1184,6278,186,0,0,0,0,0,7935,3,9,4,2
3,4,1,2,1,2,1,1,5,11,4,287,1184,6278,186,2,1,2,3,55,7998,5,9,4,3
4,4,1,2,0,0,0,0,0,0,0,287,1184,0,0,0,0,0,0,0,1471,3,12,4,3


Now apply function *prepare_train_set_with_fe* to 10 users and specify session_length=10

In [6]:
# %%time
# # you code here
# train_data_10users = prepare_train_set_with_fe '''your code here'''

In [None]:
# train_data_10users.head()

Apply function *prepare_train_set_with_fe* to 150 users and specify session_length=10

In [6]:
# %%time
# # you code here
# train_data_150users = prepare_train_set_with_fe '''your code here'''

In [None]:
# train_data_150users.head()

Save features session_timespan, #unique_sites, start_hour and day_of_week for 10 and 150 users to pickle files.

In [None]:
# # you code here
# new_features_10users = 
# new_features_150users = 

In [None]:
with open(os.path.join(PATH_TO_DATA, 
                       'new_features_10users.pkl'), 'wb') as new_features_10users_pkl:
    pickle.dump(new_features_10users, new_features_10users_pkl)
with open(os.path.join(PATH_TO_DATA, 
                       'new_features_150users.pkl'), 'wb') as new_features_150users_pkl:
    pickle.dump(new_features_150users, new_features_150users_pkl)

**<font color='red'> Question 1. </font> What's the median session timespan for 10 users data?**

In [None]:
# you code here

**<font color='red'> Question 2. </font> What's the median day of week for 10 users data?**

In [None]:
# you code here

**<font color='red'> Question 3. </font> What's the median session start hour for 150 users data?**

In [None]:
# you code here

**<font color='red'> Question 4. </font> What's the median number of unique sites for 150 users data?**

In [None]:
# you code here

## Part 2. Visual data exploration

Let's assign a name and a color to each user.

In [None]:
id_name_dict = {128: 'Mary-Kate', 39: 'Ashley', 207: 'Lindsey', 127: 'Naomi', 237: 'Avril',
               33: 'Bob', 50: 'Bill', 31: 'John', 100: 'Dick', 241: 'Ed'}
train_data_10users['target'] = train_data_10users['target'].map(id_name_dict)

In [None]:
color_dic = {'Mary-Kate': 'pink', 'Ashley': 'darkviolet', 'Lindsey':'blueviolet', 
             'Naomi': 'hotpink', 'Avril': 'orchid', 
             'Bob': 'firebrick', 'Bill': 'gold', 'John': 'forestgreen', 
             'Dick': 'slategrey', 'Ed':'brown'}

1. Plot a histogram of session length (measured in seconds). Limit x axis to 200 (the right tail is very heavy). Make the histogram of color darkviolet and name the axes.

In [None]:
# you code here
train_data_10users['session_timespan'] 

2. Plot a histogram of number of unique sites in session. Make it of color aqua, name the axes.

In [None]:
# you code here
train_data_10users['#unique_sites'] 

3. Plot a histogram of number of unique sites for **each user separately**. Use *subplots* to fit all the 10 little pictures on a big one. Assign legend to each plot, it should have user's name on it. For each user color the corresponding histogram with a color from *color_dic*. Sign the axes in each of the histograms.

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))

#just a suggestion
for idx, (user, sub_df) in  enumerate(pd.groupby(train_data_10users, 'user_id')): 
    pass
    # you code here

4. Plot a histogram of session start hour. Make histogram of color darkgreen, sign the axes.

In [None]:
# you code here
train_data_10users['start_hour'] 

5. Plot histograms of start hour distribution for each of 10 users separately. Use *subplots* to fit all the 10 little pictures on a big one. Assign legend to each plot, it should have user's name on it. For each user color the corresponding histogram with a color from color_dic. Sign the axes in each of the histograms.

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))

#just a suggestion
for idx, (user, sub_df) in  enumerate(pd.groupby(train_data_10users, 'user_id')):
    pass
    # you code here

6. Plot histogram of day of week distribution. Make it of color sienna, sign the axes. 

In [None]:
# you code here
train_data_10users['day_of_week']

7. Plot histograms of day of week distribution for each of 10 users separately. Use *subplots* to fit all the 10 little pictures on a big one. Change x axis labels to \['Mon','Tue','Wed','Thu','Fri','Sat','Sun'\] using *set_xticklabels* method. Assign legend to each plot, it should have user's name on it. For each user color the corresponding histogram with a color from color_dic. Sign the axes in each of the histograms.

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))

#just a suggestion
for idx, (user, sub_df) in  enumerate(pd.groupby(train_data_10users, 'user_id')):
    pass
    # you code here

8. Make conclusions about each user using the plots you just plotted. 

Load the pickle file frequency dict for 10 users you saved earlier.

In [None]:
# # you code here
# with open 

Find top-10 most visited sites (top10_sites) and corresponding number of visits (top10_freqs).

In [None]:
# # you code here
# top10_freqs = 
# top10_sites = 

9. Plot seaborn barplot that shows frequencies of top-10 sites visits. Make x-axis names (xticks) vertically oriented, otherwise they do not look good. 

In [None]:
# # you code here
# sns.barplot 

## Part 3. Further feature engineering

This task will test your creativity. You have to come up with ideas, how else you can use website visit time and other features.

On the next week we will use a "bag of sites" to classify sessions based on which user they belong to. Additionaly, you will use features you create here and we'll see, whether model improves. It is wise to calculate them now and save, like we did in this assignment. 

You may go wild here and explore any feature you want - you have no constraints. 

- year, month and day of session start
- hour of sessions start (with respect to year, month and day)
- time of day
- average time spent on website (for top-30, for example)
- indicator function for popular site visit (again, top-30)
- facebook visit frequency
- ...

Implement the function to create new features and apply to initial data - catalogues with 10 and 150 users. Do it only for dataset, which was created with paramaters session_length=10 and window_size=10. Serialize resulting matrices via pickle. Function may return both only new features and old ones concatenated with new. You are free to choose function signature - no constraints here.

In [None]:
def feature_engineering(path_to_csv_files, features, session_length=10):
    pass
    # you code here

In [None]:
# # you code here
# new_features_10users = feature_engineering

In [None]:
# # you code here
# new_features_150users = feature_engineering

10. Plot pictures for new features, explore them and comment on the results.

In [7]:
# you code here

Finally, save only those features to pickle files which, on your opinion, would help to identify user more precisely. This applies both to features, that we created in the beginning (session_timespan, #unique_sites, start_hour, day_of_week) and your own. You are free to create all these features not only for sessions of length 10, but for any other combination of *session_length* and *window_size*

In [None]:
# # you code here
# selected_features_10users = 
# selected_features_150users = 

In [None]:
with open(os.path.join(PATH_TO_DATA, 
                       'selected_features_10users.pkl'), 'wb') as selected_features_10users_pkl:
    pickle.dump(selected_features_10users, selected_features_10users_pkl, 
                protocol=2)
with open(os.path.join(PATH_TO_DATA, 
                       'selected_features_150users.pkl'), 'wb') as selected_features_150users_pkl:
    pickle.dump(selected_features_150users, selected_features_150users_pkl, 
                protocol=2)

Next week, we'll finally start training classification models.