## Advanced Python - Individual Assignment
### Bike Sharing Analysis with Dask Structures
Areknaz Khaligian
May 19, 2019

### Introduction
#### Assingment Instructions

The objective is to rewrite the Bike Sharing analysis done in the Python for Statistical Programming subject using Dask data structures and ecosystem instead of plain pandas.

The maximum score of the assignment is 4 points and the grading will be as follows:

Creation of a git repository with a proper README, incremental commits, and some sort of automatic or programmatic download of the data before the analysis (1 point). Notice that the data should not be checked out in the repository. Including data files in git repositories is considered a bad practice.        

Use of dask.dataframe and distributed.Client for all the data manipulation (2 points). Remember that calling .compute() in a Dask DataFrame turns it into a pandas dataframe, which resides in RAM and loses the distributed advantages. The more Dask structures are used, the higher the grade.          

Use of Dask-ML for distributed training and model selection https://ml.dask.org/ (1 point). See below for inspiration.

### Import Libraries

In [1]:
import dask.dataframe as dd

from dask_ml.preprocessing import Categorizer, DummyEncoder, StandardScaler
from dask_ml.linear_model import LogisticRegression
from dask_ml.xgboost import XGBRegressor

from scipy.stats import skew, pearsonr

### Read Data

In [2]:
dataset = dd.read_csv('https://s3.eu-central-1.amazonaws.com/ie-mbd-advpython-ml-bikesharing-dask/hour.csv')

dataset.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


### Data Cleaning

No null values to clean

In [3]:
sum(dataset.isnull().values.any())

0

Drop "instant", it is a unique ID number

In [None]:
dataset = dataset.drop('instant', axis =1)
dataset.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Convert dteday to datetime datatype

In [None]:
dataset['dteday']=dataset['dteday'].astype('M8')
dataset['dteday'].head()

0   2011-01-01
1   2011-01-01
2   2011-01-01
3   2011-01-01
4   2011-01-01
Name: dteday, dtype: datetime64[ns]

Convert season, yr, mnth, hr, holiday, weekday, workingday, weathersit to categorical datatybe

In [None]:
categories = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit']
categorizer = Categorizer(columns= categories)
dataset = categorizer.fit_transform(dataset)
dataset.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Check skewness and range for temp, atemp, hum, windspeed, casual, registerd, cnt

(We see that only casual/registered/cnt are very skewed, but we will plan to scale all the variables before training, so this will solve the issue)

In [None]:
numerics = ['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']
for n in numerics:
    print(n)
    print(skew(dataset[n]))
    print()

temp
-0.00602036366695605

atemp
-0.09042105336080838

hum
-0.1112775438226877

windspeed
0.5748555816221624

casual
2.4990211743609105

registered
1.5577697580511438

cnt
1.2773013463494975



Check correlations between variables

In [None]:
print("casual vs cnt:",pearsonr(dataset['casual'], dataset['cnt']))

print("registered vs cnt:", pearsonr(dataset['registered'], dataset['cnt']))

casual vs cnt: (0.6945640779749493, 0.0)
registered vs cnt: (0.9721507308642992, 0.0)


In [None]:
print("temp vs atemp:", pearsonr(dataset['temp'], dataset['atemp']))

print("temp vs cnt:", pearsonr(dataset['temp'], dataset['cnt']))

print("atemp vs cnt:", pearsonr(dataset['atemp'], dataset['cnt']))

We only want to keep one of registerd or count, but since registered is more highly correlated with the target (cnt) we will drop casual

Also, because temp and atemp are so highly correlated we will drop atemp because it is slightly less correlated with the target (cnt)

In [None]:
dataset = dataset.drop(['casual','atemp'], axis =1)
dataset.head()

### Dummy Encoding

Before training the models we need to dummy encode the categorical variables

In [None]:
de = DummyEncoder(columns = categories)

dataset = de.fit_transform(dataset)

In [None]:
dataset.head()