# Python for Data Science | Spring 2018 -- Resit

***

**The aim of this exam is to study the dataset composed of bike trips on the *Capital Bikeshare* network, based in Washington DC.** 

You're expected to download this notebook and add code and markdown cells to answer questions you'll find hereafter. The first couple of cells aim at loading the dataset as a python usable object. They are already written for you and only need to be executed for your objects to be available within notebook. You'll be then guided through a number of questions **exclusively** related to the biking dataset. 

**Your notebook has to be submitted by email to the address *bashar.dudin@epita.fr* 2:00 after start of examination. Subject has to be `[IPDS][RESIT] Spring 2018`.**

Remember that :

- **Any late submissisons will not be graded!**, unless there is a serious valid reason for it.
- **Code cells** that raise errors should **NOT** be handed out.
- **Copying googled code or your friends code is unlikely to help much.** You have access to the `python`, `numpy`, `pandas`, `matplotlib` and `sklearn` documentations. You do also have access to the course's notebooks. These are already enuough. For the rest ; gray matter should be enough.

### Loading Useful Libraries

In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt

### Downloading Dataset and Extracting It

In [None]:
import urllib.request
import zipfile

dataset_url = 'https://s3.amazonaws.com/capitalbikeshare-data/2011-capitalbikeshare-tripdata.zip'
downloaded_file = 'Data/bikeshare-dataset-2011.zip'
with urllib.request.urlopen(dataset_url) as response:
    with open(downloaded_file, "wb") as out_file:
        dataset_ = response.read()
        out_file.write(dataset_)

In [None]:
zip_ref = zipfile.ZipFile(downloaded_file, 'r')
zip_ref.extractall('Data/')
zip_ref.close()

### Loading Dataset

In [None]:
data = pd.read_csv("Data/2011-capitalbikeshare-tripdata.csv")

In [None]:
data.head()

***

# Exam

Answer questions as fully as you can. You are free to give partial answers if you wish for as long as these **treat the studied dataset**, **make sense within the context** and **don't raise errors**.

***

## I. Looking into the Data

We're dropping unuseful data and cleaning up to prepare for relevant analysis. 

**I.1 Drop the columns corresponding to station names and bicycle identifiers**.

**I.2 What does `mystery` function do?**

The below function is aimed at transforming two of the dataset columns into reusable data types. 

In [None]:
def __mystery(D, hash_t):
    da, t = D.split(" ")
    
    h, m, s = [int(x) for x in t.split(":")]
    t_res = s + 60*m + h*3600
    _, m, d = [int(x) for x in da.split("-")]

    da_res = 0
    for i in range(1, m):
        da_res += hash_t[i]
    da_res += d - 1
    da_res *= 24*3600
    
    return da_res + t_res

In [None]:
def mystery(df, d_feature):
    """This is a a baby mystery function."""
    D = df[d_feature].iloc[0]
    da, _ = D.split(" ")
    y = int(da.split("-")[0])
    l = y % 4 == 0 or (y % 100 != 0 and y % 400 == 0)
    hash_t = {i : (30 + (i % 7 % 2 == 1) + (i == 7)) for i in range(1, 13)}
    hash_t[2] = 28 + l
    df[d_feature] = pd.Series([__mystery(D, hash_t) for D in df[d_feature]])

In [None]:
mystery(data, "Start date")

In [None]:
mystery(data, "End date")

**I.3 Recode `Member type` into a feature having either value `0` for `Casual` and `1` for `Member`.**

## II. Building Up a First Naive Model

In order to get into any ML work you need to have a number of dummy models to which you compare increasingly complex models. Here is where we're doing such work.

***
Our main **aim** is to predict `Member type` out of other relevant features in the dataset. 
***

We shall first work under the assumption that our only inputs are `Duration`, `Start date` and `End date`.

**II.1 Decompose dataset into input and target, where input only contains `Duration`, `Start date` and `End date`. Expected types are `numpy` `arrays`.**

**II.2 Split datasets into training and testing sets having a test set proportion of `0.2`.**

### Constant Model

We're building a first constant model only answering `Member` independantly of input.

**II.3 Compute accuracy score of a constant model only answering `Member` for any single entry.**

**II.4 Following conventions for writing down a `Class` inheriting from the `BaseEstimator` class, define a model that always outputs `Member`.**

### Random Model

By a random model we mean here a model that would randomly answer either `Member` or `Casual` independantly of input.

**II.6 Following steps in previous section write down and evaluate a random model that uniformly answers either `Member` or `Casual` independantly of entry. **

**II.7 Using `binomial` from the `random` module within `numpy` build up a model that gets a closer to the constant model you've built at previous section (do not expect much).**

## III. First ML Models

We're going to try out two models on the dataset as it is now ; only containing `Duration`, `Start date` and `End date`. 

### Logistic Regression

Logistic Regression is one of the simplest models that is sensible to input ; with the difference with what we've been playing with uptill now. 

**III.1 Train and evaluate a `logistic regression`.**

## Random Forest

Read about `random forest` model form the `sklearn` documentation page. It is a classifier that has often better results than a `logistic regression`.

**III.2 Train and evaluate a `random forest`.**

**III.3 In both previous studies are all input features useful? You're expected to support your claim, either showing code or acceptable arguement.**

### Learning curves

To study any ML model, it is expected to be able to analyse learning curves of model.

**III.4 Plot learning curves of both previous models, up to the first `1000` entries of dataset (takes time).**

You're allowed to reduce number of entries for performance issues.

**III.5 Comment on results of both previous learning curves.**

## IV. Taking Stations into Account

One thing we're not taking into account in our present models are the `Start Station` and `End Station`. It is enough to look into the numbers of these stations. The point is that these numbers do point to a categorical variable and are not a numerical one (the numbers do not have a mathematical significance here, you cannot compare them, add them etc.)

In order to be able to feed such inputs to our model a standard strategy is to add extra dimensions to the dataset. Each new feature corresponds to a station name. A trip goes out from station A means that the only station feature at 1 is the one corresponding to station A the other being 0.

**IV.1 Extract from `data` the `numpy` `arrays` corresponding to `Start date`, `End date`, `Start station number` and `End station number`.**  

**IV.2 Use your own words to explain what does `OneHotEncoder` estimator do?**

**IV.3 Use a `OneHotEncoder` estimator to transform input dataset in order reincode your categorical variables.**

**IV.4 What is the shape of the transformed data?**

**IV.6 Train a `logistic regression` on transformed data. Are results better than previously? You're invited to comment and support your claims.**

**IV.5 Train and evaluate an `SVC` on transformed data, using linear kernel.** 

You're allowed to first shrink your dataset, limiting observation to a `10000` or less if need be.

**IV.6 Choose a different `kernel` hyperparameter and train an new `SVC` classifier in order to get better results.**

## Any Ideas to Get Better Models?

You're free to try out what you feel would work. I'll comment on it.

## Conclusions

What are conclusions of your study? You're expected to support your claims either by looking into previously obtained results. Drawing partial learning curves. Making your own analysis and plots of errors etc.

## Documentation Question

**What is `Grid Search`? Support your study through use cases.**