![](imgs/deepsense_header.png)

# Machine Learning and Big Data

A course by [deepsense.io](http://deepsense.io/).

## Part 3: Logistic Regression

Linear regression analogue for:

* classification
* estimating probabilities

![](imgs/wikipedia_logistic.png)

In [None]:
%matplotlib inline

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

In [None]:
# again, let us use the Bike Sharing Dataset
df = pd.read_csv("data/Bike-Sharing-Dataset/day.csv")

## Initial exploration

In [None]:
# warning: official description was wrong!
seasons = {1: "winter", 2: "spring", 3: "summer", 4: "fall"}

In [None]:
# recoding seasons
df['season'] = df['season'].map(seasons)

In [None]:
# grouping by "seasons", selecting "cnt" columns and then taking mean 
df.groupby("season")["cnt"].mean()

In [None]:
# let's define some colors we will be using
colors = {"winter": "#5555dd", "spring": "#55dd55", "summer": "#bbbb33", "fall": "#dd5555"}

In [None]:
# temperatures in seasons
for name, df_part in df.groupby("season")["temp"]:
    sns.distplot(df_part, hist=False, label=name, color=colors[name])

In [None]:
# biker count by season
for name, df_part in df.groupby("season")["cnt"]:
    sns.distplot(df_part, hist=False, label=name, color=colors[name])

### Exercises

* Plot humidity by season.
* Plot casual and (on a separate plot) registered users by season.
* ★ Plot usage by weekday.  

## Logistic Regression for 2 variables

In [None]:
# creating a logistic regression classifier
# parameter C is related to regularization;
# in short, it bounds the maximal steepness of the logistic function 
lr = LogisticRegression(C=100)

In [None]:
# input
X = df[["cnt"]]

# output
Y = df["season"] == "winter"

In [None]:
# splitting the dataset for cross-validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)

In [None]:
# training Linear Regression on data
lr.fit(X_train, Y_train)

In [None]:
# score - percent of correct answers
# on the train set
lr.score(X_train, Y_train)

In [None]:
# score - percent of correct answers
# on the test set
lr.score(X_test, Y_test)

In [None]:
for name, df_part in df.groupby("season")["cnt"]:
    sns.distplot(df_part, hist=False, label=name, color=colors[name])

# vertical line for each logistic regression predicts the boundary
plt.vlines(-lr.intercept_[0]/lr.coef_[0,0], 0, 0.00035)

In [None]:
# probability (logistic function) vs binary prediction 
X_grid = np.linspace(-3000, 10000, 100).reshape(100, 1)
logistic_df = pd.DataFrame({"winter prediction": lr.predict(X_grid),
                            "winter probability": lr.predict_proba(X_grid)[:,1]},
                            index=X_grid.reshape(100))

logistic_df.sort_index().plot()
plt.ylim(-0.1, 1.1)

In [None]:
# one more plot - this time with histograms and absolute counts

df.query("season == 'winter'")["cnt"] \
  .hist(label=seasons[1], color=colors['winter'], alpha=0.5, range=(0, 10000), bins=40)

df.query("season != 'winter'")["cnt"] \
  .hist(label="Not winter", color="grey", alpha=0.5, range=(0, 10000), bins=40)

plt.legend()

threshold = -lr.intercept_[0]/lr.coef_[0,0]
width = 1/lr.coef_[0,0]
plt.vlines(threshold, 0, 50)
plt.vlines(threshold - width, 0, 50, linestyles='dashed')
plt.vlines(threshold + width, 0, 50, linestyles='dashed')

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
# more detailed score
confusion_matrix(Y_test, lr.predict(X_test))

In [None]:
# confused?
# let's plot it!

df_confusion = pd.DataFrame(confusion_matrix(Y_test, lr.predict(X_test)))

df_confusion.index = ["not winter", "winter"]
df_confusion.index.name = "true value"

df_confusion.columns = ["not winter", "winter"]
df_confusion.columns.name = "predicted value"

sns.heatmap(df_confusion, linewidths=3, annot=True, fmt="d")

## More variables may help

In [None]:
# sns.pairsplot is useful for showing many scatter plots at once 
sns.pairplot(df,
             vars=["hum", "temp", "atemp", "casual", "registered"],
             hue="season",
             palette=colors)

In [None]:
lr = LogisticRegression(C=100)

X = df[["casual", "registered", "weekday", "yr"]].copy()

X['casual'] = np.log10(X['casual'])

X = (X - X.mean())/X.std()


Y = (df["season"] == "winter")

# splitting the dataset for cross validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)

# training
lr.fit(X_train, Y_train)

# score 
print("Train score is: {:.3f}".format(lr.score(X_train, Y_train)))
print("Test score is:  {:.3f}".format(lr.score(X_test, Y_test)))

In [None]:
# we used normalized variables
coeffs = pd.Series(lr.coef_[0], index=X.columns)
coeffs.plot(kind="barh")

In [None]:
# again, a confusion matrix
df_confusion = pd.DataFrame(confusion_matrix(Y_test, lr.predict(X_test)))

df_confusion.index = ["not winter", "winter"]
df_confusion.index.name = "true value"

df_confusion.columns = ["not winter", "winter"]
df_confusion.columns.name = "predicted value"

sns.heatmap(df_confusion, linewidths=3, annot=True, fmt="d")

### Exercises

* Use other variables and their scalings to improve the score.