<a href="https://colab.research.google.com/github/fabioantonini/Applied-Data-Science-Coursera-Capstone/blob/master/C2_W2_Lab_3_Feature_Selection_Bike_Sharing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ungraded Lab: Feature Selection Bike Sharing

Feature selection involves picking the set of features that are most relevant to the target variable. This helps in reducing the complexity of your model, as well as minimizing the resources required for training and inference. This has greater effect in production models where you maybe dealing with terabytes of data or serving millions of requests.

In this notebook, you will run through the different techniques in performing feature selection on the [Seoul Bike Sharing](https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand). Most of the modules will come from [scikit-learn](https://scikit-learn.org/stable/), one of the most commonly used machine learning libraries. It features various machine learning algorithms and has built-in implementations of different feature selection methods. Using these, you will be able to compare which method works best for this particular dataset.

## Imports

In [None]:
# for data processing and manipulation
import pandas as pd
import numpy as np

# scikit-learn modules for feature selection and model evaluation
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import math
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# libraries for visualization
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import datetime as dt
import pandas as pd

In [None]:
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
else:
  print('Not running on CoLab')

### Preview the  dataset

You will be using the [Seoul Bike Sharing](https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand).

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.
The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information. The label (target) is the 'Rented Bike count'.

Here is the description of the features again: 

* **Date** : year-month-day
* **Rented Bike count** - Count of bikes rented at each hour
* **Hour** - Hour of the day
* **Temperature** - Temperature in Celsius
* **Humidity** - %
* **Windspeed** - m/s
* **Visibility** - 10m
* **Dew point temperature** - Celsius
* **Solar radiation** - MJ/m2
* **Rainfall** - mm
* **Snowfall** - cm
* **Seasons** - Winter, Spring, Summer, Autumn
* **Holiday** - Holiday/No holiday
* **Functional Day** - NoFunc(Non Functional Hours), Fun(Functional hours)

## Load the dataset

We've already downloaded the CSV in your workspace. Run the cell below to load it in the lab environment and inspect its properties.

In [None]:
# Load the dataset
link="https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv"
df = pd.read_csv(link, encoding = 'latin1')

# Print datatypes
print(df.dtypes)

# Describe columns
df.describe(include='all')

In [None]:
# Preview the dataset
df.head()

## Remove Unwanted Features

You can remove features that are not needed when making predictions.

In [None]:
# Check if there are null values in any of the columns.
df.isna().sum()

There is no Null value and all the features are numerical unless 'Date'. We can move ahead.

## Removing 'Date' feature
The feature 'Date' will be removed because not helpful to train the model by Linear Regression.

In [None]:
# Drop the previous string column
#df.drop(['Date'], axis=1, inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = df['Date'].map(dt.datetime.toordinal)
# Preview the dataset after che change to the date
df.head()

## Integer Encode Holiday feature

You may have realized that the target column, `Holiday`, is encoded as a string type categorical variable: 'No Holiday' and 'Holiday'. You need to convert these into integers before training the model. Since there are only two classes, you can use `0` for 'No Holiday' and `1` for 'Holiday'. Let's create a column `Holiday_int` containing this integer representation.

In [None]:
# Integer encode the target variable, Holiday
df["Holiday_int"] = (df["Holiday"] == 'Holiday').astype('int')

# Drop the previous string column
df.drop(['Holiday'], axis=1, inplace=True)

# Check the new column
df.head()

## Integer Encode 'Functioning Day' feature

The feature `Functioning Day`, is encoded as a string type categorical variable: 'Yes' and 'No'. You need to convert these into integers before training the model. Since there are only two classes, you can use `0` for 'No' and `1` for 'Yes'. Let's create a column `Functioning_Day_int` containing this integer representation.

In [None]:
# Integer encode the target variable, Holiday
df["Functioning_Day_int"] = (df["Functioning Day"] == 'Yes').astype('int')

# Drop the previous string column
df.drop(['Functioning Day'], axis=1, inplace=True)

# Check the new column
df.head()

## Integer Encode Seasons feature

You may have realized that the target column `Seasons` is encoded as a string type categorical variable with domain 'Autumn', 'Spring', 'Summer', 'Winter'. You need to convert these into integers before training the model. The mapping will be:
* 'Spring' -> 0
* 'Summer' -> 1
* 'Autumn' -> 2
* 'Winter' -> 3

In [None]:
df['Seasons'] = df['Seasons'].replace(['Spring','Summer','Autumn', 'Winter'], [0, 1, 2,3])
df.head()

In [None]:
# Print datatypes
print(df.dtypes)

# Describe columns
df.describe(include='all')

## Model Performance

Next, split the dataset into feature vectors `X` and target vector (Rented Bike Count) `Y` to fit a [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression). You will then compare the performance of each feature selection technique, using [mse](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=mean_squared_error#sklearn.metrics.mean_squared_error), [mae](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html?highlight=mean_absolute_error#sklearn.metrics.mean_absolute_error) as evaluation metrics.

In [None]:
# Split feature and target vectors
X = df.drop("Rented Bike Count", 1)
Y = df["Rented Bike Count"]
Y.head()

### Fit the Model and Calculate Metrics

You will define helper functions to train your model and use the scikit-learn modules to evaluate your results.

In [None]:
def fit_model(X, Y):
    '''Use a linear_model.LinearRegression() for this problem.'''
    
    # define the model to use
    model = linear_model.LinearRegression()
    
    # Train the model
    model.fit(X, Y)
    
    return model

In [None]:
def calculate_metrics(model, X_test_scaled, Y_test):
    '''Get model evaluation metrics on the test set.'''
    
    # Get model predictions
    y_predict_r = model.predict(X_test_scaled)
    
    # Calculate evaluation metrics for assessing performance of the model.
    mse = mean_squared_error(Y_test, y_predict_r)
    mae = mean_absolute_error(Y_test, y_predict_r)

    return mse, mae

In [None]:
def train_and_get_metrics(X, Y):
    '''Train a Linear Regression and get evaluation metrics'''
    
    # Split train and test sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 123)

    # All features of dataset are float values. You normalize all features of the train and test dataset here.
    scaler = StandardScaler().fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Call the fit model function to train the model on the normalized features and the 'Rented Bike Count' values
    model = fit_model(X_train_scaled, Y_train)

    # Make predictions on test dataset and calculate metrics.
    mse, mae = calculate_metrics(model, X_test_scaled, Y_test)
    
    return mse, mae

In [None]:
def evaluate_model_on_features(X, Y):
    '''Train model and display evaluation metrics.'''
    
    # Train the model, predict values and get metrics
    mse, mae = train_and_get_metrics(X, Y)

    # Construct a dataframe to display metrics.
    display_df = pd.DataFrame([[mse, mae]], columns=['MSE', 'MAE'])
    
    return display_df

Now you can train the model with all features included then calculate the metrics. This will be your baseline and you will compare this to the next outputs when you do feature selection.

In [None]:
# Calculate evaluation metrics
all_features_eval_df = evaluate_model_on_features(X, Y)
all_features_eval_df.index = ['All features']

# Initialize results dataframe
results = all_features_eval_df

# Check the metrics
results.head()

## Correlation Matrix

It is a good idea to calculate and visualize the correlation matrix of a data frame to see which features have high correlation. You can do that with just a few lines as shown below. The Pandas [corr()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) method computes the Pearson correlation by default and you will plot it with Matlab PyPlot and Seaborn. The darker blue boxes show features with high positive correlation while white ones indicate high negative correlation. The diagonals will have 1's because the feature is mapped on to itself.

In [None]:
# Set figure size
plt.figure(figsize=(20,20))

# Calculate correlation matrix
cor = df.corr() 

# Plot the correlation matrix
sns.heatmap(cor, annot=True, cmap=plt.cm.PuBu)
plt.show()

## Filter Methods

Let's start feature selection with filter methods. This type of feature selection uses statistical methods to rank a given set of features. Moreover, it does this ranking regardless of the model you will be training on (i.e. you only need the feature values). When using these, it is important to note the types of features and target variable you have. Here are a few examples:

* Pearson Correlation (numeric features - numeric target, *exception: when target is 0/1 coded*)
* ANOVA f-test (numeric features - categorical target)
* Chi-squared (categorical features - categorical target)

Let's use some of these in the next cells.

### Correlation with the target variable

Let's start by determining which features are strongly correlated with the 'Rented Bike Count' (i.e. the target variable). Since we have numeric features and our target, although categorical, is 0/1 coded, we can use Pearson correlation to compute the scores for each feature. This is also categorized as *supervised* feature selection because we're taking into account the relationship of each feature with the target variable. Moreover, since only one variable's relationship to the target is taken at a time, this falls under *univariate feature selection*.

In [None]:
# Get the absolute value of the correlation
cor_target = abs(cor["Rented Bike Count"])

# Select highly correlated features (thresold = 0.2)
relevant_features = cor_target[cor_target>0.2]

# Collect the names of the features
names = [index for index, value in relevant_features.iteritems()]

# Drop the target variable from the results
names.remove('Rented Bike Count')

# Display the results
print(names)

Now try training the model again but only with the features in the columns you just gathered. You can observe that there is an improvement in the metrics compared to the model you trained earlier.

In [None]:
# Evaluate the model with new features
strong_features_eval_df = evaluate_model_on_features(df[names], Y)
strong_features_eval_df.index = ['Strong features']

# Append to results and display
results = results.append(strong_features_eval_df)
results.head()

### Correlation with other features

You will now eliminate features which are highly correlated with each other. This helps remove redundant features thus resulting in a simpler model. Since the scores are calculated regardless of the target variable, this can be categorized under *unsupervised* feature selection.

For this, you will plot the correlation matrix of the features selected previously. Let's first visualize the correlation matrix again.

In [None]:
# Set figure size
plt.figure(figsize=(20,20))

# Calculate the correlation matrix for target relevant features that you previously determined
new_corr = df[names].corr()

# Visualize the correlation matrix
sns.heatmap(new_corr, annot=True, cmap=plt.cm.Blues)
plt.show()

You will see that `Temperature(°C)` is highly correlated to `Dew point temperature(°C)`. You can retain `Temperature` and remove the rest of the features highly correlated to it.

This is a more magnified view of the features that are highly correlated to each other.

In [None]:
# Set figure size
plt.figure(figsize=(12,10))

# Select a subset of features
new_corr = df[['Temperature(°C)', 'Dew point temperature(°C)']].corr()

# Visualize the correlation matrix
sns.heatmap(new_corr, annot=True, cmap=plt.cm.Blues)
plt.show()

You will now evaluate the model on the features selected based on your observations. You can see that the metrics show the same values as when it was using all the features. This indicates that you can get the same model performance even if you reduce the number of features. In other words, the 4 features you removed were indeed redundant and you only needed the ones you retained.

In [None]:
# Remove the features with high correlation to other features
subset_feature_corr_names = [x for x in names if x not in ['Temperature(°C)', 'Dew point temperature(°C)']]

# Calculate and check evaluation metrics
subset_feature_eval_df = evaluate_model_on_features(df[subset_feature_corr_names], Y)
subset_feature_eval_df.index = ['Subset features']

# Append to results and display
results = results.append(subset_feature_eval_df)
results.head(n=10)

## Wrap Up

That's it for this quick rundown of the different feature selection methods. As shown, you can do quick experiments with these because convenience modules are already available in libraries like sci-kit learn. It is a good idea to do this preprocessing step because not only will you save resources, you may even get better results than when you use all features. Try it out on your previous/upcoming projects and see what results you get!