# Project objective
In this project, we try to identify best alpha in lasso machine learning method for predicting shares in social network popularity of Mashable using information regarding the articles published in a period of two years.

This is a simple case of hyperparameter optimization. The selection is done by identifying best alpha comparing performance of the models in cross-validation setting.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)
* **pandas**: Pandas provides easy-to-use data structures and data analysis tools for Python. (https://pandas.pydata.org/)

We also use **warnings** to stop the notebook from returning warning messages.


In [None]:
import numpy as np
import pandas as pd
import sklearn as sk

import warnings
warnings.filterwarnings('ignore')

# Introduction to the dataset

**Name**: Online News Popularity Data Set

**Summary**: "This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity)." (UCI ML)

**number of features**: 58 predictive features 

**Number of data points (instances)**: 39797

**Link to the dataset**: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity




## Importing the dataset
We can import the dataset in multiple ways

**Colab Notebook**: You can download the dataset file (or files) from the link (if provided) and uploading it to your google drive and then you can import the file (or files) as follows:

**Note.** When you run the following cell, it tries to connect the colab with google derive. Follow steps 1 to 5 in this link (https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/) to complete the 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

# This path is common for everybody
# This is the path to your google drive
input_path = '/content/gdrive/My Drive/'
# reading the data (target)
target_dataset = pd.read_csv(input_path + 'OnlineNewsPopularity.csv', index_col=0)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


**Local directory**: In case you save the data in your local directory, you need to change "input_path" to the local directory you saved the file (or files) in.

**GitHub**: If you use my GitHub (or your own GitHub) repo, you need to change the "input_path" to where the file (or files) exist in the repo. For example, when I clone ***ml_in_practice*** from my GitHub, I need to change "input_path" to 'data/' as the file (or files) is saved in the data dicretory in this repository. 

**Note.**: You can also clone my ***ml_in_practice*** repository (here: https://github.com/alimadani/ml_in_practice) and follow the same process.

## Data preparation
We need to prepare the dataset for machine learnign modeling. Here we prepare the data in 2 steps:

1) Selecting target columns from the output dataframe (target_dataset_output)
2) Converting tissue names to integers (one for each tissue)

In [None]:
# tissueid is the column that contains tissue type information
output_var = target_dataset[' shares']

# we would like to use all the features as input features of the model
input_features = target_dataset.drop([' timedelta', ' shares'], axis=1)

## Making sure about the dataset characteristics (number of data points and features)

In [None]:
print('number of features: {}'.format(input_features.shape[1]))
print('number of data points: {}'.format(input_features.shape[0]))

number of features: 58
number of data points: 39644


## Splitting data to training and testing sets

We need to split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about generalizability of the model we train.

**test_size**: Traditionally, 30%-40% of the dataset cna be used for test set. If you split the data to train, validation and test, you can use 60%, 20% and 20% of teh dataset, respectively.

**Note.**: We need the validation and test sets to be big enough for checking generalizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state** as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Building the supervised learning model
We want to build a regression model as the output variable is continuous. Here we build multiple models using Lasso using different hyperparameter values.


### Lasso
Lasso is a sparse learning algorithm to identify linear relationship between features an the output variable while trying to get rid of irrelevent(not a scientific term) features. The objective function of Lasso is to minimize:

$$min_w {\frac {1}{2n_{sample}}}||Xw-y||_2^2+\alpha||w||_1$$

where $\alpha$ is the model hyperparameter determining the level of sparsification. Larger $\alpha$ values results in fewer non-zero coefficients in the final model. $||w||_1$ is also the first norm ($l_1$) norm of the coefficient vector. The added term ($\alpha ||w||_1$) is a penalty term trying to constrained the coefficient values of the features in the final model.



## Cross-validation and checking generalizability of the model
After training a machine learning model, we need to check its generalizability and making sure it is not only good in predicting the training set but is capable of predicting new data points. We splitted the data to 2 parts, training and test set. We can go one step further and repeat this splitting across the dataset so that every single data point is considered in one of the test (better to be said validation) sets. This process is called k-fold cross-validation. For example in case of 5-fold cross-validation, the dataset is splitted to 5 chunks and the model is trained in 4 out of 5 chunk and tested on the remianing chunk. The test chunk is then rotated so that every chunk is conisidered once for testing the model. Then we can get average performance of the model in the tested chunks.

Here we use 5-fold cross-validation.

Note. Lack (or low level) of generalizability of a trained model to new data points is called overfitting.

## Hyperparameter selection
We have parameters and hyperparameters that need to be determined to build a machine learning model. The parameters are determined in the optimization process in training set (this is hat happens when we train a model). The hyperparameters are those exist for the method (like $\alpha$ in lasso) irrespect of the data. But these hyperparameters can be optimized for the dataset at hand. The hyperparameter optimization is usually done in validation (or development) set. In cross-validation, we are technically assesing performanc of a model at hand on different validation sets we have in cross-validation setting. Hence, the performance in cross-validation setting can be compared to select the best hypeparameters.


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn import linear_model

alpha_hyperparam = np.arange(0.1,1.1,0.1)
scores = []
for alpha_iter in alpha_hyperparam:
  print('alpha: {}'.format(alpha_iter))
  lasso = linear_model.Lasso(alpha=alpha_iter)
  scores.append(-round(cross_val_score(lasso, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean()/(len(y_train)/5), 3))
# Create k nearest neighbour object

# average performance across all folds
print("Average cross-validation performance (mean squared error) in 5-fold cross validation for alpha values of 0.1 to 1 are {}, respectively.".format(scores))

alpha: 0.1
alpha: 0.2
alpha: 0.30000000000000004
alpha: 0.4
alpha: 0.5
alpha: 0.6
alpha: 0.7000000000000001
alpha: 0.8
alpha: 0.9
alpha: 1.0
Average cross-validation performance (mean squared error) in 5-fold cross validation for alpha values of 0.1 to 1 are [42248.563, 52116.924, 50441.597, 48810.133, 47004.382, 45440.819, 43984.336, 42591.135, 41261.189, 39994.479], respectively.


In [None]:
print('best alpha value corresponding to the lowest MSE: {}'.format(alpha_hyperparam[np.argmin(scores)]))

best alpha value corresponding to the lowest MSE: 1.0


We identified that $\alpha=1$ results in the best performance in 5-fold cross-validation setting. Now we use all the training data to refit a lasso model with $\alpha=1$ and then assess the performance of the model in the test set.

In [None]:
# Create k nearest neighbour object
lasso = linear_model.Lasso(alpha=1)
# Train the models using the training sets
lasso.fit(X_train, y_train)
# Make predictions using the testing set
y_pred_lass = lasso.predict(X_test)

## Evaluating performance of the model
Finally, we need to assess performance of the model using the predictions of the test set. We use mean squared error to assess the performance of our model. Here are their definitions:

**Mean squared error (MSE)**: 

\begin{equation*} MSE = \frac{1}{n}\Sigma_{i=1}^n (Y_i-\hat{Y}_i)^2 \end{equation*}

Note. By setting squared = False, we get the squared root of **MSE**.


In [None]:
from sklearn import metrics

print("normalized mean squared error of the predictions using lasso with alpha=1:", metrics.mean_squared_error(y_test, y_pred_lasso, squared = False))

normalized mean squared error of the predictions using lasso with alpha=1: 9042.2006051822
