# Project objective
This project is designed to review linear regression method and its python implementation using facebook metric dataset.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)
* **pandas**: Pandas provides easy-to-use data structures and data analysis tools for Python. (https://pandas.pydata.org/)


In [0]:
import numpy as np
import pandas as pd
import sklearn as sk

# Introduction to the dataset

**Name**: Facebook metrics Data Set

**Summary**: Facebook performance metrics of a renowned cosmetic's brand Facebook page.

**number of features**: 18 

**Note.**:  One of the columns in this structured dataset is for the outcome we would like to predict. 

**Number of data points (instances)**: 500

**Link to the dataset**: http://archive.ics.uci.edu/ml/datasets/Facebook+metrics




## Importing the dataset
We can import the dataset in multiple ways

**Colab Notebook**: You can download the dataset file (or files) from the link (if provided) and uploading it to your google drive and then you can import the file (or files) as follows:

**Note.** When you run the following cell, it tries to connect the colab with google derive. Follow steps 1 to 5 in this link (https://www.marktechpost.com/2019/06/07/how-to-connect-google-colab-with-google-drive/) to complete the process. 

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

# This path is common for everybody
# This is the path to your google drive
input_path = '/content/gdrive/My Drive/'
file_name = 'dataset_Facebook.csv'
# sometimes we need to specify "sep" based on column spacing in the original file
target_dataset = pd.read_csv(input_path+file_name, sep=';')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


**Local directory**: In case you save the data in your local directory, you need to change "input_path" to the local directory you saved the file (or files) in.

**GitHub**: If you use my GitHub (or your own GitHub) repo, you need to change the "input_path" to where the file (or files) exist in the repo. For example, when I clone ***ml_projects*** from my GitHub, I need to change "input_path" to 'data/' as the file (or files) is saved in the data dicretory in this repository. 

**Note.**: You can also clone my ***ml_projects*** repository (here: https://github.com/alimadani/ml_projects) and follow the same process.

## Making sure about the dataset characteristics (number of data points and features)

In [0]:
print('number of data points: {}'.format(target_dataset.shape[0]))
print('number of features: {}'.format(target_dataset.shape[1]-1))
# remember that 1 column is the output we want to predict and should not be considered as a feature

number of data points: 500
number of features: 18


## Data preparation
We need to prepare the dataset for machine learning modeling. Here we prepare the data in 2 steps:

1) converting categorical variables to integers

2) filling Nans with 0

In [0]:
# converting strings in categorical features to integers
cat_columns = target_dataset.select_dtypes(['object']).columns
target_dataset[cat_columns] = target_dataset[cat_columns].apply(lambda x: pd.factorize(x)[0])

# replace infinity values with Nans
# Fill Nans with 0
target_dataset = target_dataset.replace([np.inf, -np.inf], np.nan)
target_dataset = target_dataset.fillna(0)

### Separating features from output variable
The dataframe of the target dataset has a column we would like to predict its values (output variable). We need to separate this column from the rest of the dataframe which include the features we want to use to build the model.

In [0]:
# output variable
output_var = target_dataset['like']

#input features
input_features = target_dataset.drop(['like'], axis=1)
print('number of features: {}'.format(input_features.shape[0]))


number of features: 500


## Splitting data to training and testing sets

We need to split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about gneralizability of the model we train.

**test_size**: Traditionally, 30%-40% of the dataset cna be used for test set. If we split the data to train, validation and test, we can use 60%, 20% and 20% of teh dataset, respectively.

**Note.**: We need the validation and test sets to be big enough for checking genralizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state** as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices.


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Building the supervised learning model
We want to build a regression model as the output variable is continuous. Here we build a simple linear regression model. In linear regression, if we have set of features X1 to Xn, y can be obtained as:
\begin{equation*} y=b0+b1X1+b2X2+...+bnXn\end{equation*}

where y is the predicted value obtained by weighted sum of the feature values.

In [0]:
from sklearn.linear_model import LinearRegression

# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Prediction of test (or validation) set
We now have to use the trained model to predict output values of test set (y_test).

In [0]:
# Make predictions using the testing set
y_pred = regr.predict(X_test)

## Evaluating performance of the model
We need to assess performance of the model using the predictions of the test set. We use mean squared error and mean absolute error to assess the performance of our model. Here are their definitions:

**Mean squared error (MSE)**: 

\begin{equation*} MSE = \frac{1}{n}\Sigma_{i=1}^n (Y_i-\hat{Y}_i)^2 \end{equation*}

**Mean absolute error (MAE)**: 

\begin{equation*} MSE = \frac{1}{n}\Sigma_{i=1}^n |Y_i-\hat{Y}_i| \end{equation*}

where n is the total number of data points that we predicted their output values, $ Y_i $ is the output value of the $i$th data point and $ \hat{Y}_i $ is the predicted output value of the $i$th data point.

In [0]:
from sklearn import metrics

print("Mean squared error: {}".format(metrics.mean_squared_error(y_test, y_pred)))
print("Mean absolute error: {}".format(metrics.mean_absolute_error(y_test, y_pred)))

Mean squared error: 1.32242961644449e-24
Mean absolute error: 6.490823380455474e-13


## Extracting the coefficient of the model
The trained linear regresseion model is a linear combination of feature values. Hence, each feature has a coefficient in this linear combination for predicting the output variable.

In [0]:
print('Coefficients: {}'.format(regr.coef_))

Coefficients: [ 8.06626380e-18  8.04427054e-14  2.49042960e-13 -7.25583941e-15
 -1.15422712e-14 -7.40617409e-15  1.58590907e-13  1.14828548e-16
 -7.14846660e-17 -1.05606852e-15  1.19494392e-15 -1.89649566e-16
  6.38825188e-17  6.59320565e-17 -3.44487473e-16 -1.00000000e+00
 -1.00000000e+00  1.00000000e+00]
