# Linear regression

Assignment: Build a linear regression model to predict the log number of shares an article received.

Explain briefly in your own words how linear regression works
Your pre-processing steps
The head() of the resulting data frame
Splitting the dataset into a training and test set
Training a linear regression model to predict the number of shares, using exactly 5 variables (collections of dummy variables, such as weekday_is_monday, weekday_is_tuesday, etc. count as 1 variable). Report:
How you selected the variables
An equation of the model (please use Markdown formulas)
Plots of the relation of your selected variables with the target
Comment on the linearity of those relationships
Evaluating the model on the test data
Predictive power of the model (R2, RMSE)
Investigating the residuals


### 1. Explain briefly in your own words how linear regression works


Lineair regression is used to built a model based on a independent variable ($x$) to predict a dependent variable ($y$). E.g. based on the surface area (independent) of a house you can build a model to predict the price of a house (dependent). The model tries to find a lineair line among provided data points. 

For linear regression the following equation is used: $y = b0 + b1x + e$. 
In this equation is $b0$ the intercept and $b1$ the slope, both are coefficients (constants). 
The $e$ strands for error/ residual. 

### 2. Your pre-processing steps

In [16]:
import pandas as pd
import sklearn as sk
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# # importing stuff
# import math
# import seaborn as sns
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt

# from sklearn import metrics
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression
# %matplotlib inline

df = pd.read_csv("dataMashable.csv")
df.head()

Unnamed: 0,id,url,n_tokens_title,n_tokens_content,num_imgs,num_videos,average_token_length,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,...,weekday_is_friday,weekday_is_saturday,weekday_is_sunday,is_weekend,global_subjectivity,global_sentiment_polarity,title_subjectivity,title_sentiment_polarity,shares,shares_log
0,1,http://mashable.com/2013/10/29/ashton-kutcher-...,10,821,12,0,4.518879,0,0,0,...,0,0,0,0,0.439379,0.082486,0.666667,0.375,2100,3.322219
1,2,http://mashable.com/2014/09/08/mashies-sept-19...,9,489,6,0,5.0409,0,0,0,...,0,0,0,0,0.300454,0.027715,0.0,0.0,274,2.437751
2,3,http://mashable.com/2013/02/01/hello-kitty-roc...,11,131,0,1,4.877863,0,0,0,...,1,0,0,0,0.575486,0.25912,0.0,0.0,1500,3.176091
3,4,http://mashable.com/2014/02/06/add-us-on-snapc...,8,556,2,0,4.97482,0,0,1,...,0,0,0,0,0.32722,0.134424,0.0,0.0,2000,3.30103
4,5,http://mashable.com/2014/01/07/lindsey-vonn-wi...,9,880,18,0,4.928409,0,0,0,...,0,0,0,0,0.507709,0.109256,0.0,0.0,6000,3.778151


### Selecting variables for the model


In [3]:
# What are the possible variables
df.columns

Index(['id', 'url', 'n_tokens_title', 'n_tokens_content', 'num_imgs',
       'num_videos', 'average_token_length', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend',
       'global_subjectivity', 'global_sentiment_polarity',
       'title_subjectivity', 'title_sentiment_polarity', 'shares',
       'shares_log'],
      dtype='object')

In [10]:
# Correlations of variables with shares_log
correlations = df[df.columns[1:]].corr()['shares_log'][:]
# Which variables have the highest correlation
correlations.sort_values(ascending=True)

data_channel_is_world           -0.141932
data_channel_is_entertainment   -0.080967
data_channel_is_bus             -0.033469
weekday_is_wednesday            -0.032125
average_token_length            -0.031730
weekday_is_thursday             -0.030329
weekday_is_tuesday              -0.027738
n_tokens_title                  -0.012247
weekday_is_monday               -0.007843
weekday_is_friday                0.006309
n_tokens_content                 0.032897
data_channel_is_lifestyle        0.036926
num_videos                       0.037147
data_channel_is_tech             0.047729
title_sentiment_polarity         0.050216
global_sentiment_polarity        0.053199
title_subjectivity               0.055161
weekday_is_sunday                0.071864
weekday_is_saturday              0.072644
data_channel_is_socmed           0.085516
num_imgs                         0.085575
global_subjectivity              0.093021
is_weekend                       0.105919
shares                           0

Variables with the highest correlation are selected: 

    1. Weekend
       - is_weekend: posted in the weekend or not.
    2. Channel
       - data_channel_is_world has after is_weekend the highest correlation with shares_log. The variable is part of a series of dummies, therefor all dummies are included.
    3. Global Subjectivity
       - global_subjectivity: Text subjectivity
    4. Number of images
        -num_img
    5. Title subjectivity
        -title_subjectivity

### Creating df and splitting into a training and test set

In [25]:
# selecting choosen variables
df_var = df[['is_weekend', 'data_channel_is_lifestyle', 'data_channel_is_entertainment', 
               'data_channel_is_bus','data_channel_is_socmed', 'data_channel_is_tech', 
               'data_channel_is_world', 'global_subjectivity', 'num_imgs', 'title_subjectivity']]

y = df['shares_log'] #We need to take out the rating as our Y-variable
X = df_var.loc[:,'is_weekend':'title_subjectivity'] #this slices the dataframe to include all rows and the columns from "action" to "metascore"
X.index = df['url'] #this stores the movie names in the column names, so we don't lose track of them later
X.head()



Unnamed: 0_level_0,is_weekend,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,global_subjectivity,num_imgs,title_subjectivity
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
http://mashable.com/2013/10/29/ashton-kutcher-lenovo/,0,0,0,0,0,1,0,0.439379,12,0.666667
http://mashable.com/2014/09/08/mashies-sept-19-deadline/,0,0,0,0,0,0,1,0.300454,6,0.0
http://mashable.com/2013/02/01/hello-kitty-rocket/,0,0,0,0,0,0,0,0.575486,0,0.0
http://mashable.com/2014/02/06/add-us-on-snapchat/,0,0,0,1,0,0,0,0.32722,2,0.0
http://mashable.com/2014/01/07/lindsey-vonn-withdraws-sochi-olympics-knee-injury/,0,0,0,0,0,1,0,0.507709,18,0.0


In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables
X_train.head() #The train data


Unnamed: 0_level_0,is_weekend,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,global_subjectivity,num_imgs,title_subjectivity
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
http://mashable.com/2013/02/03/puppy-bowl-online/,1,0,0,0,0,0,0,0.451711,0,0.5
http://mashable.com/2014/09/11/bridesmaid-lost-dress-sydney/,0,0,0,0,0,0,0,0.476976,9,0.1
http://mashable.com/2013/03/08/facebook-acquires-storylane/,0,0,0,1,0,0,0,0.488495,1,0.0
http://mashable.com/2014/04/29/yahoo-mail-app-redesign/,0,0,0,0,0,0,0,0.545746,23,0.454545
http://mashable.com/2013/08/28/chef-knife-moves-video/,0,0,0,0,0,0,0,0.507042,13,0.75


##### Train model

In [27]:
lm = LinearRegression() #create the model
model = lm.fit(X_train, y_train) #train the model

##### Calculate coefficients 

In [28]:
coef = pd.DataFrame(X.columns.values, lm.coef_) #this creates a df with a column with the column names, and the coefficients of the model
coef

Unnamed: 0,0
0.118794,is_weekend
-0.063844,data_channel_is_lifestyle
-0.20186,data_channel_is_entertainment
-0.139517,data_channel_is_bus
0.005019,data_channel_is_socmed
-0.084932,data_channel_is_tech
-0.229058,data_channel_is_world
0.154307,global_subjectivity
0.002605,num_imgs
0.029093,title_subjectivity



We can interpret these coefficients as follows:

If a article is posted in the weekend, it adds 0.12 to its amount of shares. 
If a article is subjective, it adds 0.16 to its amount of shares.
If the title of an article is subjective it adds 0.03
For each image in an artile the amount of shares increase with 0.003.
For each point on Metascore (0-100), the rating increases with 0.031
Finally, let's look at the model performance. We'll generate predictions and calculate the R^2 and RMSE.