# Linear regression

Assignment: Build a linear regression model to predict the log number of shares an article received.

Explain briefly in your own words how linear regression works
Your pre-processing steps
The head() of the resulting data frame
Splitting the dataset into a training and test set
Training a linear regression model to predict the number of shares, using exactly 5 variables (collections of dummy variables, such as weekday_is_monday, weekday_is_tuesday, etc. count as 1 variable). Report:
How you selected the variables
An equation of the model (please use Markdown formulas)
Plots of the relation of your selected variables with the target
Comment on the linearity of those relationships
Evaluating the model on the test data
Predictive power of the model (R2, RMSE)
Investigating the residuals


### 1. Explain briefly in your own words how linear regression works


Lineair regression is used to built a model based on a independent variable ($x$) to predict a dependent variable ($y$). E.g. based on the surface area (independent) of a house you can build a model to predict the price of a house (dependent). The model tries to find a lineair line among provided data points. 

For linear regression the following equation is used: $y = b0 + b1x + e$. 
In this equation is $b0$ the intercept and $b1$ the slope, both are coefficients (constants). 
The $e$ strands for error/ residual. 

### 2. Your pre-processing steps

In [16]:
import pandas as pd
import sklearn as sk
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# # importing stuff
# import math
# import seaborn as sns
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt

# from sklearn import metrics
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression
# %matplotlib inline

df = pd.read_csv("dataMashable.csv")
df.head()

Unnamed: 0,id,url,n_tokens_title,n_tokens_content,num_imgs,num_videos,average_token_length,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,...,weekday_is_friday,weekday_is_saturday,weekday_is_sunday,is_weekend,global_subjectivity,global_sentiment_polarity,title_subjectivity,title_sentiment_polarity,shares,shares_log
0,1,http://mashable.com/2013/10/29/ashton-kutcher-...,10,821,12,0,4.518879,0,0,0,...,0,0,0,0,0.439379,0.082486,0.666667,0.375,2100,3.322219
1,2,http://mashable.com/2014/09/08/mashies-sept-19...,9,489,6,0,5.0409,0,0,0,...,0,0,0,0,0.300454,0.027715,0.0,0.0,274,2.437751
2,3,http://mashable.com/2013/02/01/hello-kitty-roc...,11,131,0,1,4.877863,0,0,0,...,1,0,0,0,0.575486,0.25912,0.0,0.0,1500,3.176091
3,4,http://mashable.com/2014/02/06/add-us-on-snapc...,8,556,2,0,4.97482,0,0,1,...,0,0,0,0,0.32722,0.134424,0.0,0.0,2000,3.30103
4,5,http://mashable.com/2014/01/07/lindsey-vonn-wi...,9,880,18,0,4.928409,0,0,0,...,0,0,0,0,0.507709,0.109256,0.0,0.0,6000,3.778151


### Selecting variables for the model


In [3]:
# What are the possible variables
df.columns

Index(['id', 'url', 'n_tokens_title', 'n_tokens_content', 'num_imgs',
       'num_videos', 'average_token_length', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend',
       'global_subjectivity', 'global_sentiment_polarity',
       'title_subjectivity', 'title_sentiment_polarity', 'shares',
       'shares_log'],
      dtype='object')

In [10]:
# Correlations of variables with shares_log
correlations = df[df.columns[1:]].corr()['shares_log'][:]
# Which variables have the highest correlation
correlations.sort_values(ascending=True)

data_channel_is_world           -0.141932
data_channel_is_entertainment   -0.080967
data_channel_is_bus             -0.033469
weekday_is_wednesday            -0.032125
average_token_length            -0.031730
weekday_is_thursday             -0.030329
weekday_is_tuesday              -0.027738
n_tokens_title                  -0.012247
weekday_is_monday               -0.007843
weekday_is_friday                0.006309
n_tokens_content                 0.032897
data_channel_is_lifestyle        0.036926
num_videos                       0.037147
data_channel_is_tech             0.047729
title_sentiment_polarity         0.050216
global_sentiment_polarity        0.053199
title_subjectivity               0.055161
weekday_is_sunday                0.071864
weekday_is_saturday              0.072644
data_channel_is_socmed           0.085516
num_imgs                         0.085575
global_subjectivity              0.093021
is_weekend                       0.105919
shares                           0

Variables with the highest correlation are selected: 

    1. Weekend
       - is_weekend: posted in the weekend or not.
    2. Channel
       - data_channel_is_world has after is_weekend the highest correlation with shares_log. The variable is part of a series of dummies, therefor all dummies are included.
    3. Global Subjectivity
       - global_subjectivity: Text subjectivity
    4. Number of images
        -num_img
    5. Title subjectivity
        -title_subjectivity

### Creating df and splitting into a training and test set

In [12]:
# selecting choosen variables
df_indep = df[['is_weekend', 'data_channel_is_lifestyle', 'data_channel_is_entertainment', 
               'data_channel_is_bus','data_channel_is_socmed', 'data_channel_is_tech', 
               'data_channel_is_world', 'global_subjectivity', 'num_imgs', 'title_subjectivity']]
df_indep.head()

# # get all the values of the independent and dependent variables
# x = df_indep.values
# y = df['shares_log'].values

# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)


Unnamed: 0,is_weekend,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,global_subjectivity,num_imgs,title_subjectivity
0,0,0,0,0,0,1,0,0.439379,12,0.666667
1,0,0,0,0,0,0,1,0.300454,6,0.0
2,0,0,0,0,0,0,0,0.575486,0,0.0
3,0,0,0,1,0,0,0,0.32722,2,0.0
4,0,0,0,0,0,1,0,0.507709,18,0.0


In [17]:
# split the data into train and test set
train, test = train_test_split(df_indep, test_size=0.3, random_state=1, shuffle=True)


# determine the path where to save the train and test file
train_path = Path(data_dir, 'train.tsv')
test_path = Path(data_dir, 'test.tsv')

# save the train and test file
# again using the '\t' separator to create tab-separated-values files
train.to_csv(train_path, sep='\t', index=False)
test.to_csv(test_path, sep='\t', index=False)

NameError: name 'Path' is not defined