# Linear Regression

It is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.  <br>

In short, It is a cost function that lets us figure out what is the best line that can fit our data.  <br>

By best fit line means that the Root Mean Square error should be the lowest with this line.

In this tutorial we'll use the titanic data set and see how we can increase the efficiency of our model.

You can find it here: https://www.kaggle.com/c/titanic/data

In [1]:
%load_ext watermark
%watermark -a 'Vaibhav Srivastav' -nmv --packages numpy,pandas,scikit-learn,matplotlib,Seaborn
#Helps knowing what version was used later on

Vaibhav Srivastav Sun Jul 17 2016 

CPython 2.7.11
IPython 4.1.2

numpy 1.10.4
pandas 0.18.0
scikit-learn 0.17.1
matplotlib 1.5.1
Seaborn 0.7.1

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 4.2.0-42-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit


# About the Dataset

**VARIABLE DESCRIPTIONS**: <br>
survival ==        Survival (0 = No; 1 = Yes) <br>
pclass  ==          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) <br>
name  ==            Name <br>
sex  ==             Sex <br>
age  ==             Age<br>
sibsp  ==         Number of Siblings/Spouses Aboard <br>
parch  ==           Number of Parents/Children Aboard <br>
ticket  ==          Ticket Number <br>
fare  ==            Passenger Fare <br>
cabin  ==           Cabin <br>
embarked  ==        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) <br>

**SPECIAL NOTES**:<br>
Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower <br>

Age is in Years; Fractional if Age less than One (1) 

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

*Sibling*:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic <br>
*Spouse*:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) <br>
*Parent*:   Mother or Father of Passenger Aboard Titanic <br>
*Child*:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic <br>

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them.  As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

**It is of prime importance to understand the data first and then preprocess it further in order to get better predictions out of your model**

In [3]:
import pandas as pd

df = pd.read_csv('titanic_train.csv')
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Here, if you see the count row, all other columns except 'Age' have a value of 891 and Age has a value of 714, that means we are dealing with missing values over here!

In [6]:
df['Age'] = df['Age'].fillna(df['Age'].median)
#You can use mean or mode also

In [7]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,1.0,0.0,31.0
max,891.0,1.0,3.0,8.0,6.0,512.3292


This solves our missing values problem but also shows another one of the most common problems in Data Preparation that only numeric columns are shown in df.describe() table.

The way to do this is to convert your non-numeric columns into numeric (for better predictions)

Non-Numeric columns - Name, Sex, Cabin, Embarked and Ticket

We can use the Sex and Embarked column to better leverage the Dataset.

In [11]:
#Sex can either be Male or Female

df.loc[df["Sex"]== "male", "Sex"] = 0
df.loc[df["Sex"]== "female", "Sex"] = 1

In [12]:
print df["Embarked"].unique()

['S' 'C' 'Q' nan]


Here, again we can see missing values in the "Embarked" column, lets fix this:

In [15]:
df["Embarked"] = df["Embarked"].fillna("S")
print df["Embarked"].unique()

['S' 'C' 'Q']


Now, Let's convert this non-numeric column into numeric!

In [16]:
df.loc[df["Embarked"]=="S", "Embarked"] = 0
df.loc[df["Embarked"]=="C", "Embarked"] = 1
df.loc[df["Embarked"]=="Q", "Embarked"] = 2

In [19]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age             object
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We can see that even though we converted the columns Sex and Embarked into numeric one's it still shows that they are object type, lets fix this by converting the type of that particular column.

In [27]:
print df["Sex"].unique()
print df["Embarked"].unique()

[0 1]
[0 1 2]


In [26]:
df["Sex"] = df["Sex"].astype(int)
df["Embarked"] = df["Embarked"].astype(int)

In [28]:
print df.describe()

       PassengerId    Survived      Pclass         Sex       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642    0.352413    0.523008   
std     257.353842    0.486592    0.836071    0.477990    1.102743   
min       1.000000    0.000000    1.000000    0.000000    0.000000   
25%     223.500000    0.000000    2.000000    0.000000    0.000000   
50%     446.000000    0.000000    3.000000    0.000000    0.000000   
75%     668.500000    1.000000    3.000000    1.000000    1.000000   
max     891.000000    1.000000    3.000000    1.000000    8.000000   

            Parch        Fare    Embarked  
count  891.000000  891.000000  891.000000  
mean     0.381594   32.204208    0.361392  
std      0.806057   49.693429    0.635673  
min      0.000000    0.000000    0.000000  
25%      0.000000    7.910400    0.000000  
50%      0.000000   14.454200    0.000000  
75%      0.000000   31.000000    1.000000  
max      6.000000

Our data is now ready!
Let's dive into machine learning :D

In [37]:
df.dtypes
df_features = df[["Pclass", "Sex", "SibSp", "Parch", "Fare", "Embarked"]]
df_labels = df[["Survived"]]

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

model = LogisticRegression(random_state=1)

scores = cross_validation.cross_val_score(model, df_features, df["Survived"], cv=3)
print scores

[ 0.79124579  0.8013468   0.78787879]


In [47]:
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation

model_linear = LinearRegression()

scores = cross_validation.cross_val_score(model_linear, df_features, df["Survived"], cv=3)
print scores

[ 0.30589367  0.38946669  0.36963099]


In [50]:
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation

model_linear = LinearRegression(normalize=True)

scores = cross_validation.cross_val_score(model_linear, df_features, df["Survived"], cv=3)
print scores

[ 0.30589367  0.38946669  0.36963099]
