# Subject: Classical Data Analysis

## Session 1 - Regression

### Individual assignment 1

Develop a regression analysis in Statmodels (with and without a constant) and SKLearn, based on the Iris sklearn dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.

See here for more information on this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set 

Use the field “sepal width (cm)” as independent variable and the field “sepal length (cm)” as dependent variable.

- Interpret and discuss the OLS Regression Results.
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.

# Linear Regression in Statsmodels

## Load the iris dataset

In [1]:
from sklearn import linear_model
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import pandas as pd

  from pandas.core import datetools


In [2]:
iris = datasets.load_iris()

In [3]:
# why cant i do: iris.head()????

In [4]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

### Regression model with Statsmodels and without a constant:

In [5]:
df = pd.DataFrame(iris.data, columns=iris.feature_names) 

In [6]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
X = df["sepal width (cm)"]
y = df["sepal length (cm)"]

In [8]:
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,3316.0
Date:,"Wed, 18 Oct 2017",Prob (F-statistic):,1.04e-103
Time:,22:07:41,Log-Likelihood:,-243.13
No. Observations:,150,AIC:,488.3
Df Residuals:,149,BIC:,491.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
sepal width (cm),1.8717,0.033,57.585,0.000,1.807,1.936

0,1,2,3
Omnibus:,16.884,Durbin-Watson:,0.429
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7.669
Skew:,-0.336,Prob(JB):,0.0216
Kurtosis:,2.12,Cond. No.,1.0


### Interpreting the Table 

The coefficient of 1.8717 means that as sepal width increases by 1cm, sepal length increases by 1.8717cm. R-squared is only 0.957 means the model is a very good model

### Regression model with Statsmodels and with a constant:

In [9]:
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

0,1,2,3
Dep. Variable:,sepal length (cm),R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,1.792
Date:,"Wed, 18 Oct 2017",Prob (F-statistic):,0.183
Time:,22:07:41,Log-Likelihood:,-183.14
No. Observations:,150,AIC:,370.3
Df Residuals:,148,BIC:,376.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4812,0.481,13.466,0.000,5.530,7.432
sepal width (cm),-0.2089,0.156,-1.339,0.183,-0.517,0.099

0,1,2,3
Omnibus:,4.455,Durbin-Watson:,0.941
Prob(Omnibus):,0.108,Jarque-Bera (JB):,4.252
Skew:,0.356,Prob(JB):,0.119
Kurtosis:,2.585,Cond. No.,24.3


### Interpreting the Table 

We have now a sepal width-intercept at -0.2089. We also changed the slope of the RM predictor from 1.8717 to 6.4812. R-squared decreased from 0.957 to 0.012, indicating a very bad model with a constant.

# Linear Regression in SKLearn 

In [10]:
from sklearn import linear_model
from sklearn import datasets

data = datasets.load_iris() 

In [11]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df2 = pd.DataFrame(df, columns=["sepal width (cm)"])
target = pd.DataFrame(df, columns=["sepal length (cm)"])

In [12]:
X = df2
y = target

In [13]:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)

In [14]:
predictions = lm.predict(X)
print(predictions)

[[ 5.75017718]
 [ 5.85461233]
 [ 5.81283827]
 [ 5.8337253 ]
 [ 5.72929015]
 [ 5.66662906]
 [ 5.77106421]
 [ 5.77106421]
 [ 5.87549936]
 [ 5.8337253 ]
 [ 5.70840312]
 [ 5.77106421]
 [ 5.85461233]
 [ 5.85461233]
 [ 5.64574204]
 [ 5.56219392]
 [ 5.66662906]
 [ 5.75017718]
 [ 5.68751609]
 [ 5.68751609]
 [ 5.77106421]
 [ 5.70840312]
 [ 5.72929015]
 [ 5.79195124]
 [ 5.77106421]
 [ 5.85461233]
 [ 5.77106421]
 [ 5.75017718]
 [ 5.77106421]
 [ 5.81283827]
 [ 5.8337253 ]
 [ 5.77106421]
 [ 5.62485501]
 [ 5.60396798]
 [ 5.8337253 ]
 [ 5.81283827]
 [ 5.75017718]
 [ 5.8337253 ]
 [ 5.85461233]
 [ 5.77106421]
 [ 5.75017718]
 [ 6.00082154]
 [ 5.81283827]
 [ 5.75017718]
 [ 5.68751609]
 [ 5.85461233]
 [ 5.68751609]
 [ 5.81283827]
 [ 5.70840312]
 [ 5.79195124]
 [ 5.81283827]
 [ 5.81283827]
 [ 5.8337253 ]
 [ 6.00082154]
 [ 5.89638639]
 [ 5.89638639]
 [ 5.79195124]
 [ 5.97993451]
 [ 5.87549936]
 [ 5.91727342]
 [ 6.06348262]
 [ 5.85461233]
 [ 6.02170856]
 [ 5.87549936]
 [ 5.87549936]
 [ 5.8337253 ]
 [ 5.85461

In [15]:
lm.score(X,y) 

0.011961632834767699

In [18]:
lm.coef_

array([[-0.20887029]])

In [19]:
lm.intercept_

array([ 6.48122321])