# Subject: Classical Data Analysis

## Session 1 - Regression

### Individual assignment 1

Develop a regression analysis in Statmodels (with and without a constant) and SKLearn, based on the Iris sklearn dataset. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length.

See here for more information on this dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set 

Use the field “sepal width (cm)” as independent variable and the field “sepal length (cm)” as dependent variable.

- Interpret and discuss the OLS Regression Results.
- Commit scripts in your GitHub account. You should export your solution code (.ipynb notebook) and push it to your repository “ClassicalDataAnalysis”.

The following are the tasks that should complete and synchronize with your repository “ClassicalDataAnalysis” until October 13. Please notice that none of these tasks is graded, however it’s important that you correctly understand and complete them in order to be sure that you won’t have problems with further assignments.

# Linear Regression in Statsmodels

## Load the iris dataset

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn import datasets
from sklearn import linear_model

  from pandas.core import datetools


In [2]:
data = datasets.load_iris()

In [7]:
print(data.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [8]:
print(data.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [10]:
df = pd.DataFrame(data.data, columns=data.feature_names)

In [14]:
x = df['sepal width (cm)'].values
y = df['sepal length (cm)'].values

### Regression model with Statsmodels and without a constant:

In [17]:
model = sm.OLS(y, x).fit()

In [19]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.957
Model:,OLS,Adj. R-squared:,0.957
Method:,Least Squares,F-statistic:,3316.0
Date:,"Sun, 08 Oct 2017",Prob (F-statistic):,1.04e-103
Time:,13:29:59,Log-Likelihood:,-243.13
No. Observations:,150,AIC:,488.3
Df Residuals:,149,BIC:,491.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,1.8717,0.033,57.585,0.000,1.807,1.936

0,1,2,3
Omnibus:,16.884,Durbin-Watson:,0.429
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7.669
Skew:,-0.336,Prob(JB):,0.0216
Kurtosis:,2.12,Cond. No.,1.0


### Interpreting the Table 

Our R-squared is 0.957, that means the model good at predicting sepal length based on the sepal width.   
The coeficient is 1.8717, when the value of the variable sepal width changes by 1 the predicted value of sepal length changes by 1.8717 in the same direction.
For the hypothesis testing with the P-value we have to reject the null hypothesis.

### Regression model with Statsmodels and with a constant:

In [20]:
x = sm.add_constant(x)

In [21]:
model = sm.OLS(y, x).fit()

In [22]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,1.792
Date:,"Sun, 08 Oct 2017",Prob (F-statistic):,0.183
Time:,13:34:45,Log-Likelihood:,-183.14
No. Observations:,150,AIC:,370.3
Df Residuals:,148,BIC:,376.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.4812,0.481,13.466,0.000,5.530,7.432
x1,-0.2089,0.156,-1.339,0.183,-0.517,0.099

0,1,2,3
Omnibus:,4.455,Durbin-Watson:,0.941
Prob(Omnibus):,0.108,Jarque-Bera (JB):,4.252
Skew:,0.356,Prob(JB):,0.119
Kurtosis:,2.585,Cond. No.,24.3


### Interpreting the Table 

Our R-squared is only 0.012, that means the model can't predict at all the sepal length based on the sepal width.    
For the hypothesis testing, with this P-value we have to accept the null hypothesis and reject this model.

# Linear Regression in SKLearn 

In [27]:
x = df['sepal width (cm)'].values.reshape(-1,1)
y = df['sepal length (cm)'].values

In [28]:
lr = linear_model.LinearRegression()

In [29]:
model = lr.fit(x,y)

In [32]:
model.coef_

array([-0.20887029])

In [35]:
model.intercept_

6.4812232114596053

In [34]:
model.score(x,y)

0.011961632834767699