# Introduction to Stats Models (Python) & Sklearn

In the section we intend to get exposed to the advanced tools for analysing data. statsmodel is a Python module and sklearn a Library.
Lets begin by defining a Python module and Library.

Module: It is a collection of classes and its methods as well as functions. This can be just a simple function and can be imported by many scripts

Library: it is a collection of Modules which helpes using advanced and predefined functions for calculations and manupulation of objects

So far we have only explored different ways to manipulate data with the help of objects such as Numpy arrays and lists as well as pandas Dataframes. These objects allow a Data scientist to input and output data and reshape data for the benefit of the classifier/regressor. 
But is a classifier ?
And how is the data put in a sequence which can be available for further analysis ?

## statsmodels

This module helps with analyzing data and creating statistical models.  
In order to understand how to utilise this Module, we also need to have knowledge regarding statistics which we will learn in the sections ahead.



Regression: Observing how a certain variable reacts or behaves when an another variable is changed. In the simplest scenarios we chose to observe the behaviour of Y when X is changed on 2D Plane. 

Formula of a line: y = mx + b
                   
                   m = slope 
                   b = intercept


In [2]:
# Import the Libraries
import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices # Don't worry about this for now

Currently we are importing pandas library for dataframes data structure

And as you can observe in the first line of the previous cell 
#### import statsmodels.api as sm

This is a common way to import a statsmoels module

In [3]:
# we are getting a custom dataset from the statsmodels module 
df = sm.datasets.get_rdataset("Guerry", "HistData").data
print(df)

    dept Region           Department  Crime_pers  Crime_prop  Literacy  \
0      1      E                  Ain       28870       15890        37   
1      2      N                Aisne       26226        5521        51   
2      3      C               Allier       26747        7925        13   
3      4      E         Basses-Alpes       12935        7289        46   
4      5      E         Hautes-Alpes       17488        8174        69   
5      7      S              Ardeche        9474       10263        27   
6      8      N             Ardennes       35203        8847        67   
7      9      S               Ariege        6173        9597        18   
8     10      E                 Aube       19602        4086        59   
9     11      S                 Aude       15647       10431        34   
10    12      S              Aveyron        8236        6731        31   
11    13      S     Bouches-du-Rhone       13409        5291        38   
12    14      N             Calvados  

The dataset is bigger than it seems so we only need a certain number of characterics/columns for a regression on the dataset

In [4]:
vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
df = df[vars]
df[-5:]

Unnamed: 0,Department,Lottery,Literacy,Wealth,Region
81,Vienne,40,25,68,W
82,Haute-Vienne,55,13,67,C
83,Vosges,14,62,82,E
84,Yonne,51,47,30,C
85,Corse,83,49,37,


In [6]:
# To remove the values which will not affect the outcome 
# For eg. values that are NaN
df = df.dropna()

In [7]:
df[-5:]

Unnamed: 0,Department,Lottery,Literacy,Wealth,Region
80,Vendee,68,28,56,W
81,Vienne,40,25,68,W
82,Haute-Vienne,55,13,67,C
83,Vosges,14,62,82,E
84,Yonne,51,47,30,C


In [8]:
y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')

dmatrices() is function from the patsy library. And here it is helping us to manipulate the data points to be reflective of a proportional relationship like a regression. For instance. in the above syntax you can see 

'Lottery ~ Literacy + Wealth + Region'

this is equivalent of saying 

Lottery = Literacy * ( some variable X ) + Wealth ( some variable Y ) + Region

i.e. Literacy, Wealth, and Region will be independent variables where as Literacy will be dependent variable.

But on observing our dataset df, we see that Region is different from the rest of the Group. 

How do we incoporate that with other numbers?

In [16]:
# Dependent Variable
print(y[-5:])

    Lottery
80     68.0
81     40.0
82     55.0
83     14.0
84     51.0


In [17]:
# Independent Variables
print(X[-5:])

    Intercept  Region[T.E]  Region[T.N]  Region[T.S]  Region[T.W]  Literacy  \
80        1.0          0.0          0.0          0.0          1.0      28.0   
81        1.0          0.0          0.0          0.0          1.0      25.0   
82        1.0          0.0          0.0          0.0          0.0      13.0   
83        1.0          1.0          0.0          0.0          0.0      62.0   
84        1.0          0.0          0.0          0.0          0.0      47.0   

    Wealth  
80    56.0  
81    68.0  
82    67.0  
83    82.0  
84    30.0  


As you can see the string variable Region has been incoporated but with 4 more columns. T.E, T.N, T.S, T.W.
This is a concept of dummy variables, where we assign a set of finite values a value in a way that would reflect which value the data is concerned for in a particula data point. 

For instance, in Row 80 we have T.E = T.N = T.S = 0 and T.W = 1,

This is because for this particular row the dataset has this region selected and so in the equation created by the module we will have extra variables for T.E, T.N, etc. but in this case they will be multiplied with 0 hence only T.W will show up. 

More info can be found https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.get_dummies.html

In [19]:
mod = sm.OLS(y, X) # Commonly known as Regressor/Classifier
result = mod.fit() # A common method you will find in many classes for 

In [21]:
print(result.summary()) # Shows a summary of the analysis

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Fri, 13 Jul 2018   Prob (F-statistic):           1.07e-05
Time:                        11:29:16   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      

The result summary gives a complete picture to user to compare the different units and values. 

## scikit-learn 

This Library concerns itself with data mining and analysing tools for the purposes of classifying, clustering and preprocessing. It is mainly used for the machine learning aspect of Data Science.

To give you an idea of how the library works, we will follow a short tutorial 

In [1]:
# Import the function required for instantiating a classifier
from sklearn.linear_model import LinearRegression

In [2]:
# Construct training and test data
train_Data = [
    [0,0],
    [1,2],
    [2,4]
]
class_Data = [
    0,
    1,
    2
]


As you observe, we have our training data and class data. It is easy to notice in our training set the pattern of the parameters. The parameters showcase linear progression where the both values tend to linearly increase and so the class set confirms this by giving values which corresponds to those singular values. In the background the LinearRegression classifier is working with this data to come up with a set of values to justify the behaviour. 

In [3]:
# Create the classifier and fit the data
cls = LinearRegression()
cls.fit(train_Data, class_Data)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

After the classifier is fit it analyzes the training set and comes up with a set of coefficients for the two variables in the training data. 


In [8]:
print(cls.coef_)

[ 0.2  0.4]


the 2 variables correspond as constants to the equation (⅕) x + (⅖)y = z 

In [9]:
print(cls.predict([[4,3]]))

[ 2.]


Hence when you predict values, they are actually placed in the equation to give you an answer. 
