# Introduction to Statsmodels (Python) & Sklearn
In the section we intend to get exposed to the advanced tools for analysing data. Statsmodel is a Python module and Sklearn a Library.
Lets begin by defining a Python module and Library.

Module: It is a collection of classes and its methods as well as functions. This can be just a simple function and can be imported by many scripts

Library: it is a collection of Modules which helpes using advanced and predefined functions for calculations and manupulation of objects

So far we have only explored different ways to manipulate data with the help of objects such as Numpy arrays and lists as well as pandas Dataframes. These objects allow a Data scientist to input and output data and reshape data for the benefit of the classifier/regressor. 

But what is a classifier?

And how is the data put in a sequence which can be available for further analysis ?

A classifier is described as a model predictor. 
Assume a Data Scientist collects data and wants to find patterns/trends in it, he/she would have create a model on that dataset. 

A model in this case means to observe a value's (dependent variable) behaviour based on other variables' (independent variable) distincet values.

## Statsmodels

### Introduction and Explanation
This module helps with analyzing data and creating statistical models.  
In order to understand how to utilise this Module, we also need to have knowledge regarding statistics which we will learn in the sections ahead.

A classifier here would mainly cover Descriptive Statistics ( we will see an example for this). 

Regression: Observing how a certain variable reacts or behaves when an another variable is changed. In the 
simplest scenarios we chose to observe the behaviour of Y when X is changed on 2D Plane. 

Formula of a line: y = mx + b
                   
                   m = slope 
                   b = intercept


```Python
# Import the Libraries
import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices # Don't worry about this for now

# we are getting a custom dataset from the Statsmodels module 
df = sm.datasets.get_rdataset("Guerry", "HistData").data

vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
df = df[vars]

# To remove the values which will not affect the outcome 
# For eg. values that are NaN
df = df.dropna()

# We will use this variable later on
temp_df = df 

# Here we are converting a Dataframe to Matices with the help of patsy libray
y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')

mod = sm.OLS(y, X) 
result = mod.fit() 
print(result.summary()) # Shows a summary of the analysis
```
<img src="../../../images/desc_stat_out.png"> <br>

The result summary gives a complete picture to user to compare the different units and values.


## Sklearn
This Library concerns itself with data mining and analysing tools for the purposes of classifying, clustering and preprocessing. It is mainly used for the machine learning aspect of Data Science.

A classifier here would help the its user to predict the behaviour based on the tested data or values

To give you an idea of how the library works, we will follow a short tutorial 

```Python
# Import the function required for instantiating a classifier
from sklearn.linear_model import LinearRegression

# Construct training and test data
train_Data = [
    [0,0],
    [1,2],
    [2,4]
]
class_Data = [
    0,
    1,
    2
]

# Create the classifier and fit the data
cls = LinearRegression()
cls.fit(train_Data, class_Data)

# After the classifier is fit it analyzes the training set and comes up with a set of coefficients for the two variables in the training data. 
print(cls.coef_)
# Output
>>> [ 0.2  0.4]

# the 2 variables correspond as constants to the equation (⅕) x + (⅖)y = z 
print(cls.predict([[4,3]]))
# Output
>>> [ 2.]
```


Exercices

Create a regression model on the dateset that is used in the statsmodel example




In [4]:
### Solution

def training_data():
    trainData = []
    for i in range(20):
        trainData.append([i, 2*i])
    return trainData

def class_data():
    classData = []
    for i in range(20):
        classData.append(i)
    return classData


# Import Libraries
import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices

# Gathering Data and converting them to DataFrame

train_Data = training_data()

class_Data = class_data()

train_Data = pd.DataFrame(train_Data)
train_Data = train_Data.rename({
    0:'train_col1',
    1:'train_col2'
}, axis='columns')

class_Data = pd.DataFrame(class_Data)
class_Data = class_Data.rename({
    0:'class_col1'
}, axis=1)
df = pd.concat([class_Data, train_Data], axis=1)

# Creating class and traing data and then fitting the data

y, X = dmatrices('class_col1 ~ train_col1 + train_col2', data=df, return_type='dataframe')
mod = sm.OLS(y, X) 
result = mod.fit() 
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:             class_col1   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.760e+33
Date:                Sat, 25 Aug 2018   Prob (F-statistic):          2.93e-293
Time:                        22:43:32   Log-Likelihood:                 683.13
No. Observations:                  20   AIC:                            -1362.
Df Residuals:                      18   BIC:                            -1360.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -6.661e-16   1.61e-16     -4.136      0.0

### Solution

```Python
# All solution code is given in above cell
```