# Introduction to Statsmodels (Python) & Sklearn

In the section we intend to get exposed to the advanced tools for analysing data. Statsmodel is a Python module and Sklearn a Library.
Lets begin by defining a Python module and Library.

Module: It is a collection of classes and its methods as well as functions. This can be just a simple function and can be imported by many scripts

Library: it is a collection of Modules which helpes using advanced and predefined functions for calculations and manupulation of objects

So far we have only explored different ways to manipulate data with the help of objects such as Numpy arrays and lists as well as pandas Dataframes. These objects allow a Data scientist to input and output data and reshape data for the benefit of the classifier/regressor. 

But what is a classifier?

And how is the data put in a sequence which can be available for further analysis ?

A classifier is described as a model predictor. 
Assume a Data Scientist collects data and wants to find patterns/trends in it, he/she would have create a model on that dataset. 

A model in this case means to observe a value's (dependent variable) behaviour based on other variables' (independent variable) distincet values.

## Statsmodels

### Introduction and Explanation
This module helps with analyzing data and creating statistical models.  
In order to understand how to utilise this Module, we also need to have knowledge regarding statistics which we will learn in the sections ahead.

A classifier here would mainly cover Descriptive Statistics ( we will see an example for this). 

Regression: Observing how a certain variable reacts or behaves when an another variable is changed. In the simplest scenarios we chose to observe the behaviour of Y when X is changed on 2D Plane. 

Formula of a line: y = mx + b
                   
                   m = slope 
                   b = intercept


In [50]:
# Import the Libraries
import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices # Don't worry about this for now

Currently we are importing pandas library for dataframes data structure

And as you can observe in the first line of the previous cell 
This is a common way to import a Statsmodels module

<font color='green'> import statsmodels.api as sm </font>

The Guerry Dataset: A dataset created in 1833 to analyze parameters such as crime, and literacy given in this dataset.

In our dataset we would like to analyze a parameter based on a set of parameters

In [51]:
# we are getting a custom dataset from the Statsmodels module 
df = sm.datasets.get_rdataset("Guerry", "HistData").data
# .head() function gives the first 5 rows from the top
print(df.head())

   dept Region    Department  Crime_pers  Crime_prop  Literacy  Donations  \
0     1      E           Ain       28870       15890        37       5098   
1     2      N         Aisne       26226        5521        51       8901   
2     3      C        Allier       26747        7925        13      10973   
3     4      E  Basses-Alpes       12935        7289        46       2733   
4     5      E  Hautes-Alpes       17488        8174        69       6962   

   Infants  Suicides MainCity   ...     Crime_parents  Infanticide  \
0    33120     35039    2:Med   ...                71           60   
1    14572     12831    2:Med   ...                 4           82   
2    17044    114121    2:Med   ...                46           42   
3    23018     14238     1:Sm   ...                70           12   
4    23076     16171     1:Sm   ...                22           23   

   Donation_clergy  Lottery  Desertion  Instruction  Prostitutes  Distance  \
0               69       41         55

The dataset is bigger than it seems so we only need a certain number of characterics/columns to perform a regression on the dataset.

In [52]:
vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
df = df[vars]
df[-5:]

Unnamed: 0,Department,Lottery,Literacy,Wealth,Region
81,Vienne,40,25,68,W
82,Haute-Vienne,55,13,67,C
83,Vosges,14,62,82,E
84,Yonne,51,47,30,C
85,Corse,83,49,37,


In [53]:
# To remove the values which will not affect the outcome 
# For eg. values that are NaN
df = df.dropna()

In [60]:
temp_df = df # We will use this variable later on
df[-5:]

Unnamed: 0,Department,Lottery,Literacy,Wealth,Region
80,Vendee,68,28,56,W
81,Vienne,40,25,68,W
82,Haute-Vienne,55,13,67,C
83,Vosges,14,62,82,E
84,Yonne,51,47,30,C


Here we are converting a Dataframe to Matices with the help of patsy libray

In [55]:
y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')

dmatrices() is function from the patsy library. And here it is helping us to manipulate the data points to be reflective of a proportional relationship like a regression. For instance. in the above syntax you can see 

'Lottery ~ Literacy + Wealth + Region'

this is equivalent of saying 

Lottery = Literacy * ( some variable X ) + Wealth ( some variable Y ) + Region

i.e. Literacy, Wealth, and Region will be independent variables where as Literacy will be dependent variable.

But on observing our dataset df, we see that Region is different from the rest of the Group. 

How do we incoporate that with other numbers?

In [56]:
# Dependent Variable
print(y[-5:])

    Lottery
80     68.0
81     40.0
82     55.0
83     14.0
84     51.0


In [57]:
# Independent Variables
print(X[-5:])

    Intercept  Region[T.E]  Region[T.N]  Region[T.S]  Region[T.W]  Literacy  \
80        1.0          0.0          0.0          0.0          1.0      28.0   
81        1.0          0.0          0.0          0.0          1.0      25.0   
82        1.0          0.0          0.0          0.0          0.0      13.0   
83        1.0          1.0          0.0          0.0          0.0      62.0   
84        1.0          0.0          0.0          0.0          0.0      47.0   

    Wealth  
80    56.0  
81    68.0  
82    67.0  
83    82.0  
84    30.0  


### Implementation of Dummy Variables
As you can see the string variable Region has been incoporated but with 4 more columns. T.E, T.N, T.S, T.W.
This is a concept of dummy variables, where we assign a set of finite values a value in a way that would reflect which value the data is concerned for in a particula data point. 

For instance, in Row 80 we have T.E = T.N = T.S = 0 and T.W = 1,

This is because for this particular row the dataset has this region selected and so in the equation created by the module we will have extra variables for T.E, T.N, etc. but in this case they will be multiplied with 0 hence only T.W will show up. 

More info can be found https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.get_dummies.html

The function <font color='purple'> .get_dummies() </font> allows the user to create values for categorical parameters. It is a vital step in constructing our Regression graph/equation/model/classifier.

So, lets explore this section a little deeper to understand this concept. So, out plan is to explore the Region Column. We will first observe the column by looking at its values 

In [70]:
# You may remember from earlier we had saved the whole dataset in this variable
temp_df['Region'].values

# Here we can see all categories for the Region Column. But looking from the list of Regions we can't find out
# how many types of Regions are there?

array(['E', 'N', 'C', 'E', 'E', 'S', 'N', 'S', 'E', 'S', 'S', 'S', 'N',
       'C', 'W', 'W', 'C', 'C', 'E', 'W', 'C', 'W', 'E', 'E', 'N', 'C',
       'W', 'S', 'S', 'S', 'W', 'S', 'W', 'C', 'C', 'E', 'E', 'W', 'C',
       'C', 'C', 'W', 'C', 'S', 'W', 'S', 'W', 'N', 'N', 'E', 'W', 'E',
       'N', 'W', 'N', 'C', 'N', 'N', 'N', 'N', 'C', 'W', 'S', 'S', 'E',
       'E', 'E', 'E', 'E', 'C', 'N', 'N', 'N', 'N', 'W', 'N', 'S', 'S',
       'S', 'S', 'W', 'W', 'C', 'E', 'C'], dtype=object)

In [72]:
# So will implement a Dataframe query which provides us with distinct Region values, allowing the user to see
# how many categories all avalaible for every row of the data
temp_df.Region.unique()

array(['E', 'N', 'C', 'S', 'W'], dtype=object)

### Dummy Variable Trap
As we can observe from the above array the dataset contains 5 types of regions, but why has the function .dmatrices() only given 4 Regions (E, S, N, W) ?

To prevent the Dummy Variable Trap. 

Every sum of a row has to equal to its Intercept column that is shown in the right most column and so this can cause multi-collinearity, which is when one independent variable can be predicted accurately from another independent variable. This is a problem because each independent variable should act on its own accords and contribute to the model, it can affect the results when you want to use the model for further analysis. This is called a Dummy Variable Trap.

##### Solution: We remove one of dummy variables and that solves the problem of collinearity. 

And so in our dataset the function .dmatrices() does that exactly giving us 4 instead of 5 dummy variables.

In [58]:
mod = sm.OLS(y, X) 
result = mod.fit() 
print(result.summary()) # Shows a summary of the analysis

                            OLS Regression Results                            
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Mon, 30 Jul 2018   Prob (F-statistic):           1.07e-05
Time:                        00:25:01   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      

The result summary gives a complete picture to user to compare the different units and values. 

## Sklearn

This Library concerns itself with data mining and analysing tools for the purposes of classifying, clustering and preprocessing. It is mainly used for the machine learning aspect of Data Science.

A classifier here would help the its user to predict the behaviour based on the tested data or values

To give you an idea of how the library works, we will follow a short tutorial 

In [1]:
# Import the function required for instantiating a classifier
from sklearn.linear_model import LinearRegression

In [2]:
# Construct training and test data
train_Data = [
    [0,0],
    [1,2],
    [2,4]
]
class_Data = [
    0,
    1,
    2
]


As you observe, we have our training data and class data. It is easy to notice in our training set the pattern of the parameters. The parameters showcase linear progression where the both values tend to linearly increase and so the class set confirms this by giving values which corresponds to those singular values. In the background the LinearRegression classifier is working with this data to come up with a set of values to justify the behaviour. 

In [3]:
# Create the classifier and fit the data
cls = LinearRegression()
cls.fit(train_Data, class_Data)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

After the classifier is fit it analyzes the training set and comes up with a set of coefficients for the two variables in the training data. 


In [8]:
print(cls.coef_)

[ 0.2  0.4]


the 2 variables correspond as constants to the equation (⅕) x + (⅖)y = z 

In [9]:
print(cls.predict([[4,3]]))

[ 2.]


Hence when you predict values, they are actually placed in the equation to give you an answer. 


# Exercises

In this exercise we have seen 2 kinds of datasets
    1. A custom dataset in the form of a dataframe
    2. An array of 2 Dimensional objects

For us to understand on how to use these modules we will complete an exercise where we perform an analysis using Sklearn and Statsmodel on the opposite datasets

## Sklearn
#### Instruction : Import the Sklearn and dataframe dataset and fit() the model under a linear regression
##### We will implement Sklearn's LinearRegression on the dataframe used for Statsmodel

In [27]:
# Import the Libraries
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import pandas as pd

In [74]:
# Importing the dataset and downsizing it for a regresssion

# Instantiate the Dataset
df = sm.datasets.get_rdataset("Guerry", "HistData").data
# Get the colomns that are required for a regression 
vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
df = df[vars]
print('Data: \n',df[-5:])
# Remove the values that have no eg. NaN
df = df.dropna()
print()
print('Readable Data: \n',df[-5:])

Data: 
       Department  Lottery  Literacy  Wealth Region
81        Vienne       40        25      68      W
82  Haute-Vienne       55        13      67      C
83        Vosges       14        62      82      E
84         Yonne       51        47      30      C
85         Corse       83        49      37    NaN

Readable Data: 
       Department  Lottery  Literacy  Wealth Region
80        Vendee       68        28      56      W
81        Vienne       40        25      68      W
82  Haute-Vienne       55        13      67      C
83        Vosges       14        62      82      E
84         Yonne       51        47      30      C


In [73]:
# Creating Dummy Variables and implementing it correctly

# Create dummy variables for Region columns
dummy_regions = pd.get_dummies(df['Region'])
print('Dummy Varables\n', dummy_regions.head())
print()
# Append to the Dataframe
df = pd.concat([df, dummy_regions], axis=1)
print(df.head())
print()
# Cleaning data by removing unnecessary columns
df = df.drop(['Department', 'Region'], axis=1)
print(df.head())

Dummy Varables
    C  E  N  S  W
0  0  1  0  0  0
1  0  0  1  0  0
2  1  0  0  0  0
3  0  1  0  0  0
4  0  1  0  0  0

     Department  Lottery  Literacy  Wealth Region  C  E  N  S  W
0           Ain       41        37      73      E  0  1  0  0  0
1         Aisne       38        51      22      N  0  0  1  0  0
2        Allier       66        13      61      C  1  0  0  0  0
3  Basses-Alpes       80        46      76      E  0  1  0  0  0
4  Hautes-Alpes       79        69      83      E  0  1  0  0  0

   Lottery  Literacy  Wealth  C  E  N  S  W
0       41        37      73  0  1  0  0  0
1       38        51      22  0  0  1  0  0
2       66        13      61  1  0  0  0  0
3       80        46      76  0  1  0  0  0
4       79        69      83  0  1  0  0  0


In [30]:
# Converting Data into numpy arrays


# Class Data
X = df['Lottery'].values

# Training Data
df = df.drop(['Lottery'], axis=1)
# Removed one column to prevent the dummy variable trap
var = ['Literacy', 'Wealth', 'C', 'E', 'N', 'S']
y = df[var]
y[:5]
y = y.values

print('y - ', y[:5])
print('x - ', X[:5])

y -  [[37 73  0  1  0  0]
 [51 22  0  0  1  0]
 [13 61  1  0  0  0]
 [46 76  0  1  0  0]
 [69 83  0  1  0  0]]
x -  [41 38 66 80 79]


In [33]:
# Fitting the Model

regr = LinearRegression()
regr.fit(y, X)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [75]:
# Running Tests

test1 = regr.predict([[37, 73,  0,  1,  0,  0]])
test2 = regr.predict([[13, 67,  1,  0,  0,  0]])

print('test1: ',test1)
print('test2: ',test2)

test1:  [ 49.30622039]
test2:  [ 66.48482007]


## Statsmodels

In [38]:
# Import Libraries
import statsmodels.api as sm
import pandas as pd
from patsy import dmatrices

In [39]:
# Gathering Data and converting them to DataFrame

train_Data = [
    [0,0],
    [1,2],
    [2,4]
]
class_Data = [
    0,
    1,
    2
]
train_Data = pd.DataFrame(train_Data)
train_Data = train_Data.rename({
    0:'train_col1',
    1:'train_col2'
}, axis='columns')

class_Data = pd.DataFrame(class_Data)
class_Data = class_Data.rename({
    0:'class_col1'
}, axis=1)
df = pd.concat([class_Data, train_Data], axis=1)
df

Unnamed: 0,class_col1,train_col1,train_col2
0,0,0,0
1,1,1,2
2,2,2,4


In [40]:
# Creating class and traing data and then fitting the data

y, X = dmatrices('class_col1 ~ train_col1 + train_col2', data=df, return_type='dataframe')
mod = sm.OLS(y, X) 
result = mod.fit() 
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             class_col1   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.667e+30
Date:                Sun, 29 Jul 2018   Prob (F-statistic):           3.32e-16
Time:                        23:44:32   Log-Likelihood:                 101.92
No. Observations:                   3   AIC:                            -199.8
Df Residuals:                       1   BIC:                            -201.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept  -3.886e-16   6.74e-16     -0.576      0.6

