<a href="https://colab.research.google.com/github/anyuanay/info212/blob/main/INFO212_Week9_Lecture2_modeling_libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 212: Data Science Programming 1
___

### Week 9: Modeling Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Introduction to statsmodels
statsmodels is a Python library for fitting many kinds of statistical models, performing
statistical tests, and data exploration and visualization. Statsmodels contains more
“classical” frequentist statistical methods, while Bayesian methods and machine learning
models are found in other libraries.
Some kinds of models found in statsmodels include:
- Linear models, generalized linear models, and robust linear models
- Linear mixed effects models
- Analysis of variance (ANOVA) methods
- Time series processes and state space models
- Generalized method of moments

### Estimating Linear Models
There are several kinds of linear regression models in statsmodels, from the more
basic (e.g., ordinary least squares) to more complex (e.g., iteratively reweighted least
squares).

Linear models in statsmodels have two different main interfaces: array-based and
formula-based. These are accessed through these API module imports:

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

  import pandas.util.testing as tm


To show how to use these, we generate a linear model from some random data:

In [None]:
def dnorm(mean, variance, size=1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * np.random.randn(*size)

# For reproducibility
np.random.seed(12345)

N = 100
X = np.c_[dnorm(0, 0.4, size=N),
          dnorm(0, 0.6, size=N),
          dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]

y = np.dot(X, beta) + eps

Here, I wrote down the “true” model with known parameters beta. In this case, dnorm
is a helper function for generating normally distributed data with a particular mean
and variance. So now we have:

In [None]:
X[:5]
y[:5]

array([ 0.42786349, -0.67348041, -0.09087764, -0.48949442, -0.12894109])

A linear model is generally fitted with an intercept term as we saw before with Patsy.
The sm.add_constant function can add an intercept column to an existing matrix:

In [None]:
X_model = sm.add_constant(X)
X_model[:5]

array([[ 1.        , -0.12946849, -1.21275292,  0.50422488],
       [ 1.        ,  0.30291036, -0.43574176, -0.25417986],
       [ 1.        , -0.32852189, -0.02530153,  0.13835097],
       [ 1.        , -0.35147471, -0.71960511, -0.25821463],
       [ 1.        ,  1.2432688 , -0.37379916, -0.52262905]])

The sm.OLS class can fit an ordinary least squares linear regression:

In [None]:
model = sm.OLS(y, X)

The model’s fit method returns a regression results object containing estimated
model parameters and other diagnostics:

In [None]:
results = model.fit()
results.params

array([0.17826108, 0.22303962, 0.50095093])

In [None]:
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.430
Model:                            OLS   Adj. R-squared (uncentered):              0.413
Method:                 Least Squares   F-statistic:                              24.42
Date:                Sat, 16 Apr 2022   Prob (F-statistic):                    7.44e-12
Time:                        01:17:49   Log-Likelihood:                         -34.305
No. Observations:                 100   AIC:                                      74.61
Df Residuals:                      97   BIC:                                      82.42
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

The model’s fit method returns a regression results object containing estimated
model parameters and other diagnostics:

In [None]:
data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])
data['y'] = y
data[:5]

In [None]:
results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()
results.params
results.tvalues

In [None]:
results.predict(data[:5])

### Estimating Time Series Processes
Another class of models in statsmodels are for time series analysis. Among these are
autoregressive processes, Kalman filtering and other state space models, and multivariate
autoregressive models.

In [None]:
init_x = 4

import random
values = [init_x, init_x]
N = 1000

b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)
for i in range(N):
    new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
    values.append(new_x)

In [None]:
MAXLAGS = 5
model = sm.tsa.AR(values)
results = model.fit(MAXLAGS)

In [None]:
results.params

array([-0.00616093,  0.78446347, -0.40847891, -0.01364148,  0.01496872,
        0.01429462])

## Introduction to scikit-learn

scikit-learn is one of the most widely used and trusted general-purpose Python
machine learning toolkits. It contains a broad selection of standard supervised and
unsupervised machine learning methods with tools for model selection and evaluation,
data transformation, data loading, and model persistence.

These models can be
used for classification, clustering, prediction, and other common tasks.
There are excellent online and printed resources for learning about machine learning
and how to apply libraries like scikit-learn and TensorFlow to solve real-world problems.

In [None]:
train = pd.read_csv('datasets/titanic/train.csv')
test = pd.read_csv('datasets/titanic/test.csv')
train[:4]

In [None]:
train.isnull().sum()
test.isnull().sum()

In [None]:
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)

In [None]:
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

In [None]:
predictors = ['Pclass', 'IsFemale', 'Age']
X_train = train[predictors].values
X_test = test[predictors].values
y_train = train['Survived'].values
X_train[:5]
y_train[:5]

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
y_predict = model.predict(X_test)
y_predict[:10]

(y_true == y_predict).mean()

In [None]:
from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(10)
model_cv.fit(X_train, y_train)

In [None]:
from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model, X_train, y_train, cv=4)
scores

## Continuing Your Education
- INFO 213: Data Science Programing II
- INFO 323: Cloud Computing and Big Data
- DSCI: 371: Recommender Systems
- DSCI 471: Applied Deep Learning