# Introduction to Modeling Libraries in Python

In this book, I have focused on providing a programming foundation for doing data analysis in Python. Since data analysts and scientists often report spending a disproportionate amount of time with data wrangling and preparation, the book's structure reflects the importance of mastering these techniques.

Which libarary you use for developing models will depend on the application. Many statistical problems can be solved by simpler techniques like orginary least squares regression, while other problems may call for more advanced machine learning methods. Fortunately, Python has become one of the languages of choice for impleting this book.

In this chapter, I will review some features of pandas that may be helpful when you're crossing back and forth between data wrangling with pandas and model fitting and scoring. I will give short introductions to two popular modeling toolkits, stats-models (http://statsmodels.org) and scikit-learn (http://scikit-learn.org). Since each of these projects is large enough to warrant its own dedicated book, I make no effort to be comprehensive and instead direct you to both projects' online documentation along with some other Python-based books on data science, statistics, and machine learning.

## Interfacing Between pandas and Model Code

To turn a DataFrame into a NumPy array, use the __.values__ properties:

In [1]:
import pandas as pd; import numpy as np

In [2]:
data = pd.DataFrame({
    'x0': [1,2,3,4,5],
    'x1': [0.01,-0.01,0.25,-4.1,0.],
    'y': [-1.5, 0., 3.6, 1.3,-2.]
})

In [3]:
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [4]:
data.columns

Index(['x0', 'x1', 'y'], dtype='object')

In [5]:
data.values

array([[ 1.  ,  0.01, -1.5 ],
       [ 2.  , -0.01,  0.  ],
       [ 3.  ,  0.25,  3.6 ],
       [ 4.  , -4.1 ,  1.3 ],
       [ 5.  ,  0.  , -2.  ]])

To convert back to a DataFrame, you can pass a two-dimensional ndarray with optimal column names:

In [6]:
df2 = pd.DataFrame (data.values, columns=['one','two','three'])

In [7]:
df2

Unnamed: 0,one,two,three
0,1.0,0.01,-1.5
1,2.0,-0.01,0.0
2,3.0,0.25,3.6
3,4.0,-4.1,1.3
4,5.0,0.0,-2.0


In [8]:
df3 = data.copy()

In [9]:
df3['strings']=['a','b','c','d','e']

In [10]:
df3

Unnamed: 0,x0,x1,y,strings
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,c
3,4,-4.1,1.3,d
4,5,0.0,-2.0,e


In [11]:
df3.values

array([[1, 0.01, -1.5, 'a'],
       [2, -0.01, 0.0, 'b'],
       [3, 0.25, 3.6, 'c'],
       [4, -4.1, 1.3, 'd'],
       [5, 0.0, -2.0, 'e']], dtype=object)

In [12]:
model_cols = ['x0','x1']

In [13]:
data.loc[:,model_cols].values

array([[ 1.  ,  0.01],
       [ 2.  , -0.01],
       [ 3.  ,  0.25],
       [ 4.  , -4.1 ],
       [ 5.  ,  0.  ]])

In [14]:
data['category']=pd.Categorical(['a','b','a','a','b'],
                               categories=['a','b'])

In [15]:
data

Unnamed: 0,x0,x1,y,category
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,a
3,4,-4.1,1.3,a
4,5,0.0,-2.0,b


In [16]:
dummies = pd.get_dummies(data.category, prefix='category')

In [17]:
data_with_dummies = data.drop('category',axis=1).join(dummies)

In [18]:
data_with_dummies

Unnamed: 0,x0,x1,y,category_a,category_b
0,1,0.01,-1.5,1,0
1,2,-0.01,0.0,0,1
2,3,0.25,3.6,1,0
3,4,-4.1,1.3,1,0
4,5,0.0,-2.0,0,1


## Creating Model Descriptions with Patsy

Patsy (https://patsy.readthedocs.io/) is a Python library for describing statistical models (especially linear models) with a small string-based "formula syntax", which is insprired by (but not exactly the same as) the formula syntax used by the R and S statistical programming languages.

Patsy is well supported for specifying linear models in statsmodels, so I will focus on some of the main features to help you get up and running. Patsy's formulas are special string syntax that looks like:

y ~ x0 + x1

The syntax __a + b__ does not mean to add a to b, but rather that these are terms in the design matrix created for the model. The patsy.dmatices function takes a formula string along with a dataset (which can be a DataFrame or a dict of arrays) and produces design matices for a linear model:

In [19]:
data = pd.DataFrame({
    'x0': [1,2,3,4,5],
    'x1': [0.01,-0.01,0.25,-4.1,0.],
    'y': [-1.5,0.,3.6,1.3,-2.]
})

In [20]:
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [21]:
import patsy

In [22]:
y,X = patsy.dmatrices('y ~ x0 + x1', data)

In [23]:
y

DesignMatrix with shape (5, 1)
     y
  -1.5
   0.0
   3.6
   1.3
  -2.0
  Terms:
    'y' (column 0)

In [24]:
X

DesignMatrix with shape (5, 3)
  Intercept  x0     x1
          1   1   0.01
          1   2  -0.01
          1   3   0.25
          1   4  -4.10
          1   5   0.00
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'x1' (column 2)

### Data Transformations in Patsy Formulas 

### Categorical Data and Patsy

## Introduction to statsmodels

### Estimating Linear Models

### Estimating Time Series Processes

## Introduction to scikit-learn

## Continuing Your Education