**Patsy**

Patsy is a nice package for setting up linear models for fitting in sklearn.

It creates the matrices needed for modeling various methods (like regression) in sklearn

- the matrix of predictor variable columns aka the *design matrix*
- the column of response variable values

It allows us to specify models using *formulas* (as in R) rather than by creating design matrices by hand.

To illustrate its use, we return to the traffic prediction problem. 

We'll recode weekdays as strings (object) to illustrate what patsy does with a categorical variable.

In [None]:
import pandas as pd
import numpy as np
import datetime
import os
import mylib as my
from sklearn.linear_model import LinearRegression
df=pd.read_csv("TRAFFIC_VOLUME2.csv")

In [None]:
df.drop(columns=["weather_description","time_diff","snow_1h","holiday"],inplace=True)
df["dayofweek"]=df.weekday.map({0:"Mon",1:"Tue",2:"Wed",3:"Thu",4:"Fri",5:"Sat",6:"Sun"})
N=df.shape[0] # number of rows
perm=np.random.permutation(range(N))
Itrain1=perm[0:int(N/3)]
Itrain2=perm[int(N/3):int(2*N/3)]
Itest=perm[int(2*N/3):N]
dfTrain1=df.loc[Itrain1]
dfTrain2=df.loc[Itrain2]
dfTest=df.loc[Itest]
df.dayofweek

**Simple example**

In the following we tell patsy to create the matrices for predicting traffic_volume 
using a combination of numerical and categorical predictor variables.

Patsy automatically includes an intercept in the model.


In [None]:
import patsy as ps

formula="traffic_volume~dayofweek+dayofyear"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)

In [None]:
Ytrain1

In [None]:
print(Xtrain1)

Here, we see that patsy created a design matrix that has a column of 1's for the intercept parameter and 6 dummy variables, one for each weekday, leaving out friday as reference day, and a column for day of the year (numerical variable).

The matrix has type that is particular to patsy.

In [None]:
type(Xtrain1)

Numpy has a function that allows us to convert this to a numpy array.

In [None]:
np.asarray(Xtrain1).shape

In [None]:
np.asarray(Xtrain1)[0:100,1]

In [None]:
Next, as usual we can use the X, Y values to fit a model.

In [None]:
Ytrain1.shape

In [None]:
fit=LinearRegression().fit(Xtrain1,Ytrain1)

**Creating the test matrices**

Important: we need to apply the same algorithm for creating the design matrix for the test data (or the second training dataset) that was applied to the first training set.

Be careful!!! Consider the following example of what can go wrong.

In [None]:
dt1=pd.DataFrame({"day":["Mon","Tue","Wed","Thu","Mon","Tue","Wed","Thu"],
                  "calories":[1400,1800,2100,2000,1900,1800,1500,1800],
                  "wt":[145,147,149,148,147,148,150,149]})
print(dt1)
Y1,X1=ps.dmatrices("wt~day+calories",dt1)
X1
fit=LinearRegression().fit(X1,Y1)
fit.coef_

In [None]:
print(X1)

We see that Monday was made the reference day.
Now consider data we want to make predictions for.

In [None]:
dt2=pd.DataFrame({"day":["Thu","Mon","Thu","Mon","Thu","Mon","Thu","Mon"],
                  "calories":[1500,1300,1700,2000,1900,1800,1800,1900],
                  "wt":[148,147,146,149,147,149,150,151]})
print(dt2)
Y2,X2=ps.dmatrices("wt~day+calories",dt2)
X2

In [None]:
fit.predict(X2)

This is a problem. We want to make sure that the same process is applied to the new data for creating the design matrix. 

We can do that using the following approach which basically says to use info about how X1 was created (formula and the data frame) to build a design matrix for a new data frame.

In [None]:
X2=ps.build_design_matrices([X1.design_info], dt2)[0]

In [None]:
X2

In [None]:
fit.predict(X2)

**Additional things you can do with patsy formulas**

We can tell patsy not to include an interecept parameter:

In [None]:
formula="traffic_volume~0+day+dayofweek"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)
Xtrain1

We can include transformations of columns.

In [None]:
formula="traffic_volume~day+np.sin(2*np.pi*dayofyear/365)+np.cos(2*np.pi*dayofyear/365)"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)
Xtrain1

In [None]:
We can include interactions between categorical variables.

In [None]:
dfTrain1.columns

In [None]:
formula="traffic_volume~day+weather_main*dayofweek"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)
Xtrain1

**Normalizing**

For some procedures you are advised to normalize your variables before fitting a model.
Again, whatever normalization you use should be first defined on the training data and that normalization should be applied to the test data.

This is best explained in an example.

Q: If we normalize the rain_1h variable by subtracing the mean and dividing by the standard deviation of that variable, which mean and std deviation should we use?