**Patsy**

Patsy is a nice package for setting up linear models for fitting in sklearn.

It creates the matrices needed for modeling various methods (like regression) in sklearn

- the matrix of predictor variable columns aka the *design matrix*
- the column of response variable values

It allows us to specify models using *formulas* (as in R) rather than by doing things by hand.

To illustrate it use, we return to the traffic prediction problem. 

We'll recode weekdays as strings (object) to illustrate what patsy does with a categorical variable.

In [4]:
import pandas as pd
import numpy as np
import datetime
import os
from sklearn.linear_model import LinearRegression
df=pd.read_csv("TRAFFIC_VOLUME2.csv")
df.drop(columns=["Unnamed: 0","index","weather_description","time_diff","snow_1h","holiday"],inplace=True)
df["dayofweek"]=df.weekday.map({0:"Mon",1:"Tue",2:"Wed",3:"Thu",4:"Fri",5:"Sat",6:"Sun"})
N=df.shape[0] # number of rows
perm=np.random.permutation(range(N))
Itrain1=perm[0:int(N/3)]
Itrain2=perm[int(N/3):int(2*N/3)]
Itest=perm[int(2*N/3):N]
dfTrain1=df.loc[Itrain1]
dfTrain2=df.loc[Itrain2]
dfTest=df.loc[Itest]
df.dayofweek

KeyError: "['Unnamed: 0', 'index'] not found in axis"

**Simple example**

In the following we tell patsy to create the matrices for predicting traffic_volume 
using a combination of numerical and categorical predictor variables.

Patsy automatically includes an intercept in the model.


In [2]:
import patsy as ps

formula="traffic_volume~dayofweek+dayofyear"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)

NameError: name 'dfTrain1' is not defined

In [3]:
Ytrain1

DesignMatrix with shape (13523, 1)
  traffic_volume
             891
             312
            4322
            4259
            5650
            5466
            5833
            3996
            1740
            1853
            3282
             827
            3038
            2635
            4860
            4882
            4295
            5227
             800
            2176
            3385
            4471
             731
            4238
            5747
            3422
            2943
            2558
            3282
             396
  [13493 rows omitted]
  Terms:
    'traffic_volume' (column 0)
  (to view full data, use np.asarray(this_obj))

In [7]:
print(Xtrain1)

[[  1.   0.   0. ...   1.   0. 253.]
 [  1.   1.   0. ...   0.   0. 120.]
 [  1.   0.   0. ...   0.   0. 348.]
 ...
 [  1.   1.   0. ...   0.   0. 249.]
 [  1.   0.   0. ...   1.   0. 153.]
 [  1.   0.   0. ...   0.   0. 155.]]


Here, we see that patsy created 6 dummy variables, one for each weekday, and left out friday as reference day.
We an inspect the columns using that last suggestion.

In [41]:
np.asarray(Xtrain1)[0:100,1]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])

In [None]:
Next, as usual we can use the X, Y values to fit a model.

In [44]:
Ytrain1.shape

(13523, 1)

In [8]:
fit=LinearRegression().fit(Xtrain1,Ytrain1)

**Creating the test matrices**

We need to apply the same algorithm for creating the design matrix for the test data (or the second training dataset) that was applied to the first training set.

Be careful!!! Consider the following example of what can go wrong.

In [13]:
dt1=pd.DataFrame({"day":["Mon","Tue","Wed","Thu","Mon","Tue","Wed","Thu"],
                  "calories":[1400,1800,2100,2000,1900,1800,1500,1800],
                  "wt":[145,147,149,148,147,148,150,149]})
print(dt1)
Y1,X1=ps.dmatrices("wt~day+calories",dt1)
X1
fit=LinearRegression().fit(X1,Y1)
fit.coef_

   day  calories   wt
0  Mon      1400  145
1  Tue      1800  147
2  Wed      2100  149
3  Thu      2000  148
4  Mon      1900  147
5  Tue      1800  148
6  Wed      1500  150
7  Thu      1800  149


array([[0.00000000e+00, 2.42307692e+00, 1.45384615e+00, 3.45384615e+00,
        3.07692308e-04]])

In [77]:
print(X1)

[[1.0e+00 0.0e+00 0.0e+00 0.0e+00 1.4e+03]
 [1.0e+00 0.0e+00 1.0e+00 0.0e+00 1.8e+03]
 [1.0e+00 0.0e+00 0.0e+00 1.0e+00 2.1e+03]
 [1.0e+00 1.0e+00 0.0e+00 0.0e+00 2.0e+03]
 [1.0e+00 0.0e+00 0.0e+00 0.0e+00 1.9e+03]
 [1.0e+00 0.0e+00 1.0e+00 0.0e+00 1.8e+03]
 [1.0e+00 0.0e+00 0.0e+00 1.0e+00 1.5e+03]
 [1.0e+00 1.0e+00 0.0e+00 0.0e+00 1.8e+03]]


We see that Monday was made the reference day.
Now consider data we want to make predictions for.

In [14]:
dt2=pd.DataFrame({"day":["Thu","Mon","Thu","Mon","Thu","Mon","Thu","Mon"],
                  "calories":[1500,1300,1700,2000,1900,1800,1800,1900],
                  "wt":[148,147,146,149,147,149,150,151]})
print(dt2)
Y2,X2=ps.dmatrices("wt~day+calories",dt2)
X2

   day  calories   wt
0  Thu      1500  148
1  Mon      1300  147
2  Thu      1700  146
3  Mon      2000  149
4  Thu      1900  147
5  Mon      1800  149
6  Thu      1800  150
7  Mon      1900  151


DesignMatrix with shape (8, 3)
  Intercept  day[T.Thu]  calories
          1           1      1500
          1           0      1300
          1           1      1700
          1           0      2000
          1           1      1900
          1           0      1800
          1           1      1800
          1           0      1900
  Terms:
    'Intercept' (column 0)
    'day' (column 1)
    'calories' (column 2)

In [15]:
fit.predict(X2)

ValueError: X has 3 features, but LinearRegression is expecting 5 features as input.

This is a problem. We want to make sure that the same process is applied to the new data for creating the design matrix. 

We can do that using the following approach which basically says to use info about how X1 was created (formula and the data frame) to build a design matrix for a new data frame.

In [16]:
X2=ps.build_design_matrices([X1.design_info], dt2)[0]

In [17]:
X2

DesignMatrix with shape (8, 5)
  Intercept  day[T.Thu]  day[T.Tue]  day[T.Wed]  calories
          1           1           0           0      1500
          1           0           0           0      1300
          1           1           0           0      1700
          1           0           0           0      2000
          1           1           0           0      1900
          1           0           0           0      1800
          1           1           0           0      1800
          1           0           0           0      1900
  Terms:
    'Intercept' (column 0)
    'day' (columns 1:4)
    'calories' (column 4)

In [18]:
fit.predict(X2)

array([[148.37692308],
       [145.89230769],
       [148.43846154],
       [146.10769231],
       [148.5       ],
       [146.04615385],
       [148.46923077],
       [146.07692308]])

**Additional things you can do with patsy formulas**

We can tell patsy not to include an interecept parameter:

In [19]:
formula="traffic_volume~0+day+dayofweek"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)
Xtrain1

DesignMatrix with shape (13523, 8)
  Columns:
    ['dayofweek[Fri]',
     'dayofweek[Mon]',
     'dayofweek[Sat]',
     'dayofweek[Sun]',
     'dayofweek[Thu]',
     'dayofweek[Tue]',
     'dayofweek[Wed]',
     'day']
  Terms:
    'dayofweek' (columns 0:7), 'day' (column 7)
  (to view full data, use np.asarray(this_obj))

In [None]:
We can include transformations of columns.

In [20]:
formula="traffic_volume~day+np.sin(2*np.pi*dayofyear/365)+np.cos(2*np.pi*dayofyear/365)"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)
Xtrain1

DesignMatrix with shape (13523, 4)
  Columns:
    ['Intercept',
     'day',
     'np.sin(2 * np.pi * dayofyear / 365)',
     'np.cos(2 * np.pi * dayofyear / 365)']
  Terms:
    'Intercept' (column 0)
    'day' (column 1)
    'np.sin(2 * np.pi * dayofyear / 365)' (column 2)
    'np.cos(2 * np.pi * dayofyear / 365)' (column 3)
  (to view full data, use np.asarray(this_obj))

In [None]:
We can include interactions between categorical variables.

In [114]:
dfTrain1.columns

Index(['temp', 'rain_1h', 'clouds_all', 'weather_main', 'date_time',
       'traffic_volume', 'date', 'time', 'year', 'day1ofyear', 'dayofyear',
       'hour', 'weekday', 'day', 'lrain', 'snowind', 'holiday_ind', 'daysin',
       'daycos', 'dayofweek'],
      dtype='object')

In [116]:
formula="traffic_volume~day+weather_main*dayofweek"
Ytrain1,Xtrain1=ps.dmatrices(formula, dfTrain1)
Xtrain1

DesignMatrix with shape (13523, 64)
  Columns:
    ['Intercept',
     'weather_main[T.Clouds]',
     'weather_main[T.Drizzle]',
     'weather_main[T.Haze]',
     'weather_main[T.Mist]',
     'weather_main[T.Other]',
     'weather_main[T.Rain]',
     'weather_main[T.Snow]',
     'weather_main[T.Thunderstorm]',
     'dayofweek[T.Mon]',
     'dayofweek[T.Sat]',
     'dayofweek[T.Sun]',
     'dayofweek[T.Thu]',
     'dayofweek[T.Tue]',
     'dayofweek[T.Wed]',
     'weather_main[T.Clouds]:dayofweek[T.Mon]',
     'weather_main[T.Drizzle]:dayofweek[T.Mon]',
     'weather_main[T.Haze]:dayofweek[T.Mon]',
     'weather_main[T.Mist]:dayofweek[T.Mon]',
     'weather_main[T.Other]:dayofweek[T.Mon]',
     'weather_main[T.Rain]:dayofweek[T.Mon]',
     'weather_main[T.Snow]:dayofweek[T.Mon]',
     'weather_main[T.Thunderstorm]:dayofweek[T.Mon]',
     'weather_main[T.Clouds]:dayofweek[T.Sat]',
     'weather_main[T.Drizzle]:dayofweek[T.Sat]',
     'weather_main[T.Haze]:dayofweek[T.Sat]',
     'weather_

**Normalizing**

For some procedures you are advised to normalize your variables before fitting a model.
Again, whatever normalization you use should be first defined on the training data and that normalization should be applied to the test data.

This is best explained in an example.

Q: If we normalize the rain_1h variable by subtracing the mean and dividing by the standard deviation of that variable, which mean and std deviation should we use?