## Resampling Methods
* two most commonly used resampling Methods

(1) cross-validation

(2) bootstrap

In [42]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype

# download dataset to use throughout
hprice3 = pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/hprice3.dta')

# pre-processing dataset
## select interested columns
col = ['lprice','lland','larea','nbh','rooms','cbd','y81','ldist','baths','age','agesq']
df = hprice3[col]

## change y81 and nbh to integers
df[['y81','nbh', 'rooms']] = df[['y81','nbh','rooms']].astype(int) 

## change y81, nbh, rooms into categorical
df['y81'] = df['y81'].astype('category')
df['nbh'] = df['nbh'].astype(CategoricalDtype(ordered=False))
df['rooms'] = df['rooms'].astype(CategoricalDtype(ordered=True))

# view columns types of pre-processed dataset
df.dtypes

lprice     float32
lland      float32
larea      float32
nbh       category
rooms     category
cbd        float32
y81       category
ldist      float32
baths      float32
age        float32
agesq      float32
dtype: object

### Cross-Validation

* these methods are used to do two things

(1) Model Assessment: process of evaluating a model's performance

(2) Model Selection: process of selecting the proper level of flexibility for a model

* since models are 'trained' using training data sets, they are suitable to fit data in these training data sets only

* since the validation set was not used to fit the model, these set of observations can be used to assess the performance of the model; therefore they allow us to do model selection

In [53]:
# continue preprocessing: convert categorical variables into dummy variables
df1 = pd.get_dummies(df, columns=['y81', 'nbh','rooms'], drop_first = True)

# get list of columns
df1.columns

Index(['lprice', 'lland', 'larea', 'cbd', 'ldist', 'baths', 'age', 'agesq',
       'y81_1', 'nbh_1', 'nbh_2', 'nbh_3', 'nbh_4', 'nbh_5', 'nbh_6',
       'rooms_5', 'rooms_6', 'rooms_7', 'rooms_8', 'rooms_9', 'rooms_10'],
      dtype='object')

## Validation Set Approach

![](img\validation_set_approach.png)

* set of n observations are randomly split into a training set (in blue with other obs) and a validation set (in beige containing other obs)

* the stat learning method is fit onto the training set

* performance evaluated on the validation set

* can be doing many (i.e j) times randomly, then all model assessment measures such as RMSE, R2, Cp, BIC, and AIC can be calculated many (i.e j) times

In [77]:
from sklearn.model_selection import train_test_split
import patsy
from sklearn.linear_model import LogisticRegression

# separate dataframe while preserving datatypes (pmatrices disregards categorical variables)
X = df1[list(df1[1:21])]
y = df1[['lprice']]

# do validation set splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# run regression and do fit
model = LogisticRegression()
model.fit(X_train, y_train) # not sure how to fix error

# predict testing set with training set regression
reg.score(X_test, Y_test)


ValueError: Unknown label type: 'continuous'

In [None]:
# convert dataframe into numpy array
df1_array = df1.to_numpy()

# create specification using dataframe column object
predictors = " + ".join(list(df1.columns[1:15]))
f = " ~ ".join(['lprice',predictors])

predict = reg.predict(X_test)

#
