# Supervised Machine Learning to Predict Price

## Goals

In this Jupyter Notebook, we'll predict price using supervised machine learning algorithms for continuous outcomes. This notebook highlights:

1. Data cleaning 
2. Feature selection
3. Model evaluation and selection

## Data Loading and Cleaning

In [93]:
import pandas as pd
import numpy as np

from sklearn import linear_model # Linear, ridge, and LASSO regression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor  # may need to run brew install libomp on MacOS

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error 

In [55]:
data_path = '/Users/danielchen/Desktop/GitHub/tutorials/Supervised Machine Learning Continuous Outcome/Data/sample.csv'
data = pd.read_csv(data_path)

In [56]:
data.head()

Unnamed: 0,loc1,loc2,para1,dow,para2,para3,para4,price
0,0,1,1,Mon,662,3000.0,3.8,73.49
1,9,99,1,Thu,340,2760.0,9.2,300.0
2,0,4,0,Mon,16,2700.0,3.0,130.0
3,4,40,1,Mon,17,12320.0,6.4,365.0
4,5,50,1,Thu,610,2117.0,10.8,357.5


For this exercise, we'll need to ensure that all of our data is numeric.

In [57]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   loc1    10000 non-null  object 
 1   loc2    10000 non-null  object 
 2   para1   10000 non-null  int64  
 3   dow     10000 non-null  object 
 4   para2   10000 non-null  int64  
 5   para3   10000 non-null  float64
 6   para4   10000 non-null  float64
 7   price   10000 non-null  float64
dtypes: float64(3), int64(2), object(3)
memory usage: 625.1+ KB


It appears that the columns `loc1`, `loc2`, and `dow` (day of week) contain non-numeric values (or are string characters as in the `dow` column). 

In [58]:
# We'll only keep rows that have numeric values in the loc1 column
data = data[data['loc1'].str.contains('\d')]

# Convert loc1 and loc2 from strings to numeric
to_numeric_cols = ['loc' + str(num) for num in range(1, 3)]
data[to_numeric_cols] = data[to_numeric_cols].apply(pd.to_numeric, errors='coerce', axis=1)

# Drop any NA values
data.dropna(inplace=True)


Let's check the number of observations that are left and the data types of the `loc` columns

In [59]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 9993 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   loc1    9993 non-null   float64
 1   loc2    9993 non-null   float64
 2   para1   9993 non-null   int64  
 3   dow     9993 non-null   object 
 4   para2   9993 non-null   int64  
 5   para3   9993 non-null   float64
 6   para4   9993 non-null   float64
 7   price   9993 non-null   float64
dtypes: float64(5), int64(2), object(1)
memory usage: 702.6+ KB


It looks like the `dow` column still contains strings. Let's convert the categories into numeric values.

In [74]:
dow_catgeories = data['dow'].unique().tolist()
dow_numeric = [x for x in range(1, len(dow_catgeories) + 1)]
replacement_dict = dict(zip(dow_catgeories, dow_numeric))

data['dow'] = data['dow'].replace(replacement_dict)

One final check on our data types:

In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9993 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   loc1    9993 non-null   float64
 1   loc2    9993 non-null   float64
 2   para1   9993 non-null   int64  
 3   dow     9993 non-null   int64  
 4   para2   9993 non-null   int64  
 5   para3   9993 non-null   float64
 6   para4   9993 non-null   float64
 7   price   9993 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 702.6 KB


## LASSO Regression Draft

In [110]:
grid_search = GridSearchCV(
    estimator=linear_model.Lasso(random_state=42),
    param_grid={'alpha':(10 ** np.linspace(start=-2, stop=2, num=100))},
    cv=10,
    scoring='neg_mean_squared_error'
)

Fit our instance of `GridSearchCV` onto our data to find the optimal alpha.

In [273]:
# Identify our features and targets - ensure they're numpy arrays
X = data.drop('price', axis=1).values
y = data['price'].values

# Standardize X otherwise we'll end up with funky results
X = StandardScaler().fit_transform(X)

results = grid_search.fit(X, y)

In [274]:
# Split our data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=23
)

In [275]:
# Instantiate LASSO class
lasso = linear_model.Lasso(alpha=results.best_params_['alpha'], random_state=42)

# Fit onto our training data
X_train = StandardScaler().fit_transform(X_train)
lasso.fit(X_train, y_train)

# Make predictions on training and testing data
train_rsquared = lasso.score(X_train, y_train)
test_rsquared = lasso.score(X_test, y_test)

ytrain_predictions = lasso.predict(X_train)
ytest_predictions = lasso.predict(X_test)

train_mse = mean_squared_error(y_train, ytrain_predictions)
test_mse = mean_squared_error(y_test, ytest_predictions)

In [276]:
print(train_mse, test_mse)

print(train_rsquared, test_rsquared)


36021.42807770397 27908.757555969856
0.5461507076194688 0.6016190079030486


In [255]:
lasso.score(X_train, y_train)


0.5461507076194687

In [None]:
lasso = Lasso(alpha=alpha, normalize=True)
lasso.fit(x_train, y_train)

train_rsquared = lasso.score(x_train, y_train)
test_rsquared = lasso.score(x_test, y_test)

ytrain_predictions = lasso.predict(x_train)
ytest_predictions = lasso.predict(x_test)

train_mse = mean_squared_error(y_train, ytrain_predictions)
test_mse = mean_squared_error(y_test, ytest_predictions)

results = pd.DataFrame({
    'Lasso Train R-Squared': [train_rsquared],
    'Lasso Test R-Squared': [test_rsquared],
    'Lasso Train MSE': [train_mse],
    'Lasso Test MSE': [test_mse],
})


In [257]:
r2_score(y_true=y_train, y_pred=ytrain_predictions)

0.5461507076194687

In [258]:
r2_score(y_true=y_test, y_pred=ytest_predictions)

-73358.30404486162

In [259]:
lasso.score(X_train, y_train)

0.5461507076194687

In [260]:
lasso.coef_

array([ -0.        ,  -1.17454361,  18.60135783,  -5.24267703,
       142.90073129,   0.        , 137.08772407])

{'alpha': 3.1992671377973845,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': 1000,
 'normalize': 'deprecated',
 'positive': False,
 'precompute': False,
 'random_state': 42,
 'selection': 'cyclic',
 'tol': 0.0001,
 'warm_start': False}

In [272]:
linear_reg = linear_model.LinearRegression()

linear_reg.fit(X_train, y_train)

linear_reg.score(X_train, y_train)
linear_reg.score(X_test, y_test)

-89022.78635875523