# Supervised Machine Learning to Predict Price

## Goals

In this Jupyter Notebook, we'll predict price using supervised machine learning algorithms for continuous outcomes. This notebook highlights:

1. Data cleaning 
2. Feature selection
3. Model evaluation and selection

## Data Loading and Cleaning

In [2]:
import pandas as pd
import numpy as np

from sklearn import linear_model # Linear, ridge, and LASSO regression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor  # may need to run brew install libomp on MacOS

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error 

In [3]:
data_path = '/Users/danielchen/Desktop/GitHub/tutorials/Supervised Machine Learning Continuous Outcome/Data/sample.csv'
data = pd.read_csv(data_path)

In [4]:
data.head()

Unnamed: 0,loc1,loc2,para1,dow,para2,para3,para4,price
0,0,1,1,Mon,662,3000.0,3.8,73.49
1,9,99,1,Thu,340,2760.0,9.2,300.0
2,0,4,0,Mon,16,2700.0,3.0,130.0
3,4,40,1,Mon,17,12320.0,6.4,365.0
4,5,50,1,Thu,610,2117.0,10.8,357.5


For this exercise, we'll need to ensure that all of our data is numeric.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   loc1    10000 non-null  object 
 1   loc2    10000 non-null  object 
 2   para1   10000 non-null  int64  
 3   dow     10000 non-null  object 
 4   para2   10000 non-null  int64  
 5   para3   10000 non-null  float64
 6   para4   10000 non-null  float64
 7   price   10000 non-null  float64
dtypes: float64(3), int64(2), object(3)
memory usage: 625.1+ KB


It appears that the columns `loc1`, `loc2`, and `dow` (day of week) contain non-numeric values (or are string characters as in the `dow` column). 

In [6]:
# We'll only keep rows that have numeric values in the loc1 column
data = data[data['loc1'].str.contains('\d')]

# Convert loc1 and loc2 from strings to numeric
to_numeric_cols = ['loc' + str(num) for num in range(1, 3)]
data[to_numeric_cols] = data[to_numeric_cols].apply(pd.to_numeric, errors='coerce', axis=1)

# Drop any NA values
data.dropna(inplace=True)


Let's check the number of observations that are left and the data types of the `loc` columns

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9993 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   loc1    9993 non-null   float64
 1   loc2    9993 non-null   float64
 2   para1   9993 non-null   int64  
 3   dow     9993 non-null   object 
 4   para2   9993 non-null   int64  
 5   para3   9993 non-null   float64
 6   para4   9993 non-null   float64
 7   price   9993 non-null   float64
dtypes: float64(5), int64(2), object(1)
memory usage: 702.6+ KB


It looks like the `dow` column still contains strings. Let's convert the categories into numeric values.

In [8]:
dow_catgeories = data['dow'].unique().tolist()
dow_numeric = [x for x in range(1, len(dow_catgeories) + 1)]
replacement_dict = dict(zip(dow_catgeories, dow_numeric))

data['dow'] = data['dow'].replace(replacement_dict)

One final check on our data types:

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9993 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   loc1    9993 non-null   float64
 1   loc2    9993 non-null   float64
 2   para1   9993 non-null   int64  
 3   dow     9993 non-null   int64  
 4   para2   9993 non-null   int64  
 5   para3   9993 non-null   float64
 6   para4   9993 non-null   float64
 7   price   9993 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 702.6 KB


## LASSO Regression Draft

In [110]:
grid_search = GridSearchCV(
    estimator=linear_model.Lasso(random_state=42),
    param_grid={'alpha':(10 ** np.linspace(start=-2, stop=2, num=100))},
    cv=10,
    scoring='neg_mean_squared_error'
)

Fit our instance of `GridSearchCV` onto our data to find the optimal alpha.

In [273]:
# Identify our features and targets - ensure they're numpy arrays
X = data.drop('price', axis=1).values
y = data['price'].values

# Standardize X otherwise we'll end up with funky results
X = StandardScaler().fit_transform(X)

results = grid_search.fit(X, y)

In [274]:
# Split our data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=23
)

In [319]:
# Instantiate LASSO class
lasso = linear_model.Lasso(alpha=results.best_params_['alpha'], random_state=42)

# Fit onto our training data
X_train = StandardScaler().fit_transform(X_train)
lasso.fit(X_train, y_train)

# Make predictions on training and testing data
train_rsquared = lasso.score(X_train, y_train)
test_rsquared = lasso.score(X_test, y_test)

ytrain_predictions = lasso.predict(X_train)
ytest_predictions = lasso.predict(X_test)

train_mse = mean_squared_error(y_train, ytrain_predictions)
test_mse = mean_squared_error(y_test, ytest_predictions)

# Identifying Features

For this walkthrough, we'll identify the best features based on their correlation to our outcome variable `price`. This isn't the only method for feature selection, though. LASSO regression, VIF score, and subset selection are also methods for identifying features. We justify using correlation here since our target variables should at least be related to our outcome variable.

In [10]:
correlations = (abs(data.corr()['price']).
                  to_frame()
                  .sort_values(by='price', ascending=False)[1::])
correlations                 


Unnamed: 0,price
para2,0.551222
para4,0.517614
para3,0.356949
para1,0.074555
loc1,0.044079
loc2,0.043543
dow,0.001043


It looks like `para2`, `para4`, and `para3` are the most related to `price`, so we'll use those for our model.

In [11]:
features = correlations.nlargest(n=3, columns='price').index.tolist()

# Determining an Algorithm

Now that our data is in the right format (i.e., we've removed all non-numeric data and missing data), we need to identify the model the optimal algorithm to use. In this instance, by optimal, we are seeking to identify the model that makes the most accurate predictions, or the one that minimizes the mean squared error (the difference between the actual values and our predictions.)

In [15]:
# Create a list of tuples specifiying the model name and an instance of the model
algorithms = [
    ('Linear Regression', linear_model.LinearRegression()),
    ('Ridge', linear_model.Ridge(random_state=42)),
    ('LASSO', linear_model.Lasso(random_state=42)),
    ('Random Forest', RandomForestRegressor(random_state=42)),
    ('XGBoost', XGBRegressor(random_state=42))
]

# Create a list of measures to which we will evaluate the model's performance
measures = ['r2', 'neg_mean_squared_error']


In [16]:
# Identify features and target
X = data[features].values
y = data['price'].values

# Normalize the data
X = StandardScaler().fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

In [20]:
test_harness_dfs = []
for name, algorithm in algorithms:
    results = cross_validate(
        estimator=algorithm,
        X=X_train,
        y=y_train,
        cv=10,
        scoring=measures
    )
    test_harness_df = pd.DataFrame({
        'algorithm': [name],
        'r2': [results['test_r2'].mean()],
        'mse': [results['test_neg_mean_squared_error'].mean() * -1]
    })
    test_harness_dfs.append(test_harness_df)


In [21]:
pd.concat(test_harness_dfs)

Unnamed: 0,algorithm,r2,mse
0,Linear Regression,0.553867,33487.520274
0,Ridge,0.553869,33487.497792
0,LASSO,0.554053,33480.361398
0,Random Forest,0.629002,27607.055472
0,XGBoost,0.613304,28841.67726


In [None]:
algorithms = [
    ('Linear Regression', linear_model.LinearRegression()),
    ('Ridge', linear_model.Ridge(random_state=42)),
    ('LASSO', linear_model.Lasso(random_state=42)),
    ('Random Forest', RandomForestRegressor(random_state=42)),
    ('XGBoost', XGBRegressor(random_state=42)),
    ('LGBM', LGBMRegressor(random_state=42))
]
