In [1]:
# Imports 
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_graphviz
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error

import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)


from scipy import stats

import subprocess
import datetime

# The Problem

So the problem is 'how do I know if wine is good?'  
Specifically, when in a shop looking for something to buy. 

There are a few ways I could determine this. 

# The Solution(s)?

1. I could do what I did with the other wine related datasets and train a tree model to get which are the ~3-5 most important factors, and what ranges they should lie in. While I think this will work - and I intend to do it - it also tends to always end up focussing on only one particular ordered set of questions with the same answers; country A, variety B, price C, etc. which doesn't tell me anything about good wines from country F (and I assume that country F must have some good wine or other? Additionally, since several of the features are nominal, non-ordinal data, but they have been represented ordinally, decision trees will not be able to select features out of order from their representation, and this will limit the usefulness of the trained models regardless of how accurate they are. 

2. I could use some sort of relatively simple model (e.g. linear regression) and try to determine from the coefficients what features are important, and what values are better, but attempts to determine feature relevance in that way usually fail, as coefficients do not correlate with feature importance. I could use some sort of feature selection to further reduce the number of features I have, and make the result easier to understand somehow? The effectiveness of that is unclear, but unlikely. Then again, [scikit-learn's `SelectFromModel`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html "scikit-learn docs") exists to do just that - automatically - so maybe that's worth a look. This still doesn't actually tell me what _values_ of features mean good wine though, so I probably won't do this. 

3. I could train an accurate, black box model that classifies wine, then put that model into a phone app, and use it to give me a likely score (or range of scores) when I manually enter the features of the wine I am looking at on the shelf. The more data I enter, the more accurate the score. This seems like an approach that will provide an actual, physical solution to my problem, but will require a lot more work. 

# The Plan

The first thing I'm going to do is the easiest - solution 1: decision tree models.  
Of course the _zeroth_ thing I'm going to do is load and preprocess the data. 

### Modifications from the EDA
The EDA was more about getting to know the data, finding out what was _feasibly physically usable_, and seeing how it could be prepared, _without prejudice_ with respect to what will later turn out to be useful (otherwise my assumptions and biases could cause me to discard or inappropriately interpret data). Now I come to the actual task I must reassess from what I've learned, and make new decisions based on _what difference it makes to the objective_, in order to avoid failing to move towards the value I'm working to create. 

At this point I have 10 features: 
1. country
2. price
3. province
4. region_1
5. region_2
6. taster_name
7. taster_twitter_handle
8. variety
9. winery
10. vintage

I am trying to find a way to determine the best (or a decent choice of) wine to buy when standing in a shop. That means that the only information I have available to me is what's on the bottle - if I was the type to google it or if I had an interest in learning about wine theory, I wouldn't need to be doing this project - so I should only use features that represent data I can get from the bottle. 

6. ~~taster_name~~
7. ~~taster_twitter_handle~~

That means taster_name and taster_twitter_handle are out because - again - if I was going to put in effort to learn wine theory and read reviews, I wouldn't need to be putting my effort towards this project. 

The country is always available, as is the price (though I will have to be mindful to remember currency conversions when not in the USA). The province is usually available, as is the data that makes up region_1 and region_2. The variety, winery, and vintage are also usually available for all but the cheapest of wines (the mixes and generic 'red wine'). 

So I end up with 8 features, which I will now begin work with: 
1. country
2. price
3. province
4. region_1
5. region_2
6. variety
7. winery
8. vintage

# Data Loading and Preparation

In [2]:
# Inital data loading 
data = pd.read_csv('../input/wine-reviews/winemag-data-130k-v2.csv', index_col=0)
target_data = data['points']
feature_data = data.drop('points', axis=1)

# Creation of new feature 'vintage'
titles = feature_data.title.copy(deep=True)
years = titles.replace(r'.*((19|20)[0-9]{2}).*', r'\1', regex=True).replace(r'[A-Za-z]|[\D]|\s|(?:(?<!\d)\d{1}(?!\d))|(?:(?<!\d)\d{2}(?!\d))|(?:(?<!\d)\d{3}(?!\d))', '', regex=True)
filtered_years = years.replace(r'(1503|1607|1821|1827|1847|1868|1872|1868|1872|1882|1887|2067)', '', regex=True)
vintage = filtered_years

# Removal of unusable features (I can't find the reviewers description 
# on the bottle, can I?) and addition of new feature 'vintage'
raw_reduced_features = feature_data.drop(labels=['description', 'designation', 'title', 'taster_name', 'taster_twitter_handle'], axis=1)
raw_reduced_features['vintage'] = vintage.copy(deep=False).replace('', '0').astype(int )

In [3]:
# Separate out all the features so I can transform them individually, 
# and then put them back together into different datasets
country = raw_reduced_features.country.copy(deep=False)
price = raw_reduced_features.price.copy(deep=False)
province = raw_reduced_features.province.copy(deep=False)
region_1 = raw_reduced_features.region_1.copy(deep=False)
region_2 = raw_reduced_features.region_2.copy(deep=False)
variety = raw_reduced_features.variety.copy(deep=False)
winery = raw_reduced_features.winery.copy(deep=False)
vintage = raw_reduced_features.vintage.copy(deep=False)

## Dealing with null values and preparing features for encoding

In [4]:
# Set all null data to either 0 or the string 'NaN' so that the encoders can process them 
country[country.isna()] = 'NaN'
price[price.isna()] = 0
province[province.isna()] = 'NaN'
region_1[region_1.isna()] = 'NaN'
region_2[region_2.isna()] = 'NaN'
variety[variety.isna()] = 'NaN'
# winery has no missing values
# vintage has already had it's missing data replaced with 0

In [5]:
reduced_features_non_null_data = raw_reduced_features.dropna(subset=['country', 'price', 'province', 'region_1', 'region_2', 'variety', 'winery'], axis='index', how='any').copy(deep=False)
print('Number of rows left with no null values after dropping region_2: {}'.format(reduced_features_non_null_data.shape[0]))

Number of rows left with no null values after dropping region_2: 129971


In [6]:
# Separate out all the features so I can transform them individually, 
# and then put them back together into different datasets
country_non_null = reduced_features_non_null_data.country
price_non_null = reduced_features_non_null_data.price
province_non_null = reduced_features_non_null_data.province
region_1_non_null = reduced_features_non_null_data.region_1
region_2_non_null = reduced_features_non_null_data.region_2
variety_non_null = reduced_features_non_null_data.variety
winery_non_null = reduced_features_non_null_data.winery
vintage_non_null = reduced_features_non_null_data.vintage

## Ordinal numerical encoding

In [7]:
# country
# nulls
label_encoder_country = LabelEncoder()
label_encoder_country.fit(country.astype(str))
encoded_countries = label_encoder_country.transform(country.astype(str))

# non-nulls
label_encoder_non_null_country = LabelEncoder()
label_encoder_non_null_country.fit(country_non_null.astype(str))
encoded_non_null_countries = label_encoder_non_null_country.transform(country_non_null.astype(str))

In [8]:
# province
# nulls
label_encoder_province = LabelEncoder()
label_encoder_province.fit(province.astype(str))
encoded_provinces = label_encoder_province.transform(province.astype(str))

# non-nulls
label_encoder_non_null_province = LabelEncoder()
label_encoder_non_null_province.fit(province_non_null.astype(str))
encoded_non_null_provinces = label_encoder_non_null_province.transform(province_non_null.astype(str))

In [9]:
# region_1
# nulls
label_encoder_region_1 = LabelEncoder()
label_encoder_region_1.fit(region_1.astype(str))
encoded_region_1s = label_encoder_region_1.transform(region_1.astype(str))

# non-nulls
label_encoder_non_null_region_1 = LabelEncoder()
label_encoder_non_null_region_1.fit(region_1_non_null.astype(str))
encoded_non_null_region_1s = label_encoder_non_null_region_1.transform(region_1_non_null.astype(str))

In [10]:
# region_2
# nulls
label_encoder_region_2 = LabelEncoder()
label_encoder_region_2.fit(region_2.astype(str))
encoded_region_2s = label_encoder_region_2.transform(region_2.astype(str))

# non-nulls
label_encoder_non_null_region_2 = LabelEncoder()
label_encoder_non_null_region_2.fit(region_2_non_null.astype(str))
encoded_non_null_region_2s = label_encoder_non_null_region_2.transform(region_2_non_null.astype(str))

In [11]:
# variety
# nulls
label_encoder_variety = LabelEncoder()
label_encoder_variety.fit(variety.astype(str))
encoded_varieties = label_encoder_variety.transform(variety.astype(str))

# non-nulls
label_encoder_non_null_variety = LabelEncoder()
label_encoder_non_null_variety.fit(variety_non_null.astype(str))
encoded_non_null_varieties = label_encoder_non_null_variety.transform(variety_non_null.astype(str))

In [12]:
# winery
# nulls
label_encoder_winery = LabelEncoder()
label_encoder_winery.fit(winery.astype(str))
encoded_wineries = label_encoder_winery.transform(winery.astype(str))

# non-nulls
label_encoder_non_null_winery = LabelEncoder()
label_encoder_non_null_winery.fit(winery_non_null.astype(str))
encoded_non_null_wineries = label_encoder_non_null_winery.transform(winery_non_null.astype(str))

## One hot encoding

In [13]:
# country
# nulls
one_hot_country = OneHotEncoder(sparse=False)
one_hot_country.fit(country.astype(str).values.reshape(-1,1))
one_hot_countries = one_hot_country.transform(country.astype(str).values.reshape(-1,1))

In [14]:
# region_2
one_hot_region_2 = OneHotEncoder(sparse=False)
one_hot_region_2.fit(region_2.astype(str).values.reshape(-1,1))
one_hot_region_2s = one_hot_region_2.transform(region_2.astype(str).values.reshape(-1,1))

## Creating datasets

In [15]:
# Label encoded feature data that contains nulls
feature_data_label_encoded = raw_reduced_features.copy(deep=False)
feature_data_label_encoded['vintage'] = vintage

feature_data_label_encoded.country = encoded_countries
feature_data_label_encoded.province = encoded_provinces
feature_data_label_encoded.region_1 = encoded_region_1s
feature_data_label_encoded.region_2 = encoded_region_2s
feature_data_label_encoded.variety = encoded_varieties
feature_data_label_encoded.winery = encoded_wineries

In [16]:
# Label encoded feature data that does not contain nulls
feature_data_label_encoded_non_null = reduced_features_non_null_data.copy(deep=False)
feature_data_label_encoded_non_null['vintage'] = vintage

feature_data_label_encoded_non_null.country = encoded_non_null_countries
feature_data_label_encoded_non_null.province = encoded_non_null_provinces
feature_data_label_encoded_non_null.region_1 = encoded_non_null_region_1s
feature_data_label_encoded_non_null.region_2 = encoded_non_null_region_2s
feature_data_label_encoded_non_null.variety = encoded_non_null_varieties
feature_data_label_encoded_non_null.winery = encoded_non_null_wineries

In [17]:
# Mixed encoding feature data
feature_data_mixed_encoding = raw_reduced_features.copy(deep=False)
feature_data_mixed_encoding = feature_data_mixed_encoding.drop(labels=['country', 'region_2'], axis=1)
feature_data_mixed_encoding['vintage'] = vintage

# Label encoding
feature_data_mixed_encoding.province = encoded_provinces
feature_data_mixed_encoding.region_1 = encoded_region_1s
feature_data_mixed_encoding.variety = encoded_varieties
feature_data_mixed_encoding.winery = encoded_wineries

# One hot encoding
one_hot_countries_df = pd.DataFrame(data=one_hot_countries, columns=one_hot_country.categories_[0])
one_hot_countries_df = one_hot_countries_df.rename(columns={"nan":"country_nan"})
feature_data_mixed_encoding = pd.concat([feature_data_mixed_encoding, one_hot_countries_df], axis=1)

one_hot_region_2s_df = pd.DataFrame(data=one_hot_region_2s, columns=one_hot_region_2.categories_[0])
one_hot_region_2s_df = one_hot_region_2s_df.rename(columns={"nan":"region_2_nan"})
feature_data_mixed_encoding = pd.concat([feature_data_mixed_encoding, one_hot_region_2s_df], axis=1)

In [18]:
feature_data_label_encoded.isnull().values.any()

False

### Test/train splitting

In [19]:
# Label Encoded full data test train split
X_train_label_encoded, X_test_label_encoded, y_train_label_encoded, y_test_label_encoded = train_test_split(feature_data_label_encoded, target_data, random_state=4)

In [20]:
# Label Encoded non-null data test train split
target_data_non_null = target_data[feature_data_label_encoded_non_null.index]
X_train_label_encoded_non_null, X_test_label_encoded_non_null, y_train_label_encoded_non_null, y_test_label_encoded_non_null = train_test_split(feature_data_label_encoded_non_null, target_data_non_null, random_state=4)

In [21]:
# Mixed encoding full data test train split
X_train_mixed_encoding, X_test_mixed_encoding, y_train_mixed_encoding, y_test_mixed_encoding = train_test_split(feature_data_mixed_encoding, target_data, random_state=4)

### Standardisation of data

In [22]:
# Label Encoded data standardisation
label_encoded_standard_scaler = StandardScaler()
label_encoded_standard_scaler.fit(X_train_label_encoded)

X_train_label_encoded_standardised = label_encoded_standard_scaler.transform(X_train_label_encoded)
X_test_label_encoded_standardised = label_encoded_standard_scaler.transform(X_test_label_encoded)

In [23]:
# Label Encoded non-null data standardisation
label_encoded_non_null_standard_scaler = StandardScaler()
label_encoded_non_null_standard_scaler.fit(X_train_label_encoded_non_null)

X_train_label_encoded_non_null_standardised = label_encoded_non_null_standard_scaler.transform(X_train_label_encoded_non_null)
X_test_label_encoded_non_null_standardised = label_encoded_non_null_standard_scaler.transform(X_test_label_encoded_non_null)

In [24]:
# Mixed encoding full data standardisation
mixed_encoding_standard_scaler = StandardScaler()
mixed_encoding_standard_scaler.fit(X_train_mixed_encoding)

X_train_mixed_encoding_standardised = mixed_encoding_standard_scaler.transform(X_train_mixed_encoding)
X_test_mixed_encoding_standardised = mixed_encoding_standard_scaler.transform(X_test_mixed_encoding)

## Solution 1
Decision trees - let's see what they tell me. 

Since I've already decided on the limited usefulness of decision tree models for this, I'll keep it short and bake in a few assumptions. 
1. The maximum depth of the tree will not greatly exceed the number of features (8). 
2. The maximum leaf nodes will not greatly exceed the number of possible scores, rounded to the nearest integer (21). 
3. I will use Mean Absolute Error as an error metric, as I prefer to when the error metric is easily digestible and especially relevant to the end use of the model's predictions. 

In [25]:
# Label encoded decision tree regressor
start_time_decision_tree_label_encoded = datetime.datetime.now()
print("Start time: {}".format(start_time_decision_tree_label_encoded.time()))

label_encoded_params = {
    'max_depth': np.arange(3,13,3), 
    'max_leaf_nodes': np.arange(15,42,3)
}

label_encoded_gridsearchcv = GridSearchCV(
    estimator=DecisionTreeRegressor(criterion='mae', random_state=4), 
    param_grid=label_encoded_params, 
    n_jobs=-1, 
    cv=5)

label_encoded_gridsearchcv.fit(X_train_label_encoded_standardised, y_train_label_encoded)
print('DecisionTreeRegressor GridSearchCV label_encoded best score: {}'.format(label_encoded_gridsearchcv.best_score_))
print('DecisionTreeRegressor GridSearchCV label_encoded best estimator: {}'.format(label_encoded_gridsearchcv.best_estimator_))

print('\n')

print('DecisionTreeRegressor GridSearchCV label_encoded best estimator test mae: {}'.format(mean_absolute_error(y_test_label_encoded, label_encoded_gridsearchcv.best_estimator_.predict(X_test_label_encoded_standardised))))

end_time_decision_tree_label_encoded = datetime.datetime.now()
print("End time: {}".format(end_time_decision_tree_label_encoded.time()))

time_elapsed_decision_tree_label_encoded = end_time_decision_tree_label_encoded - start_time_decision_tree_label_encoded
print("Time elapsed: {}".format(time_elapsed_decision_tree_label_encoded))

Start time: 17:14:35.605230
DecisionTreeRegressor GridSearchCV label_encoded best score: 0.374080668273635
DecisionTreeRegressor GridSearchCV label_encoded best estimator: DecisionTreeRegressor(criterion='mae', max_depth=8, max_features=None,
           max_leaf_nodes=39, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=4, splitter='best')


DecisionTreeRegressor GridSearchCV label_encoded best estimator test mae: 1.8624319084110423
End time: 19:41:56.150086


TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.time'

In [None]:
# Save image of label encoded decision tree regressor to output dir
export_graphviz(label_encoded_gridsearchcv.best_estimator_, 
                out_file='label_encoded_gridsearchcv_decisiontreeregressor.dot', 
                feature_names=feature_data_label_encoded.columns, 
                label='all', 
                filled=True, 
                rounded=True, 
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot', 
                '-Tpng', 
                'label_encoded_gridsearchcv_decisiontreeregressor.dot', 
                '-o', 
                'label_encoded_gridsearchcv_decisiontreeregressor.png'])
subprocess.run(['mv', 
                'label_encoded_gridsearchcv_decisiontreeregressor.dot', 
                '../output/label_encoded_gridsearchcv_decisiontreeregressor.dot'])
subprocess.run(['mv', 
                'label_encoded_gridsearchcv_decisiontreeregressor.png', 
                '../output/label_encoded_gridsearchcv_decisiontreeregressor.png'])

In [None]:
# Label encoded non null decision tree regressor
start_time_decision_tree_non_null = datetime.datetime.now()
print("Start time: {}".format(start_time_decision_tree_non_null.time()))

label_encoded_non_null_params = {
    'max_depth': np.arange(3,13,3), 
    'max_leaf_nodes': np.arange(15,42,3)
}

label_encoded_non_null_gridsearchcv = GridSearchCV(
    estimator=DecisionTreeRegressor(criterion='mae', random_state=4), 
    param_grid=label_encoded_non_null_params, 
    n_jobs=-1, 
    cv=5)

label_encoded_non_null_gridsearchcv.fit(X_train_label_encoded_non_null_standardised, y_train_label_encoded_non_null)
print('DecisionTreeRegressor GridSearchCV label_encoded_non_null best score: {}'.format(label_encoded_non_null_gridsearchcv.best_score_))
print('DecisionTreeRegressor GridSearchCV label_encoded_non_null best estimator: {}'.format(label_encoded_non_null_gridsearchcv.best_estimator_))

print('\n')

print('DecisionTreeRegressor GridSearchCV label_encoded_non_null best estimator test mae: {}'.format(mean_absolute_error(y_test_label_encoded_non_null, label_encoded_non_null_gridsearchcv.best_estimator_.predict(X_test_label_encoded_non_null_standardised))))

end_time_decision_tree_non_null = datetime.datetime.now()
print("End time: {}".format(end_time_decision_tree_non_null.time()))

time_elapsed_decision_tree_non_null = end_time_decision_tree_non_null - start_time_decision_tree_non_null
print("Time elapsed: {}".format(time_elapsed_decision_tree_non_null))

In [None]:
# Save image of label encoded non null decision tree regressor to output dir
export_graphviz(label_encoded_non_null_gridsearchcv.best_estimator_, 
                out_file='label_encoded_non_null_gridsearchcv_decisiontreeregressor.dot', 
                feature_names=feature_data_label_encoded_non_null.columns, 
                label='all', 
                filled=True, 
                rounded=True, 
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot', 
                '-Tpng', 
                'label_encoded_non_null_gridsearchcv_decisiontreeregressor.dot', 
                '-o', 
                'label_encoded_non_null_gridsearchcv_decisiontreeregressor.png'])
subprocess.run(['mv', 
                'label_encoded_non_null_gridsearchcv_decisiontreeregressor.dot', 
                '../output/label_encoded_non_null_gridsearchcv_decisiontreeregressor.dot'])
subprocess.run(['mv', 
                'label_encoded_non_null_gridsearchcv_decisiontreeregressor.png', 
                '../output/label_encoded_non_null_gridsearchcv_decisiontreeregressor.png'])

In [None]:
# Mixed encoding non null decision tree regressor
start_time_decision_tree_mixed_encoding = datetime.datetime.now()
print("Start time: {}".format(start_time_decision_tree_mixed_encoding.time()))

mixed_encoding_params = {
    'max_depth': np.arange(3,13,3), 
    'max_leaf_nodes': np.arange(15,42,3)
}

mixed_encoding_gridsearchcv = GridSearchCV(
    estimator=DecisionTreeRegressor(criterion='mae', random_state=4), 
    param_grid=mixed_encoding_params, 
    n_jobs=-1, 
    cv=5)

mixed_encoding_gridsearchcv.fit(X_train_mixed_encoding_standardised, y_train_mixed_encoding)
print('DecisionTreeRegressor GridSearchCV mixed_encoding best score: {}'.format(mixed_encoding_gridsearchcv.best_score_))
print('DecisionTreeRegressor GridSearchCV mixed_encoding best estimator: {}'.format(mixed_encoding_gridsearchcv.best_estimator_))

print('\n')

print('DecisionTreeRegressor GridSearchCV mixed_encoding best estimator test mae: {}'.format(mean_absolute_error(y_test_mixed_encoding, mixed_encoding_gridsearchcv.best_estimator_.predict(X_test_mixed_encoding_standardised))))

end_time_decision_tree_mixed_encoding = datetime.datetime.now()
print("End time: {}".format(end_time_decision_tree_mixed_encoding.time()))

time_elapsed_decision_tree_mixed_encoding = end_time_decision_tree_mixed_encoding - start_time_decision_tree_mixed_encoding
print("Time elapsed: {}".format(time_elapsed_decision_tree_mixed_encoding))

In [None]:
# Save image of mixed encoding decision tree regressor to output dir
export_graphviz(mixed_encoding_gridsearchcv.best_estimator_, 
                out_file='mixed_encoding_gridsearchcv_decisiontreeregressor.dot', 
                feature_names=feature_data_mixed_encoding.columns, 
                label='all', 
                filled=True, 
                rounded=True, 
                leaves_parallel=True, 
                impurity=True)
subprocess.run(['dot', 
                '-Tpng', 
                'mixed_encoding_gridsearchcv_decisiontreeregressor.dot', 
                '-o', 
                'mixed_encoding_gridsearchcv_decisiontreeregressor.png'])
subprocess.run(['mv', 
                'mixed_encoding_gridsearchcv_decisiontreeregressor.dot', 
                '../output/mixed_encoding_gridsearchcv_decisiontreeregressor.dot'])
subprocess.run(['mv', 
                'mixed_encoding_gridsearchcv_decisiontreeregressor.png', 
                '../output/mixed_encoding_gridsearchcv_decisiontreeregressor.png'])

![label_encoded_non_null_gridsearchcv_decisiontreeregressor](../output/label_encoded_non_null_gridsearchcv_decisiontreeregressor.png)

In [None]:
# print(datetime.datetime.now())
# dtr = DecisionTreeRegressor(criterion='mae', max_depth=7, max_features=1, max_leaf_nodes=21, random_state=4)

# dtr.fit(X_train_label_encoded_non_null, y_train_label_encoded_non_null)

# preds = dtr.predict(X_test_label_encoded_non_null)

# mae = mean_absolute_error(y_test_label_encoded_non_null, preds)
# mae
# print(datetime.datetime.now())

In [None]:
# export_graphviz(dtr, out_file='dtr_test.dot', 
#                 feature_names=feature_data_label_encoded_non_null.columns, 
#                 label='all',
#                 filled=True,
#                 rounded=True,
#                 leaves_parallel=True,
#                 impurity=True)
# subprocess.run(['dot','-Tpng', 'dtr_test.dot', '-o', 'dtr_test.png'])

![dtr_test](dtr_test.png)

## Solution 2
Linear regression. 

Again, I'll use mean absolute error. 

In [26]:
# Label encoded linear regression
start_time_linear_regression_label_encoded = datetime.datetime.now()
print("Start time: {}".format(start_time_linear_regression_label_encoded.time()))

# There are no parameters to set, so GridSearchCV is unnecessary
label_encoded_linear_regression = LinearRegression()
label_encoded_linear_regression.fit(X_train_label_encoded_standardised, y_train_label_encoded)

print('Linear regression label_encoded test mae: {}'.format(mean_absolute_error(y_test_label_encoded, label_encoded_linear_regression.predict(X_test_label_encoded_standardised))))

end_time_linear_regression_label_encoded = datetime.datetime.now()
print("End time: {}".format(end_time_linear_regression_label_encoded.time()))

time_elapsed_linear_regression_label_encoded = end_time_linear_regression_label_encoded - start_time_linear_regression_label_encoded
print("Time elapsed: {}".format(time_elapsed_linear_regression_label_encoded))

Start time: 19:54:15.690759
Linear regression label_encoded test mae: 2.2236270134283433
End time: 19:54:15.731479
Time elapsed: 0:00:00.040720


In [27]:
# Label encoded non null linear regression
start_time_linear_regression_label_encoded_non_null = datetime.datetime.now()
print("Start time: {}".format(start_time_linear_regression_label_encoded_non_null.time()))

# There are no parameters to set, so GridSearchCV is unnecessary
label_encoded_non_null_linear_regression = LinearRegression()
label_encoded_non_null_linear_regression.fit(X_train_label_encoded_non_null_standardised, y_train_label_encoded_non_null)

print('Linear regression label_encoded_non_null test mae: {}'.format(mean_absolute_error(y_test_label_encoded_non_null, label_encoded_non_null_linear_regression.predict(X_test_label_encoded_non_null_standardised))))

end_time_linear_regression_label_encoded_non_null = datetime.datetime.now()
print("End time: {}".format(end_time_linear_regression_label_encoded_non_null.time()))

time_elapsed_linear_regression_label_encoded_non_null = end_time_linear_regression_label_encoded_non_null - start_time_linear_regression_label_encoded_non_null
print("Time elapsed: {}".format(time_elapsed_linear_regression_label_encoded_non_null))

Start time: 19:54:18.178757
Linear regression label_encoded_non_null test mae: 2.2236270134283433
End time: 19:54:18.212136
Time elapsed: 0:00:00.033379


In [28]:
# Label encoded linear regression
start_time_linear_regression_mixed_encoding = datetime.datetime.now()
print("Start time: {}".format(start_time_linear_regression_mixed_encoding.time()))

# There are no parameters to set, so GridSearchCV is unnecessary
mixed_encoding_linear_regression = LinearRegression()
mixed_encoding_linear_regression.fit(X_train_mixed_encoding_standardised, y_train_mixed_encoding)

print('Linear regression mixed_encoding test mae: {}'.format(mean_absolute_error(y_test_mixed_encoding, mixed_encoding_linear_regression.predict(X_test_mixed_encoding_standardised))))

end_time_linear_regression_mixed_encoding = datetime.datetime.now()
print("End time: {}".format(end_time_linear_regression_mixed_encoding.time()))

time_elapsed_linear_regression_mixed_encoding = end_time_linear_regression_mixed_encoding - start_time_linear_regression_mixed_encoding
print("Time elapsed: {}".format(time_elapsed_linear_regression_mixed_encoding))

Start time: 19:54:20.260816
Linear regression mixed_encoding test mae: 31901654.235031657
End time: 19:54:20.865510
Time elapsed: 0:00:00.604694
