## Introduction ##

This project was provided by Zyfra, an efficiency solutions developer for heavy industry. In this project, we are to prepare a prototype of a machine learning model that should predict the amount of gold recovered from gold ore. The model will help optimize gold production and eliminate unprofitable parameters. We are provided data on gold ore extraction and purification in the following csv files:

- gold_recovery_train.csv
- gold_recovery_test.csv
- gold_recovery_full.csv

The source dataset contains all of the pertinent parameters and outputs related to the technological process for gold ore refinement. The source dataset has already been split into a training and test set. Of note, some of the features are absent in the test set and will be addressed during the project. The project goal is to find the best model according to the symmetric Mean Absolute Percentage Error (sMAPE) metric using the predicted rougher and final concentrate recovery values.

## Imports ##

**Packages, Libraries, and Modules**

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyRegressor

## Reading and Inspecting the Data ##

**Data info and Sample**

In [2]:
# When cloning this project, ensure directory/file paths are correct with respect to user's operating system
gold_full = pd.read_csv('datasets/gold_recovery_full.csv')
gold_train = pd.read_csv('datasets/gold_recovery_train.csv')
gold_test = pd.read_csv('datasets/gold_recovery_test.csv')

gold_full.info()
display(gold_full.sample(10, random_state=12345))

gold_train.info()
display(gold_train.sample(10, random_state=12345))

gold_test.info()
display(gold_test.sample(10, random_state=12345))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22716 entries, 0 to 22715
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                22716 non-null  object 
 1   final.output.concentrate_ag                         22627 non-null  float64
 2   final.output.concentrate_pb                         22629 non-null  float64
 3   final.output.concentrate_sol                        22331 non-null  float64
 4   final.output.concentrate_au                         22630 non-null  float64
 5   final.output.recovery                               20753 non-null  float64
 6   final.output.tail_ag                                22633 non-null  float64
 7   final.output.tail_pb                                22516 non-null  float64
 8   final.output.tail_sol                               22445 non-null  float64


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
14334,2017-09-03 05:59:59,5.402473,10.080398,6.191297,43.548861,52.719593,8.352964,3.458732,4.172927,3.007726,...,11.95648,-499.657728,9.101714,-400.01597,13.999355,-499.433086,10.972613,-499.334612,14.969209,-499.461258
6208,2016-09-29 15:59:59,3.921312,3.187676,5.499611,11.548059,86.577048,8.612864,2.482086,8.482637,4.679322,...,11.979586,-588.104619,9.989417,-500.344378,10.047065,-617.112513,4.991262,-500.227633,19.604098,-501.473032
21513,2018-06-29 08:59:59,4.42486,10.157334,7.1169,45.952564,74.117048,8.231104,2.327185,10.99388,2.448596,...,30.035144,-498.766454,22.997594,-500.070561,19.982851,-499.262008,14.97393,-499.864906,17.967964,-498.399672
2898,2016-05-14 17:59:59,3.905437,8.575671,10.532593,47.020037,67.319742,8.464871,1.689378,13.669049,2.705404,...,12.031857,-500.849465,8.108548,-502.640272,8.904894,-498.520782,5.957944,-500.606375,19.994113,-499.991456
6829,2016-10-25 12:59:59,6.169956,9.422575,11.87291,44.685774,75.645395,10.135578,2.269874,9.165746,3.0423,...,20.028081,-499.240587,16.983768,-499.24921,16.986478,-499.166485,14.014267,-498.108465,25.019014,-498.872206
7694,2016-11-30 13:59:59,5.215999,7.840408,10.839686,40.592282,66.562125,6.085885,1.70308,5.795798,1.893287,...,18.035531,-500.300294,16.047626,-500.253518,16.096515,-501.209685,12.014144,-499.962487,22.00348,-500.523712
21881,2018-07-14 16:59:59,3.848936,10.082536,6.986162,46.598624,74.768757,11.263828,3.088475,7.479572,1.919036,...,17.079889,-501.621512,14.979839,-500.007438,10.918462,-501.729337,10.045543,-500.037191,13.017549,-502.721089
6683,2016-10-19 10:59:59,6.193314,10.192565,10.724995,44.675555,70.432661,10.466835,3.161605,9.335343,3.829962,...,20.028503,-500.920155,16.911465,-500.373799,7.988245,-501.049137,14.04188,-500.615417,25.053361,-501.55252
4892,2016-08-05 19:59:59,5.609689,9.303589,6.835753,45.151787,56.497563,9.72491,2.624843,3.366715,3.863005,...,19.985011,-399.257356,10.01937,-400.222474,9.172945,-399.198435,5.024726,-400.100409,22.997855,-499.947826
17652,2018-01-19 11:59:59,4.331592,8.231944,13.403063,46.723715,68.911917,11.298664,1.669021,5.754017,2.763657,...,19.972262,-500.326348,15.178027,-499.838052,10.950151,-499.833237,8.967812,-499.598364,11.020779,-499.727462


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   date                                                16860 non-null  object 
 1   final.output.concentrate_ag                         16788 non-null  float64
 2   final.output.concentrate_pb                         16788 non-null  float64
 3   final.output.concentrate_sol                        16490 non-null  float64
 4   final.output.concentrate_au                         16789 non-null  float64
 5   final.output.recovery                               15339 non-null  float64
 6   final.output.tail_ag                                16794 non-null  float64
 7   final.output.tail_pb                                16677 non-null  float64
 8   final.output.tail_sol                               16715 non-null  float64


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
5160,2016-08-16 23:59:59,10.389027,9.518832,12.977982,35.351219,80.293379,9.757327,2.351265,10.258158,2.760782,...,17.952402,-399.814244,9.989155,-400.264371,15.009879,-401.027892,4.927554,-400.378415,22.941726,-500.864047
13366,2018-03-25 21:59:59,4.49988,11.1343,8.469435,45.656652,71.634437,9.096745,2.041674,8.197805,2.229945,...,23.007195,-499.909865,15.068483,-500.521292,18.019546,-500.870804,12.016944,-500.116669,12.988538,-500.086879
7892,2017-04-09 19:59:59,4.48047,11.042749,9.751865,45.285007,67.593795,8.202751,3.032599,11.642006,2.496477,...,25.046496,-401.055055,22.990209,-400.116397,25.98951,-450.508183,24.037303,-449.933354,29.991628,-500.498874
7605,2017-03-28 20:59:59,4.922946,10.434142,10.678296,46.678133,57.551604,9.92527,3.408842,10.649813,3.815938,...,25.013766,-398.868507,22.98402,-398.966602,29.972392,-450.700081,24.012988,-450.222121,25.018402,-500.985296
1933,2016-04-04 13:00:00,,,,,,,,,,...,,,,,,,,,,
12275,2018-02-08 10:59:59,4.912996,8.757087,8.908876,44.463884,76.091062,11.260668,1.45667,12.99627,2.782712,...,19.908923,-498.63248,15.067284,-499.674636,10.971864,-498.948495,9.003349,-499.867722,10.991802,-498.783215
6294,2017-02-02 05:59:59,5.399841,12.850944,16.573839,42.041781,66.863319,10.641776,4.464644,8.901448,3.850627,...,24.964508,-497.678417,22.998162,-500.288763,22.99211,-499.523707,19.897816,-499.815827,25.008903,-599.781952
7169,2017-03-10 16:59:59,6.050264,8.438954,7.758436,46.333466,65.839186,9.289685,3.199672,6.90676,3.704658,...,24.971946,-400.261317,23.020525,-400.094223,21.232995,-449.091525,20.007429,-449.829493,25.00587,-499.640583
12297,2018-02-09 08:59:59,5.739681,9.838668,8.411716,47.397116,76.563755,9.202121,0.896478,12.478896,2.140149,...,20.022976,-499.974627,15.040998,-499.914511,10.993049,-500.366405,9.001779,-499.841581,11.00852,-500.09052
7175,2017-03-10 22:59:59,6.111156,8.824492,7.801447,45.646647,66.244615,9.410112,3.122016,7.175431,3.557226,...,24.979583,-400.168683,22.931727,-400.21023,22.499457,-449.83238,20.000747,-449.837387,25.016873,-500.1884


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   date                                        5856 non-null   object 
 1   primary_cleaner.input.sulfate               5554 non-null   float64
 2   primary_cleaner.input.depressant            5572 non-null   float64
 3   primary_cleaner.input.feed_size             5856 non-null   float64
 4   primary_cleaner.input.xanthate              5690 non-null   float64
 5   primary_cleaner.state.floatbank8_a_air      5840 non-null   float64
 6   primary_cleaner.state.floatbank8_a_level    5840 non-null   float64
 7   primary_cleaner.state.floatbank8_b_air      5840 non-null   float64
 8   primary_cleaner.state.floatbank8_b_level    5840 non-null   float64
 9   primary_cleaner.state.floatbank8_c_air      5840 non-null   float64
 10  primary_clea

Unnamed: 0,date,primary_cleaner.input.sulfate,primary_cleaner.input.depressant,primary_cleaner.input.feed_size,primary_cleaner.input.xanthate,primary_cleaner.state.floatbank8_a_air,primary_cleaner.state.floatbank8_a_level,primary_cleaner.state.floatbank8_b_air,primary_cleaner.state.floatbank8_b_level,primary_cleaner.state.floatbank8_c_air,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level
5012,2017-11-26 20:59:59,198.238357,10.991582,7.08,2.008978,1599.978357,-499.93156,1598.857403,-500.057486,1599.98521,...,16.991206,-501.192218,14.989985,-500.749635,10.91998,-499.662605,8.996301,-499.46696,15.987536,-499.361782
3201,2017-09-12 09:59:59,187.685715,8.020792,7.02,0.892078,1301.752604,-503.955823,1297.673879,-499.634237,1296.851378,...,14.059974,-500.180901,12.284795,-403.746102,9.284352,-500.253898,7.241139,-500.203849,10.241658,-501.22509
4632,2017-11-11 00:59:59,148.685271,9.959004,8.47,1.89218,1598.159719,-503.20483,1601.797695,-502.643701,1478.193261,...,17.028473,-499.606716,13.976562,-500.0065,15.014979,-499.945616,10.986182,-499.708539,15.978529,-499.731177
2333,2016-12-07 05:59:59,222.145769,9.488381,7.15,1.284743,1502.807962,-500.009535,1500.379611,-499.972527,1501.227618,...,17.973835,-500.579666,16.095902,-500.355044,17.488205,-501.207781,12.072397,-500.297369,21.015653,-503.051128
5028,2017-11-27 12:59:59,218.152811,10.976666,7.42,2.192583,1598.790561,-489.345965,1598.655291,-573.576554,1598.207072,...,17.001131,-494.76567,14.962499,-508.047815,10.979592,-497.56319,8.991156,-454.185705,15.961403,-494.668152
2076,2016-11-26 12:59:59,,0.041375,6.86,0.009742,1598.926837,-500.164321,1599.576886,-500.016356,1602.260539,...,18.064714,-499.868945,15.935439,-499.790389,17.773422,-500.257762,11.964392,-500.09676,22.011969,-500.278612
4995,2017-11-26 03:59:59,233.777157,9.993671,7.61,2.491121,1502.199453,-500.583948,1499.2454,-499.72973,1499.893175,...,17.04104,-498.218564,14.92388,-499.5347,10.982981,-498.69715,9.02996,-498.248544,16.009203,-499.723772
4067,2017-10-18 11:59:59,211.355551,6.881219,5.89,2.394932,1604.772829,-474.268242,1481.627385,-556.283461,1600.320423,...,17.976587,-499.049166,15.852059,-399.794671,12.999002,-499.513709,9.999142,-500.084926,14.00205,-498.81578
712,2016-09-30 16:59:59,62.326118,4.036006,6.3,0.567262,1605.145677,-498.899125,1604.133982,-502.666879,1604.916785,...,12.050552,-560.440825,10.006541,-495.463881,10.02075,-498.901036,4.96631,-499.648076,19.969872,-494.611828
2642,2016-12-20 02:59:59,214.711006,12.483614,7.25,1.09679,1499.312387,-500.427849,1498.186569,-500.190446,1500.282923,...,17.045912,-500.533174,14.984849,-473.212022,13.744876,-500.513665,12.036559,-499.629287,20.990126,-500.033005


Per the project description, we know the source dataset (22,716 entries) contains the combined values of the training and test sets (16,860 and 5,856 entries respectively). Furthermore, we expect to have some of the features and targets in the training set be absent from the test set. Each entry of the dataset should be uniquely identifiable by the date column (currently set as an object datatype). Aside from the date, the column names follow the format *[stage].[parameter_type].[parameter_name]*, where the possible values are as follows:

Possible values for *[stage]*:
- *rougher*: flotation
- *primary_cleaner*: primary purification
- *secondary_cleaner*: secondary purification
- *final*: final characteristics

Possible values for *[parameter_type]*:

- *input*: raw material parameters
- *output*: product parameters
- *state*: parameters characterizing the current state of the stage
- *calculation*: calculation characteristics

Each of these columns appear to be numerical and float64 datatypes. We can see there are various columns with missing values, but we'll address missing values as well as any duplicates later on in the project.

**Recovery Calculations**

Per the technical process, recovery (percent) is calculated as follows:

$\text{Recovery} = {{C \times (F-T)} \over {F \times (C-T)}} \times 100$

- C - share of gold in the concentrate right after flotation (rougher concentrate recovery), or after purification (final concentrate recovery)
- F - share of gold in the feed before flotation (rougher concentrate recovery), or after flotation (final concentrate recovery)
- T - share of gold in the rougher tails right after flotation (rougher concentrate recovery), or after purification (final concentrate recovery)

As such, we'll define the function, <code>calculate_recovery()</code>:

**Parameters**
- *df*: DataFrame object for which recovery calculations are desired
- *C*: the column name for gold share in the concentrate, as described above.
- *F*: the column name for gold share in the feed, as described above.
- *T*: the column name for gold share in the rougher tails, as described above.

**Returns**: Series object containing the calculated recovery as percent.

In [3]:
def calculate_recovery(df, C, F, T):
    return ((df[C] * (df[F] - df[T])) / (df[F] * (df[C] - df[T]))) * 100

Using our function, we'll verify recovery was calculated correctly for the target, *rougher.output.recovery*, in the training set and find the mean absolute error (MAE) between the calculated values and actual values.

In [4]:
calculated_values = calculate_recovery(gold_train.dropna(subset=['rougher.output.recovery']), 'rougher.output.concentrate_au', 'rougher.input.feed_au', 'rougher.output.tail_au')
actual_values = gold_train['rougher.output.recovery'].dropna()

print('Mean Absolute Error: ', mean_absolute_error(actual_values, calculated_values))
print('Calculated values sufficiently close: ', np.allclose(actual_values, calculated_values))

Mean Absolute Error:  9.303415616264301e-15
Calculated values sufficiently close:  True


It appears the training set's rougher concentrate recovery is correct. The calculated values are sufficiently close to the source dataset and the MAE is nearly zero.

**Feature Analysis**

Prior to data preprocessing, we'll inspect the features (in the training set) not available in the test set.

In [5]:
training_columns = gold_train.drop(gold_test.columns, axis=1)
training_columns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 34 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   final.output.concentrate_ag                         16788 non-null  float64
 1   final.output.concentrate_pb                         16788 non-null  float64
 2   final.output.concentrate_sol                        16490 non-null  float64
 3   final.output.concentrate_au                         16789 non-null  float64
 4   final.output.recovery                               15339 non-null  float64
 5   final.output.tail_ag                                16794 non-null  float64
 6   final.output.tail_pb                                16677 non-null  float64
 7   final.output.tail_sol                               16715 non-null  float64
 8   final.output.tail_au                                16794 non-null  float64


The training set has 34 features that are not available in the test set. All of the datatypes are float64. These features appear to be values taken from different stages of the refining process *after* flotation. Specifically, the features represent the concentrate and/or measurements of various materials and byproducts before and after the purification steps.

## Data Preprocessing ##

**Date Column**

To facilitate use in other operations, we'll convert the date datatype to datetime. We'll verify the datatype change with <code>info()</code>.

In [6]:
gold_full['date'] = pd.to_datetime(gold_full['date'], format='%Y-%m-%d %H:%M:%S')
gold_train['date'] = pd.to_datetime(gold_train['date'], format='%Y-%m-%d %H:%M:%S')
gold_test['date'] = pd.to_datetime(gold_test['date'], format='%Y-%m-%d %H:%M:%S')

gold_full.info()
gold_train.info()
gold_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22716 entries, 0 to 22715
Data columns (total 87 columns):
 #   Column                                              Non-Null Count  Dtype         
---  ------                                              --------------  -----         
 0   date                                                22716 non-null  datetime64[ns]
 1   final.output.concentrate_ag                         22627 non-null  float64       
 2   final.output.concentrate_pb                         22629 non-null  float64       
 3   final.output.concentrate_sol                        22331 non-null  float64       
 4   final.output.concentrate_au                         22630 non-null  float64       
 5   final.output.recovery                               20753 non-null  float64       
 6   final.output.tail_ag                                22633 non-null  float64       
 7   final.output.tail_pb                                22516 non-null  float64       
 8   final.

**Missing Values**

We'll first determine how many missing values there are in the datasets before we decide how it should be addressed. Of note, the source dataset is the combination of the training and test sets, thus there could be potential overlap of missing values for these data subsets.

In [7]:
print('Number of rows with at least one missing value in source: ', gold_full[gold_full.isna().sum(axis=1) > 0].shape)
print('Number of rows with at least one missing value in training: ', gold_train[gold_train.isna().sum(axis=1) > 0].shape)
print('Number of rows with at least one missing value in test: ', gold_test[gold_test.isna().sum(axis=1) > 0].shape)

Number of rows with at least one missing value in source:  (6622, 87)
Number of rows with at least one missing value in training:  (5843, 87)
Number of rows with at least one missing value in test:  (473, 53)


There are 6,622 rows out of 22,716 total entries that contain at least one missing value. This represents over 29.2% of the data, but we know values close in time should be similar. We'll address the missing values in the following manner:
- Drop rows with missing target values (i.e. rougher.output.recovery and final.output.recovery)
- Group each dataset by dt.date and forward fill missing values
- Drop the remaining missing values

To facilitate this operation, we'll define the function <code>fill_and_drop()</code> to automate the process described above for each dataframe:

**Parameters**
- *df*: DataFrame object containing a project dataset

**Returns**: DataFrame with all missing values replaced/dropped.

We'll verify the changes using the same sum method above.

In [8]:
def fill_and_drop(df):
    df = df.groupby(df['date'].dt.date).ffill()
    
    return df.dropna()  

In [9]:
nan_target_rows = gold_full[gold_full['rougher.output.recovery'].isna() | gold_full['final.output.recovery'].isna()]

gold_full.drop(nan_target_rows.index, inplace=True)
gold_train.drop((gold_train.query('date in @nan_target_rows.date')).index, inplace=True)
gold_test.drop((gold_test.query('date in @nan_target_rows.date')).index, inplace=True)

gold_full = fill_and_drop(gold_full)
gold_train = fill_and_drop(gold_train)
gold_test = fill_and_drop(gold_test)

print('Number of rows with at least one missing value in source: ', gold_full[gold_full.isna().sum(axis=1) > 0].shape)
print('Number of rows with at least one missing value in training: ', gold_train[gold_train.isna().sum(axis=1) > 0].shape)
print('Number of rows with at least one missing value in test: ', gold_test[gold_test.isna().sum(axis=1) > 0].shape)

Number of rows with at least one missing value in source:  (0, 87)
Number of rows with at least one missing value in training:  (0, 87)
Number of rows with at least one missing value in test:  (0, 53)


**Checking for Duplicate Values**

We'll use the source dataset to determine if there are any duplicates, and drop any applicable entries.

In [10]:
print('Number of duplicates: ', gold_full.duplicated().sum())

Number of duplicates:  0


There are no duplicates in the datasets.

**Adding Missing Columns to Test Set**

As previously identified, the test set is missing both targets as well as 34 features from the training set. Since the model should make predictions at the start of the refining process, the test dataset would not have access to all of the resultant data after every phase. As such, our model will only use the features that the training and test sets will have in common. We will, however, need to add the target data to the test set.

In [11]:
gold_test = gold_test.merge(gold_full[['date', 'rougher.output.recovery', 'final.output.recovery']], on='date', how='inner')
gold_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5220 entries, 0 to 5219
Data columns (total 55 columns):
 #   Column                                      Non-Null Count  Dtype         
---  ------                                      --------------  -----         
 0   date                                        5220 non-null   datetime64[ns]
 1   primary_cleaner.input.sulfate               5220 non-null   float64       
 2   primary_cleaner.input.depressant            5220 non-null   float64       
 3   primary_cleaner.input.feed_size             5220 non-null   float64       
 4   primary_cleaner.input.xanthate              5220 non-null   float64       
 5   primary_cleaner.state.floatbank8_a_air      5220 non-null   float64       
 6   primary_cleaner.state.floatbank8_a_level    5220 non-null   float64       
 7   primary_cleaner.state.floatbank8_b_air      5220 non-null   float64       
 8   primary_cleaner.state.floatbank8_b_level    5220 non-null   float64       
 9   primary_

## Analyze the Data ##

**Metal Concentration Changes During Purification**

We'll observe changes in Au, Ag, and Pb concentrate levels after each stage of cleaning.

In [12]:
concentrate_dict = {'Gold': ['primary_cleaner.output.concentrate_au', 'final.output.concentrate_au'],
                    'Silver': ['primary_cleaner.output.concentrate_ag', 'final.output.concentrate_ag'],
                    'Lead': ['primary_cleaner.output.concentrate_pb', 'final.output.concentrate_pb']}

fig1 = go.Figure()
fig2 = go.Figure()

for key, val in concentrate_dict.items():
    fig1.add_trace(go.Box(x=gold_full[val[0]], name=key)) # List index 0 for primary_cleaner.output.concentrate
    fig2.add_trace(go.Box(x=gold_full[val[1]], name=key)) # List index 1 for final.output.concentrate

fig1.update_layout(title='Metal Concentrate After Primary Stage Purification',
                    xaxis_title='Concentrate Level',
                    yaxis_title='Metal')

fig2.update_layout(title='Metal Concentrate After Secondary Stage Purification',
                    xaxis_title='Concentrate Level',
                    yaxis_title='Metal')

fig1.show()
fig2.show()

The gold concentration increases, while the silver concentration decreases. The lead concentration remains about the same.

**Distribution of Feed Particle Size**

Using a histogram (normalized by density), we'll compare the feed particle size distributions in the training and test sets.

In [13]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=gold_train['rougher.input.feed_size'], histnorm='density', name='Feed Size: Train'))
fig.add_trace(go.Histogram(x=gold_test['rougher.input.feed_size'], histnorm='density', name='Feed Size: Test', opacity=0.7))
fig.update_layout(barmode='overlay',
                title='Feed Particle Size Distribution',
                xaxis_title='Particle Size',
                yaxis_title='Density')
fig.show()

Both datasets appear to have a negative skew with the majority of their particle feed sizes between about 40 and 100. The density between each distribution varies as well, with the test set having a larger representation of particles between 40 and 50, whereas the training set has a more dispersed representation of particle sizes overall.

**Total Concentrations of All Substances**

We'll look at the total distribution of the substance concentrations to identify if the dataset has any anomalies.

In [14]:
rougher_input_labels = ['rougher.input.feed_au', 'rougher.input.feed_ag', 'rougher.input.feed_pb']
rougher_output_labels = ['rougher.output.concentrate_au', 'rougher.output.concentrate_ag', 'rougher.output.concentrate_pb']
pri_clean_output_labels = ['primary_cleaner.output.concentrate_au', 'primary_cleaner.output.concentrate_ag', 'primary_cleaner.output.concentrate_pb']
final_output_labels = ['final.output.concentrate_au', 'final.output.concentrate_ag', 'final.output.concentrate_pb']

fig = go.Figure()
fig.add_trace(go.Histogram(x=gold_full[rougher_input_labels].sum(axis=1), name='Rougher Input Feed'))
fig.add_trace(go.Histogram(x=gold_full[rougher_output_labels].sum(axis=1), name='Rougher Output Concentrate', opacity=0.7))
fig.add_trace(go.Histogram(x=gold_full[pri_clean_output_labels].sum(axis=1), name='Primary Cleaner Output Concentrate', opacity=0.7))
fig.add_trace(go.Histogram(x=gold_full[final_output_labels].sum(axis=1), name='Final Output Concentrate', opacity=0.7))
fig.update_layout(barmode='overlay',
                title='Total Distribution of Substances by Phase',
                xaxis_title='Total Concentration',
                yaxis_title='Count')
fig.show()

It appears there are about 600 total instances across the four phases where the combined concentration of all the metals is/near zero. We'll examine these datapoints to determine if they are abnormalities and remove them.

In [15]:
concentrate_cols = ['rougher.input.feed_au', 'rougher.input.feed_ag', 'rougher.input.feed_pb', 'rougher.input.feed_sol',
                'rougher.output.concentrate_au', 'rougher.output.concentrate_ag', 'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol',
                'primary_cleaner.output.concentrate_au', 'primary_cleaner.output.concentrate_ag', 'primary_cleaner.output.concentrate_pb', 'primary_cleaner.output.concentrate_sol',
                'final.output.concentrate_au', 'final.output.concentrate_ag', 'final.output.concentrate_pb', 'final.output.concentrate_sol']

display(gold_full[gold_full[rougher_input_labels].sum(axis=1) < 0.25][concentrate_cols])
display(gold_full[gold_full[rougher_output_labels].sum(axis=1) < 0.25][concentrate_cols])
display(gold_full[gold_full[pri_clean_output_labels].sum(axis=1) < 0.25][concentrate_cols])
display(gold_full[gold_full[final_output_labels].sum(axis=1) < 0.25][concentrate_cols])

Unnamed: 0,rougher.input.feed_au,rougher.input.feed_ag,rougher.input.feed_pb,rougher.input.feed_sol,rougher.output.concentrate_au,rougher.output.concentrate_ag,rougher.output.concentrate_pb,rougher.output.concentrate_sol,primary_cleaner.output.concentrate_au,primary_cleaner.output.concentrate_ag,primary_cleaner.output.concentrate_pb,primary_cleaner.output.concentrate_sol,final.output.concentrate_au,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol
18891,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18892,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18893,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18894,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18895,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,28.410152,6.203355,7.05569,8.278268,35.424183,4.238007,7.349108,6.833999


Unnamed: 0,rougher.input.feed_au,rougher.input.feed_ag,rougher.input.feed_pb,rougher.input.feed_sol,rougher.output.concentrate_au,rougher.output.concentrate_ag,rougher.output.concentrate_pb,rougher.output.concentrate_sol,primary_cleaner.output.concentrate_au,primary_cleaner.output.concentrate_ag,primary_cleaner.output.concentrate_pb,primary_cleaner.output.concentrate_sol,final.output.concentrate_au,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol
45,7.114065,7.521974,2.811569,41.154430,0.00,0.00,0.00,0.00,37.633572,8.513177,9.724797,19.120964,46.614456,4.566664,10.406178,4.588698
46,7.651522,7.313187,2.973841,41.983063,0.00,0.00,0.00,0.00,37.718947,8.466262,9.780929,19.220963,46.250638,4.577832,10.691605,5.274175
47,5.587750,7.934791,1.763437,41.983063,0.00,0.00,0.00,0.00,38.624404,8.092185,9.719728,19.077471,46.663335,4.591462,10.628846,5.688961
51,7.039270,7.835670,2.770323,38.010139,0.00,0.00,0.00,0.00,34.822988,9.877097,8.932357,19.111308,45.267942,5.759800,9.537415,3.997737
52,7.321526,8.027547,3.073579,39.417952,0.00,0.00,0.00,0.00,32.413685,10.806964,9.382084,19.502217,43.237493,5.807253,10.439505,6.512951
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18891,0.010000,0.010000,0.010000,0.010000,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18892,0.010000,0.010000,0.010000,0.010000,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18893,0.010000,0.010000,0.010000,0.010000,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319
18894,0.010000,0.010000,0.010000,0.010000,0.01,0.01,0.01,0.01,36.306431,7.925334,9.014648,10.577148,45.270618,5.413548,9.389648,8.731319


Unnamed: 0,rougher.input.feed_au,rougher.input.feed_ag,rougher.input.feed_pb,rougher.input.feed_sol,rougher.output.concentrate_au,rougher.output.concentrate_ag,rougher.output.concentrate_pb,rougher.output.concentrate_sol,primary_cleaner.output.concentrate_au,primary_cleaner.output.concentrate_ag,primary_cleaner.output.concentrate_pb,primary_cleaner.output.concentrate_sol,final.output.concentrate_au,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol
19,6.801254,6.046063,2.777324,40.381002,18.511902,10.788951,7.537074,26.092838,0.00,0.00,0.00,0.00,42.509402,5.658943,10.436002,5.942418
22,6.610732,5.599324,2.525838,41.302359,18.089134,10.958096,7.267608,25.911055,0.00,0.00,0.00,0.00,41.406172,6.118749,10.483007,6.546983
30,6.524249,5.689557,2.508414,43.042457,17.583602,11.574823,7.384216,25.740506,0.00,0.00,0.00,0.00,44.059908,5.322681,9.577672,4.805490
73,7.120089,6.994296,2.875142,39.637217,18.414367,10.350497,7.456147,24.962090,0.00,0.00,0.00,0.00,45.135616,4.677499,11.304745,4.655438
76,7.103887,7.844640,2.653501,38.767396,18.614079,10.585663,7.076596,24.024872,0.00,0.00,0.00,0.00,43.264258,4.821388,10.235502,1.581066
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22516,10.611729,12.018571,5.802394,36.939986,19.286963,15.310860,8.999071,30.634365,0.01,0.01,0.01,0.01,46.335757,4.442023,9.238301,8.544080
22517,10.538937,11.879119,5.916240,36.748329,19.694488,15.234636,9.126429,31.524008,0.01,0.01,0.01,0.01,44.961086,4.916341,9.644849,8.945859
22518,10.379927,11.407120,6.154015,36.986787,19.411162,14.731266,9.299917,31.440318,0.01,0.01,0.01,0.01,44.519141,5.134035,9.683889,9.203260
22519,7.549261,8.270800,4.492912,25.538641,11.142676,8.349079,5.466212,18.159502,0.01,0.01,0.01,0.01,35.313266,4.503756,8.387537,7.408152


Unnamed: 0,rougher.input.feed_au,rougher.input.feed_ag,rougher.input.feed_pb,rougher.input.feed_sol,rougher.output.concentrate_au,rougher.output.concentrate_ag,rougher.output.concentrate_pb,rougher.output.concentrate_sol,primary_cleaner.output.concentrate_au,primary_cleaner.output.concentrate_ag,primary_cleaner.output.concentrate_pb,primary_cleaner.output.concentrate_sol,final.output.concentrate_au,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol
707,5.598971,7.407737,1.934178,31.762853,15.935190,11.803589,7.143475,21.692525,26.238846,8.986696,6.530607,11.969464,0.00,0.00,0.00,0.00
1354,7.786147,8.098125,2.588661,34.896012,18.682162,10.640168,7.565854,24.793448,34.564456,8.701237,7.234158,10.754321,0.00,0.00,0.00,0.00
1355,7.499248,7.832467,2.435125,34.196525,18.231121,10.415491,7.367937,24.200400,34.399450,8.697952,7.119325,10.194141,0.00,0.00,0.00,0.00
1356,7.032278,7.600277,2.265148,31.994985,18.190655,10.197899,7.364725,22.839147,34.303756,8.921656,6.637383,10.291689,0.00,0.00,0.00,0.00
1357,7.021635,7.436860,2.265435,32.541806,18.194323,9.968103,7.282737,22.918014,33.956378,9.239378,6.249012,10.464946,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16861,3.985905,7.342727,1.116701,35.105592,22.586063,16.139433,6.653230,35.403157,6.357625,0.865060,5.587420,4.361046,0.00,0.00,0.00,0.00
17061,7.275802,8.055075,3.913440,29.234975,20.792589,15.965736,11.492747,33.186725,8.409608,2.342683,11.363321,13.633887,0.00,0.00,0.00,0.00
17084,11.604771,12.464094,5.393215,40.733771,19.442973,14.424928,8.120457,32.367988,34.015777,8.415798,10.210094,11.008945,0.00,0.00,0.00,0.00
17085,11.763586,12.726401,5.630816,39.556916,17.784973,14.028955,8.029784,29.570775,27.937329,11.517548,11.462232,15.168948,0.00,0.00,0.00,0.00


Upon closer inspection, we can see the aformentioned data points are either 0.0 or 0.01. Despite this, the other features for those entries have what appear to be normal values. This is likely an anomaly and we'll remove these from the training and test sets as to not adversely affect our results. We'll verify the changes took place by displaying the full dataframe filtered with the previous query strings (i.e. filtered dataframes should be empty) as well as compare the size changes.

In [16]:
date_reference = gold_full[(gold_full[rougher_input_labels].sum(axis=1) < 0.25) |
                        (gold_full[rougher_output_labels].sum(axis=1) < 0.25) |
                        (gold_full[pri_clean_output_labels].sum(axis=1) < 0.25) |
                        (gold_full[final_output_labels].sum(axis=1) < 0.25)]['date']

gold_full = gold_full.drop(gold_full.query('date in @date_reference').index)
gold_train = gold_train.drop(gold_train.query('date in @date_reference').index)
gold_test = gold_test.drop(gold_test.query('date in @date_reference').index)

display(gold_full[gold_full[rougher_input_labels].sum(axis=1) < 0.25])
display(gold_full[gold_full[rougher_output_labels].sum(axis=1) < 0.25])
display(gold_full[gold_full[pri_clean_output_labels].sum(axis=1) < 0.25])
display(gold_full[gold_full[final_output_labels].sum(axis=1) < 0.25])

print('Full dataset shape: ', gold_full.shape)
print('Training set shape: ', gold_train.shape)
print('Test set shape: ', gold_test.shape)

Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level


Unnamed: 0,date,final.output.concentrate_ag,final.output.concentrate_pb,final.output.concentrate_sol,final.output.concentrate_au,final.output.recovery,final.output.tail_ag,final.output.tail_pb,final.output.tail_sol,final.output.tail_au,...,secondary_cleaner.state.floatbank4_a_air,secondary_cleaner.state.floatbank4_a_level,secondary_cleaner.state.floatbank4_b_air,secondary_cleaner.state.floatbank4_b_level,secondary_cleaner.state.floatbank5_a_air,secondary_cleaner.state.floatbank5_a_level,secondary_cleaner.state.floatbank5_b_air,secondary_cleaner.state.floatbank5_b_level,secondary_cleaner.state.floatbank6_a_air,secondary_cleaner.state.floatbank6_a_level


Full dataset shape:  (16931, 87)
Training set shape:  (11867, 87)
Test set shape:  (5064, 55)


The datasets have the anomalies removed. We can now move on to building, training, and evaluating the models.

## Preparing Models ##

**Evaluation Metric**

Before we build and train the models, we'll need a function to calculate the evaluation metric, sMAPE. The sMAPE is calculated according to the following formula:

$\text{sMAPE} = {1 \over n} \displaystyle\sum_{i=1}^{n} {{|y_i - \hat{y_i}|} \over {(|y_i| + |\hat{y_i}|)/2}} \times 100\%$
- $y_i$: target value for observation $i$
- $\hat{y_i}$: predicted target value for observation $i$
- $n$: number of observations in the sample

We'll define the function, <code>calculate_smape()</code>, to calculate the sMAPE of the predicted values using vector operations:

**Parameters**
- *actual*: List or Array containing the actual target values.
- *predict*: List or Array containing the predicted target values.

**Returns**: The sMAPE for the predicted target values.

In [17]:
def calculate_smape(actual, predict):
    return 100*(np.abs(actual - predict) / ((np.abs(actual) + np.abs(predict))/2)).mean()

To facilitate the calculation of the final metric, we'll define the function <code>calculate_final_smape</code> that is compatible with the *scoring* parameter from *sklearn.model_selection.cross_val_score*, as follows:

**Parameters**
- *estimator*: Trained model to generate predictions from a given array-like of shape (n_samples, n_features).
- *features*: Array-like of shape (n_samples, n_features) containing the validation/test features.
- *target*: Array-like of shape (n_samples) containing the true values of the target variable(s).

**Returns**: The final sMAPE for the predicted targets according to the equation below.
- $\text{Final sMAPE} = 25\% \times \text{sMAPE(rougher)} + 75\% \times \text{sMAPE(final)}$

In [18]:
def calculate_final_smape(estimator, features, target):
    prediction = estimator.predict(features)
    rougher_prediction, final_prediction = prediction[:, 0], prediction[:, 1]

    return 0.25*calculate_smape(target['rougher.output.recovery'].to_list(), rougher_prediction) + 0.75*calculate_smape(target['final.output.recovery'].to_list(), final_prediction)

**Feature and Target Selection**

As noted above, we can only use features available in the test set when training our models. We already have a variable from a previous section, training_columns, that contain the features not available in the test set. Additionally, we'll need to exclude the date column.

In [19]:
features_train = gold_train.drop(training_columns.columns, axis=1)
features_train.drop('date', axis=1, inplace=True)
target_train = gold_train[['rougher.output.recovery', 'final.output.recovery']] # index 0, 1 for rougher recovery and final recovery, respectively

print(features_train.shape)
print(target_train.shape)

(11867, 52)
(11867, 2)


## Model Training, Tuning, and Evaluation ##

We'll train the following regression models and compare the sMAPE metrics using the cross-validation technique:
- Decision Tree Regressor
- Random Forest Regressor
- Linear Regression

In [20]:
tree_model = DecisionTreeRegressor(random_state=12345)
tree_score = cross_val_score(tree_model, features_train, target_train, cv=3, scoring=calculate_final_smape)

print('Decision Tree Regressor Scores: ', tree_score)
print('Mean: ', tree_score.mean())

Decision Tree Regressor Scores:  [14.26200046 10.49294257 19.17443393]
Mean:  14.643125653427758


In [21]:
forest_model = RandomForestRegressor(random_state=12345)
forest_score = cross_val_score(forest_model, features_train, target_train, cv=3, scoring=calculate_final_smape)

print('Random Forest Regressor Scores: ', forest_score)
print('Mean: ', forest_score.mean())

Random Forest Regressor Scores:  [ 8.97016385  6.99243235 13.09600348]
Mean:  9.686199893124433


In [22]:
linear_model = LinearRegression()
linear_score = cross_val_score(linear_model, features_train, target_train, cv=3, scoring=calculate_final_smape)

print('Linear Regression Scores: ', linear_score)
print('Mean: ', linear_score.mean())

Linear Regression Scores:  [ 9.91684498  7.81307797 13.86853805]
Mean:  10.532820335884663


Using the default parameters, we have a collection of baseline results of how each model performed. Before we select a model for our final test, we'll tune a couple of hyperparameters to see if we can improve model quality. Of note, the linear regression model lacks a variety/scope of parameters for tuning.

**Decision Tree Regressor Tuning**

For the Decision Tree Regressor, we'll tune the hyperparameters max_depth, min_sample_leaves, and min_sample_split.

- *Note: the best score resulted in using the default values for min_samples_split and min_samples_leaves. As such, only depth was tuned in final code block for sake of runtime.*

In [23]:
best_tree = tree_model
best_tree_score = tree_score
best_tree_depth = None

for depth in range(1, 30):
	model = DecisionTreeRegressor(random_state=12345, max_depth=depth)
	score = cross_val_score(model, features_train, target_train, cv=3, scoring=calculate_final_smape)
	if score.mean() < best_tree_score.mean():
		best_tree = model
		best_tree_score = score
		best_tree_depth = depth

print('Best score: ', best_tree_score)
print('Mean: ', best_tree_score.mean())
print('Depth of best decision tree: ', best_tree_depth)

Best score:  [9.26427087 8.61888918 9.68944259]
Mean:  9.190867548022487
Depth of best decision tree:  1


Our Decision Tree Regressor sMAPE score improved; the best depth was 1, but its average score improved by about 5.5%.

**Random Forest Regressor Tuning**

When tuning the Random Forest Regressor, we'll first set an arbitrary number of estimators (i.e. 10), and iterate through different tree depths. Once we've found the best depth, we'll then iterate through different number of estimators using the determined value for max_depth.

In [24]:
best_forest = forest_model
best_forest_score = forest_score
best_forest_depth = None
best_est = 10 # our selected arbitrary start value

for depth in range(1, 30):
	model = RandomForestRegressor(random_state=12345, n_estimators=10, max_depth=depth)
	score = cross_val_score(model, features_train, target_train, cv=3, scoring=calculate_final_smape)
	if score.mean() < best_forest_score.mean():
		best_forest = model
		best_forest_score = score
		best_forest_depth = depth

for est in range(1, 50):
	model = RandomForestRegressor(random_state=12345, max_depth=best_forest_depth, n_estimators=est)
	score = cross_val_score(model, features_train, target_train, cv=3, scoring=calculate_final_smape)
	if score.mean() < best_forest_score.mean():
		best_forest = model
		best_forest_score = score
		best_est = est

print('Best score: ', best_forest_score)
print('Mean: ', best_forest_score.mean())
print('Number of estimators: ', best_est)
print('Best depth: ', best_forest_depth)

Best score:  [9.25988656 8.46654577 9.68933644]
Mean:  9.13858959040576
Number of estimators:  27
Best depth:  1


The Random Forest Regressor sMAPE score slightly improved; the average score was 9.1% when specifying the number of estimators to 27 and maximum depth to 1.

## Model Selection and Test Evaluation ##

Based on the results above, the Random Forest Regressor had the overall best (lower is better) sMAPE scores and average. We'll evaluate the test set using this model.

In [25]:
features_test = gold_test.drop(['date', 'final.output.recovery', 'rougher.output.recovery'], axis=1)
target_test = gold_test[['rougher.output.recovery', 'final.output.recovery']]
best_model = RandomForestRegressor(random_state=12345, n_estimators=best_est, max_depth=best_forest_depth)
best_model.fit(features_train, target_train)
final_score = calculate_final_smape(best_model, features_test, target_test)

print('Final sMAPE of selected model: ', final_score)

Final sMAPE of selected model:  7.189174439534593


Our final sMAPE score using the test set is 7.19%. As a sanity check for model performance, we'll compare our trained model's score to a baseline using a constant model.

In [26]:
dummy = DummyRegressor(strategy='mean')
dummy_score = cross_val_score(dummy, features_train, target_train, cv=3, scoring=calculate_final_smape)
print('Dummy sMAPE score: ', dummy_score)
print('Mean: ', dummy_score.mean())

Dummy sMAPE score:  [10.2832107   7.77387313 10.73106538]
Mean:  9.596049738034447


Our constant model resulted in an average sMAPE score of 9.6%. It appears our model predicts values with only a slightly better sMAPE than a model that arbitrarily assigns the mean value.

## Conclusion ##

In this project we trained and evaluated three different regression models using the cross-validation technique against a custom metric, sMAPE. Based on our findings, we should use the Random Forest Regressor model to predict the rougher output recovery and final output recovery values to help improve Zyfra's gold extraction strategy.

While the final test sMAPE score of 7.2% may be acceptable at face value, it is difficult to determine whether this metric meet Zyfra's expectation or threshold goal. We can look to further improve model quality by engaging Zyfra on feature selection. Specifically, it could be beneficial to know which features are are beyond the control of the refining process as well which features are more critical to predicting gold extraction. Additionally, we could potentially improve our model quality and evaluation results by using more exhaustive techniques such as GridSearchCV and exploriung other regression models that incorporate contemporary techniques such as gradient boosting. While this could improve results, it could be at the cost of adverse runtime duration.