## Outline
1. Import/review data
2. Preprocessing
    * Preprocessing and feature engineering of individual datasets
        * Eliminate unecessary parameters, change datatypes, feature extraction
    * Join datasets together, final preprocessing of full dataset
        * Imputations, further feature extraction
3. EDA
    * Calculate Pearson correlation coefficients and interpret
    * Reduce multicollinearity
4. Model training
    * Dummy (baseline) model
    * Linear Regression model
    * Random Forest Regression model
    * CatBoost model
5. Final testing of best performing models
6. Interpretation of results and conclusion

### Objective
Predict temperature while processing steel to reduce energy consumption and optimize production costs. 

# 1. Import/review data

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor

In [2]:
electrode_df = pd.read_csv('/datasets/data_arc_en.csv')
bulk_supply_df = pd.read_csv('/datasets/data_bulk_en.csv')
bulk_delivery_df = pd.read_csv('/datasets/data_bulk_time_en.csv')
gas_df = pd.read_csv('/datasets/data_gas_en.csv')
temp_df = pd.read_csv('/datasets/data_temp_en.csv')
wire_vol_df = pd.read_csv('/datasets/data_wire_en.csv')
wire_time_df = pd.read_csv('/datasets/data_wire_time_en.csv')

In [3]:
electrode_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14876 entries, 0 to 14875
Data columns (total 5 columns):
key                  14876 non-null int64
Arc heating start    14876 non-null object
Arc heating end      14876 non-null object
Active power         14876 non-null float64
Reactive power       14876 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 581.2+ KB


In [4]:
electrode_df.head()

Unnamed: 0,key,Arc heating start,Arc heating end,Active power,Reactive power
0,1,2019-05-03 11:02:14,2019-05-03 11:06:02,0.976059,0.687084
1,1,2019-05-03 11:07:28,2019-05-03 11:10:33,0.805607,0.520285
2,1,2019-05-03 11:11:44,2019-05-03 11:14:36,0.744363,0.498805
3,1,2019-05-03 11:18:14,2019-05-03 11:24:19,1.659363,1.062669
4,1,2019-05-03 11:26:09,2019-05-03 11:28:37,0.692755,0.414397


In [5]:
bulk_supply_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3129 entries, 0 to 3128
Data columns (total 16 columns):
key        3129 non-null int64
Bulk 1     252 non-null float64
Bulk 2     22 non-null float64
Bulk 3     1298 non-null float64
Bulk 4     1014 non-null float64
Bulk 5     77 non-null float64
Bulk 6     576 non-null float64
Bulk 7     25 non-null float64
Bulk 8     1 non-null float64
Bulk 9     19 non-null float64
Bulk 10    176 non-null float64
Bulk 11    177 non-null float64
Bulk 12    2450 non-null float64
Bulk 13    18 non-null float64
Bulk 14    2806 non-null float64
Bulk 15    2248 non-null float64
dtypes: float64(15), int64(1)
memory usage: 391.2 KB


In [6]:
bulk_supply_df.head()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
0,1,,,,43.0,,,,,,,,206.0,,150.0,154.0
1,2,,,,73.0,,,,,,,,206.0,,149.0,154.0
2,3,,,,34.0,,,,,,,,205.0,,152.0,153.0
3,4,,,,81.0,,,,,,,,207.0,,153.0,154.0
4,5,,,,78.0,,,,,,,,203.0,,151.0,152.0


In [7]:
bulk_delivery_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3129 entries, 0 to 3128
Data columns (total 16 columns):
key        3129 non-null int64
Bulk 1     252 non-null object
Bulk 2     22 non-null object
Bulk 3     1298 non-null object
Bulk 4     1014 non-null object
Bulk 5     77 non-null object
Bulk 6     576 non-null object
Bulk 7     25 non-null object
Bulk 8     1 non-null object
Bulk 9     19 non-null object
Bulk 10    176 non-null object
Bulk 11    177 non-null object
Bulk 12    2450 non-null object
Bulk 13    18 non-null object
Bulk 14    2806 non-null object
Bulk 15    2248 non-null object
dtypes: int64(1), object(15)
memory usage: 391.2+ KB


In [8]:
bulk_delivery_df.head()

Unnamed: 0,key,Bulk 1,Bulk 2,Bulk 3,Bulk 4,Bulk 5,Bulk 6,Bulk 7,Bulk 8,Bulk 9,Bulk 10,Bulk 11,Bulk 12,Bulk 13,Bulk 14,Bulk 15
0,1,,,,2019-05-03 11:21:30,,,,,,,,2019-05-03 11:03:52,,2019-05-03 11:03:52,2019-05-03 11:03:52
1,2,,,,2019-05-03 11:46:38,,,,,,,,2019-05-03 11:40:20,,2019-05-03 11:40:20,2019-05-03 11:40:20
2,3,,,,2019-05-03 12:31:06,,,,,,,,2019-05-03 12:09:40,,2019-05-03 12:09:40,2019-05-03 12:09:40
3,4,,,,2019-05-03 12:48:43,,,,,,,,2019-05-03 12:41:24,,2019-05-03 12:41:24,2019-05-03 12:41:24
4,5,,,,2019-05-03 13:18:50,,,,,,,,2019-05-03 13:12:56,,2019-05-03 13:12:56,2019-05-03 13:12:56


In [9]:
gas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3239 entries, 0 to 3238
Data columns (total 2 columns):
key      3239 non-null int64
Gas 1    3239 non-null float64
dtypes: float64(1), int64(1)
memory usage: 50.7 KB


In [10]:
gas_df.head()

Unnamed: 0,key,Gas 1
0,1,29.749986
1,2,12.555561
2,3,28.554793
3,4,18.841219
4,5,5.413692


In [11]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15907 entries, 0 to 15906
Data columns (total 3 columns):
key              15907 non-null int64
Sampling time    15907 non-null object
Temperature      13006 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 372.9+ KB


In [12]:
temp_df.head()

Unnamed: 0,key,Sampling time,Temperature
0,1,2019-05-03 11:16:18,1571.0
1,1,2019-05-03 11:25:53,1604.0
2,1,2019-05-03 11:29:11,1618.0
3,1,2019-05-03 11:30:01,1601.0
4,1,2019-05-03 11:30:39,1613.0


In [13]:
wire_vol_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3081 entries, 0 to 3080
Data columns (total 10 columns):
key       3081 non-null int64
Wire 1    3055 non-null float64
Wire 2    1079 non-null float64
Wire 3    63 non-null float64
Wire 4    14 non-null float64
Wire 5    1 non-null float64
Wire 6    73 non-null float64
Wire 7    11 non-null float64
Wire 8    19 non-null float64
Wire 9    29 non-null float64
dtypes: float64(9), int64(1)
memory usage: 240.8 KB


In [14]:
wire_vol_df.head()

Unnamed: 0,key,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
0,1,60.059998,,,,,,,,
1,2,96.052315,,,,,,,,
2,3,91.160157,,,,,,,,
3,4,89.063515,,,,,,,,
4,5,89.238236,9.11456,,,,,,,


In [15]:
wire_time_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3081 entries, 0 to 3080
Data columns (total 10 columns):
key       3081 non-null int64
Wire 1    3055 non-null object
Wire 2    1079 non-null object
Wire 3    63 non-null object
Wire 4    14 non-null object
Wire 5    1 non-null object
Wire 6    73 non-null object
Wire 7    11 non-null object
Wire 8    19 non-null object
Wire 9    29 non-null object
dtypes: int64(1), object(9)
memory usage: 240.8+ KB


In [16]:
wire_time_df.head()

Unnamed: 0,key,Wire 1,Wire 2,Wire 3,Wire 4,Wire 5,Wire 6,Wire 7,Wire 8,Wire 9
0,1,2019-05-03 11:11:41,,,,,,,,
1,2,2019-05-03 11:46:10,,,,,,,,
2,3,2019-05-03 12:13:47,,,,,,,,
3,4,2019-05-03 12:48:05,,,,,,,,
4,5,2019-05-03 13:18:15,2019-05-03 13:32:06,,,,,,,


Initial review of the data shows a large number of NaN values that will need to be dealt with. The dataframes also need to be joined.

# 2. Preprocessing

### Electrode preprocessing

In [17]:
electrode_df['Arc heating start'] = pd.to_datetime(
    electrode_df['Arc heating start'], format='%Y-%m-%d %H:%M:%S')
electrode_df['Arc heating end'] = pd.to_datetime(
    electrode_df['Arc heating end'], format='%Y-%m-%d %H:%M:%S')

In [18]:
electrode_df['time_heating'] = (
    electrode_df['Arc heating end'] - electrode_df['Arc heating start'])

In [19]:
keys = []
total_seconds = []
avg_actives = []
avg_reactives = []

for key in range(1,3242):
    keys.append(key)
    interval = pd.to_timedelta(
        electrode_df[electrode_df['key'] == key]['time_heating']).dt.total_seconds().sum()
    total_seconds.append(interval)
    avg_actives.append(electrode_df[electrode_df['key'] == key]['Active power'].mean())
    avg_reactives.append(electrode_df[electrode_df['key'] == key]['Reactive power'].mean())
    

s1 = pd.Series(keys, name='key')
s2 = pd.Series(total_seconds, name='total_seconds')
s3 = pd.Series(avg_actives, name='avg_active_power')
s4 = pd.Series(avg_reactives, name='avg_reactive_power')

electrode_df_agg = pd.concat([s1, s2, s3, s4], axis=1)

In [20]:
electrode_df_agg.head()

Unnamed: 0,key,total_seconds,avg_active_power,avg_reactive_power
0,1,1098.0,0.975629,0.636648
1,2,811.0,0.76315,0.499528
2,3,655.0,0.505176,0.319815
3,4,741.0,0.802313,0.515074
4,5,869.0,0.836793,0.563161


In [21]:
electrode_df_agg.set_index('key', inplace=True)

To preprocess the electrode readings, I converted the time readings to the datetime datatype. I also needed to find a way to both aggregate the data and make it numerical, so that I could train a regression model on it. To do so, I first found the time elapsed between "Arc heating start" and "Arc heating end" to create a new feature that measured the time spent heating, for each observation. Since I could only have one row/observation for each batch key, and this dataset contained multiple readings for many of the batches, I added readings together if there were more than one. The final result was a feature measuring the total heating time for every batch. To aggregate the average active and reactive power readings, I took their means.

Since I would be joining the datasets together at the end. I turned the `key` value into the index. I did this for all of the following datasets as well. 

### Bulk supply/delivery preprocessing

In [22]:
bulk_delivery_df[[
    'Bulk 3', 'Bulk 4', 'Bulk 12', 
    'Bulk 13', 'Bulk 14', 'Bulk 15']] = bulk_delivery_df[[
                                            'Bulk 3', 'Bulk 4', 'Bulk 12', 
                                            'Bulk 13', 'Bulk 14', 'Bulk 15']].apply(
                                                    pd.to_datetime, format='%Y-%m-%d %H:%M:%S')

In [23]:
bulk_delivery_2 = bulk_delivery_df[[
    'key', 'Bulk 3', 'Bulk 4', 'Bulk 12', 'Bulk 13', 'Bulk 14', 'Bulk 15']].copy()

In [24]:
bulk_supply_2 = bulk_supply_df[[
    'key', 'Bulk 3', 'Bulk 4', 'Bulk 12', 'Bulk 13', 'Bulk 14', 'Bulk 15']].copy()

In [25]:
bulk_delivery_2.set_index('key', inplace=True)
bulk_supply_2.set_index('key', inplace=True)

In [26]:
bulk_delivery_2.columns = [
    'bulk3_delivery', 'bulk4_delivery', 'bulk12_delivery', 'bulk13_delivery',
    'bulk14_delivery', 'bulk15_delivery']
bulk_supply_2.columns = ['bulk3_supply', 'bulk4_supply', 'bulk12_supply', 'bulk13_supply', 
                         'bulk14_supply', 'bulk15_supply']

### Wire preprocessing

In [27]:
wire_time_df[['Wire 1', 'Wire 2']] = wire_time_df[['Wire 1', 'Wire 2']].apply(
                                                    pd.to_datetime, format='%Y-%m-%d %H:%M:%S')

In [28]:
wire_vol_2 = wire_vol_df[['key','Wire 1', 'Wire 2']].copy()

In [29]:
wire_time_2 = wire_time_df[['key','Wire 1', 'Wire 2']].copy() 

In [30]:
wire_vol_2.set_index('key', inplace=True)
wire_time_2.set_index('key', inplace=True)

In [31]:
wire_vol_2.columns = ['wire1_volume', 'wire2_volume']
wire_time_2.columns = ['wire1_time', 'wire2_time']

To preprocess the bulk delivery and wire readings, I converted the appropriate columns to the datetime datatype, eliminated the ones that were almost entirely NaN values, and renamed the columns. 

### Temp preprocessing

In [32]:
temp_df['Sampling time'] = pd.to_datetime(temp_df['Sampling time'], format='%Y-%m-%d %H:%M:%S')

In [33]:
keys = []
first_temp_times = []
first_temp_readings = []
last_temp_times = []
last_temp_readings = []

for key in temp_df['key'].unique():
    keys.append(key)
    batch = temp_df[temp_df['key'] == key]
    
    first_temp_times.append(batch.iloc[0, 1])
    first_temp_readings.append(batch.iloc[0, 2])
    last_temp_times.append(batch.iloc[-1, 1])
    last_temp_readings.append(batch.iloc[-1, 2])
    
s1 = pd.Series(keys, name='key')
s2 = pd.Series(first_temp_times, name='first_reading_time')
s3 = pd.Series(first_temp_readings, name='first_temp')
s4 = pd.Series(last_temp_times, name='last_reading_time')
s5 = pd.Series(last_temp_readings, name='last_temp')

temp_df_2 = pd.concat([s1, s2, s3, s4, s5], axis=1)

In [34]:
temp_df_2.head()

Unnamed: 0,key,first_reading_time,first_temp,last_reading_time,last_temp
0,1,2019-05-03 11:16:18,1571.0,2019-05-03 11:30:39,1613.0
1,2,2019-05-03 11:37:27,1581.0,2019-05-03 11:59:12,1602.0
2,3,2019-05-03 12:13:17,1596.0,2019-05-03 12:34:57,1599.0
3,4,2019-05-03 12:52:57,1601.0,2019-05-03 12:59:25,1625.0
4,5,2019-05-03 13:23:19,1576.0,2019-05-03 13:36:01,1602.0


In [35]:
temp_df_2.set_index(keys='key', inplace=True)

Similar to the electrode dataset, the one for temperature readings had multiple observations for each batch. Since I only needed the first and last readings, I extracted those to create one row with both the time and temp values. 

## Join all datasets

In [36]:
gas_df.set_index('key', inplace=True)

In [37]:
df_full = temp_df_2.join([
    electrode_df_agg, bulk_delivery_2, bulk_supply_2, 
    gas_df, wire_vol_2, wire_time_2])

In [38]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3216 entries, 1 to 3241
Data columns (total 24 columns):
first_reading_time    3216 non-null datetime64[ns]
first_temp            3216 non-null float64
last_reading_time     3216 non-null datetime64[ns]
last_temp             2477 non-null float64
total_seconds         3216 non-null float64
avg_active_power      3214 non-null float64
avg_reactive_power    3214 non-null float64
bulk3_delivery        1298 non-null datetime64[ns]
bulk4_delivery        1014 non-null datetime64[ns]
bulk12_delivery       2450 non-null datetime64[ns]
bulk13_delivery       18 non-null datetime64[ns]
bulk14_delivery       2806 non-null datetime64[ns]
bulk15_delivery       2248 non-null datetime64[ns]
bulk3_supply          1298 non-null float64
bulk4_supply          1014 non-null float64
bulk12_supply         2450 non-null float64
bulk13_supply         18 non-null float64
bulk14_supply         2806 non-null float64
bulk15_supply         2248 non-null float64
Gas 1

In order to do some final preprocessing, and to prepare the data for a model, I joined all of the data into one dataframe. Each row represents one batch. 

## Final preprocessing of joined dataframe

In [39]:
def get_seconds_since_initial_temp_reading(time):
    time_diff = time - df_full['first_reading_time']
    return pd.to_timedelta(time_diff).dt.total_seconds()

In [40]:
time_cols = ['bulk3_delivery', 'bulk4_delivery', 'bulk12_delivery', 'bulk13_delivery',
    'bulk14_delivery', 'bulk15_delivery', 'wire1_time', 'wire2_time']

for col in time_cols:
    df_full[col] = df_full[col].apply(get_seconds_since_initial_temp_reading)

In [41]:
df_full.head(1)

Unnamed: 0_level_0,first_reading_time,first_temp,last_reading_time,last_temp,total_seconds,avg_active_power,avg_reactive_power,bulk3_delivery,bulk4_delivery,bulk12_delivery,...,bulk4_supply,bulk12_supply,bulk13_supply,bulk14_supply,bulk15_supply,Gas 1,wire1_volume,wire2_volume,wire1_time,wire2_time
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2019-05-03 11:16:18,1571.0,2019-05-03 11:30:39,1613.0,1098.0,0.975629,0.636648,,312.0,-746.0,...,43.0,206.0,,150.0,154.0,29.749986,60.059998,,-277.0,


Once that the data was joined together, I changed the time values to numerical ones. To do so, I calculated the time since the first temperature reading for each column. Ultimately these features still represent time elapsing - in total seconds since the beginning of processing for each batch.

In [42]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3216 entries, 1 to 3241
Data columns (total 24 columns):
first_reading_time    3216 non-null datetime64[ns]
first_temp            3216 non-null float64
last_reading_time     3216 non-null datetime64[ns]
last_temp             2477 non-null float64
total_seconds         3216 non-null float64
avg_active_power      3214 non-null float64
avg_reactive_power    3214 non-null float64
bulk3_delivery        1298 non-null float64
bulk4_delivery        1014 non-null float64
bulk12_delivery       2450 non-null float64
bulk13_delivery       18 non-null float64
bulk14_delivery       2806 non-null float64
bulk15_delivery       2248 non-null float64
bulk3_supply          1298 non-null float64
bulk4_supply          1014 non-null float64
bulk12_supply         2450 non-null float64
bulk13_supply         18 non-null float64
bulk14_supply         2806 non-null float64
bulk15_supply         2248 non-null float64
Gas 1                 3214 non-null float64
wir

In [43]:
df_full = df_full.drop(['bulk3_delivery', 'bulk4_delivery', 'bulk3_supply', 'bulk4_supply',
             'bulk13_supply', 'bulk13_delivery', 'wire2_volume', 'wire2_time'], axis=1)

In [44]:
df_full = df_full[df_full['last_temp'].notnull()]

In [45]:
na_dict = {'avg_reactive_power' : df_full['avg_reactive_power'].mean(),
          'bulk12_delivery' : df_full['bulk12_delivery'].mean(),
          'bulk14_delivery': df_full['bulk14_delivery'].mean(),
          'bulk15_delivery' : df_full['bulk15_delivery'].mean(),
          'bulk12_supply' : 0,
          'bulk14_supply' : 0,
          'bulk15_supply' : 0,
          'Gas 1' : 0,
          'wire1_volume' : 0,
          'wire1_time' : df_full['wire1_volume'].mean()}

df_full = df_full.fillna(value=na_dict)

In [46]:
df_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2477 entries, 1 to 2499
Data columns (total 16 columns):
first_reading_time    2477 non-null datetime64[ns]
first_temp            2477 non-null float64
last_reading_time     2477 non-null datetime64[ns]
last_temp             2477 non-null float64
total_seconds         2477 non-null float64
avg_active_power      2475 non-null float64
avg_reactive_power    2477 non-null float64
bulk12_delivery       2477 non-null float64
bulk14_delivery       2477 non-null float64
bulk15_delivery       2477 non-null float64
bulk12_supply         2477 non-null float64
bulk14_supply         2477 non-null float64
bulk15_supply         2477 non-null float64
Gas 1                 2477 non-null float64
wire1_volume          2477 non-null float64
wire1_time            2477 non-null float64
dtypes: datetime64[ns](2), float64(14)
memory usage: 329.0 KB


Since I still had a huge number of NaN values in my data, I dropped more columns that had the highest ratio of them. For the ones that I didn't drop alltogether, I did some imputation. I also dropped all rows where the last temperature reading was missing. Since this is the target value, it would not be helpful to include data that lacks it. 

For the columns representing the volume of a physical object, I replaced missing values with zero. In my opinon, an absent reading probably means that material wasn't added, so zero is an accurate representation of nothing. For all other columns I imputed the mean. 

# 3. Further EDA

In [47]:
corrs = df_full.corr()

print(corrs.iloc[:, 1])

first_temp            0.376690
last_temp             1.000000
total_seconds         0.207950
avg_active_power      0.315809
avg_reactive_power    0.032415
bulk12_delivery      -0.031387
bulk14_delivery      -0.056612
bulk15_delivery      -0.054435
bulk12_supply         0.161764
bulk14_supply         0.007057
bulk15_supply         0.008312
Gas 1                -0.032639
wire1_volume         -0.065547
wire1_time           -0.248266
Name: last_temp, dtype: float64


For some final EDA I wanted to see how variables correlated. That way I could reduce multicollinearity, and get an idea of what kind of regression model would be best. I found that the target value was most correlated with the first temperature reading, total seconds of electrode power, average active electrode power, and wire 1 time. Additionally, the target has a weak inverse relationship with the bulk deliverys, gas 1, and wire volume/time. This probably reflects that as more new material is introduced, the temperature drops. 

In [48]:
df_full.corr()

Unnamed: 0,first_temp,last_temp,total_seconds,avg_active_power,avg_reactive_power,bulk12_delivery,bulk14_delivery,bulk15_delivery,bulk12_supply,bulk14_supply,bulk15_supply,Gas 1,wire1_volume,wire1_time
first_temp,1.0,0.37669,-0.295217,-0.303837,-0.00587,-0.0116,-0.009961,-0.022572,-0.068612,-0.154577,-0.115536,-0.001322,0.071969,-0.089355
last_temp,0.37669,1.0,0.20795,0.315809,0.032415,-0.031387,-0.056612,-0.054435,0.161764,0.007057,0.008312,-0.032639,-0.065547,-0.248266
total_seconds,-0.295217,0.20795,1.0,0.55291,0.044348,-0.023615,-0.031228,-0.023905,0.473242,0.463047,0.306849,0.400571,0.060031,-0.021364
avg_active_power,-0.303837,0.315809,0.55291,1.0,0.059903,0.018931,0.010927,0.005751,0.379135,0.324592,0.270746,0.077271,-0.171806,-0.070118
avg_reactive_power,-0.00587,0.032415,0.044348,0.059903,1.0,0.00284,-0.025608,0.002033,0.041249,0.000684,0.036608,0.000958,0.001318,-0.027378
bulk12_delivery,-0.0116,-0.031387,-0.023615,0.018931,0.00284,1.0,0.930835,0.929148,0.041664,-0.105196,-0.016255,-0.047776,-0.025694,0.802131
bulk14_delivery,-0.009961,-0.056612,-0.031228,0.010927,-0.025608,0.930835,1.0,0.894393,0.029247,-0.101933,-0.042444,-0.039148,-0.016626,0.861594
bulk15_delivery,-0.022572,-0.054435,-0.023905,0.005751,0.002033,0.929148,0.894393,1.0,0.018468,-0.062981,-0.057174,-0.021206,-0.028978,0.795554
bulk12_supply,-0.068612,0.161764,0.473242,0.379135,0.041249,0.041664,0.029247,0.018468,1.0,0.505997,0.607762,0.22835,0.170573,0.041341
bulk14_supply,-0.154577,0.007057,0.463047,0.324592,0.000684,-0.105196,-0.101933,-0.062981,0.505997,1.0,0.308339,0.281024,0.001551,-0.065101


In [49]:
df_full = df_full.drop(['bulk14_delivery', 'bulk15_delivery', 
                        'bulk14_supply', 'bulk15_supply'], axis=1)

In [50]:
df_full.corr()

Unnamed: 0,first_temp,last_temp,total_seconds,avg_active_power,avg_reactive_power,bulk12_delivery,bulk12_supply,Gas 1,wire1_volume,wire1_time
first_temp,1.0,0.37669,-0.295217,-0.303837,-0.00587,-0.0116,-0.068612,-0.001322,0.071969,-0.089355
last_temp,0.37669,1.0,0.20795,0.315809,0.032415,-0.031387,0.161764,-0.032639,-0.065547,-0.248266
total_seconds,-0.295217,0.20795,1.0,0.55291,0.044348,-0.023615,0.473242,0.400571,0.060031,-0.021364
avg_active_power,-0.303837,0.315809,0.55291,1.0,0.059903,0.018931,0.379135,0.077271,-0.171806,-0.070118
avg_reactive_power,-0.00587,0.032415,0.044348,0.059903,1.0,0.00284,0.041249,0.000958,0.001318,-0.027378
bulk12_delivery,-0.0116,-0.031387,-0.023615,0.018931,0.00284,1.0,0.041664,-0.047776,-0.025694,0.802131
bulk12_supply,-0.068612,0.161764,0.473242,0.379135,0.041249,0.041664,1.0,0.22835,0.170573,0.041341
Gas 1,-0.001322,-0.032639,0.400571,0.077271,0.000958,-0.047776,0.22835,1.0,0.162869,0.009704
wire1_volume,0.071969,-0.065547,0.060031,-0.171806,0.001318,-0.025694,0.170573,0.162869,1.0,0.154952
wire1_time,-0.089355,-0.248266,-0.021364,-0.070118,-0.027378,0.802131,0.041341,0.009704,0.154952,1.0


In [51]:
df_full.isnull().sum().sum()

2

In [52]:
df_full = df_full.dropna()

After reviewing the Pearson correlation coefficients for the entire dataset, I could eliminate the bulk 14 and 15 delivery/supply columns, because they were highly correlated with bulk 12 and wire 1 time. This correlation between independent variables could affect my model quality and I still have plenty of features without them. 

Since there is not a strong dependency between the target and all of the features, only some, the relationship is probably not best represented by a multiple linear regression model. 

# 4. Train models

In [54]:
features.isnull().sum().sum()

0

In [75]:
features = df_full.drop(['first_reading_time', 'last_reading_time', 'last_temp'], axis=1)
target = df_full['last_temp']

In [76]:
features_train, features_test, target_train, target_test = train_test_split(features, target,
                                                            test_size=.25)

In [77]:
features_train, features_valid, target_train, target_valid = train_test_split(features_train, 
                                                                target_train, test_size=.25)

In [78]:
scaler = StandardScaler()

scaler.fit(features_train)
train = scaler.transform(features_train)
valid = scaler.transform(features_valid)
test = scaler.transform(features_test)

In [79]:
dummy = DummyRegressor(strategy='mean')

dummy.fit(features_train, target_train)
predictions = dummy.predict(features_valid)
print('MAE for constant (mean) temperature prediction: ', mean_absolute_error(
                                                            predictions, target_valid))

MAE for constant (mean) temperature prediction:  10.406336702338477


After standardizing and splitting my data up into features/target and train/valid/test, I "trained" one dummy regressor to have a baseline to evaluate other models by. This resulted in an MAE of ~9.7 degrees, which is just under the interquartile range (10 degrees).

In [80]:
linear_model = LinearRegression()

linear_model.fit(features_train, target_train)
predictions = linear_model.predict(features_valid)
print('MAE for Linear Regression predictions: ', mean_absolute_error(predictions, target_valid))

MAE for Linear Regression predictions:  8.055000604741883


In [82]:
params = {'max_depth': [10], 'n_estimators': [10, 20, 30, 40, 50]}

rf_model = GridSearchCV(RandomForestRegressor(random_state=12345), params, 
                        scoring='neg_mean_absolute_error', cv=3)

rf_model.fit(features_train, target_train)
rf_model.best_params_

{'max_depth': 10, 'n_estimators': 50}

In [83]:
predictions = rf_model.predict(features_valid)
print('MAE for Random Forest predictions: ', mean_absolute_error(predictions, target_valid))

MAE for Random Forest predictions:  6.189428387368768


In [85]:
cat_model = CatBoostRegressor(boosting_type='Ordered', verbose=False,
                             depth=10, eval_metric='MAE', random_state=12345)

cat_model.fit(features_train, target_train, use_best_model=True, 
              eval_set=(features_valid, target_valid))
cat_model.get_best_score()

{'learn': {'MAE': 3.4772097177028822, 'RMSE': 4.689021992900412},
 'validation': {'MAE': 6.369791875659766, 'RMSE': 8.748303840341682}}

I trained three models on this data: Linear Regression, Random Forest Regressor, and CatBoost Regressor. As predicted, the linear model performed the worst of the three. The Random Forest model had the lowest validation MAE by ~.19 degrees, with the CatBoost model in a close second.

# 5. Final testing of best models

In [86]:
rf_predictions = rf_model.predict(features_test)

cat_predictions = cat_model.predict(features_test, verbose=False) 

In [87]:
print('MAE for Random Forest test set predictions: ', mean_absolute_error(
                                                        rf_predictions, target_test))
print('MAE for CatBoost test set predictions: ', mean_absolute_error(
                                                        cat_predictions, target_test))

MAE for Random Forest test set predictions:  6.71700110361495
MAE for CatBoost test set predictions:  6.6404293471608


The CatBoost model very slightly outperformed the Random Forest model on the test set, with an MAE that was .08 degrees lower. The Random Forest model seems to suffer more overfitting - the percent increase between its validation and test MAE (\~8.5%) is double that of the Catboost model (~4.25%). 

# 6. Conclusion

The CatBoost model's test MAE was approximately 6.64 degrees. This is an improvement of 3.766 degrees over the baseline of a constant prediction, or an improvement of about 36%. However, when comparing the CatBoost test MAE to the Random Forest model's, it is only a ~1.15% improvement. Since gradient boosting is computationally expensive, it could be argued that the using the CatBoost model is not cost effective given such a small improvment between it and a simpler algorithm. This would have to be weighed against the CatBoost model's greater reliablity when generalized. 

It is also worth mentioning that my method of filling in missing time and electrode power values was a very simple one. A deeper understanding of the context and processing procedure could improve model quality for this project.