After talking to the TA, she noted that the penalty for classifying a D0 as a D5 should be harsher than classifying a D0 as a D1. This could be done using a custom loss function passed to a classifier, but we can also just implement it as a regression problem and let the default loss functions do their job. For this attempt, we will use spatial encoding on the raw dataset, as we know that that greatly assists the classifier model, and it will likely help the regressor as well. 

In [1]:
import pandas as pd
from catboost import CatBoostRegressor

raw_train = pd.read_csv('../data/train_with_raw_score.csv')
raw_test = pd.read_csv('../data/test_with_raw_score.csv')

In [4]:
raw_train.head()

Unnamed: 0,fips,date,PRECTOT,PS,QV2M,T2M,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,...,TS,WS10M,WS10M_MAX,WS10M_MIN,WS10M_RANGE,WS50M,WS50M_MAX,WS50M_MIN,WS50M_RANGE,score
0,1001,2000-01-04,15.95,100.29,6.42,11.4,6.09,6.1,18.09,2.16,...,11.31,3.84,5.67,2.08,3.59,6.73,9.31,3.74,5.58,1.0
1,1001,2000-01-11,1.33,100.4,6.63,11.48,7.84,7.84,18.88,5.72,...,10.43,1.76,2.48,1.05,1.43,3.55,6.38,1.71,4.67,2.0
2,1001,2000-01-18,1.11,100.39,9.53,14.28,13.26,13.26,18.04,8.98,...,14.19,2.63,3.6,1.67,1.92,5.19,6.4,3.84,2.55,2.0
3,1001,2000-01-25,0.0,100.11,2.05,-0.78,-7.93,-7.72,5.65,-5.46,...,-0.61,3.35,4.59,2.28,2.32,5.75,8.03,3.96,4.07,2.0
4,1001,2000-02-01,0.0,101.0,3.36,2.06,-1.73,-1.7,11.02,-4.21,...,1.88,2.03,2.74,0.88,1.86,4.18,6.38,1.27,5.11,1.0


In [1]:
from functools import lru_cache
import json

fips_map = open('../data/fips_map.json')
fips_map = json.load(fips_map)
@lru_cache(maxsize=10000)
def fips_to_coordinate(fips_code):
    fips_code = str(fips_code)
    if fips_code in fips_map:
        return [fips_map[fips_code]['lat'], fips_map[fips_code]['long']]
    else:
        return [None, None]

In [5]:
train_coords = raw_train['fips'].apply(fips_to_coordinate)
test_coords = raw_test['fips'].apply(fips_to_coordinate)
train_coords = pd.DataFrame(train_coords.tolist(), columns=['lat', 'long'])
test_coords = pd.DataFrame(test_coords.tolist(), columns=['lat', 'long'])

In [6]:
train_coords.head()

Unnamed: 0,lat,long
0,32.532237,-86.64644
1,32.532237,-86.64644
2,32.532237,-86.64644
3,32.532237,-86.64644
4,32.532237,-86.64644


In [7]:
test_coords.head()

Unnamed: 0,lat,long
0,32.532237,-86.64644
1,32.532237,-86.64644
2,32.532237,-86.64644
3,32.532237,-86.64644
4,32.532237,-86.64644


In [8]:
train = pd.concat([raw_train, train_coords], axis=1)
test = pd.concat([raw_test, test_coords], axis=1)

In [9]:
train = train.drop('fips', axis=1)
test = test.drop('fips', axis=1)

In [12]:
train = train.drop('date',axis=1)
test = test.drop('date',axis=1)

In [19]:
model = CatBoostRegressor()
model.fit(train.drop('score', axis=1), train['score'])

Learning rate set to 0.143141
0:	learn: 1.2020214	total: 325ms	remaining: 5m 24s
1:	learn: 1.1854770	total: 614ms	remaining: 5m 6s
2:	learn: 1.1722514	total: 903ms	remaining: 4m 59s
3:	learn: 1.1618601	total: 1.18s	remaining: 4m 54s
4:	learn: 1.1534296	total: 1.43s	remaining: 4m 44s
5:	learn: 1.1461779	total: 1.7s	remaining: 4m 41s
6:	learn: 1.1402717	total: 1.93s	remaining: 4m 33s
7:	learn: 1.1359029	total: 2.17s	remaining: 4m 29s
8:	learn: 1.1321657	total: 2.41s	remaining: 4m 25s
9:	learn: 1.1288927	total: 2.67s	remaining: 4m 24s
10:	learn: 1.1257840	total: 2.9s	remaining: 4m 20s
11:	learn: 1.1233268	total: 3.13s	remaining: 4m 18s
12:	learn: 1.1213900	total: 3.33s	remaining: 4m 12s
13:	learn: 1.1193193	total: 3.58s	remaining: 4m 11s
14:	learn: 1.1173403	total: 3.83s	remaining: 4m 11s
15:	learn: 1.1159604	total: 4.06s	remaining: 4m 9s
16:	learn: 1.1142209	total: 4.31s	remaining: 4m 9s
17:	learn: 1.1129553	total: 4.56s	remaining: 4m 8s
18:	learn: 1.1114891	total: 4.79s	remaining: 4m 7s

<catboost.core.CatBoostRegressor at 0x7f34bd387e50>

In [20]:
preds = model.predict(test.drop(['score'], axis=1))

In [None]:
# save the regressor
import pickle
pickle.dump(model, open('../models/catboost4_initialRegression.pkl', 'wb'))

I am currently not sure how best to profile a regression model (MSE? RMSE? MAPE?) and we will have to ask the TA. For now, I have logged a few metrics below.

In [2]:
import pickle
import pandas as pd

raw_train = pd.read_csv('../data/train_with_raw_score.csv')
raw_test = pd.read_csv('../data/test_with_raw_score.csv')
train_coords = raw_train['fips'].apply(fips_to_coordinate)
test_coords = raw_test['fips'].apply(fips_to_coordinate)
train_coords = pd.DataFrame(train_coords.tolist(), columns=['lat', 'long'])
test_coords = pd.DataFrame(test_coords.tolist(), columns=['lat', 'long'])
train = pd.concat([raw_train, train_coords], axis=1)
test = pd.concat([raw_test, test_coords], axis=1)
train = train.drop('fips', axis=1)
test = test.drop('fips', axis=1)
train = train.drop('date',axis=1)
test = test.drop('date',axis=1)
model = pickle.load(open('../models/catboost4_initialRegression.pkl', 'rb'))
preds = model.predict(test.drop(['score'], axis=1))

In [3]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score

print("Mean Squared Error: ", mean_squared_error(test['score'], preds))
print("Mean Absolute Error: ", mean_absolute_error(test['score'], preds))
print("Mean Absolute Percentage Error: ", mean_absolute_percentage_error(test['score'], preds))
print("R2 Score: ", r2_score(test['score'], preds))

Mean Squared Error:  0.7646419217106
Mean Absolute Error:  0.6704698094839872
Mean Absolute Percentage Error:  1810587988628717.0
R2 Score:  -0.0878119558467616


R2 score seems really weird. Need to check this out further.

Conclusions: Regression shows some promise, based on the low MSE compared to the range of the data. However, we need to figure out the low R2, and we may need to try different model types beyond the gradient boosting regressor. The MAPE and R2 scores look very weird, and need to be investigated further.