## Feature Extraction
In this exercise we take the chocolate data from exercise 4 and do some machine learning with it. Our intention is to take the data from 2020 and use it to predict for every day of 2021 how much chocolate of which type we will be selling.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

In [2]:
#np.set_printoptions(suppress=True) # in case you prefer to avoid the scientific notations numpy often uses

Some helper functions

In [2]:
def dataframe_to_Xy(df, predict_column, feature_columns):
    """Convert the dataframe to a format usable for the ML algorithms"""
    X = df[feature_columns].values.reshape(-1, df[feature_columns].shape[1]) # all features
    y = df[[predict_column]].values.reshape(-1, 1) # values to predict
    return X, y

In [3]:
def score(df_train, df_predict, predict_column, feature_columns):
    """Trains a linear regression model and evaluates it on a second dataset"""
    X1, y1 = dataframe_to_Xy(df_train, predict_column, feature_columns)
    X2, y2 = dataframe_to_Xy(df_predict, predict_column, feature_columns)
    lr = LinearRegression()
    reg = lr.fit(X1, y1)
    print(f"intercept: {lr.intercept_[0]} coefficients: {lr.coef_[0]}")
    mean_abs_error = abs(reg.predict(X2) - y2).mean()
    return mean_abs_error

In [4]:
def all_possible_feature_columns(df):
    columns = df.columns.values.tolist()
    return [s for s in columns if s not in ['datetime', 'year', 'chocolate_normal', 'chocolate_fancy','chocolate_frozen']]

Read the two files for 2020 and 2021

In [5]:
df = pd.read_csv("chocolate_combined.csv")
df.head()

Unnamed: 0,datetime,year,month,day,weekday,chocolate_normal,chocolate_fancy,chocolate_frozen
0,2020-01-01,2020,1,1,4,254.0,167,61
1,2020-01-02,2020,1,2,5,428.0,202,65
2,2020-01-03,2020,1,3,6,970.0,360,174
3,2020-01-04,2020,1,4,7,1005.0,302,152
4,2020-01-05,2020,1,5,1,722.0,342,145


In [6]:
df_2021 = pd.read_csv("chocolate_2021.csv")
df_2021.head()

Unnamed: 0,datetime,year,month,day,weekday,chocolate_normal,chocolate_fancy,chocolate_frozen
0,2021-01-01,2021,1,1,6,403,170,87
1,2021-01-02,2021,1,2,7,820,396,92
2,2021-01-03,2021,1,3,1,966,473,151
3,2021-01-04,2021,1,4,2,565,393,127
4,2021-01-05,2021,1,5,3,945,363,159


Try our first regression model

In [30]:
score_chocolate_normal = score(df, df_2021, 'chocolate_normal', ['month', 'day', 'weekday'])
score_chocolate_normal

intercept: 893.1823817665359 coefficients: [14.07651485 -3.89773571 36.42940525]


400.73596163760567

In [31]:
score_chocolate_fancy = score(df, df_2021, 'chocolate_fancy', ['month', 'day', 'weekday'])
score_chocolate_fancy

intercept: 400.9376149488954 coefficients: [13.96310598 -1.66197636  2.98526001]


217.78544285365496

In [32]:
score_chocolate_frozen = score(df, df_2021, 'chocolate_frozen', ['month', 'day', 'weekday'])
score_chocolate_frozen

intercept: 417.5123535935806 coefficients: [44.66223397 -0.25762083  1.11905266]


410.4963691519664

## Start Exercise
Now it's up to you: add new columns to the two dataframes (make sure to treat both the same way) to improve the regression model. You might want to start with a one-hot encoding of either month or weekday. Hint: since we are using pandas dataframes it might be easier to use the pd.get_dummies() method for this instead of the OneHotEncoder from sklearn.

Your goal in this exercise is to add more columns to both dataframes and then run the score functions again with new columns added, resulting in a mean absolute error that is as low as possible. 

In [43]:
# One-hot encoding
df_month = pd.get_dummies(df['month'], prefix='month')
df_2021_month = pd.get_dummies(df_2021['month'], prefix='month')

# rename month columns
rename_dict = {
    'month_1': 'Jan', 
    'month_2': 'Feb', 
    'month_3': 'Mar', 
    'month_4': 'Apr', 
    'month_5': 'May', 
    'month_6': 'Jun', 
    'month_7': 'Jul', 
    'month_8': 'Aug', 
    'month_9': 'Sep', 
    'month_10': 'Oct', 
    'month_11': 'Nov', 
    'month_12': 'Dec'
}
df_month = df_month.rename(columns=rename_dict)
df_2021_month = df_2021_month.rename(columns=rename_dict)

df_1 = df.join(df_month)
df_2021_1 = df_2021.join(df_2021_month)

# show difference in chocolate_normal between months
print(f"chocolate_normal", score(df_1, df_2021_1, 'chocolate_normal', ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']))
print(f"chocolate_fancy", score(df_1, df_2021_1, 'chocolate_fancy', ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']))
print(f"chocolate_frozen", score(df_1, df_2021_1, 'chocolate_frozen', ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']))

df_1.head()

intercept: 6091984862946012.0 coefficients: [-6.09198486e+15 -6.09198486e+15 -6.09198486e+15 -6.09198486e+15
 -6.09198486e+15 -6.09198486e+15 -6.09198486e+15 -6.09198486e+15
 -6.09198486e+15 -6.09198486e+15 -6.09198486e+15 -6.09198486e+15]
chocolate_normal 321.2896174863388
intercept: 1408828640056183.8 coefficients: [-1.40882864e+15 -1.40882864e+15 -1.40882864e+15 -1.40882864e+15
 -1.40882864e+15 -1.40882864e+15 -1.40882864e+15 -1.40882864e+15
 -1.40882864e+15 -1.40882864e+15 -1.40882864e+15 -1.40882864e+15]
chocolate_fancy 200.5737704918033
intercept: 569830487816053.8 coefficients: [-5.69830488e+14 -5.69830488e+14 -5.69830488e+14 -5.69830488e+14
 -5.69830488e+14 -5.69830488e+14 -5.69830488e+14 -5.69830488e+14
 -5.69830488e+14 -5.69830488e+14 -5.69830488e+14 -5.69830488e+14]
chocolate_frozen 148.8125


Unnamed: 0,datetime,year,month,day,weekday,chocolate_normal,chocolate_fancy,chocolate_frozen,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,2020-01-01,2020,1,1,4,254.0,167,61,True,False,False,False,False,False,False,False,False,False,False,False
1,2020-01-02,2020,1,2,5,428.0,202,65,True,False,False,False,False,False,False,False,False,False,False,False
2,2020-01-03,2020,1,3,6,970.0,360,174,True,False,False,False,False,False,False,False,False,False,False,False
3,2020-01-04,2020,1,4,7,1005.0,302,152,True,False,False,False,False,False,False,False,False,False,False,False
4,2020-01-05,2020,1,5,1,722.0,342,145,True,False,False,False,False,False,False,False,False,False,False,False


In [44]:
# further increase the number of features
df_weekday = pd.get_dummies(df['weekday'], prefix='weekday')
df_2021_weekday = pd.get_dummies(df_2021['weekday'], prefix='weekday')

# rename weekday columns
rename_dict = {
    'weekday_1': 'Mon', 
    'weekday_2': 'Tue', 
    'weekday_3': 'Wed', 
    'weekday_4': 'Thu', 
    'weekday_5': 'Fri', 
    'weekday_6': 'Sat', 
    'weekday_7': 'Sun'
}

df_weekday = df_weekday.rename(columns=rename_dict)
df_2021_weekday = df_2021_weekday.rename(columns=rename_dict)

df_2 = df_1.join(df_weekday)
df_2021_2 = df_2021_1.join(df_2021_weekday)

score_chocolate_normal_new = score(df_2, df_2021_2, 'chocolate_normal', ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
score_chocolate_fancy_new = score(df_2, df_2021_2, 'chocolate_fancy', ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
score_chocolate_frozen_new = score(df_2, df_2021_2, 'chocolate_frozen', ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# show difference in chocolate_normal between months
print(f"chocolate_normal", score_chocolate_normal_new)
print(f"chocolate_fancy", score_chocolate_fancy_new)
print(f"chocolate_frozen", score_chocolate_frozen_new)

print(f"\n \n Show table head")
print(df_2)

intercept: 7694722824814188.0 coefficients: [-1.16623912e+16 -1.16623912e+16 -1.16623912e+16 -1.16623912e+16
 -1.16623912e+16 -1.16623912e+16 -1.16623912e+16  3.96766838e+15
  3.96766838e+15  3.96766838e+15  3.96766838e+15  3.96766838e+15
  3.96766838e+15  3.96766838e+15  3.96766838e+15  3.96766838e+15
  3.96766838e+15  3.96766838e+15  3.96766838e+15]
intercept: 4135586645835690.0 coefficients: [-2.97023916e+15 -2.97023916e+15 -2.97023916e+15 -2.97023916e+15
 -2.97023916e+15 -2.97023916e+15 -2.97023916e+15 -1.16534749e+15
 -1.16534749e+15 -1.16534749e+15 -1.16534749e+15 -1.16534749e+15
 -1.16534749e+15 -1.16534749e+15 -1.16534749e+15 -1.16534749e+15
 -1.16534749e+15 -1.16534749e+15 -1.16534749e+15]
intercept: 853321329933085.2 coefficients: [-1.46105738e+15 -1.46105738e+15 -1.46105738e+15 -1.46105738e+15
 -1.46105738e+15 -1.46105738e+15 -1.46105738e+15  6.07736046e+14
  6.07736046e+14  6.07736046e+14  6.07736046e+14  6.07736046e+14
  6.07736046e+14  6.07736046e+14  6.07736046e+14  6.07

In [36]:
# show difference in chocolate_normal between months
print(f"score_chocolate_normal: {score_chocolate_normal}, score_chocolate_normal_new: {score_chocolate_normal_new}")
print(f"score_chocolate_frozen: {score_chocolate_frozen}, score_chocolate_frozen_new: {score_chocolate_frozen_new}")
print(f"score_chocolate_fancy: {score_chocolate_fancy}, score_chocolate_fancy_new: {score_chocolate_fancy_new}")

score_chocolate_normal: 400.73596163760567, score_chocolate_normal_new: 311.3524590163934
score_chocolate_frozen: 410.4963691519664, score_chocolate_frozen_new: 146.28415300546447
score_chocolate_fancy: 217.78544285365496, score_chocolate_fancy_new: 199.05464480874318


In [None]:
# new ideas to test
# days to valentine's day
# days to easter
# days to christmas
