## Feature Extraction
In this exercise we take the chocolate data from exercise 4 and do some machine learning with it. Our intention is to take the data from 2020 and use it to predict for every day of 2021 how much chocolate of which type we will be selling.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

In [2]:
#np.set_printoptions(suppress=True) # in case you prefer to avoid the scientific notations numpy often uses

Some helper functions

In [2]:
def dataframe_to_Xy(df, predict_column, feature_columns):
    """Convert the dataframe to a format usable for the ML algorithms"""
    X = df[feature_columns].values.reshape(-1, df[feature_columns].shape[1]) # all features
    y = df[[predict_column]].values.reshape(-1, 1) # values to predict
    return X, y

In [3]:
def score(df_train, df_predict, predict_column, feature_columns):
    """Trains a linear regression model and evaluates it on a second dataset"""
    X1, y1 = dataframe_to_Xy(df_train, predict_column, feature_columns)
    X2, y2 = dataframe_to_Xy(df_predict, predict_column, feature_columns)
    lr = LinearRegression()
    reg = lr.fit(X1, y1)
    print(f"intercept: {lr.intercept_[0]} coefficients: {lr.coef_[0]}")
    mean_abs_error = abs(reg.predict(X2) - y2).mean()
    return mean_abs_error

In [None]:
def all_possible_feature_columns(df):
    columns = df.columns.values.tolist()
    return [s for s in columns if s not in ['datetime', 'year', 'chocolate_normal', 'chocolate_fancy','chocolate_frozen']]

Read the two files for 2020 and 2021

In [4]:
df = pd.read_csv("chocolate_combined.csv")
df.head()

Unnamed: 0,datetime,year,month,day,weekday,chocolate_normal,chocolate_fancy,chocolate_frozen
0,2020-01-01,2020,1,1,4,254.0,167,61
1,2020-01-02,2020,1,2,5,428.0,202,65
2,2020-01-03,2020,1,3,6,970.0,360,174
3,2020-01-04,2020,1,4,7,1005.0,302,152
4,2020-01-05,2020,1,5,1,722.0,342,145


In [5]:
df_2021 = pd.read_csv("chocolate_2021.csv")
df_2021.head()

Unnamed: 0,datetime,year,month,day,weekday,chocolate_normal,chocolate_fancy,chocolate_frozen
0,2021-01-01,2021,1,1,6,403,170,87
1,2021-01-02,2021,1,2,7,820,396,92
2,2021-01-03,2021,1,3,1,966,473,151
3,2021-01-04,2021,1,4,2,565,393,127
4,2021-01-05,2021,1,5,3,945,363,159


Try our first regression model

In [6]:
score(df, df_2021, 'chocolate_normal', ['month', 'day', 'weekday'])

intercept: 893.1823817665356 coefficients: [14.07651485 -3.89773571 36.42940525]


400.73596163760567

In [7]:
score(df, df_2021, 'chocolate_fancy', ['month', 'day', 'weekday'])

intercept: 400.9376149488952 coefficients: [13.96310598 -1.66197636  2.98526001]


217.785442853655

In [8]:
score(df, df_2021, 'chocolate_frozen', ['month', 'day', 'weekday'])

intercept: 417.5123535935799 coefficients: [44.66223397 -0.25762083  1.11905266]


410.4963691519663

## Start Exercise
Now it's up to you: add new columns to the two dataframes (make sure to treat both the same way) to improve the regression model. You might want to start with a one-hot encoding of either month or weekday. Hint: since we are using pandas dataframes it might be easier to use the pd.get_dummies() method for this instead of the OneHotEncoder from sklearn.

Your goal in this exercise is to add more columns to both dataframes and then run the score functions again with new columns added, resulting in a mean absolute error that is as low as possible. 