## 🍺 Beer Consumption Prediction (Regression Task)

Given data about *beer consumption in Sao Paulo on different days of the year*, let's try to predict the **liters of beer** that will be consumed on a given day.

We will use various forms of linear regression to make our predictions.

Data source: https://www.kaggle.com/datasets/dongeorge/beer-consumption-sao-paulo

### Importing Libraries

In [3]:
import numpy as np
import pandas as pd

import re
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, SGDRegressor, HuberRegressor

In [4]:
data = pd.read_csv('Consumo_cerveja.csv')
data

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.900
...,...,...,...,...,...,...,...
936,,,,,,,
937,,,,,,,
938,,,,,,,
939,,,,,,,


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941 entries, 0 to 940
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Data                         365 non-null    object 
 1   Temperatura Media (C)        365 non-null    object 
 2   Temperatura Minima (C)       365 non-null    object 
 3   Temperatura Maxima (C)       365 non-null    object 
 4   Precipitacao (mm)            365 non-null    object 
 5   Final de Semana              365 non-null    float64
 6   Consumo de cerveja (litros)  365 non-null    float64
dtypes: float64(2), object(5)
memory usage: 51.6+ KB


### Preprocessing

In [8]:
data.isna().mean()

Data                           0.612115
Temperatura Media (C)          0.612115
Temperatura Minima (C)         0.612115
Temperatura Maxima (C)         0.612115
Precipitacao (mm)              0.612115
Final de Semana                0.612115
Consumo de cerveja (litros)    0.612115
dtype: float64

In [9]:
df = data.copy()
df

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.900
...,...,...,...,...,...,...,...
936,,,,,,,
937,,,,,,,
938,,,,,,,
939,,,,,,,


In [10]:
# Drop rows with missing values
df = df.dropna(axis=0).reset_index(drop=True)
df

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,273,239,325,0,0.0,25.461
1,2015-01-02,2702,245,335,0,0.0,28.972
2,2015-01-03,2482,224,299,0,1.0,30.814
3,2015-01-04,2398,215,286,12,1.0,29.799
4,2015-01-05,2382,21,283,0,0.0,28.900
...,...,...,...,...,...,...,...
360,2015-12-27,24,211,282,136,1.0,32.307
361,2015-12-28,2264,211,267,0,0.0,26.095
362,2015-12-29,2168,203,241,103,0.0,22.309
363,2015-12-30,2138,193,224,63,0.0,20.467


In [11]:
df.isna().sum()

Data                           0
Temperatura Media (C)          0
Temperatura Minima (C)         0
Temperatura Maxima (C)         0
Precipitacao (mm)              0
Final de Semana                0
Consumo de cerveja (litros)    0
dtype: int64

In [18]:
df.dtypes

Data                            object
Temperatura Media (C)           object
Temperatura Minima (C)          object
Temperatura Maxima (C)          object
Precipitacao (mm)               object
Final de Semana                float64
Consumo de cerveja (litros)    float64
dtype: object

In [20]:
# Replace , with . in numeric_columns
for column in ['Temperatura Media (C)', 'Temperatura Minima (C)', 'Temperatura Maxima (C)', 'Precipitacao (mm)']:
    df[column] = df[column].apply(lambda x: float(re.sub(r',', '.', x)))

In [21]:
df

Unnamed: 0,Data,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros)
0,2015-01-01,27.30,23.9,32.5,0.0,0.0,25.461
1,2015-01-02,27.02,24.5,33.5,0.0,0.0,28.972
2,2015-01-03,24.82,22.4,29.9,0.0,1.0,30.814
3,2015-01-04,23.98,21.5,28.6,1.2,1.0,29.799
4,2015-01-05,23.82,21.0,28.3,0.0,0.0,28.900
...,...,...,...,...,...,...,...
360,2015-12-27,24.00,21.1,28.2,13.6,1.0,32.307
361,2015-12-28,22.64,21.1,26.7,0.0,0.0,26.095
362,2015-12-29,21.68,20.3,24.1,10.3,0.0,22.309
363,2015-12-30,21.38,19.3,22.4,6.3,0.0,20.467


In [22]:
df.dtypes

Data                            object
Temperatura Media (C)          float64
Temperatura Minima (C)         float64
Temperatura Maxima (C)         float64
Precipitacao (mm)              float64
Final de Semana                float64
Consumo de cerveja (litros)    float64
dtype: object

In [23]:
# Create date features
df['Data'] = pd.to_datetime(df['Data'])

df['Year'] = df['Data'].apply(lambda x: x.year)
df['Month'] = df['Data'].apply(lambda x: x.month)
df['Day'] = df['Data'].apply(lambda x: x.day)

df = df.drop('Data', axis=1)

df

Unnamed: 0,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Consumo de cerveja (litros),Year,Month,Day
0,27.30,23.9,32.5,0.0,0.0,25.461,2015,1,1
1,27.02,24.5,33.5,0.0,0.0,28.972,2015,1,2
2,24.82,22.4,29.9,0.0,1.0,30.814,2015,1,3
3,23.98,21.5,28.6,1.2,1.0,29.799,2015,1,4
4,23.82,21.0,28.3,0.0,0.0,28.900,2015,1,5
...,...,...,...,...,...,...,...,...,...
360,24.00,21.1,28.2,13.6,1.0,32.307,2015,12,27
361,22.64,21.1,26.7,0.0,0.0,26.095,2015,12,28
362,21.68,20.3,24.1,10.3,0.0,22.309,2015,12,29
363,21.38,19.3,22.4,6.3,0.0,20.467,2015,12,30


In [24]:
# Split df into X and y
y = df['Consumo de cerveja (litros)'].copy()
X = df.drop('Consumo de cerveja (litros)', axis=1).copy()

In [25]:
# Scale X
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X

Unnamed: 0,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Year,Month,Day
0,1.912508,2.281333,1.365781,-0.419062,-0.631243,0.0,-1.602745,-1.673503
1,1.824340,2.493924,1.597722,-0.419062,-0.631243,0.0,-1.602745,-1.559818
2,1.131590,1.749853,0.762735,-0.419062,1.584177,0.0,-1.602745,-1.446134
3,0.867085,1.430966,0.461212,-0.322294,1.584177,0.0,-1.602745,-1.332449
4,0.816703,1.253806,0.391630,-0.419062,-0.631243,0.0,-1.602745,-1.218764
...,...,...,...,...,...,...,...,...
360,0.873383,1.289238,0.368436,0.677640,1.584177,0.0,1.587648,1.282303
361,0.445137,1.289238,0.020525,-0.419062,-0.631243,0.0,1.587648,1.395988
362,0.142846,1.005782,-0.582521,0.411528,-0.631243,0.0,1.587648,1.509672
363,0.048380,0.651463,-0.976820,0.088969,-0.631243,0.0,1.587648,1.623357


In [26]:
X['Year'].unique()

array([0.])

In [27]:
X = X.drop('Year', axis=1)
X

Unnamed: 0,Temperatura Media (C),Temperatura Minima (C),Temperatura Maxima (C),Precipitacao (mm),Final de Semana,Month,Day
0,1.912508,2.281333,1.365781,-0.419062,-0.631243,-1.602745,-1.673503
1,1.824340,2.493924,1.597722,-0.419062,-0.631243,-1.602745,-1.559818
2,1.131590,1.749853,0.762735,-0.419062,1.584177,-1.602745,-1.446134
3,0.867085,1.430966,0.461212,-0.322294,1.584177,-1.602745,-1.332449
4,0.816703,1.253806,0.391630,-0.419062,-0.631243,-1.602745,-1.218764
...,...,...,...,...,...,...,...
360,0.873383,1.289238,0.368436,0.677640,1.584177,1.587648,1.282303
361,0.445137,1.289238,0.020525,-0.419062,-0.631243,1.587648,1.395988
362,0.142846,1.005782,-0.582521,0.411528,-0.631243,1.587648,1.509672
363,0.048380,0.651463,-0.976820,0.088969,-0.631243,1.587648,1.623357


### Training

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123)

In [29]:
models = {
    '    Linear Regression': LinearRegression(),
    '     Ridge Regression': Ridge(),
    '     Lasso Regression': Lasso(),
    'ElasticNet Regression': ElasticNet(),
    '       SGD Regression': SGDRegressor(),
    '     Huber Regression': HuberRegressor()
}

In [30]:
for model in models.values():
    model.fit(X_train, y_train)

### Results

In [31]:
for name, model in models.items():
    print(name + 'R^2 Score: {:.4f}'.format(model.score(X_test, y_test)))

    Linear RegressionR^2 Score: 0.6896
     Ridge RegressionR^2 Score: 0.6909
     Lasso RegressionR^2 Score: 0.5763
ElasticNet RegressionR^2 Score: 0.5646
       SGD RegressionR^2 Score: 0.6918
     Huber RegressionR^2 Score: 0.6699
