# Ice Cream Sales Prediction

This notebook aims to predict ice cream sales based on historical weather data. The workflow includes data preprocessing, feature engineering, model training, and evaluation. The steps are as follows:

1. Import necessary libraries and install required packages.
2. Load and inspect historical weather and sales data.
3. Preprocess the data, including handling missing values and converting date formats.
4. Aggregate weather data to daily level and merge with sales data.
5. Train a multi-output regression model to predict sales.
6. Evaluate the model's performance using metrics like Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).
7. Save the trained model for future use.

#### Dependicies: 

In [1]:
!pip install pandas
!pip install scikit-learn
!pip install joblib
!pip install numpy



In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
import joblib
import numpy as np
from sklearn.metrics import mean_squared_error

* Load the data from the csv file:

In [3]:
weather = pd.read_csv('data/temp-history.csv')
sales = sales = pd.read_csv('data/Ajustes de Stock Rio (1).csv',sep=';', decimal=',')

In [4]:
weather.head()

Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,visibility,dew_point,feels_like,...,wind_gust,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1262304000,2010-01-01 00:00:00 +0000 UTC,-10800,Rosario,-32.958702,-60.693042,23.74,10000.0,14.99,23.68,...,,,,,,0,800,Clear,sky is clear,01n
1,1262307600,2010-01-01 01:00:00 +0000 UTC,-10800,Rosario,-32.958702,-60.693042,23.5,,17.72,23.73,...,,,,,,15,801,Clouds,few clouds,02n
2,1262311200,2010-01-01 02:00:00 +0000 UTC,-10800,Rosario,-32.958702,-60.693042,22.24,,17.6,22.48,...,,,,,,4,800,Clear,sky is clear,01n
3,1262314800,2010-01-01 03:00:00 +0000 UTC,-10800,Rosario,-32.958702,-60.693042,21.81,,17.81,22.08,...,,,,,,0,800,Clear,sky is clear,01n
4,1262318400,2010-01-01 04:00:00 +0000 UTC,-10800,Rosario,-32.958702,-60.693042,21.51,,17.51,21.75,...,,,,,,0,800,Clear,sky is clear,01n


In [5]:
sales.head()

Unnamed: 0,Ajuste Fec,Americana,Cheesecake de Frambuesa,Chocolate con Almendras,Crema Oreo,Dulce de Leche Granizado,Maracuyá
0,1/1/23,-22.38,-6.09,-25.02,-21.15,-15.96,-7.61
1,2/1/23,-20.92,-17.94,-17.07,-28.1,-44.8,
2,3/1/23,-20.48,-6.41,-17.96,-10.59,-41.16,-14.9
3,4/1/23,,,-17.16,-22.61,-37.28,-14.43
4,5/1/23,-10.14,-12.85,-17.05,-22.44,-31.41,


* Removing the trailing " UTC" from the datetime string and parse it


In [6]:
weather['dt_iso'] = pd.to_datetime(
    weather['dt_iso'].str.replace(' UTC', '', regex=False),
    format='%Y-%m-%d %H:%M:%S %z'
)

In [7]:
weather['dt_iso'].head()

0   2010-01-01 00:00:00+00:00
1   2010-01-01 01:00:00+00:00
2   2010-01-01 02:00:00+00:00
3   2010-01-01 03:00:00+00:00
4   2010-01-01 04:00:00+00:00
Name: dt_iso, dtype: datetime64[ns, UTC]

In [8]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136396 entries, 0 to 136395
Data columns (total 28 columns):
 #   Column               Non-Null Count   Dtype              
---  ------               --------------   -----              
 0   dt                   136396 non-null  int64              
 1   dt_iso               136396 non-null  datetime64[ns, UTC]
 2   timezone             136396 non-null  int64              
 3   city_name            136396 non-null  object             
 4   lat                  136396 non-null  float64            
 5   lon                  136396 non-null  float64            
 6   temp                 136396 non-null  float64            
 7   visibility           113825 non-null  float64            
 8   dew_point            136396 non-null  float64            
 9   feels_like           136396 non-null  float64            
 10  temp_min             136396 non-null  float64            
 11  temp_max             136396 non-null  float64            
 12  pr

In [9]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 761 entries, 0 to 760
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Ajuste Fec                761 non-null    object 
 1   Americana                 679 non-null    float64
 2   Cheesecake de Frambuesa   606 non-null    float64
 3   Chocolate con Almendras   725 non-null    float64
 4   Crema Oreo                724 non-null    float64
 5   Dulce de Leche Granizado  749 non-null    float64
 6   Maracuyá                  417 non-null    float64
dtypes: float64(6), object(1)
memory usage: 41.7+ KB


* Checking and handling missing values: 

In [10]:
sales.isnull().sum()

Ajuste Fec                    0
Americana                    82
Cheesecake de Frambuesa     155
Chocolate con Almendras      36
Crema Oreo                   37
Dulce de Leche Granizado     12
Maracuyá                    344
dtype: int64

In [11]:
weather.isnull().sum()

dt                          0
dt_iso                      0
timezone                    0
city_name                   0
lat                         0
lon                         0
temp                        0
visibility              22571
dew_point                   0
feels_like                  0
temp_min                    0
temp_max                    0
pressure                    0
sea_level              136396
grnd_level             136396
humidity                    0
wind_speed                  0
wind_deg                    0
wind_gust              117989
rain_1h                122199
rain_3h                136387
snow_1h                136396
snow_3h                136396
clouds_all                  0
weather_id                  0
weather_main                0
weather_description         0
weather_icon                0
dtype: int64

In [12]:
weather.fillna(0, inplace=True)

In [13]:
sales.fillna(0, inplace=True)

In [14]:
weather.isnull().sum()

dt                     0
dt_iso                 0
timezone               0
city_name              0
lat                    0
lon                    0
temp                   0
visibility             0
dew_point              0
feels_like             0
temp_min               0
temp_max               0
pressure               0
sea_level              0
grnd_level             0
humidity               0
wind_speed             0
wind_deg               0
wind_gust              0
rain_1h                0
rain_3h                0
snow_1h                0
snow_3h                0
clouds_all             0
weather_id             0
weather_main           0
weather_description    0
weather_icon           0
dtype: int64

In [15]:
sales.isnull().sum()

Ajuste Fec                  0
Americana                   0
Cheesecake de Frambuesa     0
Chocolate con Almendras     0
Crema Oreo                  0
Dulce de Leche Granizado    0
Maracuyá                    0
dtype: int64

* Checking the data types of the columns:

In [16]:
weather['dt_iso'] = pd.to_datetime(weather['dt_iso'], errors='coerce')

weather['date'] = weather['dt_iso'].dt.date

* Aggregating the weather data to daily level:

In [17]:
daily_weather = weather.groupby('date').agg({
    'temp': 'mean',
    'feels_like': 'mean',
    'temp_min': 'mean',
    'temp_max': 'mean',
    'humidity': 'mean',
    'dew_point': 'mean',
    'wind_speed': 'mean',
    'wind_deg': 'mean',
    'clouds_all': 'mean',
    'visibility': 'mean',  
    'rain_1h': 'sum',     
    'rain_3h': 'sum',
    'snow_1h': 'sum',
    'snow_3h': 'sum'
}).reset_index()

In [18]:
daily_weather.head()

Unnamed: 0,date,temp,feels_like,temp_min,temp_max,humidity,dew_point,wind_speed,wind_deg,clouds_all,visibility,rain_1h,rain_3h,snow_1h,snow_3h
0,2010-01-01,22.8025,23.119583,22.435,23.179583,77.5,18.4925,2.862083,169.458333,35.041667,4916.666667,0.16,0.0,0.0,0.0
1,2010-01-02,24.720417,25.720833,24.187083,25.185833,75.583333,19.5175,0.974167,48.041667,0.666667,5583.333333,0.0,0.0,0.0,0.0
2,2010-01-03,25.569167,26.76125,25.047917,26.164583,74.5,20.421667,2.819167,69.375,24.75,8333.333333,0.12,0.0,0.0,0.0
3,2010-01-04,26.676,29.1456,26.21,27.1856,76.2,21.9492,2.868,118.04,61.96,5480.0,5.76,0.0,0.0,0.0
4,2010-01-05,28.104167,31.86375,27.54125,28.615,83.333333,24.820417,1.377917,90.958333,22.625,7291.666667,2.56,0.0,0.0,0.0


In [19]:
# Preprocess sales data: convert string dates and rename columns if needed.
sales['Ajuste Fec'] = pd.to_datetime(sales['Ajuste Fec'], dayfirst=True)
sales.rename(columns={'Ajuste Fec': 'date'}, inplace=True)
sales['date'] = sales['date'].dt.date

  sales['Ajuste Fec'] = pd.to_datetime(sales['Ajuste Fec'], dayfirst=True)


* Merging the weather data with sales data:

In [20]:
df = pd.merge(sales, daily_weather, on='date', how='inner')

In [21]:
df.head()

Unnamed: 0,date,Americana,Cheesecake de Frambuesa,Chocolate con Almendras,Crema Oreo,Dulce de Leche Granizado,Maracuyá,temp,feels_like,temp_min,...,humidity,dew_point,wind_speed,wind_deg,clouds_all,visibility,rain_1h,rain_3h,snow_1h,snow_3h
0,2023-01-01,-22.38,-6.09,-25.02,-21.15,-15.96,-7.61,23.45875,23.70125,22.554167,...,69.125,17.01875,4.462083,142.083333,75.833333,10000.0,6.71,0.0,0.0,0.0
1,2023-01-02,-20.92,-17.94,-17.07,-28.1,-44.8,0.0,23.787083,23.27,23.013333,...,58.791667,12.790417,4.440833,217.041667,4.791667,10000.0,3.05,0.0,0.0,0.0
2,2023-01-03,-20.48,-6.41,-17.96,-10.59,-41.16,-14.9,25.024167,24.239583,23.810833,...,46.541667,11.16,3.261667,228.375,17.291667,10000.0,0.16,0.0,0.0,0.0
3,2023-01-04,0.0,0.0,-17.16,-22.61,-37.28,-14.43,25.454583,24.536667,24.215,...,43.041667,10.449583,2.853333,129.583333,2.5,10000.0,0.0,0.0,0.0,0.0
4,2023-01-05,-10.14,-12.85,-17.05,-22.44,-31.41,0.0,26.999583,26.04375,26.0075,...,38.5,10.501667,3.267083,81.25,10.0,10000.0,0.0,0.0,0.0,0.0


* Feacture cols: 

In [22]:
feature_columns = [
    "temp", "feels_like", "temp_min", "temp_max",
    "humidity", "dew_point",
    "wind_speed", "wind_deg",
    "clouds_all", "visibility",
    "rain_1h", "rain_3h", "snow_1h", "snow_3h"
]

* Splitting the data into training and testing sets:

In [23]:
X = df[feature_columns]
y = df.drop(columns=['date', 'temp', 'humidity'])

In [24]:
y = y.abs()

In [25]:
model = MultiOutputRegressor(RandomForestRegressor(
    n_estimators=200, 
    min_samples_leaf=1,
    random_state=42,
    bootstrap=True,
))
model.fit(X, y)

In [26]:
def safe_predict(self, X):
    predictions = self.predict(X)
    return np.maximum(predictions, 0) 

In [27]:
MultiOutputRegressor.safe_predict = safe_predict

In [28]:
joblib.dump(model, 'ice_cream_sales_model.pkl')

['ice_cream_sales_model.pkl']

In [29]:
accuracy = model.score(X, y)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9542472252673855


In [30]:
y_pred = model.safe_predict(X)
mse = mean_squared_error(y, y_pred)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 25.570142736403287


In [31]:
rmse = mse**0.5
print(f'Root Mean Squared Error: {rmse}')

Root Mean Squared Error: 5.056692865540015
