# Forecasting Stickers Sale

**Ashok Medasani**

**Challenge Information:**
For this challenge, you will be predicting multiple years worth of sales for various Kaggle-branded stickers from different fictitious stores in different (real!) countries. This dataset is completely synthetic, but contains many effects you see in real-world data, e.g., weekend and holiday effect, seasonality, etc.

In [14]:
# prompt: mount google drive

# from google.colab import drive
# drive.mount('/content/drive')


### Load the data

In [15]:
import pandas as pd

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
print(test.head())
print("\n")
print(train.head())


       id        date country              store             product
0  230130  2017-01-01  Canada  Discount Stickers   Holographic Goose
1  230131  2017-01-01  Canada  Discount Stickers              Kaggle
2  230132  2017-01-01  Canada  Discount Stickers        Kaggle Tiers
3  230133  2017-01-01  Canada  Discount Stickers            Kerneler
4  230134  2017-01-01  Canada  Discount Stickers  Kerneler Dark Mode


   id        date country              store             product  num_sold
0   0  2010-01-01  Canada  Discount Stickers   Holographic Goose       NaN
1   1  2010-01-01  Canada  Discount Stickers              Kaggle     973.0
2   2  2010-01-01  Canada  Discount Stickers        Kaggle Tiers     906.0
3   3  2010-01-01  Canada  Discount Stickers            Kerneler     423.0
4   4  2010-01-01  Canada  Discount Stickers  Kerneler Dark Mode     491.0


In [16]:
# Display information about the test dataset
print("Test Data Description:")
print(test.info())
print(test.describe())

Test Data Description:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98550 entries, 0 to 98549
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       98550 non-null  int64 
 1   date     98550 non-null  object
 2   country  98550 non-null  object
 3   store    98550 non-null  object
 4   product  98550 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB
None
                  id
count   98550.000000
mean   279404.500000
std     28449.078852
min    230130.000000
25%    254767.250000
50%    279404.500000
75%    304041.750000
max    328679.000000


In [17]:

# Display information about the training dataset
print("Train Data Description:")
print(train.info())
print(train.describe())

Train Data Description:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230130 entries, 0 to 230129
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   id        230130 non-null  int64  
 1   date      230130 non-null  object 
 2   country   230130 non-null  object 
 3   store     230130 non-null  object 
 4   product   230130 non-null  object 
 5   num_sold  221259 non-null  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 10.5+ MB
None
                  id       num_sold
count  230130.000000  221259.000000
mean   115064.500000     752.527382
std     66432.953062     690.165445
min         0.000000       5.000000
25%     57532.250000     219.000000
50%    115064.500000     605.000000
75%    172596.750000    1114.000000
max    230129.000000    5939.000000


In [18]:
# Fill NaN values in 'num_sold' with the mean
#train['num_sold'] = train['num_sold'].fillna(train['num_sold'].median())

### Data Preprocessing 


In [19]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error


# Preprocessing: Convert date to datetime and extract features
train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])

# Extract year, month, day, and day_of_week from date
for df in [train, test]:
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['day_of_week'] = df['date'].dt.dayofweek

# Encode categorical features
label_encoder = LabelEncoder()
for col in ['country', 'store', 'product']:
    train[col] = label_encoder.fit_transform(train[col])
    test[col] = label_encoder.transform(test[col])


### Handle Missing Values

In [20]:
# Handle missing values in 'num_sold' using predictive modeling
train_known = train[train['num_sold'].notna()]
train_missing = train[train['num_sold'].isna()]

if not train_missing.empty:
    # Train a model to fill missing 'num_sold'
    X_known = train_known.drop(columns=['num_sold', 'date', 'id'])
    y_known = train_known['num_sold']
    X_missing = train_missing.drop(columns=['num_sold', 'date', 'id'])

    imputer_model = RandomForestRegressor(random_state=42)
    imputer_model.fit(X_known, y_known)
    train.loc[train['num_sold'].isna(), 'num_sold'] = imputer_model.predict(X_missing)

: 

### Model Training

In [None]:
# Define features and target for main model
X_train = train.drop(columns=['num_sold', 'date', 'id'])
y_train = train['num_sold']
X_test = test.drop(columns=['date', 'id'])

# Split training data for validation
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Train the main model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_split, y_train_split)

### Validation using MAE

In [8]:
# Evaluate the model on validation data
y_val_pred = model.predict(X_val)
print("Validation MAE:", mean_absolute_error(y_val, y_val_pred))

Validation MAE: 37.75357523573633


### Predict and Save Data

In [13]:
# Predict on test data
test['num_sold'] = model.predict(X_test)

# Save the predictions to a CSV
test[['id', 'num_sold']].to_csv('predictions.csv', index=False)