<a href="https://colab.research.google.com/github/flycye/Kabble-Tutorial/blob/main/SalesForecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
# catboost - gradient boosting on decision trees
%pip install -Uq upgini catboost

In [16]:
# 80-20 rule: 80% percent for testing and 20% for training

from os.path import exists
import pandas as pd       # library dealing with dataframes

# if the downloaded zip file is not found, use the github
  # if it already exists, don't re-download it
df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)   # reading csv file and inputting into df_path dataframe
df = df.sample(n = 19_000, random_state = 0)  # take a random sample of 19,000 data points

# convert store and item colns into strings
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# convert date into pandas datetime, making sorting easier
df["date"] = pd.to_datetime(df["date"])

# sort the values in-place by date (chronological order)
df.sort_values("date", inplace = True)
df.reset_index(inplace = True, drop = True)
df.head()   # get the top five rows

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


In [17]:
# put data from 2013 - 2016 into the training set, and 2017 into testing set
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

In [18]:
# features are our input values and labels are our predictions

train_features = train.drop(columns = ["sales"])
train_target = train["sales"]

test_features = test.drop(columns = ["sales"])
test_target = test["sales"]

In [19]:
# enrich our features with the gini library
  # create relevant brand new features
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher (
    search_keys = {
        "date": SearchKey.DATE,     # new data refers to same range of dates
    },

    cv = CVType.time_series         # tells enricher the type of dataset (time_series)
)

enricher.fit(train_features,
             train_target,
             eval_set = [(test_features, test_target)])


Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IPv4 to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history

Detected task type: ModelTaskType.REGRESSION




Column name,Status,Errors
target,All valid,-
date,All valid,-



Running search request, search_id=be1251bd-2983-4064-91d9-bb25b063a871
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
70 relevant feature(s) found with the search keys: ['date'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_autofe_mul_34d11bc4,0.0237,100.0,"-0.393, 6.4287, 9.3789",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_50507976,0.0208,100.0,"-0.0504, -0.0559, -0.0186",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_135dc79d,0.0179,100.0,"0.7927, 1.702, 1.0302",Upgini,AutoFE: features from Markets data,Daily
f_autofe_mul_0cc09d2d,0.0136,100.0,"-72.6808, 60.6836, 16.2907",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_mul_478be4f3,0.0124,100.0,"0.9816, 0.5939, -1.3147",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_89c56a5f,0.0111,100.0,"-0.0986, -1.0761, -0.2645",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_cf5dd7be,0.0091,100.0,"-0.8731, -0.552, 0.8335",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_events_date_year_cos1_9014a856,0.0086,100.0,"0.8521, -0.4289, 0.5724",Upgini,Calendar data,Daily
f_autofe_mul_402f5e6a,0.0075,100.0,"0.779, -0.797, -0.7981",Upgini,"AutoFE: features from Calendar data,Markets data",Daily
f_autofe_div_283a6178,0.007,100.0,"-0.0109, 0.0124, -0.0023",Upgini,"AutoFE: features from Calendar data,Markets data",Daily


Provider,Source,All features SHAP,Number of relevant features
Upgini,"AutoFE: features from Calendar data,Markets data",0.1493,39
Upgini,AutoFE: features from Markets data,0.0292,18
Upgini,Calendar data,0.0089,2
Upgini,AutoFE: features from Calendar data,0.006,9
Upgini,Markets data,0.0003,1
Upgini,AutoFE: feature from Markets data,0.0001,1


Sources,Feature name,Feature 1,Feature 2,Function
"Calendar data,Markets data",f_autofe_mul_34d11bc4,f_events_date_year_cos1_9014a856,f_financial_date_silver_e4e33014,*
"Calendar data,Markets data",f_autofe_div_50507976,f_events_date_year_cos1_9014a856,f_financial_date_silver_e4e33014,/
Markets data,f_autofe_div_135dc79d,f_financial_date_gold_7d_to_1y_ae310379,f_financial_date_natural_gas_7d_to_7d_1y_shift_a5c3c07f,/
"Calendar data,Markets data",f_autofe_mul_0cc09d2d,f_events_date_year_cos1_9014a856,f_financial_date_crude_oil_1f195998,*
"Calendar data,Markets data",f_autofe_mul_478be4f3,f_events_date_year_cos1_9014a856,f_financial_date_natural_gas_7d_to_7d_1y_shift_a5c3c07f,*
"Calendar data,Markets data",f_autofe_div_89c56a5f,f_events_date_year_cos1_9014a856,f_financial_date_natural_gas_7d_to_7d_1y_shift_a5c3c07f,/
"Calendar data,Markets data",f_autofe_div_cf5dd7be,f_events_date_week_sin1_847b5db1,f_financial_date_vix_7d_to_1y_634c77eb,/
"Calendar data,Markets data",f_autofe_mul_402f5e6a,f_events_date_year_cos1_9014a856,f_financial_date_silver_7d_to_7d_1y_shift_55fa8001,*
"Calendar data,Markets data",f_autofe_div_283a6178,f_events_date_week_cos2_b0a07cfc,f_financial_date_usd_7419609a,/
"Calendar data,Markets data",f_autofe_div_3e22df83,f_events_date_week_sin1_847b5db1,f_financial_date_crude_oil_1f195998,/



Examples of outliers with maximum value of target:
84    205
47    196
38    187
Name: target, dtype: int64
Outliers will be excluded during the metrics calculation.
Calculating accuracy uplift after enrichment...

which makes metrics between the train and eval_set incomparable.


Dataset type,Rows,Mean target,Baseline mean_squared_error,Enriched mean_squared_error,Uplift
Train,15213,50.3977,301.2609,197.7269,103.5339
Eval 1,3787,59.2424,485.3569,332.4573,152.8996


In [20]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric

model = CatBoostRegressor(verbose = False, allow_writing_files = False, random_state = 0)

# calculate metrics before and after using feature enrichment
enricher.calculate_metrics(
    train_features, train_target,
    eval_set = [(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"    # how do we want to score accuracy?
)

Calculating accuracy uplift after enrichment...
-
which makes metrics between the train and eval_set incomparable.


Unnamed: 0,Dataset type,Rows,Mean target,Baseline mean_absolute_percentage_error,Enriched mean_absolute_percentage_error,Uplift
0,Train,15213,50.3977,0.25285,0.175309,0.077541
1,Eval 1,3787,59.2424,0.264187,0.178654,0.085533
