# Feature Engineering

## 1. Packages

In [1]:
import numpy as np 
import pandas as pd 
import inflection
import math
import datetime
import matplotlib.pyplot as plt 
import seaborn as sns

sns.set_style('darkgrid')

## 2. Data Reading

In [2]:
data = pd.read_pickle('../Data/data_clean.pkl')

## 3. Hypothesis Creation

### 3.1 Store Hyphotesis

**1.** Stores with more employees should sell more.

**2.** Stores with more stock capacity should sell more.

**3.** Stores with bigger sizes should sell more.

**4.** Stores with more assortments shoud sell more.

**5.** Stores with closer competitors shoud sell less.

**6.** Stores with older competitors should sell more.

### 2. Product Hyphotesis

**1.** Stores which invest more on Marketing should sell more.

**2.** Stores with more product exposure should sell more. 

**3.** Stores with lower prices should sell more.

**4.** Stores with more aggresive promotions should sell more.

**5.** Stores with longer active promotions should sell more.

**6.** Stores with more promotion days should sell more.

**7.** Stores with more consecutive promotions should sell more.

### 3. Time Hyphotesis (Seasonality)

**1.** Stores open during Christmas should sell more.

**2.** Stores should sell more along the years.

**3.** Stores should sell more during the second semester of the year.

**4.** Stores should sell more after day 10 of each month.

**5.** Stores shoud sell less on weekends.

**6.** Stores should sell less during School Holidays.

## 4. Final List of Hyphotesis

Here, a final list of Hyphotesis is selected based on the availability of data needed to confirm each one.

**1.** Stores with more assortments shoud sell more.

**2.** Stores with closer competitors shoud sell less.

**3.** Stores with older competitors should sell more.

**4.** Stores with longer active promotions should sell more.

**5.** Stores with more consecutive promotions should sell more.

**6.** Stores open during Christmas should sell more.

**7.** Stores should sell more along the years.

**8.** Stores should sell more during the second semester of the year.

**9.** Stores should sell more after day 10 of each month.

**10.** Stores shoud sell less on weekends.

**11.** Stores should sell less during School Holidays.

## 5. Feature Engineering

On this section, the Feature Enginnering process is implemented, creating new features to help study the created hyphotesis and improve model performance.

First, some basic features of data are derived from original data column.

In [5]:
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['week_of_year'] = data['date'].dt.isocalendar().week
data['year_week'] = data['date'].dt.strftime('%Y-%W')

Here a "Competition Since" column is derived aggregating the month and year original columns. Also, a column calculating the Time of Competition in months is proposed.

In [6]:
data['competition_since'] = data.apply(lambda x: datetime.datetime(day=1, month=x['competition_open_since_month'], year=x['competition_open_since_year']),axis=1)
data['competition_time_month'] = ((data['date'] - data['competition_since'])/30).apply(lambda x: x.days).astype('int64')

A similar process is implemented for the "Promo Since..." columns, where a aggregated column and a calculated column with promotion time in weeks are proposed.

In [7]:
data['promo_since'] = data['promo2_since_year'].astype( str ) + '-' + data['promo2_since_week'].astype( str )
data['promo_since'] = data['promo_since'].apply( lambda x: datetime.datetime.strptime( x + '-1', '%Y-%W-%w' ) - datetime.timedelta( days=7 ) )
data['promo_time_week'] = ( ( data['date'] - data['promo_since'] ) / 7 ).apply( lambda x: x.days ).astype( int )

For Assortment and State Holiday columns, the encoded values are substituted with literal names, according to the data documentation.

In [8]:
data['assortment'] = data['assortment'].apply(lambda x: 'basic' if x == 'a' else 'extra' if x == 'b' else 'extended')

In [9]:
data['state_holiday'] = data['state_holiday'].apply(lambda x: 'public_holiday' if x == 'a' else 'easter_holiday' if x == 'b' else 'christmas' if x =='c' else 'regular_day')

A summary of the resulting data is presented above.

In [10]:
data.head().T

Unnamed: 0,0,1,2,3,4
store,1,2,3,4,5
day_of_week,5,5,5,5,5
date,2015-07-31 00:00:00,2015-07-31 00:00:00,2015-07-31 00:00:00,2015-07-31 00:00:00,2015-07-31 00:00:00
sales,5263,6064,8314,13995,4822
customers,555,625,821,1498,559
open,1,1,1,1,1
promo,1,1,1,1,1
state_holiday,regular_day,regular_day,regular_day,regular_day,regular_day
school_holiday,1,1,1,1,1
store_type,c,a,a,c,a


## 6. Data Filtering

On this section, some initial filtering of the data, both in terms of rows and columns, is proposed based on initial knowledge of the data.

### 6.1. Row Filtering

For Row Filtering, samples refered to close stores or no sales are removed from the dataset.

In [11]:
data = data[(data['open'] != 0) & (data['sales'] > 0)]

### 6.2. Column Filtering

For column filtering, columns not available on prediction time, like "Customers" and columns already represented by the new crafted ones are removed.

In [12]:
cols_drop = ['customers','open','promo_interval','month_map']
data = data.drop(cols_drop,axis=1)

## 7. Data Exporting

Finally, the resulting data is exported as a pickle file to subsequent notebooks.

In [13]:
data.to_pickle('../Data/data_engineered.pkl')