In [9]:
import pandas as pd

In [1]:
%store -r full_store_details


In [56]:
full_store_details.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017209 entries, 0 to 1017208
Data columns (total 21 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1014567 non-null  float64
 12  CompetitionOpenSinceMonth  693861 non-null   float64
 13  CompetitionO

## Preprocessing 

It is important to process the data into a format where it can be fed to a machine learning model, usually numeric. This means :
   1. Converting all non-numeric columns to numeric
   2. Handling NaN values 
   3. Generating new features from already existing features. In this case , we have a few datetime columns to generate new features from. We can extract the following from them: - weekdays - weekends - number of days to holidays - Number of days after holiday - Beginning of month, mid month and ending of month
   4. Scale the data to help with predictions

### 1. Converting non numeric columns to numeric
You will get an error if you try to plug these variables into most machine learning models in Python without "encoding" them first.

#### First , find out the datatypes of the feature columns

In [7]:
full_store_details.dtypes

Store                          int64
DayOfWeek                      int64
Date                          object
Sales                          int64
Customers                      int64
Open                           int64
Promo                          int64
StateHoliday                  object
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
PromoInterval                 object
dtype: object

#### Convert the  Assortment , StoreType , State holiday  data columns first which are string objects
We will use one hot encoding to achieve the conversion 

Pandas offers a convenient function called get_dummies to get one-hot encodings. Additionally, scikit also offers a -OneHotEncoder- class that encodes categorical features as a one-hot numeric array.

In [12]:
cat_features = ['Assortment' , 'StoreType' , 'StateHoliday']

Unnamed: 0,Assortment,StoreType,StateHoliday
0,a,c,0
1,a,c,0
2,a,c,0
3,a,c,0
4,a,c,0
...,...,...,...
1017204,c,d,0
1017205,c,d,0
1017206,c,d,0
1017207,c,d,0


In [55]:
from sklearn.preprocessing import LabelEncoder

# Assigning numerical values and storing in another column

labelencoder = LabelEncoder()
cat_features = cat_features.apply(lambda col: labelencoder.fit_transform(col.astype(str)), axis=0, result_type='expand')
cat_features


Unnamed: 0,Assortment,StoreType,StateHoliday
0,0,2,0
1,0,2,0
2,0,2,0
3,0,2,0
4,0,2,0
...,...,...,...
1017204,2,3,0
1017205,2,3,0
1017206,2,3,0
1017207,2,3,0


@article{scikit-learn,
 title={Scikit-learn: Machine Learning in {P}ython},
 author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
 journal={Journal of Machine Learning Research},
 volume={12},
 pages={2825--2830},
 year={2011}
}