#  Rosemann Store Slaes Prediction from August to January 2015

## Import necessary modules for exploration

In [23]:
import pandas as pd
from sklearn.pipeline import Pipeline
import numpy as np
from sklearn.impute import KNNImputer

## Load the dataset(s) and view the first rows


In [6]:
df = pd.read_csv('rossmann-store-sales/train.csv')
df_store = pd.read_csv('rossmann-store-sales/store.csv')
full_store_details = df.merge(df_store)
full_store_details.head(70)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,5,2015-07-31,5263,555,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
1,1,4,2015-07-30,5020,546,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
2,1,3,2015-07-29,4782,523,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
3,1,2,2015-07-28,5011,560,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
4,1,1,2015-07-27,6102,612,1,1,0,1,c,a,1270.0,9.0,2008.0,0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,1,3,2015-05-27,4083,497,1,0,0,0,c,a,1270.0,9.0,2008.0,0,,,
66,1,2,2015-05-26,4211,479,1,0,0,0,c,a,1270.0,9.0,2008.0,0,,,
67,1,1,2015-05-25,0,0,0,0,a,0,c,a,1270.0,9.0,2008.0,0,,,
68,1,7,2015-05-24,0,0,0,0,0,0,c,a,1270.0,9.0,2008.0,0,,,


In a data frame each row corresponds to one observation (e.g., a store) and
each column corresponds to one feature (type, promotions run, assortment.). 

For example, by looking at the first observation we can see that store id 1 is of type 'c' and carries a basic assortment of goods (denoted as a) and is 1270 metres from the next competitor that was opened around September in 2008. Additionally, this store does not participate in the ongoing promotion (labelled Promo 2).


## Get to know your data

In [42]:
full_store_details.dtypes , full_store_details.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1014567 non-null  float64
 12  CompetitionOpenSinceMonth  693861 non-null   float64
 13  CompetitionO

(Store                          int64
 DayOfWeek                      int64
 Date                          object
 Sales                          int64
 Customers                      int64
 Open                           int64
 Promo                          int64
 StateHoliday                  object
 SchoolHoliday                  int64
 StoreType                     object
 Assortment                    object
 CompetitionDistance          float64
 CompetitionOpenSinceMonth    float64
 CompetitionOpenSinceYear     float64
 Promo2                         int64
 Promo2SinceWeek              float64
 Promo2SinceYear              float64
 PromoInterval                 object
 dtype: object,
 None)

The *df_store* contains information pertaining to each store such as the id, the distance to the nearest competitor and how long the competitor has been open for , whether or not the store participates in the promotion labelled as *Promo2* and for how long and the intervals the promo 2 is run

The *df* dataset contains information on the stores sales and customers per day as well as whether or not a promotion is run.


## Wrangle some data

Since we have discovered that non-participating stores in the Promo2 yield NAN values in the *Promo2SinceWeek 	* , *Promo2SinceYear* and *PromoInterval* columns it is possible to adjust for this without skewing our data by filling in the NAN values using a pipeline

The first step in building the pipeline is to define each transformer type. The convention here is generally to create transformers for the different variable types.

### Drop some columns not pertinent to the exploration

In [14]:
full_store_details.drop(['CompetitionOpenSinceMonth' , 'CompetitionOpenSinceYear'] , axis=1)
full_store_details.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1014567 non-null  float64
 12  CompetitionOpenSinceMonth  693861 non-null   float64
 13  CompetitionO

Data pipelines allow one to transform data from one representation to another through a series of steps. Pipelines allow one to apply and chain intermediate steps of transform to our data. For example, one can fill missing values, pass the output to cross validation and grid search and then fit the model in series of steps chained together where the output of one is the input to another.

Make the X and Y variables for the imputting the missing variables

In [17]:
X = full_store_details['Promo2']
y = full_store_details[['Promo2SinceWeek' , 'Promo2SinceYear']]

### Input missing values in the Y columns using the K-nearest neighbor

In [24]:
nan = np.nan

In [None]:
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputer.fit_transform(y)

Factors such as promotions, competition, school and state holidays,seasonality, and locality as necessary for predicting the sales across the various stores.

Next we use the ColumnTransformer to apply the transformations to the correct columns in the dataframe. Before building this I have stored lists of the numeric and categorical columns using the pandas dtype method.