# Challenge 
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are ***tasked with predicting their daily sales for up to six weeks in advance***. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 

# Predict Store Sales 
***What we learn from this notebook***

1. **Data Collection and Preparation**
   * Gather Data
   * Remove **TOTALLY** unrelated columns from dataset
   * Encode cateorical features
   * Handle too many unique values in categorical features (Cordinality)
   * Missing value handling 
  
  
2. **Exploratory Data Analysis (EDA)**
   * Remove outliers
   * Understand data distribution (mean, median, mode, standard deviation).
   * Check for correlations & covariance between features.
   
   
3. **Feature Selection & Engineering**
   *  Select relevant features based on correlation, mutual information, feature importance, Lasso, RFE, R-squared, p-values, and VIF.
   * Create new features if necessary.
   
   
4. **Check for Assumptions of Linear Regression**
   * **Linearity**: The relationship between predictors and the target should be linear.
   * **Independence**: Observations should be independent of each other,
   * **Homoscedasticity**: Constant variance of the residuals
   * **Normality of Residuals**: Residuals should be normally distributed.
   * **No Multicollinearity**: Predictors should not be highly correlated with each other.
   
   
5. **Model Training**
   * Split data into training and testing sets.
   * Train the multiple linear regression model.
   
   
6. **Model Evaluation**
   * Evaluate the model using metrics like R-squared, adjusted R-squared, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
   * Check residual plots to ensure assumptions are met.
   
   
7. **Predict Sales based on new model created**
 

## 1. Data Collection and Preparation

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        full_path = os.path.join(dirname, filename)
        print (full_path)
        if "store.csv" in full_path:
            df_store = pd.read_csv(full_path)
        elif "train.csv" in full_path:
            df_train = pd.read_csv(full_path)
        elif "test.csv" in full_path:
            df_test = pd.read_csv(full_path)
                                 
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Data fields
Most of the fields are self-explanatory. The following are descriptions for those that aren't.

**Id** - an Id that represents a (Store, Date) duple within the test set

**Store** - a unique Id for each store

**Sales** - the turnover for any given day (this is what you are predicting)

**Customers** - the number of customers on a given day

**Open** - an indicator for whether the store was open: 0 = closed, 1 = open

**StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

**SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools

**StoreType** - differentiates between 4 different store models: a, b, c, d

**Assortment** - describes an assortment level: a = basic, b = extra, c = extended

**CompetitionDistance** - distance in meters to the nearest competitor store

**CompetitionOpenSince[Month/Year]** - gives the approximate year and month of the time the nearest competitor was opened

**Promo** - indicates whether a store is running a promo on that day

**Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

**Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2

**PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store


In [None]:
df_store.head()

In [None]:
df_train.head()

#### Check the SHAPE of each DataFrame
#### Merge the train and store data based on the store id to have all associated feature of store in single row

In [None]:
dfs =[df_store,df_train]
for df in dfs:
    #print (df)
    print (df.shape)

In [None]:
df = df_train.merge (df_store, how="left", on="Store")
print (df.info())
df.head()

#### Intuitive Approach In Prediction
##### 1.  How *Promo, stateHoliday, SchoolHoliday, Promo2, Promo2SinceWeek* and *Promo2SinceYear* impacted the sales in a store?
  * Whats impact of question#1 in a year, month and weeks. 
      * 2 What's the impact of season in a year impacted the sales in a store?
      * 1.3 What's the impact of the weekend in a year impacted sales in a store?
      * 1.4 How Open on weekend, weekday StateHoliday, SchooldHliday impacted the sales in Store?
      * 1.5 Was there a impact on sales in stores for a month in a given year?
##### 2.  How much sales impacted in a store becuase of these *CompetitionOpenSinceMonth* and *CompetitionOpenSinceYear*?
##### 3.  How much *PromoInterval* impacted the sales in a store?
##### 4. 

### Fix The Data Types 

In [None]:
df['Date'].isnull().sum()

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df.head()

## Missing Values Handling

In [None]:
df.isnull().sum()

### Handling missing value for the feature CompetitionDistance

In [None]:
df[df['CompetitionDistance'].isnull()].head()

#### Checking if there is a relation between other features with the CompetitionDistance

In [None]:
print (df['CompetitionDistance'].isnull().sum() )
print (df['CompetitionOpenSinceMonth'].isnull().sum() )
print (df['CompetitionOpenSinceYear'].isnull().sum() )

#### Looks there is close relation between these features 

In [None]:
df[df['CompetitionDistance'].isna() & df['CompetitionOpenSinceMonth'].isna() & df['CompetitionOpenSinceYear'].isna()].shape

#### Lets try to find if there are any Store has its distance already present but only missed at Ramdon 

In [None]:
# Group by 'Store' and check for both NaN and non-NaN values in 'CompetitionDistance'
store_groups = df.groupby('Store')['CompetitionDistance'].agg(
    missing_count = lambda x: x.isna().sum(),
    non_missing_count = lambda x: x.notna().sum()
).reset_index()

# Filter stores that have both missing and non-missing values
stores_with_both = store_groups[(store_groups['missing_count'] > 0) & (store_groups['non_missing_count'] > 0)]

# Display the result
print("Stores with both missing and non-missing CompetitionDistance")
stores_with_both

No such store where it has value in CompetitionDistance and null at other places CompetitionDistance

#### Lets find the list Stores have the null value in all of these 3 features 

In [None]:
nan_df = df[['Store','CompetitionDistance','CompetitionOpenSinceMonth','CompetitionOpenSinceYear']]

In [None]:
nan_df.groupby('Store')['CompetitionDistance'].size()

### From above its clearn that only 3 Stores have missing value in all these 3 features. 
**Row percentage of these 3 stores contributes only 0.25% of total.**

**This is an *MISSING at RANdom (MAR)* patttern for these 3 stores.**

**Based on the StoreType and Assortment of other stores, I assume these 3 stores will also have same distance as compitator** 

In [None]:
df.head()

In [None]:
stores_to_exclude = [291, 622, 879]
store_type_291=['d']
assortment_291=['a']
mask_291 = df[(~df['Store'].isin(stores_to_exclude)) & (df['StoreType'].isin(store_type_291)) & (df['Assortment'].isin(assortment_291))]

In [None]:
print (mask_291.shape)
print (mask_291.info())
mask_291.head()

In [None]:
# Summary statistics
print("Original Data:")
print("Mean:", mask_291['CompetitionDistance'].mean())
print("Median:", mask_291['CompetitionDistance'].median())
print("Skewness:", skew(mask_291['CompetitionDistance']))
print("Kurtosis:", kurtosis(mask_291['CompetitionDistance']))

#### Now fill the missing value as CompetitionDistance for the Store 291 

In [None]:
# Fill missing values in 'CompetitionDistance' with 4000 where 'Store' is 291
df.loc[(df['Store'] == 291) & (df['CompetitionDistance'].isna()), 'CompetitionDistance'] = 4400
