# Food Demand Forecasting
Predict the number of orders for upcoming 10 weeks

## Overview 
### 1) Context

### 2) Content

### 3) Used Python Libraries

### 4 ) Know Dataset Nature

### 5) Light Data Exploration

### 6) Data Normalization

### 7) Light Data Exploration

### 8) Data Normalization

### 6) Feature Selection

### 7) Model Buliding 

### 8) Conclusion

### 9) Applying Algorithm 



## Context
It is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.

## Content
The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Secondly, staffing of the centers is also one area wherein accurate demand forecasts are really helpful. Given the following information, the task is to predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set

## Acknowledgements
Analytics Vidhya

## Inspiration
Forecasting accurately could male the business growth in well directed direction.


## Used Python Libraries

In [None]:
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Know Dataset Nature
1. head() : It is used to get the first 5 rows of the dataframe.
2. tail() : It is used to get the last 5 rows of the dataframe.
3. describe() : It is used to view some basic statistical details like percentile, mean, std etc.
4. info() : It is used to print a concise summary of a DataFrame. including the index dtype and column dtypes, non-null values and memory usage

In [None]:
train = pd.read_csv('/kaggle/input/food-demand-forecasting/train.csv')
test = pd.read_csv('/kaggle/input/food-demand-forecasting/test.csv')
meal = pd.read_csv('/kaggle/input/food-demand-forecasting/meal_info.csv')
centerinfo = pd.read_csv('/kaggle/input/food-demand-forecasting/fulfilment_center_info.csv')

In [None]:
train.head()

In [None]:
centerinfo.head()

In [None]:
meal.head()

In [None]:
train.describe()

In [None]:
train.info()

## Light Data Exploration
### 1) For numeric data
  * Made histograms to understand distributions
  * Corrplot

### 2) For Categorical Data
   * Made bar charts to understand balance of classes

In [None]:
train_cat = train[['center_id','meal_id','emailer_for_promotion','homepage_featured']]
train_num = train[['week','checkout_price']]


In [None]:
for i in train_num.columns:
    plt.hist(train_num[i])
    plt.title(i)
    plt.show()

In [None]:
sns.heatmap(train_num.corr())

In [None]:
for i in train_cat.columns:
    plt.xticks(rotation=90)
    sns.barplot(train_cat[i].value_counts().index,train_cat[i].value_counts()).set_title(i)
    plt.show()
    

## Data Normalization
1. for-loop: here we checked outliers occur or not? "checkout_price" column has occurred an outlier. 
2. outlinefree() : It is a customise function that help us to figureout and work on outlier values in columns. meanly, it is used to **remove outlires** values from dataset.
3. for-loop: with the help of for-loop, we are checking the **outlinefree()** function worked properly or not.
4. columns **center_id** and **meal_id** has many categorical values.
5. to manage categorical columns we using function their create new few sub-categories.


In [None]:
for i in train_num.columns:
    sns.boxplot(train_num[i])
    plt.title(i)
    plt.show()

In [None]:
def outlinefree(dataCol):     
      
    sorted(dataCol)                          # sort column
    Q1,Q3 = np.percentile(dataCol,[25,75])   # getting 25% and 75% percentile
    IQR = Q3-Q1                              # getting IQR 
    LowerRange = Q1-(1.5 * IQR)              # getting Lowrange
    UpperRange = Q3+(1.5 * IQR)              # getting Upperrange 
    
    colname = dataCol.tolist()               # convert column into list  
    newlist =[]                              # empty list for store new values
    for i in range(len(colname)):
        
        if colname[i] > UpperRange:          # list number > Upperrange 
            colname[i] = UpperRange          # then number = Upperrange
            newlist.append(colname[i])       # append value to empty list
        elif colname[i] < LowerRange:        # list number < Lowrange 
            colname[i] = LowerRange          # then number = Lowrange
            newlist.append(colname[i])       # append value to empty list 
        else:
            colname[i]                       # list number
            newlist.append(colname[i])       # append value to empty list
            
        

    return newlist

In [None]:
for i in range(len(train_num.columns)):
    new_list =  outlinefree(train.loc[:,train_num.columns[i]]) # retrun new list
    train.loc[:,train_num.columns[i]] = new_list 

In [None]:
def center_id(datacol):
    center_id_val_index_n = []
    for i in datacol:
        if i >= 10 and i <= 30:
            center_id_val_index_n.append("10-30")
        elif i >= 31 and i <=50:
            center_id_val_index_n.append("31-50")
        elif i >= 51 and i <=70:
            center_id_val_index_n.append("51-70")  
        elif i >= 71 and i <=90:
            center_id_val_index_n.append("71-90")
        elif i >= 91 and i <=110:
            center_id_val_index_n.append("91-110") 
        elif i >= 111 and i <=130:
            center_id_val_index_n.append("111-130")
        elif i >= 131 and i <=150:
            center_id_val_index_n.append("131-150")          
        else:
            center_id_val_index_n.append("151-190")
    
    return  center_id_val_index_n 
center_id_val_index_n = center_id(train.center_id) 
train.center_id = center_id_val_index_n

In [None]:
def meal_id(datacol):        
    meal_id_val_index_n = []
    for i in datacol:
        if i >= 1000 and i <= 1300:
            meal_id_val_index_n.append("1000-1300")
        elif i >= 1301 and i <=1600:
            meal_id_val_index_n.append("1301-1600")
        elif i >= 1601 and i <=1900:
            meal_id_val_index_n.append("1601-1900")  
        elif i >= 1901 and i <=2200:
            meal_id_val_index_n.append("1901-2200")
        elif i >= 2201 and i <=2500:
            meal_id_val_index_n.append("2201-2500") 
        elif i >= 2501 and i <=2800:
            meal_id_val_index_n.append("2501-2800")          
        else:
            meal_id_val_index_n.append("2801-3000") 
    return  meal_id_val_index_n

meal_id_val_index_n = meal_id(train.meal_id)
train.meal_id = meal_id_val_index_n

## Feature Selection
1. seaborn.pairplot(): It is help to figure-out relation between features and label.

In [None]:
sns.pairplot(train)

In [None]:
f_train = train.loc[:,['num_orders','week','center_id','meal_id','checkout_price','base_price','emailer_for_promotion',
                 'homepage_featured']]
final_train = pd.get_dummies(f_train)

In [None]:
features = final_train.iloc[:,1:].values
label = final_train.iloc[:,:1].values

## Model Buliding
here we will be using many algorithms and compare all of them. which algorithm will be giving us a Better result. The following algorithms are below.

1. LinearRegression (RMSE: 334.45162241353864)
2. DecisionTreeRegressor (RMSE:  332.8261160204239)
3. **RandomForestRegressor (RMSE: 331.0142032987282)**

In [None]:
#------------------------------------ LinearRegression ---------------------------------------------
X_train,X_test,y_train,y_test = train_test_split(features,label,test_size=0.20,random_state=1705)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [None]:
print("R2 score  :",r2_score(y_test, y_pred))
print("MSE score  :",mean_squared_error(y_test, y_pred))
print("RMSE: ",sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
#------------------------------------ DecisionTreeRegressor---------------------------------------------
X_train,X_test,y_train,y_test = train_test_split(features,label,test_size=0.20,random_state=1956)
DTRmodel = DecisionTreeRegressor(max_depth=3,random_state=0)
DTRmodel.fit(X_train,y_train)
y_pred = DTRmodel.predict(X_test)

In [None]:
print("R2 score  :",r2_score(y_test, y_pred))
print("MSE score  :",mean_squared_error(y_test, y_pred))
print("RMSE: ",sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
#------------------------------------ RandomForestRegressor ---------------------------------------------
X_train,X_test,y_train,y_test = train_test_split(features,label,test_size=0.20,random_state=33)
RFRmodel = RandomForestRegressor(max_depth=3, random_state=0)
RFRmodel.fit(X_train,y_train)
y_pred = RFRmodel.predict(X_test)

In [None]:
print("R2 score  :",r2_score(y_test, y_pred))
print("MSE score  :",mean_squared_error(y_test, y_pred))
print("RMSE: ",sqrt(mean_squared_error(y_test, y_pred)))

## Conclusion
I will choose a **RandomForestRegressor algorithm** for this dataset.

**RandomForestRegressor score**:

1. **RMSE score : 331.0142032987282** 


## Applying Algorithm
before applying the algorithm to the test dataset. we should make it a complete numeric dataset. the following setups are below mentioned.
1. columns center_id and meal_id has many categorical values.
2. to manage categorical columns we using function their create new few sub-categories.
3. using get_dummies() function.
4. here our data is ready to apply an algorithm on it.

In [None]:
center_id_val_index_n = center_id(test.center_id) 
test.center_id = center_id_val_index_n

meal_id_val_index_n = meal_id(test.meal_id)
test.meal_id = meal_id_val_index_n

In [None]:
f_test = test.loc[:,['week','center_id','meal_id','checkout_price','base_price','emailer_for_promotion',
                 'homepage_featured']]
final_test = pd.get_dummies(f_test)

In [None]:
test_predict = RFRmodel.predict(final_test)

In [None]:
test['num_orders'] = test_predict

In [None]:
sample =  test.loc[:,['id','num_orders']]

In [None]:
sample.to_csv('sample_submission.csv',index=False)