<a href="https://www.kaggle.com/code/cristianojoseblanco/foodorderpredictionwithrandomforestclassifier?scriptVersionId=169509701" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/online-food-dataset/onlinefoods.csv


In [2]:
# Importing other libraries
import plotly.io as pio
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### Loading and Cleaning Data

In [3]:
df = pd.read_csv("/kaggle/input/online-food-dataset/onlinefoods.csv")

In [4]:
df.head()

Unnamed: 0,Age,Gender,Marital Status,Occupation,Monthly Income,Educational Qualifications,Family size,latitude,longitude,Pin code,Output,Feedback,Unnamed: 12
0,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,Yes,Positive,Yes
1,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.977,77.5773,560009,Yes,Positive,Yes
2,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,Yes,Negative,Yes
3,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,Yes,Positive,Yes
4,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.985,77.5533,560010,Yes,Positive,Yes


In [5]:
df.shape

(388, 13)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 388 entries, 0 to 387
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         388 non-null    int64  
 1   Gender                      388 non-null    object 
 2   Marital Status              388 non-null    object 
 3   Occupation                  388 non-null    object 
 4   Monthly Income              388 non-null    object 
 5   Educational Qualifications  388 non-null    object 
 6   Family size                 388 non-null    int64  
 7   latitude                    388 non-null    float64
 8   longitude                   388 non-null    float64
 9   Pin code                    388 non-null    int64  
 10  Output                      388 non-null    object 
 11  Feedback                    388 non-null    object 
 12  Unnamed: 12                 388 non-null    object 
dtypes: float64(2), int64(3), object(8)


In [7]:
df = df.drop(['Unnamed: 12'], axis=1)

### Histograms

In [8]:
pio.templates.default = "plotly_dark"

# For each histogram, grey and green will represent no and yes for Outcome respectively.

for i in df.columns:
    fig = px.histogram(df,
                    x=i,
                    color = 'Output',
                    color_discrete_sequence = ['#006600', '#333333'],
                    title='{} Frequency'.format(i)
                    )
    fig.update_layout(bargap=0.1)
    fig.show()

  sf: grouped.get_group(s if len(s) > 1 else s[0])














































•Age between 22 and 25 order the most, then the output drops significantly. <br>
•Males order more although output proportion stays similar. <br>
•Singles order much more and have higher output percentage, same for students and 'No income'. <br>
•Graduate and post graduate orders more. <br>
•Family size 2 or 3 dominates order frequency. <br>
•And of course positive feedback results in higher order.

### Encoding Categorical Data

In [9]:
categorical_cols = ['Gender','Marital Status', 'Occupation', 'Monthly Income', 'Educational Qualifications', 'Feedback']

for i in categorical_cols:
    one_hot = pd.get_dummies(df[i])
    df = pd.concat([df, one_hot], axis=1)
    df = df.drop(i, axis=1)

In [10]:
list(df)

['Age',
 'Family size',
 'latitude',
 'longitude',
 'Pin code',
 'Output',
 'Female',
 'Male',
 'Married',
 'Prefer not to say',
 'Single',
 'Employee',
 'House wife',
 'Self Employeed',
 'Student',
 '10001 to 25000',
 '25001 to 50000',
 'Below Rs.10000',
 'More than 50000',
 'No Income',
 'Graduate',
 'Ph.D',
 'Post Graduate',
 'School',
 'Uneducated',
 'Negative ',
 'Positive']

### Data preprocessing for our model

In [11]:
input_cols = ['Age',
'Family size',
'Female',
 'Male',
 'Married',
 'Prefer not to say',
 'Single',
 'Employee',
 'House wife',
 'Self Employeed',
 'Student',
 '10001 to 25000',
 '25001 to 50000',
 'Below Rs.10000',
 'More than 50000',
 'No Income',
 'Graduate',
 'Ph.D',
 'Post Graduate',
 'School',
 'Uneducated',
 'Negative ',
 'Positive'] 

target_col = 'Output'

In [12]:
inputs = df[input_cols].copy()
target = df[target_col].copy()

In [13]:
inputs.head()

Unnamed: 0,Age,Family size,Female,Male,Married,Prefer not to say,Single,Employee,House wife,Self Employeed,...,Below Rs.10000,More than 50000,No Income,Graduate,Ph.D,Post Graduate,School,Uneducated,Negative,Positive
0,20,4,True,False,False,False,True,False,False,False,...,False,False,True,False,False,True,False,False,False,True
1,24,3,True,False,False,False,True,False,False,False,...,True,False,False,True,False,False,False,False,False,True
2,22,3,False,True,False,False,True,False,False,False,...,True,False,False,False,False,True,False,False,True,False
3,22,6,True,False,False,False,True,False,False,False,...,False,False,True,True,False,False,False,False,False,True
4,22,4,False,True,False,False,True,False,False,False,...,True,False,False,False,False,True,False,False,False,True


In [14]:
target.head()

0    Yes
1    Yes
2    Yes
3    Yes
4    Yes
Name: Output, dtype: object

### ML RandorForestClassifier model

In [15]:
input_train, input_test, target_train, target_test = train_test_split(inputs, target, test_size=0.1, random_state=10)

In [16]:
clf = RandomForestClassifier()

param_grid = [ 
        {"n_estimators": [10, 100, 200, 1000], "max_depth": [None, 5, 10], "min_samples_split": [2, 3, 4]}
]

grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='accuracy', return_train_score=True)
grid_search.fit(input_train, target_train)

In [17]:
best_clf = grid_search.best_estimator_

In [18]:
best_clf

In [19]:
best_clf.score(input_test, target_test)

0.9487179487179487

Our model got an accuracy of 94,87%