# Introduction to Python Project : FoodHub Data Analysis

### Problem Statement

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

### Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company improve its business.


### Data Dictionary

The data contains the different data related to a food order. The detailed data dictionary is given below.

**Data Dictionary:**

<u>order_id</u>: Unique ID of the order \
<u>customer_id</u>: ID of the customer who ordered the food \
<u>restaurant_name</u>: Name of the restaurant \
<u>cuisine_type</u>: Cuisine ordered by the customer \
<u>cost_of_the_order</u>: Price paid per order \
<u>day_of_the_week</u>: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday) \
<u>rating</u>: Rating given by the customer out of 5 \
<u>food_preparation_time</u>: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation. \
<u>delivery_time</u>: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

### Let us start by importing the required libraries

In [1]:
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
sns.set()

### Understanding the structure of the data

In [2]:
# uncomment and run the following lines for Google Colab
from google.colab import drive
import os

drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive/Great Learning")

Mounted at /content/drive


In [3]:
# Write your code here to read the data
data = pd.read_csv("foodhub_order.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'foodhub_order.csv'

In [None]:
# Write your code here to view the first 5 rows
data.head(5)

### **Question 1:** How many rows and columns are present in the data? [0.5 mark]

In [None]:
# Write your code here
shape = data.shape
print(f"There are {shape[0]} rows and {shape[1]} columns")

#### Observations:


### **Question 2:** What are the datatypes of the different columns in the dataset? (The info() function can be used) [0.5 mark]

In [None]:
# Write your code here
data.info()

#### Observations:

1. Order Id and Customer ID are numerical in nature however it can be treated as an categorical variable hence columns could be changed to object dtype.

2. Cost and time related columns(Food_Preparation_Time and delivery_time) should be numerical in nature and the dtype of int64 is correctly mapped

3. restaurant_name, cuisine_type and ratings are categorical variables and are rightly mapped to object(which might indicate strings or numerical category[0,1,2,3,4])


In [None]:
## Changing Datatype of order_id and customer_id
data['order_id'] = data['order_id'].astype("object")
data['customer_id'] = data['customer_id'].astype("object")

In [None]:
data.info()

### **Question 3:** Are there any missing values in the data? If yes, treat them using an appropriate method. [1 mark]

In [None]:
# Write your code here
data.isna().sum()

#### Observations:

None of the columns have any missing data hence there is no need to handle any missing values

### **Question 4:** Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed? [2 marks]

In [None]:
# Write your code here
data.describe(include="all")

In [None]:
ser = data.food_preparation_time.describe()
ser.loc[["min", "max", "mean"]]

#### Observations:

**Minimum** Time for Food Preparation: **20 minutes** \
**Average** Time for Food Preparation: **27.37 minutes** \
**Maximum** Time for Food Preparation: **35 minutes**


### **Question 5:** How many orders are not rated? [1 mark]

In [None]:
# Write the code here
data.rating.unique()

In [None]:
no_ratings_df = data.loc[data["rating"] == "Not given"]
no_ratings_df.shape

#### Observations:

736 out of 1898 orders are not given any rating

### Exploratory Data Analysis (EDA)

### Univariate Analysis

### **Question 6:** Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration.) [9 marks]

**Order ID**

In [None]:
data.order_id.nunique()

Since the number of unique order id is equal to the number of entries in the data, we can assume that each row in the data refers to an unique order placed

**Customer ID**

In [None]:
customer_order_freq = data.groupby("customer_id")['order_id'].count().sort_values(ascending=False).reset_index().rename(columns= {"order_id":"number of orders placed"})
print(f"""There are {data.customer_id.nunique()} customers in the database.
Let us see how many orders are placed by a customer""")
counts = customer_order_freq["number of orders placed"].value_counts().tolist()

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharex=False)
countplt = sns.countplot(data=customer_order_freq, x="number of orders placed", ax=axes[0]);
boxplt = sns.boxplot(data=customer_order_freq, x="number of orders placed", ax=axes[1],  medianprops={"color": "r"});
countplt.set(ylabel="Count of Customers")
fig.suptitle("Number of customers placing placing orders on FoodHub");
for i,count_ in enumerate(counts):
  countplt.annotate(str(count_), xy=(i,count_), horizontalalignment="center");

There are around 784 customers who have placed only 1 order on the app. Customers who placed more than 4 or more orders can be said to be loyal customers. There are very few customers(66) customers who can be considered loyal.

**Restaurant ID**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(100, 75), width_ratios=[3,1]);
counts_rest = data.restaurant_name.value_counts(sort=False).tolist()
freq_df = data.groupby("restaurant_name")['order_id'].count().sort_values(ascending=False)
restaurant_count = sns.histplot(data=data, x="restaurant_name", ax=ax[0]);
ax2 = ax[0].twinx()
sns.lineplot(x=range(len(freq_df)), y=freq_df.median(), ax=ax2, color="r")
ax2.legend(loc="upper right")
restaurant_count.set_xticklabels(restaurant_count.get_xticklabels(), rotation=90);
for i,count_ in enumerate(counts_rest):
  restaurant_count.annotate(str(count_), xy=(i,count_), horizontalalignment="center");

sns.boxplot(data=freq_df.reset_index(), ax=ax[1]);

**cuisine_type**

In [None]:
print("Histogram showing how many orders were placed for each cuisine")
fig, ax = plt.subplots(figsize=(100, 75));
counts_cusine = data.cuisine_type.value_counts(sort=False).tolist()
freq_df = data.groupby("cuisine_type")['order_id'].count().sort_values(ascending=False).reset_index().rename(columns = {"order_id":"Number of order placed"})
cuisine_count = sns.histplot(data=data, x="cuisine_type", ax=ax);
ax2 = ax.twinx()
cuisine_median = sns.lineplot(x=range(len(freq_df)), y=freq_df["Number of order placed"].median(), ax=ax2, color="r")
# ax2.legend(loc="upper right")
cuisine_count.set_xticklabels(cuisine_count.get_xticklabels(), rotation=45, size=50);
cuisine_count.set_yticklabels(cuisine_count.get_yticklabels(), size=50);
for i,count_ in enumerate(counts_cusine):
  cuisine_count.annotate(str(count_), xy=(i,count_), horizontalalignment="center", fontsize=50);

cuisine_count.set_xlabel("Types of Cuisine", size=50)
cuisine_count.set_ylabel("Count of Orders", size=50)
cuisine_count.set_title("Histogram showing how many orders were placed for each cuisine", size=75);

# cuisine_boxplot = sns.boxplot(data=data, y="order_id", hue="cuisine_type", ax=ax[1]);
# cuisine_boxplot.set_xticklabels(cuisine_boxplot.get_xticklabels(), rotation=90, size=50);
# cuisine_boxplot.set_yticklabels(cuisine_boxplot.get_yticklabels(), size=50);

In [None]:
fig, ax = plt.subplots(figsize=(100, 75));
freq_df = data.groupby("cuisine_type")['restaurant_name'].nunique().reset_index().rename(columns = {"restaurant_name":"Number of restaurants"})
cuisine_count = sns.histplot(data=data, x="cuisine_type", ax=ax);

cuisine_count.set_xticklabels(cuisine_count.get_xticklabels(), rotation=45, size=50);
cuisine_count.set_yticklabels(cuisine_count.get_yticklabels(), size=50);

cuisine_count.set_xlabel("Types of Cuisine", size=50)
cuisine_count.set_ylabel("Number of Restaurants", size=50)
cuisine_count.set_title("Histogram showing Restaurants cater to each cuisine", size=75);

In [None]:
data.groupby("cuisine_type")['restaurant_name'].nunique().to_dict()

**cost_of_the_order**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, width_ratios=[1,3], figsize=(100, 75));
fig.suptitle("Cost of the Order", fontsize=100);
cost_box = sns.boxplot(data=data, y="cost_of_the_order", ax=ax[0]);
cost_box.set_ylabel("Cost of Orders", fontsize=50)
cost_box.set_xticklabels(cost_box.get_xticklabels(), size=50);
cost_box.set_yticklabels(cost_box.get_yticklabels(), size=50);

cost_hist = sns.histplot(data=data, x="cost_of_the_order", ax=ax[1], kde=True);
cost_hist.set_xlabel("Cost of the Order", fontsize=50)
cost_hist.set_ylabel("Count of Orders", fontsize=50)
cost_hist.set_xticklabels(cost_hist.get_xticklabels(), size=50);
cost_hist.set_yticklabels(cost_hist.get_yticklabels(), size=50);

Majority of the Orders cost between 10 dollars and 15 dollars as evident in the boxplot as well as histogram. The Distribution is also right skewed/positively skewed data. There are fewer orders with higher cost

**day_of_the_week**

In [None]:
hist_day = sns.histplot(data=data, x="day_of_the_week");
hist_day.set_title("Day of the week on which order is placed");
hist_day.set_xlabel("Day of the Week");
hist_day.set_ylabel("Count of Orders");

Most of the orders are placed during the weekends. Alot of people prefer eating out during weekends as a means to relax.

**rating**

In [None]:
count_rating = sns.countplot(data=data, x="rating");
count_rating.set_title("Ratings given for the orders placed");
count_rating.set_xlabel("Ratings");
count_rating.set_ylabel("Count of Orders");

Customers might have given ratings to only those order which tasted good but have not given ratings to the orders which may not have tasted good. Further investigation may also show how much rating is given to each restaurant/ each cuisine.

**food_preparation_time**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, width_ratios=[1,3], figsize=(100, 75));

food_prep_box = sns.boxplot(data=data, y="food_preparation_time", ax=ax[0], showmeans=True, meanprops={'marker':'o','markerfacecolor':'white','markersize':'32'},
                            medianprops={'color':"k", "linewidth":3});
food_prep_box.set_ylabel("Food Preparation Time", fontsize=50)
food_prep_box.set_xticklabels(food_prep_box.get_xticklabels(), size=50);
food_prep_box.set_yticklabels(food_prep_box.get_yticklabels(), size=50);

food_prep_hist = sns.histplot(data=data, x="food_preparation_time", ax=ax[1], kde=True);
food_prep_hist.set_xlabel("Food Preparation Time", fontsize=50)
food_prep_hist.set_ylabel("Count of Orders", fontsize=50)
food_prep_hist.set_xticklabels(food_prep_hist.get_xticklabels(), size=50);
food_prep_hist.set_yticklabels(food_prep_hist.get_yticklabels(), size=50);

The distribution for the food preparation time is seems to be similar in most of the orders except a few orders. The median and mean are also very close to each other in the boxplot

**delivery_time**

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, width_ratios=[1,3], figsize=(100, 75));
fig.suptitle("Delivery Time", fontsize=100);

delivery_box = sns.boxplot(data=data, y="delivery_time", ax=ax[0], medianprops={"color":"k", "linewidth":3});
delivery_box.set_ylabel("Delivery Time", fontsize=50)
delivery_box.set_xticklabels(delivery_box.get_xticklabels(), size=60);
delivery_box.set_yticklabels(delivery_box.get_yticklabels(), size=60);

delivery_hist = sns.histplot(data=data, x="delivery_time", ax=ax[1], kde=True);
delivery_hist.set_xlabel("Delivery Time", fontsize=50)
delivery_hist.set_ylabel("Count of Orders", fontsize=50)
delivery_hist.set_xticklabels(delivery_hist.get_xticklabels(), size=60);
delivery_hist.set_yticklabels(delivery_hist.get_yticklabels(), size=60);

The Delivery Time ranges from 15 minutes to 33 minutes however more than half of the orders take more than 25 minutes to deliver. This is also evident by the left skewed/negatively skewed data graphs.

### **Question 7**: Which are the top 5 restaurants in terms of the number of orders received? [1 mark]

In [None]:
data["restaurant_name"].value_counts()[:5]

#### Observations:

By Finding the number of times each restaurant name has occured in the data. This can also be confirmed with histogram plot shown above.


### **Question 8**: Which is the most popular cuisine on weekends? [1 mark]

In [None]:
data[data["day_of_the_week"] == "Weekend"]["cuisine_type"].value_counts()[:5]

**American** Cuisine is the most popular cuisine during the weekends

#### Observations:


### **Question 9**: What percentage of the orders cost more than 20 dollars? [2 marks]

In [None]:
cost_more_than_20 = data[data["cost_of_the_order"] > 20]
len(cost_more_than_20)/data.shape[0] * 100

In [None]:
sns.histplot(data=cost_more_than_20, x="cost_of_the_order", kde=True);

In [None]:
sns.boxplot(data=cost_more_than_20, x="cost_of_the_order", showmeans=True)

**29.24%** of the orders cost more than 20 dollars. In orders costing more than 20 dollars, Majority of the orders cost between $24 and $26 data. The boxplot shows that the data is slightly skewed on the right side, however the histogram shows that the data is a bi-modal distribution

#### Observations:


### **Question 10**: What is the mean order delivery time? [1 mark]

In [None]:
print(f"The mean order delivery time is {data['delivery_time'].mean()} minutes")

#### Observations:
The mean order delivery time is 24.161749209694417 minutes.

### **Question 11:** The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed. [1 mark]

In [None]:
data["customer_id"].value_counts()[:3]

#### Observations:
The top 3 customer ids who have placed the most orders are: \
**52832** - 13 orders placed \
**47440** - 10 orders placed \
**83287** - 9 orders placed \

### Multivariate Analysis

### **Question 12**: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables) [10 marks]


#### Correlation between Numerical Variables

Plotting the heatmap between the numerical features will give us an idea of the correlation/relationship between the various features.

In [None]:
corr_ = sns.heatmap(data[["cost_of_the_order", "food_preparation_time", "delivery_time"]].corr(), annot=True)
corr_.set_title("Correlation between Numerical Features(Cost of the order, Food Preparation Time, Delivery Time)");
plt.show()
print("")
pairplot_ = sns.pairplot(data[["cost_of_the_order", "food_preparation_time", "delivery_time"]], kind="scatter",corner=True, height=3.5)
pairplot_.fig.suptitle("Relationship between Numerical variables");
plt.show()

There is very weak correlation between the numerical variables. The Pairplot is also showing no correlation between the numerical variables

To Check the relation between numerical features within each categorical_data, we can plot a pairplot. This will give us distribution of each variable as well as the relationship(correlation) between various numeric features

In [None]:
cuisine_type_pairplot = sns.pairplot(data = data.drop(columns=["order_id", "customer_id"]), kind="scatter",corner=True, hue="cuisine_type");
cuisine_type_pairplot.fig.suptitle("Relationship between Numerical variables indicated by their cuisine")
plt.show()
print("")

unique_cuisines = data['cuisine_type'].unique()
n_rows = (len(unique_cuisines) + 2) // 3
fig, axes = plt.subplots(n_rows, 3, figsize=(18, 18))
axes = axes.flatten()
for i, type_ in enumerate(data['cuisine_type'].unique()):
    sns.heatmap(
        data=data[data["cuisine_type"] == type_][["cost_of_the_order", "food_preparation_time", "delivery_time"]].corr(),
        ax=axes[i],
        annot=True)
    axes[i].set_title(f"Correlation between variables for {type_}")


plt.tight_layout()
plt.show()

In [None]:
day_of_the_week_pairplot = sns.pairplot(data.drop(columns=["order_id", "customer_id"]), kind="scatter",corner=True, hue="day_of_the_week");
day_of_the_week_pairplot.fig.suptitle("Relationship between Numerical variables indicated by what day the order was placed")
plt.show()
print("")

day_ = data['day_of_the_week'].unique()
num_cuisines = len(day_)
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
for i, type_ in enumerate(data['day_of_the_week'].unique()):
    sns.heatmap(
        data=data[data["day_of_the_week"] == type_][["cost_of_the_order", "food_preparation_time", "delivery_time"]].corr(),
        ax=axes[i],
        annot=True)
    axes[i].set_title(f"Correlation between variables for {type_}")


plt.tight_layout()
plt.show()

In [None]:
rating_pairplot = sns.pairplot(data.drop(columns=["order_id", "customer_id"]), kind="scatter",corner=True, hue="rating");
rating_pairplot.fig.suptitle("Relationship between Numerical variables indicated by the rating given to the order")
plt.show()
print("")

rating_ = data['rating'].unique()
num_rating = len(rating_)
fig, axes = plt.subplots(2, 2, figsize=(18, 6))
axes = axes.flatten()
for i, type_ in enumerate(data['rating'].unique()):
    sns.heatmap(
        data=data[data["rating"] == type_][["cost_of_the_order", "food_preparation_time", "delivery_time"]].corr(),
        ax=axes[i],
        annot=True)
    axes[i].set_title(f"Correlation between variables for {type_}")


plt.tight_layout()
plt.show()

As visible from the heatmaps and the pairplots, we can draw a conclusion that none of the numerical variables have any correlation between them(even within the categories in categorical features)

### **Question 13:** The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer. [3 marks]

In [None]:
restaurants_grouped = data[data["rating"].isin(["3","4","5"])]
restaurants_grouped["rating"] = restaurants_grouped["rating"].astype(int)
restaurants_having_ratings = restaurants_grouped.groupby("restaurant_name").agg({"order_id": "count","rating": "mean"}).reset_index().rename(columns = {"order_id":"No. of orders",
                                                                                                                           "rating": "Average Rating"})
restaurants_having_50ratings = restaurants_having_ratings[(restaurants_having_ratings["No. of orders"] >= 50) & (restaurants_having_ratings["Average Rating"] >= 4)]
restaurants_having_50ratings

#### Observations:

The Four Restaurants eligible for the Promotional Offer are: \
1. Shake Shack
2. The Meatball Shop
3. Blue Ribbon Sushi
4. Blue Ribbon Fried Chicken

### **Question 14:** The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders. [3 marks]

In [None]:
data["Revenue_Generated"] = data["cost_of_the_order"].apply(lambda x: 0.25*x if x >= 20 else (0.15*x if x >= 5 and x< 20 else 0))

In [None]:
data["Revenue_Generated"].sum()

#### Observations:

The Net Revenue Generated by the Company(FoodHub) is 6166.303 dollars


### **Question 15:** The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered.) [2 marks]

In [None]:
data["total_time_to_deliver"] = data["food_preparation_time"] + data["delivery_time"]
percent_orders_more_than_60_min = len(data[data["total_time_to_deliver"] > 60])/(len(data))*100
percent_orders_more_than_60_min

#### Observations:

10.5% orders take more than 60 minutes to prepare and deliver the food.

### **Question 16:** The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends? [2 marks]

In [None]:
display(data.groupby("day_of_the_week")["delivery_time"].mean())
delivery_boxplot = sns.boxplot(data=data, x="day_of_the_week", y="delivery_time", hue="day_of_the_week");
delivery_boxplot.set_xticklabels(delivery_boxplot.get_xticklabels(), rotation=90);
plt.show()

#### Observations:

The Delivery Time varies by approximately 6 minutes where the delivery time during the Weekday is high in comparison to the weekends. \
The boxplot also shows that there is slight skewness to the left indicating a left-skewed/negatively skewed data indicating that majority of the time taken is less than the average time taken to deliver the order.


### Conclusion and Recommendations

### **Question 17:** What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations.) [6 marks]

### Conclusions:
FoodHub caters from variety of cuisines and from various restaurants. However the following is observed
* The ratings for each order is not received and the ratings for orders vary from 3 to 5 only.
* Majority of the orders come from only 4 types of cuisines: (American, Japanese, Italian, Chinese). When plotted the number of restaurants with respect to cuisines, It was seen that these cuisines had highest number of restaurants
* Cost of the orders ranged on the lower side of spectrum which might indicate that the restaurants are very budget friendly and hence can be attributed to the fact that most of people would turn the app during the weekends
* It is also interesting to note that majority of the orders took similar time to prepare the food. The trend is similar across all cuisine, days of the week and ratings.
* There is also no relationship in (food preparation time v/s cost) and (delivery time v/s cost) which doesn't follow a trend of inverse correlation

### Recommendations:

*  Ratings should be made mandatory for all orders as it might help in understanding bad orders and the feedback coming from those ratings can be used to improve customer service
* More restaurants from other cuisines should be also be onboarded which can be done by marketing to local and small restaurants.
* Foodhub should try to focus more on restaurants having less orders per day as by recommending them first while searching for a particular cuisine.