# **EDA_McDonalds_data_Analysis**, **By Amit Kharche**
**Follow me** on [Linkedin](https://www.linkedin.com/in/amit-kharche) and [Medium](https://medium.com/@amitkharche14) for more insights on **Data Science** and **AI**

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
**4.** [**Data Acquisition & Description**](#Section4)<br>
**5.** [**Data Pre-profiling**](#Section5)<br>
**6.** [**Data Cleaning**](#Section6)<br>
**7.** [**Data Post-profiling**](#Section7)<br>
**8.** [**Exploratory Data Analysis**](#Section8)<br>
  - **8.1** [**Analysis Based on Outlet Performance Metrics**](#Section81)
  - **8.2** [**Analysis Focused on Nutritional Content**](#Section82)
  - **8.3** [**Evaluation Using Geolocation Data**](#Section83)
  - **8.4** [**Combined Analysis: Outlet Metrics & Nutritional Content**](#Section84)
  - **8.5** [**Combined Analysis: Outlet Metrics & Location Data**](#Section85)
  - **8.6** [**Combined Analysis: Nutrition & Location Data**](#Section86)
  - **8.7** [**Integrated Analysis: Outlet Metrics, Menu Items & Geographic Insights**](#Section87)
  
**9.** [**Summarization**](#Section9)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- McDonald's Corporation is an **American fast food company**, founded in 1940 as a restaurant operated by **Richard and Maurice McDonald**.

- They **rechristened** their business as a **hamburger stan**d, and later turned the company into a **franchise**, with the Golden Arches logo.

- It is the world's <a href="https://www.forbes.com/pictures/591c79084bbe6f1b730a5811/2017-global-2000-restaura/?sh=342712856d2a">**largest restaurant chain**</a> by revenue, serving over <a href="https://www.chicagotribune.com/business/chi-mcdonalds-60-years-20150415-story.html">**69 million**</a> customers daily in over 100 countries.


<center><img width="35%" src="https://raw.githubusercontent.com/insaid2018/PGPDSAI/main/03%20Term%203%20-%20EDA%20%26%20Data%20Storytelling/03%20Module%203/img/01%20mcd.gif"></center>

- As of 2018, it is the <a href="https://www.msn.com/en-in/money/photos/the-worlds-30-largest-employers-will-surprise-you/ss-BBKxFrN#image=27">**world's second-largest private employer**</a> with 1.7 million employees (behind Walmart with 2.3 million employees).

- As of 2020, McDonald's has the <a href="https://www.statista.com/statistics/326059/mcdonalds-brand-value/">**ninth-highest global brand valuation**</a>.

---
<a name = Section2></a>
# **2. Problem Statement**
---

- McDonald wants to **expand** their **business** in the developing countries in the **near future**.

- To solve this issue they **hired a team of data scientis**t. Let's say its you...

- Based on **cultural** and **economic similarities/dissimilarities** what can you do to solve their issue.


<center><img width="60%" src="https://raw.githubusercontent.com/insaid2018/PGPDSAI/main/03%20Term%203%20-%20EDA%20%26%20Data%20Storytelling/03%20Module%203/img/02%20mcd.gif"></center>

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [1]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.6 MB[0m [31m11.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m23.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.6/262.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.8/309.8 kB[0m [31m20.6 MB/s

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [2]:
!pip install -q --upgrade datascience                               # Package that is required by pandas profiling
!pip install -q --upgrade pandas-profiling                          # Library to generate basic statistics about data

<a name = Section33></a>
### **3.3 Importing Libraries**

In [5]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from ydata_profiling import ProfileReport                        # To perform data profiling
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # For numerical python operations
#-------------------------------------------------------------------------------------------------------------------------------
import plotly.graph_objs as go                                      # For interactive graphs
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

<a name="Section4"></a>  
# **4. Data Acquisition & Wrangling**
---

We will be working with two datasets related to McDonald's to perform our analysis:

- **McDonald's Menu Nutrition Dataset** – Contains nutrition facts for various items on the McDonald’s menu.  
  ✅ *Shown in the left table.*

- **McDonald’s Store Location Dataset** – Includes store location and performance information across India and the US.  
  ✅ *Shown in the right table.*

<br>

<center>  
<img src="https://raw.githubusercontent.com/insaid2018/Term-1/master/Images/Mcd.png">  
</center>

---

### 📊 Dataset Summary

| Dataset | Records | Features | Dataset Size |
|:--:|:--:|:--:|:--:|
| McDonald's Menu (left) | 340 | 13 | 38 KB |
| McDonald's Store Locations (right) | 340 | 18 | 83 KB |

---

### 🔍 Feature Description

| Id | Menu Dataset Features | Description |  | Id | Store Dataset Features | Description |
|:--|:--|:--|:--|:--|:--|:--|
| 01 | **Category** | Category to which an item belongs |  | 01 | **Store ID** | Unique ID of the store |
| 02 | **Item** | Name of the menu item |  | 02 | **Store Name** | Name of the store |
| 03 | **Serve_Size** | Weight of one serving (g) |  | 03 | **Ownership_Type** | Ownership type (franchise/company-owned) |
| 04 | **Energy** | Calories in the item (kcal) |  | 04 | **Street_Address** | Address of the store |
| 05 | **Protein** | Protein content (g) |  | 05 | **City** | City where the store is located |
| 06 | **Total_Fat** | Total fat content (g) |  | 06 | **State** | State where the store is located |
| 07 | **Saturated_Fat** | Saturated fat content (g) |  | 07 | **Country** | Country of the store |
| 08 | **Trans_Fat** | Trans fat content (g) |  | 08 | **Postcode** | Postal code of the store |
| 09 | **Cholestrol** | Cholesterol content (mg) |  | 09 | **Phone_Number** | Contact number of the store |
| 10 | **Carbohydrates** | Carbohydrate content (g) |  | 10 | **Timezone** | Timezone of the store location |
| 11 | **Sugar** | Sugar content (g) |  | 11 | **Longitude** | Longitude coordinate |
| 12 | **Dietary_Fibre** | Fibre content (g) |  | 12 | **Latitude** | Latitude coordinate |
| 13 | **Sodium** | Sodium content (mg) |  | 13 | **Revenue** | Store revenue (in million INR) |
|    |               |                              |  | 14 | **Profits** | Store profits (in million INR) |
|    |               |                              |  | 15 | **Gross_Profit_Margin** | Gross profit margin (in %) |
|    |               |                              |  | 16 | **Number_of_Employees** | Total employees at the store |
|    |               |                              |  | 17 | **Customers** | Monthly customer count |
|    |               |                              |  | 18 | **Best_Selling_Item** | Top-selling menu item |


In [16]:
# Nutrition facts about different items for McDonald's menu.
df_mcd = pd.read_csv('https://raw.githubusercontent.com/amitkharche/exploratory_data_analysis_projects_amit_kharche/main/02.EDA_McDonalds_data_analysis_amit_kharche/df_mcd.csv')

# McDonald's store location information.
df_store = pd.read_csv('https://raw.githubusercontent.com/amitkharche/exploratory_data_analysis_projects_amit_kharche/main/02.EDA_McDonalds_data_analysis_amit_kharche/df_store.csv')

# Display data dimensions
print('McDonalds Menu Shape:', df_mcd.shape)
print('McDonalds Store Location Shape:', df_store.shape)

McDonalds Menu Shape: (340, 13)
McDonalds Store Location Shape: (340, 18)


**Observation:**

- From the data description of the two datasets,  we can observe that using **Item** and **Best_Selling_Item** can be used to **merge** the dataset.

- This is usually not the case in real-world datasets, but this will **help** in **avoiding** the **formation** of **duplicate** rows.

In [17]:
# Merge data using menu and outlets
data = pd.merge(left=df_store,
                right=df_mcd,
                how='inner',
                left_on='Best_Selling_Item',
                right_on='Item').drop(labels=['Item'], axis=1)

# Display the final data shape
print('McDonalds Menu Shape:', data.shape)

# Output top 3 rows
data.head(3)

McDonalds Menu Shape: (340, 30)


Unnamed: 0,Store_ID,Store_Name,Ownership_Type,Street_Address,City,State,Country,Postcode,Phone_Number,Timezone,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Best_Selling_Item,Category,Serve_Size,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre,Sodium
0,23149-228271,Banjara Hills,Joint Venture,"Lower Ground Floor, GVK One, Road Number 1, Banjara Hills",Hyderabad,AP,IN,500034,,GMT+05:30 Asia/New_Delhi,78.45,17.42,2.117344,0.171584,0.747732,34.311197,3979.583117,Egg & Cheese Muﬃn,Breakfast,112,290,14,13.0,7.0,0.2,244,28,2,2,620
1,23191-228548,Kukatpally,Joint Venture,"Upper Ground Floor, Forum Sujana Mall, Kukatpally",Hyderabad,AP,IN,500072,,GMT+05:30 Asia/New_Delhi,78.39,17.48,1.058504,0.054645,0.442299,25.487533,1156.01062,Sausage McMuﬀm,Breakfast,112,273,16,11.0,5.7,0.2,50,28,2,2,950
2,23193-228546,Madhapur,Joint Venture,"Lower Ground Floor, Inorbit Mall, Madhapur",Hyderabad,AP,IN,500081,,GMT+05:30 Asia/New_Delhi,78.39,17.43,4.50502,0.663867,0.933588,54.208502,10346.720786,Sausage & Egg McMuﬀm,Breakfast,157,355,22,17.0,7.9,0.2,277,29,2,2,1020


<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [18]:
data.describe()

Unnamed: 0,Postcode,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre
count,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0,340.0
mean,290214100.0,-56.761,33.487647,25.853416,4.643606,3.52331,90.556231,13071.990651,337.8,11.694118,13.05,5.595588,0.183235,47.544118,43.770588,26.347059,1.544118
std,364978700.0,76.967267,9.945169,14.47698,3.406981,5.193082,31.352258,6993.912334,231.850796,10.99077,13.639604,5.219422,0.381715,81.815469,26.9364,26.89634,1.663719
min,2134.0,-158.02,12.91,1.001099,0.050085,-4.881901,25.009156,1002.929777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,90436.25,-99.325,28.5225,15.301218,1.829762,0.069328,64.577738,6877.498703,176.5,2.0,1.375,0.5,0.0,4.0,28.0,4.0,0.0
50%,560079.0,-87.265,35.105,27.056185,3.893049,1.433993,89.164403,13373.546556,299.5,10.0,10.0,4.55,0.0,25.0,41.0,16.0,1.0
75%,660473100.0,-71.1225,40.1,37.930464,6.919813,6.18458,115.51561,19245.185705,460.0,17.0,20.0,9.0,0.1,55.0,56.0,43.25,3.0
max,996698000.0,80.26,61.6,49.680624,13.514181,19.459398,149.087497,24964.842677,1880.0,87.0,118.0,24.1,2.5,575.0,141.0,128.0,9.0


**Observation:**

> **Revenue:**
- On average there is revenue of ₹ 25.85 (in million) across the India & US.
- 25% of the items sold by McD generate revenue of ₹ 15.3 (in million).
- 50% of the items sold by McD generate revenue of ₹ 27 (in million).
- 75% of the items sold by McD generate revenue of ₹ 37.9 (in million).

> **Profit:**
- On average there is profit of ₹ 4.64 (in million) across the India & US.
- 25% of the items sold by McD generate profit of ₹ 1.82 (in million).
- 50% of the items sold by McD generate profit of ₹ 3.89 (in million).
- 75% of the items sold by McD generate profit of ₹ 6.91 (in million).

- Similarly, rest of the information can also be retrieved from the above dataframe.


<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

In [20]:
data.info(verbose=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 30 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Store_ID             340 non-null    object 
 1   Store_Name           340 non-null    object 
 2   Ownership_Type       340 non-null    object 
 3   Street_Address       340 non-null    object 
 4   City                 340 non-null    object 
 5   State                340 non-null    object 
 6   Country              340 non-null    object 
 7   Postcode             340 non-null    int64  
 8   Phone_Number         248 non-null    object 
 9   Timezone             340 non-null    object 
 10  Longitude            340 non-null    float64
 11  Latitude             340 non-null    float64
 12  Revenue              340 non-null    float64
 13  Profits              340 non-null    float64
 14  Gross_Profit_Margin  340 non-null    float64
 15  Number_of_Employees  340 non-null    flo

In [21]:
data.head(2)

Unnamed: 0,Store_ID,Store_Name,Ownership_Type,Street_Address,City,State,Country,Postcode,Phone_Number,Timezone,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Best_Selling_Item,Category,Serve_Size,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre,Sodium
0,23149-228271,Banjara Hills,Joint Venture,"Lower Ground Floor, GVK One, Road Number 1, Banjara Hills",Hyderabad,AP,IN,500034,,GMT+05:30 Asia/New_Delhi,78.45,17.42,2.117344,0.171584,0.747732,34.311197,3979.583117,Egg & Cheese Muﬃn,Breakfast,112,290,14,13.0,7.0,0.2,244,28,2,2,620
1,23191-228548,Kukatpally,Joint Venture,"Upper Ground Floor, Forum Sujana Mall, Kukatpally",Hyderabad,AP,IN,500072,,GMT+05:30 Asia/New_Delhi,78.39,17.48,1.058504,0.054645,0.442299,25.487533,1156.01062,Sausage McMuﬀm,Breakfast,112,273,16,11.0,5.7,0.2,50,28,2,2,950


**Observation:**

- We can clearly see that there's **no null data present** in our dataset.

- We can see that **phone number** is found to be **object**, instead it should be integer.

- The phone number could be object unless it contains the separator "-" which is a common practice across the globe.

- The **Serve_Size** and **Sodium** features are found to have **inconsistent data type**.

- This was the high level information. Next, we will perform pandas profiling to dig deeper.

<a name = Section5></a>

---
# **5. Data Pre-Profiling**
---

- For quick analysis pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column, statistics are presented in an interactive HTML report.

In [22]:
profile = ProfileReport(df=data)
profile.to_file(output_file='Pre Profiling Report.html')
print('Accomplished!')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/30 [00:00<?, ?it/s][A
  3%|▎         | 1/30 [00:00<00:19,  1.45it/s][A
 13%|█▎        | 4/30 [00:00<00:04,  5.38it/s][A
 30%|███       | 9/30 [00:01<00:01, 12.46it/s][A
 43%|████▎     | 13/30 [00:01<00:00, 17.28it/s][A
 60%|██████    | 18/30 [00:01<00:00, 21.11it/s][A
100%|██████████| 30/30 [00:01<00:00, 20.92it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Accomplished!


**Observations:**

- The report shows that there are a **total** of **30 features** out of which **10** are **categorical**, **17** are **numerical**, and **3 unsupported type**.

- We can observe that there **exist missing data** in **92 cells (0.9%)** while **no duplicity present**.

- Features such as Store_ID, Store_Name, Street_Address, City, Best_Selling_Item are found to be highly cardinal in nature.

- The **revenu** is **highly correlated** with the **number of employees** and vice-versa.

- The **energy** is **highly correlated** with the **fat** and vice-versa.

- The **country** is **highly correlated** with time **zone, ownership type, state** and vice-versa.

- The **phone number** has **92 (27.1%) missing values**.

- Features such as **Store_ID, Store_Name, Street_Address, Best_Selling_Item** are found to be **uniformly distributed**.

- Features such as **Store_ID, Store_Name, Profits, Gross_Profit_Margin, Best_Selling_Item** are found to have **unique values**.

- **Phone_Number, Serve_Size, Sodium** are the **unsupported type** of feature. They needs cleaning or further analysis.

- **Number of people** and **customers** are found to be in **float value** which looks like a data entry error, requires flooring.

- **Note:** Rest of the information can be analyzed from the profile report itself.


<a name = Section6></a>

---
# **6. Data Cleaning**
---

- In this section, we will perform the **cleaning** operations over the features using information from previous section.

- But before peforming cleaning operations we need to **drop unnecessary features** that wont' help in analysis.

- These features are Postcode, Phone_Number, Timezone.

- Next, we need to **rectify** the **data entry errors** by performing flooring operation.

In [24]:
# Dropping unnecessary features: Postcode, Phone_Number, Timezone
data.drop(labels=['Postcode', 'Phone_Number', 'Timezone'], axis=1, inplace=True)

# Rectifying data entry error by flooring
data['Number_of_Employees'] = data['Number_of_Employees'].apply(np.floor).astype(int)
data['Customers'] = data['Customers'].apply(np.floor).astype(int)

- Features such as **Serve_Size, Sodium contains irregular data** which needs to be cleaned.

In [25]:
# Analyze the Serve_Size and Sodium feature values at random range
print('Serve_Size few values (Random Chosen):', data['Serve_Size'].values[0:10])
print('Sodium few values (Random Chosen):', data['Sodium'].values[45:55])

Serve_Size few values (Random Chosen): ['112' '112' '157' '119' '139' '142 g' '64' '115' '246' '173 g']
Sodium few values (Random Chosen): ['-' '-' '-' '-' '-' '-' '0' '0' '0' '0']


In [27]:
# Remove ' g' from 'Serve_Size' and convert to int
data['Serve_Size'] = data['Serve_Size'].astype(str).str.replace(' g', '', regex=False).astype(int)

# Convert 'Sodium' to numeric safely (replace '-' with NaN first)
data['Sodium'] = pd.to_numeric(data['Sodium'], errors='coerce')  # Non-numeric like '-' become NaN

# Fill NaNs with median value
median = int(data['Sodium'].median())
data['Sodium'].fillna(median, inplace=True)

# Convert to int (all values now valid)
data['Sodium'] = data['Sodium'].astype(int)

# Verify some values
print('Serve_Size few values (Random Chosen):', data['Serve_Size'].values[0:10])
print('Sodium few values (Random Chosen):', data['Sodium'].values[45:55])

Serve_Size few values (Random Chosen): [112 112 157 119 139 142  64 115 246 173]
Sodium few values (Random Chosen): [230 230 230 230 230 230   0   0   0   0]


<a name = Section7></a>

---
# **7. Data Post-Profiling**
---

- In this section, we will observe the changes after performing data pre-processing, if present.

In [29]:
profile = ProfileReport(df=data)
profile.to_file(output_file='Post Profiling Report.html')
files.download("Post Profiling Report.html")
print('Accomplished!')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/27 [00:00<?, ?it/s][A
  7%|▋         | 2/27 [00:00<00:01, 15.26it/s][A
 15%|█▍        | 4/27 [00:00<00:01, 12.51it/s][A
 37%|███▋      | 10/27 [00:00<00:00, 28.01it/s][A
 56%|█████▌    | 15/27 [00:00<00:00, 30.55it/s][A
100%|██████████| 27/27 [00:00<00:00, 38.51it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Accomplished!


**Observations:**

- The report shows that there are a total of **30 features** out of which **9** are **categorical**, **10** are **numerical**.

- We have **successfully get rid of unsupported type data** features.

- There exist **no missing cells** after data cleaning.

- **Note:** To know deep information about the differences kindly use the profiling report.

<a name = Section8></a>

---
# **8. Exploratory Data Analysis**
---

- In this section, we will analyze the dataset to summarize their main characteristics, often using visuals.

- The primary goal is to retrieve a maximum amount of information from the dataset.

- At the same time finding following set of information:
  - A list of outliers
  - A good fitting model
  - Estimates for parameters and their uncertainties
  - A ranked list of important factors
  - Conclusions as to whether individual factors are statistically significant
  - A sense of the robustness of conclusions
  - Optimal settings

**Note:**

- In the upcoming sub-sections, we will be using an interactive visualization library called as Plotly.

- If you haven't gone through this package it is entirely okay.

- You can always visit the page of documentation to understand how they work.

- Trust us!, it is relatively easy and more impactive while doing exploratory data analysis.

- However, you can still achieve the same objective (in static fashion) as plotly using matplotlib and seaborn.

<a name = Section81></a>
### **8.1 Outlet Metrics-based Analysis**

- In this section, we will perform exploratory data analysis based on outlet metrics of Mcdonald's.

<a name = Section811></a>
**<h4>Question:** How many stores are owned and run by McDonald's?</h4>

In [30]:
# Extract labels and values of ownership type
labels = data['Ownership_Type'].value_counts().index
values = data['Ownership_Type'].value_counts().values

# Initiate an empty figure
fig = go.Figure()

# Add a trace of pie to the figure
fig.add_trace(trace=go.Pie(labels=labels,
                           values=values,
                           hole=.8))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Proportion of Ownership Type',
                  title_x=0.5)

# Display the figure
fig.show()

**Observation:**

- **~40%** of the outlets are **company owned** and **~36%** are **Licensed** outlets.

- **~24%** of the outlets are on **Joint venture**. This is mostly in countries where 100% FDI is/was not allowed to start their own outlets.

<a name = Section812></a>
**<h4>Question:** Which top 10 outlets generate maximum revenue for the company?</h4>

In [31]:
top10outletsrevenue = data.iloc[data['Revenue'].sort_values(ascending=False)[:10].index, :]
top10outletsrevenue.head(2)

Unnamed: 0,Store_ID,Store_Name,Ownership_Type,Street_Address,City,State,Country,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Best_Selling_Item,Category,Serve_Size,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre,Sodium
139,10753-102015,19th & Telephone,Company Owned,620 SW 19th Street,Moore,OK,US,-97.5,35.32,49.680624,4.065785,17.456977,149,4056,Premium Grilled Chicken Classic Sandwich,Chicken & Fish,200,350,28,9.0,2.0,0.0,65,42,8,3,820
133,27316-246764,Fort Sill BX,Licensed,1718 Macomb Rd,Fort Sill,OK,US,-98.4,34.67,49.415723,5.617425,13.902593,148,7112,McDouble,Beef & Pork,147,380,22,17.0,8.0,1.0,75,34,7,2,840


In [32]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure
fig.add_trace(trace=go.Scattergeo(lon=top10outletsrevenue['Longitude'],
                                  lat=top10outletsrevenue['Latitude'],
                                  text=top10outletsrevenue[['Store_Name', 'Revenue']],
                                  marker=dict(size=top10outletsrevenue['Revenue'] / 2,
                                              color='Green')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Outlets Concerning Revenue (in million INR)',
                  title_x=0.5,
                  geo=dict(scope='usa',
                           projection=go.layout.geo.Projection(type='albers usa')))

# Display the figure
fig.show()

**Observations:**

- All the stores in our **Top 10** for **maximum revenue generation** lie in the **US**.

- The **size** of the **marker** gives the **Revenue** of the outlet.

- It can be seen that all the outlets in the **Top 10** have very **similar revenue** (near about **48 - 49 million INR**).

- The **tooltip** shows the **geographical coordinates, name** and **revenue** of the **co-ordinates**.

<a name = Section813></a>
**<h4>Question:** Which top 10 outlets generate maximum profit for the company?</h4>

In [33]:
top10outletsprofit = data.iloc[data['Profits'].sort_values(ascending=False)[:10].index, :]
top10outletsprofit.head(2)

Unnamed: 0,Store_ID,Store_Name,Ownership_Type,Street_Address,City,State,Country,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Best_Selling_Item,Category,Serve_Size,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre,Sodium
316,76725-102051,Target Kansas City T-2222,Licensed,10900 Stadium Pkwy,Kansas City,KS,US,-94.83,39.13,46.934416,13.514181,-4.788909,141,23842,Strawberry Banana Smoothie (Medium),Smoothies & Shakes,453,250,4,1.0,0.0,0.0,5,58,54,3,60
210,72668-65003,Super Target Tuscaloosa ST-1787,Licensed,1901 13th Ave E,Tuscaloosa,AL,US,-87.51,33.2,48.907285,13.43148,-3.558528,146,22564,1% Low Fat Milk Jug,Cold Beverages,236,100,8,2.5,1.5,0.0,10,12,12,0,125


In [34]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure
fig.add_trace(trace=go.Scattergeo(lon=top10outletsprofit['Longitude'],
                                  lat=top10outletsprofit['Latitude'],
                                  text=top10outletsprofit[['Store_Name', 'Profits']],
                                  marker=dict(size=top10outletsprofit['Profits'],
                                              color='Blue')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Outlets Concerning Profit (in million INR)',
                  title_x=0.5,
                  geo=dict(scope='usa',
                           projection=go.layout.geo.Projection(type='albers usa')))

# Display the figure
fig.show()

**Observations:**

- All the stores in our **Top 10** for **maximum profits** also lie in the **US**.

- The **size** of the **marker** gives the **Profit** of the outlet.

- The **tooltip** shows the **geographical coordinates, name** and **profits** of the **outlet**.

<a name = Section814></a>
**<h4>Question:** Which top 10 outlets have the highest number of employees?</h4>

In [35]:
top10outletsemp = data.iloc[data['Number_of_Employees'].sort_values(ascending=False)[:10].index, :]
top10outletsemp.head(2)

Unnamed: 0,Store_ID,Store_Name,Ownership_Type,Street_Address,City,State,Country,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Best_Selling_Item,Category,Serve_Size,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre,Sodium
139,10753-102015,19th & Telephone,Company Owned,620 SW 19th Street,Moore,OK,US,-97.5,35.32,49.680624,4.065785,17.456977,149,4056,Premium Grilled Chicken Classic Sandwich,Chicken & Fish,200,350,28,9.0,2.0,0.0,65,42,8,3,820
133,27316-246764,Fort Sill BX,Licensed,1718 Macomb Rd,Fort Sill,OK,US,-98.4,34.67,49.415723,5.617425,13.902593,148,7112,McDouble,Beef & Pork,147,380,22,17.0,8.0,1.0,75,34,7,2,840


In [36]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure
fig.add_trace(trace=go.Scattergeo(lon=top10outletsemp['Longitude'],
                                  lat=top10outletsemp['Latitude'],
                                  text=top10outletsemp[['Store_Name', 'Number_of_Employees']],
                                  marker=dict(size=top10outletsemp['Number_of_Employees'] / 7,
                                              color='Orange')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Outlets Having Highest Number of Employees',
                  title_x=0.5,
                  geo=dict(scope='usa',
                           projection=go.layout.geo.Projection(type='albers usa')))

# Display the figure
fig.show()

**Observations:**

- All the stores in our **Top 10** for **highest number of employees** lie in the **US** as well.

- The **number of employees** in the **top 10** lie in the **range** between **145-149**.


- The **tooltip** shows the **geographical coordinates, name** and **number of employees** of the **outlet**.

<a name = Section815></a>
**<h4>Question:** Which top 10 outlets have the highest number of customers?</h4>

In [37]:
top10outletscust = data.iloc[data['Customers'].sort_values(ascending=False)[:10].index, :]
top10outletscust.head(2)

Unnamed: 0,Store_ID,Store_Name,Ownership_Type,Street_Address,City,State,Country,Longitude,Latitude,Revenue,Profits,Gross_Profit_Margin,Number_of_Employees,Customers,Best_Selling_Item,Category,Serve_Size,Energy,Protein,Total_Fat,Saturated_Fat,Trans_Fat,Cholestrol,Carbohydrates,Sugars,Dietary_Fibre,Sodium
281,3236-251306,Hwy 44 and Edgewood - Eagle,Company Owned,1598 E. Riverside Dr.,Eagle,ID,US,-116.33,43.69,38.209784,11.448942,-4.881901,116,24964,Caramel Iced Coffee (Large),Hot Beverages,907,260,2,9.0,6.0,0.0,35,43,42,0,65
57,19530-197407,Bandra East - FIFC,Joint Venture,"First International Financial Centre, Bandra Kurla Complex Road, Bandra (East)",Mumbai,MH,IN,72.87,19.07,9.917875,2.952737,-1.22538,99,24780,McFloat Fanta,Desserts,237,152,2,2.0,1.1,0.1,3,32,31,0,390


In [38]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure
fig.add_trace(trace=go.Scattergeo(lon=top10outletscust['Longitude'],
                                  lat=top10outletscust['Latitude'],
                                  text=top10outletscust[['Store_Name', 'Customers']],
                                  marker=dict(size=top10outletscust['Customers'] / 2000,
                                              color='Purple')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Outlets Having Highest Number of Customers',
                  title_x=0.5,
                  geo=dict(scope='world',
                           resolution=110,
                           showcoastlines=True,
                           coastlinecolor='White'))

# Display the figure
fig.show()

**Observations:**

- In case of the **Customer count, 6 outlets** are from the **US** and **4** are from **India** in our **Top 10** outlets with **highest number of customers**.

- The **tooltip** shows the **geographical coordinates, name** and **number of customers** of the **outlet**.

<a name = Section82></a>
### **8.2 Nutritional Value-based Analysis**

- In this section, we will perform exploratory data analysis based on the nutrition value of Mcdonald's items.

<a name = Section821></a>
**<h4>Question:** Which is the most common category on the menu?</h4>

In [39]:
# Extract labels and values of menu category
labels = data['Category'].value_counts().index
values = data['Category'].value_counts().values

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Bar(x=values,
                           y=labels,
                           orientation='h'))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Frequency Distribution of Menu Category',
                  title_x=0.5,
                  xaxis_title='Frequency',
                  yaxis_title='Category')

# Display the figure
fig.show()

**Observation:**

- Most items on the **McDonald's menu** belong to the **Hot Beverages** category.

<a name = Section822></a>
**<h4>Question:** On average how many calories are present in each category of the menu?</h4>

In [40]:
# Extract labels and values of menu category
labels = data.groupby(['Category'])['Energy'].mean().sort_values().index
values = data.groupby(['Category'])['Energy'].mean().sort_values().values

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Bar(x=values,
                           y=labels,
                           orientation='h',
                           marker=dict(color='rgba(246, 78, 139, 0.6)',
                                       line=dict(color='rgba(246, 78, 139, 1.0)',
                                                 width=3))))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Calories per Category',
                  title_x=0.5,
                  xaxis_title='Frequency',
                  yaxis_title='Category')

# Display the figure
fig.show()

**Observations:**

- **Chicken & Fish** category has the **highest calory content** present in them followed by **Sandwiches and Wraps**.

- The **most common category** of **Hot Beverages** have **lesser calory content** than the above mentioned categories.

<a name = Section823></a>
**<h4>Question:** Is there any difference in the nutritional value of a grilled vs crispy chicken?</h4>

In [41]:
# Creating new features containing items either grilled or crispy
data['Grilled'] = data['Best_Selling_Item'].str.contains('Grilled')
data['Crispy'] = data['Best_Selling_Item'].str.contains('Crispy')

# Creating two new dataframes having items grilled and crispy
crispy_df = data.loc[data['Crispy'] == True, ['Best_Selling_Item', 'Total_Fat']]
grilled_df = data.loc[data['Grilled'] == True, ['Best_Selling_Item', 'Total_Fat']]

# Reset the index and drop old index values
crispy_df.reset_index(drop=True, inplace=True)
grilled_df.reset_index(drop=True, inplace=True)

# Merge crispy and grilled dataframe
grillcrisp = pd.merge(left=grilled_df, right=crispy_df, how='left', left_index=True, right_index=True)

# Display the dataframe shape
print('DataFrame Shape:', grillcrisp.shape)

# Output the dataframe
grillcrisp

DataFrame Shape: (13, 4)


Unnamed: 0,Best_Selling_Item_x,Total_Fat_x,Best_Selling_Item_y,Total_Fat_y
0,Premium Grilled Chicken Classic Sandwich,9.0,Premium Crispy Chicken Classic Sandwich,22.0
1,Premium Grilled Chicken Club Sandwich,20.0,Premium Crispy Chicken Club Sandwich,33.0
2,Premium Grilled Chicken Ranch BLT Sandwich,15.0,Premium Crispy Chicken Ranch BLT Sandwich,28.0
3,Bacon Clubhouse Grilled Chicken Sandwich,25.0,Bacon Clubhouse Crispy Chicken Sandwich,38.0
4,Premium McWrap Chicken & Bacon (Grilled Chicken),19.0,Southern Style Crispy Chicken Sandwich,19.0
5,Premium McWrap Chicken & Ranch (Grilled Chicken),18.0,Premium McWrap Chicken & Bacon (Crispy Chicken),32.0
6,Premium McWrap Southwest Chicken (Grilled Chicken),20.0,Premium McWrap Chicken & Ranch (Crispy Chicken),31.0
7,Premium McWrap Chicken Sweet Chili (Grilled Chicken),10.0,Premium McWrap Southwest Chicken (Crispy Chicken),33.0
8,Premium Bacon Ranch Salad with Grilled Chicken,8.0,Premium McWrap Chicken Sweet Chili (Crispy Chicken),23.0
9,Premium Southwest Salad with Grilled Chicken,8.0,Premium Bacon Ranch Salad with Crispy Chicken,21.0


**Observation:**

- We can see that **Ranch Snack Wrap (Grilled Chicken)** and **Southern Style Crispy Chicken Sandwich** don't match with each other.

- This will **affect our analysis** and hence, we're gonna **remove these items from** their **respective dataframes**.

In [42]:
# Get indexes of grilled and crispy items to be removed
grilled_index = grilled_df[grilled_df['Best_Selling_Item'] == 'Ranch Snack Wrap (Grilled Chicken)'].index
crispy_index = crispy_df[crispy_df['Best_Selling_Item'] == 'Southern Style Crispy Chicken Sandwich'].index

# Drop items using the index values extracted
grilled_df.drop(labels=grilled_index, inplace=True)
crispy_df.drop(labels=crispy_index, inplace=True)

# Reset the index and drop old index values
crispy_df.reset_index(drop=True, inplace=True)
grilled_df.reset_index(drop=True, inplace=True)

# Merge crispy and grilled dataframe
grillcrisp = pd.merge(left=grilled_df, right=crispy_df, how='left', left_index=True, right_index=True)

# Display the dataframe shape
print('DataFrame Shape:', grillcrisp.shape)

# Renaming the merged dataframe features
grillcrisp.columns = ['Items-Grilled', 'Total_Fat-Grilled', 'Items-Crispy', 'Total_Fat-Crispy']

# Output the dataframe
grillcrisp

DataFrame Shape: (12, 4)


Unnamed: 0,Items-Grilled,Total_Fat-Grilled,Items-Crispy,Total_Fat-Crispy
0,Premium Grilled Chicken Classic Sandwich,9.0,Premium Crispy Chicken Classic Sandwich,22.0
1,Premium Grilled Chicken Club Sandwich,20.0,Premium Crispy Chicken Club Sandwich,33.0
2,Premium Grilled Chicken Ranch BLT Sandwich,15.0,Premium Crispy Chicken Ranch BLT Sandwich,28.0
3,Bacon Clubhouse Grilled Chicken Sandwich,25.0,Bacon Clubhouse Crispy Chicken Sandwich,38.0
4,Premium McWrap Chicken & Bacon (Grilled Chicken),19.0,Premium McWrap Chicken & Bacon (Crispy Chicken),32.0
5,Premium McWrap Chicken & Ranch (Grilled Chicken),18.0,Premium McWrap Chicken & Ranch (Crispy Chicken),31.0
6,Premium McWrap Southwest Chicken (Grilled Chicken),20.0,Premium McWrap Southwest Chicken (Crispy Chicken),33.0
7,Premium McWrap Chicken Sweet Chili (Grilled Chicken),10.0,Premium McWrap Chicken Sweet Chili (Crispy Chicken),23.0
8,Premium Bacon Ranch Salad with Grilled Chicken,8.0,Premium Bacon Ranch Salad with Crispy Chicken,21.0
9,Premium Southwest Salad with Grilled Chicken,8.0,Premium Southwest Salad with Crispy Chicken,22.0


**Observation:**

- Now **each item** in both the dataframes **match with each other**.

- Next, we will **drop Item-Crispy** and replace the "**Grilled**" with **nothing** and create a simple dataframe to plot.

In [43]:
# Dropping Items-Crispy
grillcrisp.drop(labels='Items-Crispy', axis=1, inplace=True)

# Replacing Grilled with nothing
grillcrisp['Item'] = grillcrisp['Items-Grilled'].str.replace("Grilled ", "")

# Dropping Item-Grilled
grillcrisp.drop('Items-Grilled', axis=1, inplace=True)

# Set Item as index of the dataframe
# grillcrisp.set_index(keys='Item', inplace=True)

# Output the dataframe
grillcrisp

Unnamed: 0,Total_Fat-Grilled,Total_Fat-Crispy,Item
0,9.0,22.0,Premium Chicken Classic Sandwich
1,20.0,33.0,Premium Chicken Club Sandwich
2,15.0,28.0,Premium Chicken Ranch BLT Sandwich
3,25.0,38.0,Bacon Clubhouse Chicken Sandwich
4,19.0,32.0,Premium McWrap Chicken & Bacon (Chicken)
5,18.0,31.0,Premium McWrap Chicken & Ranch (Chicken)
6,20.0,33.0,Premium McWrap Southwest Chicken (Chicken)
7,10.0,23.0,Premium McWrap Chicken Sweet Chili (Chicken)
8,8.0,21.0,Premium Bacon Ranch Salad with Chicken
9,8.0,22.0,Premium Southwest Salad with Chicken


In [44]:
# Initiate an empty figure
fig = go.Figure()

# Adding two traces of bar to the figure
fig.add_trace(trace=go.Bar(x=grillcrisp['Total_Fat-Crispy'],
                           y=grillcrisp['Item'],
                           orientation='h',
                           name='Total_Fat-Crispy'))

fig.add_trace(trace=go.Bar(x=grillcrisp['Total_Fat-Grilled'],
                           y=grillcrisp['Item'],
                           orientation='h',
                           name='Total_Fat-Grilled'))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Fat-Grilled vs Fat-Crispy',
                  title_x=0.5,
                  xaxis_title='Frequency',
                  yaxis_title='Item')

# Display the figure
fig.show()

**Observations:**

- It is evident form the chart that **Crispy** food items have **higher fat content** present as compared to **Grilled** food items.

- Chicken and Sandwiches already have a **high calory content** as shown in the previous chart.

- But now we can distinguish them on the basis of **Fat content** as well.

<a name = Section824></a>
**<h4>Question:** On average how much sugar content is available in the each category of the menu?</h4>

In [45]:
# Extract labels and values of menu category
labels = data.groupby(['Category'])['Sugars'].mean().sort_values().index
values = data.groupby(['Category'])['Sugars'].mean().sort_values().values

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Bar(x=values,
                           y=labels,
                           orientation='h',
                           marker=dict(color='rgba(247, 231, 113, 1.0)',
                                       line=dict(color='rgba(247, 209, 0, 1.0)',
                                                 width=3))))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Sugar per Category',
                  title_x=0.5,
                  xaxis_title='Frequency',
                  yaxis_title='Category')

# Display the figure
fig.show()

**Observations:**

- The **sugar content** present in the **Desserts, Beverages, Smoothies, and Shakes** is **too much sugar** than the remaining categories.

- The **most common category** of **Hot Beverages** is only behind **Smoothies and Shakes** in terms of **Sugar content** present in them.

- One must be **cautious** with the **sugar intake** of the shown items because it can **affect health adversely**.

<a name = Section825></a>
**<h4>Question:** In what sort of foods and beverages do manufacturers include fiber?</h4>

In [46]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Box(x=data['Dietary_Fibre'],
                           y=data['Category'],
                           orientation='h',
                           marker=dict(color='rgba(148, 166, 212, 1.0)',
                                       line=dict(color='rgba(81, 64, 191, 1.0)',
                                                 width=3))))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Dietary_Fibre per Category',
                  title_x=0.5,
                  xaxis_title='Dietary_Fibre',
                  yaxis_title='Category')

# Display the figure
fig.show()

**Observations:**

- Not only are **salads healthy** but also have **high fiber content** compared to the other items on the **McDonald's menu**.

- But, McDonald's only sells a **few different items** in the **Salads** category.

- The categories **Condiments, Cold Beverages**, and **Desserts** have **fiber content** near **zero** with some **outliers**.

- The **Hot Beverages** category also has a **very low fiber content** present in its items.

In [47]:
category_count = data.groupby(by='Category').count()
item_count = category_count[['Best_Selling_Item']].sort_values(by='Best_Selling_Item', ascending=False)
item_count.transpose()

Category,Hot Beverages,Breakfast,Cold Beverages,Smoothies & Shakes,Chicken & Fish,Desserts,Snacks & Sides,Beef & Pork,Sandwiches and Wraps,Salads,New Products,Condiments,Nuggets,Chicken Wings
Best_Selling_Item,99,49,45,34,26,24,17,15,10,6,5,5,3,2


**Observations:**

- **30%** of McDonald's menu is comprised of **Hot Beverages** while only **2%** is comprised of **Salads**.

- This shows that **McDonald's doesn't care about your health** that much.

- They **will serve** you **items that aren't good for** your **health** in the long run.

<a name = Section826></a>
**<h4>Question:** How to select nutritious and non-nutritious food from the menu?</h4>

In [48]:
# Initialize variables containing nutritious and non-nutritious factors
Nutritious = ['Protein', 'Dietary_Fibre']
Non_Nutritious = ['Total_Fat', 'Saturated_Fat', 'Trans_Fat', 'Cholestrol']

# Create new features for nutritious and non-nutritious values
data['Nutritious'] = data['Protein'] + data['Dietary_Fibre']
data['Non-Nutritious'] = data['Total_Fat'] \
                       + data['Saturated_Fat'] \
                       + data['Trans_Fat'] \
                       + data['Cholestrol']

# Extracting data for Brekfast category
breakfast = data[data['Category'] == 'Breakfast']

# Extracting top 10 food items based on nutrition
nutritframe = breakfast.groupby(by=['Best_Selling_Item']) \
                       .sum() \
                       .sort_values(by='Nutritious', ascending=False)[Nutritious] \
                       .head(10)

In [49]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Heatmap(x=Nutritious,
                               y=nutritframe.index,
                               z=nutritframe))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Food Items Containing High Protein and Fibre Content',
                  title_x=0.5,
                  xaxis_title='Nutritious Factors',
                  yaxis_title='Food Items')

# Display the figure
fig.show()

In [50]:
# Extracting top 10 food items based on non-nutrition
nonnutritframe = breakfast.groupby(by=['Best_Selling_Item']) \
                          .sum() \
                          .sort_values(by='Non-Nutritious', ascending=False)[Non_Nutritious] \
                          .head(10)

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Heatmap(x=Non_Nutritious,
                               y=nonnutritframe.index,
                               z=nonnutritframe))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Food Items Containing Fat and Cholestrol',
                  title_x=0.5,
                  xaxis_title='Non-Nutritious Factors',
                  yaxis_title='Food Items')

# Display the figure
fig.show()

**Observations:**

- Big **Breakfast** with **Hotcakes and Egg Whites (Large Biscuit)** contains a **high** amount of **Protein** and **Dietary_Fibre**.

-  On the other hand, it contains a **lower** content of **Fat** and **Cholesterol** as seen in the **2nd chart**.

- So it can be a preferable food item for people looking for **higher nutrition** and **lower fats and Cholesterol**.

<a name = Section83></a>
### **8.3 Geographical Information-based Analysis**

- In this section, we will perform exploratory data analysis based on the geographical information of Mcdonald's.

<a name = Section831></a>
**<h4>Question:** Which top 10 cities have the highest number of McDonald's outlets per capita?</h4>

In [51]:
# Extract labels and values of menu category
labels = data['City'].value_counts().index[0:10]
values = data['City'].value_counts().values[0:10]

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Bar(x=values,
                           y=labels,
                           orientation='h',
                           marker=dict(color='rgba(238, 227, 231, 1.0)',
                                       line=dict(color='rgba(148, 42, 43, 1.0)',
                                                 width=3))))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Cities having Highest Outlets',
                  title_x=0.5,
                  xaxis_title='Frequency',
                  yaxis_title='City')

# Display the figure
fig.show()

**Observation:**

- **Mumbai** is the city with the **highest** number of **outlets** in both **US and India**.

<a name = Section831></a>
**<h4>Question:** Which top 10 states have the highest number of McDonald's outlets per capita?</h4>

In [52]:
# Extract labels and values of menu category
labels = data['State'].value_counts().index[0:10]
values = data['State'].value_counts().values[0:10]

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Bar(x=values,
                           y=labels,
                           orientation='h',
                           marker=dict(color='rgba(245, 227, 33, 1.0)',
                                       line=dict(color='rgba(66, 5, 84, 1.0)',
                                                 width=3))))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 States having Highest Outlets',
                  title_x=0.5,
                  xaxis_title='Frequency',
                  yaxis_title='State')

# Display the figure
fig.show()

**Observation:**

- **Maharashtra (MH)** is the state with the **highest** number of **outlets** in both **US and India**.

- This is because the city with the highest number of stores is **Mumbai** and in the **capital** of **Maharashtra (MH)**.

<a name = Section831></a>
**<h4>Question:** What are the outlet locations in India & US?</h4>

In [53]:
# Extracting data for respective countries
countryIN = data[data['Country']=='IN']
countryUS = data[data['Country']=='US']

# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure for India
fig.add_trace(trace=go.Scattergeo(lon=countryIN['Longitude'],
                                  lat=countryIN['Latitude'],
                                  text=countryIN[['Store_Name']],
                                  name='India',
                                  marker=dict(size=countryIN['Revenue'],
                                              color='Purple')))

# Add a trace of scattergeo to the figure for USA
fig.add_trace(trace=go.Scattergeo(lon=countryUS['Longitude'],
                                  lat=countryUS['Latitude'],
                                  text=countryUS[['Store_Name']],
                                  name='USA',
                                  marker=dict(size=countryUS['Revenue'] / 5,
                                              color='Red')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Outlet Location in India & USA',
                  title_x=0.5,
                  geo=dict(scope='world',
                           resolution=110,
                           showcoastlines=True,
                           countrycolor='white',
                           coastlinecolor='White'))

# Display the figure
fig.show()

**Observations:**

- It can be seen from the above plot that most of the **McDonald's outlets** in **India** are **concentrated** near **big cities** only.

- The **size of the marker** gives the **Revenue generated** by the outlet, **bigger** the **marker higher** the **revenue**.

- Unlike **India**, the **outlets** in the **US** are **spread thorughout** the **country** and are not limited to the **big cities**.

- The **size of the marker** gives the **Revenue generated** by the outlet, **bigger** the **marker higher** the **revenue**.

- The **tooltip** shows the **geographical coordinates** and **name** of the **outlet**.

<a name = Section84></a>
### **8.4 Outlet Metrics-based & Nutritional Value-based Analysis**

- In this section, we will perform exploratory data analysis based on the outlet metric & nutritional value of Mcdonald's.

<a name = Section841></a>
**<h4>Question:** Which outlets have the most nutritious item as their best selling item?</h4>

In [54]:
# Extracting top 10 rows that contains highest nutritious items
top10nut = data.iloc[data['Nutritious'].sort_values(ascending=False)[:10].index, :]

# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure for nutritious items
fig.add_trace(trace=go.Scattergeo(lon=top10nut['Longitude'],
                                  lat=top10nut['Latitude'],
                                  text=top10nut[['Store_Name', 'Nutritious']],
                                  marker=dict(size=top10nut['Nutritious'] / 4,
                                              color='Red')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Outlets Concerning Nutritional Items',
                  title_x=0.5,
                  geo=dict(scope='usa',
                           showcoastlines=True))

# Display the figure
fig.show()

**Observation:**

- The **marker size** gives the **nutritional content** of the **best selling item** of the outlet.

- The **bigger** the **marker higher** the **nutritional content** in the item.

<a name = Section842></a>
**<h4>Question:** What is the revenue of the outlet based on the category of its best-selling item?</h4>

In [55]:
# Extract labels and values of menu category
labels = data.groupby(by=['Category'])['Revenue'].mean().sort_values().index
values = data.groupby(by=['Category'])['Revenue'].mean().sort_values().values

# Initiate an empty figure
fig = go.Figure()

# Add a trace of bar to the figure
fig.add_trace(trace=go.Bar(x=values,
                           y=labels,
                           orientation='h',
                           marker=dict(color='rgba(69, 214, 179, 1.0)',
                                       line=dict(color='rgba(66, 5, 84, 1.0)',
                                                 width=3))))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Revenue of Outlet per Category',
                  title_x=0.5,
                  xaxis_title='Revenue (in million INR)',
                  yaxis_title='Category')

# Display the figure
fig.show()

**Observations:**

- Outlets with their **Best Selling Item** belonging to the **Salads** category earn the **largest revenue** followed by the **Beef & Pork** category.

- It implies that **salads** as **best-selling items** having **lower frequency** but **high revenue** as compared to the other categories.

- This might be because many people have become **health conscious** and only try to **eat healthy food**.

<a name = Section85></a>
### **8.5 Outlet Metrics-based & Geographical Information-based Analysis**

- In this section, we will perform exploratory data analysis based on the outlet metric & geographic information of Mcdonald's.

<a name = Section851></a>
**<h4>Question:** Where did McDonald's enter into Joint Venture to start their Outlets?</h4>

In [56]:
# Extracting data of India and USA
IN_data_labels = data[data['Country'] == 'IN']['Ownership_Type'].value_counts().index
IN_data_values = data[data['Country'] == 'IN']['Ownership_Type'].value_counts().values

US_data_labels = data[data['Country'] == 'US']['Ownership_Type'].value_counts().index
US_data_values = data[data['Country'] == 'US']['Ownership_Type'].value_counts().values

# Initiate an empty figure
fig = go.Figure()

# Adding trace of bar to the figure for India
fig.add_trace(trace=go.Bar(x=IN_data_values,
                           y=IN_data_labels,
                           orientation='h',
                           name='India'))

# Adding trace of bar to the figure for USA
fig.add_trace(trace=go.Bar(x=US_data_values,
                           y=US_data_labels,
                           orientation='h',
                           name='USA'))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Onwership Type vs Number of Employees per Country',
                  title_x=0.5,
                  barmode='group',
                  xaxis_title='Frequency',
                  yaxis_title='Ownership Type')

# Display the figure
fig.show()

**Observation:**

- McDonald's has **Joint Venture** with other companies in major Asian economies like **India**.

- All the outlets in India are under **Joint Venture between McDonald's, Connaught Plaza Restaurants Limited (CPRL), and Hardcastle Restaurants Pvt. Ltd.** while **American outlets are Company Owned or Licensed**.

- In India, **McDonald's is a 50:50 Joint Venture** company managed by two Indians.

- While Amit Jatia, M.D. Hardcastle Restaurants Pvt. Ltd. owns and spearheads McDonalds in west & south India, McDonald’s restaurants in North & East India are owned and managed by Vikram Bakshi’s Connaught Plaza Restaurants Private Limited.

<a name = Section852></a>
**<h4>Question:** What are the top 10 outlets in India and the US-based on revenue?</h4>

In [57]:
# Extracting top 10 outlest in IN and US
top10IN = data[data['Country'] == 'IN'].sort_values(by=['Revenue'], ascending=False)[:10]
top10US = data[data['Country'] == 'US'].sort_values(by=['Revenue'], ascending=False)[:10]

# Concatenating top 10 outlets in US and IN in one dataframe
top10outlets = pd.concat(objs=[top10IN, top10US], axis=0)

In [58]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure
fig.add_trace(trace=go.Scattergeo(lon=top10outlets['Longitude'],
                                  lat=top10outlets['Latitude'],
                                  text=top10outlets[['Store_Name', 'Revenue']],
                                  marker=dict(size=top10outlets['Revenue'] / 2,
                                              color='Green')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Top 10 Outlets in IN and US based on Revenue (in million INR)',
                  title_x=0.5,
                  geo=dict(scope='world',
                           resolution=110,
                           showcoastlines=True,
                           coastlinecolor='White'))

# Display the figure
fig.show()

**Observations:**

- It can be seen through the **size** of the **points** that the **Revenue** of outlets in **India** is **lower** than that in the **US**.

- The **tooltip** shows the **geographical coordinates** and **name** of the **outlet** along with the **Revenue** of the **outlet**.

<a name = Section853></a>
**<h4>Question:** What is the average revenue of outlets in each US state?</h4>

In [59]:
# Extract average revenue per state in US
avgUSrevenue = data[data['Country'] == 'US'].groupby(by=['State'])['Revenue'].mean()

# Extract state labels of US
statesUS = sorted(data[data['Country'] == 'US']['State'].unique())

In [60]:
# Initiate an empty figure
fig = go.Figure()

# Add a choropleth directly to the figure
fig.add_choropleth(colorscale='Viridis',
                   autocolorscale=False,
                   locations=statesUS,
                   locationmode='USA-states',
                   z=avgUSrevenue,
                   text='Revenue',
                   colorbar=go.choropleth.ColorBar(title='Revenue (in million INR)'),
                   marker=go.choropleth.Marker(line=go.choropleth.marker.Line(color='rgb(255, 255, 255)',
                                                                              width=2)))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Average Revenue of Outlets per State in US',
                  title_x=0.5,
                  geo=dict(scope='usa',
                           resolution=110,
                           showlakes=True,
                           lakecolor='rgb(255, 255, 255)',
                           projection=go.layout.geo.Projection(type='albers usa')))

# Display the figure
fig.show()

**Observation:**

- The **state** with the **highest revenue** in the **US** is **Oklahoma (OK)** with **38.603 million INR** in **Revenue**.

<a name = Section854></a>
**<h4>Question:** How much is the Gross Profit Margin of each outlet?</h4>

In [61]:
# Initiate an empty figure
fig = go.Figure()

# Add a trace of scattergeo to the figure
fig.add_trace(trace=go.Scattergeo(lon=data['Longitude'],
                                  lat=data['Latitude'],
                                  text=data[['Store_Name', 'Gross_Profit_Margin']],
                                  marker=dict(size=data['Revenue'] / 2,
                                              color='Purple')))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Gross Profit Margin per Outlet (in million INR)',
                  title_x=0.5,
                  geo=dict(scope='world',
                           resolution=110,
                           showcoastlines=True,
                           coastlinecolor='White'))

# Display the figure
fig.show()

**Observations:**

- It can be seen through the **size** of the **points** that the **Gross Profit Margin** of outlets in **India** is **lower** than that in the **US**.


- The **tooltip** shows the **geographical coordinates** and **name** of the **outlet** along with the **Gross Profit Margin** of the **outlet**.

<a name = Section86></a>
### **8.6 Nutritional Value-based & Geographical Information-based Analysis**

- In this section, we will perform exploratory data analysis based on the nutritional value & geographic information of Mcdonald's.

<a name = Section861></a>
**<h4>Question:** What is different in terms of the nutritional content of each category between India and the US?</h4>

In [62]:
# Initialize a list of nutritional factors of food type
nutrifactors = ['Energy', 'Protein', 'Sugars', 'Total_Fat', 'Saturated_Fat',
                'Cholestrol', 'Carbohydrates', 'Dietary_Fibre', 'Sodium']

# Create two dataframes of IN and US based nutritional factors
dataIN = data[data['Country'] == 'IN'].groupby(['Category'])[nutrifactors].mean()
dataUS = data[data['Country'] == 'US'].groupby(['Category'])[nutrifactors].mean()

In [63]:
# Initiating a plotly figure
fig = go.Figure()

# Adding first graph of categories concerning energy
fig.add_bar(x=dataIN['Energy'], y=dataIN.index, orientation='h', name='IN')
fig.add_bar(x=dataUS['Energy'], y=dataUS.index, orientation='h', name='US')

# Adding a button to select different features
button = [dict(method='update',
               args = [{'x': [dataIN[nutrifactors[k]], dataUS[nutrifactors[k]]],
                        'y': [dataIN[nutrifactors[k]].index, dataUS[nutrifactors[k]].index],
                        'visible': [True, True]}],
               label=nutrifactors[k]) for k in range(0, len(nutrifactors))]

# Updating the layout of the graph
fig.update_layout(title_text='Categorical Nutritional Difference in India & USA',
                  title_x=0.5,
                  width=1000,
                  height=500,
                  xaxis_title='Nutritional Value',
                  yaxis_title='Category',
                  updatemenus=[dict(active=0,
                                    buttons=button,
                                    x=1.2,
                                    y=1,
                                    xanchor='left',
                                    yanchor='top')])

# Adding extra annotaions alongside the button
fig.add_annotation(x=1.2,
                   y=1.08,
                   xref='paper',
                   yref='paper',
                   showarrow=False,
                   xanchor='left',
                   yanchor = 'top',
                   text='Category')

# Display the graph
fig.show()

**Observations:**

- From the above charts it is quite evident that **US is rich in nutritional content**.

- Perhaps, use of **Genetically Modified Food/Crop** might be the reason.

- Some **genetically modified foods** are **designed** to **improve nutrition, quality and taste**.

- For example, potatoes are modified to even out distribution of starches, enhance texture and reduce fat absorption.

- The **United States leads other countries** in **growing genetically modified foods**.

- In 2006, 53% of the crops grown in the United States were genetically modified, according to the Human Genome Project.

- Soybeans, corn and canola are the most common genetically modified crops.

<a name = Section861></a>
**<h4>Question:** How are Indian menu items compare to US menu items in terms of nutrition?</h4>

In [64]:
# Initiating a plotly figure
fig = go.Figure()

# Adding first graph of Energy vs Country
fig.add_box(x=data['Energy'], y=data['Country'], orientation='h')

# Adding a button to select different features
button = [dict(method='update',
               args = [{'x': [data[k]],
                        'y': [data['Country']],
                        'visible':[True, False]}],
               label = k) for k in nutrifactors]

# Updating the layout of the graph
fig.update_layout(title_text='Nutritional Factors vs Country',
                  title_x=0.5,
                  width=1000,
                  height=500,
                  xaxis_title='Nutritional Value',
                  yaxis_title='Country',
                  updatemenus=[dict(active=0,
                                    buttons=button,
                                    x=1.15,
                                    y=1,
                                    xanchor='left',
                                    yanchor='top')])

# Adding extra annotaions alongside the button
fig.add_annotation(x=1.03,
                   y=0.97,
                   xref='paper',
                   yref='paper',
                   showarrow=False,
                   xanchor='left',
                   yanchor = 'top',
                   text='Nutrition')

# Display the graph
fig.show()

**Observations:**

- From the above charts, we can infer the **difference** in the **nutritional content between** the **US** and **Indian McDonald's menu**.

- **Scientists** across the world have **identified** two **reasons** for this **declining food nutrition**.

- One, **intensive agricultural practices** have **stripped the soil of micronutrients**.

- This could well be the reason for **India** where **soils** have been **found deficient in nutrients**.


- Second, **rising levels of carbon dioxide (CO2) in** the **environment** could also be **affecting plant nutrition levels**.

- **High CO2 levels in** the **atmosphere lower** the **nitrogen concentration in plants**, which in turn **affects** the **protein content** in food.  

<a name = Section87></a>
### **8.7  Outlet Metric-based, Menu Items & Geographical Information-based Analysis**

- In this section, we will perform exploratory data analysis based on the outlet metric, menu items & geographic information of Mcdonald's.

<a name = Section871></a>
**<h4>Question:** How is the revenue different based on menu items in India and the US?</h4>

In [65]:
# Extract labels and values of revenue in India and USA
labels = data.groupby(['Country', 'Category'])['Revenue'].mean().index
values = data.groupby(['Country', 'Category'])['Revenue'].mean().values

# Initiate an empty figure
fig = go.Figure()

# Add a trace of pie to the figure
fig.add_trace(trace=go.Pie(labels=labels,
                           values=values,
                           hole=.8))

# Update the layout with some cosmetics
fig.update_layout(height=500,
                  width=1000,
                  title_text='Proportion of Revenue in India vs USA',
                  title_x=0.5)

# Display the figure
fig.show()

**Observations:**

- **Maximum revenue generation** for **McDonald's** is **from US** as **compared to India**.

- The reason behind this could be the fact that **McDonald's** have to enter into **Joint-Ventures** in **India**.

- Also the **menu** in **India** is not that **diverse** and only a **limited number of items** are **sold in India** when **compared to** the **menu** in the **US**.
  
- The **number of outlets** in **India** is **lower** as well when **compared to** the **US**.

<a name = Section9></a>

---
# **9. Summarization**
---

- **<h4>Conclusion</h4>**

  - It is analyzed that the items in menu dataset can be categorized as **nutritious** food and **non-nutritious** food based on different chart diagrams and range values in percentage obtained.

  - So it is beneficial for demonstrating different range values for food nutrients such as **Protein, Sugar, Dietary Fibers, Fats, Carbohydrates, Cholesterol, and Sodium** for their proper consumption from menu items.

  - The US food industry has risen as a high-development and high-benefit area because of its huge potential for esteem expansion, especially inside the food processing industry.

  - However, India is still taking its initial steps and this could be the reason for McDonald's India not being profitable after many years of operations.


-  **<h4>Actionable Insights</h4>**

  - In order to **increase** the **outlet metrics** like **Revenue, Profits** in **Indian outlets**, **McDonald's** need to **open new outlets** belonging to the **Company Owned** and **Licensed** ownership types **instead of Joint Ventures**.

  - The **nutritional content** of the food items can be **improved** if **good agricultural practices** are taken up in **India** like use of **Genetically Modified Crops (GM Crops)**, **High-Yield-Variety (HYV) seeds**, etc.

  - **McDonald's India** needs to **introduce more food items** on the menu, which have **higher nutritional content** like US and will eventually help them **increase** their **revenue and profits**.