# Mission Dotlas 🌎 [50 points]

> `v2.0` Updated: April 18 2023 (Spring + Summer 2023 Version)

![Dotlas](https://camo.githubusercontent.com/6a3a3a9e55ce6b5c4305badbdc68c0d5f11b360b11e3fa7b93c822d637166090/68747470733a2f2f646f746c61732d776562736974652e73332e65752d776573742d312e616d617a6f6e6177732e636f6d2f696d616765732f6769746875622f62616e6e65722e706e67)

## Section 1: Project Overview ✉️

Welcome to your mission! In this notebook, you will download a dataset containing restaurants' information in the state of California, US. The dataset will then be transformed, processed and prepared in a required format. This clean dataset will then be used to answer some analytical questions and create a few data visualizations in Python.

This is a template notebook that has some code already filled-in to help you on your way. There are also cells that require you to fill in the python code to solve specific problems. There are sections of the notebook that contain a points tally for code written. 

**Each section of this notebook is largely independent, so if you get stuck on a problem you can always move on to the next one.**

### 1.1 Tools & Technologies 🪛

- This exercise will be carried out using the [Python](https://www.python.org/) programming language and will rely hevily on the [Pandas](https://pandas.pydata.org/) library for data manipulation.
- You may use any of [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/) or [Plotly](https://plotly.com/python/) packages for data visualization.
- We will be using [Jupyter notebooks](https://jupyter.org/) to run Python code in order to view and interact better with our data and visualizations.
- You are free to use [Google Colab](https://colab.research.google.com/) which provides an easy-to-use Jupyter interface.
- When not in Colab, it is recommended to run this Jupyter Notebook within an [Anaconda](https://continuum.io/) environment
- You can use any other Python packages that you deem fit for this project.

> ⚠ **Ensure that your Python version is 3.9 or higher**

![](https://upload.wikimedia.org/wikipedia/commons/1/1b/Blue_Python_3.9_Shield_Badge.svg)

**Language**

![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)

**Environments & Packages**

![Anaconda](https://img.shields.io/badge/Anaconda-%2344A833.svg?style=for-the-badge&logo=anaconda&logoColor=white)
![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)
![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)
![Matplotlib](https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=for-the-badge&logo=Matplotlib&logoColor=black)
![Plotly](https://img.shields.io/badge/Plotly-%233F4F75.svg?style=for-the-badge&logo=plotly&logoColor=white)

**Data Store**

![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)

---

## Section 2: Data Overview 🔍 [8 points]

### 2.1.1 Read California Dataset 🚰 [1 point]

In this section, we will load the dataset from [AWS](https://googlethatforyou.com?q=amazon%20web%20services) S3, conduct an exploratory data analysis and then clean up the dataset


- Ensure that pandas and plotly are installed (possibly via pip or poetry)
- The dataset is about 300 MB in size and time-to-download depends on internet speed and availability
- Download the dataset using Python into this notebook and load it into a pandas dataframe (without writing to file)


In [1]:
from matplotlib import pyplot as plt
%matplotlib inline

import pandas as pd
import plotly.express as px
import numpy as np

CELL_HEIGHT: int = 50

# Initialize helpers to ignore pandas warnings and resize columns and cells
pd.set_option("chained_assignment", None)
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 500)
pd.set_option('display.max_colwidth', CELL_HEIGHT)

DATA_URL: str = "https://dotlas-marketing.s3.amazonaws.com/interviews/california_restaurants.json"

In [2]:
%%time
# ✏️ YOUR CODE HERE
# df: pd.DataFrame = ?

CPU times: user 12.6 s, sys: 2.05 s, total: 14.7 s
Wall time: 21.3 s


Create a restaurant ID column to uniquely index each restaurant


In [3]:
df["restaurant_id"] = range(1, len(df) + 1)
df.head(2)

Unnamed: 0,country,subregion,city,brand_name,categories,latitude,longitude,area,address,menu,description,public_transit,cross_street,restaurant_website,phone_number,primary_cuisine,dining_style,executive_chef_name,parking_info,dress_code,entertainment,operating_hours,price_range_id,price_range,payment_options,maximum_days_advance_for_reservation,rating,rating_count,rating_by_feature,rating_distribution,review_count,review_topics,awards,experiences,tags,editorial_lists,checklist,safety_precautions,order_online_link,facebook,menu_url,popular_dishes,daily_reservation_count,restaurant_id
0,United States,California,Los Angeles,Luv2Eat Thai Bistro,[Thai],34.09751,-118.335921,Hollywood,"6660 W Sunset Blvd, Ste P, CA, Los Angeles, 90...","[{'name': 'Main Menu', 'sections': [{'name': '...","Luv2Eat Thai Bistro is located in Los Angeles,...",,,http://www.luv2eatthai.com/,(323) 498-5835,Thai,Casual Dining,,Street Parking,Casual Dress,,Lunch\nDaily 11:00 am–3:30 pm\nDinner\nDaily 4...,2,$30 and under,"[AMEX, Discover, MasterCard, Visa]",90,4.6,136,"{'food': 4.7, 'noise': 2.0, 'value': 4.6, 'ser...","[2, 3, 4, 23, 104]",18,"[Spicy, Casual, Neighborhood Gem]",[],[],"[Delivery, Gluten-free Options, Late Night, No...",[],"{'bar': False, 'counter': False, 'gifting': No...","{'cleanMenus': None, 'limitedSeating': None, '...",,http://www.facebook.com/luv2eatthaibistro/,http://sappclub.com/restaurant.aspx?r=205,[],,1
1,United States,California,Sherman Oaks,Jerry's Famous Deli,[American],34.154596,-118.4487,Sherman Oaks,,"[{'name': 'Sample Menu', 'sections': [{'name':...",<p>Feast on delicious grub at Jerry's Famous D...,,,http://www.jerrysfamousdeli.com/,(818) 905-5774,American,Casual Dining,,,Business Casual,,,2,$30 and under,[],90,0.0,0,"{'food': 0.0, 'noise': 0.0, 'value': 0.0, 'ser...","[0, 0, 0, 0, 0]",0,[],[],[],[],[],"{'bar': None, 'counter': None, 'gifting': None...","{'cleanMenus': None, 'limitedSeating': None, '...",,,,[],,2


### 2.1.2 Basic Exploration 🔎 [7 points]

<img src="https://media.giphy.com/media/42wQXwITfQbDGKqUP7/giphy.gif" height="200px" width="200px" alt="pokemon">

Take a look at the data and answer the following questions in markdown. You can include relevant code cells below to show how you explored the data, but do not include rough work / drafts / scratch work. 

1. **Describe the dataset**. 
    - You don't need to drop columns or rows for this question. 
    - You also don't need to provide a data dictionary. 
    - Explain in a paragraph or two on what the dataset is about, what each record indicates and note down some features that look interesting to you.
    
<br>

2. **Craft 4 data-driven questions** rooted in your dataset description.
    - A data-driven question is a tangible business or research question that can be addressed using data (considered as facts), as opposed to anecdotally or by guesswork. 
    - You can optionally use the framework of [Tripartitie questions for decision making](https://github.com/dotlas/codeventures/blob/main/CONTRIBUTING.md) to devise question classes. 
    - Your questions may be directly influenced by this dataset, or stem from a hypothetical union of this dataset with another publicly available dataset. For instance, "By overlaying US census geographies onto the restaurant locations, what are the demographic compositions surrounding each eatery?"
    
<br>
    
3. **Create a Business Scenario and Formulate 2 Use-Cases** based on one or more questions you formulated above.
    - You may assume any audience or situation as you please, with the understanding that you'll adopt the role of a data analyst or data scientist. Use the examples provided below as inspiration for crafting your own unique scenario.
        - As a data scientist at a food-delivery company that recently gained access to this restaurant data, the Lead Data Scientist seeks 2 use-cases for incorporating the data in ways that benefit the business side of the organization.
        - As a quantitative analyst at a hedge fund aiming to reconcile various F&B companies' financial reports with their on-ground performance using restaurant data, your goal is to create a custom valuation of their stock prices.
        - As a market research analyst, you're working to compile a list of potential leads to cold-call or email within the restaurant industry.
    
<br>


4. Were you successful in locating your favorite restaurant within the dataset? If not, did you discover a similar establishment nearby?

<br>

> 📝 Your writing will be assessed based on clarity of communication, originality of questions, scenarios, and use-cases. The devised questions should be well-founded and not overly ambitious. Both questions and use-cases ought to demonstrate your problem-solving abilities, including the skill to identify issues initially and envision how they could be developed further down the line in subsequent sections.

##### Answers: ✏️
1. 

2. 

3. 

4. 

In [None]:
# ✏️ YOUR CODE HERE

---

## Section 3: Exploratory Data Analysis & Preprocessing 🕵🏼‍♀️ [15 points]

### 3.1 EDA 📊 [8 points]

In this exercise, you will be conducting your own open-ended exploratory data analysis (EDA) and preprocessing of a dataset to gain insights and prepare the data for further analysis. The EDA will involve understanding the structure of the data, checking for missing values, outliers, and correlations, and identifying trends or patterns. Preprocessing will involve cleaning the data, transforming it into a suitable format, and handling missing values and outliers. These steps are crucial to ensure the quality and reliability of the data before applying statistical learning models. By conducting a thorough EDA and preprocessing, we can better understand the data and extract meaningful insights that will inform our decision-making.

We know how much fun it is to create all sorts of funky visualizations and crunch numbers all day long. But let's not forget why we're doing this - we want to tell a story about our data! So, while it's great to have fun with your data, let's make sure we're doing it in a systematic and purposeful way. Each visualization and exploration should show progress in understanding the data better and contribute to telling the story of our data. So let's put on our exploration hats and approach every chart and graph with a clear question in mind. By doing this, we'll uncover exciting insights and tell an engaging story about our data that even your grandma will want to hear.

> 📝 Your work will be assessed based on the quality of your visualizations, the funnels employed to transform data-driven questions into insights, and your interpretation of the generated results.

### 3.2 Preprocessing & Data Engineering ⚙️ [5 points]

#### 3.2.1 Basic Transformation Exercise 🚚 [2 Points]

This section has 2 sub-questions which can help with preprocessing down the line.

<img src="https://media.giphy.com/media/2f41Z7bhKGvbG/giphy.gif" height="250px" width="250px" alt="harry potter">

##### 3.2.1.1 Safety Precautions 🦺

Transform the entire safety precautions column into a new column based on the following rule:

Convert from `dictionary` to `list`. Only include in the list, those keys in the dictionary which are `true`.
For ex, for safety precautions of the type:

```python
{
    'cleanMenus': True,
    'limitedSeating': False,
    'sealedUtensils': None,
    'prohibitSickStaff': True,
    'requireDinerMasks': True,
    'staffIsVaccinated': None,
    'proofOfVaccinationRequired': False,
    'sanitizerProvidedForCustomers': None
}
```

It should turn into a list of the form:

```python
["Clean Menus", "Prohibit Sick Staff", "Require Diner Masks"]
```


In [59]:
# ✏️ YOUR CODE HERE

##### 3.2.1.2 Clean up HTML text 🥜

Find columns containing text / strings that have html text and remove those HTML texts

ex:

```html
<p>
  Feast on delicious grub at Jerry's Famous Deli.<br />
  Its retro-style casual setting features comfortable booth seating.
</p>
```

to:

```
Feast on delicious grub at Jerry's Famous Deli. Its retro-style casual setting features comfortable booth seating.
```


In [None]:
# ✏️ YOUR CODE HERE

#### 3.2.2 Drop and Clean 🖌 [3 points]

During the exploratory data analysis phase, it is crucial to assess the relevance and utility of each feature within the dataset. Dropping certain features (columns) may become necessary if they are deemed irrelevant or uninformative for the problem at hand. This process often involves evaluating the level of missing data, the presence of duplicate information, or the overall impact of each feature on the analysis. Similarly, removing rows from the dataset may be essential based on factors such as the presence of outliers, incomplete or erroneous data, or other inconsistencies that could impede the analysis. The overarching objective of this exercise section is to refine the dataset, ultimately arriving at a final set of features that you deem is important for your specific research question or business goal. Once this selection is made, you can confidently commit to these features, ensuring a more robust and reliable foundation for subsequent data analysis and modeling tasks.

In [None]:
# ✏️ YOUR CODE HERE

### 3.3 Imputing Exercise 📈 (2 points)

Fill up missing values for rating, rating count and review count by imputing based on the following columns in order:

1. `brand_name`
2. `area`
3. `city`

This means that if `rating` is missing for a restaurant (null / 0), but that restaurant is part of a brand where
other restaurants of the same brand have ratings, then a median rating is taken. If brands are complete, then missing values are filled using
area where the restaurant is located (median rating) and finally filled using the city's rating

Here's an example:

|restaurant_id	|brand_name|	area|	city|	rating|	imputed_rating_brand | imputed_rating_area| imputed_rating_city
| --- | --- | --- | --- | --- | --- | --- | --- |
|1	|X1|	A1|	B1|	3|	3| 3 | 3 |
|2	|X1|	A1|	B1|	2|	2| 2 | 2 |
|3	|X1|	A1|	B1|	| 2.5| 2.5 | 2.5 |
|4	|X2|	A1|	B1|	4|	4 | 4 | 4 |
|5	|X3|	A1|	B1|	|	 | 2.75 | 2.75 |
|6	|X4|	A4|	B2|	|	 | | 3 |
|7	|X5|	A6|	B2|	2|	 2| 2| 2|
|8	|X6|	A7|	B2|	4|	 4| 4| 4 |


In [None]:
# ✏️ YOUR CODE HERE

---

Remember to hydrate and 

[![Spotify](https://img.shields.io/badge/Spotify-1ED760?style=for-the-badge&logo=spotify&logoColor=white)](https://open.spotify.com/playlist/3d4bU6GAelt3YL2L1X2SOn)

---

## Section 4: Non-Trivial Transformations Exercise 🤺 [17 points]

<img src="https://media.giphy.com/media/hbd8nlok7kqnS/giphy.gif" height="250px" width="250px" alt="simpsons">

### 4.1 Transform Operating Hours 🕰️ [6 points]

Create an operating hours [bitmap](https://en.wikipedia.org/wiki/Bit_array) column from the operating hours text column for all restaurants. The bitmap would be a matrix of size 24 x 7 where a 1 or 0 on each cell indicates whether the restaurant is operating on a specific day at a specific hour

Example: For operating hours text of the form:

```tex
Lunch
Daily 11:00 am–3:30 pm
Dinner
Daily 4:30 pm–11:30 pm
```

Create a bitmap of the following form:

```json
{
    "Monday" : [0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,1,1,1,1],
    "Tuesday" : [0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,1,1,1,1],
    .
    .
    .
    "Sunday" : [0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,1,1,1,1,1],

}
```


In [None]:
# ✏️ YOUR CODE HERE

### 4.2 On my radar 🗺️ [4 points]

For the following restaurant:

- brand_name `Calzone's Pizza Cucina`
- coordinates `37.799068, -122.408226`.

Answer these questions:

- How many restaurants exist within a 100 meter radius of this restaurant?
- What is the most frequent cuisine (`category`) occurence in this 100m radius across the restaurants that exist in that range?

In [None]:
# ✏️ YOUR CODE HERE

### 4.3 Menu-Level Table 🧾 [7 points]

<img src="https://media.giphy.com/media/qpLuA97QGOsnK/giphy.gif" height="250px" width="250px" alt="ratatouille">

**Create a menu-level table by parsing out menu items from the `menu` column per restaurant.**

Every restaurant has a `menu` column that contains deeply nested JSON data on the restaurant's menu. The hierarchy is as follows: 

* One restaurant can have multiple menus (morning menu, evening menu, etc.)
    * Each menu can have a description and provider
* Each restaurant menu can have multiple sections (such as Appetizers, Desserts, etc.)
    * Each section has a description
* Each section can have multiple menu items (such as Latte, Apple Pie, Carrot Halwa, etc.)
    * Each menu item has a price, currency and description

You need to parse out the menu data from the JSON in the `menu` column for each restaurant and have a restaurants x menu table as shown below. 

| restaurant_id | menu_name | menu_description | menu_provider | section_name | section_description | item_name          | item_description                                                                                                      | item_price | item_price_currency |
| ------------: | :-------- | :--------------- | ------------: | :----------- | :------------------ | :----------------- | :-------------------------------------------------------------------------------------------------------------------- | ---------: | :------------------ |
|             1 | Main Menu |                  |           nan | Appetizers   |                     | Egg Rolls          | Deep fried mixed veggie egg rolls served with sweet & sour sauce                                                      |          8 | USD                 |
|             1 | Main Menu |                  |           nan | Appetizers   |                     | Fried Tofu         | (Contains Peanut) Deep fried tofu, served with sweet & sour sauce and crushed peanut                                  |          8 | USD                 |
|             1 | Main Menu |                  |           nan | Appetizers   |                     | Fried Meat Balls   | Deep fried fish, pork, beef balls or mixed served with sweet & sour sauce. Meat: Beef $1, Fish, Mixed Meat ball, Pork |        8.5 | USD                 |
|             1 | Main Menu |                  |           nan | Appetizers   |                     | Pork Jerky         | Deep fried marinated pork served with special jaew sauce                                                              |        8.5 | USD                 |
|             1 | Main Menu |                  |           nan | Appetizers   |                     | Thai Isaan Sausage | (Contains Peanut) Thai Style sausage served with fresh vegetables and peanuts                                         |          9 | USD                 |


In [None]:
# ✏️ YOUR CODE HERE

---

## Section 5: Analytical Questions ⚗️ [10 points]

**Answer ONLY ONE of the Questions using the Data, i.e, choose between `3.1.1`, `3.1.2` or `3.1.3`**

<img src="https://media.giphy.com/media/3o7TKVSE5isogWqnwk/giphy.gif" height="250px" width="250px" alt="sherlock holmes">

> Note that the analytical questions may sometimes require converting categorical type columns that are lists or strings into numeric columns. For ex. "Casual Dining", "Fine Dining"..etc. would require you to generate a categorical encoding of 1,2..etc. For columns that contain lists like `categories`, which contain cuisine tags, a one-hot or multi-hot encoding technique may be required based on the situation. A numeric categorical encoding is required for these string or list based columns since pandas cannot (usually) automatically generate correlations or clusters based on text-based categories


### 5.1 Take me out for dinner 🕯️

Which areas according to you have the best restaurants in California and why? You can define best based on whatever criteria you wish as long as it involves measuring more than a single column. For ex. You cannot merely claim that the restaurant with the highest rating is the best restaurant.


In [1]:
# ✏️ YOUR CODE HERE

### 5.2 Michelin Approves 🎖️

Which columns seem to play / not play a major factor in whether or not the restaurant has an award? Justify your options


In [33]:
# simple dataframe to look at distribution of awards across california by most awarded titles
awards_df: pd.DataFrame = pd.json_normalize(df["awards"].dropna().explode()).rename(
    columns={"name": "award_name", "location": "award_location"}
)
awards_df["award_name"].value_counts().to_frame().head(10).rename(
    columns={"award_name": "award_count"}
).transpose()

Unnamed: 0,Most Booked,Best Ambiance,Best Food,Best Overall,Best Service,Best Value,Special Occasion,Romantic,Fit for Foodies,Vibrant Bar Scene
award_count,414,404,402,402,401,400,398,393,391,389


In [None]:
# ✏️ YOUR CODE HERE

---

Good job!

<img src="https://media.giphy.com/media/qLhxN7Rp3PI8E/giphy.gif" height="250px" width="250px" alt="legend of zelda">