# Mission Dotlas 🌎 [40 points]

> Data Science Assignment

> `v4.1` Updated: Apr 15 2024 (Summer 2024 Version)

<img src="https://camo.githubusercontent.com/6a3a3a9e55ce6b5c4305badbdc68c0d5f11b360b11e3fa7b93c822d637166090/68747470733a2f2f646f746c61732d776562736974652e73332e65752d776573742d312e616d617a6f6e6177732e636f6d2f696d616765732f6769746875622f62616e6e65722e706e67" width="750px" alt="dotlas">

## Project Overview ✉️

Welcome to your mission! In this notebook, you will download a dataset containing restaurants' information in the state of California, US. 
The dataset will then be transformed, processed and prepared in a required format. 
This clean dataset will then be used to answer some analytical questions and create a few data visualizations in Python.

This is a template notebook that has some code already filled-in to help you on your way. There are also cells that require you to fill in the python code to solve specific problems. There are sections of the notebook that contain a points tally for code written. 

**Each section of this notebook is largely independent, so if you get stuck on a problem you can always move on to the next one.**

### Tools & Technologies 🪛

- This exercise will be carried out using the [Python](https://www.python.org/) programming language and will rely hevily on the [Pandas](https://pandas.pydata.org/) library for data manipulation.
- You are also free to use Polars, Dask or Spark if you do not want to use Pandas.
- You may use any of [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/) or [Plotly](https://plotly.com/python/) packages for data visualization.
- We will be using [Jupyter notebooks](https://jupyter.org/) to run Python code in order to view and interact better with our data and visualizations.
- You are free to use [Google Colab](https://colab.research.google.com/) which provides an easy-to-use Jupyter interface.
- When not in Colab, it is recommended to run this Jupyter Notebook within an [Anaconda](https://continuum.io/) environment
- You can use any other Python packages that you deem fit for this project.
- You are also allowed to freely use search engines like Google, or tools like ChatGPT to assist. 

> ⚠ **Ensure that your Python version is 3.9 or higher**

![](https://upload.wikimedia.org/wikipedia/commons/1/1b/Blue_Python_3.9_Shield_Badge.svg)

**Language**

![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)

**Environments & Packages**

![Anaconda](https://img.shields.io/badge/Anaconda-%2344A833.svg?style=for-the-badge&logo=anaconda&logoColor=white)
![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)

![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)
![Matplotlib](https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=for-the-badge&logo=Matplotlib&logoColor=black)
![Plotly](https://img.shields.io/badge/Plotly-%233F4F75.svg?style=for-the-badge&logo=plotly&logoColor=white)

**Data Store**

![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)

## Section 1: Read & Transform Dataset 🚰 [10]
---

In this section, we will load the dataset from [AWS](https://googlethatforyou.com?q=amazon%20web%20services) S3, conduct an exploratory data analysis and then clean up the dataset


- Ensure that pandas and plotly are installed (possibly via pip or poetry)
- The dataset is about 34.5 MB in size and time-to-download depends on internet speed and availability

In [None]:
import warnings

warnings.filterwarnings("ignore")

from matplotlib import pyplot as plt

%matplotlib inline

import pandas as pd
import numpy as np

CELL_HEIGHT: int = 50

# Initialize helpers to ignore pandas warnings and resize columns and cells
pd.set_option("chained_assignment", None)
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_info_columns", 1_000)
pd.set_option("display.max_info_rows", 1_000_000)
pd.set_option("display.max_colwidth", CELL_HEIGHT)

DATA_URL: str = (
    "https://dotlas-marketing.s3.amazonaws.com/interviews/california_restaurants_2024.json"
)

### 1.1 Review the Data 📝 [3]

The code cell below reads the data from the `DATA_URL`, and performs some preliminary preprocessing steps. Read through the code.

Briefly describe what each of the steps `Preproc 1 ... 3` are seeking to accomplish, and why these are relevant. (*Hint*: Try viewing the dataset without these steps individually)


In [None]:
df: pd.DataFrame = pd.read_json(DATA_URL)

# Preproc 1
for col in df.columns:
    df[col] = df[col].apply(
        lambda cell: None if isinstance(cell, list) and len(cell) == 0 else cell
    )

# Preproc 2
rating_cols: list[str] = [
    "rating",
    "atmosphere_rating",
    "noise_rating",
    "food_rating",
    "service_rating",
    "value_rating",
]

for col in rating_cols:
    df.loc[(df.rating_count == 0) | (df.rating_count.isna()), col] = None

# Preproc 3
df = df.replace([np.nan], [None])
df = df.replace([""], [None])

df.head()


### 1.2 Transformations 🔍 [7]

<img src="https://media.giphy.com/media/l0NgQIwNvU9AUuaY0/giphy.gif"  width="150px" alt="assembly">

Based on a look at the resulting dataset, do you believe there's a need for further preprocessing to prepare the data for analysis? If so, add the preprocessing steps in the subsequent cell.

> 📝 Your evaluation for this section will be based on your ability to identify where additional preprocessing is most necessary, and the steps taken to achieve it. We're not looking for a battery of data cleaning transformations, and instead want you to think surgically about what could be causing the most problems downstream.

In [None]:
# Preproc 4...n
# YOUR CODE HERE

## Section 2: Data-Driven Questions 💬 [15]
---

<img src="https://media.giphy.com/media/fv8KclrYGp5dK/giphy.gif"  width="250px" alt="sherlock">


This section is designed to pose several broad questions that require a data-centric approach for their answers. You'll need to manipulate and analyze the provided dataset, potentially supplementing it with other public data sources. The focus is on your methodology and reasoning in deducing answers. Ensure to clean and prepare the data as needed. Your findings should be communicated through original visualizations, such as charts or maps, which you'll create yourself. The primary dataset provided must be the main basis of your analysis

> 📝 Your evaluation will be based on the investigative techniques employed, the originality of your approach, the selection of variables and metrics, the criteria chosen, the type and quality of the visualization selected, and any additional supporting evidence, if provided

### 2.1. Cuisine Saturation 🍱 [3]

**Which cuisines are over-saturating in `San Diego`?** Define a metric called `saturation` and use it to calculate the result. When you produce the result, ask yourself if it makes sense, and justify. (*Hint*: Think about whether you can equate cuisines that have `fast-food` and `burger` as equivalent)

In [None]:
# YOUR CODE HERE

### 2.2. Twinning 🗺 [5]

**Find a pair of `area` (neighbourhoods) in California that have a similar `cuisine` distribution**. Be sure to consider neighbourhoods that have a reasonable sample of restaurants to begin with!

In [None]:
# YOUR CODE HERE

### 2.3. Competitor Analysis 🍝 [3]

**Which restaurants in California can be considered competitors of the `Mona Lisa Italian Restaurant` in San Francisco?** Use whatever metrics or criteria you wish to define what qualifies as a competitor. Justify your use of these metrics, and be sure to check if the values in the data accurately represent the interpretation you're seeking.

In [None]:
# YOUR CODE HERE

### 2.4 Freestyle! 🛼 [4]

Come up with your own data-driven question to ask of the data. It has to be interesting enough to warrant a composite analysis across multiple fields, and not merely looking up or summing up a column. 

Once you've formulated a question, fill in the code similar to the previous exercises to answer it. This helps you put yourself in the shoes of the users of your analysis. Think about whether your question and result are meaningful for the business


*Hint*: Think of yourself as both the restaurant owner asking the question, and the analyst answering the question

In [None]:
# YOUR CODE HERE

Remember to hydrate and  [![Spotify](https://img.shields.io/badge/Spotify-1ED760?style=for-the-badge&logo=spotify&logoColor=white)](https://open.spotify.com/playlist/3d4bU6GAelt3YL2L1X2SOn)


## Section 3: Meal Cost Prediction 💰 [15]

---

<img src="https://media.giphy.com/media/TGcD6N8uzJ9FXuDV3a/giphy.gif" width="250px" alt="sherlock">

Create a supervised machine learning model that can predict the `meal_cost` of a restaurant. You can use `meal_cost`, a field in the dataset that represents the price in `USD` for a meal at the restaurant for 1-2 persons as a training label. You can use any of the other fields available in the dataset as trainable features to predict the target. You do not need to use neural networks or more complex models. Do not create a model zoo. Answer the following questions as you build out your model.

1. What metrics will you use to judge if the model is performing well?
2. What baseline model are you using, and why?
3. What features have you selected to predict the target, and why?
4. Have you looked at the specific rows where your model is performing badly? 
5. Can your trained model be used to predict meal prices for restaurants not present in this dataset? What kind of constraints would you set for users of this model, if you were to ship it?

> 📝 You will be evaluated on the following criteria. Feature selection and reasoning, feature engineering, model selection, hyperparameter tuning and general methodology. We're not interested in your final model accuracy scores but instead on how systematic you were in arriving at a *reliable* result, and if that result is meaningful.

In [None]:
# YOUR CODE HERE

## Final Checklist 🔘

Here's a final checklist to review your notebook, code style and general format. Adhering to these makes your notebook more showcasable, easier to grade and most importantly, your future-self will thank you.

- No individual code cells are longer than 50 lines of code
- No individual cell outputs require too many scrolls to go through
- The entire notebook can be viewed in under 10-12 scrolls.
- The outputs shown by this notebook are relevant. Hide or remove ad-hoc cells (ex. where you wanted to quickly preview a data or a column for something). 
- Repeatable: A gold standard is for your notebook to work the same everytime you run all cells. This ensures that there are no manual additions plagueing the functionality.
- Display your dataframe (or at least `.info()` or `.shape`) every couple of cells so that when looking back, it's easy to follow the results rather than the logic.


Optional: Format your code cells using a tool like `black` or `nbqa`


Good job!

<img src="https://media.giphy.com/media/qLhxN7Rp3PI8E/giphy.gif" width="250px" alt="legend of zelda">