# Mission Dotlas 🌎 [40 points]

> Data Science Assignment

> `v4.0` Updated: December 27 2023 (Spring 2024 Version)

<img src="https://camo.githubusercontent.com/6a3a3a9e55ce6b5c4305badbdc68c0d5f11b360b11e3fa7b93c822d637166090/68747470733a2f2f646f746c61732d776562736974652e73332e65752d776573742d312e616d617a6f6e6177732e636f6d2f696d616765732f6769746875622f62616e6e65722e706e67" width="750px" alt="dotlas">

## Section 1: Project Overview ✉️

Welcome to your mission! In this notebook, you will download a dataset containing restaurants' information in the state of California, US. 
The dataset will then be transformed, processed and prepared in a required format. 
This clean dataset will then be used to answer some analytical questions and create a few data visualizations in Python.

This is a template notebook that has some code already filled-in to help you on your way. There are also cells that require you to fill in the python code to solve specific problems. There are sections of the notebook that contain a points tally for code written. 

**Each section of this notebook is largely independent, so if you get stuck on a problem you can always move on to the next one.**

### 1.1. Tools & Technologies 🪛

- This exercise will be carried out using the [Python](https://www.python.org/) programming language and will rely hevily on the [Pandas](https://pandas.pydata.org/) library for data manipulation.
- You are also free to use Polars, Dask or Spark if you do not want to use Pandas.
- You may use any of [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/) or [Plotly](https://plotly.com/python/) packages for data visualization.
- We will be using [Jupyter notebooks](https://jupyter.org/) to run Python code in order to view and interact better with our data and visualizations.
- You are free to use [Google Colab](https://colab.research.google.com/) which provides an easy-to-use Jupyter interface.
- When not in Colab, it is recommended to run this Jupyter Notebook within an [Anaconda](https://continuum.io/) environment
- You can use any other Python packages that you deem fit for this project.
- You are also allowed to freely use search engines like Google, or tools like ChatGPT to assist. 

> ⚠ **Ensure that your Python version is 3.9 or higher**

![](https://upload.wikimedia.org/wikipedia/commons/1/1b/Blue_Python_3.9_Shield_Badge.svg)

**Language**

![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)

**Environments & Packages**

![Anaconda](https://img.shields.io/badge/Anaconda-%2344A833.svg?style=for-the-badge&logo=anaconda&logoColor=white)
![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)

![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)
![Matplotlib](https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=for-the-badge&logo=Matplotlib&logoColor=black)
![Plotly](https://img.shields.io/badge/Plotly-%233F4F75.svg?style=for-the-badge&logo=plotly&logoColor=white)

**Data Store**

![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)

## Section 2: Read California Dataset 🚰
---

In this section, we will load the dataset from [AWS](https://googlethatforyou.com?q=amazon%20web%20services) S3, conduct an exploratory data analysis and then clean up the dataset


- Ensure that pandas and plotly are installed (possibly via pip or poetry)
- The dataset is about 34.5 MB in size and time-to-download depends on internet speed and availability

In [0]:
import warnings

warnings.filterwarnings("ignore")

from matplotlib import pyplot as plt

%matplotlib inline

import pandas as pd
import numpy as np

CELL_HEIGHT: int = 50

# Initialize helpers to ignore pandas warnings and resize columns and cells
pd.set_option("chained_assignment", None)
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_info_columns", 1_000)
pd.set_option("display.max_info_rows", 1_000_000)
pd.set_option("display.max_colwidth", CELL_HEIGHT)

DATA_URL: str = (
    "https://dotlas-marketing.s3.amazonaws.com/interviews/california_restaurants_2024.json"
)

#### 2.1 Read & Preprocess Data 🔍 [7]

<img src="https://media.giphy.com/media/l0NgQIwNvU9AUuaY0/giphy.gif"  width="150px" alt="assembly">

The code cell below reads the data from the `DATA_URL`, and performs some preliminary preprocessing steps. We initially drop a few columns, and you are required to not use these for the remainder of the notebook. Read through the code and answer the following questions.

1. Briefly describe what each of the steps `Preproc 1 ... 4` are seeking to accomplish, and why these are relevant. (*Hint*: Try viewing the dataset without these steps individually)
2. Based on a look at the resulting dataset, do you believe there's a need for further preprocessing to prepare the data for analysis? If so, add the preprocessing steps in the subsequent cell.

> 📝 Your evaluation for this section will be based on your ability to identify where additional preprocessing is most necessary, and the steps taken to achieve it. We're not looking for a battery of data cleaning transformations, and instead want you to think surgically about what could be causing the most problems downstream.

In [0]:
df: pd.DataFrame = pd.read_json(DATA_URL)

# Preproc 1
for col in df.columns:
    df[col] = df[col].apply(
        lambda cell: None if isinstance(cell, list) and len(cell) == 0 else cell
    )

# Preproc 2
rating_cols: list[str] = [
    "rating",
    "atmosphere_rating",
    "noise_rating",
    "food_rating",
    "service_rating",
    "value_rating",
]
for col in rating_cols:
    df.loc[(df.rating_count == 0) | (df.rating_count.isna()), col] = None

# Preproc 3
df = df.replace([np.nan], [None])
df = df.replace([""], [None])

# Preproc 4
df["restaurant_id"] = range(1, len(df) + 1)

df.head()

In [0]:
# Preproc 5...n
# YOUR CODE HERE

## Section 3: Data-Driven Questions 💬 [13]
---

<img src="https://media.giphy.com/media/fv8KclrYGp5dK/giphy.gif"  width="250px" alt="sherlock">


This section is designed to pose several broad questions that require a data-centric approach for their answers. You will need to manipulate, filter, and aggregate the provided data to derive the results using code. The key objective is to assess your capability to convert a question into a series of methods or transformations that guide you to **systematically deduce the answer**, and to evaluate the reasoning process you employ to reach the conclusion as well as the **criteria you select**. Consider cleaning up additional fields in the data before you use them. Although the analysis is not limited to the dataset included in this notebook, and you are encouraged to incorporate **supplementary references** from other publicly available datasets, studies, or statistics, it is obligatory to utilize the primary dataset supplied with this notebook as the main information source. Additionally, it is essential to incorporate a **visualization** of your findings as a way to articulate your response. You are free to select any form of visualization, be it charts, maps, animations, or others, but they must be original creations and not borrowed from external sources.

> 📝 Your evaluation will be based on the investigative techniques employed, the originality of your approach, the selection of variables and metrics, the criteria chosen, the type and quality of the visualization selected, and any additional supporting evidence, if provided

#### 3.1. Cuisine Saturation 🍱 [3]

**Which cuisines are over-saturating in `San Diego`?** Define a metric called `saturation` and use it to calculate the result. When you produce the result, ask yourself if it makes sense, and justify

In [0]:
# YOUR CODE HERE

#### 3.2. Twinning 🗺 [5]

**Find a pair of `area` (neighbourhoods) in California that have a similar `cuisine` distribution**. Be sure to consider neighbourhoods that have a reasonable sample of restaurants to begin with! (*Hint*: Think about whether you can equate neighbourhoods that have `fast-food` and `burger` cuisines as equivalent)

In [0]:
# YOUR CODE HERE

#### 3.3. Competitor Analysis 🍝 [3]

**Which restaurants in California can be considered competitors of the `Mona Lisa` Italian restaurant in San Francisco?** Use whatever metrics or criteria you wish to define what qualifies as a competitor. Justify your use of these metrics, and be sure to check if the values in the data accurately represent the interpretation you're seeking.

In [0]:
# YOUR CODE HERE

#### 3.4 Social Pops 🎭 [2]

**Does a restaurant's presence on social media influence its popularity?** Use your own definition of `popularity` and justify the factors used to arrive at it. (*Hint*: Is a restaurant with a high rating or review count always popular?)

Remember to hydrate and  [![Spotify](https://img.shields.io/badge/Spotify-1ED760?style=for-the-badge&logo=spotify&logoColor=white)](https://open.spotify.com/playlist/3d4bU6GAelt3YL2L1X2SOn)


## Section 4: Meal Cost Prediction 💰 [10]

---

<img src="https://media.giphy.com/media/TGcD6N8uzJ9FXuDV3a/giphy.gif" width="250px" alt="sherlock">

Create a supervised machine learning model that can predict the `meal_cost` of a restaurant. You can use `meal_cost`, a field in the dataset that represents the price in `USD` for a meal at the restaurant for 1-2 persons as a training label. You can use any of the other fields available in the dataset as trainable features to predict the target. You do not need to use neural networks or more complex models. Do not create a model zoo. Answer the following questions as you build out your model.

1. What metrics will you use to judge if the model is performing well?
2. What baseline model are you using, and why?
3. What features have you selected to predict the target, and why?
4. Have you looked at the specific rows where your model is performing badly? 
5. Can your trained model be used to predict meal prices for restaurants not present in this dataset? What kind of constraints would you set for users of this model, if you were to ship it?

> 📝 You will be evaluated on the following criteria. Feature selection and reasoning, feature engineering, model selection, hyperparameter tuning and general methodology. We're not interested in your final model accuracy scores but instead on how systematic you were in arriving at a *reliable* result, and if that result is meaningful.

In [0]:
# YOUR CODE HERE

## Section 5: Create a Data Product 🗃 [10]

In [0]:
# YOUR CODE HERE

---

Good job!

<img src="https://media.giphy.com/media/qLhxN7Rp3PI8E/giphy.gif" width="250px" alt="legend of zelda">