# Mission Dotlas 🌎 [40 points]

> Data Science Assignment

> `v3.0` Updated: August 29 2023 (Fall 2023 Version)

<img src="https://camo.githubusercontent.com/6a3a3a9e55ce6b5c4305badbdc68c0d5f11b360b11e3fa7b93c822d637166090/68747470733a2f2f646f746c61732d776562736974652e73332e65752d776573742d312e616d617a6f6e6177732e636f6d2f696d616765732f6769746875622f62616e6e65722e706e67" width="750px" alt="dotlas">

## Section 1: Project Overview ✉️

Welcome to your mission! In this notebook, you will download a dataset containing restaurants' information in the state of California, US. 
The dataset will then be transformed, processed and prepared in a required format. 
This clean dataset will then be used to answer some analytical questions and create a few data visualizations in Python.

This is a template notebook that has some code already filled-in to help you on your way. There are also cells that require you to fill in the python code to solve specific problems. There are sections of the notebook that contain a points tally for code written. 

**Each section of this notebook is largely independent, so if you get stuck on a problem you can always move on to the next one.**

### 1.1. Tools & Technologies 🪛

- This exercise will be carried out using the [Python](https://www.python.org/) programming language and will rely hevily on the [Pandas](https://pandas.pydata.org/) library for data manipulation.
- You are also free to use Polars, Dask or Spark if you do not want to use Pandas.
- You may use any of [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/) or [Plotly](https://plotly.com/python/) packages for data visualization.
- We will be using [Jupyter notebooks](https://jupyter.org/) to run Python code in order to view and interact better with our data and visualizations.
- You are free to use [Google Colab](https://colab.research.google.com/) which provides an easy-to-use Jupyter interface.
- When not in Colab, it is recommended to run this Jupyter Notebook within an [Anaconda](https://continuum.io/) environment
- You can use any other Python packages that you deem fit for this project.

> ⚠ **Ensure that your Python version is 3.9 or higher**

![](https://upload.wikimedia.org/wikipedia/commons/1/1b/Blue_Python_3.9_Shield_Badge.svg)

**Language**

![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)

**Environments & Packages**

![Anaconda](https://img.shields.io/badge/Anaconda-%2344A833.svg?style=for-the-badge&logo=anaconda&logoColor=white)
![Jupyter Notebook](https://img.shields.io/badge/jupyter-%23FA0F00.svg?style=for-the-badge&logo=jupyter&logoColor=white)

![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)
![Matplotlib](https://img.shields.io/badge/Matplotlib-%23ffffff.svg?style=for-the-badge&logo=Matplotlib&logoColor=black)
![Plotly](https://img.shields.io/badge/Plotly-%233F4F75.svg?style=for-the-badge&logo=plotly&logoColor=white)

**Data Store**

![AWS](https://img.shields.io/badge/AWS-%23FF9900.svg?style=for-the-badge&logo=amazon-aws&logoColor=white)

## Section 2: Read California Dataset 🚰 [1]

In this section, we will load the dataset from [AWS](https://googlethatforyou.com?q=amazon%20web%20services) S3, conduct an exploratory data analysis and then clean up the dataset


- Ensure that pandas and plotly are installed (possibly via pip or poetry)
- The dataset is about 34.5 MB in size and time-to-download depends on internet speed and availability
- Download the dataset using Python into this notebook and load it into a pandas dataframe (without writing to file)

In [1]:
import warnings

warnings.filterwarnings("ignore")

from matplotlib import pyplot as plt

%matplotlib inline

import pandas as pd
import numpy as np

CELL_HEIGHT: int = 50

# Initialize helpers to ignore pandas warnings and resize columns and cells
pd.set_option("chained_assignment", None)
pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", 500)
pd.set_option("display.max_colwidth", CELL_HEIGHT)

DATA_URL: str = (
    "https://dotlas-marketing.s3.amazonaws.com/interviews/california_restaurants_2023.json"
)

In [None]:
# ✏️ YOUR CODE HERE
# df: pd.DataFrame = ?

The following cell creates a `restaurant ID` column to uniquely index each restaurant. Run it as is.

In [3]:
df["restaurant_id"] = range(1, len(df) + 1)
print(df.shape)
df.head()

(11296, 79)


Unnamed: 0,country,subregion,city,brand_name,categories,latitude,longitude,area,address,description,public_transit,cross_street,restaurant_website,phone_number,primary_cuisine,dining_style,executive_chef_name,parking_info,dress_code,entertainment,operating_hours,price_range_id,price_range,payment_options,maximum_days_advance_for_reservation,rating,rating_count,atmosphere_rating,noise_rating,food_rating,service_rating,value_rating,terrible_review_count,poor_review_count,average_review_count,very_good_review_count,excellent_review_count,most_recent_review,review_count,review_topics,tags,has_clean_menus,has_common_area_cleaning,has_common_area_distancing,has_contact_tracing_collected,has_contactless_payment,requires_diner_temperature_check,has_limited_seating,prohibits_sick_staff,has_proof_of_vaccination_outdoor,requires_proof_of_vaccination,requires_diner_masks,requires_wait_staff_masks,has_sanitized_surfaces,provides_sanitizer_for_customers,has_sealed_utensils,has_vaccinated_staff,requires_staff_temp_checks,has_table_layout_with_extra_space,is_permanently_closed,is_waitlist_only,has_waitlist,has_bar,has_counter,has_high_top_seating,has_outdoor_seating,has_priority_seating,has_private_dining,has_takeout,has_delivery_partners,has_pickup,is_network_non_bookable,has_gifting,order_online_link,delivery_partners,facebook,menu_url,daily_reservation_count,restaurant_id
0,United States,California,La Jolla,Fresheria - Be Fresh La Jolla,[Café],32.83937,-117.27654,La Jolla,"627 Pearl St, , CA, La Jolla, 92037, United St...","Freshería - Fresh, balanced and delicious food...",,,http://www.Lafresheria.com/,(858) 551-8786,Café,Casual Dining,,,Casual Dress,,"Mon–Fri 6:30 am–7:00 pm Sat, Sun 8:00 am–7:00 pm",2,$30 and under,"[AMEX, Discover, Mastercard, Visa]",90,0.0,0,0.0,0,0.0,0.0,0.0,0,0,0,0,0,,0,[],"[Cafe, Counter Seating, Delivery, Non-Smoking,...",False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,True,False,False,False,False,,[],http://www.facebook.com/befreshsd,http://www.facebook.com/befreshsd/menu/,,1
1,United States,California,Davis,Hikari Sushi & Omakase,"[Sushi, Japanese, Creative Japanese]",38.54269,-121.73945,Davis,"110 F St, , CA, Davis, 95616-4628, United States",Please be advised. Friday and Saturday's dinne...,,,http://hikariomakase.com/,(530) 564-4356,Sushi,Casual Dining,Sithu and Zin,Street Parking,Casual Dress,,"Tue–Thu 5:00 pm–9:00 pm Fri, Sat 4:30 pm–9:30 pm",4,$50 and over,"[AMEX, Discover, MasterCard, Visa]",90,5.0,44,5.0,2,5.0,5.0,4.9,0,0,0,0,44,Great service and gourmet. Presentations of fo...,81,"[Good for special occasions, Charming, Innovat...","[BYO Liquor, BYO Wine, Chef's Table, Counter S...",True,True,False,False,False,True,True,True,False,False,False,True,True,True,False,False,True,False,False,False,True,False,True,False,False,False,False,True,False,False,False,True,,[],,https://www.facebook.com/hikarisushiandomakase...,1.0,2
2,United States,California,San Diego,Grant Grill,"[Californian, Contemporary American, Bar / Lou...",32.71569,-117.16173,Downtown,"326 Broadway, , CA, San Diego, 92101, United S...",A remarkable epicurean experience awaits you a...,,Fourth Ave.,http://www.grantgrill.com/,(619) 744-2077,Californian,Fine Dining,Mark Kropczynski,Valet,Smart Casual,Live entertainment Thursday-Saturday at 8pm in...,"Brunch Sat, Sun 7:00 am–2:00 pm Breakfast Mon–...",3,$31 to $50,"[AMEX, Diners Club, Discover, Mastercard, Visa]",90,4.8,77,4.9,2,4.6,4.9,4.3,0,0,5,6,66,"Our waiter, Jose, is so wonderful! I've been m...",1453,"[Good for special occasions, Good for business...","[Banquet, Bar/Lounge, Beer, BYO Wine, Cocktail...",False,True,True,False,False,False,True,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,,[],http://www.facebook.com/grantgrill,http://www.grantgrill.com/menu/,12.0,3
3,United States,California,Danville,Yanni's Taverna,"[Greek, Mediterranean]",37.82196,-121.99948,Danville,"120 E Prospect Avenue, , CA, Danville, 94526, ...",<p>Yanni's Taverna is a Mediterranean restaura...,,Hartz Avenue,http://www.yannistaverna.com/,(925) 820-7700,Greek,Casual Dining,,Street Parking,Casual Dress,Live Greek music once every month.,Lunch: Monday - Saturday: 11:00am - 4:00pm Din...,2,$30 and under,"[AMEX, Discover, Mastercard, Visa]",90,4.1,14,3.3,2,4.1,4.1,3.6,0,0,4,5,5,Great find! Mixed grill is fabulous!,7,"[Kid-friendly, Innovative, Good for groups]","[Bar Dining, Beer, Corkage Fee, Counter Seatin...",False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,,[],,http://www.yannistaverna.com/#!menus/cfvg,,4
4,United States,California,Los Angeles,Proud Bird,"[American, Unspecified]",33.93541,-118.37742,LAX / Westchester,"11022 Aviation Blvd, , CA, Los Angeles, 90045,...","The Proud Bird is a modern, re-imagined food b...",,111th Street,https://www.theproudbird.com/,(310) 670-3093,American,Casual Dining,,,Casual Dress,,"Tue–Thu, Sun 11:00 am–7:00 pm Fri, Sat 11:00 a...",2,$30 and under,"[AMEX, Cash not accepted, Discover, Mastercard...",91,4.0,1399,4.1,2,4.0,3.9,4.0,36,77,246,547,493,The Proud Bird is a terrific restaurant savory...,567,"[Great for scenic views, Good for groups, Good...","[Banquet, Bar/Lounge, Beer, Cocktails, Corkage...",False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,,[POSTMATES],http://www.facebook.com/proudbirdrestaurant,https://www.theproudbird.com/menus,,5


## Section 3: Data Preprocessing 🕵🏼‍♀️ [9]


<img src="https://media.giphy.com/media/2f41Z7bhKGvbG/giphy.gif"  width="250px" alt="potter">


In this exercise, you will be preprocessing the data. It will involve cleaning the data, transforming it into a suitable format, and handling missing values and outliers. These steps are crucial to ensure the quality and reliability of the data before applying statistical learning models.

> 📝 Your work will be assessed based on how systematic and complete your transformations are i.e, if they perform the task generally across all data points, and yield the expected output.

### 3.1. Missing Values ❓ [4]

Inspect the data to understand what you are dealing with, including summary statistics as well as its structure and identify potential issues that need fixing. Identify and handle missing values in the dataset. If a value is missing, decide whether to fill it in with an appropriate default value, remove the row, or simply let it be.

In [None]:
# ✏️ YOUR CODE HERE

### 3.2. Phoning it in 📞 [3]

Standardize the format of the ```'phone_number'``` column by removing any non-numeric characters and ensuring that all phone numbers have the same length.

In [None]:
# ✏️ YOUR CODE HERE

### 3.3. No more HTML 📄 [2]
Find columns containing HTML tags and replace them with an appropriate plain text equivalent, such as a newline character or space.

ex:

```html
<p>
  Feast on delicious grub at Jerry's Famous Deli.<br />
  Its retro-style casual setting features comfortable booth seating.
</p>
```

to:

```
Feast on delicious grub at Jerry's Famous Deli. Its retro-style casual setting features comfortable booth seating.
```

In [None]:
# ✏️ YOUR CODE HERE

Remember to hydrate and  [![Spotify](https://img.shields.io/badge/Spotify-1ED760?style=for-the-badge&logo=spotify&logoColor=white)](https://open.spotify.com/playlist/3d4bU6GAelt3YL2L1X2SOn)

## Section 4: Data-Driven Questions 💬 [15]

<img src="https://media.giphy.com/media/fv8KclrYGp5dK/giphy.gif"  width="250px" alt="sherlock">


This section is designed to pose several broad questions that require a data-centric approach for their answers. You will need to manipulate, filter, and aggregate the provided data to derive the results using code. The key objective is to assess your capability to convert a question into a series of methods or transformations that guide you to **systematically deduce the answer**, and to evaluate the reasoning process you employ to reach the conclusion as well as the **criteria you select**. Consider cleaning up additional fields in the data before you use them. Although the analysis is not limited to the dataset included in this notebook, and you are encouraged to incorporate **supplementary references** from other publicly available datasets, studies, or statistics, it is obligatory to utilize the primary dataset supplied with this notebook as the main information source. Additionally, it is essential to incorporate a **visualization** of your findings as a way to articulate your response. You are free to select any form of visualization, be it charts, maps, animations, or others, but they must be original creations and not borrowed from external sources.

> 📝 Your evaluation will be based on the investigative techniques employed, the originality of your approach, the selection of variables and metrics, the criteria chosen, the type and quality of the visualization selected, and any additional supporting evidence, if provided

#### 4.1. Cuisine Saturation 🍱 [1]

**Which cuisines are over-saturating in `Pasadena`?**

In [None]:
# YOUR CODE HERE

#### 4.2. Consumer Culinary Spend  💸 [2]

**How does the average spend per person at a restaurant in the `San-Francisco Bay Area` compare with the `Los Angeles Metropolitan Area`?**

In [None]:
# YOUR CODE HERE

#### 4.3. Safe & Famous  🫧 [3]

**Identify some popular restaurants in California that also have good `hygiene` & `safety` standards?**

> You may define `popular` in whatever way you wish using the variables in the data

In [None]:
# YOUR CODE HERE

#### 4.4. Twinning 🗺 [4]

**Find a pair of `neighbourhoods` in California that have a similar `cuisine` distribution**

> Be sure to consider neighbourhoods that have a reasonable sample of restaurants to begin with!

In [None]:
# YOUR CODE HERE

#### 4.5. Competitor Analysis 🍝 [5]

**Which restaurants in California can be considered competitors of the `Mona Lisa` Italian restaurant in San Francisco?**

> Use whatever metrics or criteria you wish to define what qualifies as a competitor

In [None]:
# YOUR CODE HERE

## Section 5: Review Analysis 🎙 [15]

> This section is independent of the previous sections, and uses a different dataset

<img src="https://media.giphy.com/media/3o6ggcSS96ZfB13woo/giphy.gif"  width="250px" alt="ramsey">


In this section, we have a dataset of restaurant reviews in Los Angeles, and we're interested in understanding customer sentiment towards three key aspects: Food, Service, and Ambiance.

We will load a reviews dataset from [AWS](https://googlethatforyou.com?q=amazon%20web%20services) S3, to perform the analysis.

In [4]:
REVIEWS_URL: str = (
    "https://dotlas-marketing.s3.amazonaws.com/interviews/los_angeles_reviews_2023.parquet"
)

Below, you've been provided with some seed words to help you in analysing each aspect of the restaurant's experience. Feel free to modify it as you see fit.

In [None]:
seed_words = {
    'Food': {
        'Positive': ['tasty', 'delicious', 'flavorful', 'fresh', 'juicy', 'crispy', 'savory', 'yummy', 'delectable', 'succulent'],
        'Negative': ['bland', 'stale', 'overcooked', 'undercooked', 'greasy', 'salty', 'burnt', 'sour', 'tasteless', 'dry']
    },
    'Service': {
        'Positive': ['friendly', 'attentive', 'prompt', 'courteous', 'professional', 'helpful', 'accommodating', 'quick', 'efficient', 'warm'],
        'Negative': ['slow', 'rude', 'inattentive', 'disorganized', 'negligent', 'unprofessional', 'unhelpful', 'snobby', 'aloof', 'cold']
    },
    'Ambiance': {
        'Positive': ['cozy', 'inviting', 'elegant', 'trendy', 'clean', 'lively', 'comfortable', 'romantic', 'vibrant', 'charming'],
        'Negative': ['loud', 'crowded', 'dirty', 'dull', 'dark', 'cramped', 'noisy', 'uncomfortable', 'drab', 'claustrophobic']
    }
}

#### Term Identification 🔠 [5]

**Identify mentions of the three aspects (`Food, Service, Ambiance`) in the review text.**

> Use the seed words provided above to identify the presence of sentiments for each review. Feel free to alter it as you see fit.

In [None]:
# ✏️ YOUR CODE HERE
# reviews_df: pd.DataFrame = ?

In [6]:
print(reviews_df.shape)
reviews_df.head()

(141410, 8)


Unnamed: 0,author_name,creation_date,rating,review_title,review_text,brand_name,city,area
0,Casper L.,2023-08-20 12:57:10-07:00,5,Casper L.'s review,Really good place for some stuffing Japanese f...,Chinchikurin - Sawtelle,Los Angeles,Sawtelle
1,Michelle S.,2021-12-31 16:31:48-07:00,5,Michelle S.'s review,Treats are amazing !!!! Everything you order i...,The Sweet Life,Los Angeles,Vermont Square
2,Elias D.,2023-08-19 18:07:30-07:00,5,Elias D.'s review,Quality from the sauce to the wings and servic...,Pizza King,Los Angeles,
3,Maribel G.,2023-08-14 19:03:18-07:00,4,Maribel G.'s review,They're pizza is delicious!!!! And they are ve...,Pizza King,Los Angeles,
4,Denise W.,2023-07-08 22:59:32-07:00,5,Denise W.'s review,Really nice staff and good pizza for a great p...,Pizza King,Los Angeles,


#### Sentiment Analysis 🎭 [5]

**Perform a sentiment scoring (`Negative` or `Positive`) of each mention of the aspects.**

> The seed words dictionary already accounts for the sentiment of each aspect, so feel free to utilize it once more.

In [None]:
# YOUR CODE HERE

#### Compile 📚 [5]

**Provide an analysis of the aspect and sentiment of restaurants in Los Angeles. Provide at least 1 visualization and summarize your findings.**

In [None]:
# YOUR CODE HERE

---

Good job!

<img src="https://media.giphy.com/media/qLhxN7Rp3PI8E/giphy.gif" width="250px" alt="legend of zelda">