## CS 248 Day 5 - Exploring `requests`

**Author:** Eni Mustafaraj
**Date:** Feb 20, 2025

**Table of Content**

1. [Part 1: compare results from requests](#sec1)
2. [Part 2: retrieve and save JSON data](#sec2)
3. [Part 3: use pandas to transform our data](#sec3)

These examples were initially coded live during class time, but then the notebook was edited to add text and structure for future use.

<a id="sec1"></a>

### Part 1: compare results from two requests

In this part, we will send requests for two different URLs, to check whether the pages have the same content.

In [None]:
# This is a very simplistic solution, for speed's sake.
# Generally, we first need to check that the response was successful
# before we actually try to read the text from the response.

import requests

url1 = "https://dish.avifoodsystems.com/wellesley"
url2 = "https://dish.avifoodsystems.com/wellesley/96/148/week"

text1 = requests.get(url1).text
text2 = requests.get(url2).text

# Test if the retrieved text is the same
text1 == text2

This was unexpected. We had two different URLs, so the expectation was for the content to be different, especially since the corresponding pages on the Web look different. However, Wellesley Fresh AVI seem to use data injenction into its web pages to make them dynamic. Their website is written using Angular, a web application framework.

<a id="sec2"></a>

### Part 2: Getting JSON data

Now we will try the new URL, which is an API call:

In [None]:
url3 = "https://dish.avifoodsystems.com/api/menu-items/week?date=2/20/2025&locationId=96&mealId=148"

# first save the text of the response, as we did before
text = requests.get(url3).text
type(text)

The result we got is a string, since that is what we asked for (text is always a string). It is possible to convert this text to a list of dictionaries (though we will show below that we don' need this, because of the `.json` method).

In [None]:
import json

# load a string into JSON formatted list
data1 = json.loads(requests.get(url3).text)
type(data1), type(data1[0]) # see that the outer structure is a list, and its elements are dicts

Now, we will use the default method `.json` that is part of the Response object from the requests library:

In [None]:
data2 = requests.get(url3).json()
type(data2)

We can check if the two lists are the same:

In [None]:
data1 == data2

Let's explore what is in these lists:

In [None]:
len(data2)

Access one item:

In [None]:
data2[0]

Let's use list comprehension to print out the names of all foods in the menu:

In [None]:
meals = [item['name'] for item in data2]
for el in meals: print(el)

For the purposes of doing further work, we are going to save the data into a JSON file:

In [None]:
with open("lulu-menu.json", 'w') as outf:
    json.dump(data2, outf)

<a id="sec3"></a>

### Part 3: use pandas to transform our data

We can use pandas with JSON files, similar to how we use pandas with CSV files.
This makes sense only if we have a list of dictionaries, since each dictionary can be converted into a row for the dataframe.

In [None]:
import pandas as pd
# there is a function read_json, just like there is a function read_csv
df = pd.read_json("lulu-menu.json")
df.head()

Let's do some of the typical exploration whenever we load a new dataset into a dataframe:

#### Learn about the structure of the dataframe

In [None]:
df.shape # find the size

In [None]:
# look at the column names
df.columns

In [None]:
# check the data types of each column
df.dtypes

In [None]:
# look up only one row
df.iloc[0]

In [None]:
# check to see what is stored in the column "allergens"
df.iloc[0]['allergens']

It is a list of dictionaries. We can also find rows where this list is empty:

In [None]:
df.iloc[1]['allergens'] # second row doesn't have allergens

How would we go about counting how many cells are an empty list? 
Using the `apply` method, we can first find all the cells that are empty lists, and then sum their number:

In [None]:
# use pandas methods to check how many rows do not have allergens, that is, are empty lists
df['allergens'].apply(lambda x: type(x)==list and len(x) == 0).sum()

Why does the above code work?

In [None]:
# let's first just apply the lambda function to the whole column and see what the result looks like
result = df['allergens'].apply(lambda x: type(x)==list and len(x) == 0)
result.to_list()

In the cod above, the `lambda` function is a **predicate** that returns True or False. Specifically, it returns True for empty lists. 
Applying the function `sum` to the Series with this True-s and False-s, counts the number of True values, since in Python (True means 1 and False means 0).

In [None]:
result.sum()

#### Cleaning up the "allergens"

Instead of keeping the allergen colum as is, we want to replace the lists with a string of comma separated values, and the empty list with an empty string. 

We can do that by creating a function `transform`, which then will be applied to every cell of the column "allergen".

In [None]:
def transform(cellLst):
    result = ""
    if cellLst:
        result = ",".join([item['name'] for item in cellLst])
    return result

# test it with one item from the column to see how it works
transform(df.iloc[0].allergens)

In [None]:
# try again with a cell that has an empty list

transform(df.iloc[1].allergens)

As expected, we got an empty string. Now that we know that our function is doing the right thing, we will apply it to the whole column:

In [None]:
df['allergens'] = df['allergens'].apply(transform) # notice, we don't pass any arguments to the function here
df.head()

**Question:** Can we use the function we just created to clean up any other columns in this table?

If so, go ahead and do it below.

In [None]:
# your code here

### Other pandas functions

Let's find all the foods, whose description contains a certain word. That is, we want all dishes that contain a certain ingredient.

In [None]:
# find all foods that contain eggs
query = "eggs"
filtered = df[df['description'].str.contains(query, case=False, na=False)]
filtered.shape

In [None]:
# Let's see these dishes
filtered['name']

It looks like there is a lot of repeated rows. Are these the same items?

In [None]:
filtered.head()

We can see that row 5 and 17 have the same id value, 19874, they simply have a different date, but everything else is the same. This means that we would have to drop a lot of rows from this tables, since they are not helpful for certain kinds of analysis (though they are valuable when doing the daily menu, to know what we will eat that day).

#### Dropping columns

Some of the columns in this dataframe are not particulary useful, for example, image, stationName, price, etc. 
We can drop all these columns with a single command:

In [None]:
dfLess = df.drop(columns=['date', 'image', 'stationName', 'stationOrder', 'price'])
dfLess.shape

In [None]:
dfLess.head()

#### Dropping duplicate rows

Let's remove dishes that are repeated, given their ID, which is unique for each dish:

In [None]:
dfFinal = dfLess.drop_duplicates(subset=['id'], keep='first')
dfFinal.shape

There are only 29 unique dishes in the data we got for Lulu & Breakfast.

We will continue working with this data in the assignment for Week 5!