# Homework 1
UIC CS 418, Fall 2025

_According to the **Academic Integrity Policy** of this course, all work submitted for grading must be done individually, unless otherwise specified. While we encourage you to talk to your peers and learn from them, this interaction must be superficial with regard to all work submitted for grading. This means you cannot work in teams, you cannot work side-by-side, and you cannot submit someone else’s work (partial or complete) as your own. In particular, note that you are guilty of academic dishonesty if you extend or receive any kind of unauthorized assistance. Absolutely no transfer of program code between students is permitted (paper or electronic), and you may not solicit code from family, friends, or online forums. Other examples of academic dishonesty include emailing your program to another student, copying and pasting code from the internet, working in a group on a homework assignment, and allowing a tutor, TA, or another individual to write an answer for you. Academic dishonesty is unacceptable, and penalties range from failure to expulsion from the university; cases are handled via the official student conduct process described at https://dos.uic.edu/conductforstudents.shtml._

_This homework may be completed in pairs. You may tag your teammate in Gradescope upon submission._

## Due Date

This assignment is due at 11:59 pm CST on September 17, 2025. All parts of the assignments are due at the same time. If any segment of the assignment is submitted late, the late submission policy applies for the whole assignment. Instructions on how to submit it to Gradescope are given at the end of the notebook and should be followed carefully.

## Part 1 (45% of HW1): Data processing with pandas 


In this homework, you will see examples of some commonly used data wrangling tools in Python. In particular, we aim to give you some familiarity with:

* Slicing data frames
* Filtering data
* Grouped counts
* Joining two tables
* NA/Null values



## Part 1: Practice (15%)

This part of the homework is graded manually based on showing the correct outputs after executing each step.

## Setup

You need to execute each step (run each Cell), in order for the next ones to work. First, import necessary libraries:

In [1]:
import pandas as pd
import numpy as np

The code below produces the data frames used in the examples:

In [2]:
heroes = pd.DataFrame(
    data={'color': ['red', 'green', 'black', 
                    'blue', 'black', 'red'],
          'first_seen_on': ['a', 'a', 'f', 'a', 'a', 'f'],
          'first_season': [2, 1, 2, 3, 3, 1]},
    index=['flash', 'arrow', 'vibe', 
           'atom', 'canary', 'firestorm']
)

identities = pd.DataFrame(
    data={'ego': ['barry allen', 'oliver queen', 'cisco ramon',
                  'ray palmer', 'sara lance', 
                  'martin stein', 'ronnie raymond'],
          'alter-ego': ['flash', 'arrow', 'vibe', 'atom',
                        'canary', 'firestorm', 'firestorm']}
)

teams = pd.DataFrame(
    data={'team': ['flash', 'arrow', 'flash', 'legends', 
                   'flash', 'legends', 'arrow'],
          'hero': ['flash', 'arrow', 'vibe', 'atom', 
                   'killer frost', 'firestorm', 'speedy']})

## Pandas and Wrangling

For the examples that follow, we will be using a toy data set containing information about superheroes in the Arrowverse.  In the `first_seen_on` column, `a` stands for Archer and `f`, Flash.

In [3]:
heroes

Unnamed: 0,color,first_seen_on,first_season
flash,red,a,2
arrow,green,a,1
vibe,black,f,2
atom,blue,a,3
canary,black,a,3
firestorm,red,f,1


In [4]:
identities

Unnamed: 0,ego,alter-ego
0,barry allen,flash
1,oliver queen,arrow
2,cisco ramon,vibe
3,ray palmer,atom
4,sara lance,canary
5,martin stein,firestorm
6,ronnie raymond,firestorm


In [5]:
teams

Unnamed: 0,team,hero
0,flash,flash
1,arrow,arrow
2,flash,vibe
3,legends,atom
4,flash,killer frost
5,legends,firestorm
6,arrow,speedy


### Slice and Dice

#### Column selection by label
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` method. General usage looks like `frame.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `heroes` data frame, we would use :

In [6]:
heroes.loc[:, 'color']

flash          red
arrow        green
vibe         black
atom          blue
canary       black
firestorm      red
Name: color, dtype: object

Selecting multiple columns is easy. You just need to supply a list of column names. Here we select the color and value columns:

In [7]:
heroes.loc[:, ['color', 'first_season']]

Unnamed: 0,color,first_season
flash,red,2
arrow,green,1
vibe,black,2
atom,blue,3
canary,black,3
firestorm,red,1


While .loc is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the [] method, which takes on the form frame['colname'].

In [8]:
heroes['first_seen_on']

flash        a
arrow        a
vibe         f
atom         a
canary       a
firestorm    f
Name: first_seen_on, dtype: object

#### Row Selection by Label

Similarly, if we want to select a row by its label, we can use the same .loc method.

In [9]:
heroes.loc[['flash', 'vibe'], :]

Unnamed: 0,color,first_seen_on,first_season
flash,red,a,2
vibe,black,f,2


If we want all the columns returned, we can, for brevity, drop the colon without issue.

In [10]:
heroes.loc[['flash', 'vibe']]

Unnamed: 0,color,first_seen_on,first_season
flash,red,a,2
vibe,black,f,2


#### General Selection by Label

More generally you can slice across both rows and columns at the same time.  For example:

In [11]:
heroes.loc['flash':'atom', :'first_seen_on']

Unnamed: 0,color,first_seen_on
flash,red,a
arrow,green,a
vibe,black,f
atom,blue,a


#### Selection by Integer Index

If you want to select rows and columns by position, the Data Frame has an analogous `.iloc` method for integer indexing. Remember that Python indexing starts at 0.

In [12]:
heroes.iloc[:4,:2]

Unnamed: 0,color,first_seen_on
flash,red,a
arrow,green,a
vibe,black,f
atom,blue,a


### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, culling out fishy outliers, or analyzing subgroups of your data set.  For example, we may be interested in characters that debuted in season 3 of Archer.  Note that compound expressions have to be grouped with parentheses.

In [13]:
heroes[(heroes['first_season']==3) & (heroes['first_seen_on']=='a')]

Unnamed: 0,color,first_seen_on,first_season
atom,blue,a,3
canary,black,a,3


#### Problem Solving Strategy
We want to highlight the strategy for filtering to answer the question above:

* **Identify the variables of interest**
    * Interested in the debut: `first_season` and `first_seen_on`
* **Translate the question into statements one with True/False answers**
    * Did the hero debut on Archer? $\rightarrow$ The hero has `first_seen_on` equal to `a`
    * Did the hero debut in season 3? $\rightarrow$ The hero has `first_season` equal to `3`
* **Translate the statements into boolean statements**
    * The hero has `first_seen_on` equal to `a` $\rightarrow$ `hero['first_seen_on']=='a'`
    * The hero has `first_season` equal to `3` $\rightarrow$ `heroes['first_season']==3`
* **Use the boolean array to filter the data**

Note that compound expressions have to be grouped with parentheses.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
\>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

An often-used operation missing from the above table is a test-of-membership.  The `Series.isin(values)` method returns a boolean array denoting whether each element of `Series` is in `values`.  We can then use the array to subset our data frame. For example, if we wanted to see which rows of `heroes` had values in $\{1,3\}$, we would use:

In [None]:
heroes[heroes['first_season'].isin([1,3])]

Notice that in both examples above, the expression in the brackets evaluates to a boolean series.  The general strategy for filtering data frames, then, is to write an expression of the form `frame[logical statement]`.

### Counting Rows

To count the number of instances of a value in a `Series`, we can use the `value_counts` method.  Below we count the number of instances of each color.

In [None]:
heroes['color'].value_counts()

A more sophisticated analysis might involve counting the number of instances a tuple appears.  Here we count $(color,value)$ tuples.

In [None]:
heroes.groupby(['color', 'first_season']).size()

This returns a series that has been multi-indexed.  We'll eschew this topic for now.  To get a data frame back, we'll use the `reset_index` method, which also allows us to simulataneously name the new column.

In [None]:
heroes.groupby(['color', 'first_season']).size().reset_index(name='count')

### Joining Tables on One Column

Suppose we have another table that classifies superheroes into their respective teams.  Note that `canary` is not in this data set and that `killer frost` and `speedy` are additions that aren't in the original `heroes` set.

For simplicity of the example, we'll convert the index of the `heroes` data frame into an explicit column called `hero`.  A careful examination of the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) will reveal that joining on a mixture of the index and columns is possible.

In [None]:
heroes['hero'] = heroes.index
heroes

#### Inner Join

The inner join below returns rows representing the heroes that appear in both data frames.

In [None]:
pd.merge(heroes, teams, how='inner', on='hero')

#### Left and right join
The left join returns rows representing heroes in the `heroes` ("left") data frame, augmented by information found in the `teams` data frame.  Its counterpart, the right join, would return heroes in the `teams` data frame.  Note that the `team` for hero `canary` is an `NaN` value, representing missing data.

In [None]:
pd.merge(heroes, teams, how='left', on='hero')

#### Outer join

An outer join on `hero` will return all heroes found in both the left and right data frames.  Any missing values are filled in with `NaN`.

In [None]:
pd.merge(heroes, teams, how='outer', on='hero')

#### More than one match?

If the values in the columns to be matched don't uniquely identify a row, then a cartesian product is formed in the merge.  For example, notice that `firestorm` has two different egos, so information from `heroes` had to be duplicated in the merge, once for each ego.

In [None]:
pd.merge(heroes, identities, how='inner', 
         left_on='hero', right_on='alter-ego')

### Missing Values

There are a multitude of reasons why a data set might have missing values.  The current implementation of Pandas uses the numpy NaN to represent these null values (older implementations even used `-inf` and `inf`).  Future versions of Pandas might implement a true `null` value---keep your eyes peeled for this in updates!  More information can be found [http://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html](http://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

Because of the specialness of missing values, they merit their own set of tools.  Here, we will focus on detection.  For replacement, see the docs.

In [None]:
x = np.nan
y = pd.merge(heroes, teams, how='outer', on='hero')['first_season']
y

To check if a value is null, we use the `isnull()` method for series and data frames.  Alternatively, there is a `pd.isnull()` function as well.

In [None]:
x.isnull() # Will throw an error since x is neither a series nor a data frame

In [None]:
pd.isnull(x)

In [None]:
y.isnull()

In [None]:
pd.isnull(y)

Since filtering out missing data is such a common operation, Pandas also has conveniently included the analogous `notnull()` methods and function for improved human readability.

In [None]:
y.notnull()

In [None]:
y[y.notnull()]

## Part 1: Questions (30%)

The problems below are based on an article that appeared in https://chi.streetsblog.org/2022/04/25/data-analysis-found-cta-has-only-been-running-about-half-its-schedule-blue-line-runs

In short, during 2022, the train service run by CTA was somewhat irregular, especially since they were doing maintenance work. As mentioned in the article, Fabio Göttlicher, a resident of Chicago decided to record arrival times of the blue line at a particular train stop. We will use more recent data to calculate delays at this train stop. This data is given to you as 7 json files. If interested, you can see how you can get such data by quering the CTA for real time feed in Fabio Göttlicher's website at https://github.com/FabioCZ/dude-wheres-my-train/tree/main

The first three parts of starter code below provides the skeleton for processing one json file with which you can calculate the delays. 

In the last part, you will have to repeat the calculation for the full week and calculate weekly variance.

In [None]:
# Ignore the warnings.

# load the json file to get arrival and schedule for the day
arrival_data = pd.read_json('20240801.json')

# show the first few rows, by default 5
#print(arrival_data.head() )
print(arrival_data.shape)

#let's look at the columns
print(arrival_data.columns)

#get arrivals and scheduled columns -- all actual arrival and scheduled times are stored in these two dataframe columns
arrivals_info = arrival_data.loc[:,'arrivals']
schedule_info = arrival_data.loc[:,'scheduled']

print(arrivals_info.head())
print(schedule_info.head())


### Question 1.1 (8% credit)
Data Preprocessing to get scheduled and arrival times from JSON file

In [None]:
# 4%credit
# In this cell we will extract schedule as a list

#allDepartures column has all the scheduled information
num_of_scheduled = len(schedule_info.iloc[1]["allDepartures"]) #change 1 to 0 to get schedule of 30111 or a particular direction (northbound)
print(num_of_scheduled)

#to get a particular scheduled time, we can use iloc
print(schedule_info.iloc[1]["allDepartures"][num_of_scheduled-5]) #prints the 5th from last entry
#observe that schedules are already stored in HH:MM:SS format, so no preprocessing required

#insert to code create a list containing all the times when trains are scheduled on the day
scheduled_times_list = [YOUR CODE HERE]
scheduled_times = pd.DataFrame(scheduled_times_list)

print(num_of_scheduled==scheduled_times.shape[0]) #should evaluate to True


In [None]:
# 4%credit
# In this cell we will extract actual arrivals as a list

#it looks like all the arrivals are stored in the first and second rows
num_of_arrivals= len(arrivals_info.iloc[1]) #change 1 to 0 to get arrivals of 30111 or a particular direction (northbound) 
print(num_of_arrivals) # gives number of arrival time in the direction
print(arrivals_info.iloc[1][num_of_arrivals-1]["arrival"])#extract last entry in the series

## insert code to extract time in the string, that is, have to extract substring between 'T' and '-' 
[YOUR CODE HERE]

arrival_time = "23:49:41" 
print(arrival_time == "23:49:41" ) #time of arrival
#output should be in HH:MM:SS format like 23:49:31 for the last entry as above
print(time)

## insert code to exact times of all arrivals on a day. Store in a list and convert to series or dataframe called arrival_times
arrival_times_list = []
[YOUR CODE HERE]

arrival_times = pd.DataFrame(arrival_times_list)
print(num_of_arrivals==arrival_times.shape[0]) #should evaluate to True

In [None]:
#So percent run is
print('Efficiency of the train system is:',len(arrivals_info.iloc[1])/len(schedule_info.iloc[1]["allDepartures"]))

### Question 1.2 (6% credit)
Since the departure and arrival are given in HH:MM:SS format, we will write two helper functions to convert to minutes.  Write two functions, `extract_hour` and `extract_mins` that converts this format to hours and minutes, respectively. Hint: You may want to use modular arithmetic and integer division. Keep in mind that the data has not been cleaned and you need to check whether the extracted values are valid. Replace all the invalid values with `NaN`. The documentation for `pandas.Series.where` provided [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html) should be helpful.

In [None]:
# 2% credit
def extract_hour(time):
    """
    Extracts hour information from military time
    
    Args: 
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with hour information.  
          Should only take on integer values in 0-23
    """
    [YOUR CODE HERE]

In [None]:
### write code to test your extract_hour function here and execute it
# HINT: See tests_sample_part1/tests.py
[YOUR CODE HERE]

In [None]:
# 2% credit
def extract_mins(time):
    """
    Extracts minute information from military time
    
    Args: 
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with minute information.  
          Should only take on integer values in 0-59
    """
    [YOUR CODE HERE]
    

In [None]:
### write code to test your extract_mins function here and execute it
# HINT: See tests_sample_part1/tests.py
[YOUR CODE HERE]

In [None]:
# 2% credit
def convert_to_minofday(time):
    """
    Converts HH:MM time to minute of day
    
    Args:
        time: series of time given as strings in HH:MM:SS format.  
          
    
    Returns:
        array (float64): series of input dimension with minute of day
    
    Example: 13:03 is converted to 783.0
    """
    [YOUR CODE HERE]
    
    
# Test your code
ser = pd.Series(['13:03:00', '12:00:00', '24:00:00'])
convert_to_minofday(ser)
# 0    783.0
# 1    720.0
# 2      NaN
# dtype: float64

### Question 1.3 (6% credit)

Before we can calculate delays, we need one more important function to write. Notice that arrival times and departure times are given as two lists or dataframes. Our first task is to relate them so that each entry in the arrivals with a corresponding scheduled times. We will assume that clearly num_of_arrivals <= num_scheduled as is the case in these files (check this yourself!)

In [None]:
# 3%credit
def assigned_scheduled_times(arrival_times, scheduled_times):
    """
    Calculates delay times y - x
    
    Args:
        arrival_times: series of scheduled times 
        scheduled_times: series of actual arrival times
    
    Returns:
        arrival_scheduled_times: pandas dataframe with two columns viz., arrival times and corresponding scheduled time
    """
    actual = [YOUR CODE HERE]

    # insert code to find the closest scheduled time for each arrival time in arrival_times
    scheduled = [YOUR CODE HERE]
    [YOUR CODE HERE]
    

In [None]:
# 3% credit
def calc_delay(assigned_scheduled_times):
    """
    Calculates delay times y - x
    
    Args:
        assigned_scheduled_times: pandas dataframe with two columns viz., arrival times and corresponding scheduled time
    
    Returns: 
        pandas series of input dimension with delay time
    """
    
    scheduled = [YOUR CODE HERE]
    actual = [YOUR CODE HERE]
    
    [YOUR CODE HERE]
    
#Test your code
sched = pd.Series([1303, 1210], dtype='float64')
actual = pd.Series([1304, 1215], dtype='float64')
calc_delay(pd.concat([sched, actual], axis=1))
# 0    1.0
# 1    5.0
# dtype: float64

### Question 1.4 (10% credit)

Once you have figured out the data preprocessing to obtain actual arrival and scheduled times of the train from one json, you can now write a function to extract them for all the 7 days (or json files).

In [None]:
# 3%credit
# Function to extract scheduled and actual arrival data for all 7 json files
[YOUR CODE HERE]

In [None]:
# 3% credit
### write code to test your functions here by calculating delay between `sched_dep_time` and `actual_dep_time` for each direction. 
### your printed results should show the values of the following two variables
[YOUR CODE HERE]


Calculate the average delay for each day of the week and the weekly variance and submit

In [None]:
#Calculate 
#   1 (2%credit). the average delay for each day  -- 7 numbers for each direction,  
#   2 (2%credit). the variance between them the 7 numbers, and  
#   3. store in a dataframe with 8 rows (last row for variance) and 2 columns, and save in a csv file for submission
[YOUR CODE HERE]



## Part 2 (45% of HW 1): Web scraping and data collection 

Here, you will practice collecting and processing data in Python. By the end of this exercise hopefully you should look at the wonderful world wide web without fear, comforted by the fact that anything you can see with your human eyes, a computer can see with its computer eyes. In particular, we aim to give you some familiarity with:

* Using HTTP to fetch the content of a website
* HTTP Requests (and lifecycle)
* RESTful APIs
    * Authentication (OAuth)
    * Pagination
    * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

Since everyone loves food (presumably), the ultimate end goal of this homework will be to acquire the data to answer some questions and hypotheses about the restaurant scene in Chicago (which we will get to later). We will download __both__ the metadata on restaurants in Chicago from the Yelp API and with this metadata, retrieve the comments/reviews and ratings from users on restaurants.


### Library Documentation

For solving this part, you need to look up online documentation for the Python packages you will use:

* Standard Library: 
    * [io](https://docs.python.org/3/library/io.html)
    * [time](https://docs.python.org/3/library/time.html)
    * [json](https://docs.python.org/3/library/json.html)

* Third Party
    * [requests](http://docs.python-requests.org/en/master/)
    * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


## Setup

First, import necessary libraries:

In [None]:
import io, time, json
import requests
from bs4 import BeautifulSoup

import base64

## Authentication and working with APIs

There are various authentication schemes that APIs use, listed here in relative order of complexity:

* No authentication
* [HTTP basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
* Cookie based user login
* OAuth (v1.0 & v2.0, see this [post](http://stackoverflow.com/questions/4113934/how-is-oauth-2-different-from-oauth-1) explaining the differences)
* API keys
* Custom Authentication

For the IMDb example below (**Q2.1**), since it is a publicly visible page we did not need to authenticate. HTTP basic authentication isn't too common for consumer sites/applications that have the concept of user accounts (like Facebook, LinkedIn, Twitter, etc.) but is simple to setup quickly and you often encounter it on with individual password protected pages/sites. 

Cookie based user login is what the majority of services use when you login with a browser (i.e. username and password). Once you sign in to a service like Facebook, the response stores a cookie in your browser to remember that you have logged in (HTTP is stateless). Each subsequent request to the same domain (i.e. any page on `facebook.com`) also sends the cookie that contains the authentication information to remind Facebook's servers that you have already logged in.

Many REST APIs however use OAuth (authentication using tokens) which can be thought of a programmatic way to "login" _another_ user. Using tokens, a user (or application) only needs to send the login credentials once in the initial authentication and as a response from the server gets a special signed token. This signed token is then sent in future requests to the server (in place of the user credentials).

A similar concept common used by many APIs is to assign API Keys to each client that needs access to server resources. The client must then pass the API Key along with _every_ request it makes to the API to authenticate. This is because the server is typically relatively stateless and does not maintain a session between subsequent calls from the same client. Most APIs (including Spotify) allow you to pass the API Key via a special HTTP Header: `Authorization: Bearer <API_KEY>`. Check out the [docs](https://developer.spotify.com/documentation/web-api/concepts/authorization) for more information.


### Question 2.1: Basic HTTP Requests w/o authentication (5%)

First, let's do the "hello world" of making web requests with Python to get a sense for how to programmatically access web pages: an (unauthenticated) HTTP GET to download a web page.

Fill in the funtion to use `requests` to download and return the raw HTML content of the URL passed in as an argument. As an example try the following IMDb page to retrieve titles and descriptions of top 250 movies: [https://www.imdb.com/chart/top/](https://www.imdb.com/chart/top/)

Your function should return a string of: `<text>`. 

(Hint: look at the **Library documentation** listed earlier to see how `requests` should work.) 

In [None]:
# 2% credit
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    [YOUR CODE HERE]
    

In [None]:
# Example usage
imdb_url = 'https://www.imdb.com/chart/top/'
imdb_data = retrieve_html(imdb_url)
print(imdb_data[:1000])
# <!DOCTYPE html><html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta charSet="utf-8"/>...

In [None]:
#3% credit
def parse_imdb(imdb_data):
    """
    Return the movie lists from imdb top chart URL.

    Args:
        raw_html (string): 

    Returns:
        movies (list): the list of movies with Title, Description and Rating.
    
        Example:
        movies = [
        {
            'Title': 'The Shawshank Redemption',
            'Description': 'A Maine banker convicted of the murder of his wife and her lover...',
            'Rating': 9.3,
        },
        {
            'Title': 'The Godfather',
            'Description': 'Don Vito Corleone, head of a mafia family, decides to hand over his empire...',
            'Rating': 9.2,

        },
            # ... more
        ]

    
    """

    [YOUR CODE HERE]

In [None]:
# Example usage
if imdb_data:
    movies = parse_imdb(imdb_data)
    for movie in movies[:3]:
        print(f"Title: {movie['Title']}")
        print(f"Description: {movie['Description']}")
        print(f"Rating: {movie['Rating']}")
else:
    print("Failed to retrieve the webpage content.")

# Example outputs
# Title: The Shawshank Redemption
# Description: A banker convicted of uxoricide forms a friendship over a quarter century with a hardened convict, while maintaining his innocence and trying to remain hopeful through simple compassion.
# Rating: 9.3
# Title: The Godfather
# Description: The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.
# Rating: 9.2
# Title: The Dark Knight
# Description: When a menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman, James Gordon and Harvey Dent must work together to put an end to the madness.
# Rating: 9.1

Now while this example might have been fun, we haven't yet done anything more than we could with a web browser. To really see the power of programmatically making web requests we will need to interact with an API. For the rest of this lab we will be working with the [Spotify Web API](https://developer.spotify.com/documentation/web-api). 

## Spotify Web API Access

The reasons for using the Spotify Web API are threefold:

1. Incredibly Rich Dataset:
    * Track and Artist Data: detailed information about tracks and artists.
    * Genre and Popularity Trends: Analyze trends in music genres and track popularity over time.
    * Audio Features: Analyze tracks based on audio features like tempo, key, and more.
    * Personal Relevance: the Spotify API enables you to find data that resonates with your personal interests and preferences, making the analysis more engaging and insightful.
2. Well-Documented API: The Spotify Web API [Documentation](https://developer.spotify.com/documentation/web-api) provides thorough examples and guides to help you get started with API requests, handling responses, and understanding the rich dataset available through Spotify.


To access the Spotify API, you will need to perform a few steps. Like many other platforms, Spotify uses authentication and rate limiting to control access to its data. This ensures fair usage and compliance with Spotify's policies. The first step (even before making any request) is to set up a Spotify Developer account and obtain API credentials.

1. Create a [Spotify](https://accounts.spotify.com/en/login) account (if you do not have one already)
2. Log into the [dashboard](https://developer.spotify.com/dashboard) using your Spotify account. 
2. Generate API keys (if you haven't already). Create an app and select "Web API" for the question asking which APIs are you planning to use. Once you have created your app, you will have access to the app credentials. These will be required for API authorization to obtain an access token.

Now that we have our accounts setup we can start making requests! 


### Question 2.2: Authenticated HTTP Request with the Spotify Web API (12%)

First, store your Spotify credentials in a local file (kept out of version control) which you can read in to authenticate with the API. This file can be any format/structure since you will fill in the function stub below.

For example, you may want to store your key in a file called `spotify_api_key.json` (run in terminal):
```bash
echo '{"client_id": "your_client_id", "client_secret": "your_client_secret_key"}' > spotify_api_key.json
```

**KEEP THE API KEY FILE PRIVATE AND OUT OF VERSION CONTROL (and definitely do not submit them to Gradescope!)**

You can then read from the file using:

In [None]:
# 1% credit
with open('spotify_api_key.json', 'r') as file:
    credentials = json.load(file)
    
    # Extract the credentials
    client_id = credentials['client_id']
    client_secret = credentials['client_secret']
    print(client_id, client_secret)
    # verify your credentials are correct
# DO NOT FORGET TO CLEAR THE OUTPUT TO KEEP YOUR API KEY PRIVATE

In [None]:
# 1% credit
def read_api_key(filepath):
    """
    Read the Spotify API Keys from file.
    
    Args:
        filepath (string): File containing API Keys
    Returns:
        client_id (string): Your client id
        client_secret (string): Your client secret
    """
    
    # feel free to modify this function if you are storing the API Key differently
    with open(filepath, 'r') as file:
        return json.load(file)

Now authenticate and get access token for future search. 

**Hint: read Authorization Code Flow [docs](https://developer.spotify.com/documentation/web-api/tutorials/code-flow).** 
1. Send a POST request to the /api/token endpoint.
2. Parse the response json to get access_token

In [None]:
# 2% credit
def access_spotify(client_id, client_secret):
    """
    Authenticates the user and retrieves the bearer token required for API requests.
    """
    
    auth_url = 'https://accounts.spotify.com/api/token'
    auth_header = [YOUR CODE HERE]

    headers = {
        'Authorization': f'Basic {auth_header}',
    }
    data = {
        'grant_type': 'client_credentials'
    }
    
    response = [YOUR CODE HERE]
    access_token = [YOUR CODE HERE]
    
    return access_token


# Example usage
# DO NOT FORGET TO CLEAR THE OUTPUT TO KEEP YOUR API KEY PRIVATE

access_token = access_spotify(client_id, client_secret)
print(f"Authenticated with token: {access_token}")


Using the Spotify API, fill in the following function stub to make an authenticated request to the [Search Endpoint](https://developer.spotify.com/documentation/web-api/reference/search). Once you have the access token, you can use it to search for content (Tracks, Artists, or Albums) on Spotify. 

In [None]:
# 4% credit    
def spotify_search_params(client_id, client_secret, **kwargs):
    """
    Construct url, headers and params. Reference API docs (link above) to use the arguments
    """
    # What is the url endpoint for search?
    url = [YOUR CODE HERE]
    # How is Authentication performed? Hint: use access_token from function of access_spotify
    headers = [YOUR CODE HERE]
    # SPACES in url is problematic. How should you handle queries with field filters?
    query = [YOUR CODE HERE]
    # Include keyword arguments in params dictionary
    params = [YOUR CODE HERE]
    
    return url, headers, params
    

Hint: `**kwargs` represent keyword arguments that are passed to the function. For example, if you called the function `spotify_search_params(client_id, client_secret, artist="Taylor Swift", track="Lover", type="track", limit=10, offset=0))`. The arguments `client_id` and `client_secret` are called *positional arguments* and key-value pair arguments are called **keyword arguments**. Your `kwargs` variable will be a python dictionary with those keyword arguments.

In [None]:
# Example usage
url, headers, params = spotify_search_params(
    client_id, client_secret, 
    artist="Taylor Swift", 
    track="Lover", 
    type="track",
    limit=5,
    offset=0
)
url, headers, params
# ('https://<hidden_url_check_search_endpoint_docs_to_get_answer>',
#  {'Authorization': 'Bearer your_access_token'},
#  {'q': 'artist:Taylor Swift track:Lover', 'type': 'track', 'limit': 5,'offset':0})

Now use `spotify_search_params(client_id, client_secret, **kwargs)` to actually search album/track/artist from Spotify API. Most of the code is provided to you.

In [None]:
# 2% credit
def api_get_request(url, headers, params):
    """
    Send a HTTP GET request and return a json response 
    
    Args:
        url (string): API endpoint url
        headers (dict): A python dictionary containing HTTP headers including Authentication to be sent
        url_params (dict): The parameters (required and optional) supported by endpoint
        
    Returns:
        results (json): response as json
    """
    # See requests.request?
    response = [YOUR CODE HERE]
    return [YOUR CODE HERE]
    

def spotify_search(client_id, client_secret, **kwargs):
    """
    Make an authenticated request to the Spotify API and return search results.

    Args:
        client_id (string): Your Spotify Client ID for Authentication
        client_secret (string): Your Spotify Client Secret for Authentication
        **kwargs: Additional search parameters (e.g., artist, track, album, etc.)

    Returns:
        total (integer): Total number of tracks matching the query
        tracks (list): List of dicts representing each track with name, and popularity
    """
    url, headers, params = spotify_search_params(client_id, client_secret, **kwargs)
    response_json = api_get_request(url, headers, params)
    total = response_json['tracks']['total']
    tracks = []
    if response_json['tracks']['items']:
            popularities = []
            for track in response_json['tracks']['items']:
                track_info = {
                    'track_name': track['name'],
                    'popularity': track['popularity']
                }
                tracks.append(track_info)
                popularities.append(track['popularity'])
            
    return total, tracks

# 2% credit
total, tracks = spotify_search(client_id, client_secret,artist="Taylor Swift", track="Lover", type="track", limit=5)
print(total)
#35
print(tracks)
#[{'track_name': 'Lover', 'popularity': 86}, {'track_name': 'Lover (Remix) [feat. Shawn Mendes]', 'popularity': 66}, {'track_name': 'Lover - First Dance Remix', 'popularity': 53}, {'track_name': 'Lover - Live From Paris', 'popularity': 49}, {'track_name': 'Lover', 'popularity': 17}]

Now that we have completed the "hello world" of working with the Spotify API, we are ready to really fly! The rest of the exercise will have a bit less direction since there are a variety of ways to retrieve the requested information but you should have all the component knowledge at this point to work with the API.

## Parameterization and Pagination

Before we can retrieve all of Taylor Swift's tracks, albums, or playlists, we need to understand how to work with the Spotify API's search and pagination system. Spotify's API returns a limited number of results per request to safeguard against returning TOO much data at once (imagine if you were to request 100,000 tracks in one go!). This limitation is common among APIs and helps manage rate limiting while ensuring efficient and fair access to Spotify's vast music database.

> As a thought exercise, consider: If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Spotify API to filter the results it returns.

### Question 2.3: Retrieve All Tracks by Taylor Swift on Spotify (10%)

Again using the [API documentation](https://developer.spotify.com/documentation/web-api/reference/get-track) for the `search` endpoint, fill in the following function to retrieve all of the _Tracks_ for a given query. Again you should use your `read_api_key()` function to read the API Key used for the requests. You will need to account for __pagination__ and __[rate limiting](https://developer.spotify.com/documentation/web-api/concepts/rate-limits)__ to:

1. Retrieve all of the Track objects (# of track objects should equal `total` in the response). **Paginate by querying 10 restaurants each request.**
2. Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).  

As always with API access, make sure you follow all of the [API's policies](https://developer.spotify.com/documentation/web-api/concepts/rate-limits) and use the API responsibly and respectfully.

**DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED**

In [None]:
# 4% credit
def paginated_spotify_search_requests(client_id, client_secret, artist_name, total,limit):
    """
    Returns a list of tuples (url, headers, params) for paginated search of all restaurants
    Args:
        client_id, client_secret (string): Your Spotify API Key for Authentication
        artist_name (string): Artist name
        total (int): Total number of items to be fetched
        limit (int): Number of items to fetch per request (default is 50)
    Returns:
        results (list): list of tuple (url, headers, params)
    """
    # HINT: Use total, offset and limit for pagination
    # You can reuse function location_search_params(...)
    [YOUR CODE HERE]
    
#     return 

# Example Usage
artist_name = "Taylor Swift"
total=200
limit=50
all_track_requests = paginated_spotify_search_requests(client_id, client_secret, artist_name, total,limit)
all_track_requests

#[('https:<hidden>',
#  {'Authorization': 'Bearer your_access_token'},
#  {'q': 'artist:Taylor Swift', 'type': 'track', 'limit': 50, 'offset': 0}),
# ('https:<hidden>',
#  {'Authorization': 'Bearer your_access_token'},
#  {'q': 'artist:Taylor Swift', 'type': 'track', 'limit': 50, 'offset': 50}),
# ('https:<hidden>',
#  {'Authorization': 'Bearer your_access_token'},
#  {'q': 'artist:Taylor Swift', 'type': 'track', 'limit': 50, 'offset': 100}),
# ('https:<hidden>',
#  {'Authorization': 'Bearer your_access_token'},
#  {'q': 'artist:Taylor Swift', 'type': 'track', 'limit': 50, 'offset': 150})]

In [None]:
# 3% credit
def get_tracks(client_id, client_secret, artist_name):
    """
    Construct the pagination requests for ALL tracks by Given Artist on Spotify.

    Args:
        client_id (string): Your Spotify Client ID for Authentication
        client_secret (string): Your Spotify Client Secret for Authentication
        artist_name (string): Artist name

    Returns:
        results (list): List of dicts representing each track
    """
    total_items = 200
    limit = 50
    
    tracks_request = paginated_spotify_search_requests(api_key, location, total_items,limit)
    
    # Use returned list of (url, headers, url_params) and function api_get_request to retrive all restaurants
    # REMEMBER to pause slightly after each request.
    [YOUR CODE HERE]
#     return 

In [None]:
# 3% credit
artist_name = 'Taylor Swift'
data = get_tracks(client_id, client_secret, artist_name)
print(len(data))
# 200

# Display first 10 tracks with Track name, Album name and Popularity
[YOUR CODE HERE]

# Track: Cruel Summer, Album: Lover, Popularity: 89
# Track: august, Album: folklore, Popularity: 88
# Track: Style, Album: 1989, Popularity: 76
# Track: Don’t Blame Me, Album: reputation, Popularity: 84
# Track: Lover, Album: Lover, Popularity: 86
# Track: cardigan, Album: folklore, Popularity: 85
# Track: Delicate, Album: reputation, Popularity: 83
# Track: Blank Space, Album: 1989, Popularity: 75
# Track: ...Ready For It?, Album: reputation, Popularity: 81
# Track: Fortnight (feat. Post Malone), Album: THE TORTURED POETS DEPARTMENT, Popularity: 83

Now that we have the metadata on 300 tracks of Taylor Swift on Spotify, we can retrieve the album cover images. For that we need to download from image links, but to find out what pages to download we first need to parse our JSON from the API to extract the URLs of the tracks.

In general, it is a best practice to separate the act of __downloading__ data and __parsing__ data. This ensures that your data processing pipeline is modular and extensible (and autogradable ;). This decoupling also solves the problem of expensive downloading but cheap parsing (in terms of computation and time).

### Question 2.4: Parse the API Responses and Extract the URLs (7%)

Because we want to separate the __downloading__ from the __parsing__, fill in the following function to parse the URLs pointing to the tracks. As input your function should expect a string of [properly formatted JSON](http://www.json.org/) (which is similar to __BUT__ not the same as a Python dictionary) and as output should return a Python list of strings. Hint: print your `data` to see the JSON-formatted information you have. The input JSON will be structured as follows (same as the [sample](https://developer.spotify.com/documentation/web-api/reference/get-track) on the Spotify API page):

```json
{
  "album": {
    "album_type": "compilation",
    "total_tracks": 9,
    "available_markets": ["CA", "BR", "IT"],
    "external_urls": {
      "spotify": "string"
    },
    "href": "string",
    "id": "2up3OPMp9Tb4dAKM2erWXQ",
    "images": [
      {
        "url": "https://i.scdn.co/image/ab67616d00001e02ff9ca10b55ce82ae553c8228",
        "height": 300,
        "width": 300
      }
    ],
    "name": "string",
    "release_date": "1981-12",
    "release_date_precision": "year",
    "restrictions": {
      "reason": "market"
    },
    "type": "album",
    "uri": "spotify:album:2up3OPMp9Tb4dAKM2erWXQ",
    "artists": [
      {
        "external_urls": {
          "spotify": "string"
        },
        "href": "string",
        "id": "string",
        "name": "string",
        "type": "artist",
        "uri": "string"
      }
    ]
  },
  "artists": [
    {
      "external_urls": {
        "spotify": "string"
      },
      "href": "string",
      "id": "string",
      "name": "string",
      "type": "artist",
      "uri": "string"
    }
  ],
  "available_markets": ["string"],
  "disc_number": 0,
  "duration_ms": 0,
  "explicit": false,
  "external_ids": {
    "isrc": "string",
    "ean": "string",
    "upc": "string"
  },
  "external_urls": {
    "spotify": "string"
  },
  "href": "string",
  "id": "string",
  "is_playable": false,
  "linked_from": {
  },
  "restrictions": {
    "reason": "string"
  },
  "name": "string",
  "popularity": 0,
  "preview_url": "string",
  "track_number": 0,
  "type": "track",
  "uri": "string",
  "is_local": false
}
```

In [None]:
# 4% credit
def parse_api_response(data):
    """
    Parse Spotify API results to extract cover images URLs.
    
    Args:
        data (string): String of properly formatted JSON.

    Returns:
        (list): list of URLs as strings from the input JSON.
    """
    
    [YOUR CODE HERE]

# 3% credit    
url, headers, params = spotify_search_params(
    client_id, client_secret, 
    artist="Taylor Swift", 
    track="Lover", 
    type="track",
    limit=2,
    offset=0
)
response_text = [YOUR CODE HERE]
parse_api_response(response_text)

#['https://i.scdn.co/image/ab67616d0000b273e787cffec20aa2a396a61647',
# 'https://i.scdn.co/image/ab67616d00001e02e787cffec20aa2a396a61647',
# 'https://i.scdn.co/image/ab67616d00004851e787cffec20aa2a396a61647',
# 'https://i.scdn.co/image/ab67616d0000b27359457bdb1edb5c6417f3baa2',
# 'https://i.scdn.co/image/ab67616d00001e0259457bdb1edb5c6417f3baa2',
# 'https://i.scdn.co/image/ab67616d0000485159457bdb1edb5c6417f3baa2']


As we can see, JSON is quite trivial to parse (which is not the case with HTML as we will see in a second) and work with programmatically. This is why it is one of the most ubiquitous data serialization formats (especially for ReSTful APIs) and a huge benefit of working with a well defined API if one exists. But APIs do not always exists or provide the data we might need, and as a last resort we can always scrape web pages...

## Working with Web Pages (and HTML)

Think of APIs as similar to accessing an application's database itself (something you can interactively query and receive structured data back). But the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query). This is a benefit _AND_ a drawback depending on the end use case. For data science and _programatic_ analysis this raw form is quite ideal, but for an end user requesting information from a _graphical interface_ (like a web browser) this is very far from ideal since it takes some cognitive overhead to interpret the raw information. And vice versa, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

> As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

We may like to scrape more information of tracks, such as lyrics. However, due to Spotify's API's policy, we are not able to scrape track pages. Always ensure you're using APIs and accessing data legally and ethically, respecting the terms of service of the platform you're working with.

Going back to the "hello world" example of question 2.1 with the AP News, we will retrieve the HTML of the movie site to get more interesting information as text, such as storyline and reviews. 


### Question 2.5: Parse a Movie Page from imdb (11%)

Using `BeautifulSoup`, parse the HTML of a single movie page to extract the reviews in a structured form as well as the URL to the next page of reviews (or `None` if it is the last page). Fill in following function stubs to parse a single page of reviews and return:
* the reviews as a structured Python dictionary
* the HTML element containing the link/url for the next page of reviews (or None).

For each review be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values:

```python
{
    'Author': str
    'Rating': float
    'Date': str ('dd mm yyyy')
    'Review': str
}

# Example
{
    'Author': 'ap_griffiths'
    'Rating': 5
    'Date': '24 January 2012'
    'Review': "This is not a bad film per se, had it been about 30-40 minutes shorter I would not have been too offended."
}
```

There can be issues with Beautiful Soup using various parsers, for maximum compatibility (and fewest errors) initialize the library with the default (and Python standard library parser): `BeautifulSoup(markup, "html.parser")`.

Most of the function has been provided to you:

In [None]:
url_lookup = {"https://www.imdb.com/title/tt1375666/reviews?ref_=tt_urv":"Inception.html"}

def html_fetcher(url):
    """
    Return the raw HTML at the specified URL.
    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """
    html_file = url_lookup.get(url)
    with open(html_file, 'rb') as file:
        html_text = file.read()
        return 200, html_text


def parse_page(html):
    """
    Parse reviews from an IMDb movie reviews page.

    Args:
        html (string): HTML content of the IMDb reviews page.

    Returns:
        reviews (list): A list of dictionaries, each containing the review's rating, author, date, and content.
    """
    soup = BeautifulSoup(html,'html.parser')
    reviews_list = []

    # Find all review containers on the page
    review_containers = soup.find_all('div', class_='lister-item-content')
    # HINT: print reviews to see what http tag to extract
    [YOUR CODE HERE]
        
    return reviews_list

# Example Usage
code, html = html_fetcher("https://www.imdb.com/title/tt1375666/reviews?ref_=tt_urv") #should load inception movie released in 2010
reviews_list = parse_page(html)
print(len(reviews_list)) # 25

## Part 3 (10% of HW 1): Basic Probability (Refer to tutorial noteboook on Piazza for definition)

Now we will answer some basic probability questions. You may type the answers using markdown or attach photos of your answers using $\text{![](image.png)}$ where image has your answers. 

1. (5% credit) Let $X$ and $Y$ be random variables with the following joint distribution:

    <center>

    |    $\mathbb{P}(X=x,Y=y)$ | X=1 | X=2|X=3|X=4|
    | -------- | ------- | ------- | ------- | ------- |
    |Y=0|1/18|1/18|1/9|1/9|
    |Y=1|1/12|1/12|1/6|1/15|
    |Y=2|1/15|1/30|1/30|2/15|

    </center>
    Answer the following questions:
    
    1. What are the marginal distributions of $X$ and $Y$?
        
    2. Are $X$ and $Y$ independent? Justify your answer.

    3. What is the conditional distribution of $X$, given that $Y=2$? What is $\mathbb{E}\left[X|Y=2\right]$?

    4. Calculate $\mathbb{E}[Y]$ and $\mathbb{E}[XY]$.

2. (5% credit) A coin is tossed three times with probability of heads p. Consider the following four events:

    A: Heads on the first toss
    
    B: Heads on the second toss
    
    C: All three outcomes the same
    
    D: Exactly two heads

    Which of the following pairs of events are independent? (More than one pair may be independent.) Justify your answer.
    
    1. A and B; 2. A and C; 3. A and D; 4. C and D


# Submission

You're almost done! 

After executing all commands and completing this notebook, save your *cs418-hw1-F25.ipynb* as a pdf file and upload it to Gradescope under *Homework 1 (written)*. Make sure you check that your pdf file includes all parts of your solution **(including the outputs)**. We recommend using the browser (not jupyter) for saving the pdf. For Chrome on a Mac, this is under *File->Print...->Open PDF in Preview* and when the PDF opens in Preview you can use *Save...* to save it. This part will be graded based on completion (having executed the code and showing the output).

Next, you need to copy the functions from Questions 1.2 and 1.3 into the corresponding functions in *hw1part1.py*. Similarly, you need to copy the functions from Questions 2.1, 2.2, 2.3, 2.4, and 2.5 into the corresponding functions in *hw1part2.py*. Place your files *hw1part1.py*, *hw1part2.py*, and *cs418-hw1-F25.ipynb* in a zip file and upload the zip file to Gradescope under *Homework 1 - (code)*. In order to get full points for this part, you need to pass all test cases that we will run against your *hw1part1.py* and *hw1part2.py* (and not the notebook) on Gradescope. We have provided a sample of the test cases in *tests_sample_part1/tests.py* and *tests_sample_part2/tests.py*. Other tests are hidden on the Gradescope server. To check whether your code runs locally, run the four tests in *tests_sample_part1* from your command line: 

`(cs418-fa25) sathya@Sathyas-MacBook-Pro h1% python run_tests_sample.py part1`

You should see the following output:

```
....
----------------------------------------------------------------------
Ran 4 tests in 0.001s

OK
```

Feel free to add more tests that check all parts of your code.

Similarly, you can run sample tests for part2 as follows:

`(cs418-fa25) sathya@Sathyas-MacBook-Pro h1% python run_tests_sample.py part2`

However, for part2 test to work you need to edit the test cases code in tests_sample_part2/tests.py and add you individual client_id and client_secret.

You can submit to Gradescope as many times as you would like. We will only consider your last submission. If your last submission is after the deadline, the late homework policy applies.

After submitting the zip file, the autograder will run. If you see a screen after the autograder finishes the execution with all correct, then it means that all the tests ran successfully on the server, and you're done! If your tests fail, you can debug your program locally by comparing the input, output and expected output (as shown for first two test cases)..

Make sure `hw1part1.py`, `hw1part2.py` and `cs418-hw1-F25.ipynb` are included on the root of the zip file. **This means you need to zip those files and not the folder containing the files.**