## Imports

In [1]:
# necessary imports
import requests
import re # regex

## Inspection

**NOTE:** *As of 4/16/2023, Spring 2023 CS61A URL is: [https://cs61a.org/academic-interns/]("https://cs61a.org/academic-interns/"). In future semeters however, it will most likely be changed to: [https://inst.eecs.berkeley.edu/~cs61a/sp23/academic-interns/](https://inst.eecs.berkeley.edu/~cs61a/sp23/academic-interns/). This project proceeds with the former, but for anyone who wants to duplicate these results (or for future me to look back on), do note the change because the current code will scrape the "current" semester of 61A, assuming they still have AIs.*

Like most web scraping ventures, I start by seeing if I could get the contents of the `html` file first. I begin my intial dive with [UC Berkeley's Spring 2023 Semester's CS61A's Academic Interns (AIs)](https://cs61a.org/academic-interns/).

In [2]:
# Scrape the html contents
sp23_url = "https://cs61a.org/academic-interns/"
# API call
sp23_response = requests.get(sp23_url)
# Store response into a string
sp23_response_string = sp23_response.text

print(sp23_response_string[:100])


<!DOCTYPE html>
<html lang="en">
  <head>
    <meta name="description" content ="CS 61A: Structure 


Unfortunately, unlike [my other web scraping project](https://github.com/dnhuang/sf-tri-city-restaurants/blob/main/scraper.ipynb), there doesn't seem to to be a JSON response file shown by Google Chrome developer tools. This means that I needed to do this the old fashioned way: regular expressions.

Thankfully, online tools like [Regex101](https://regex101.com/) exists, which made finding a pattern much easier. But in the process of figuring out patterns, I noticed a lot of inconsistencies and edge cases that were very difficult to capture; all from **ONE** semester of 61A. Here are a few of them:

#### **Names**

I was an AI for 61A in Fall 2021, and I only recall inputting my first and last name, so I thought the pattern would be pretty simple. It turns out some people also listed their middle names, which meant there were at least two spaces. But then came the Indian names, which maybe had like five different words strung together. After accounting for these, I was still was getting the correct amount of matches (there are a total of $130$ AIs for Spring 2023 and yes, I manually counted them). So that's when I discovered that some people had hyphens (-) in their name. After adding that into the pattern, I was still left disappointed; $129/130$ matches. And that's when I saw this:

<div>
<img src='img/parentheses_in_name.png' width=500>
</div>

Of course, ***nicknames***.

#### **Pronouns**

Pronouns weren't as bad as names, but that might be because I didn't bother spending time to make the pattern look concise; I'm kind of just over it. An edge case I ran into was this:

<div>
<img src='img/multi_pronoun_sp23.png' width=500>
<img src='img/multi_pronoun_html_sp23.png' width=500>
</div>

And this was the reason I was getting $131$ matches instead of $130$. ***Multi-set pronouns***. WHY.

I decided that I honestly couldn't be bothered to determine how to settle this issue, so I settled on just choosing the first set,

#### **Bio**

This was by far the worst one to extract. In the beginning, I thought I solved it by just simply using `(.*)` as I ended up $124$ matches (missing $6%) so I just needed to tweak it a little bit further.

But then I started to really struggle because I couldn't for the life of me figure out how to use `.*` within a `[]`, as I wanted to capture everything while account for `\s*`, which needed to be utilized within a `[]`.

But from my Google searches, it doesn't seem like `[.*]` is a valid regex pattern. I tried using `[\S\s]*` but this captured everything, like *literally everything* so I was out of luck there. But why do I care so much about the `\s`? Well, here is the reason:

<div>
<img src='img/multiline_bio_html_sp23.png'>
</div>

For some reason, some bios were not contained in *one line*. There is a newline character in this bio, which required me to somehow capture it in my regex pattern.

## Scraping

### Initial Scraping of Spring 2023 CS61A AIs

With those aforementioned edge cases, I start by trying my hand at Spring 2023's CS61A AIs. I implement a few Python functions that performs the extraction and print out checks on the extracted list.

These were the initial patterns that I used:
* names: `([a-zA-Z\-\(\) ]+)`
* pronouns: `'<div class="badge badge-info">([a-zA-Z]*\/*[a-zA-Z]*\/*[a-zA-Z]*\/*)<\/div>'`
* bio: `'<li class="section bio">(.{0,500}\s?.{0,500}\s?.{0,500}\s?.{0,500}\s?)<\/li>'`

Yes, all of them are absolutely horrid, but I do improve upon them after experiencing more edge cases in other semesters. The "final" patterns I used are:
* names and pronouns: `'<h3 class="staff-name">[\s\n]*([a-zA-Z\u00C0-\u00ff\-\(\) ]+)\s+<div class="badge badge-info">([a-zA-Z\/ ]*)<\/div>'`
    * accounts for special characters, hyphens, multi-spaces
    * combined matching pattern with pronouns, simply because I wanted to
    * pronoun matcher accounts for some *very weird* edge cases (listed in later sections)
    * some non-generalized patterns were inevitable
* bio: `'<li class="section bio">([\S\s]{0,500})<\/li>'`
    * mostly stayed the same, which became a constant thorn while I was doing this project, led to the use of some non-generalized patterns

Some non-generalized patterns used in this project:
* `su22_bio_pattern = '<li class="section bio">([\S\s]{0,350})<\/li>'`
    * original pattern was overcapturing, this pattern reduces the character limit
* `sp21_bio_pattern = '<li class="section bio">([\S\s]{0,400})<\/li>'`
    * same overcapturing issue, character limit reduced
* `sp21_name_pronoun_pattern = '<h3>[\s\n]*([a-zA-Z\-\(\) ]+)[\s\n]*\(([a-zA-Z\/ ]*)\)'`
    * inevitable, the `html` file for this semester had a different format from that of other semesters, so instead of creating a mess up a regex pattern to try to generalize for this semester as well, decided to just write another pattern

In [3]:
### DEPRECATED CODE ###
# This was code that I intially wrote for extracting names and pronouns separately. At the end
# I decided to just combine the two patterns, looks neater in my opinion so I kept it.
# Functionality wise I don't think this changes much; intead of accessing two arrays containing singular 
# elements I would just be using a single array containing a list with two elements.

# # Extract name
# name_pattern = '<h3 class="staff-name">[\s\n]*([a-zA-Z ]+)\n'
# name_list = re.findall(name_pattern, response_string)
# print(name_list)


# # Extract pronouns
# pronoun_pattern = '<div class="badge badge-info">([a-zA-Z]*\/*[a-zA-Z]*\/*[a-zA-Z]*\/*)<\/div>'
# pronoun_list = re.findall(pronoun_pattern, response_string)
# print(pronoun_list)

In [4]:
# Establish patterns from Inspection section
name_pronoun_pattern = '<h3 class="staff-name">[\s\n]*([a-zA-Z\u00C0-\u00ff\-\(\) ]+)\s+<div class="badge badge-info">([a-zA-Z\/ ]*)<\/div>'
bio_pattern = '<li class="section bio">([\S\s]{0,500})<\/li>'

# Function that takes in a regex pattern `pattern`, an input String `input_string`, and a boolean `print_info` that
# determines whether or not to execute a print statement for debugging purposes.
# Will display the total number of elements in the extracted list and return thereof.
def extract_pattern(pattern, input_string, print_info=False):
    extracted_list = re.findall(pattern, input_string)
    if (print_info):
        print("Length of extracted list: {}".format(len(extracted_list)))
    return extracted_list

With my extraction function implemented, I call it below to extract my list of names, pronouns, and bios from Spring 2023 semester's 61A AIs.

In [5]:
print('sp23_name_pronoun_list:')
sp23_name_pronoun_list = extract_pattern(name_pronoun_pattern, sp23_response_string, True)

print('sp23_bio_list:')
sp23_bio_list = extract_pattern(bio_pattern, sp23_response_string, True)

sp23_name_pronoun_list:
Length of extracted list: 130
sp23_bio_list:
Length of extracted list: 130


It seems to be working as intended, $130/130$ matches. I'm sure I'll run into more issues for other semesters, so I'll take care of it then.

### Scraping the AIs of Other Semesters of 61A

Having been in the AI program for three semesters, I'm aware that AIs aren't always hired every semester, and it was only "rebranded" recently. When I first started university, I think they were just called lab assistants; essentially unpaid TAs that Berkeley needed since there wasn't enough teaching staff, but this is a whole another discussion that I don't really get into.

Knowing this, I took a closer look at the AI pages for each recent semester of 61A (as of 4/15/2023), and these were my findings:

CS61A AI Program:
* Fall Semesters
    * [FA22](https://inst.eecs.berkeley.edu/~cs61a/fa22/academic-interns/) - exists
    * [FA21](https://inst.eecs.berkeley.edu/~cs61a/fa21/academic-interns/) - exists
    * [FA20](https://inst.eecs.berkeley.edu/~cs61a/fa20/) - does not exist
* Spring Semesters
    * [SP23](https://inst.eecs.berkeley.edu/~cs61a/sp23/academic-interns/) - exists (scraped it above)
    * [SP22](https://inst.eecs.berkeley.edu/~cs61a/sp22/academic-interns/) - exists
    * [SP21](https://inst.eecs.berkeley.edu/~cs61a/sp21/academic-interns/) - exists BUT pronouns seem to be joined together with name, need a different regex pattern
    * [SP20](https://inst.eecs.berkeley.edu/~cs61a/sp20/academic-interns.html) - exists BUT no pronouns
* Summer Semesters
    * [SU22](https://inst.eecs.berkeley.edu/~cs61a/su22/academic-interns/) - exists
    * [SU21](https://inst.eecs.berkeley.edu/~cs61a/su21/academic-interns/) - *page* exists BUT actual AIs do not

With these observartions, I'm going to scrape the pages that contain AIs first, then I'll look closer into the semesters containing special cases and see what to do from there.

#### Scraping FA22 AIs

**Total AIs:** $152$

In running my function on the [CS61A FA22 AI](https://inst.eecs.berkeley.edu/~cs61a/fa22/academic-interns/) page, I ran into some more of the same edge cases:

* Multi-set pronoun, just like SP23, I accept the first set of pronouns
<div>
<img src='img/multi_pronoun_html_fa22.png' width=500>
</div>


* Multiline bio, this one had a lot more newlines compared to that of SP23, forced me to change my bio extraction pattern
<div>
<img src='img/multiline_bio_html_fa22.png' width=750>
</div>

* But the *most egregious* of all...
<div>
<img src='img/comrade_pronoun.png' width=250>
</div>

Nothing against communism but c'mon Winston, you're making my life so much harder.

With all these edge cases, I update my patterns `name_pronoun_pattern` and `bio_pattern`. Then I perform the extraction.

In [6]:
# API call to get the `html` file
fa22_response_string = requests.get('https://inst.eecs.berkeley.edu/~cs61a/fa22/academic-interns/').text

# Extract the name and pronoun list
fa22_name_pronoun_list = extract_pattern(name_pronoun_pattern, fa22_response_string, True)
# Extract the bio list
fa22_bio_list = extract_pattern(bio_pattern, fa22_response_string, True)

Length of extracted list: 152
Length of extracted list: 152


$152/152$ FA22 AIs extracted.

#### Scraping SP22 AIs

**Total AIs:** $125$

Moving on to the [CS61A SP22 AI](https://inst.eecs.berkeley.edu/~cs61a/sp22/academic-interns/) page, I ran into one edge case:

<div>
<img src='img/special_character_name.png', width=500>
</div>

To capture the special character, I referred to this [Stack Overflow post](https://stackoverflow.com/questions/2013451/test-if-string-contains-only-letters-a-z-%C3%A9-%C3%BC-%C3%B6-%C3%AA-%C3%A5-%C3%B8-etc) and adjusted by `name_pronoun_pattern` accordingly.

In [7]:
# API call to get the `html` file
sp22_response_string = requests.get('https://inst.eecs.berkeley.edu/~cs61a/sp22/academic-interns/').text

# Extract the name and pronoun list
sp22_name_pronoun_list = extract_pattern(name_pronoun_pattern, sp22_response_string, True)
# Extract the bio list
sp22_bio_list = extract_pattern(bio_pattern, sp22_response_string, True)

Length of extracted list: 125
Length of extracted list: 125


Yes I am aware that I am copy pasting code. I did consider capturing the extraction process within a function, but I think it would be more trouble than its worth, since I want to be able to assign my extracted lists to a named variable. I guess I could create a function that returns a tuple of the two lists, but then that would just require more work from me to retrieve those lists and write them into a file. So this time, I'll sacrifice aesthetics for ease of use. $125/125$ extracted.

#### Scraping SU22 AIs

**Total AIs:** $54$

Moving on to the [CS61A SU22 AI](https://inst.eecs.berkeley.edu/~cs61a/su22/academic-interns/) page:

There was an edge case exclusive to the `html` file of these AI which resulted in my `bio_pattern` capturing too much:

<div>
<img src='img/bio_cap_length_su22.png' width=750>
</div>

I could change `bio_pattern` again, but I've already modified it inefficiently so many times I think I'm just going to create an exclusive bio pattern, `su22_bio_pattern` that captures less to take care of this edge case. Not the most beautiful solution but, hey, it works.

But the edge case that gave me the biggest headache was this:

<div>
<img src='img/name_hyperlink_su22.png', width=1000>
</div>

Someone *hyperlinked* their name...

I really don't want to complicated my current *working* regex pattern any further, so to take care of this problem, I use [Regex101](https://regex101.com/) to determine the position of the capture. Matches start enumerating from $1$ and according to Regex101, this edge caes should be match $31$, which means it should sit in array index position $30$. Thus, I will copy the contents of the extracted list from index $0$ to $29$ (thirty names), append this ~~*cursed*~~ special name to index $30$, then append the remaining tail of the extracted list, specifically the original elements in index $30$ to $52$ (twenty-three names), to the end of my new list. This should result in $30 + 1 + 23 = 54$ names in my final list.

In [8]:
# API call to get the `html` file
su22_response_string = requests.get('https://inst.eecs.berkeley.edu/~cs61a/su22/academic-interns/').text

# Extract the name and pronoun list to create a "pre" list that needs to be modified later
su22_name_pronoun_pre_list = extract_pattern(name_pronoun_pattern, su22_response_string, True)
# Extract the bio list
su22_bio_pattern = '<li class="section bio">([\S\s]{0,350})<\/li>'
su22_bio_list = extract_pattern(su22_bio_pattern, su22_response_string, True)

Length of extracted list: 53
Length of extracted list: 54


Moving on to creating the *correct* `su22_name_pronoun_list`:

In [9]:
# Get the first thirty elements
su22_name_pronoun_list = su22_name_pronoun_pre_list[0:30]
# Use list.append() to add the special name, pronoun pair: (Matthew Lee, he/him/his)
su22_name_pronoun_list.append(('Matthew Lee', 'he/him/his'))
# Use list.extend() to add the remaining elements in the tail of the orginal extracted list
su22_name_pronoun_list.extend(su22_name_pronoun_pre_list[30:])
# Check length of the final list for accuracy
print("Length of modified list: {}".format(len(su22_name_pronoun_list)))

Length of modified list: 54


I guess that wasn't too bad. $53/54$ extracted with $1$ imputation. Moving on to other semesters.

#### Scraping FA21 AIs


**Total AIs:** $132$

Moving on to the [CS61A FA21 AI](https://inst.eecs.berkeley.edu/~cs61a/fa21/academic-interns/) page, I encountered a very interesting edge case:

<div>
<img src='img/missing_pronoun_html_fa21.png', width=500>
<img src='img/missing_pronoun_html_2_fa21.png', width=575>
</div>

There were *two* entries where the pronouns section *does not* show up in the `html` file. Usually if a student did not list their pronouns, the section would still exist in the `html` file, but it would just be empty.

To take care of this issue, I decided to go the manual input route, the same way I solved the ~~*cursed*~~ hyperlinked name issue. With that said, using [Regex101](https://regex101.com/) again helped me determine their supposed positions. From looking at their names and photos provided on the page, I'm also going to commit the 2023 sin of assuming their genders. I've determined that "Irene Geng, she/her/hers" will be in array index position $55$ and "Wonjae Lee, he/him/his" will be in array index position $126$.

In [10]:
# API call to get the `html` file
fa21_response_string = requests.get('https://inst.eecs.berkeley.edu/~cs61a/fa21/academic-interns/').text

# Extract the name and pronoun list
fa21_name_pronoun_pre_list = extract_pattern(name_pronoun_pattern, fa21_response_string, True)
# Extract the bio list
fa21_bio_list = extract_pattern(bio_pattern, fa21_response_string, True)

Length of extracted list: 130
Length of extracted list: 132


Performing the list manual inputs:

In [11]:
# Get the first fifty-five elements
fa21_name_pronoun_list = fa21_name_pronoun_pre_list[0:55]
# Use list.append() to add the first name, pronoun pair: (Irene Geng, she/her/hers)
fa21_name_pronoun_list.append(('Irene Geng', 'she/her/hers'))
# Use list.extend() to add the middle elements the orginal extracted list
fa21_name_pronoun_list.extend(fa21_name_pronoun_pre_list[55:125])
# Use list.append() to add the second name, pronoun pair: (Wonjae Lee, he/him/his)
fa21_name_pronoun_list.append(('Wonjae Lee', 'he/him/his'))
# Use list.extend() to add the remaining tail of the original extracted list
fa21_name_pronoun_list.extend(fa21_name_pronoun_pre_list[125:])

# Printing some checks
# Check length of the final list for accuracy
print("Length of modified list: {}".format(len(fa21_name_pronoun_list)))
# Check entries at position 55 and 126
print("Tuple at index 55: {}".format(str(fa21_name_pronoun_list[55])))
print("Tuple at index 126: {}".format(str(fa21_name_pronoun_list[126])))

Length of modified list: 132
Tuple at index 55: ('Irene Geng', 'she/her/hers')
Tuple at index 126: ('Wonjae Lee', 'he/him/his')


Seems like it worked properly. $130/132$ extracted with $2$ imputations.

#### Scraping SP21 AIs

**Total AIs:** $50$

Moving on to the [CS61A SP21 AI ](https://inst.eecs.berkeley.edu/~cs61a/sp21/academic-interns/) page.

Additionally, the `html` file of this page is different from that of the others:

<div>
<img src='img/html_file_sp21.png'>
</div>

This meant that I need to change my patterns a little bit. From experimenting around, I ran into a similar issue as SU22 for my `bio_pattern`. Thus, I ended up changing this semester's pattern to: `<li class="section bio">([\S\s]{0,400})<\/li>`

Furthermore, the pattern to capture name and pronouns also needed to be changed. I settled on the following pattern: `<h3>[\s\n]*([a-zA-Z\-\(\) ]+)[\s\n]*\(([a-zA-Z\/ ]*)\)`

Nevertheless, the pattern isn't perfect due to encountering the same hyperlink name edge cases as SU22:

<div>
<img src='img/name_hyperlink_sp21.png', width=450>
<img src='img/name_hyperlink_2_sp21.png', width=500>
</div>

Since I'm already in so deep, I will be manually inputting these two values. Again, using [Regex101](https://regex101.com/), I determined that they belong in index postions $16$ and $43$.

In [12]:
# API call to get the `html` file
sp21_response_string = requests.get('https://inst.eecs.berkeley.edu/~cs61a/sp21/academic-interns/').text

# Extract the name and pronoun list
sp21_name_pronoun_pattern = '<h3>[\s\n]*([a-zA-Z\-\(\) ]+)[\s\n]*\(([a-zA-Z\/ ]*)\)'
sp21_name_pronoun_pre_list = extract_pattern(sp21_name_pronoun_pattern, sp21_response_string, True)
# Extract the bio list
sp21_bio_pattern = '<li class="section bio">([\S\s]{0,400})<\/li>'
sp21_bio_list = extract_pattern(sp21_bio_pattern, sp21_response_string, True)

Length of extracted list: 48
Length of extracted list: 50


Following the same procedure as FA21, performing the list manual inputs:

In [13]:
# Get the first sixteen elements
sp21_name_pronoun_list = sp21_name_pronoun_pre_list[0:16]
# Use list.append() to add the first name, pronoun pair: (Irene Geng, she/her/hers)
sp21_name_pronoun_list.append(('Devin Sze', 'he/him/his'))
# Use list.extend() to add the middle elements the orginal extracted list
sp21_name_pronoun_list.extend(sp21_name_pronoun_pre_list[16:42])
# Use list.append() to add the second name, pronoun pair: (Wonjae Lee, he/him/his)
sp21_name_pronoun_list.append(('Wonjae Lee', 'he/him/his'))
# Use list.extend() to add the remaining tail of the original extracted list
sp21_name_pronoun_list.extend(sp21_name_pronoun_pre_list[42:])

# Printing some checks
# Check length of the final list for accuracy
print("Length of modified list: {}".format(len(sp21_name_pronoun_list)))
# Check entries at position 55 and 126
print("Tuple at index 16: {}".format(str(sp21_name_pronoun_list[16])))
print("Tuple at index 43: {}".format(str(sp21_name_pronoun_list[43])))

Length of modified list: 50
Tuple at index 16: ('Devin Sze', 'he/him/his')
Tuple at index 43: ('Wonjae Lee', 'he/him/his')


Looks good. $48/52$ extracted with $2$ imputations. And with that, we are done with scraping all the data that I will be using.

Next is writing it into a `.csv` file.

## Writing the Extracted Data into a `.csv` File

Finally. With all the data successfully scraped and the missing data imputed (done with some personal liberties), I am ready to write the data into a `.csv` file. For my `.csv` file, I plan on formatting it with the following headers:

`name`, `pronoun`, `bio`, `semester`

Below, I implement a Python function to do just that. 

In [14]:
# Python function that writes all the data scraped into a .csv file. Since the data has already been extracted
# in the earlier parts of the notebook, I plan on using those list variables instead of encapsulating the capture
# within this function. This means less work for me, but it also means the inevitable use of the match-case block
# below, which probably isn't the most aesthetically pleasing.
# Nevertheless, function takes in nothing but writes in the data I desire into a .csv file named `ai_61a.data.csv`.
def write_data():
    # open a destination file to write into
    ai_61a_data = open("ai_61a_data.csv", "w")
    # write the headers
    ai_61a_data.write("name,pronoun,bio,semester\n")

    # instantiate a list of semesters to loop through
    sem_list = ['fa22', 'sp22', 'su22', 'fa21', 'sp21']
    # loop through the list
    for sem in sem_list:

        # Initialize empty lists
        curr_name_pronoun_list = []
        curr_bio_list = []

        # Decided to use a match-case block here, probably not the most elegant solution
        # Needed to determine which semester's list to start working on
        match sem:
            case 'fa22':
                curr_name_pronoun_list = fa22_name_pronoun_list
                curr_bio_list = fa22_bio_list
            case 'sp22':
                curr_name_pronoun_list = sp22_name_pronoun_list
                curr_bio_list = sp22_bio_list
            case 'su22':
                curr_name_pronoun_list = su22_name_pronoun_list
                curr_bio_list = su22_bio_list
            case 'fa21':
                curr_name_pronoun_list = fa21_name_pronoun_list
                curr_bio_list = fa21_bio_list
            case 'sp21':
                curr_name_pronoun_list = sp21_name_pronoun_list
                curr_bio_list = sp21_bio_list
        
        # Zip and loop through the two extracted lists
        for name_pronoun_tuple, bio_element in zip(curr_name_pronoun_list, curr_bio_list):

            # instantiate variables for each feature
            name = name_pronoun_tuple[0] # name is the 0th index of the tuple
            pronoun = name_pronoun_tuple[1] # pronoun is the 1st index of the tuple
            bio = bio_element

            ### SOME CLEANING NEEDED. NEED TO GET RID OF ALL COMMAS AND NEWLINES IN BIO ###
            bio = re.sub(',', '', bio)
            bio = re.sub('\n', '', bio)
            
            # concatenate all features together to form tuple
            curr_tuple = name + ',' + pronoun + ',' + bio + ',' + sem + '\n'
            # write tuple into destination file
            ai_61a_data.write(curr_tuple)

    ai_61a_data.close() # close the destination file
            

With `write_data` successfully implemented, I call the function in the following cell: 

In [15]:
# Write the data
write_data()

It ran instantly! With the data scraping done, I move on to EDA and analysis.