**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Chinmay Bharambe 
- Anshul Govindu 
- Chaela Moraleja 
- Candice Sanchez 
- Praveen Sharma
 

# Research Question

Using UCSD enrollment data since Fall 2022, what combination of course characteristics (fill rate, capacity, time offered) and student factors (class standing, major) best predict enrollment success rates for undergraduate courses, across all departments, during first and second pass registrations? 
Can these predictions be used to develop a recommendation tool that optimizes first and second-pass course selection?

## Background and Prior Work

This project attempts to address a major challenge for UCSD students: deciding which classes to enroll in during first and second pass. UCSD’s unique “pass” enrollment system turns course selection into more of an art than a science, often leaving students uncertain about their choices or failing to enroll in certain classes. This process also involves other unusual factors, such as major priority for CSE courses. Overall, there is a definite need for a tool that maximizes students' chances of securing their desired courses.

Upon initial research, we came across a project that collects data on individual classes at different points in time during each term, such as Fall 2022 or Winter 2023; each term’s data is contained within its own repository <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). The project involved building a web scraping tool that scrapes web-reg about every 10 minutes, and collects real-time data on information like enrolled, available, and waitlist spots. This not only offers a tool to collect our own data in the future, but also a great sample dataset from what has already been collected.

We also found another project that was built using the aforementioned github repositories <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). Given a course, the website takes data from specific terms and plots the course availability as a time series across various registration milestones (senior first pass, junior second pass, etc). This offers a great initial visualization of the enrollment data, and our EDA would likely produce some similar graphs. However, we certainly have to build upon this with predictive analyses in order to answer our research question.


1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://github.com/UCSD-Historical-Enrollment-Data
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://www.ucsdregistration.com


# Hypothesis


We predict that the fill rate of a course and the student’s major would be the most influential combination of factors for students deciding which courses to enroll in during first and second passes. Specifically, we predict that a high course fill rate and close relationship between the course and a student’s major would make it more likely to be enrolled in during first pass rather than second pass.

# Setup

In [1]:
import pandas as pd
import numpy as np
import requests
import time
from concurrent.futures import ThreadPoolExecutor
import os
import io

# Data

## Data overview

- Dataset #1
  - Dataset Name: UCSD Historical Enrollment Data
  - Link to the dataset: https://github.com/UCSD-Historical-Enrollment-Data/UCSDHistEnrollData?tab=readme-ov-file
  - Number of observations: 11 quarters of data is recorded, the number of observations for subjects across the quarters is inconsistent.
  - Number of variables: There are 5 variables recorded: 
    - Time : The date and time the data was recorded
    - Enrolled : Number of students enrolled
    - Available : Number of seats available
    - Waitlisted : number of students waitlisted
    - Total : total seats available for the course 

This dataset was compiled using an automated web scraper that collected enrollment information from UC San Diego courses, spanning from Fall 2022 through the current quarter. The data is stored in CSV files that are hosted on GitHub. 

## UCSD Historical Enrollment Data

### Collecting the Data

Initially, collecting data from GitHub seemed straightforward since pd.read_csv() can read web links. However, given that the datasets spanned thousands of subjects, each with many thousands of observations across 11 quarters, using pd.read_csv() alone proved inefficient, with an estimated load time of over 12 hours.

To address this, we first implemented the chunking mechanism in pd.read_csv() to read the data in smaller segments. While this improved speed, the gain was not substantial. Additionally, we encountered GitHub API rate limits. To overcome this, we used the concurrent.futures package to parallelize the data retrieval process, enabling us to read multiple files simultaneously, significantly improving efficiency<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). 

To further mitigate API limitations, we discovered that while unauthenticated users were restricted to 60 requests per hour, authenticated users had a much higher limit of 5,000 requests per hour. By adding authentication headers to our requests, we avoided unnecessary restrictions.

Another challenge arose when we realized our API links were fetching directory contents containing CSV files for each quarter. However, GitHub truncates directory listings at 999 files, meaning we were missing additional CSV files. After researching, we found that using the git/trees API allowed us to access all files within a directory, including those hidden by truncation. Implementing this solution ensured that we retrieved the complete dataset<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2).

Finally, to prevent the need to rerun this entire process every time the notebook's kernel restarts, we saved the dataset as enrollment_data.csv. If this file exists in the current working directory, the data-loading process is skipped, thereby optimizing efficiency.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://medium.com/@smrati.katiyar/introduction-to-concurrent-futures-in-python-009fe1d4592c
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28

In [13]:
# list of all the repo-links that host the data for each quarter in a csv file
repo_links = [
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2022Fall/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2023Winter/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2023Spring/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2023Fall/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Winter/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Spring/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Summer1/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Summer2/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Summer3/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Fall/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2025Winter/contents/overall',
      ]

quarter_names = ['FA 22','WI 23', 'SP 23', 'FA 23', 'WI 24', 'SP 24', 'S1 24', 'S2 24', 'S3 24', 'FA 24', 'WI 25']

In [14]:
# github token for adressing the limit on github api rates. 
# Recommended to create an environment varaible to store this for improved security. Alternatively, one can simply add the github token below
GITHUB_TOKEN = 'put in your token'  

In [15]:

# check if the enrollment data file already exists
if os.path.exists('enrollment_data.csv'):
    df = pd.concat([
    pd.read_csv(f'enrollment_data_part{i+1}.csv') 
    for i in range(4)], ignore_index=True)

else:
    # this function processes the data quarter by quarter
    def process_quarter(repo_link, quarter_name):
        # try-except block to handle errors
        try:
            # extract repo name from the API URL
            repo_name = repo_link.split('/')[5]  
            
            # construct tree API URL to gett all the files that are hidden as well
            tree_url = f"https://api.github.com/repos/UCSD-Historical-Enrollment-Data/{repo_name}/git/trees/main?recursive=1"
            
            # add headers to account for GitHub API rate limiting
            headers = {
                'Accept': 'application/vnd.github.v3+json',
                'Authorization': f'token {GITHUB_TOKEN}'
            }
            
            # get the tree
            response = requests.get(tree_url, headers=headers)
            
            # if request was unsuccessful print error message
            if response.status_code != 200:
                print(f"failed to access {tree_url}")
                print(f"Response: {response.text}")
                return None
                
            # get all files from the 'overall' directory
            all_files = [item['path'].split('/')[-1] 
                        for item in response.json()['tree'] 
                        if item['path'].startswith('overall/') and item['path'].endswith('.csv')]
                        
            # process multiple files in parallel to make the process faster and more efficient
            dfs = []
            with ThreadPoolExecutor(max_workers=5) as executor:
                # create a list of futures where each future represents a file being processed
                futures = [executor.submit(process_file, file, repo_name, quarter_name) 
                        for file in all_files]
                
                # loop through completed futures and collect results
                for future in futures:  
                    try:
                        df = future.result()
                        if df is not None:
                            dfs.append(df)
                    except Exception as e:
                        # print error in case of an error
                        print(f"error in future: {str(e)}")
            
            # if dfs is not empty, concatenate all the dfs for that quarter and return that
            if dfs:
                return pd.concat(dfs, ignore_index=True)
            return None
            
        except Exception as e:
            # print error msg if it occurs
            print(f"error processing quarter {quarter_name}: {str(e)}")
            return None

    # function that reads the csv file and makes it into a df
    def process_file(file, repo_name, quarter_name):
        print(file)
        # try-except block to handle errors
        try:
            # convert file name to the format seen in the url
            file_url = file.replace(' ','%20')
            
            # raw csv file link
            raw_url = f"https://raw.githubusercontent.com/UCSD-Historical-Enrollment-Data/{repo_name}/main/overall/{file_url}"
            
            # add authentication headers
            headers = {
                'Accept': 'application/vnd.github.v3+json',
                'Authorization': f'token {GITHUB_TOKEN}'
            }
            
            # Read the csv files with authentication
            response = requests.get(raw_url, headers=headers)
            response.raise_for_status()
            
            # read csv file into a pandas df
            df = pd.read_csv(
                io.StringIO(response.text),
                sep=',',              # the separator
                encoding='utf-8',     # specify the character encoding
                parse_dates=['time'], # parse dates as datetime objects as they are being read to save time
                usecols=['time', 'enrolled', 'available', 'waitlisted', 'total'] # specify column names to improve efficiency
            )
            
            if not df.empty:
                # add course column that is readable
                df['course'] = file.replace('.csv', '').replace('%20',' ')
                # group df at a frequency of every 12 hrs to get 2 readings for each day
                df = df.groupby(pd.Grouper(key='time', freq='12h')).first().reset_index()
                # add a column that stores the quarter name
                df['quarter'] = quarter_name
                return df
            return None
            
        except Exception as e:
            # if there is an error, print it
            print(f"error processing {file}: {str(e)}")
            return None

    def load_data():
        # list that will store the df for every quarter
        all_quarter_dfs = []

        # loop through each quarter and process its data
        for repo_link, quarter_name in zip(repo_links, quarter_names):  
            # delay to avoid hitting githubs rate limits
            if all_quarter_dfs:
                time.sleep(5)

            # process the current quarters data   
            quarter_df = process_quarter(repo_link, quarter_name)

            # append data to all_quarter_dfs if df is not empty
            if quarter_df is not None:
                all_quarter_dfs.append(quarter_df)
                
                # save progress after each quarter in case the program crashes            
                temp_df = pd.concat(all_quarter_dfs, ignore_index=True)
                temp_df.to_csv('enrollment_data_temp.csv', 
                            index=False,
                            encoding='utf-8')
        
        # save the final complete dataset
        if all_quarter_dfs:
            combined_df = pd.concat(all_quarter_dfs, ignore_index=True)
            combined_df.to_csv('enrollment_data.csv', 
                            index=False,
                            encoding='utf-8')
            return combined_df
        return None

    # Run the load_data function
    df = load_data()

While reading the data from github we did not realise that there were some graduate level courses included in this datasets. Therefore, we will exclude these courses below. 

In [18]:
# Extract the course number and convert to integer
df['course_number'] = df['course'].str.extract('(\d+)').astype(int)

# Filter out graduate courses where the number is >=200
df = df[df['course_number'] < 200]

# Drop the temporary course_number column 
df = df.drop('course_number', axis=1)

df

Unnamed: 0,time,enrolled,available,waitlisted,total,course,quarter
0,2022-05-18 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
1,2022-05-18 12:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
2,2022-05-19 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
3,2022-05-19 12:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
4,2022-05-20 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
...,...,...,...,...,...,...,...
3069658,2025-01-25 00:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25
3069659,2025-01-25 12:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25
3069660,2025-01-26 00:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25
3069661,2025-01-31 00:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25


# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |