**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Chinmay Bharambe 
- Anshul Govindu 
- Chaela Moraleja 
- Candice Sanchez 
- Praveen Sharma
 

# Research Question

Using UCSD enrollment data since Fall 2022, what combination of course characteristics (fill rate, capacity, quarter) and student factors (class standing/units, which determines the enrollment start date), best predict the number of open seats, for undergraduate courses, across all departments, during first and second-pass registrations? 

Can these predictions be used to develop a recommendation tool that optimizes first and second-pass course selection? 

## Background and Prior Work

This project attempts to address a major challenge for UCSD students: deciding which classes to enroll in during first and second pass. UCSD’s unique “pass” enrollment system turns course selection into more of an art than a science, often leaving students uncertain about their choices or failing to enroll in certain classes. This process also involves other unusual factors, such as major priority for CSE courses. Overall, there is a definite need for a tool that maximizes students' chances of securing their desired courses.

Upon initial research, we came across a research paper on using Machine Learning Methods for Course Enrollment Prediction <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) at San Diego State University. Their primary focus was to predict course enrollment rates based on demographic and academic performance data. Although these variables are not an element of our research, their methodology with student data is applicable. For example, they considered generic variables like major and prerequisites, and incorporated predictive models like classification and regression trees. Therefore, we can build on this analysis with similar ML and statistical approaches that consider more course and student-specific factors, such as the ones mentioned in the research question.

We also found two projects directly related to UCSD enrollment. This project collects data on individual classes at different points in time during each term, such as Fall 2022 or Winter 2023; each term’s data is contained within its own repository <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). The project involved building a web scraping tool that scrapes web-reg about every 10 minutes and collects real-time data on information like enrolled, available, and waitlist spots. This not only offers a tool to collect our data in the future but also a great sample dataset from what has already been collected.

The second project was built using the aforementioned GitHub repositories  <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). Given a course, the website takes data from specific terms and plots the course availability as a time series across various registration milestones (senior first pass, junior second pass, etc). This offers a great initial visualization of the enrollment data, and our EDA would likely produce some similar graphs. However, our objective is to quantify the relationship between student/course factors and course availability and use our analysis to develop a recommendation system that helps students prioritize courses for first and second pass. Additionally, we plan on conducting our research on data collected across 11 quarters as compared to only one quarter shown in the stated project. We believe this is significant because there may exist quarterly patterns for some classes that a single quarter would fail to grasp.


1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://par.nsf.gov/servlets/purl/10389427 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://github.com/UCSD-Historical-Enrollment-Data
3. <a name="cite_note-3"></a> [^](#cite_ref-3) https://www.ucsdregistration.com


# Hypothesis


We predict that a course’s fill rate and student class standing would be the most influential combination of factors for students deciding on courses to enroll in and directly influence the number of open seats remaining during first and second pass. More specifically, we predict that courses with a higher fill rate and later enrollment period (due to higher class standing) would mean there are fewer seats available, and thus more likely to reach full capacity during first pass rather than second pass.

# Setup

The code block below consists of all the libraries and packages we use in this project

In [1]:
import pandas as pd
import numpy as np
import requests
import time
from concurrent.futures import ThreadPoolExecutor
import os
import io
import requests
from bs4 import BeautifulSoup

# Data

## Data overview

- Dataset #1
  - Dataset Name: UCSD Historical Enrollment Data
  - Link to the dataset: https://github.com/UCSD-Historical-Enrollment-Data/UCSDHistEnrollData?tab=readme-ov-file
  - Number of observations: 11 quarters of data is recorded, the number of observations for subjects across the quarters is inconsistent.
  - Number of variables: There are 5 variables recorded: 
    - Time : The date and time the data was recorded
    - Enrolled : Number of students enrolled
    - Available : Number of seats available
    - Waitlisted : number of students waitlisted
    - Total : total seats available for the course 

This dataset was compiled using an automated web scraper that collected enrollment information from UC San Diego courses, spanning from Fall 2022 through to the current quarter (Winter 2025). The data is stored in CSV files that are hosted on GitHub.


### Dataset Structure and Organization

The UCSD Historical Enrollment Data is systematically organized within a GitHub repository, where each academic term is represented by a dedicated repository. This structure facilitates easy access to term-specific enrollment data. The data collection process is such that it captures enrollment statistics at regular intervals, providing a detailed view of how course enrollment evolves throughout the registration period.

The granularity of the data collection, which is approximately every 10 minutes during active enrollment periods, offers in depth insight into enrollment patterns, though for the purpose of our analysis, we will be implementing a more manageable sampling frequency. 

### Data Quality and Completeness

The dataset encompasses all undergraduate courses offered at UC San Diego across eleven quarters, providing a comprehensive view of enrollment patterns. While the number of observations varies between courses and quarters—primarily due to differences in course offerings and enrollment period durations — the consistency in variable recording ensures data compatibility across all terms.

### Data Processing Considerations
For our analysis, several key processing steps will be implemented:
1. Temporal aggregation to reduce unnecessary granularity while maintaining ensuring the trend of the data is captured accurately
2. Consistency of course codes to enable cross-quarter analysis
3. Enrollment phase demarcations (first pass, second pass, waitlist periods)
4. Creation of derived metrics such as fill rates to enhance analysis capabilities
5. Assignment of class standings (First-Year, Sophomore, Junior, Senior) to reflect enrollment priority hierarchies.

This dataset serves as an invaluable resource for understanding UC San Diego's enrollment patterns, offering insights that can inform both administrative decision-making and student course planning strategies.

## UCSD Historical Enrollment Data

### Collecting the Data

The initial approach to data collection from GitHub appeared straightforward, utilizing the pandas' read_csv() method to access the datasets hosted on GitHub. However, the extensive scope of the dataset—encompassing thousands of subjects with multiple observations across eleven academic quarters rendered this method inefficient, with projected data retrieval times exceeding twelve hours.

To enhance performance, we implemented several optimization strategies. First, we employed the chunking mechanism within read_csv() to process data in segments. While this modification yielded some improvement in processing speed, the enhancement was marginal for our requirements. Furthermore, we encountered limitations imposed by GitHub's API rate restrictions.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). 

To address these constraints, we implemented parallel processing using the concurrent.futures package, enabling simultaneous retrieval of multiple files. This significantly improved operational efficiency<a name="cite_ref-1"></a><sup>1</sup>. Additionally, we found that GitHub's API authentication system offered substantially higher rate limits, 5,000 requests per hour, for authenticated users compared to 60 for unauthenticated users. Therefore we implementated authentication headers in our requests to effectively circumvent the restrictions.

A subsequent challenge emerged regarding data completeness. Our initial API implementation for retrieving directory contents was subject to GitHub's truncation limit of 999 files per directory, resulting in incomplete data collection. Through further research, we identified that the git/trees API provided access to the complete file directory, including previously truncated entries. This solution ensured we collected complete data<a name="cite_ref-2"></a><sup>2</sup>.

To optimize computational efficiency and eliminate redundant processing, we stored the collected data in enrollment_data.csv. The presence of this file in the working directory enables the system to skip the data collection process during subsequent notebook kernel restarts.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://medium.com/@smrati.katiyar/introduction-to-concurrent-futures-in-python-009fe1d4592c
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28

In [2]:
# list of all the repo-links that host the data for each quarter in a csv file
repo_links = [
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2022Fall/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2023Winter/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2023Spring/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2023Fall/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Winter/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Spring/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Summer1/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Summer2/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Summer3/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2024Fall/contents/overall',
    'https://api.github.com/repos/UCSD-Historical-Enrollment-Data/2025Winter/contents/overall',
      ]

quarter_names = ['FA 22','WI 23', 'SP 23', 'FA 23', 'WI 24', 'SP 24', 'S1 24', 'S2 24', 'S3 24', 'FA 24', 'WI 25']

In [3]:
# github token for adressing the limit on github api rates. 
# recommended to create an environment varaible to store this for improved security. Alternatively, one can simply add the github token below
GITHUB_TOKEN = 'put in your token'  

In [6]:

# check if the new enrollment data file already exists
if os.path.exists('enrollment_data.csv'):
    df = pd.read_csv('enrollment_data.csv') 
else:
    # this function processes the data quarter by quarter
    def process_quarter(repo_link, quarter_name):
        # try-except block to handle errors
        try:
            # extract repo name from the API URL
            repo_name = repo_link.split('/')[5]  
            
            # construct tree API URL to gett all the files that are hidden as well
            tree_url = f"https://api.github.com/repos/UCSD-Historical-Enrollment-Data/{repo_name}/git/trees/main?recursive=1"
            
            # add headers to account for GitHub API rate limiting
            headers = {
                'Accept': 'application/vnd.github.v3+json',
                'Authorization': f'token {GITHUB_TOKEN}'
            }
            
            # get the tree
            response = requests.get(tree_url, headers=headers)
            
            # if request was unsuccessful print error message
            if response.status_code != 200:
                print(f"failed to access {tree_url}")
                print(f"Response: {response.text}")
                return None
                
            # get all files from the 'overall' directory
            all_files = [item['path'].split('/')[-1] 
                        for item in response.json()['tree'] 
                        if item['path'].startswith('overall/') and item['path'].endswith('.csv')]
                        
            # process multiple files in parallel to make the process faster and more efficient
            dfs = []
            with ThreadPoolExecutor(max_workers=5) as executor:
                # create a list of futures where each future represents a file being processed
                futures = [executor.submit(process_file, file, repo_name, quarter_name) 
                        for file in all_files]
                
                # loop through completed futures and collect results
                for future in futures:  
                    try:
                        df = future.result()
                        if df is not None:
                            dfs.append(df)
                    except Exception as e:
                        # print error in case of an error
                        print(f"error in future: {str(e)}")
            
            # if dfs is not empty, concatenate all the dfs for that quarter and return that
            if dfs:
                return pd.concat(dfs, ignore_index=True)
            return None
            
        except Exception as e:
            # print error msg if it occurs
            print(f"error processing quarter {quarter_name}: {str(e)}")
            return None

    # function that reads the csv file and makes it into a df
    def process_file(file, repo_name, quarter_name):
        print(file)
        # try-except block to handle errors
        try:
            # convert file name to the format seen in the url
            file_url = file.replace(' ','%20')
            
            # raw csv file link
            raw_url = f"https://raw.githubusercontent.com/UCSD-Historical-Enrollment-Data/{repo_name}/main/overall/{file_url}"
            
            # add authentication headers
            headers = {
                'Accept': 'application/vnd.github.v3+json',
                'Authorization': f'token {GITHUB_TOKEN}'
            }
            
            # Read the csv files with authentication
            response = requests.get(raw_url, headers=headers)
            response.raise_for_status()
            
            # read csv file into a pandas df
            df = pd.read_csv(
                io.StringIO(response.text),
                sep=',',              # the separator
                encoding='utf-8',     # specify the character encoding
                parse_dates=['time'], # parse dates as datetime objects as they are being read to save time
                usecols=['time', 'enrolled', 'available', 'waitlisted', 'total'] # specify column names to improve efficiency
            )
            
            if not df.empty:
                # add course column that is readable
                df['course'] = file.replace('.csv', '').replace('%20',' ')
                # group df at a frequency of every 12 hrs to get 2 readings for each day
                df = df.groupby(pd.Grouper(key='time', freq='12h')).first().reset_index()
                # add a column that stores the quarter name
                df['quarter'] = quarter_name
                return df
            return None
            
        except Exception as e:
            # if there is an error, print it
            print(f"error processing {file}: {str(e)}")
            return None

    def load_data():
        # list that will store the df for every quarter
        all_quarter_dfs = []

        # loop through each quarter and process its data
        for repo_link, quarter_name in zip(repo_links, quarter_names):  
            # delay to avoid hitting githubs rate limits
            if all_quarter_dfs:
                time.sleep(5)

            # process the current quarters data   
            quarter_df = process_quarter(repo_link, quarter_name)

            # append data to all_quarter_dfs if df is not empty
            if quarter_df is not None:
                all_quarter_dfs.append(quarter_df)
                
                # save progress after each quarter in case the program crashes            
                temp_df = pd.concat(all_quarter_dfs, ignore_index=True)
                temp_df.to_csv('enrollment_data_temp.csv', 
                            index=False,
                            encoding='utf-8')
        
        # save the final complete dataset
        if all_quarter_dfs:
            combined_df = pd.concat(all_quarter_dfs, ignore_index=True)
            combined_df.to_csv('enrollment_data.csv', 
                            index=False,
                            encoding='utf-8')
            return combined_df
        return None

    # run the load_data function
    df = load_data()

The file created by the function load_data() was too large to upload onto GitHub (more than 150 mb) hence we uploaded it to google drive. 

The raw data we collected can be found here: https://drive.google.com/file/d/1Xv0GoHwTJ19rF9oBCIs7jF0etkzkJPhy/view?usp=drive_link




While reading the data from github we did not realise that there were some graduate level courses included in this datasets. Therefore, we will exclude these courses below. 

In [7]:
# extract the course number and convert to integer
df['course_number'] = df['course'].str.extract('(\d+)').astype(int)

# filter out graduate courses where the number is >=200
df = df[df['course_number'] < 200]

# drop the temporary course_number column 
df = df.drop('course_number', axis=1)

df

  df['course_number'] = df['course'].str.extract('(\d+)').astype(int)


Unnamed: 0,time,enrolled,available,waitlisted,total,course,quarter
0,2022-05-18 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
1,2022-05-18 12:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
2,2022-05-19 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
3,2022-05-19 12:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
4,2022-05-20 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
...,...,...,...,...,...,...,...
3069658,2025-01-25 00:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25
3069659,2025-01-25 12:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25
3069660,2025-01-26 00:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25
3069661,2025-01-31 00:00:00,319.0,1.0,12.0,320.0,WCWP 10B,WI 25


### Data Processing 

Let's first make sure that the time column is consistently formatted.

In [8]:
df['time'] = pd.to_datetime(df['time'])

Then, in order to extract further information, we need to know when registration opens for each quarter.

In order to do this, we will scrape UCSD's publically available yearly "enrollment and registration calendars" to determine the first day of enrollment for each quarter.

In [10]:
calendar_links = [
    'https://blink.ucsd.edu/instructors/courses/enrollment/calendars/2022.html',    # 2022 - 2023
    'https://blink.ucsd.edu/instructors/courses/enrollment/calendars/2023.html',    # 2023 - 2024
    'https://blink.ucsd.edu/instructors/courses/enrollment/calendars/2024.html'     # 2024 - 2025
]

def process_calendar(link, yr):
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'html.parser')

    # The website has a table with every important date
    table = soup.find('table')

    # Store enrollment start dates w/ same formatting as the 'quarter' column of the df
    dates = {}

    # Iterate through table rows
    for row in table.find_all('tr'): 
        cells = row.find_all('td')
        if cells:
            label = cells[0].get_text(strip=True)
            if 'Enrollment begins' in label:
                dates['FA' + " " + str(yr)] = cells[1].get_text(strip=True) + "/" + str(yr)

                s = str(yr + 1)
                dates['WI' + " " + s] = cells[2].get_text(strip=True) + "/" + s

                if yr < 24: 
                    dates['SP' + " " + s] = cells[3].get_text(strip=True) + "/" + s

                if yr == 23:
                    dates['S1' + " " + s] = cells[4].get_text(strip=True) + "/" + s
                    dates['S2' + " " + s] = cells[4].get_text(strip=True) + "/" + s
                    dates['S3' + " " + s] = cells[4].get_text(strip=True) + "/" + s
                break
    return dates

enrollment_starts = {}
year = 22
for link in calendar_links:
    enrollment_starts.update(process_calendar(link, year))
    year += 1
enrollment_starts

{'FA 22': '5/20/22',
 'WI 23': '11/7/23',
 'SP 23': '2/18/23',
 'FA 23': '5/26/23',
 'WI 24': '11/14/24',
 'SP 24': '2/17/24',
 'S1 24': '4/15/24',
 'S2 24': '4/15/24',
 'S3 24': '4/15/24',
 'FA 24': '5/24/24',
 'WI 25': '11/12/25'}

Convert the month/day/year format into a datetime object for consistent formatting

In [11]:
for key in enrollment_starts:
    enrollment_starts[key] = pd.to_datetime(enrollment_starts[key], format='%m/%d/%y')

enrollment_starts

{'FA 22': Timestamp('2022-05-20 00:00:00'),
 'WI 23': Timestamp('2023-11-07 00:00:00'),
 'SP 23': Timestamp('2023-02-18 00:00:00'),
 'FA 23': Timestamp('2023-05-26 00:00:00'),
 'WI 24': Timestamp('2024-11-14 00:00:00'),
 'SP 24': Timestamp('2024-02-17 00:00:00'),
 'S1 24': Timestamp('2024-04-15 00:00:00'),
 'S2 24': Timestamp('2024-04-15 00:00:00'),
 'S3 24': Timestamp('2024-04-15 00:00:00'),
 'FA 24': Timestamp('2024-05-24 00:00:00'),
 'WI 25': Timestamp('2025-11-12 00:00:00')}

Now we can perform the following task.

- The dataset contains datapoints from before the enrollment period opens. This may skew the data, making it appear as if open slots are available for longer than they actually are.
- We will remove these datapoints.

In [12]:
df = df[df['quarter'].map(enrollment_starts) <= df['time']]
df

Unnamed: 0,time,enrolled,available,waitlisted,total,course,quarter
4,2022-05-20 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22
5,2022-05-20 12:00:00,1.0,67.0,0.0,68.0,AAS 10,FA 22
6,2022-05-21 00:00:00,3.0,65.0,0.0,68.0,AAS 10,FA 22
7,2022-05-21 12:00:00,5.0,63.0,0.0,68.0,AAS 10,FA 22
8,2022-05-22 00:00:00,6.0,62.0,0.0,68.0,AAS 10,FA 22
...,...,...,...,...,...,...,...
2771056,2024-11-22 12:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24
2771057,2024-11-23 00:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24
2771058,2024-11-23 12:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24
2771059,2024-11-24 00:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24


We also need columns that measure enrollment priority (senior, junior...) and pass # (1, 2), as these are integral parts of our research question. These can easily be derived with respect to each quarter's enrollment start date.

In [13]:
# Enrollment start dates for incoming students
fa_fresh_start = {
    'FA 22' : pd.to_datetime("8-17-2022"),
    'FA 23' : pd.to_datetime("8-28-2023"),
    'FA 24' : pd.to_datetime("8-12-2024")
}

def registration_priority(date, quarter):

    # Ensure consistent formatting 
    enrollment_start_date = pd.to_datetime(enrollment_starts[quarter])
    date = pd.to_datetime(date)
    days = int((date - enrollment_start_date).days) # number of days since enrollment has been open
    
    pass_num = 0   
    priority = 0 

    # first week of registration is first pass                               
    if 0 <= days and days < 7:
        pass_num = 1            
        if days == 0:
            priority = 1    # first day of each pass is senior enrollment (1)
        elif days == 1:
            priority = 2    # second day = junior (2)
        elif days == 2: 
            priority = 3    # soph (3)
        else:
            priority = 4    # fresh (4)

    # second week is second pass
    elif 7 <= days and days < 14:
        pass_num = 2 
        if days == 7:
            priority = 1
        elif days == 8:
            priority = 2
        elif days == 9:
            priority = 3
        else:
            priority = 4 

    # afterwards, registration is open to all
    else:
        # these values are just placeholders to represent that anybody can enroll
        pass_num = 3 
        priority = 6        

    # Incoming freshman enrollment is unique
    if 'FA' in quarter:
        days = int((date - fa_fresh_start[quarter]).days)
        
        if 0 <= days and days < 7:
            pass_num = 1
            priority = 5
        elif 7 <= days and days < 14:
            pass_num = 2
            priority = 5
    
    return pass_num, priority

In [14]:
import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=pd.errors.SettingWithCopyWarning)
    df[['pass', 'priority']] = df.apply(lambda row: registration_priority(row['time'], row['quarter']), axis=1, result_type='expand')
df

Unnamed: 0,time,enrolled,available,waitlisted,total,course,quarter,pass,priority
4,2022-05-20 00:00:00,0.0,68.0,0.0,68.0,AAS 10,FA 22,1,1
5,2022-05-20 12:00:00,1.0,67.0,0.0,68.0,AAS 10,FA 22,1,1
6,2022-05-21 00:00:00,3.0,65.0,0.0,68.0,AAS 10,FA 22,1,2
7,2022-05-21 12:00:00,5.0,63.0,0.0,68.0,AAS 10,FA 22,1,2
8,2022-05-22 00:00:00,6.0,62.0,0.0,68.0,AAS 10,FA 22,1,3
...,...,...,...,...,...,...,...,...,...
2771056,2024-11-22 12:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24,3,6
2771057,2024-11-23 00:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24,3,6
2771058,2024-11-23 12:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24,3,6
2771059,2024-11-24 00:00:00,196.0,3.0,5.0,199.0,WCWP 10B,FA 24,3,6


In [15]:
df['priority'].value_counts()

priority
6    1279636
4      95847
5      92473
1      24558
3      24212
2      24198
Name: count, dtype: int64

We can see from the value counts that certain subsets are over-represented. It simply doesn't make sense to have so much data outside of first and second pass, when these are the focus of our question.

1. There is a disproportionate number of days of freshman priority (4). This is because we're considering all days after sophomore enrollment, but before the end of the week, to be freshman enrollment.

2. The same is true for incoming freshman enrollment (priorty 5).

3. The same is true for when enrollment is opened to all students (priority 6).

Realistically, we only need to know how many seats are available during the first day of someone's first pass, first day of their second pass, and first day of open enrollment. Anything beyond this will simply skew our predictive model.

In [16]:
# Find the first date for each group
first_dates = df.groupby(['quarter', 'pass', 'priority'])['time'].min().reset_index()

# Merge with the original df to keep all unique (quarter, pass, priority) entries on the first date
df = pd.merge(df, first_dates, on=['quarter', 'pass', 'priority', 'time'], how='inner')

df

Unnamed: 0,time,enrolled,available,waitlisted,total,course,quarter,pass,priority
0,2022-05-20,0.0,68.0,0.0,68.0,AAS 10,FA 22,1,1
1,2022-05-21,3.0,65.0,0.0,68.0,AAS 10,FA 22,1,2
2,2022-05-22,6.0,62.0,0.0,68.0,AAS 10,FA 22,1,3
3,2022-05-23,6.0,62.0,0.0,68.0,AAS 10,FA 22,1,4
4,2022-05-27,32.0,36.0,0.0,68.0,AAS 10,FA 22,2,1
...,...,...,...,...,...,...,...,...,...
62116,2024-06-02,184.0,0.0,1.0,184.0,WCWP 10B,FA 24,2,3
62117,2024-06-03,184.0,0.0,3.0,184.0,WCWP 10B,FA 24,2,4
62118,2024-06-07,184.0,0.0,41.0,184.0,WCWP 10B,FA 24,3,6
62119,2024-08-12,184.0,0.0,46.0,184.0,WCWP 10B,FA 24,1,5


In [17]:
df['priority'].value_counts()

priority
4    12300
3    12284
2    12281
1    12277
5     6598
6     6381
Name: count, dtype: int64

In [18]:
df['pass'].value_counts()

pass
1    28588
2    27152
3     6381
Name: count, dtype: int64

This is much more representative of the analysis we want to do, which prioritizes senior to freshman registration during first and second pass. 

In [21]:
if not os.path.exists('new_enrollment_data.csv'):
    df.to_csv('new_enrollment_data.csv', index=False)

# Ethics & Privacy

While the data is openly accessible, the method of data collection raises ethical considerations regarding automated data retrieval. It is essential to ensure that the scraping process does not access data we are not meant to use. However, the github uses the MIT License, and mentions that anyone is free to use this data. Since this repository is the only thing we scrape, we can address ethical concerns by simply citing this repository as our data source.

Although the data has no personally identifiable information (PII), there are some other privacy considerations. For example, enrollment trends and course availability data may indirectly reflect departmental or institutional scheduling strategies. Care must be taken to ensure that any analysis or publication of findings does not inadvertently expose proprietary or sensitive academic information.

Furthermore, there are potential biases in the datasets that may need to be addressed, particularly concerning data collection and representation. We may be analyzing data that has an overrepresentation of certain majors and class standings which leads to a biased analysis and recommendations. In addition to that, there may be subjective biases present in course and professor evaluations (CAPES) and instructor ratings from Rate My Professor. To identify and mitigate these biases, we will follow the Data Science Ethics Checklist. This includes conducting thorough data validation and exploratory data analysis (EDA) and implementing access controls to ensure data security as well as integrity. We will ensure that our visualizations and reports honestly represent the data and transparently document our analysis process. Any identified issues will be addressed through corrective measures such as weighting adjustments for underrepresented groups and/or incorporating additional data sources. As we complete all these ethical and privacy concerns throughout our project, we will produce fair, unbiased, and equitable recommendations for future use. 


# Team Expectations 

- Communication:
    - We will communicate via Discord, including texting and calling
    - The longest we expect to wait for a response is 24 hours 
    - We will meet at least once a week
    - Most, if not all, meetings will be done virtually
- Tone:
    - Be direct, but polite
        - Ex 1: “I think X is a problem because of Y. Does everyone else see it that way too or am I missing something?”
        - Ex 2: “I disagree with that idea because Z. What do you think if instead we try...”
- Decision Making:
    - Majority vote system for major decisions
    - Smaller decisions can be left to the person who is in charge of the task.
    - If a teammate is unresponsive when a decision has to be made quickly, it will be made without them using a majority vote.
- Tasks:
    - Members should first be assigned according to specialization, then others can oversee it to make sure everything aligns with expectations for that task
    - We will use GitHub issues for specific tasks and assignment deadlines. 
- Task completion issues
    - If you are struggling to deliver something you promised to do and haven’t made any progress on your own for 30+ minutes, let the group know through discord as soon as possible
    - Other group members who have the time outside of their own responsibilities and capability must respond within 24 hours
    - If no other members are available for help, the issue will be brought up during the following meeting time to discuss how to solve the problem and possibly reorganize the timeline to reflect that.



# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting | 
|---|---|---|---|
| 2/4  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/8  |  1 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/8  | 1 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/20  | 6 PM  | Compress and merge Data - Chinmay - 16th Feb then Clean and Tidy Data, Add necessary columns - Anshul | Completion of Data wrangling   |
| 2/23  | 12 PM  | Discuss next steps for EDA | Complete project check-in |
| 3/6  | 12 PM  | Praveen and Chaela complete EDA by 3/1  | DiscussAnalysis |
| 3/9  | 12 PM  | Discuss next steps for analysis | Complete project check-in |
| 3/12  | 12 PM  | Complete Draft results/conclusion/discussion | Discuss/edit full project |
| 3/16  | 12 PM  | Finalize draft | Have final submission ready |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |