# COGS 108 - Project Proposal

# Names

- Chinmay Bharambe 
- Anshul Govindu 
- Chaela Moraleja 
- Candice Sanchez 
- Praveen Sharma
 

# Research Question

Using UCSD enrollment data since Fall 2022, what combination of course characteristics (fill rate, capacity, time offered) and student factors (class standing, major) best predict enrollment success rates for undergraduate courses, across all departments, during first and second pass registrations? 
Can these predictions be used to develop a recommendation tool that optimizes first and second-pass course selection?

## Background and Prior Work

This project attempts to address a major challenge for UCSD students: deciding which classes to enroll in during first and second pass. UCSD’s unique “pass” enrollment system turns course selection into more of an art than a science, often leaving students uncertain about their choices or failing to enroll in certain classes. This process also involves other unusual factors, such as major priority for CSE courses. Overall, there is a definite need for a tool that maximizes students' chances of securing their desired courses.

Upon initial research, we came across a project that collects data on individual classes at different points in time during each term, such as Fall 2022 or Winter 2023; each term’s data is contained within its own repository <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). The project involved building a web scraping tool that scrapes web-reg about every 10 minutes, and collects real-time data on information like enrolled, available, and waitlist spots. This not only offers a tool to collect our own data in the future, but also a great sample dataset from what has already been collected.

We also found another project that was built using the aforementioned github repositories <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). Given a course, the website takes data from specific terms and plots the course availability as a time series across various registration milestones (senior first pass, junior second pass, etc). This offers a great initial visualization of the enrollment data, and our EDA would likely produce some similar graphs. However, we certainly have to build upon this with predictive analyses in order to answer our research question.


1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://github.com/UCSD-Historical-Enrollment-Data
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://www.ucsdregistration.com


# Hypothesis


We predict that the fill rate of a course and the student’s major would be the most influential combination of factors for students deciding which courses to enroll in during first and second passes. Specifically, we predict that a high course fill rate and close relationship between the course and a student’s major would make it more likely to be enrolled in during first pass rather than second pass.

# Data

- Variables:
    - Unique identifiers for every course
    - Term in which the course was offered
    - Standing required for a student to enroll in the course
    - Major restrictions for the course
    - Capacity of the course
    - Fill rate of the course during first pass and second pass respectively
    - Waitlist count
    - Course and professor evaluations/ratings

- We believe that rather than the number of observations, the duration across which these variables were observed is more important. Ideally we would want to have data that starts on the first day of first pass and ends when the waitlist ends. 

- The ideal format for this data would be in CSV files as these are compatible with pandas. 

- A potential real data set we found is within this repository: https://github.com/UCSD-Historical-Enrollment-Data. This data set contains the time period we are interested in for all undergraduate courses. It has the building blocks of our ideal data set. The csv file for each course across each quarter is different. We will have to find a way to merge these csv files to make our data set usable, while preserving information like the course name and when it was offered. We will also have to demarcate when the first pass and second pass starts and ends for each quarter to help with our recommendation system along with creating a fill rate column. Additionally, the csv file for each course contains too many observations because the data for each course was collected every 10 minutes. We do not need this level of granularity. It will make our final data set unnecessarily big. 

# Ethics & Privacy

Working with UCSD’s historical enrollment data comes with a great responsibility such as prioritizing student privacy and ethical considerations. Since this academic project will be handling sensitive student information, in order to safeguard this privacy, all personal identifiers such as student PID’s will be anonymized. Additionally, we will comply with UCSD’s data usage policies to assure students whose data was collected that their data will be used exclusively for this academic project and will not be exploited in any manner. Following this approach prevents any potential misuse and unauthorized access to the data. 

There are potential biases in the datasets that may need to be addressed, particularly concerning data collection and representation. We may be analyzing data that has an over representation of certain majors and classes standings which leads to a biased analysis and recommendations. In addition to that, there may be subjective biases present in course and professor evaluations (CAPES) and instructor ratings from Rate My Professor. To identify and mitigate these biases, we will follow the Data Science Ethics Checklist. This includes conducting thorough data validation and exploratory data analysis (EDA), limiting personally identifiable information (PII) exposure through anonymization, and implementing access controls to ensure data security as well as integrity. We will ensure that our visualizations and reports honestly represent the data and transparently document our analysis process. Any identified issues will be addressed through corrective measures such as weighting adjustments underrepresented groups and/or incorporating additional data sources. As we complete all these ethical and privacy concerns throughout our project, we will produce fair, unbiased, and equitable recommendations for future use. 


# Team Expectations 

- Communication:
    - We will communicate via Discord, including texting and calling
    - The longest we expect to wait for a response is 24 hours 
    - We will meet at least once a week
    - Most, if not all, meetings will be done virtually
- Tone:
    - Be direct, but polite
        - Ex 1: “I think X is a problem because of Y. Does everyone else see it that way too or am I missing something?”
        - Ex 2: “I disagree with that idea because Z. What do you think if instead we try...”
- Decision Making:
    - Majority vote system for major decisions
    - Smaller decisions can be left to the person who is in charge of the task.
    - If a teammate is unresponsive when a decision has to be made quickly, it will be made without them using a majority vote.
- Tasks:
    - Members should first be assigned according to specialization, then others can oversee it to make sure everything aligns with expectations for that task
    - We will use GitHub issues for specific tasks and assignment deadlines. 
- Task completion issues
    - If you are struggling to deliver something you promised to do and haven’t made any progress on your own for 30+ minutes, let the group know through discord as soon as possible
    - Other group members who have the time outside of their own responsibilities and capability must respond within 24 hours
    - If no other members are available for help, the issue will be brought up during the following meeting time to discuss how to solve the problem and possibly reorganize the timeline to reflect that.



# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/4  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 2/8  |  1 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/8  | 1 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/20  | 6 PM  | Compress and merge Data - Chinmay - 16th Feb then Clean and Tidy Data, Add necessary columns - Anshul | Completion of Data wrangling   |
| 2/23  | 12 PM  | Discuss next steps for EDA | Complete project check-in |
| 3/6  | 12 PM  | Complete EDA | Discuss/edit Analysis |
| 3/9  | 12 PM  | Discuss next steps for analysis | Complete project check-in |
| 3/12  | 12 PM  | Complete Draft results/conclusion/discussion | Discuss/edit full project |
| 3/16  | 12 PM  | Finalize draft | Have final submission ready |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |