## **University of Illinois Chicago**
CS 418 - Fall 2024 Team 5

## **Data-Driven Course Insights: Predicting Grade Trends**

## **Authors:**
| **Name**  | **Email** | **Github Handle** |
|---|---|---|
| Arlette Diaz | adiaz218@uic.edu | adiaz218 |
| Marianne Hernandez | mhern85@uic.edu | marhern19 |
| Nandini Jirobe | njiro2@uic.edu | nandinijirobe |
| Sharadruthi Muppidi | smuppi2@uic.edu | sharadruthi-uic |
| Sonina Mut | smut3@uic.edu | snina22 |
| Yuting Lu | lyuti@uic.edu | yutinglu103 |

**Github Repository Link: https://github.com/cs418-fa24/project-check-in-team-5**

## **Project Description**

This project is to predict course grade distributions and popularity rankings for upcoming semesters, enabling students to make informed decisions about their class selections. By shifting the focus from individual grade predictions to overall course outcomes, the project provides insights into course grading trends and demand. It uses clustering to rank courses based on student performance and popularity, and topic-based grouping to help students discover courses aligned with their interests, factoring in professor expertise and class attributes. This data-driven tool uncovers hidden patterns, aiding both students and academic planning.

## **Project Update**

### **Import Packages**

In [None]:
import sys
python_loc = sys.executable

!{python_loc} -m pip install pandas
!{python_loc} -m pip install scikit-learn
!{python_loc} -m pip install matplotlib
!{python_loc} -m pip install seaborn
!{python_loc} -m pip install tabulate

In [None]:
# import useful libraries
import pandas as pd
import numpy as np
import requests
import time
from bs4 import BeautifulSoup
from typing import List, Dict
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from urllib3.exceptions import InsecureRequestWarning
import urllib3
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from tabulate import tabulate # 'pip install tabulate' if you haven't install this library

### **Part 1: Load Datasets**

In [None]:
# Grade distribution data 
cs_grades = pd.read_csv('uic_GD_CS_14_24.csv')
meie_grades = pd.read_csv('uic_GD_MEIE_14_24.csv')

# Rate My Professor Data
cs_rmp = pd.read_csv('uic_RMP_CS_14_24.csv')
meie_rmp = pd.read_csv('CS418_Team5_DataSet - RMP_MEIE_14_24.csv')

# Google Scholar Data
cs_gs = pd.read_csv('CS418_Team5_DataSet - GS_CS_14_24.csv')
meie_gs = pd.read_csv('CS418_Team5_DataSet - GS_MEIE_14_24.csv')

# Lecture Data
cs_lectures = pd.read_csv('uic_CS_lectures_all_semesters.csv')
me_lectures = pd.read_csv('uic_ME_lectures_all_semesters.csv')
ie_lectures = pd.read_csv('uic_IE_lectures_all_semesters.csv')

# Course Description Data
cs_descrip = pd.read_csv('CS418_Team5_DataSet - CS_Descrip.csv')

In [None]:
cs_lectures.head(5)

# print(cs_lectures['Method'].unique())






### **Part 2: Data Cleaning**

#### **Dataset 1 - Grade Distribution**

In [None]:
# Grade distribution data cleaning
# Drop columns where all values are zero
cs_grades = cs_grades.loc[:, (cs_grades != 0).any(axis=0)]
meie_grades = meie_grades.loc[:, (meie_grades != 0).any(axis=0)]

# Drop rows where CRS TITLE (course title) contains "research" or "seminar" (case-insensitive)
cs_grades = cs_grades[~cs_grades['CRS TITLE'].str.contains("research|seminar", case=False, na=False)]
meie_grades = meie_grades[~meie_grades['CRS TITLE'].str.contains("research|seminar", case=False, na=False)]

# Convert all numeric columns to integers or floats
for col in cs_grades.columns:
    cs_grades[col] = pd.to_numeric(cs_grades[col], errors='ignore')

for col in meie_grades.columns:
    meie_grades[col] = pd.to_numeric(meie_grades[col], errors='ignore')

# Save the cleaned data to a new CSV file
cs_grades.to_csv("uic_GD_CS_14_24.csv", index=False)
meie_grades.to_csv("uic_GD_MEIE_14_24.csv", index=False)

#### **Dataset 2.1 - Rate My Professor - Computer Science Department**

In [None]:

cs_grades.rename(columns={'Primary Instructor': 'Instructor'}, inplace=True)

# Filter for courses
cs_grades = cs_grades[cs_grades['CRS NBR'].between(100, 599)]

merged_data = pd.merge(cs_grades, cs_rmp, on='Instructor', how='left')

# Fill missing values with "N/A" for NULL columns
merged_data['Rating'] = merged_data['Rating'].fillna("N/A")
merged_data['Num Reviews'] = merged_data['Num Reviews'].fillna("N/A")
merged_data[['CRS SUBJ CD', 'CRS TITLE', 'Instructor']] = merged_data[['CRS SUBJ CD', 'CRS TITLE', 'Instructor']].fillna("N/A")

# Select relevant columns and sort by course number (CRS NBR)
result_data = merged_data[['CRS SUBJ CD', 'CRS NBR', 'CRS TITLE', 'Instructor', 'Rating', 'Num Reviews']]
result_data = result_data.sort_values(by=['CRS NBR'])

print(tabulate(result_data, headers='keys', tablefmt='fancy_grid', showindex=False))

#### **Dataset 2.2 - Rate My Professor - Mechanical & Industrial Engineering Department**

In [None]:
meie_grades.rename(columns={'Primary Instructor': 'Instructor'}, inplace=True)

# Filter for courses
meie_grades = meie_grades[meie_grades['CRS NBR'].between(100, 599)]

merged_data = pd.merge(meie_grades, meie_rmp, on='Instructor', how='left')

# Fill missing values with "N/A" for Null columns
merged_data['Rating'] = merged_data['Rating'].fillna("N/A")
merged_data['Num Reviews'] = merged_data['Num Reviews'].fillna("N/A")
merged_data[['CRS SUBJ CD', 'CRS TITLE', 'Instructor']] = merged_data[['CRS SUBJ CD', 'CRS TITLE', 'Instructor']].fillna("N/A")

# Select relevant columns then sort them by course number
result_data = merged_data[['CRS SUBJ CD', 
                           'CRS NBR', 
                           'CRS TITLE', 
                           'Instructor', 
                           'Rating', 
                           'Num Reviews']]
result_data = result_data.sort_values(by=['CRS NBR'])

print(tabulate(result_data, headers='keys', tablefmt='fancy_grid', showindex=False))

#### **Dataset 3 - Class Scheduler Data**

In [None]:
# Disable SSL verification warnings
urllib3.disable_warnings(InsecureRequestWarning)

class LectureScheduleScraper:
    def __init__(self, semester: str, year: int):
        """
        Initialize the scraper with semester info and determine correct URL format
        
        Args:
            semester (str): Semester name
            year (int): Year
        """
        self.semester = semester
        self.year = year
        self.url = self._get_url_for_semester()
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        })

    def _get_url_for_semester(self) -> str:
        """
        Determine the correct URL format based on semester and year
        Returns appropriate URL string
        """
        base_url = "https://webcs7.osss.uic.edu/schedule-of-classes/archive/pre-proof/{}-{}/CS.{}"
        
        # Convert semester/year to a comparable date
        semester_date = pd.Timestamp(year=self.year, 
                                   month={"Spring": 1, "Summer": 6, "Fall": 9}[self.semester],
                                   day=1)
        
        # Cutoff date: Spring 2020
        cutoff_date = pd.Timestamp(year=2020, month=1, day=1)
        
        # Use .html for Spring 2020 and later, .htm for earlier dates
        extension = "html" if semester_date >= cutoff_date else "htm"
        
        return base_url.format(self.semester.lower(), self.year, extension)

    def fetch_page(self, max_retries: int = 3) -> str:
        """
        Fetch the webpage content with retry logic
        """
        for attempt in range(max_retries):
            try:
                response = self.session.get(self.url, verify=False, timeout=10)
                response.raise_for_status()
                return response.text
            except requests.RequestException as e:
                print(f"Attempt {attempt + 1} failed: {str(e)}")
                if attempt < max_retries - 1:
                    time.sleep(2)
                else:
                    raise Exception(f"Failed to fetch page after {max_retries} attempts") from e

    def parse_old_format(self, soup: BeautifulSoup) -> List[Dict]:
        """
        Parse the old .htm format schedule (pre-2020)
        """
        courses = []
        current_course = None
        
        # Find all paragraph tags
        paragraphs = soup.find_all(['p', 'table'])
        
        for element in paragraphs:
            if element.name == 'p':
                # Check if this is a course header
                # Look for course code pattern like "IE   201" or similar
                course_info = element.find('a')
                if course_info and 'CS' in course_info.text:
                    title_parts = element.find_all('b')
                    if len(title_parts) >= 3: 
                        # The course code is in the first bold element, title in second
                        course_code = title_parts[0].text.strip()
                        course_title = title_parts[2].text.strip()
                        
                        current_course = {
                            'course_code': course_code.replace('\xa0', ' ').strip(),
                            'course_title': course_title,
                            'description': element.text.strip(),
                            'sections': []
                        }
                        courses.append(current_course)
            
            elif element.name == 'table' and current_course:
                rows = element.find_all('tr')
                for row in rows:
                    cols = row.find_all('td')
                    if len(cols) >= 9:  # Make sure we have enough columns
                        course_type = cols[1].text.strip()
                        if course_type.startswith('LEC') or course_type.startswith('LCD'):
                            # Combine start and end times
                            start_time = cols[3].text.strip()
                            end_time = cols[5].text.strip()
                            
                            # Handle arranged times
                            if start_time.upper() == "ARRANGED":
                                time = "ARRANGED"
                            else:
                                time = f"{start_time} - {end_time}" if end_time else start_time
                            
                            # Clean up CRN and other fields
                            crn = cols[0].text.strip().replace('\xa0', '').replace('strong>', '').strip()
                            
                            section = {
                                'crn': crn,
                                'course_type': course_type,
                                'time': time,
                                'days': cols[6].text.strip(),
                                'room': cols[7].text.strip(),
                                'building': cols[8].text.strip(),
                                'instructor': cols[9].text.strip() if len(cols) > 9 else '',
                                'method': '',  # Old format doesn't have method
                                'semester': self.semester,
                                'year': self.year
                            }
                            current_course['sections'].append(section)
        
        # Only return courses that have lecture sections
        valid_courses = [course for course in courses if course['sections']]
        
        if not valid_courses:
            print(f"DEBUG: Found {len(courses)} courses but none had valid lecture sections")
            # Print the first few course codes found to help debug
            if courses:
                print("DEBUG: Found these course codes:")
                for course in courses[:5]:
                    print(f"- {course['course_code']}")
        
        return valid_courses


    def parse_schedule(self, html_content: str) -> List[Dict]:
        """
        Parse the schedule and extract only lecture sections
        Handles both old and new formats
        """
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Check if this is the new format (post-2020)
        if soup.find_all('div', class_='row course'):
            return self.parse_new_format(soup)
        else:
            return self.parse_old_format(soup)

    def parse_new_format(self, soup: BeautifulSoup) -> List[Dict]:
        """
        Parse the new .html format schedule (2020 and later)
        """
        courses = []
        course_divs = soup.find_all('div', class_='row course')
        
        for course_div in course_divs:
            course_code = course_div.find('h2').text.strip()
            course_title = course_div.find('h3').text.strip()
            course_description = course_div.find('p').text.strip()
            
            table = course_div.find('table')
            if not table:
                continue
                
            lecture_sections = []
            current_section = None
            
            for row in table.find_all('tr')[1:]:
                cols = row.find_all('td')
                if not cols:
                    continue
                    
                if 'separator' in cols[0].get('class', []):
                    if current_section:
                        current_section['additional_info'] = cols[0].text.strip()
                    continue
                
                course_type = cols[1].text.strip()
                if not (course_type.startswith('LEC') or course_type.startswith('LCD')):
                    continue
                
                current_section = {
                    'crn': cols[0].text.strip().replace('strong>', ''),
                    'course_type': course_type,
                    'time': cols[2].text.strip(),
                    'days': cols[3].text.strip(),
                    'room': cols[4].text.strip(),
                    'building': cols[5].text.strip(),
                    'instructor': cols[6].text.strip(),
                    'method': cols[8].text.strip() if len(cols) > 8 else '',
                    'semester': self.semester,
                    'year': self.year
                }
                lecture_sections.append(current_section)
            
            if lecture_sections:
                course_info = {
                    'course_code': course_code,
                    'course_title': course_title,
                    'description': course_description,
                    'sections': lecture_sections
                }
                courses.append(course_info)
        
        return courses

    def create_dataframe(self, courses: List[Dict]) -> pd.DataFrame:
        """
        Convert course data to DataFrame with improved handling and spacing cleanup
        """
        rows = []
        for course in courses:
            for section in course['sections']:
                # Clean and standardize fields
                crn = section['crn'].replace('\xa0', '').strip()
                course_code = course['course_code'].replace('\xa0', ' ').strip()
                # Compress multiple spaces into single space
                course_code = ' '.join(course_code.split())
                
                # Clean up instructor field - remove extra spaces and quotes
                instructor = section['instructor'].strip().strip('"')
                instructor = ' '.join(instructor.split())  # Compress multiple spaces
                
                # Clean up time field
                time = section['time'].strip()
                time = ' '.join(time.split())  # Compress multiple spaces
                
                # Create row with cleaned data
                row = {
                    'Course Code': course_code,
                    'Course Title': course['course_title'].strip(),
                    'CRN': crn,
                    'Section Type': section['course_type'].strip(),
                    'Time': time,
                    'Days': section['days'].strip(),
                    'Instructor': instructor,
                    'Method': section['method'].strip(),
                    'Semester': section['semester'].strip(),
                    'Year': section['year']
                }
                rows.append(row)
                        
        df = pd.DataFrame(rows)
        
        try:
            df['Course Number'] = df['Course Code'].str.extract(r'(\d+)').astype(float)
            df = df.sort_values(['Year', 'Semester', 'Course Number'], na_position='last')
            df = df.drop('Course Number', axis=1)
        except:
            pass
                
        # Remove any completely empty rows
        df = df.dropna(how='all')
        
        # Convert year to integer
        df['Year'] = df['Year'].astype(int)
        
        return df

def main():
    semesters = [
        ('spring', 2014),
        ('summer', 2014),
        ('fall', 2014),
        ('spring', 2015),
        ('summer', 2015),
        ('fall', 2015),
        ('spring', 2016),
        ('summer', 2016),
        ('fall', 2016),
        ('spring', 2017),
        ('summer', 2017),
        ('fall', 2017),
        ('spring', 2018),
        ('summer', 2018),
        ('fall', 2018),
        ('spring', 2019),
        ('summer', 2019),
        ('fall', 2019),
        ('spring', 2020),
        ('summer', 2020),
        ('fall', 2020),
        ('spring', 2021),
        ('summer', 2021),
        ('fall', 2021),
        ('spring', 2022),
        ('summer', 2022),
        ('fall', 2022),
        ('spring', 2023),
        ('summer', 2023),
        ('fall', 2023),
        ('spring', 2024),
        ('summer', 2024),
    ]
    
    successful = []
    failed = []
    all_data = []

    
    for season, year in semesters:
        try:
            semester_name = f"{season.capitalize()} {year}"
            
            scraper = LectureScheduleScraper(season.capitalize(), year)
            
            courses = scraper.parse_schedule(scraper.fetch_page())
            if courses:
                df = scraper.create_dataframe(courses)
                all_data.append(df)
                
                successful.append(semester_name)
            else:
                raise Exception("No courses found")
                
            print("-" * 80)
            
        except Exception as e:
            print(f"Error processing {semester_name}")
            print(f"Error details: {str(e)}")
            failed.append((semester_name, str(e)))
            print("Continuing to next semester...")
            print("-" * 80)
            continue
    
    # Combine all DataFrames and save to a single CSV
    if all_data:
        combined_df = pd.concat(all_data, ignore_index=True)
        output_file = 'uic_CS_lectures_all_semesters.csv' ## change _CS_ to _ME_ or _IE_ for different department, and change code above
        combined_df.to_csv(output_file, index=False)
        print(f"\nSaved combined data to: {output_file}")
    
    if successful:
        print("\nSuccessfully processed semesters:")
        for semester in successful:
            print(f"- {semester}")
    
    if failed:
        print("\nFailed semesters:")
        for semester, error in failed:
            print(f"- {semester}: {error}")

if __name__ == "__main__":
    main()

#### **Dataset 4 - Google Scholar**

In [None]:
# CODE HERE !!

### **Part 3: Exploratory Data Analysis**

### **Part 4: Data Visualizations**

### **Part 5: Machine Learning Analysis**

## **Reflection**

**What is the hardest part of the project that you’ve encountered so far?**


<br>**What are your initial insights?**


<br>**Are there any concrete results you can show at this point? If not, why not?**


<br>**Going forward, what are the current biggest problems you’re facing?**


<br>**Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?**


<br>**Given your initial exploration of the data, is it worth proceeding with your project, why? If not, how are you going to change your project and why do you think it’s better than your current results?**



## **Roles/Coordination (important)**

**Arlette Diaz:** 
* Text

<br>**Marianne Hernandez:** 
* Text

<br>**Nandini Jirobe:** 
* Collected Rate My Professor ratings for professors who taught Mechanical and Industrial Enginnering classes from 2014-2024
* Collected Rate My Professor ratings for professors in the Computer Science classes from 2014-2024
* Collected Google Scholar research interests of professors who taught Mechanical and Industrial Enginnering classes from 2014-2024
* Collected Google Scholar research interests of professors in the Computer Science classes from 2014-2024
* Collected course description data for computer science courses taught at UIC. 

<br>**Sharadruthi Muppidi:** 
* Text

<br>**Sonina Mut:** 
* Collected UIC Grade Distribution for professors who taught Mechanical and Industrial Enginnering classes from 2014-2024
* Collected UIC Grade Distribution for professors in the Computer Science classes from 2014-2024
* Collected Rate My Professor ratings for professors who taught Mechanical and Industrial Enginnering classes from 2014-2024
* Collected Rate My Professor ratings for professors in the Computer Science classes from 2014-2024

<br>**Yuting Lu:** 
* Text

## **Next Steps**