In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("project.ipynb")

# Project 1 – Gradebook 💯

## DSC 80, Fall 2024

### Checkpoint Due Date: Tuesday, October 8th (Questions 1-7)
### Due Date: Tuesday, October 15hh

## Instructions

Welcome to Project 1! Be sure to read the instructions below carefully to understand how projects differ from labs.

### Working on the Project

This Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems.

* Like the lab, your coding work will be developed in the accompanying `project.py` file, that will be imported into the current notebook. This code will be autograded.
* There is no manually-graded component to Project 1, so the only thing you will ever submit is `project.py`.
* **For the Checkpoint, which is required, you only need to turn in a `project.py` containing solutions for Questions 1-7!**
    - The "Project 1 Checkpoint" autograder on Gradescope does not thoroughly check your code – it only runs the public tests on Questions 1-7 to make sure that you have completed them. There are no hidden tests for the checkpoint, and you will see your score upon submission. 
    - When you submit the final version of the project, however, we will use hidden tests to check your answers more thoroughly.
    - Note that this means you will ultimately have to submit the project twice – once to the "Project 1 Checkpoint" autograder (Questions 1-7 only), and once to the "Project 1" autograder (once you're fully done).
- **Do not change the function names in `project.py` file!** The functions in `project.py` are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2024-ss2).
- **To ensure that all of your work to be submitted is in `project.py`, we've included a script named `project-validation.py` in the project folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.** More details on its usage are given at the bottom of this notebook.
- You are encouraged to write your own additional helper functions to solve the project, as long as they also end up in `project.py`.

### Working with a Partner

You may work together on projects (and projects only!) with a partner. If you work with a partner, you are both required to actively contribute to all parts of the project. You must both be working on the assignment at the same time together, either physically or virtually on a Zoom call. You are encouraged to follow the pair programming model, in which you work on just a single computer and alternate who writes the code and who thinks about the problems at a high level.

In particular, you **cannot** split up the project and each work on separate parts independently.

Note that if you do work with a partner, you and your partner must submit the Checkpoint together and the whole project together. See [here](https://dsc80.com/syllabus/#projects) for more details.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

In [4]:
from project import *

## About the Assignment

In this project, you'll work with the gradebook for CSD 18, a fictional data science course with 535 students co-taught by Professors Yutian and Dylan. You'll help Professors Yutian and Dylan compute the total course grade for every student in their course and analyze their students' performances throughout the quarter.

---

### Navigating the Project

Click on the links below to navigate to different parts of the project. Note that Part 1 – that is, Questions 1, 2, 3, 4, 5, 6, and 7 – constitutes your Checkpoint submission.

- [Part 1: Initial Calculations 🔢](#part1)
    - [Question 1 ](#Question-1-(Checkpoint-Question))
    - [Question 2 ](#Question-2-(Checkpoint-Question))
    - [Question 3 ](#Question-3-(Checkpoint-Question))
    - [Question 4 ](#Question-4-(Checkpoint-Question))
    - [Question 5 ](#Question-5-(Checkpoint-Question))
    - [Question 6 ](#Question-6-(Checkpoint-Question))
    - [Question 7 ](#Question-7-(Checkpoint-Question))
- [Part 2: Redemption 🙏](#part2)
    - [Question 8](#Question-8)
    - [Question 9](#Question-9)
    - [Question 10](#Question-10)
- [Part 3: Analysis 🧠](#part3)
    - [Question 11](#Question-11)
    - [Question 12](#Question-12)
    - [Question 13](#Question-13)

<!--     - [✅ Question 1 (Checkpoint Question)](#Question-1-(Checkpoint-Question))
    - [✅ Question 2 (Checkpoint Question)](#Question-2-(Checkpoint-Question))
    - [✅ Question 3 (Checkpoint Question)](#Question-3-(Checkpoint-Question))
    - [✅ Question 4 (Checkpoint Question)](#Question-4-(Checkpoint-Question))
    - [✅ Question 5 (Checkpoint Question)](#Question-5-(Checkpoint-Question))
    - [✅ Question 6 (Checkpoint Question)](#Question-6-(Checkpoint-Question))
    - [✅ Question 7 (Checkpoint Question)](#Question-7-(Checkpoint-Question)) -->
---

### The Syllabus

Professor Yutian has taught this course several times, so the instructors decide to use her syllabus at the start of the quarter. (Note that this syllabus is **not** the same as the course syllabus for DSC 80 in Spring 2024)

* **Lab assignments (20% total)**
    - Each lab is worth the same amount, regardless of each lab's raw point total.
    - The lowest lab is dropped.
    - Each lab may be revised for up to (and including) one week after the deadline for a 10% penalty, for up to (and including) two weeks after the deadline for a 30% penalty, and beyond that for a 60% penalty. Such revisions are reflected in the `'Lateness'` columns in the gradebook.
    - Labs also have a two hour grace period that needs to be factored in before assigning late penalties.
    - Note that lateness penalties are not assessed for any other type of assignment – that is, students can submit projects, checkpoints, and discussions late without penalty.
* **Projects (30% total)** 
    - Each project consists of an autograded portion, and **possibly** a free response portion.
    - The total points for a single project consist of the sum of the raw score of the two portions.
    - Each project is worth the same amount, regardless of each project's raw point total.
* **Checkpoints (2.5% total)**
    - Each project checkpoint is worth the same amount, regardless of each project checkpoint's raw point total.
* **Discussions (2.5% total)**
    - Each discussion is worth the same amount, regardless of each discussion's raw point total.
* **Midterm Exam (15%)**
* **Final Exam (30%)**

You will need to refer to this syllabus repeatedly throughout the project, and several questions will link you back to it.

---

### Generalization

Your code only needs to work for courses that follow the syllabus above. That is, you may assume that the DataFrame `grades` looks **like** the given one in `data/grades.csv`.

However, your code should work regardless of:
- The numbers of labs, projects, discussions, and checkpoints in the course.
- The number of students in the course.

For instance, if CSD 18 is taught in a different quarter with more labs, fewer projects, and fewer students, your code should still work on a `grades.csv` from that quarter.

You may assume the course components and the naming conventions are as given in `grades.csv`, and you may assume that the course has no more than 99 of any type of assignment.

---

### Putting Everything Together

Here are a few remarks and tips for approaching Project 1, and projects more generally:

1. If you are having trouble figuring out what a question is asking you to do, look at the big picture and try to understand what the current step is doing to contribute to this big picture. This may clarify what's being asked!
1. These questions intentionally build off of each other and the final result matters! In fact, you can "get a question correct," but only receive partial credit for it because a previous answer was wrong.
    - Credit for a question will typically receive partial credit based on *how close* your answer is to correct (as well as some credit for a solution in the correct form). 
    - You should try to assess your answer to each question based on what you understand of the data. This might involve writing extensive code (that isn't turned in) just to check your work! Suggestions on checking your work are given in the assignment, but you should also think of your own ways of checking your work.
    - As you do this project, think about the data from the perspective of the student (which should be easy to do, since you've used Gradescope before!)
1. To test the correctness of your answers:
     - Once you have implemented a particular function in `project.py`, you should test out your function in the notebook. In particular, you should inspect/analyze the output to assess its correctness.
    - Run your functions on the main dataset (`grades`, and later `grades_combined` and `grades_analysis`) and ask yourself if the output *looks correct.*
    - Run your functions on very small datasets (e.g. 1-5 row DataFrames that you construct by hand), calculate the expected output by hand, and see if the function output matches (this *is* unit-testing your code with data).
    * Run your functions on (large and small) samples of the dataset `grades`. Does your code break, or does it still run as expected?

Run the cell below to load in the aforementioned `grades` dataset.

In [5]:
grades_fp = Path('data') / 'grades.csv'
grades = pd.read_csv(grades_fp)
grades.head()

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,A99706914,ERC,JR,A22,99.735279,100.0,00:00:00,84.990171,100.0,00:00:00,...,00:00:00,8.895294,10,00:00:00,10.0,10,780:01:28,10.0,10,00:00:00
1,A99237411,Eighth,JR,A29,98.829476,100.0,00:00:00,50.784231,100.0,00:00:00,...,669:12:21,9.022407,10,00:00:00,9.020283,10,00:00:00,9.437368,10,00:00:00
2,A99690544,Revelle,SR,A12,86.513369,100.0,00:00:00,47.80282,100.0,00:00:00,...,00:00:00,3.030538,10,00:04:51,7.613698,10,00:00:00,9.624617,10,00:00:00
3,A99427381,Seventh,JR,A14,100.0,100.0,00:00:00,100.0,100.0,00:00:00,...,00:00:00,10.0,10,00:00:00,9.249126,10,00:00:00,10.0,10,00:00:00
4,A99489712,Sixth,JR,A24,66.506974,100.0,00:00:00,33.422412,100.0,00:00:00,...,00:00:00,4.439606,10,00:00:00,4.485291,10,00:00:00,6.282712,10,00:00:00


***Tip***: The `grades` DataFrame has 101 columns, and you can't see them all right now. To get a feel for what all of the columns represent, you might consider opening `grades.csv` with a spreadsheet application, like Google Sheets or Excel.

<a name='part1'></a>

## Part 1: Initial Calculations 🔢

([return to the outline](#Navigating-the-Project))

In Part 1, you'll compute students' letter grades in CSD 18 using [the syllabus](#The-Syllabus) provided above. As you'll see, this requires many steps. Let's get started!

<!-- ### ✅ Question 1 (Checkpoint Question) -->
### Question 1 


<a name='Question-1-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `get_assignment_names`, which takes in a DataFrame like `grades` and returns a dictionary with the following structure:
- The keys are the general areas of [the syllabus](#The-Syllabus): `'lab'`, `'project'`, `'midterm'`, `'final'`, `'disc'`, and `'checkpoint'`.
- The values are **lists** that contain all the assignment names of that type. For example, the lab assignments all have names of the form `'labXX'` where `XX` is a zero-padded two digit number. If the class has 5 labs, the returned dictionary's value for the `'lab'` key should be `['lab01', 'lab02', 'lab03', 'lab04', 'lab05']`.

***Notes***: 
- Some of the column names in the DataFrame contain the assignment name in the zero-padded fashion requested; you should use this to your advantange when building the dictionary.
- The point of this question is to familiarize you with the names of the columns in `grades`. Try to reuse your `get_assignment_names` function in future questions – if you find yourself never using it again, you may be redoing work unnecessarily.

In [None]:
# i need this operate on n columns thus i need to possibly pass through iterations of some data in the [df["n_column"]]

In [None]:
# find the contains method that looks through all the columns names
# in this syntax and does so in definitely for n columns


In [6]:
def get_assignment_names(grades_df):
    """
    Extract assignment names from the grades DataFrame and categorize them.

    Args:
        grades_df (pd.DataFrame): DataFrame containing grade information.

    Returns:
        dict: A dictionary where keys are syllabus assignment categories \
        and values are lists of FULL (?) assignment names.
    """
    categories = {
        'lab': [],
        'project': [],
        'midterm': [],
        'final': [],
        'disc': [],
        'checkpoint': []
    }

    for column in grades_df.columns:
        parts = column.split()
        if len(parts) > 1:
            continue  # Skip columns with additional information (e.g., "Max Points", "Lateness")
        
        assignment = parts[0]
        if assignment.startswith('lab'):
            categories['lab'].append(assignment)
        elif assignment.startswith('project'):
            categories['project'].append(assignment)
        elif assignment.startswith('Midterm'):
            categories['midterm'].append(assignment)
        elif assignment.startswith('Final'):
            categories['final'].append(assignment)
        elif assignment.startswith('discussion'):
            categories['disc'].append(assignment)
        elif assignment.startswith('checkpoint'):
            categories['checkpoint'].append(assignment)

    return categories

In [7]:
assignments_n = get_assignment_names(grades)

In [8]:
(assignments_n)

{'lab': ['lab01',
  'lab02',
  'lab03',
  'lab04',
  'lab05',
  'lab06',
  'lab07',
  'lab08',
  'lab09'],
 'project': ['project01',
  'project01_free_response',
  'project02_checkpoint01',
  'project02_checkpoint02',
  'project02',
  'project02_free_response',
  'project03_checkpoint01',
  'project03',
  'project05_free_response',
  'project04',
  'project05'],
 'midterm': ['Midterm'],
 'final': ['Final'],
 'disc': ['discussion01',
  'discussion02',
  'discussion03',
  'discussion04',
  'discussion05',
  'discussion06',
  'discussion07',
  'discussion08',
  'discussion09',
  'discussion10'],
 'checkpoint': []}

In [9]:
# names_of_assignments = parse_grade_results(assignments_n)

In [10]:
# (names_of_assignments)

In [11]:
# values_paraphrased = (get_unique_words(names_of_assignments))

In [12]:
# (values_paraphrased)

In [13]:
grader.check("q1")

Now you're ready to compute each student's overall grade on the first type of assignment – projects.

<!-- ### ✅ Question 2 (Checkpoint Question) -->
### Question 2


<a name='Question-2-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `projects_total`, which takes in a DataFrame like `grades` and returns a Series containing the total project grade for each student for the entire quarter, according to [the syllabus](#The-Syllabus). The output Series should contain values between 0 and 1.

***Notes***:

- If a student didn't turn in a particular project, what should their grade for it be? 
- Some projects have free response components that you need to account for when calculating the total points earned by a student and the max points possible for that project.
    - For instance, let's say Tiffany earned 82 points on the autograded portion of Project 1 and 13 points on the free response portion. This means that her overall Project 1 grade should be:
    $$
        \text{Project 1 Grade} = \frac{82+13}{85+15} = 0.95
    $$
- Per [the syllabus](#The-Syllabus), students may submit projects (and checkpoints and discussions) late without penalty.
- Do not include scores on checkpoint assignments in your calculations.
- To check your work, try:
    1. Calculating the total project scores for a few types of students by hand.
    2. Calculating summary statistics for the whole class' performance on a few projects in particular and ensuring the results seem reasonable.

In [None]:
def projects_total(the_grades):
    temp_grades = final_parse_grades(the_grades,project)
    # intitate the search function then 

In [None]:
projects = []
for i in grades.columns:
    if "project" in i:
        projects.append(i)
projects

In [None]:
grade_weights = {"lab":20,"project":30, "project free response ": 2.5, "midterm":15,"final":30,"disc":2.5}
# TODO wat was the policy for no turn of the project?
# TODO do analysis for the class 1-var stats are (std dev mean etc.)

In [319]:
projects_total(grades)

Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
...,...
530,1.0
531,1.0
532,1.0
533,1.0


In [291]:
project_cols = grades.columns.tolist()
project_num = project_cols[col].split()[0]
project_groups = {}

for col in project_num:

    if "project01"  in col:
        project_groups[col] = []
    project_groups[project_num].append(col)
project_groups

TypeError: list indices must be integers or slices, not str

In [321]:
def projects_total(grades):
    # Identify project columns
    project_cols = [col for col in grades.columns if col.startswith('project') and 'checkpoint' not in col]
    
    # Group columns by project
    project_groups = {}
    for col in project_cols:
        project_num = col.split('project')[1].split('_')[0]
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)
    
    # Calculate total points for each project
    project_scores = pd.DataFrame()
    for project_num, cols in project_groups.items():
        score_cols = [col for col in cols if 'Max Points' not in col and 'Lateness' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]
        
        project_score = grades[score_cols].sum(axis=1)
        project_max = grades[max_cols].sum(axis=1)
        
        project_scores[f'Project {project_num}'] = project_score / project_max
    
    # Calculate average project score
    total_project_score = project_scores.mean(axis=1)
    
    # Ensure all scores are between 0 and 1
    # total_project_score = total_project_score.clip(0, 1)

    # if this line isn't present the ratio becomes infinity are you sure it's
    # understood it's a grade total we have in mind like denominator is 15 points on FRQ
    # and 85 points on the autograder section summed up thus the number on denominator is
    # 100
    
    return total_project_score

    # actual output : 
    # 0      1.0
    # 1      1.0
    # 2      1.0
    # 3      1.0
    # 4      1.0
    #       ... 
    # 530    1.0
    # 531    1.0
    # 532    1.0
    # 533    1.0
    # 534    1.0
    # Length: 535, dtype: float64

In [322]:
(projects_total(grades))

0      inf
1      inf
2      inf
3      inf
4      inf
      ... 
530    inf
531    inf
532    inf
533    inf
534    inf
Length: 535, dtype: float64

In [None]:
# project_cols = [col for col in grades.columns if col.startswith('project') and 'checkpoint' not in col]
    
#     # Group columns by project
# project_groups = {}
# for col in project_cols:
#     project_num = col.split('project')[1].split('_')[0]
#     if project_num not in project_groups:
#         project_groups[project_num] = []
#     project_groups[project_num].append(col)

# project_groups.values()

In [None]:
# project_scores = []
# for project_num, cols in project_groups.items():
#     score_cols = [col for col in cols if 'Max Points' not in col and 'Lateness' not in col]
#     max_cols = [col for col in cols if 'Max Points' in col]

#     print(grades[score_cols].iloc[2])
#     project_score = grades[score_cols].sum(axis=1)
#     project_max = grades[max_cols].sum(axis=1)
    
#     project_grade = project_score / project_max
#     project_scores.append(project_grade)
#     # print()
# # project_scores

In [323]:
def projects_total(grades):
    # Identify project columns
    project_cols = [col for col in grades.columns if col.startswith('project') and 'checkpoint' not in col]
    
    # Group columns by project
    project_groups = {}
    for col in project_cols:
        project_num = col.split('project')[1].split('_')[0]
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)
    
    # Calculate total points for each project
    project_scores = []
    for project_num, cols in project_groups.items():
        score_cols = [col for col in cols if 'Max Points' not in col and 'Lateness' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]
        
        project_score = grades[score_cols].sum(axis=1)
        project_max = grades[max_cols].sum(axis=1)
        
        project_grade = project_score / project_max
        project_scores.append(project_grade)
    
    # Calculate average project score
    total_project_score = pd.concat(project_scores, axis=1).mean(axis=1)
    
    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)
    
    return total_project_score.to_frame()

In [324]:
projects_total(grades)

Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
...,...
530,1.0
531,1.0
532,1.0
533,1.0


In [11]:
projects_total(grades)

0      0.911497
1      0.720526
2      0.663858
3      0.932194
4      0.683966
         ...   
530    0.913291
531    0.843538
532    0.846640
533    0.802447
534    0.914951
Length: 535, dtype: float64

In [213]:
mani = projects_total(grades)
mani["03"]

['project01', 'project01 - Max Points', 'project01 - Lateness (H:M:S)', 'project01_free_response', 'project01_free_response - Max Points', 'project01_free_response - Lateness (H:M:S)', 'project02', 'project02 - Max Points', 'project02 - Lateness (H:M:S)', 'project02_free_response', 'project02_free_response - Max Points', 'project02_free_response - Lateness (H:M:S)', 'project03', 'project03 - Max Points', 'project03 - Lateness (H:M:S)', 'project05_free_response', 'project05_free_response - Max Points', 'project05_free_response - Lateness (H:M:S)', 'project04', 'project04 - Max Points', 'project04 - Lateness (H:M:S)', 'project05', 'project05 - Max Points', 'project05 - Lateness (H:M:S)']


['project03']

In [None]:
grades[]

In [212]:
# grades["discussion10":]
# grades[grades[5:15]]
grades.loc[:, 'project03':]

Unnamed: 0,project03,project03 - Max Points,project03 - Lateness (H:M:S),Final,Final - Max Points,Final - Lateness (H:M:S),Total Lateness (H:M:S),project05_free_response,project05_free_response - Max Points,project05_free_response - Lateness (H:M:S),...,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,85.519583,100.0,0.0,73.0,87.0,0.0,780:01:28,22.467797,25,0.0,...,0.0,8.895294,10,0.0,10.000000,10,0.0,10.000000,10,0.0
1,88.201035,100.0,0.0,71.0,87.0,0.0,669:12:21,19.729680,25,0.0,...,0.0,9.022407,10,0.0,9.020283,10,0.0,9.437368,10,0.0
2,77.043708,100.0,0.0,74.0,87.0,0.0,828:47:53,13.069564,25,0.0,...,0.0,3.030538,10,0.0,7.613698,10,0.0,9.624617,10,0.0
3,94.299439,100.0,0.0,75.0,87.0,0.0,120:01:11,25.000000,25,0.0,...,0.0,10.000000,10,0.0,9.249126,10,0.0,10.000000,10,0.0
4,90.805754,100.0,0.0,67.0,87.0,0.0,93:16:10,14.889447,25,0.0,...,0.0,4.439606,10,0.0,4.485291,10,0.0,6.282712,10,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,98.070705,100.0,0.0,65.0,87.0,0.0,491:24:29,22.167221,25,0.0,...,0.0,10.000000,10,0.0,9.169447,10,0.0,10.000000,10,0.0
531,76.384220,100.0,0.0,64.0,87.0,0.0,47:03:14,21.363680,25,0.0,...,0.0,10.000000,10,0.0,10.000000,10,0.0,10.000000,10,0.0
532,90.107681,100.0,0.0,70.0,87.0,0.0,120:01:11,21.473725,25,0.0,...,0.0,9.878661,10,0.0,8.878946,10,0.0,10.000000,10,0.0
533,88.717789,100.0,0.0,76.0,87.0,0.0,419:06:41,20.315673,25,0.0,...,0.0,7.759434,10,0.0,8.655478,10,0.0,8.102277,10,0.0


In [197]:
grades['project01_free_response - Lateness (H:M:S)'].unique()

array([0.])

In [150]:
reg = 75.28 /85
frq = 1
avg =( reg + frq) / 2
avg

0.9428235294117647

In [151]:
actual_autogrades= grades["project01"].value_counts()
actual_frqs = grades["project01_free_response"].value_counts()

In [200]:
# grades["project01"]

In [10]:

def projects_total(grades):
    # Step 1: Identify project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col and 'Lateness' not in col
    ]

    # Step 2: Group columns by project number
    project_groups = {}
    for col in project_cols:
        project_num = col.split('project')[1].split('_')[0]  # Extract project number (e.g., '01')
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)

    # Step 3: Calculate total earned and max points for each project
    project_scores = []  # Store normalized scores for each project

    for project_num, cols in project_groups.items():
        # Separate score columns and max points columns
        score_cols = [col for col in cols if 'Max Points' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]

        # Sum earned points and max points, handling missing values as 0
        total_score = grades[score_cols].fillna(0).sum(axis=1)  # Sum of earned points
        total_max = grades[max_cols].fillna(0).sum(axis=1)  # Sum of max points

        # Handle case where max points might not be available (division by zero)
        project_grade = np.where(total_max > 0, total_score / total_max, 0)

        # Store each project grade as a Series
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))

    # Step 4: Combine all project scores into a DataFrame and calculate the mean score
    all_projects = pd.concat(project_scores, axis=1)

    # Calculate the average project score for each student
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score.to_frame()


In [14]:
def projects_total(grades):
    project_weight = 0.3  # 30% for projects
    num_projects = 3
    project_max_points = 100  # Each project has a max of 100 points
    
    # Initialize a DataFrame to hold individual project grades
    project_grades = pd.DataFrame(index=grades['Student'], columns=['Project 1', 'Project 2', 'Project 3'])
    
    for i in range(1, num_projects + 1):
        autograded_col = f'Project {i} Autograded'
        frq_col = f'Project {i} FRQ'
        lateness_col = f'Lateness Project {i}'
        
        # Calculate project grade
        project_grade = grades[autograded_col].fillna(0) + grades[frq_col].fillna(0)
        
        # Set grade to 0 if project was not submitted (lateness > 0)
        project_grade = project_grade.where(grades[lateness_col] == 0, 0)
        
        # Normalize to be between 0 and 1
        project_grades[f'Project {i}'] = project_grade / project_max_points
    
    # Calculate the average project grade
    total_project_grades = project_grades.mean(axis=1) * project_weight
    
    return total_project_grades

In [15]:
projects_total(grades)

KeyError: 'Student'

In [16]:
import numpy as np
import pandas as pd

def projects_total(grades):
    # Step 1: Identify relevant project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col and 'Lateness' not in col
    ]

    # print("Project Columns Identified:", project_cols)  # Debug print

    # Step 2: Group columns by project number
    project_groups = {}
    for col in project_cols:
        # Try different splits to ensure we correctly group the columns
        # Example: project01_max or project01_frq
        project_num = col.split("_")[0]  # Adjusting the split logic based on your suggestion

        print(f"Processing Column: {col}, Extracted Project Number: {project_num}")  # Debug print

        # Initialize the group if it doesn't exist
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)

    # print("Grouped Project Columns:", project_groups)  # Debug print

    # Step 3: Calculate total earned and max points for each project
    project_scores = []  # Store normalized scores for each project

    for project_num, cols in project_groups.items():
        # Separate score columns and max points columns
        score_cols = [col for col in cols if 'Max Points' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]

        # print(f"Project {project_num}: Score Columns: {score_cols}, Max Columns: {max_cols}")  # Debug print

        # Sum earned points and max points, handling missing values as 0
        total_score = grades[score_cols].fillna(0).sum(axis=1)  # Earned points
        total_max = grades[max_cols].fillna(0).sum(axis=1)  # Max points

        # Handle cases where max points might not exist (avoid division by zero)
        project_grade = np.where(total_max > 0, total_score / total_max, 0)

        # Store the project grade as a Series
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))

    # Step 4: Combine all project scores into a DataFrame and calculate the mean score
    all_projects = pd.concat(project_scores, axis=1)

    # Calculate the average project score for each student across all projects
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score.to_frame()


In [17]:
projects_total(grades)

Processing Column: project01, Extracted Project Number: project01
Processing Column: project01 - Max Points, Extracted Project Number: project01 - Max Points
Processing Column: project01_free_response, Extracted Project Number: project01
Processing Column: project01_free_response - Max Points, Extracted Project Number: project01
Processing Column: project02, Extracted Project Number: project02
Processing Column: project02 - Max Points, Extracted Project Number: project02 - Max Points
Processing Column: project02_free_response, Extracted Project Number: project02
Processing Column: project02_free_response - Max Points, Extracted Project Number: project02
Processing Column: project03, Extracted Project Number: project03
Processing Column: project03 - Max Points, Extracted Project Number: project03 - Max Points
Processing Column: project05_free_response, Extracted Project Number: project05
Processing Column: project05_free_response - Max Points, Extracted Project Number: project05
Process

Unnamed: 0,0
0,1.000000
1,1.000000
2,0.995787
3,1.000000
4,1.000000
...,...
530,1.000000
531,1.000000
532,1.000000
533,1.000000


In [13]:
import numpy as np
import pandas as pd

def projects_total(grades):
    # Step 1: Identify relevant project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col and 'Lateness' not in col
    ]

    print("Project Columns Identified:", project_cols)  # Debug print

    # Step 2: Group columns by project number, handling various name formats
    project_groups = {}
    for col in project_cols:
        # Extract project number using multiple split strategies
        if " - " in col:  # Handle case with spaces (e.g., "project02 - Max Points")
            project_num = col.split(" - ")[0]
        else:  # Handle case with underscores (e.g., "project02_FRQ")
            project_num = col.split('_')[0]

        print(f"Processing Column: {col}, Extracted Project Number: {project_num}")  # Debug print

        # Initialize the group if it doesn't exist
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)

    print("Grouped Project Columns:", project_groups)  # Debug print

    # Step 3: Calculate individual project grades
    project_scores = []  # Store normalized scores for each project

    for project_num, cols in project_groups.items():
        # Separate score columns and max points columns based on naming patterns
        score_cols = [col for col in cols if 'Max Points' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]

        print(f"Project {project_num}: Score Columns: {score_cols}, Max Columns: {max_cols}")  # Debug print

        # Sum earned points and max points, handling missing values as 0
        total_score = grades[score_cols].fillna(0).sum(axis=1)  # Earned points
        total_max = grades[max_cols].fillna(0).sum(axis=1)  # Max points

        # Handle cases where max points might not exist (avoid division by zero)
        project_grade = np.where(total_max > 0, total_score / total_max, 0)

        # Store the project grade as a Series with a meaningful name
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))

    # Step 4: Combine all project grades and calculate the average score
    all_projects = pd.concat(project_scores, axis=1)

    print("All Project Grades Combined:\n", all_projects)  # Debug print

    # Calculate the average project score for each student across all projects
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score.to_frame()


In [14]:
projects_total(grades)

Project Columns Identified: ['project01', 'project01 - Max Points', 'project01_free_response', 'project01_free_response - Max Points', 'project02', 'project02 - Max Points', 'project02_free_response', 'project02_free_response - Max Points', 'project03', 'project03 - Max Points', 'project05_free_response', 'project05_free_response - Max Points', 'project04', 'project04 - Max Points', 'project05', 'project05 - Max Points']
Processing Column: project01, Extracted Project Number: project01
Processing Column: project01 - Max Points, Extracted Project Number: project01
Processing Column: project01_free_response, Extracted Project Number: project01
Processing Column: project01_free_response - Max Points, Extracted Project Number: project01_free_response
Processing Column: project02, Extracted Project Number: project02
Processing Column: project02 - Max Points, Extracted Project Number: project02
Processing Column: project02_free_response, Extracted Project Number: project02
Processing Column:

Unnamed: 0,0
0,0.672287
1,0.561476
2,0.500564
3,0.703480
4,0.540847
...,...
530,0.692205
531,0.637698
532,0.632557
533,0.597411


In [15]:
import numpy as np
import pandas as pd

def projects_total(grades):
    # Step 1: Identify relevant project columns
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col
    ]

    print("Project Columns Identified:", project_cols)  # Debug print

    # Step 2: Group columns by project number, handling FRQ columns separately
    project_groups = {}
    frq_groups = {}
    
    for col in project_cols:
        if " - Max" in col:  # Max Points columns
            # Extract project number
            project_num = col.split(" - Max")[0]
            if project_num not in project_groups:
                project_groups[project_num] = []
            project_groups[project_num].append(col)  # Add to the project group
            
        elif "_free_response" in col:  # FRQ columns
            project_num = col.split("_free_response")[0]
            if project_num not in frq_groups:
                frq_groups[project_num] = []
            frq_groups[project_num].append(col)  # Add to the FRQ group
            
        else:  # Regular score columns
            project_num = col.split('_')[0]  # Handle both cases
            if project_num not in project_groups:
                project_groups[project_num] = []
            project_groups[project_num].append(col)  # Add to the project group

    print("Grouped Project Columns:", project_groups)  # Debug print
    print("Grouped FRQ Columns:", frq_groups)  # Debug print

    # Step 3: Calculate individual project grades
    project_scores = []  # Store normalized scores for each project

    for project_num, cols in project_groups.items():
        # Separate score columns and max points columns
        score_cols = [col for col in cols if 'Max Points' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]

        print(f"Project {project_num}: Score Columns: {score_cols}, Max Columns: {max_cols}")  # Debug print

        # Sum earned points and max points for the project
        total_score = grades[score_cols].fillna(0).sum(axis=1)  # Earned points
        total_max = grades[max_cols].fillna(0).sum(axis=1)  # Max points

        # Handle cases where max points might not exist (avoid division by zero)
        project_grade = np.where(total_max > 0, total_score / total_max, 0)

        # If there are FRQ columns, include their scores
        if project_num in frq_groups:
            frq_score_cols = frq_groups[project_num]
            total_frq_score = grades[frq_score_cols].fillna(0).sum(axis=1)
            total_frq_max = grades[[col for col in frq_score_cols if 'Max Points' in col]].fillna(0).sum(axis=1)
            
            # Handle FRQ grading separately
            frq_grade = np.where(total_frq_max > 0, total_frq_score / total_frq_max, 0)

            # Combine project grade and FRQ grade
            combined_grade = (project_grade + frq_grade) / 2  # Average both grades
            project_scores.append(pd.Series(combined_grade, name=f'Project {project_num} (Combined)'))
        else:
            # Only the project grade if no FRQ component
            project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))

    # Step 4: Combine all project grades and calculate the average score
    all_projects = pd.concat(project_scores, axis=1)

    print("All Project Grades Combined:\n", all_projects)  # Debug print

    # Calculate the average project score for each student across all projects
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score.to_frame()


In [16]:
projects_total(grades)

Project Columns Identified: ['project01', 'project01 - Max Points', 'project01 - Lateness (H:M:S)', 'project01_free_response', 'project01_free_response - Max Points', 'project01_free_response - Lateness (H:M:S)', 'project02', 'project02 - Max Points', 'project02 - Lateness (H:M:S)', 'project02_free_response', 'project02_free_response - Max Points', 'project02_free_response - Lateness (H:M:S)', 'project03', 'project03 - Max Points', 'project03 - Lateness (H:M:S)', 'project05_free_response', 'project05_free_response - Max Points', 'project05_free_response - Lateness (H:M:S)', 'project04', 'project04 - Max Points', 'project04 - Lateness (H:M:S)', 'project05', 'project05 - Max Points', 'project05 - Lateness (H:M:S)']
Grouped Project Columns: {'project01': ['project01', 'project01 - Max Points'], 'project01 - Lateness (H:M:S)': ['project01 - Lateness (H:M:S)'], 'project01_free_response': ['project01_free_response - Max Points'], 'project02': ['project02', 'project02 - Max Points'], 'project

TypeError: unsupported operand type(s) for +: 'float' and 'str'

<h1>correct summation series (i believe)</h1>

In [15]:
def projects_total(grades):
    # Identify project columns
    project_cols = [col for col in grades.columns if col.startswith('project') and 'checkpoint' not in col]
    # print(project_cols)
    # Group columns by project
    project_groups = {}

    # this btw has all the more than 01,02 etc. as keys
    for col in project_cols:
        project_num = col.split('project')[1].split('_')[0]
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)
    # Calculate total points for each project


    # return project_groups
    project_scores = []
    
    for project_num, cols in project_groups.items():
        score_cols = [col for col in cols if 'Max Points' not in col and 'Lateness' not in col]
        max_cols = [col for col in cols if 'Max Points' in col]
        # print(score_cols)
        # print(max_cols)
        project_score = grades[score_cols].fillna(0).sum(axis=1)
        project_max = grades[max_cols].fillna(0).sum(axis=1)

        # print(project_score)
        # print(project_max)
        # Avoid division by zero
        project_grade = np.where(project_max > 0, project_score / project_max, 0)
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))
        # print(project_num)
        
    
    # print(project_scores)
    
    # Calculate average project score
    total_project_score = pd.concat(project_scores, axis=1).mean(axis=1)
    
    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)
    
    return total_project_score

In [16]:
projects_total(grades)

0      0.911497
1      0.720526
2      0.663858
3      0.932194
4      0.683966
         ...   
530    0.913291
531    0.843538
532    0.846640
533    0.802447
534    0.914951
Length: 535, dtype: float64

<h1>actual grades for each student</h1>

In [14]:
#perplex
import numpy as np
import pandas as pd

def projects_total(grades):
    # Step 1: Identify relevant project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col.lower() and 'lateness' not in col.lower()
    ]

    # print("Project Columns Identified:", project_cols)  # Debug print

    # Step 2: Group columns by project number, handling various name formats
    project_groups = {}
    for col in project_cols:
        # Extract project number using multiple split strategies
        if " - " in col:
            project_num = col.split(" - ")[0].split("_")[0]
        else:
            project_num = col.split('_')[0]

        # print(f"Processing Column: {col}, Extracted Project Number: {project_num}")  # Debug print

        # Initialize the group if it doesn't exist
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)

    # print("Grouped Project Columns:", project_groups)  # Debug print

    # Step 3: Calculate individual project grades
    project_scores = []  # Store normalized scores for each project

    for project_num, cols in project_groups.items():
        # Separate score columns and max points columns based on naming patterns
        max_cols = [col for col in cols if any(term in col.lower() for term in ['max point', 'max score', 'total possible'])]
        score_cols = [col for col in cols if col not in max_cols]

        # print(f"Project {project_num}: Score Columns: {score_cols}, Max Columns: {max_cols}")  # Debug print

        # Sum earned points and max points, handling missing values as 0
        total_score = grades[score_cols].fillna(0).sum(axis=1)  # Earned points
        total_max = grades[max_cols].fillna(0).sum(axis=1)  # Max points

        # Handle cases where max points might not exist (avoid division by zero)
        project_grade = np.where(total_max > 0, total_score / total_max, 0)

        # Store the project grade as a Series with a meaningful name
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))

    # Step 4: Combine all project grades and calculate the average score
    all_projects = pd.concat(project_scores, axis=1)

    print("All Project Grades Combined:\n", all_projects)  # Debug print

    # Calculate the average project score for each student across all projects
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score

In [15]:
projects_total(grades)

All Project Grades Combined:
      Project project01  Project project02  Project project03  \
0             0.902826           0.949554           0.855196   
1             0.629166           0.879261           0.882010   
2             0.611228           0.872909           0.770437   
3             0.924086           0.955583           0.942994   
4             0.568237           0.960433           0.908058   
..                 ...                ...                ...   
530           0.926306           0.909326           0.980707   
531           0.756583           0.956447           0.763842   
532           0.812733           0.967382           0.901077   
533           0.767939           0.920880           0.887178   
534           0.925721           0.920477           0.883246   

     Project project05  Project project04  
0             0.963848           0.909746  
1             0.774100           0.665124  
2             0.597845           0.553977  
3             1.000000   

0      0.916234
1      0.765932
2      0.681279
3      0.962581
4      0.737446
         ...   
530    0.949434
531    0.866795
532    0.862050
533    0.813468
534    0.939433
Length: 535, dtype: float64

In [376]:
first_kid_auto = grades["project01"][0]

first_kid_frq = grades["project01_free_response"][0]
(first_kid_auto + first_kid_frq)/100

np.float64(0.9028263198246907)

In [356]:
grades.columns.to_list

<bound method IndexOpsMixin.tolist of Index(['PID', 'College', 'Level', 'Section', 'lab01', 'lab01 - Max Points',
       'lab01 - Lateness (H:M:S)', 'lab02', 'lab02 - Max Points',
       'lab02 - Lateness (H:M:S)',
       ...
       'discussion07 - Lateness (H:M:S)', 'discussion08',
       'discussion08 - Max Points', 'discussion08 - Lateness (H:M:S)',
       'discussion09', 'discussion09 - Max Points',
       'discussion09 - Lateness (H:M:S)', 'discussion10',
       'discussion10 - Max Points', 'discussion10 - Lateness (H:M:S)'],
      dtype='object', length=101)>

In [343]:
[col for i in grades.columns.to_list() if "project04" in i]
for i in grades.columns.to_list():
    print(i)

PID
College
Level
Section
lab01
lab01 - Max Points
lab01 - Lateness (H:M:S)
lab02
lab02 - Max Points
lab02 - Lateness (H:M:S)
project01
project01 - Max Points
project01 - Lateness (H:M:S)
lab03
lab03 - Max Points
lab03 - Lateness (H:M:S)
project01_free_response
project01_free_response - Max Points
project01_free_response - Lateness (H:M:S)
lab04
lab04 - Max Points
lab04 - Lateness (H:M:S)
lab05
lab05 - Max Points
lab05 - Lateness (H:M:S)
project02_checkpoint01
project02_checkpoint01 - Max Points
project02_checkpoint01 - Lateness (H:M:S)
Midterm
Midterm - Max Points
Midterm - Lateness (H:M:S)
lab06
lab06 - Max Points
lab06 - Lateness (H:M:S)
project02_checkpoint02
project02_checkpoint02 - Max Points
project02_checkpoint02 - Lateness (H:M:S)
lab07
lab07 - Max Points
lab07 - Lateness (H:M:S)
project02
project02 - Max Points
project02 - Lateness (H:M:S)
project02_free_response
project02_free_response - Max Points
project02_free_response - Lateness (H:M:S)
lab08
lab08 - Max Points
lab08 - L

In [None]:
grades["project01_free_response - Max Points"]

In [18]:
import numpy as np
import pandas as pd

def projects_total(grades):
    # Step 1: Identify relevant project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col.lower() and 'lateness' not in col.lower()
    ]

    # Step 2: Group columns by project number
    project_groups = {}
    for col in project_cols:
        project_num = col.split('_')[1]  # Assuming format 'project_1_score', 'project_1_max', etc.
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)

    # Step 3: Calculate individual project grades
    project_scores = []

    for project_num, cols in project_groups.items():
        score_cols = [col for col in cols if 'score' in col.lower() or 'frq' in col.lower()]
        max_cols = [col for col in cols if 'max' in col.lower()]

        total_score = grades[score_cols].sum(axis=1)
        total_max = grades[max_cols].sum(axis=1)

        project_grade = total_score / total_max
        project_grade = project_grade.fillna(0)  # If a student didn't turn in a project, their grade is 0
        project_scores.append(project_grade)

    # Step 4: Calculate the average project score
    all_projects = pd.concat(project_scores, axis=1)
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score

In [20]:
projects_total(grades)

IndexError: list index out of range

In [28]:
import numpy as np
import pandas as pd

def projects_total(grades):
    # Step 1: Identify relevant project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') and 'checkpoint' not in col.lower() and 'lateness' not in col.lower()
    ]

    # Step 2: Group columns by project number
    project_groups = {}
    for col in project_cols:
        project_num = col.split('project')[1].split()[0]  # Extract project number
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)

    # Step 3: Calculate individual project grades
    project_scores = []

    for project_num, cols in project_groups.items():
        score_cols = [col for col in cols if 'max points' not in col.lower() and 'free_response' not in col]
        frq_cols = [col for col in cols if 'free_response' in col and 'max points' not in col.lower()]
        max_cols = [col for col in cols if 'max points' in col.lower()]

        # Calculate total score (including free response if available)
        total_score = grades[score_cols].fillna(0).sum(axis=1) + grades[frq_cols].fillna(0).sum(axis=1)
        total_max = grades[max_cols].fillna(0).sum(axis=1)

        project_grade = np.where(total_max > 0, total_score / total_max, 0)
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))

    # Step 4: Calculate the average project score
    all_projects = pd.concat(project_scores, axis=1)
    total_project_score = all_projects.mean(axis=1)

    # Ensure all scores are between 0 and 1
    total_project_score = total_project_score.clip(0, 1)

    return total_project_score

In [29]:
projects_total(grades)

0      0.916639
1      0.779114
2      0.730404
3      0.961331
4      0.776753
         ...   
530    0.943406
531    0.815442
532    0.879538
533    0.810130
534    0.945605
Length: 535, dtype: float64

In [30]:
import numpy as np
import pandas as pd

def projects_total(grades):
    """
    Calculate the total project score for each student, handling both autograded and 
    free response portions correctly.
    
    Args:
        grades (pd.DataFrame): DataFrame containing all grade information
        
    Returns:
        pd.Series: Series containing the average project score (0-1) for each student
    """
    # Step 1: Identify project columns (excluding checkpoints and lateness)
    project_cols = [
        col for col in grades.columns 
        if col.startswith('project') 
        and 'checkpoint' not in col.lower() 
        and 'lateness' not in col.lower()
    ]
    
    # Step 2: Group columns by project number
    project_groups = {}
    for col in project_cols:
        # Extract project number (e.g., "project1" or "project01")
        if '_' in col:  # Handle free response columns
            project_num = col.split('project')[1].split('_')[0]
        else:
            project_num = col.split('project')[1].split()[0]
        
        if project_num not in project_groups:
            project_groups[project_num] = []
        project_groups[project_num].append(col)
    
    # Step 3: Calculate individual project scores
    project_scores = []
    
    for project_num, cols in project_groups.items():
        # Separate columns by type
        auto_cols = [col for col in cols if 'free_response' not in col and 'Max Points' not in col]
        frq_cols = [col for col in cols if 'free_response' in col and 'Max Points' not in col]
        
        # Get corresponding max points columns
        auto_max_cols = [f"{col} - Max Points" for col in auto_cols if f"{col} - Max Points" in grades.columns]
        frq_max_cols = [f"{col} - Max Points" for col in frq_cols if f"{col} - Max Points" in grades.columns]
        
        # Calculate autograded portion
        auto_score = grades[auto_cols].fillna(0).sum(axis=1)
        auto_max = grades[auto_max_cols].fillna(0).sum(axis=1)
        
        # Calculate free response portion if it exists
        frq_score = grades[frq_cols].fillna(0).sum(axis=1)
        frq_max = grades[frq_max_cols].fillna(0).sum(axis=1)
        
        # Combine scores
        total_score = auto_score + frq_score
        total_max = auto_max + frq_max
        
        # Calculate normalized score (0-1)
        project_grade = np.where(total_max > 0, total_score / total_max, 0)
        project_scores.append(pd.Series(project_grade, name=f'Project {project_num}'))
    
    # Step 4: Calculate average score across all projects
    all_projects = pd.concat(project_scores, axis=1)
    total_project_score = all_projects.mean(axis=1)
    
    return total_project_score.clip(0, 1)

In [31]:
projects_total(grades)

0      0.916234
1      0.765932
2      0.681279
3      0.962581
4      0.737446
         ...   
530    0.949434
531    0.866795
532    0.862050
533    0.813468
534    0.939433
Length: 535, dtype: float64

In [32]:
grader.check("q2")

Now that projects are out of the way, you need to clean and process the lab grades. This will involve a bit more work than was necessary for projects. Specifically, you'll:
- identify late submissions (Question 3), 
- compute normalized scores for each lab assignment, factoring in late penalties (Question 4), and 
- drop the lowest lab grade and compute a total lab score for each student (Question 5).

<!-- ### ✅ Question 3 (Checkpoint Question) -->
### Question 3 


<a name='Question-3-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Recall, per [the syllabus](#The-Syllabus), labs are the only assignment category for which late penalties are enforced:

>  Each lab may be revised for up to (and including) one week after the deadline for a 10% penalty, for up to (and including) two weeks after the deadline for a 30% penalty, and beyond that for a 60% penalty. Such revisions are reflected in the `'Lateness'` columns in the gradebook.

For labs, students have a **two hour grace period** after the deadline during which their submissions are counted as on time. The grace period only applies to the original deadline – for instance, if a student submits one week and one hour late, their submission falls into the "up to (and including) two weeks after the deadline" category and they are assessed a 30% penalty.

Your job is to adjust lab grades to penalize **truly** late submissions, factoring in the grace period. To adjust a student's grade, multiply their lab score by `1` (on time, factoring in the grace period), `0.9`, `0.7`, or `0.4`. We'll call these four numbers – `1`, `0.9`, `0.7`, and `0.4` – "lateness multipliers." 

Complete the implementation of the function `lateness_penalty`, which takes in a Series containing information on how late each student turned in a particular lab, such as `grades['lab01 - Lateness (H:M:S)']`, and returns a Series containing each student's lateness multiplier for that lab. The only possible values in the returned Series should be `1.0`, `0.9`, `0.7`, and `0.4`.

**Don't forget to factor in the grace period!** Remember, we will only be enforcing late penalties for labs, not for any other assignment category.

**Note**: There is no grace period for real Gradescope!! Make sure you submit your assignments on time.

In [None]:
grades

In [None]:
# TODO 
    # find the students who turned in the work late
    # then apply the change to the created grade dict that every student possesses?
    # of whether they met the bounds created SEE SCRATCH paper



In [20]:
import pandas as pd

def lateness_penalty(col):
    def parse_time(time_str):
        if pd.isna(time_str):
            return timedelta()
        parts = time_str.split(':')
        return timedelta(hours=int(parts[0]), minutes=int(parts[1]), seconds=int(parts[2]))

    grace_period = timedelta(hours=2)
    one_week = timedelta(days=7)
    two_weeks = timedelta(days=14)

    def calculate_multiplier(lateness):
        if lateness <= grace_period:
            return 1.0
        elif lateness <= one_week:
            return 0.9
        elif lateness <= two_weeks:
            return 0.7
        else:
            return 0.4

    return col.apply(lambda x: calculate_multiplier(parse_time(x)))

In [17]:
import pandas as pd
from datetime import timedelta

def lateness_penalty(col):
    def parse_time(time_str):
        # Handle NaN or non-string values gracefully
        if isinstance(time_str, str):
            parts = time_str.split(':')
            return timedelta(hours=int(parts[0]), minutes=int(parts[1]), seconds=int(parts[2]))
        return timedelta()  # Default to zero lateness if the input is not valid

    grace_period = timedelta(hours=2)
    one_week = timedelta(days=7)
    two_weeks = timedelta(days=14)

    def calculate_multiplier(lateness):
        if lateness <= grace_period:
            return 1.0
        elif lateness <= one_week:
            return 0.9
        elif lateness <= two_weeks:
            return 0.7
        else:
            return 0.4

    # Apply the logic to each element in the column
    return col.apply(lambda x: calculate_multiplier(parse_time(x)))


In [19]:
grades["lab01 - Lateness (H:M:S)"].value_counts()

np.int64(535)

In [23]:
grades["lab01 - Lateness (H:M:S)"].values

array(['00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '96:09:33', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '55:41:56', '50:09:44',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:38',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '55:41:56',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '50:09:44',
       '00:00:00', '47:26:10', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:00',
       '00:00:00', '00:00:00', '00:00:00', '00:00:00', '00:00:

In [3]:
lateness_multipliers = lateness_penalty(grades['lab01 - Lateness (H:M:S)'])
lateness_multipliers[33]

NameError: name 'grades' is not defined

In [40]:
lateness_multipliers.values[530]

np.float64(0.9)

In [20]:
grader.check("q3")

<!-- ### ✅ Question 4 (Checkpoint Question) -->
### Question 4 

<a name='Question-4-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `process_labs`, which takes in a DataFrame like `grades` and returns a DataFrame of processed lab scores. The returned DataFrame should:
* have the same index as `grades`,
* have one column for each lab assignment (e.g. `'lab01'`, `'lab02'`,..., `'lab09'`),
* have values representing the final score for each lab assignment, adjusted for lateness and **normalized** to a score between 0 and 1.

Remember to correctly handle the case where a student _doesn't_ turn in a lab.

In [None]:
# TODO 
    # divide the scores by 100
    # create a copy of grades for lab ments only


In [46]:
def process_labs(grades):
    """
    Processes the lab grades, applying lateness penalties where necessary.

    Args:
        grades (pd.DataFrame): DataFrame containing grade information.

    Returns:
        pd.DataFrame: DataFrame with processed lab grades, adjusted for lateness.
    """
    lab_cols = [col for col in grades.columns if col.startswith('lab')]
    
    processed_labs = grades.copy()
    for lab in lab_cols:
        if 'Lateness' in lab:
            lab_name = lab.replace(' - Lateness (H:M:S)', '')
            processed_labs[lab_name] *= processed_labs[lab].apply(lateness_penalty)
    
    return processed_labs



In [38]:
grades.columns.to_list()

['PID',
 'College',
 'Level',
 'Section',
 'lab01',
 'lab01 - Max Points',
 'lab01 - Lateness (H:M:S)',
 'lab02',
 'lab02 - Max Points',
 'lab02 - Lateness (H:M:S)',
 'project01',
 'project01 - Max Points',
 'project01 - Lateness (H:M:S)',
 'lab03',
 'lab03 - Max Points',
 'lab03 - Lateness (H:M:S)',
 'project01_free_response',
 'project01_free_response - Max Points',
 'project01_free_response - Lateness (H:M:S)',
 'lab04',
 'lab04 - Max Points',
 'lab04 - Lateness (H:M:S)',
 'lab05',
 'lab05 - Max Points',
 'lab05 - Lateness (H:M:S)',
 'project02_checkpoint01',
 'project02_checkpoint01 - Max Points',
 'project02_checkpoint01 - Lateness (H:M:S)',
 'Midterm',
 'Midterm - Max Points',
 'Midterm - Lateness (H:M:S)',
 'lab06',
 'lab06 - Max Points',
 'lab06 - Lateness (H:M:S)',
 'project02_checkpoint02',
 'project02_checkpoint02 - Max Points',
 'project02_checkpoint02 - Lateness (H:M:S)',
 'lab07',
 'lab07 - Max Points',
 'lab07 - Lateness (H:M:S)',
 'project02',
 'project02 - Max Points',

In [21]:
def process_labs(grades):
    """
    Process lab scores by adjusting for lateness and normalizing to a score between 0 and 1.

    Args:
        grades (pd.DataFrame): DataFrame containing grade information, including lab scores and lateness.

    Returns:
        pd.DataFrame: DataFrame of processed lab scores, adjusted for lateness, with scores normalized between 0 and 1.
    """
    lab_cols = [col for col in grades.columns if col.startswith('lab') and 'Max Points' not in col and 'Lateness' not in col]
    lateness_cols = [col for col in grades.columns if col.startswith('lab') and 'Lateness' in col]
    max_points_cols = [col for col in grades.columns if col.startswith('lab') and 'Max Points' in col]

    processed_labs = pd.DataFrame(index=grades.index)

    for lab_col, lateness_col, max_points_col in zip(lab_cols, lateness_cols, max_points_cols):
        # Apply lateness penalty
        lateness_multiplier = lateness_penalty(grades[lateness_col])

        # Normalize lab score between 0 and 1 based on max points
        normalized_lab_score = grades[lab_col] / grades[max_points_col]

        # Apply lateness multiplier to the normalized score
        adjusted_lab_score = normalized_lab_score * lateness_multiplier

        # Ensure all scores are between 0 and 1, and handle missing values (NaNs)
        adjusted_lab_score = adjusted_lab_score.clip(0, 1).fillna(0)

        # Add the processed lab score to the new DataFrame
        processed_labs[lab_col] = adjusted_lab_score

    return processed_labs


In [22]:
labs_att = process_labs(grades)
labs_att["lab08"].value_counts()
#.38 of a lab?

lab08
0.000000    31
1.000000    27
0.994693     1
0.670815     1
0.925664     1
            ..
0.553551     1
0.713093     1
0.806378     1
0.964217     1
0.370186     1
Name: count, Length: 479, dtype: int64

In [23]:
grades["lab01"][
0
]

np.float64(99.7352791173247)

In [24]:
labs_att

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.997353,0.849902,0.637744,1.000000,1.000000,0.994518,0.389141,0.887917,0.874913
1,0.988295,0.507842,0.714477,0.783672,1.000000,0.393887,0.914061,0.944378,0.902977
2,0.865134,0.478028,0.433667,0.738875,0.927838,0.345076,0.734070,0.718204,0.757840
3,1.000000,1.000000,0.925903,0.950614,0.891614,0.688403,0.985371,0.963307,0.777880
4,0.665070,0.334224,0.706932,0.747915,0.659720,0.731345,0.607859,0.370186,1.000000
...,...,...,...,...,...,...,...,...,...
530,0.900000,0.820228,1.000000,0.792935,1.000000,0.284106,0.770281,0.931245,1.000000
531,1.000000,0.874981,0.809945,0.592866,0.987597,0.759688,0.856178,0.849694,0.582645
532,0.886566,0.903260,1.000000,1.000000,0.941425,0.768909,0.967282,0.877898,1.000000
533,0.837997,0.856369,0.909363,0.955287,0.737854,0.382781,0.769093,0.947450,0.867373


In [25]:
grader.check("q4")

<!-- ### ✅ Question 5 (Checkpoint Question) -->
### Question 5 

<a name='Question-5-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `lab_total`, which takes in a DataFrame returned by `process_labs` – that is, a DataFrame that contains each student's score on each lab after lateness penalties – and returns a Series containing the total lab grade for each student according to [the syllabus](#The-Syllabus) (i.e. with the lowest lab dropped). All values in the returned Series should be proportions between 0 and 1. 

For example, if CSD 18 only has 3 labs, and Aritra received lab scores of 20%, 90%, and 100% after lateness penalties, then your output Series should contain the value `0.95` for Aritra. This is because we drop the lowest score, and then compute the average of just 90% and 100%, which is 95%, or 0.95 as a proportion.

In [None]:
# TODO
    # apply syllabus conditions to the labs df made from function earlier i.e. drop the lowest 

In [None]:
def lab_total(processed):
    """
    Calculate the total lab grade after processing lateness penalties and dropping the lowest score.

    Args:
        processed (pd.DataFrame): DataFrame containing processed lab grades.

    Returns:
        pd.Series: Series with the total lab grade for each student.
    """
    lab_scores = [col for col in processed.columns if col.startswith('lab') and 'Lateness' not in col]
    
    # Drop the lowest lab score
    lab_totals = processed[lab_scores].apply(lambda row: row.dropna().sum() - row.min(), axis=1)
    
    return lab_totals

In [26]:
def lab_total(processed):
    """
    Calculate the total lab grade after processing lateness penalties and dropping the lowest score.

    Args:
        processed (pd.DataFrame): DataFrame containing processed lab grades.

    Returns:
        pd.Series: Series with the total lab grade for each student as a proportion between 0 and 1.
    """
    lab_scores = [col for col in processed.columns if col.startswith('lab')]
    
    # Drop the lowest lab score and compute the average of the remaining scores
    lab_totals = processed[lab_scores].apply(lambda row: (row.dropna().sum() - row.min()) / (len(row) - 1), axis=1)
    
    return lab_totals


In [27]:
lab_total(labs_att)

0      0.905293
1      0.844463
2      0.706707
3      0.936836
4      0.686128
         ...   
530    0.901836
531    0.841369
532    0.947054
533    0.860098
534    0.865609
Length: 535, dtype: float64

In [28]:
grader.check("q5")

Now that projects and labs are processed, we're almost ready to compute the letter grade of each student in CSD 18.

### Question 6
<!-- ### ✅ Question 6 (Checkpoint Question) -->

<a name='Question-6-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

First, you need to compute each student's course grade, which results from adding their total grades in each course component according to the weights given in [the syllabus](#The-Syllabus).

Complete the implementation of the function `total_points`, which takes in a DataFrame like `grades` and returns a Series containing each student's course grade. **Course grades should be proportions between 0 and 1.**

***Notes***: 

- Don't repeat yourself when computing the checkpoint and discussion portions of the course.
- Remember, only the lab portion of the course accounts for late assignments; you may assume all assignments in other portions are turned in without penalty.
- Do the work by hand for a few students to check your code!

In [None]:
# TODO make all the functions portable enough that each category is calculated with its own\
# then sum together into a SERIES

In [95]:
def total_points(grades):
    """
    Calculate each student's total course grade as a proportion between 0 and 1.

    Args:
        grades (pd.DataFrame): DataFrame containing all student grades.

    Returns:
        pd.Series: Series with each student's total course grade.
    """
    # Step 1: Process Labs
    lab_scores = process_labs(grades)  # Use the function you implemented earlier
    lab_totals = lab_total(lab_scores)  # Use the previously defined lab_total function
    lab_weighted = lab_totals * 0.20  # Labs are worth 20%
    
    # Step 2: Process Projects
    project_scores = projects_total(grades)  # Use the previously defined projects_total function
    project_weighted = project_scores * 0.30  # Projects are worth 30%

    # Step 3: Process Checkpoints and Discussions
    def checkpoint_discussion_total(grades, prefix):
        # Get all the columns that start with 'checkpoint' or 'discussion'
        cols = [col for col in grades.columns if col.startswith(prefix)]
        total_score = grades[cols].fillna(0).sum(axis=1)
        max_score = grades[[col + ' - Max Points' for col in cols]].sum(axis=1)
        # Avoid division by zero
        return total_score / max_score

    checkpoint_scores = checkpoint_discussion_total(grades, 'checkpoint') * 0.025  # Checkpoints are 2.5%
    discussion_scores = checkpoint_discussion_total(grades, 'discussion') * 0.025  # Discussions are 2.5%

    # Step 4: Process Midterm and Final Exam
    midterm_score = (grades['Midterm'] / grades['Midterm - Max Points']) * 0.15  # Midterm is 15%
    final_score = (grades['Final'] / grades['Final - Max Points']) * 0.30  # Final is 30%

    # Step 5: Sum all components to get total course grade
    total_course_grade = (
        lab_weighted +
        project_weighted +
        checkpoint_scores +
        discussion_scores +
        midterm_score +
        final_score
    )

    return total_course_grade.clip(0, 1)  # Ensure grades are between 0 and 1


In [96]:
out = total_points(grades)
out

KeyError: "['discussion01 - Max Points - Max Points', 'discussion01 - Lateness (H:M:S) - Max Points', 'discussion02 - Max Points - Max Points', 'discussion02 - Lateness (H:M:S) - Max Points', 'discussion03 - Max Points - Max Points', 'discussion03 - Lateness (H:M:S) - Max Points', 'discussion04 - Max Points - Max Points', 'discussion04 - Lateness (H:M:S) - Max Points', 'discussion05 - Max Points - Max Points', 'discussion05 - Lateness (H:M:S) - Max Points', 'discussion06 - Max Points - Max Points', 'discussion06 - Lateness (H:M:S) - Max Points', 'discussion07 - Max Points - Max Points', 'discussion07 - Lateness (H:M:S) - Max Points', 'discussion08 - Max Points - Max Points', 'discussion08 - Lateness (H:M:S) - Max Points', 'discussion09 - Max Points - Max Points', 'discussion09 - Lateness (H:M:S) - Max Points', 'discussion10 - Max Points - Max Points', 'discussion10 - Lateness (H:M:S) - Max Points'] not in index"

In [97]:
def total_points(grades):
    """
    Calculate each student's total course grade as a proportion between 0 and 1.

    Args:
        grades (pd.DataFrame): DataFrame containing all student grades.

    Returns:
        pd.Series: Series with each student's total course grade.
    """
    # Step 1: Process Labs
    lab_scores = process_labs(grades)  # Use the function you implemented earlier
    lab_totals = lab_total(lab_scores)  # Use the previously defined lab_total function
    lab_weighted = lab_totals * 0.20  # Labs are worth 20%
    
    # Step 2: Process Projects
    project_scores = projects_total(grades)  # Use the previously defined projects_total function
    project_weighted = project_scores * 0.30  # Projects are worth 30%

    # Step 3: Process Checkpoints and Discussions
    def checkpoint_discussion_total(grades, prefix):
        # Get all the columns that start with 'checkpoint' or 'discussion'
        cols = [col for col in grades.columns if col.startswith(prefix)]
        # Convert all columns to numeric, setting non-numeric entries to NaN
        total_score = grades[cols].apply(pd.to_numeric, errors='coerce').fillna(0).sum(axis=1)
        max_score = grades[[col + ' - Max Points' for col in cols]].apply(pd.to_numeric, errors='coerce').fillna(0).sum(axis=1)
        # Avoid division by zero
        return total_score / max_score

    checkpoint_scores = checkpoint_discussion_total(grades, 'checkpoint') * 0.025  # Checkpoints are 2.5%
    discussion_scores = checkpoint_discussion_total(grades, 'discussion') * 0.025  # Discussions are 2.5%

    # Step 4: Process Midterm and Final Exam
    midterm_score = (pd.to_numeric(grades['Midterm'], errors='coerce') / pd.to_numeric(grades['Midterm - Max Points'], errors='coerce')) * 0.15  # Midterm is 15%
    final_score = (pd.to_numeric(grades['Final'], errors='coerce') / pd.to_numeric(grades['Final - Max Points'], errors='coerce')) * 0.30  # Final is 30%

    # Step 5: Sum all components to get total course grade
    total_course_grade = (
        lab_weighted +
        project_weighted +
        checkpoint_scores +
        discussion_scores +
        midterm_score +
        final_score
    )

    return total_course_grade.clip(0, 1)  # Ensure grades are between 0 and 1


In [99]:
out = total_points(grades)
out

KeyError: "['discussion01 - Max Points - Max Points', 'discussion01 - Lateness (H:M:S) - Max Points', 'discussion02 - Max Points - Max Points', 'discussion02 - Lateness (H:M:S) - Max Points', 'discussion03 - Max Points - Max Points', 'discussion03 - Lateness (H:M:S) - Max Points', 'discussion04 - Max Points - Max Points', 'discussion04 - Lateness (H:M:S) - Max Points', 'discussion05 - Max Points - Max Points', 'discussion05 - Lateness (H:M:S) - Max Points', 'discussion06 - Max Points - Max Points', 'discussion06 - Lateness (H:M:S) - Max Points', 'discussion07 - Max Points - Max Points', 'discussion07 - Lateness (H:M:S) - Max Points', 'discussion08 - Max Points - Max Points', 'discussion08 - Lateness (H:M:S) - Max Points', 'discussion09 - Max Points - Max Points', 'discussion09 - Lateness (H:M:S) - Max Points', 'discussion10 - Max Points - Max Points', 'discussion10 - Lateness (H:M:S) - Max Points'] not in index"

In [100]:
def total_points(grades):
    # Lab total
    lab_total_score = lab_total(process_labs(grades))
    
    # Project total
    project_total_score = projects_total(grades)
    
    # Discussion total
    discussion_cols = [col for col in grades.columns if col.startswith('discussion')]
    discussion_scores = grades[discussion_cols].fillna(0).sum(axis=1)
    discussion_max = grades[[col for col in grades.columns if col.startswith('discussion') and 'Max Points' in col]].fillna(0).sum(axis=1)
    discussion_total_score = np.where(discussion_max > 0, discussion_scores / discussion_max, 0).clip(0, 1)
    
    # Checkpoint total
    checkpoint_cols = [col for col in grades.columns if col.startswith('checkpoint')]
    checkpoint_scores = grades[checkpoint_cols].fillna(0).sum(axis=1)
    checkpoint_max = grades[[col for col in grades.columns if col.startswith('checkpoint') and 'Max Points' in col]].fillna(0).sum(axis=1)
    checkpoint_total_score = np.where(checkpoint_max > 0, checkpoint_scores / checkpoint_max, 0).clip(0, 1)
    
    # Midterm score
    midterm_col = next((col for col in grades.columns if col.startswith('Midterm')), None)
    midterm_max_col = f"{midterm_col} Max Points"
    midterm_score = grades[midterm_col] / grades[midterm_max_col] if midterm_col and midterm_max_col in grades.columns else 0
    
    # Final exam score
    final_col = next((col for col in grades.columns if col.startswith('Final')), None)
    final_max_col = f"{final_col} Max Points"
    final_score = grades[final_col] / grades[final_max_col] if final_col and final_max_col in grades.columns else 0
    
    # Combine components according to weights
    # Assuming weights: labs (30%), projects (20%), discussions (15%), checkpoints (10%), midterm (10%), final (15%)
    course_grade = (0.3 * lab_total_score +
                    0.2 * project_total_score +
                    0.15 * discussion_total_score +
                    0.1 * checkpoint_total_score +
                    0.1 * midterm_score +
                    0.15 * final_score)
    
    # Ensure course grade is between 0 and 1
    course_grade = course_grade.clip(0, 1)
    
    return course_grade


In [102]:
out = total_points(grades)
out

0      0.618747
1      0.565327
2      0.510872
3      0.617490
4      0.495748
         ...   
530    0.610263
531    0.573588
532    0.603444
533    0.576735
534    0.592673
Length: 535, dtype: float64

In [81]:
def total_points(grades):
    # Convert all relevant columns to numeric
    for col in grades.columns:
        if col.startswith(('lab', 'project', 'discussion', 'checkpoint', 'Midterm', 'Final')):
            grades[col] = pd.to_numeric(grades[col], errors='coerce')

    # Fill any remaining NaN values with 0
    grades.fillna(0, inplace=True)

    # Rest of the function remains the same...
    lab_total_score = lab_total(process_labs(grades))
    
    project_total_score = projects_total(grades)
    
    discussion_cols = [col for col in grades.columns if col.startswith('discussion')]
    discussion_scores = grades[discussion_cols].sum(axis=1)
    discussion_max = grades[[col for col in grades.columns if col.startswith('discussion') and 'Max Points' in col]].sum(axis=1)
    discussion_total_score = np.where(discussion_max > 0, discussion_scores / discussion_max, 0).clip(0, 1)
    
    checkpoint_cols = [col for col in grades.columns if col.startswith('checkpoint')]
    checkpoint_scores = grades[checkpoint_cols].sum(axis=1)
    checkpoint_max = grades[[col for col in grades.columns if col.startswith('checkpoint') and 'Max Points' in col]].sum(axis=1)
    checkpoint_total_score = np.where(checkpoint_max > 0, checkpoint_scores / checkpoint_max, 0).clip(0, 1)
    
    midterm_col = next((col for col in grades.columns if col.startswith('Midterm')), None)
    midterm_max_col = f"{midterm_col} Max Points"
    midterm_score = grades[midterm_col] / grades[midterm_max_col] if midterm_col and midterm_max_col in grades.columns else 0
    
    final_col = next((col for col in grades.columns if col.startswith('Final')), None)
    final_max_col = f"{final_col} Max Points"
    final_score = grades[final_col] / grades[final_max_col] if final_col and final_max_col in grades.columns else 0
    
    course_grade = (0.3 * lab_total_score +
                    0.2 * project_total_score +
                    0.15 * discussion_total_score +
                    0.1 * checkpoint_total_score +
                    0.1 * midterm_score +
                    0.15 * final_score)
    
    course_grade = course_grade.clip(0, 1)
    
    return course_grade


In [91]:
out = total_points(grades)
out

0      0.618747
1      0.565327
2      0.510872
3      0.617490
4      0.495748
         ...   
530    0.610263
531    0.573588
532    0.603444
533    0.576735
534    0.592673
Length: 535, dtype: float64

In [86]:
def total_points(grades):
    # Convert all relevant columns to numeric
    for col in grades.columns:
        if col.startswith(('lab', 'project', 'discussion', 'checkpoint', 'Midterm', 'Final')):
            grades[col] = pd.to_numeric(grades[col], errors='coerce')

    # Fill any remaining NaN values with 0
    grades.fillna(0, inplace=True)

    # Compute lab total score using processed lab data
    lab_total_score = lab_total(process_labs(grades))
    print("Lab Total Score:", lab_total_score.head())

    # Compute project total score
    project_total_score = projects_total(grades)
    print("Project Total Score:", project_total_score.head())

    # Compute discussion total score
    discussion_cols = [col for col in grades.columns if col.startswith('discussion')]
    discussion_scores = grades[discussion_cols].sum(axis=1)
    discussion_max = grades[[col for col in grades.columns if col.startswith('discussion') and 'Max Points' in col]].sum(axis=1)
    discussion_total_score = np.where(discussion_max > 0, discussion_scores / discussion_max, 0).clip(0, 1)
    print("Discussion Total Score:", pd.Series(discussion_total_score).head())  # Convert to Series for .head()

    # Compute checkpoint total score
    checkpoint_cols = [col for col in grades.columns if col.startswith('checkpoint')]
    checkpoint_scores = grades[checkpoint_cols].sum(axis=1)
    checkpoint_max = grades[[col for col in grades.columns if col.startswith('checkpoint') and 'Max Points' in col]].sum(axis=1)
    checkpoint_total_score = np.where(checkpoint_max > 0, checkpoint_scores / checkpoint_max, 0).clip(0, 1)
    # print("Checkpoint Total Score:", checkpoint_total_score.head())

    # Compute midterm score
    midterm_col = next((col for col in grades.columns if col.startswith('Midterm')), None)
    midterm_max_col = f"{midterm_col} Max Points"
    midterm_score = grades[midterm_col] / grades[midterm_max_col] if midterm_col and midterm_max_col in grades.columns else 0
    # print("Midterm Score:", midterm_score.head())

    # Compute final score
    final_col = next((col for col in grades.columns if col.startswith('Final')), None)
    final_max_col = f"{final_col} Max Points"
    final_score = grades[final_col] / grades[final_max_col] if final_col and final_max_col in grades.columns else 0
    # print("Final Score:", final_score.head())

    # Calculate total course grade
    course_grade = (0.3 * lab_total_score +
                    0.2 * project_total_score +
                    0.15 * discussion_total_score +
                    0.1 * checkpoint_total_score +
                    0.1 * midterm_score +
                    0.15 * final_score)
    
    # Ensure course grade is between 0 and 1
    course_grade = course_grade.clip(0, 1)
    print("Final Course Grade:", course_grade.head())
    
    return course_grade


In [90]:
out = total_points(grades)
out.mean()

np.float64(0.5719158174711783)

In [None]:
grades["Midterm - Max Points"]

In [94]:
grades[["discussion01 - Max Points","discussion01","Midterm","Midterm - Max Points"]]

Unnamed: 0,discussion01 - Max Points,discussion01,Midterm,Midterm - Max Points
0,10,10.000000,47.000000,47.0
1,10,7.680708,42.871801,47.0
2,10,5.662351,37.788579,47.0
3,10,10.000000,44.514095,47.0
4,10,6.439116,19.570629,47.0
...,...,...,...,...
530,10,10.000000,41.458979,47.0
531,10,10.000000,25.266710,47.0
532,10,10.000000,41.380923,47.0
533,10,10.000000,47.000000,47.0


In [103]:
def total_points(grades):
    # Convert all relevant columns to numeric
    for col in grades.columns:
        if col.startswith(('lab', 'project', 'discussion', 'checkpoint', 'Midterm', 'Final')):
            grades[col] = pd.to_numeric(grades[col], errors='coerce')

    # Fill any remaining NaN values with 0
    grades.fillna(0, inplace=True)

    # Compute lab total score using processed lab data
    lab_total_score = lab_total(process_labs(grades))
    
    # Compute project total score
    project_total_score = projects_total(grades)
    
    # Compute discussion total score
    discussion_cols = [col for col in grades.columns if col.startswith('discussion')]
    discussion_scores = grades[discussion_cols].sum(axis=1)
    discussion_max = grades[[col for col in grades.columns if col.startswith('discussion') and 'Max Points' in col]].sum(axis=1)
    discussion_total_score = np.where(discussion_max > 0, discussion_scores / discussion_max, 0).clip(0, 1)
    # print(discussion_max)
    
    # Compute checkpoint total score
    checkpoint_cols = [col for col in grades.columns if col.startswith('checkpoint')]
    checkpoint_scores = grades[checkpoint_cols].sum(axis=1)
    checkpoint_max = grades[[col for col in grades.columns if col.startswith('checkpoint') and 'Max Points' in col]].sum(axis=1)
    checkpoint_total_score = np.where(checkpoint_max > 0, checkpoint_scores / checkpoint_max, 0).clip(0, 1)
    # print(checkpoint_max)

    # Compute midterm score
    midterm_col = next((col for col in grades.columns if col.startswith('Midterm')), None)
    midterm_max_col = f"{midterm_col} Max Points"
    midterm_score = grades[midterm_col] / grades[midterm_max_col] if midterm_col and midterm_max_col in grades.columns else 0
    # print(midterm_max_col)
    
    # Compute final score
    final_col = next((col for col in grades.columns if col.startswith('Final')), None)
    final_max_col = f"{final_col} Max Points"
    final_score = grades[final_col] / grades[final_max_col] if final_col and final_max_col in grades.columns else 0
    # print(final_max_col)
    
    # Calculate total course grade
    course_grade = (0.3 * lab_total_score +
                    0.2 * project_total_score +
                    0.15 * discussion_total_score +
                    0.1 * checkpoint_total_score +
                    0.1 * midterm_score +
                    0.15 * final_score)
    
    # Ensure course grade is between 0 and 1
    course_grade = course_grade.clip(0, 1)
    
    return course_grade
    
    # np.float64(0.5719158174711783)

In [42]:
import pandas as pd
import numpy as np

def total_points(grades):
    # Convert columns to numeric, coerce errors to NaN
    grades = grades.apply(pd.to_numeric, errors='coerce')
    
    # Fill NaN values with 0
    grades.fillna(0, inplace=True)
    
    # Helper function to compute normalized score
    def compute_score(scores, max_points):
        return np.where(max_points > 0, scores / max_points, 0).clip(0, 1)

    # Compute discussion total score
    discussion_cols = [col for col in grades.columns if col.startswith('discussion') and 'Max Points' not in col]
    discussion_max_cols = [col for col in grades.columns if col.startswith('discussion') and 'Max Points' in col]
    discussion_scores = grades[discussion_cols].sum(axis=1)
    discussion_max = grades[discussion_max_cols].sum(axis=1)
    discussion_total_score = compute_score(discussion_scores, discussion_max)

    # Compute checkpoint total score
    checkpoint_cols = [col for col in grades.columns if col.startswith('checkpoint') and 'Max Points' not in col]
    checkpoint_max_cols = [col for col in grades.columns if col.startswith('checkpoint') and 'Max Points' in col]
    checkpoint_scores = grades[checkpoint_cols].sum(axis=1)
    checkpoint_max = grades[checkpoint_max_cols].sum(axis=1)
    checkpoint_total_score = compute_score(checkpoint_scores, checkpoint_max)

    # Compute midterm score
    midterm_col = next((col for col in grades.columns if 'Midterm' in col and 'Max Points' not in col), None)
    midterm_max_col = f"{midterm_col} - Max Points" if midterm_col else None
    midterm_score = compute_score(grades[midterm_col], grades[midterm_max_col]) if midterm_col and midterm_max_col in grades else 0

    # Compute final score
    final_col = next((col for col in grades.columns if 'Final' in col and 'Max Points' not in col), None)
    final_max_col = f"{final_col} - Max Points" if final_col else None
    final_score = compute_score(grades[final_col], grades[final_max_col]) if final_col and final_max_col in grades else 0

    # Compute lab and project scores (using placeholders for these functions)
    lab_total_score = lab_total(process_labs(grades))
    project_total_score = projects_total(grades)

    # Calculate total course grade
    course_grade = (
        0.3 * lab_total_score +
        0.2 * project_total_score +
        0.15 * discussion_total_score +
        0.1 * checkpoint_total_score +
        0.1 * midterm_score +
        0.15 * final_score
    )

    # Ensure course grade is between 0 and 1
    course_grade = course_grade.clip(0, 1)
    
    return course_grade

# Example usage
out = total_points(grades)
print(out.mean())  # Should be between 0.7 and 0.9
print(bool(0.7 < out.mean() < 0.9))


0.7665342320304341
True


In [44]:
def total_points(grades):
    """
    Calculate total course grade as a proportion between 0 and 1 using Professor Yutian's syllabus weights:
    - Labs: 20%
    - Projects: 30%
    - Checkpoints: 2.5%
    - Discussions: 2.5%
    - Midterm: 15%
    - Final: 30%
    
    Args:
        grades (pd.DataFrame): DataFrame containing all grade information
        
    Returns:
        pd.Series: Series containing each student's course grade as a proportion between 0 and 1
    """
    # Convert columns to numeric and fill NaN with 0
    grades = grades.apply(pd.to_numeric, errors='coerce').fillna(0)
    
    def get_component_score(prefix):
        """Helper function to calculate normalized scores for any assignment type"""
        cols = [col for col in grades.columns if col.startswith(prefix) and 'Max Points' not in col]
        max_cols = [col for col in grades.columns if col.startswith(prefix) and 'Max Points' in col]
        
        if not cols or not max_cols:
            return 0
        
        scores = grades[cols].sum(axis=1)
        max_points = grades[max_cols].sum(axis=1)
        return np.where(max_points > 0, scores / max_points, 0).clip(0, 1)
    
    def get_exam_score(exam_name):
        """Helper function to calculate normalized scores for exams"""
        col = next((col for col in grades.columns if exam_name in col and 'Max Points' not in col), None)
        max_col = f"{col} - Max Points" if col else None
        
        if not (col and max_col in grades.columns):
            return 0
            
        return np.where(grades[max_col] > 0, 
                       grades[col] / grades[max_col], 
                       0).clip(0, 1)
    
    # Calculate component scores with corrected weights from Professor Yutian's syllabus
    component_weights = {
        'lab': (lab_total(process_labs(grades)), 0.20),      # 20% for labs
        'project': (projects_total(grades), 0.30),           # 30% for projects
        'discussion': (get_component_score('discussion'), 0.025),  # 2.5% for discussions
        'checkpoint': (get_component_score('checkpoint'), 0.025),  # 2.5% for checkpoints
        'midterm': (get_exam_score('Midterm'), 0.15),       # 15% for midterm
        'final': (get_exam_score('Final'), 0.30)            # 30% for final
    }
    
    # Calculate weighted sum
    course_grade = sum(score * weight for score, weight in component_weights.values())
    
    return course_grade.clip(0, 1)

In [45]:
out = total_points(grades)
out

0      0.892178
1      0.814040
2      0.750310
3      0.900996
4      0.672446
         ...   
530    0.850977
531    0.755412
532    0.844957
533    0.855211
534    0.883745
Length: 535, dtype: float64

In [46]:
grader.check("q6")

<!-- ### ✅ Question 7 (Checkpoint Question) -->
### Question 7

<a name='Question-7-(Checkpoint-Question)'></a>

([return to the outline](#Navigating-the-Project))

How well did the students in CSD 18 do?

#### `final_grades`

Complete the implementation of the function `final_grades`, which takes in a Series of final course grades (as computed by `total_points` in Question 6) and returns a Series of letter grades as determined by the following cutoffs:

| Letter Grade | Cutoff |
|:--- | --- |
| A | grade >= 0.9 |
| B | 0.8 <= grade < 0.9 |
| C | 0.7 <= grade < 0.8 |
| D | 0.6 <= grade < 0.7 |
| F | grade < 0.6 |

***Note***: These cutoffs do not have pluses or minuses. **Do not round** anyone's course grade when determining their letter grade.

<br>

#### `letter_proportions`

Complete the implementation of the function `letter_proportions`, which takes in a Series of final course grades (as computed by `total_points` in Question 6) and returns a Series containing the proportion of the class that received each letter grade. For instance, this Series might tell us that the proportion of the class receiving B's was 0.45, A's was 0.33, C's was 0.16, D's was 0.05, and F's was 0.01 (though these are made up numbers). The index of this Series should be letters, and the **values should be sorted in decreasing order**.

***Notes***: 

- The values in your returned Series should add up to exactly `1.0`. If you are getting something close such as `0.99999`, that means there is an issue with your code in a function you implemented earlier.
- **Do not round**.

In [60]:
# TODO find everyones grade then divide that count of each grade by the total

In [32]:

def final_grades(grades):
    """
    Takes in a Series of final course grades and returns a Series of letter grades
    based on the specified cutoffs.
    """
    def grade_to_letter(grade):
        if grade >= 0.9:
            return 'A'
        elif grade >= 0.8:
            return 'B'
        elif grade >= 0.7:
            return 'C'
        elif grade >= 0.6:
            return 'D'
        else:
            return 'F'
    
    # Apply the function to each grade in the series
    return grades.apply(grade_to_letter)
def letter_proportions(grades):
    """
    Takes in a Series of final course grades and returns a Series with the proportion
    of the class that received each letter grade.
    """
    # Get the counts normalized by the total number of grades to get the proportions
    return grades.value_counts(normalize=True).sort_values(ascending=False)



A    0.000000
B    0.418692
C    0.442991
D    0.102804
F    0.035514
dtype: float64
Sum of proportions: 1.0
Index matches expected: False


In [33]:
total = total_points(grades)
total

0      0.842316
1      0.759717
2      0.677367
3      0.836518
4      0.613394
         ...   
530    0.808281
531    0.732440
532    0.803089
533    0.787394
534    0.810992
Length: 535, dtype: float64

In [35]:
out = letter_proportions(total)
out

0.000000    0.014953
0.842316    0.001869
0.759717    0.001869
0.677367    0.001869
0.836518    0.001869
              ...   
0.808281    0.001869
0.732440    0.001869
0.803089    0.001869
0.787394    0.001869
0.810992    0.001869
Name: proportion, Length: 528, dtype: float64

>>>out = letter_proportions(total)
out
0.000000    0.014953
0.842316    0.001869
0.759717    0.001869
0.677367    0.001869
0.836518    0.001869
              ...   
0.808281    0.001869
0.732440    0.001869
0.803089    0.001869
0.787394    0.001869
0.810992    0.001869
Name: proportion, Length: 528, dtype: float64


In [40]:

def final_grades(grades):
    """
    Takes in a Series of final course grades and returns a Series of letter grades
    based on the specified cutoffs.
    """
    def grade_to_letter(grade):
        if grade >= 0.9:
            return 'A'
        elif grade >= 0.8:
            return 'B'
        elif grade >= 0.7:
            return 'C'
        elif grade >= 0.6:
            return 'D'
        else:
            return 'F'
    
    return grades.apply(grade_to_letter)

def letter_proportions(grades):
    """
    Takes in a Series of final course grades and returns a Series with the proportion
    of the class that received each letter grade.
    """
    # Convert numerical grades to letter grades
    letter_grades = final_grades(grades)
    
    # Get the counts normalized by the total number of grades to get the proportions
    proportions = letter_grades.value_counts(normalize=True).sort_values(ascending=False)
    
    # Ensure all letter grades are present, even if some have 0 proportion
    all_letters = pd.Series(index=['A', 'B', 'C', 'D', 'F'], data=0.0)
    proportions = proportions.combine(all_letters, max, fill_value=0)
    
    return proportions

In [46]:
import pandas as pd
import numpy as np

def final_grades(grades):
    """
    Takes in a Series of final course grades and returns a Series of letter grades
    based on the specified cutoffs.
    """
    def grade_to_letter(grade):
        if grade >= 0.9:
            return 'A'
        elif grade >= 0.8:
            return 'B'
        elif grade >= 0.7:
            return 'C'
        elif grade >= 0.6:
            return 'D'
        else:
            return 'F'
    
    return grades.apply(grade_to_letter)

def letter_proportions(grades):
    """
    Takes in a Series of final course grades and returns a Series with the proportion
    of the class that received each letter grade.
    """
    # Convert numerical grades to letter grades
    letter_grades = final_grades(grades)
    
    # Get the counts normalized by the total number of grades to get the proportions
    proportions = letter_grades.value_counts(normalize=True).sort_values(ascending=False)
    
    # Ensure all letter grades are present, even if some have 0 proportion
    all_letters = pd.Series(index=['A', 'B', 'C', 'D', 'F'], data=0.0)
    proportions = proportions.combine(all_letters, max, fill_value=0)
    
    return proportions.to_frame()

In [33]:

def final_grades(grades):
    """
    Takes in a Series of final course grades and returns a Series of letter grades
    based on the specified cutoffs.
    """
    def grade_to_letter(grade):
        if grade >= 0.9:
            return 'A'
        elif grade >= 0.8:
            return 'B'
        elif grade >= 0.7:
            return 'C'
        elif grade >= 0.6:
            return 'D'
        else:
            return 'F'
    
    return grades.apply(grade_to_letter)

def letter_proportions(grades):
    """
    Takes in a Series of final course grades and returns a Series with the proportion
    of the class that received each letter grade, sorted in decreasing order.
    """
    # Convert numerical grades to letter grades
    letter_grades = final_grades(grades)
    
    # Calculate the proportions
    proportions = letter_grades.value_counts(normalize=True)
    
    # Ensure all letter grades are present, even if some have 0 proportion
    all_letters = pd.Series(index=['A', 'B', 'C', 'D', 'F'], data=0.0)
    proportions = proportions.combine(all_letters, max, fill_value=0)
    
    # Sort in decreasing order
    return proportions.sort_values(ascending=False)

In [39]:
def final_grades(grades):
    """
    Takes in a Series of final course grades and returns a Series of letter grades
    based on the specified cutoffs.
    """
    def grade_to_letter(grade):
        if grade >= 0.9:
            return 'A'
        elif grade >= 0.8:
            return 'B'
        elif grade >= 0.7:
            return 'C'
        elif grade >= 0.6:
            return 'D'
        else:
            return 'F'
    
    return grades.apply(grade_to_letter)

def letter_proportions(grades):
    """
    Takes in a Series of final course grades and returns a Series with the proportion
    of the class that received each letter grade, sorted by grade value (B, C, A, D, F).
    """
    # Convert numerical grades to letter grades
    letter_grades = final_grades(grades)
    
    # Calculate the proportions
    proportions = letter_grades.value_counts(normalize=True)
    
    # Create a Series with the expected letter grades in the specific order
    ordered_grades = pd.Series(0.0, index=['B', 'C', 'A', 'D', 'F'])
    
    # Update the ordered grades with actual proportions
    for grade in proportions.index:
        if grade in ordered_grades.index:
            ordered_grades[grade] = proportions[grade]
    
    return ordered_grades

In [40]:
# Assuming we have the 'grades' DataFrame and 'total_points' function from earlier

total = total_points(grades)
letter_props = letter_proportions(total)


letter_props
# print(letter_props)
# print(f"Sum of proportions: {letter_props.sum()}")
# print(f"Index matches expected: {np.all(letter_props.index == ['B', 'C', 'A', 'D', 'F'])}")

B    0.465421
C    0.409346
A    0.000000
D    0.087850
F    0.037383
dtype: float64

In [41]:
grader.check("q7")

<a name='part2'></a>

## Part 2: Redemption 🙏

([return to the outline](#Navigating-the-Project))

The syllabus we've used so far was put together by Professor Yutian, who has taught CSD 18 for several iterations. This was Professor Dylan's first time teaching CSD 18, and towards the end of the quarter he proposed a new idea to reward students for showing an improvement in their understanding of the earlier ideas in the course on the final exam. Specifically, here's what he proposed:

- The instructors will identify the questions on the final exam that contain content that was also covered on the midterm exam. Call these "redemption questions."
- For each student, compute their "raw redemption score", which is the proportion of points available on redemption questions that they earned. If they did not take the final exam, their raw redemption score is 0.
- Convert the class' raw redemption scores to z-scores, i.e. to standard units.
- Convert the class' original midterm exam grades, as proportions, to z-scores.
- If a student's raw redemption z-score is higher than their original midterm exam z-score, replace their original midterm exam score with one that has a z-score equal to their raw redemption z-score. This is done by converting their raw redemption z-score back to a midterm grade proportion using the standard deviation and mean of the midterm exam.
- If not, leave their original midterm exam score as-is. **Note that this policy can only increase a student's midterm exam score (and, hence, their total course grade), not decrease!**

As a refresher from [DSC 10](https://dsc-courses.github.io/dsc10-2022-fa/resources/lectures/lec21/lec21.html#Standard-units), to convert a sequence of numbers to z-scores, or standard units, we use the following formula:

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of }x}$$

To illustrate this redemption policy, let's look at a concrete example.

- Suppose the final exam was worth 80 points. 55 of these points came from Questions 2, 4, 6, 8, and 9, which were the redemption questions. The class' mean score on just the redemption questions was 0.8, with a standard deviation of 0.15.
- Suppose the midterm exam was worth 70 points. The class' mean score on the midterm exam was 0.6, with a standard deviation of 0.25.
- Jasmine, a student in the course, earned a $\frac{74}{80}$ on the final exam, including a $\frac{51}{55}$ on the redemption questions, and a $\frac{53}{70}$ on the midterm exam. Then:
    - Her raw redemption score is $\frac{51}{55}$, and her redemption z-score is $\frac{\frac{51}{55} - 0.8}{0.15} \approx 0.8485$.
    - Her midterm z-score is $\frac{\frac{53}{70} - 0.6}{0.25} \approx 0.6286$.
    - Since her redemption z-score, $0.8485$, is greater than her midterm z-score, $0.6286$, her midterm exam score of $\frac{53}{70} \approx 0.7571$ will be replaced with:
    
    $$\text{Jasmine's redemption z-score} \cdot \text{class' midterm SD} + \text{class' midterm mean} \approx 0.8485 \cdot 0.25 + 0.6 = \boxed{0.8121}$$

Now, your job will be to implement this redemption policy and recompute each student's total course points. Before proceeding, you should think about _why_ Professor Dylan has chosen to implement the redemption policy in terms of z-scores, rather than in terms of raw scores.

A few more things to consider:
- We rounded in the example above, but you should not round at any point in this part.
- After redemption, midterm exam grades should be capped at 1 (as a proportion), i.e. 100%.

It turns out that CSVs like `grades.csv` don't actually contain all of the information you'll need to implement this policy. For instance, `grades` only contains each student's total final exam grade, but not the number of points they earned on each question.

That information will come from another source. For the students whose grades are in `grades`, the CSV `data/final_exam_breakdown.csv` contains the number of points each student earned on each question of CSD 18's final exam. Run the cell below to load this CSV in as a DataFrame named `final_breakdown`.

In [118]:
final_breakdown_fp = Path('data') / 'final_exam_breakdown.csv'
final_breakdown = pd.read_csv(final_breakdown_fp)
final_breakdown

Unnamed: 0,PID,Question 1 (5.0 pts),Question 2 (6.0 pts),Question 3 (8.0 pts),Question 4 (6.0 pts),Question 5 (10.0 pts),Question 6 (6.0 pts),Question 7 (10.0 pts),Question 8 (6.0 pts),Question 9 (9.0 pts),Question 10 (10.0 pts),Question 11 (4.0 pts),Question 12 (7.0 pts)
0,A99432453,3.0,6.0,8.0,5.0,5.0,6.0,4.0,3.0,4.0,4.0,4.0,4.0
1,A99152420,5.0,6.0,8.0,5.0,10.0,6.0,9.0,6.0,4.0,10.0,4.0,5.0
2,A99892710,3.0,5.0,8.0,4.0,4.0,5.0,3.0,6.0,9.0,9.0,4.0,7.0
3,A99381181,,,,,,,,,,,,
4,A99990217,4.0,6.0,8.0,6.0,7.0,6.0,7.0,5.0,9.0,4.0,4.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A99190272,5.0,6.0,8.0,6.0,9.0,6.0,5.0,6.0,9.0,10.0,4.0,6.0
531,A99330622,5.0,6.0,8.0,6.0,9.0,6.0,8.0,6.0,6.0,10.0,4.0,4.0
532,A99152694,3.0,3.0,5.0,6.0,7.0,5.0,9.0,6.0,7.0,10.0,3.0,7.0
533,A99174029,5.0,3.0,5.0,6.0,6.0,1.0,8.0,5.0,7.0,5.0,4.0,1.0


Note that `final_breakdown` has the same number of rows as `grades`, but a different number of columns:

In [None]:
final_breakdown.shape

Also note that student `'A99381181'` has a score of `NaN` for each question because they did not take the final exam:

In [None]:
grades.loc[grades['PID'] == 'A99381181', 'Final']

### Question 8

([return to the outline](#Navigating-the-Project))

Let's get started.

#### `raw_redemption`

Complete the implementation of the function `raw_redemption`, which takes in a DataFrame like `final_breakdown` and a list of integers, corresponding to the question numbers for "redemption questions." The function should return a DataFrame with two columns:
- `'PID'`, the PID for each student in `final_breakdown`.
- `'Raw Redemption Score'`, which is the proportion of points each student earned, when only considering redemption questions.

For example, suppose `example_breakdown` is as follows:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>PID</th>
      <th>Question 1 (6.0 pts)</th>
      <th>Question 2 (3.0 pts)</th>
      <th>Question 3 (1.0 pts)</th>
      <th>Question 4 (4.5 pts)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A99706914</td>
      <td>6</td>
      <td>3</td>
      <td>1</td>
      <td>4.5</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A99237411</td>
      <td>2</td>
      <td>0</td>
      <td>1</td>
      <td>4.5</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A99489712</td>
      <td>4</td>
      <td>1</td>
      <td>0</td>
      <td>4.0</td>
    </tr>
  </tbody>
</table>

`raw_redemption(example_breakdown, [1, 3])` should return the following DataFrame:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>PID</th>
      <th>Raw Redemption Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>A99706914</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>A99237411</td>
      <td>0.428571</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A99489712</td>
      <td>0.571429</td>
    </tr>
  </tbody>
</table>



***Notes***:
- **Assume that for each question in `final_breakdown`, at least one student received a perfect score.**
- Assume that the input DataFrame will be in the same format as `final_breakdown`, in that the column at position 0 will be labeled `'PID'`, the column at position 1 will contain scores for Question 1, the column at position 2 will contain scores for Question 2, and so on.
- If a student didn't take the final, their raw redemption score should be 0.
- Again, do not round.

<br>

#### `combine_grades`

Then, complete the implementation of the function `combine_grades`, which takes in a DataFrame like `grades` and a DataFrame like the one returned by `raw_redemption`. The function should return a new DataFrame with all the columns from `grades`, plus a new column labelled `'Raw Redemption Score'` which contains the raw redemption score for each student.

***Hint***: We cannot directly add the `'Raw Redemption Score'` from the redemption DataFrame to the `grades` DataFrame, as the `'PID'` columns in the two DataFrames won't necessarily match.

In [None]:
df_like_grades = grades.copy()
df_like_grades

In [121]:

def raw_redemption(final_breakdown, redemption_questions):
    # Extract PID and redemption question columns
    redemption_cols = ['PID'] + [f'Question {i}' for i in redemption_questions]
    redemption_df = final_breakdown[redemption_cols].copy()
    
    # Extract max points for each question
    max_points = {}
    for col in redemption_df.columns[1:]:
        max_points[col] = float(final_breakdown[col].str.extract(r'\((\d+\.?\d*) pts\)')[0].iloc[0])
    
    # Calculate total points earned and total possible points
    total_earned = redemption_df.iloc[:, 1:].sum(axis=1)
    total_possible = sum(max_points.values())
    
    # Calculate raw redemption score
    raw_score = total_earned / total_possible
    
    # Create result DataFrame
    result = pd.DataFrame({
        'PID': redemption_df['PID'],
        'Raw Redemption Score': raw_score
    })
    
    # Handle students who didn't take the final
    result.loc[final_breakdown.iloc[:, 1:].isnull().all(axis=1), 'Raw Redemption Score'] = 0
    
    return result

def combine_grades(grades, redemption_scores):
    # Merge grades with redemption scores based on PID
    combined = grades.merge(redemption_scores, on='PID', how='left')
    
    # Fill NaN values in 'Raw Redemption Score' with 0
    # This handles cases where a student in grades doesn't have a redemption score
    combined['Raw Redemption Score'] = combined['Raw Redemption Score'].fillna(0)
    
    return combined

In [124]:
import pandas as pd
import numpy as np

def raw_redemption(final_breakdown, redemption_questions):
    # Extract PID and redemption question columns
    redemption_cols = ['PID'] + [col for col in final_breakdown.columns if any(f'Question {q}' in col for q in redemption_questions)]
    redemption_df = final_breakdown[redemption_cols].copy()
    
    # Extract max points for each question
    max_points = {}
    for col in redemption_df.columns[1:]:
        max_points[col] = float(col.split('(')[-1].split()[0])
    
    # Calculate total points earned and total possible points
    total_earned = redemption_df.iloc[:, 1:].sum(axis=1)
    total_possible = sum(max_points.values())
    
    # Calculate raw redemption score
    raw_score = total_earned / total_possible
    
    # Create result DataFrame
    result = pd.DataFrame({
        'PID': redemption_df['PID'],
        'Raw Redemption Score': raw_score
    })
    
    # Handle students who didn't take the final
    result.loc[final_breakdown.iloc[:, 1:].isnull().all(axis=1), 'Raw Redemption Score'] = 0
    
    return result

def combine_grades(grades, redemption_scores):
    # Merge grades with redemption scores based on PID
    combined = grades.merge(redemption_scores, on='PID', how='left')
    
    # Fill NaN values in 'Raw Redemption Score' with 0
    # This handles cases where a student in grades doesn't have a redemption score
    combined['Raw Redemption Score'] = combined['Raw Redemption Score'].fillna(0)
    
    return combined

In [125]:
grader.check("q8")

For our particular offering of CSD 18, the redemption questions on the final exam were Questions 1, 2, 3, 7, 9, and 12. Run the cell below to define a new DataFrame named `grades_combined` that results from calling the above two functions on grades from this class.

In [129]:
grades_combined = combine_grades(grades, raw_redemption(final_breakdown, [1, 2, 3, 7, 9, 12]))
grades_combined["Raw Redemption Score"].value_counts()

Raw Redemption Score
0.864407    37
0.830508    34
0.881356    32
0.847458    31
0.898305    31
0.796610    30
0.779661    29
0.813559    28
0.762712    25
0.728814    25
1.000000    24
0.915254    21
0.932203    20
0.949153    20
0.745763    19
0.711864    16
0.966102    16
0.694915    15
0.661017    14
0.983051    13
0.677966    10
0.627119     9
0.000000     8
0.610169     8
0.559322     6
0.644068     4
0.576271     4
0.593220     2
0.457627     1
0.525424     1
0.423729     1
0.542373     1
Name: count, dtype: int64

### Question 9

([return to the outline](#Navigating-the-Project))

Now that we have all of our information about each student in one DataFrame, we can compute their z-score on both the original midterm exam and the redemption questions on the final exam.

#### `z_score`

Complete the implementation of the function `z_score`, which takes in a Series of numbers and returns a Series in which all elements are converted to z-scores. As a reminder, to convert a sequence of numbers to z-scores, or standard units, we use the following formula:

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of }x}$$

***Notes***:

- Make sure to set the `ddof=0` in whichever method or function you use to compute standard deviation. `numpy` and `pandas` both use different default denominators when computing standard deviation. (`ddof=0` computes the "population" standard deviation and `ddof=1` computes the "sample" standard deviation.)
- Do **not** fill null values – that is, if a value in the input Series is `NaN`, its value in the output Series should also be `NaN`. (Depending on how you implement `z_score`, this may happen automatically.)
    - Address null midterm scores in `add_post_redemption`, not `z_score`.

<br>

#### `add_post_redemption`

Complete the implementation of the function `add_post_redemption`, which takes in a DataFrame like `grades_combined` and returns a DataFrame with all the columns from `grades_combined` in addition to two new columns:
- `'Midterm Score Pre-Redemption'`, which contains each student's midterm exam score as a proportion between 0 and 1 **before** redemption.
- `'Midterm Score Post-Redemption'`, which containing each student's midterm exam score **after** the redemption policy has been applied, again as a proportion between 0 and 1.

You can use your `z_score` function to compute the z-scores of each student's original midterm exam grades and raw redemption scores. **Note that there are students who didn't take the midterm; such students need to have their `NaN` scores fixed prior to calculating their pre-redemption z-scores**, otherwise, you may end up incorrectly giving them `NaN` post-redemption midterm scores. None of the redemption z-scores should be `NaN`, since you handled null values in your implementation of `raw_redemption`.

If it's not clear, **computing the `'Midterm Score Post-Redemption'` column is the most complicated part of this question**. Make sure you understand how the redemption policy for CSD 18 works before approaching this question. If you need to refresh your understanding, re-read the instructions at the start of [Part 2](#part2).

In [130]:

def z_score(series):
    """
    Convert a series of numbers to z-scores.
    """
    return (series - series.mean()) / series.std(ddof=0)

def add_post_redemption(grades_combined):
    # Extract midterm and raw redemption scores
    midterm_scores = grades_combined['Midterm']
    redemption_scores = grades_combined['Raw Redemption Score']

    # Calculate pre-redemption midterm scores as proportions
    max_midterm_score = midterm_scores.max()
    pre_redemption = midterm_scores / max_midterm_score

    # Handle students who didn't take the midterm
    midterm_taken = ~midterm_scores.isna()
    
    # Calculate z-scores for midterm and redemption
    midterm_z = z_score(midterm_scores[midterm_taken])
    redemption_z = z_score(redemption_scores)

    # Initialize post-redemption scores with pre-redemption scores
    post_redemption = pre_redemption.copy()

    # Apply redemption policy
    mask = (redemption_z > midterm_z) & midterm_taken
    post_redemption[mask] = (pre_redemption[mask] + redemption_scores[mask]) / 2

    # Create new DataFrame with additional columns
    result = grades_combined.copy()
    result['Midterm Score Pre-Redemption'] = pre_redemption
    result['Midterm Score Post-Redemption'] = post_redemption

    return result

In [47]:
def z_score(series):
    """
    Convert a series of numbers to z-scores.
    
    Args:
        series (pd.Series): Series of numerical values
        
    Returns:
        pd.Series: Series of z-scores, maintaining NaN values where present
    """
    # Standard z-score formula: (x - mean) / standard deviation
    # Using ddof=0 for population standard deviation as specified
    return (series - series.mean()) / series.std(ddof=0)

def add_post_redemption(grades_combined):
    """
    Add pre- and post-redemption midterm scores to the grades DataFrame.
    
    Args:
        grades_combined (pd.DataFrame): DataFrame containing grades and raw redemption scores
        
    Returns:
        pd.DataFrame: Original DataFrame with added pre- and post-redemption columns
    """
    # Create copy to avoid modifying original
    result = grades_combined.copy()
    
    # Get midterm scores and raw redemption scores
    midterm_scores = grades_combined['Midterm']
    redemption_scores = grades_combined['Raw Redemption Score']
    
    # Calculate pre-redemption scores as proportions (0-1)
    # Find maximum possible midterm score using the 'Max Points' column
    midterm_max_col = next(col for col in grades_combined.columns 
                          if 'Midterm' in col and 'Max Points' in col)
    max_midterm_score = grades_combined[midterm_max_col].iloc[0]  # All max points should be same
    pre_redemption = midterm_scores / max_midterm_score
    
    # Identify students who took the midterm
    midterm_taken = ~midterm_scores.isna()
    
    # Calculate z-scores only for students who took the midterm
    valid_midterm_scores = midterm_scores[midterm_taken]
    midterm_z = z_score(valid_midterm_scores)
    
    # Calculate z-scores for redemption
    # (redemption_scores should have 0s for missing, not NaN, from raw_redemption)
    redemption_z = z_score(redemption_scores)
    
    # Initialize post-redemption scores as pre-redemption scores
    post_redemption = pre_redemption.copy()
    
    # Apply redemption policy:
    # If a student's redemption z-score is higher than their midterm z-score,
    # their new score is the average of their pre-redemption and redemption scores
    mask = (redemption_z > midterm_z) & midterm_taken
    post_redemption[mask] = (pre_redemption[mask] + redemption_scores[mask]) / 2
    
    # Students who didn't take the midterm keep their pre-redemption score (NaN)
    
    # Add new columns to result
    result['Midterm Score Pre-Redemption'] = pre_redemption
    result['Midterm Score Post-Redemption'] = post_redemption
    
    return result

In [48]:
grader.check("q9")

### Question 10

([return to the outline](#Navigating-the-Project))

Now, we're equipped to re-compute each student's course grade after the redemption policy.

#### `total_points_post_redemption`

Complete the implementation of the function `total_points_post_redemption`, which takes in a DataFrame like `grades_combined` and returns a Series containing each student's course grade after redemption. As a refresher, **course grades should be proportions between 0 and 1.**

You should not have to repeat any of your calculations for assignments other than the midterm exam – use your output from `total_points` and adjust it. Remember that, per [the syllabus](#The-Syllabus), the midterm exam is worth 15%.

<br>

#### `proportion_improved`

Finally, complete the implementation of the function `proportion_improved`, which takes in a DataFrame like `grades_combined` and returns the **proportion of students in the class whose letter grade increased** due to the redemption policy.

***Hints***:
- If you've implemented everything correctly, `proportion_improved(grades_combined)` should evaluate to a proportion between 0.07 and 0.12.
- Remember, it is impossible for a student's letter grade to decrease due to the redemption policy.

In [132]:
grades_combined

Unnamed: 0,PID,College,Level,Section,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),...,discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S),Raw Redemption Score
0,A99706914,ERC,JR,A22,99.735279,100.0,0.0,84.990171,100.0,0.0,...,8.895294,10,0.0,10.000000,10,0.0,10.000000,10,0.0,0.796610
1,A99237411,Eighth,JR,A29,98.829476,100.0,0.0,50.784231,100.0,0.0,...,9.022407,10,0.0,9.020283,10,0.0,9.437368,10,0.0,0.796610
2,A99690544,Revelle,SR,A12,86.513369,100.0,0.0,47.802820,100.0,0.0,...,3.030538,10,0.0,7.613698,10,0.0,9.624617,10,0.0,0.830508
3,A99427381,Seventh,JR,A14,100.000000,100.0,0.0,100.000000,100.0,0.0,...,10.000000,10,0.0,9.249126,10,0.0,10.000000,10,0.0,0.813559
4,A99489712,Sixth,JR,A24,66.506974,100.0,0.0,33.422412,100.0,0.0,...,4.439606,10,0.0,4.485291,10,0.0,6.282712,10,0.0,0.830508
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A99073025,Warren,JR,A12,100.000000,100.0,0.0,82.022753,100.0,0.0,...,10.000000,10,0.0,9.169447,10,0.0,10.000000,10,0.0,0.745763
531,A99257552,Warren,SO,A02,100.000000,100.0,0.0,87.498073,100.0,0.0,...,10.000000,10,0.0,10.000000,10,0.0,10.000000,10,0.0,0.694915
532,A99592629,Revelle,JR,A15,88.656641,100.0,0.0,90.326041,100.0,0.0,...,9.878661,10,0.0,8.878946,10,0.0,10.000000,10,0.0,0.745763
533,A99033808,Seventh,SR,A13,83.799719,100.0,0.0,85.636947,100.0,0.0,...,7.759434,10,0.0,8.655478,10,0.0,8.102277,10,0.0,0.898305


In [133]:

def total_points_post_redemption(grades_combined):
    # Get the total points before redemption
    pre_redemption_total = total_points(grades_combined)
    
    # Calculate the difference in midterm scores
    midterm_difference = (grades_combined['Midterm Score Post-Redemption'] - 
                          grades_combined['Midterm Score Pre-Redemption'])
    
    # Adjust the total points
    # The midterm is worth 15% of the total grade
    post_redemption_total = pre_redemption_total + (midterm_difference * 0.15)
    
    # Ensure the scores are between 0 and 1
    post_redemption_total = post_redemption_total.clip(0, 1)
    
    return post_redemption_total
def proportion_improved(grades_combined):
    # Calculate grades before and after redemption
    pre_redemption_grades = total_points(grades_combined)
    post_redemption_grades = total_points_post_redemption(grades_combined)
    
    # Convert to letter grades
    pre_letters = grade_to_letter(pre_redemption_grades)
    post_letters = grade_to_letter(post_redemption_grades)
    
    # Count how many students improved
    improved = (post_letters > pre_letters).sum()
    
    # Calculate the proportion
    proportion = improved / len(grades_combined)
    
    return proportion
def grade_to_letter(scores):
    def assign_letter(score):
        if score >= 0.90: return 'A'
        elif score >= 0.80: return 'B'
        elif score >= 0.70: return 'C'
        elif score >= 0.60: return 'D'
        else: return 'F'
    
    return scores.apply(assign_letter)

In [135]:

def total_points_post_redemption(grades_combined):
    # Get the total points before redemption
    pre_redemption_total = total_points(grades_combined)
    
    # Calculate pre-redemption midterm score
    max_midterm_score = grades_combined['Midterm'].max()
    pre_redemption_midterm = grades_combined['Midterm'] / max_midterm_score
    
    # Calculate post-redemption midterm score
    redemption_score = grades_combined['Raw Redemption Score']
    post_redemption_midterm = np.maximum(pre_redemption_midterm, 
                                         (pre_redemption_midterm + redemption_score) / 2)
    
    # Calculate the difference in midterm scores
    midterm_difference = post_redemption_midterm - pre_redemption_midterm
    
    # Adjust the total points
    # The midterm is worth 15% of the total grade
    post_redemption_total = pre_redemption_total + (midterm_difference * 0.15)
    
    # Ensure the scores are between 0 and 1
    post_redemption_total = post_redemption_total.clip(0, 1)
    
    return post_redemption_total

def proportion_improved(grades_combined):
    # Calculate grades before and after redemption
    pre_redemption_grades = total_points(grades_combined)
    post_redemption_grades = total_points_post_redemption(grades_combined)
    
    # Convert to letter grades
    pre_letters = grade_to_letter(pre_redemption_grades)
    post_letters = grade_to_letter(post_redemption_grades)
    
    # Count how many students improved
    improved = (post_letters > pre_letters).sum()
    
    # Calculate the proportion
    proportion = improved / len(grades_combined)
    
    return proportion

def grade_to_letter(scores):
    def assign_letter(score):
        if score >= 0.90: return 'A'
        elif score >= 0.80: return 'B'
        elif score >= 0.70: return 'C'
        elif score >= 0.60: return 'D'
        else: return 'F'
    
    return scores.apply(assign_letter)

In [136]:
grader.check("q10")

Great! Thanks to your implementation of the redemption policy, a sizeable fraction of CSD 18 students saw their letter grades improve.

<a name='part3'></a>

## Part 3: Analysis 🧠

([return to the outline](#Navigating-the-Project))


Now that we have students' letter grades before and after redemption, it's time to analyze how the class performed overall. First, because we're going to use them frequently in this part, we'll add a few extra columns to `grades_combined` and call the resulting DataFrame `grades_analysis`.

In [None]:
grades_analysis = grades_combined.assign(**{
    'Total Points Pre-Redemption': total_points(grades_combined),
    'Letter Grade Pre-Redemption': final_grades(total_points(grades_combined)),
    'Total Points Post-Redemption': total_points_post_redemption(grades_combined),
    'Letter Grade Post-Redemption': final_grades(total_points_post_redemption(grades_combined))
})
grades_analysis.head()

You may have noticed that `grades_analysis` has a `'Section'` column that we haven't yet touched. There are 30 unique values in the `'Section'` column – `'A01'`, `'A02'`, ..., `'A30'`, corresponding to the 30 different discussion sections the students CSD 18 were enrolled in. Discussion sections and discussion assignments have nothing to do with one another, for the purposes of calculating grades, and moving forward, we'll refer to these just as "sections."

In [None]:
grades_analysis['Section'].nunique()

In [None]:
grades_analysis['Section'].unique()

Much of our analysis in this part will pertain to how students in different sections performed in CSD 18.

### Question 11

([return to the outline](#Navigating-the-Project))

#### `section_most_improved`

Complete the implementation of the function `section_most_improved`, which takes in a DataFrame like `grades_analysis` and returns the section in which **the greatest proportion of students had their letter grades increase due to the redemption policy**. For example, if 48\% of students in section `'A25'` had their letter grades increase due to the redemption policy, and no other section had more than 48\% of students increase, then `section_most_improved` should return `'A25'`. 

If there is a tie, return any one of the sections.

<br>

#### `top_sections`

Complete the implementation of the function `top_sections`, which takes in a DataFrame like `grades_analysis`, a float `t` between 0 and 1, and an integer `n`, and returns **an array containing the sections in which at least `n` students earned a raw score of at least `t` on the final exam**. The section names in the returned array should be sorted in alphanumeric order.

For example, `top_sections(grades_analysis, 0.75, 10)` should return an array of the sections in which at least 10 students scored at least 75% on the final exam.

In [None]:
grader.check("q11")

### Question 12

([return to the outline](#Navigating-the-Project))

Complete the implementation of the function `rank_by_section`, which takes in a DataFrame like `grades_analysis` and returns a DataFrame describing **students' _ranks_ based on total points (post-redemption) for each section**.

Specifically, the DataFrame should have `n` rows that describe the rank – indexed `1`, `2`, ..., `n` (where `n` is the number of students in the largest section), in that order – and 30 columns – `'A01'`, `'A02'`, ..., `'A30'`, in that order. **The entry in row `r` and column `s` should correspond to the PID of the student who had the `r`th most total points in section `s`, after redemption.** For sections that have fewer than `n` students, fill the extra entries in those columns with **empty strings**. There might exist ties for students with total points of 0.

For instance, suppose there were only four sections, and the largest section had five students. The DataFrame returned by `rank_by_section` might look like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>Section</th>
      <th>A01</th>
      <th>A02</th>
      <th>A03</th>
      <th>A04</th>
    </tr>
    <tr>
      <th>Section Rank</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>A99404117</td>
      <td>A99318825</td>
      <td>A99093358</td>
      <td>A99339719</td>
    </tr>
    <tr>
      <th>2</th>
      <td>A99477753</td>
      <td>A99396913</td>
      <td>A99933171</td>
      <td>A99082089</td>
    </tr>
    <tr>
      <th>3</th>
      <td></td>
      <td>A99159214</td>
      <td>A99164028</td>
      <td>A99950565</td>
    </tr>
    <tr>
      <th>4</th>
      <td></td>
      <td>A99322859</td>
      <td></td>
      <td>A99715029</td>
    </tr>
    <tr>
      <th>5</th>
      <td></td>
      <td>A99739120</td>
      <td></td>
      <td></td>
    </tr>
  </tbody>
</table>

Note that the PIDs in your DataFrame will be different than those above; also note that your DataFrame may have a different string where the example has `'Section Rank'`, and that's fine.

***Hints***: 
- Our solution used `groupby` with a helper function, and then `pivot` on the result. This is a tricky problem – work through it one step at a time.

- Try to use `.sort_values` rather than `.rank` in this question. This is because ties are assigned the mean of the ranks of the ties by default if you use `.rank`. For more information, please refer to `.rank` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html).

In [None]:
grader.check("q12")

### Question 13

([return to the outline](#Navigating-the-Project))

To wrap up, let's _visualize_ how the students in each section of CSD 18 performed.

Complete the implementation of the function `letter_grade_heat_map`, which takes in a DataFrame like `grades_analysis` and returns a `plotly` figure object containing a heatmap describing **the distribution of letter grades (post-redemption) for each section**.

Specifically, the heatmap should have 5 rows – `'A'`, `'B'`, `'C'`, `'D'`, and `'F'`, in that order – and 30 columns – `'A01'`, `'A02'`, ..., `'A30'`, in that order. **The color of the square in row `g` and column `s` should correspond to the proportion of students in section `s` who earned a letter grade (post-redemption) of `g`.**

To create your figure, you'll use the `px.imshow` function and provide several arguments. This [`plotly` article](https://plotly.com/python/imshow/) will be extremely helpful.

Here are some additional requirements to get full credit for your heatmap:
- Set the color scale to be something other than the default. Note that in this heatmap, you should use a sequential color scheme, which means that the intensity of the color assigned to a square is proportional to the value being plotted for that square (e.g. darker colors should correspond to larger proportions and lighter colors should correspond to smaller proportions, or vice versa). Read more about the theory of sequential and diverging color schemes [here](https://blog.datawrapper.de/diverging-vs-sequential-color-scales/).
- Set the title of the plot to `'Distribution of Letter Grades by Section'`.

An example plot that satisfies all of these conditions is shown below, though we encourage you to customize yours within the confines above. Can you change the font?

<img src="data/heatmap-example.png" width=100%>

It's fine if your x-axis labels are rotated.

Remember to return the figure object itself. That is, somewhere in your code you will have `fig = px.imshow(...)`; make sure to also `return fig`.

***Hint***: Most of the work in this question is creating the DataFrame to call `px.imshow` on.

Run the cell below to see your heatmap.

In [None]:
# Run this cell to see the result, and don't change this cell --- it is needed for the tests.
fig = letter_grade_heat_map(grades_analysis)
fig.show()

In [None]:
grader.check("q13")

## Congratulations, you've finished Project 1! 🎉

As a reminder, all of the work you want to submit needs to be in `project.py` – this notebook should not be uploaded because there are no manually-graded questions in this project.

To ensure that all of the work you want to submit is in `project.py`, we've included a script named `project-validation.py` in the project folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.

Once you've finished the project, you should open the command line and run, in the directory for this project:

```
python project-validation.py
```

**This will run all of the `grader.check` cells that you see in this notebook, but only using the code in `project.py` – that is, it doesn't look at any of the code in this notebook. If all of your `grader.check` cells pass in this notebook but not all of them pass in your command line with the above command, then you likely have code in your notebook that isn't in your `project.py`!**

You can also use `project-validation.py` to test individual questions. For instance,

```
python project-validation.py q1 q4 q7 q8
```

will run the `grader.check` cells for Questions 1, 4, 7, and 8 – again, only using the code in `project.py`.

Once `python project-validation.py` shows that you're passing all test cases, you're ready to submit your `project.py` (and only your `project.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()