# PD 12x Course Analysis

I'm hoping that a LR or CV-LLR will work sufficiently well for this analysis. The first goal will be to simply analyze students who persist from Fall to Spring and Spring to Fall who take the class. Since we are analyzing PD courses as a whole, it makes sense to compare that to other subject areas as a whole, at least in the first run. Let's see if that pulls anything insightful. 

In [None]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

In [None]:
%%html
<style>
table {float:left}
</style>

# CrHr and Student Enrollment

I'm using the 20th Day credit hour enrollment parquet file.

In [None]:
# File path
crhr_fp = Path(os.getcwd()).parent / '20th_day_data/Files'

# crhr parquet file
crhr = (pd.read_parquet(crhr_fp / '20th_D CrHr Enrollment 201280 - 202410.parquet')
          .loc[lambda df: df['term'].isin([201980, 202010, 202080, 202110, 202180, 
                                           202210, 202280, 202310, 202380, 202410])]
          .assign(term_id = lambda df: df['term'].astype(str) + df['id'].astype(str))
          .reset_index(drop = True)
       )[['term', 'id', 'term_id', 'purchase rate', 'first semester (join date)',
       'age by semester', 'age', 'age_range', 'totcr', 'status', 'stype',
       'resd_desc', 'degree', 'majr_desc1', 'gender', 'mrtl', 'ethn_desc',
       'cnty_desc1', 'prevhrs', 'pt', 'loc', 'crn', 'sub', 'crs', 'title',
       'cr', 're', 'div', 'crs cd']]

In [None]:
# Isolate the students who were in a PD 12x course
mask = crhr['sub'] == 'PD'
mask2 = crhr['crs'].isin(['124', '123', '121', '127', '125', '129', '122', '126'])
pdx_students = crhr[mask & mask2][['term_id', 'term', 'sub', 'crs']].reset_index(drop = True)

# Isolate the ids of students who were in a PD 12x course
pdx_ids = list(pdx_students['term_id'])

# Create column that makes binary response variable, students with PD 12x class and those without
crhr['pdx'] = ['PD 12x' if ids in pdx_ids else 'Not PD 12x' for ids in crhr['term_id']]


In [None]:
# Sanity check. Is the response variable correctly coded
(crhr[(crhr['sub'] == 'PD') & (crhr['crs'].isin(['124', '123', '121', '127', '125', '129', '122', '126']))]
 [['sub', 'crs', 'pdx']].head(15)
)


## Persistence And Retention Rates

Create two different data sets. One examines persistence rates, the other, retetion rates. Here we are defining *persistence* as semester to semester and *retention* as Fall to Fall. This means we are examining the proportion of students from the previous semester enrolled in the current semester. 

$$\text{Persistence} = \frac{S_{t+1}}{S_t}$$

where $S_t$ represents *all students* enrolled in the previous semester and $S_{t+1}$ represents all the students from the previous semester enrolled in the current semester. For instance, if the current semester is Fall, then the persistence measures the number of students enrolled in Fall that were also enrolled in the Spring semseter (i.e. the previous semester). This ends up being a LEFT JOIN of previous semester to current semester.

$$\text{Retention} = \frac{F_{t+1}}{F_t}$$

where $F_t$ represents *all students* enrolled in the previous Fall and $F_{t+1}$ represents all the students from the previous Fall in the current Fall semester. 

In [None]:
def persistence_retention(df, prev_term, curr_term):
    """
    df (pd.DataFrame): This is the 20th-Day dataframe from IR that has been modified by me (Aaron). The modifications
                       are not used in the code for persistence and retention. That is a basic comparison of the presence
                       of IDs from one semester to the next. 
    prev_term (int): Six digit integer of the previous term. Since this can do both retention and persistence,
                     the previous term will be in relation to what you are studying. Retention is Fall to Fall. 
                     Persistence is the previous Fall or Spring semester.
    curr_term (int): Six digit integer of the current term. For persistence, the current term is the Spring or Fall 
                     immediately following the previous term. For retention, the current term is the Fall term immediately
                     following the previous Fall term.
    returns: Concatenated semesters with all columns and data. (or)
             Concatenated calculation of the percent who persisted from one term to the next and those that did not. 
             
    """
    # previous
    previous = (df[df['term'] == prev_term]
                   .groupby('id').first()
                   .reset_index()
               )

    # current
    current = (df[df['term'] == curr_term]
                   .groupby('id').first()
                   .reset_index()
              )

    # merge previous with current, prioritizing previous
    merged_sems = (previous.merge(current[['id', 'term']], how = 'left', on = 'id')
                       .reset_index(drop = True)
                       .rename(columns = {'term_x':'term',
                                          'term_y':'enrolled'})
                       .assign(persistence = lambda x: [f'{prev_term} Persisted' if i == float(str(curr_term) + '.0') else f'{prev_term} Not Persisted' for i in x['enrolled']])
                       .drop('enrolled', axis = 1)
                  )

    prop_persisted = merged_sems['persistence'].value_counts(normalize = True)
    
    return merged_sems, prop_persisted


### Persistence

The loop below calculates persistence using the parameterized program from above. We start at position one in the *terms* list, then cycle through to the end. The previous term is $t-1$ and the current term is $t$. 

In [None]:
# Loop through terms
terms = sorted(crhr['term'].unique())

persistence_perc = []
persistence = []

for i in range(1, len(terms)):
    previous_term = terms[i-1]
    current_term = terms[i]
    temp_persistence_perc = persistence_retention(crhr, previous_term, current_term)[1]
    temp_persistence = persistence_retention(crhr, previous_term, current_term)[0]
    persistence_perc.append(temp_persistence_perc)
    persistence.append(temp_persistence)

In [None]:
# View percent persisted for each semester. Remember, these are looking forward. So 201980 is showing
# students who persisted from 201980 to 202010. Therefore, we see 64.18% of students from 201980 persisted to 
# 202010, 35.82% did not.
percent_persisted = (pd.concat(persistence_perc)
                       .reset_index())

percent_persisted[['term', 'persisted']] = percent_persisted['index'].str.extract(r'(\d{6})\s*(.*)')

percent_persisted = percent_persisted.drop('index', axis = 1)

# Pivot table to view persisted/not persisted
percent_persisted.pivot_table(index = 'term', columns = 'persisted', values = 'persistence')

In [None]:
# Dataframe of all persistence for each of the semesters considered
all_persistence = (pd.concat(persistence)
                     .reset_index(drop = True)
                  )

# Split the term away from the "Persisted/Not Persisted"
all_persistence[['term_x', 'persistence']] = all_persistence['persistence'].str.extract(r'(\d{6})\s*(.*)')

# Drop extra term column
all_persistence.drop('term_x', axis = 1, inplace = True)


### Special Side Project

I did this for DeAnn due to a report I sent last year that was completely incorrect. I cannot, for the life of me, figure out what I was doing or thinking last year. The below numbers are correct.

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Make stype persistence rate dataframe
stype_persistence = (all_persistence
                         .groupby(['term', 'stype', 'persistence'])
                         .agg({'id':'count'})
                         .reset_index()
                         .rename(columns = {'id':'count'})
                    )

# Add in the persistence percentage
persistence_percent = []

for term in stype_persistence['term'].unique():
    temp_term = stype_persistence[stype_persistence['term'] == term]
    for stype in temp_term['stype'].unique():
        temp_stype = temp_term[temp_term['stype'] == stype]
        temp_stype['percent'] = temp_stype['count'] / sum(temp_stype['count'])
        persistence_percent.append(temp_stype)

In [None]:
# View the persistence percentage for each semester for each student type
(pd.concat(persistence_percent)
   .pivot_table(index = ['term', 'stype'], columns = 'persistence', values = ['count', 'percent'])
)

### End Side Project

### Import GPA Information

This data is downloaded directly from Argos. The EOT grades from the IR databased do not give me all the data I have from the Argos dataset. Remember, there are hundreds of students that are total withdrawals by the end of term, which is why some of them do not show up in the grades. We only lose 943 entrees by dropping duplicates. This drops it from 61343 to 60400 and results in only 943 entries dropped. 

In [None]:
# Import grades
grades = (pd.read_parquet('Files/201980 - 202380 Grades Data From Argos.parquet')
            .assign(term_id = lambda df: df['term'].astype(str) + df['id'].astype(str))
         )
grades

In [None]:
# Combine 20th-D crhr enrollment with EOT grades (pulled form Argos)
crhr_grades = (all_persistence.merge(grades, on = 'term_id', how = 'left', indicator = True)
                   .drop(['id_y', 'term_y', 'purchase rate', 'first semester (join date)', 
                          'age by semester', 'status', 'degree', 'majr_desc1', 'mrtl', 'cnty_desc1',
                          'pt', 'loc', 'crn', 'sub', 'crs', 'title', 'cr', 're', 'div',
                          'crs cd'], axis = 1)
                   .rename(columns = {'term_x':'term',
                                      'id_x':'id',
                                      '_merge':'df_origination'})
                   .query("df_origination == 'both'")
                   .reset_index(drop = True)
                   .drop('df_origination', axis = 1)
              )


In [None]:
# Select just the unique term_ids
crhr_grades = crhr_grades.groupby('term_id').first().reset_index()

crhr_grades.to_csv('Crhr_Grades 2019 - 2023.csv')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Convert all categories in 'acdstd' to string
crhr_grades['acdstd'] = crhr_grades['acdstd'].astype(str)

# Create violin plots using seaborn
plt.figure(figsize = (6, 4))
sns.violinplot(x = 'acdstd', y = 'trmgpa', data = grades)

# add titles and labels
plt.title('Violin Plots of Term GPA and Academic Standing')
plt.xlabel('Academic Standing')
plt.ylabel('Term GPA')

# show plot
plt.show()