# Student Attendance Data Cleaning and ISP Data

Students that have been enrolled in ISP of any form have been showing up in duplicates in the Aeries query for attendance. Using this code, students that have duplicates will be cleaned by:
1) Separating them from the non duplicates.
2) Determining if they have completed, are in progress, or are incomplete in ISP.
3) Data will then be cleaned up depending on what category they are under.

Once the data has been cleaned for ISP students they will be rejoined with the rest of the students to produce a master attendance file that no longer has duplicates along with other sheets that give calculations on the number of students for each category at all of the school sites along with a list of students in progress and incomplete.

The Aeries query to obtain the data used for this file is (adjust the year in the query for the current school year):

LIST AHS STU STU.SC STU.NM STU.ID STU.GR AHS.YR AHS.EN AHS.PR AHS.AB AHS.AE AHS.AU AHS.TD AHS.TRU AHS.SU AHS.ISS AHS.ISC AHS.ISI IF AHS.YR = 2022-2023

__Make sure to scroll to the bottom to look over the checks on possible special situations that you might have to look into about the data.__

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Place the destination that is desired for the generated file to output
output_final = "C:\\Users\\derek.castleman\\Desktop\\MonthlyDataPull\\November\\Fixed_Attendance.xlsx"

In [None]:
# Copy the file pathway from the Aeries query into the parenthesis
attendance = pd.read_excel(r"C:\Users\derek.castleman\Desktop\PrintQueryToExcel_20221205_101047_85335fc.xlsx")
attendance

In [None]:
#Making sure that we are only looking at current school year
att = attendance[attendance['Year']=='2022-2023']
att

In [None]:
# Check to see if any students have more than two rows of data
more_than_two = att.groupby(['Student ID']).size()
more_than_two = more_than_two.to_frame()
more_than_two = more_than_two.rename(columns = {0:'Number Rows'}, inplace = False)
more_than_two = more_than_two[more_than_two['Number Rows'] > 2]
more_than_two

In [None]:
#Separating duplicated students from the rest and keeping both rows for them
duplicated_students = att[att.duplicated(subset=['Student ID'], keep = False)]
duplicated_students

## Getting Students with Duplicated Rows and ISP Attendance

Duplicated students have two rows for them. They have the ones in which they are doing ISP and then their regular attendance at the school In this section, the students are sorted by their Student ID and then enrollment. The lower value represents the days in which they are in ISP so this will be the one that is selected. 

This is done by enrollment since some duplicated students are in home hospital and have no ISP days, so this method will capture these students as well.

In [None]:
# Sorting students by ID and enrollment
students_sorted = duplicated_students.sort_values(['Student ID','Enrolled'])
students_sorted

In [None]:
# The lowest value row for each student is selected
lowest_number = students_sorted.groupby(['Student ID']).head(1)
lowest_number

In [None]:
# Looking over the data for any incosistencies
lowest_number.describe()

## Students Who Did Not Complete ISP or Are in Progress

Students who have did not complete ISP days will be filtered out from the duplicates. They will be checked to see if they have any days of complete ISP done since students who have both might have to be deal with manually.

Two groups (incomplete and did not complete ISP) will be created for further in the program. Students with a negative for present will have those values changed to zero. The absences will be updated for all students by setting the absences column equal to days of incomplete ISP.

Students who are in progress for ISP are being considered as absent until it is completed since technically they cannot be counted as present until the ISP is completed.

In [None]:
# Filtering out students with Incomplete ISP
negative_students = lowest_number.loc[lowest_number['Days of Incomplete Independent Study'] > 0]
negative_students

In [None]:
# Making sure that students do not have any complete days of ISP
complete_check = negative_students.loc[negative_students['Days of Complete Independent Study'] != 0]
complete_check

In [None]:
# Filtering students that are incomplete for ISP
incomplete_isp = negative_students.loc[negative_students['Present'] < 0]
incomplete_isp

In [None]:
# Students in progress of ISP are filtered
inprogress_isp = negative_students.loc[negative_students['Present'] == 0]
inprogress_isp

In [None]:
# Changing present from negative to zero for incomplete students
negative_students.loc[negative_students['Present'] <= 0, 'Present'] = 0
negative_students

In [None]:
# Incomplete independent study days are set equal to absences
negative_students['Absences'] = negative_students['Days of Incomplete Independent Study']
negative_students

## Students who Completed ISP

Students who have completed ISP need to have the days in which they are present be updated. This will be done by setting the days enrolled set equal to the days present. It is done this way so that home hospital students will be updated correctly as well since they do not have days of complete independent study.

It will first be checked that the days enrolled equal the days of complete study before proceeding. And any students that are found to not have them equal can be looked into Aeries specifically to see if it is because of home hospital or something else.

In [None]:
# Locating students that have completed ISP
positive_students = lowest_number.loc[(lowest_number['Days of Complete Independent Study'] > 0) | (lowest_number['Present'] > 0)]
positive_students

In [None]:
# Check for students who Complete ISP does not equal days enrolled
no_isp = positive_students.loc[(positive_students['Days of Complete Independent Study'] > 0) != (positive_students['Enrolled'] > 0)]
no_isp

In [None]:
# Students with ISP days completed
isp_completed_students = positive_students.loc[positive_students['Days of Complete Independent Study'] > 0]
isp_completed_students

In [None]:
# Checking to see if complete students do not have incomplete days
incomplete_check = positive_students.loc[positive_students['Days of Incomplete Independent Study'] != 0]
incomplete_check

In [None]:
# Setting days present to days enrolled to correct for missing data
positive_students['Present'] = positive_students['Enrolled']
positive_students

## Combining Students and Fixing Duplicates

Combining the data that has been fixed in the previous sections. Then concat it to the other duplicates that were filtered out at the beginning. Then merge the rows for each of the students to finally have the data fixed.

In [None]:
# Combining the fixed data
combined = pd.concat([positive_students, negative_students])
combined

In [None]:
students_sorted

In [None]:
# Getting the other duplicate rows that were filtered out at first
highest_number = students_sorted.groupby(['Student ID']).tail(1)
highest_number

In [None]:
# Combining data to recreate the duplicated rows for each student
combined_duplicated = pd.concat([highest_number, combined]).sort_values(['Student ID','Enrolled'])
combined_duplicated

In [None]:
# Merging the two rows for each student to create one single entry with the corrected data
Schoolfixed_absent = combined_duplicated.groupby(['Student Name','Grade','Student ID', 'School', 'Year']).sum().reset_index()

In [None]:
Schoolfixed_absent

## Selecting Rows Without Duplicates

Now the values without duplicates need to be filtered from the original dataset so that they can be combined with the fixed formerly duplicated rows to get the final attendance file corrected for use in data analysis.

In [None]:
att

In [None]:
# Drops all rows that have any duplicates
non_isp = att.drop_duplicates(subset = ['Student ID'], keep= False)
non_isp

## Final Combining of Attendance

In [None]:
# Combining non-duplicate rows with the fixed rows
fixed_attendance = pd.concat([Schoolfixed_absent, non_isp])
fixed_attendance

## Calculations

In this section, calculations will be made to get summary data on the school and ISP students. Enrollment for each school site will be calculated, followed by the number of students who have taken ISP, completed ISP, are in progress and did not complete ISP.

In [None]:
# Calculates enrollment for the school
calculations = fixed_attendance.groupby(['School']).size()
calculations = calculations.to_frame()
calculations = calculations.rename(columns = {0:'Enrollment'}, inplace = False)
calculations

In [None]:
# Calculates completed ISP at each site
complete_isp = isp_completed_students.groupby(['School']).size()
complete_isp = complete_isp.to_frame()
complete_isp = complete_isp.rename(columns = {0:'Complete ISP'}, inplace = False)
complete_isp

In [None]:
# Calculates incomplete ISP at each site
incomplete_isp_final = incomplete_isp.groupby(['School']).size()
incomplete_isp_final = incomplete_isp_final.to_frame()
incomplete_isp_final = incomplete_isp_final.rename(columns = {0:'Incomplete ISP'}, inplace = False)
incomplete_isp_final

In [None]:
# Calculates students in progress for ISP
inprogress_isp_final = inprogress_isp.groupby(['School']).size()
inprogress_isp_final = inprogress_isp_final.to_frame()
inprogress_isp_final = inprogress_isp_final.rename(columns = {0:'In Progress ISP'}, inplace = False)
inprogress_isp_final

In [None]:
# Merge Complete with Enrollment
calculations = pd.merge(calculations, complete_isp, how='left', on='School')
calculations

In [None]:
# Merge In Progress ISP
calculations = pd.merge(calculations, inprogress_isp_final, how='left', on='School')
calculations

In [None]:
# Merge Incomplete ISP
calculations = pd.merge(calculations, incomplete_isp_final, how='left', on='School')
calculations

In [None]:
# Create a total ISP student column and reorder columns
calculations = calculations.fillna(0)
calculations['Total ISP'] = calculations['Complete ISP'] + calculations['In Progress ISP'] + calculations['Incomplete ISP']
calculations = calculations[['Enrollment', 'Total ISP', 'Complete ISP', 'In Progress ISP', 'Incomplete ISP']]
calculations

## Data Checks

Different checks on the data at the end to see if there are any particular students that might have a unique situation that needs to be looked into.

The checks are:

1) __More than Two__: Seeing if any students have more than just two rows (three or more rows of data).

2) __Complete Check__: Seeing if any students that have incomplete ISP have some days of completion.

3) __Incomplete Check__: Seeing if any students that have complete ISP have some days of incompletion.

4) __Non ISP Students with Two Rows__: Seeing if any students have double rows that are not ISP.

_If check is blank then that means everything is looking good!!!_

### More than Two

Checks if any students have more than two lines that might need to be fixed manually in original file then rerun the Python Program.

In [None]:
more_than_two

### Complete Check

This looks into if there are any students that were categorized as having Incomplete ISP have any days they are credited for having complete ISP. This might require manual fixing of the data in the original file then rerunning the Python program.

In [None]:
complete_check

### Incomplete Check

This looks into if any students that are in the complete ISP category have any days that are incomplete ISP. This might require manually fixing the data in the original file then rerunning the Python program.

In [None]:
incomplete_check

### Non ISP Students with Two Rows

This will list any students that have two lines but do not have ISP. This could be due to home hospital. Should be verified in Aeries the reason why they have duplicates. May not need manually being fixed if enrolled and present values are there since they are incorporated in the code correctly throughout the program.

In [None]:
no_isp

In [None]:
writer = pd.ExcelWriter(output_final)

# Write each dataframe to a different worksheet.
fixed_attendance.to_excel(writer, sheet_name='Student Attendance')
calculations.to_excel(writer, sheet_name='ISP Data by School')
isp_completed_students.to_excel(writer, sheet_name='Students Completed ISP')
inprogress_isp.to_excel(writer, sheet_name='Students In progess ISP')
incomplete_isp.to_excel(writer, sheet_name='Students Incomplete ISP')
writer.save()