# Single Semester Loading and Cleaning Script

A note about shortcuts and code syntax: While I have worked hard to make this script accessible to everyone, there are a few things that you can do to make working with this script more powerful for yourself

-If you are working inside of a cell, and wish to come out of that cell, press the `Esc` key

-Pressing `m` while a cell is selected (but not active), will convert that cell's output into 'Markdown' (i.e., the language used to display this text). Pressing `y` will convert it back to code

-Highlighting code and pressing `Crtl` + `/` will comment out/in the code. Python uses `#` to denote comments, which also tells the interpreter to ignore that particular line of code. A few cells in this script give you the opportunity to look at the data. Using the keyboard shortcut above, you can make that line of code and active. You can also simply delete the `#` in front of the code

-If you wish to insert a cell, click outside of the cell above/below the location where you wish to place your cell. Press `a` to insert your cell above that cell, and `b` to place your cell below the current cell. If you wish to delete a cell, click outside of the cell you wish to delete, and click the cut (scissors) icon in the toolbar

-Python allows for strings to be entered as either `''` or `""`, but whatever you choose, it must be consistent. If you need to add a string somewhere, keep this in mind

-`ods.shape` prints out (displays) the number of columns and rows; `ods.info()` prints out an info summary about null-values, number of rows and columns, dtypes, column names, etc...(generally the most useful when looking for a high-level overview); `ods.columns` prints out the column names; `ods.head(x)\ods.tail(x)` will print out the *x* number of rows from the top/bottom (*note that x>50 will still generally result in data being cut off when printed to screen*)

In [None]:
import pandas as pd
import glob

In [None]:
filenames

*Enter the name of the pickled file you used in the previous script*

In [None]:
# Enter the name of the pickled file
f = input('Enter the name of the pickled file from "Indentifiers Removed" script: ')

In [None]:
# Load in pickled file 
ods = pd.read_pickle(f)

## Dropping Unnecessary Columns

*'Tech Tags' are irrelevant to ODS test center operations. Considering that most values are likely to be null anyways, we can drop this from the dataframe*

In [None]:
# Drop 'Tech Tags' from the dataframe
ods.drop(columns=['TechTags'], inplace = True) # We set inplace to 'True' so that the operation is performed in place (and not requiring us to declare a new variable) 

*We can also get rid of: 'Barcode', 'Access to speech-to-text software', 'Access to standing desk', 'Assessments administered in two parts', 'Electronic Reader for Online Exams', 'Extra Time 1.5x Calculation-based exams', 'Extra Time 1.5x Writing-based exams', 'Extra time 2.00x Calculation-based exams', 'Extra time 2.0x Writing-based exams', 'Leniency on spelling and grammar when it is not part of the material being tested', 'Live Reader for exams', 'ODS Proctor', 'Paper version of computerized calculation-based exams', 'Permission to bring food/drinks into testing environment', 'Reduced distraction calculation-based exams', 'Scribe for exams', 'Student may alternate between sitting and standing while testing', 'Student may handwrite exam responses'*

*Most of the data in these columns are null and/or have a relatively small impact (if any) on test center operations.*

***Should you wish to add a column back in, you will need to delete the name of the column from below. You might have to alter the script in later sections and in other scripts. Should you wish to do this, please ensure that each column name is enclosed in single quotes('') and each name is seperated by a comma(,)***

In [None]:
ods.drop(columns=[
    'TechTags', 'Barcode', 'Access to speech-to-text software', 'Access to standing desk', 'Assessments administered in two parts', 'Electronic Reader for Online Exams',
    'Extra Time 1.5x Calculation-based exams', 'Extra Time 1.5x Writing-based exams', 'Extra time 2.00x Calculation-based exams',
    'Extra time 2.0x Writing-based exams', 'Leniency on spelling and grammar when it is not part of the material being tested',
    'Live Reader for exams', 'ODS Proctor', 'Paper version of computerized calculation-based exams', 
    'Permission to bring food/drinks into testing environment', 'Reduced distraction calculation-based exams', 'Scribe for exams', 
    'Student may alternate between sitting and standing while testing', 'Student may handwrite exam responses'
], inplace = True)

*If you wish to see how certain characterisics of the dataframe right now, highlight the code you wish to run and press `Crtl` + `/`(note, this is optional)*

In [None]:
# ods.shape
# ods.info()
# ods.head(10)
# ods.tail(10)

Since the occurence of large size font for exams is limited compared to the total number of exams that ODS proctors, and it is the Instructor's duty to provide the large size font, we can drop these as well. 

In [None]:
ods.drop(columns=['Colored paper for exams and classroom materials', 'Exams and classroom materials in 18 point font or larger', 
                        'Exams and classroom materials in 24 point font or larger'], inplace=True)

In [None]:
ods.drop(columns=['Tags'], inplace=True)

## Checking Dtypes and Cleaning the Data

***Dtypes are the types of data that pandas can work with. Knowing, and assigning the appropriate dtype to a column will help in the analysis of test center data***

In [None]:
# Create a new column by concatenating 'Exam Date' with 'Start Time'
ods['Start_Date_and_Time'] = ods['Exam Date'] + ' ' + ods['Start Time']

# Convert the dtype to 'datetime64' and set errors to 'ignore' so that 'NaT' values do not raise an error
ods['Start_Date_and_Time'] = ods['Start_Date_and_Time'].astype('datetime64', errors='ignore')

# End date and time
ods['End_Date_and_Time'] = ods['Exam Date'] + ' ' + ods['End Time']
ods['End_Date_and_Time'] = ods['End_Date_and_Time'].astype('datetime64', errors='ignore')

# Actual start date and time
ods['Actual_Date_and_S_Time'] = ods['Exam Date'] + ' ' + ods['Actual Start Time']
ods['Actual_Date_and_S_Time'] = ods['Actual_Date_and_S_Time'].astype('datetime64', errors='ignore')

# Actual end date and time
ods['Actual_Date_and_E_Time'] = ods['Exam Date'] + ' ' + ods['Actual End Time']
ods['Actual_Date_and_E_Time'] = ods['Actual_Date_and_E_Time'].astype('datetime64', errors='ignore')

*Now, we need to drop the time columns from the dataframe we replaced from the dataframe*

In [None]:
ods.drop(columns=['Exam Date', 'Start Time', 'End Time', 'Actual Start Time', 'Actual End Time'], inplace=True)

*To keep things together, let's create a new column that will hold the data from 'Total Length' and 'Actual Length'*

In [None]:
ods['Alloted_Time'] = ods['Total Length']
ods['Actual_Student_Time'] = ods['Actual Total Length']

# Convert 'Alloted_Time to match the data with actual time 
ods['Alloted_Time'] = ods['Alloted_Time'].astype('float64')

*And drop the previous columns*

In [None]:
# Drop Total Length and Actual Length
ods.drop(columns=['Actual Total Length', 'Total Length'], inplace=True)

**We will also drop 'Reduced Distraction Environment' since all students testing at ODS test in an isolated environment regardless of accommodation**

In [None]:
ods.drop(columns=['Reduced Distraction Environment'], inplace=True)

*We need to handle null values. First, we need to replace all null values in categorical columns with "No."*

In [None]:
# Create variable to store the index of columns that we want to work with
cat_cols = ods.select_dtypes(exclude=['number', 'datetime64']).columns

#Drop the remaining columns that don't take 'yes/no' responses
cat_cols = cat_cols.drop(['Subject', 'LocationName','ProctorLastName', 'First Entered'])

*Now, 'cat_cols' holds all of the column names for which we want to drop null values. We can use 'cat_cols' as an indexer. Note that we will still have nulls in other columns that we'll have to deal with later.*

In [None]:
# Use '.fillna()' to fill in the null values with 'No'
ods[cat_cols] = ods[cat_cols].fillna('No')

*Let's fill the null values in 'Rescheduled' with 0.0*

In [None]:
# Set null values for rescheduled to be 0.0
ods['Rescheduled'].fillna(0.0, inplace=True)

***Run the cell below. If you receive an error, run the next cell***

In [None]:
ods.drop(columns=['Unnamed: 0'], inplace=True)

**Let's handle the null values for 'ProctorLastName'. Set these vaules to be 'Unspecified'**

In [None]:
ods['ProctorLastName'].fillna('Unspecified', inplace=True)

*Let's move No Show and Exam Completed to end of the data frame columns*

In [None]:
ods['no_show_label']=ods['No Show']
ods['exam_completed_label'] = ods['Exam Completed']
ods.drop(columns=['No Show', 'Exam Completed'], inplace=True)

*Rows that indicate the exam was completed but lack a start and end time need to be dropped from the dataframe (they shouldn't exist)*

In [None]:
idx = ods.loc[(ods['Actual_Date_and_S_Time'].isna()) & 
            (ods['exam_completed_label']=='Yes') & (ods['no_show_label']=='No')].index

ods.drop(index=idx, inplace = True)

***Before we continue, we need to dress things up a bit. We will recast some of the dtypes***

In [None]:
ods[['LocationName']] = ods[['LocationName']].astype('string')

In [None]:
ods['First Entered'] = ods['First Entered'].astype('datetime64', errors='ignore')

*Our next step is to work with the datetime columns, but there is one more thing that I want to do to clean up the dataframe before we continue. 

In [None]:
mapperDict = {'CRN':'crn', 'Subject':'subject',
                      'Course': 'course', 'Section':'section',
                      'ProctorLastName':'proctorLastName',
                      'LocationName':'locationName', 'First Entered':'firstEntered', 'File Uploaded':'fileUploaded',
                      'Received As Paper Copy': 'received_as_paper_copy',
                      'Rescheduled':'rescheduled', 'Breaks during exams': 'breaks_during_exams',
                      'Extra Time 1.50x':'extra_time_1.50x', 'Extra Time 2.00x': 'extra_time_2.00x',
                      'Make-up exams due to disability':'make_exams_due_to_disability',
                      'Medical alter device':'medical_alert_device', 'Paper version of computerized exams': 'paper_version_of_computerized_exams',
                      'Permission to mark on exam - No scantron':'noScantronExam', 'Reader for exams': 'readerForExams',
                      'Use of a calculator for assessments with a calculation component': 'allowed_calculator_with_calc_component', 
                       'Use of computer to type written exam responses': 'use_of_computer_to_type_responses', 'Start_Date_and_Time': 'startDateTime',
                       'End_Date_and_Time': 'endDateTime', 'Actual_Date_and_S_Time': 'actual_start_date_time', 'Actual_Date_and_E_Time': 'actual_end_date_time',
                       'Alloted_Time': 'allotedTime', 'Actual_Student_Time':'actual_time_taken', 'no_show_label':'no_show', 'exam_completed_label': 'exam_completed'}



ods.rename(columns = mapperDict, inplace = True)

*Take this time to look at the dataframe before moving onto the next script. Simply uncomment the line you wish to execute and run the cell. If you don't wish to run anything, simply move onto the next cell*

In [None]:
# Delete the '#' from the line you wish to run
# print(ods.info())
# print(ods.head(25))
# print(ods.tail(25))
# print(ods.shape)
# print(ods.columns)


### ***Next, we need to clean up the datetime columns, and start doing some analysis. I am going to create a new workbook to that***