### *Instructions: Execute each cell by clicking 'Run' in the toolbar (you can also do this by pressing 'Shift' + 'Enter'). Some cells require input from you. In order that the script runs correctly, please make sure that the names of files are correct, in the correct directory, and that the contents of this script are not changed.*

***The csv downloaded from AIM contains confidential information; thus this script is not allowed to be run off campus. The csv must be properly deleted after identifiers are removed and the new data exported to a pickled file***

A note about shortcuts and code syntax: While I have worked hard to make this script accessible to everyone, there are a few things that you can do to make working with this script more powerful for yourself

-If you are working inside of a cell, and wish to come out of that cell, press the `Esc` key

-Pressing `m` while a cell is selected (but not active), will convert that cell's output into 'Markdown' (i.e., the language used to display this text). Pressing `y` will convert it back to code

-Highlighting code and pressing `Crtl` + `/` will comment out/in the code. Python uses `#` to denote comments, which also tells the interpreter to ignore that particular line of code. A few cells in this script give you the opportunity to look at the data. Using the keyboard shortcut above, you can make that line of code and active. You can also simply delete the `#` in front of the code

-If you wish to insert a cell, click outside of the cell above/below the location where you wish to place your cell. Press `a` to insert your cell above that cell, and `b` to place your cell below the current cell. If you wish to delete a cell, click outside of the cell you wish to delete, and click the cut (scissors) icon in the toolbar

-Python allows for strings to be entered as either `''` or `""`, but whatever you choose, it must be consistent. If you need to add a string somewhere, keep this in mind

-`ods.shape` prints out (displays) the number of columns and rows; `ods.info()` prints out an info summary about null-values, number of rows and columns, dtypes, column names, etc...(generally the most useful when looking for a high-level overview); `ods.columns` prints out the column names; `ods.head(x)\ods.tail(x)` will print out the *x* number of rows from the top/bottom (*note that x>50 will still generally result in data being cut off when printed to screen*)

In [1]:
# By convention, we use the alias 'pd' for pandas
import pandas as pd
import glob
import numpy as np

**Place one or more csv files downloaded from the ODS portal in the current working directory (i.e., the folder this script is running from)**

In [2]:
filenames = glob.glob('*.csv')
filenames

['export (1).csv',
 'export (2).csv',
 'export (3).csv',
 'export (4).csv',
 'export.csv']

In [3]:
list_of_dfs = [pd.read_csv(filename) for filename in filenames]
ods = pd.concat(list_of_dfs, ignore_index=True)

*Run the next cell to look at the names of all the columns*

In [4]:
# Call the 'columns' attribute to look at the column names
ods.columns

Index(['SchoolID', 'StudentName', 'CRN', 'Subject', 'Course', 'Section',
       'ClassTitleComplete', 'Exam Date', 'ProctorLastName', 'InstructorName',
       'InstructorEmail', 'LocationName', 'Start Time', 'End Time',
       'Total Length', 'Scheduled By', 'Actual Start Time', 'Actual End Time',
       'Actual Total Length', 'Exam Completed', 'No Show', 'Tags', 'TechTags',
       'Barcode', 'First Entered', 'File Uploaded', 'Received As Paper Copy',
       'Rescheduled', 'StudentLastName', 'StudentFirstName',
       'Access to speech-to-text software', 'Access to standing desk',
       'Assessments administered in two parts', 'Breaks during exams',
       'Colored paper for exams and classroom materials',
       'Electronic Reader for Online Exams',
       'Exams and classroom materials in 18 point font or larger',
       'Exams and classroom materials in 24 point font or larger',
       'Extra Time 1.50x', 'Extra Time 1.5x Calculation-based exams',
       'Extra Time 1.5x Writing-ba

*We need to remove 'SchoolID', 'StudentName', 'CRN', 'Course', 'ClassTitleComplete', 'InstructorName', 'InstructorEmail', 'Scheduled By','StudentLastName', 'StudentFirstName', 'InstructorEmail' from the dataframe, since these contain confidential information.*

**Verify that the column names are listed above, then execute the cell below**

In [5]:
ods.drop(columns=['SchoolID', 'StudentName', 'CRN', 'Course', 'ClassTitleComplete', 'InstructorName', 'InstructorEmail', 'Scheduled By',
        'StudentLastName', 'StudentFirstName', 'InstructorEmail'], inplace=True)

**The following are largely irrelevant to test center operations, or the sheer amount of null values make it prohibitive to deal with them**

In [6]:
ods.drop(columns=[
    'TechTags', 'Barcode', 'Access to speech-to-text software', 'Access to standing desk', 'Assessments administered in two parts', 'Electronic Reader for Online Exams',
    'Extra Time 1.5x Calculation-based exams', 'Extra Time 1.5x Writing-based exams', 'Extra time 2.00x Calculation-based exams',
    'Extra time 2.0x Writing-based exams', 'Leniency on spelling and grammar when it is not part of the material being tested',
    'Live Reader for exams', 'ODS Proctor', 'Paper version of computerized calculation-based exams', 
    'Permission to bring food/drinks into testing environment', 'Reduced distraction calculation-based exams', 'Scribe for exams', 
    'Student may alternate between sitting and standing while testing', 'Student may handwrite exam responses'
], inplace = True)

*If you wish to see how certain characterisics of the dataframe right now, highlight the code you wish to run and press `Crtl` + `/`(note, this is optional)*

**Drop values that have the "Not Available" html script**

In [7]:
index = ods.loc[ods['End Time']=='<font class=red"><abbr title="Not Available">N/A</abbr></font>"'].index
ods.drop(index=index, inplace=True)

*Use this opportunity to look at the data by 'uncommenting' the line of code you wish to run (note, refer to the beginning of the script for help)*

In [8]:
# ods.shape
# ods.info()
# ods.head(25)
# ods.tail(25)

(11216, 32)

*Since the occurence of large size font for exams is limited compared to the total number of exams that ODS proctors, and it is the Instructor's duty to provide the large size font, we can drop these as well.*

**Drop additional unneeded columns**

In [11]:
ods.drop(columns=['Colored paper for exams and classroom materials', 'Exams and classroom materials in 18 point font or larger', 
                  'Exams and classroom materials in 24 point font or larger', 
                  'Tags', 'Use of a calculator for assessments with a calculation component',
                  'Paper version of computerized exams', 'Use of computer to type written exam responses', 
                  'Medical alert device'],
                  inplace=True)

## Checking Dtypes and Filling Null Values

**Covert the datetime columns to datetime64 dtype**

In [12]:
ods['Exam Date'] = ods['Exam Date'].astype('datetime64')
ods['First Entered'] = ods['First Entered'].astype('datetime64')
ods['Start Time'] = ods['Start Time'].astype('datetime64').dt.time
ods['End Time'] = ods['End Time'].astype('datetime64').dt.time
ods['Actual Start Time'] = ods['Actual Start Time'].astype('datetime64').dt.time
ods['Actual End Time'] = ods['Actual End Time'].astype('datetime64').dt.time

**Rename the columns**

In [13]:
mapper = {'First Entered': 'first_entered', 'Exam Date': 'exam_date', 'Start Time': 'start_time', 'End Time': 'end_time', 'Actual Start Time': 'actual_start',
         'Actual End Time': 'actual_end'}
ods.rename(columns=mapper, inplace=True)

**Rename the Total and Actual Length Columns and change dtype for actual time to float64**

In [14]:
ods['allotted_time'] = ods['Total Length']
ods['actual_time'] = ods['Actual Total Length']

# Convert 'Alloted_Time to match the data with actual time 
ods['allotted_time'] = ods['allotted_time'].astype('float64')


# Drop Total Length and Actual Length
ods.drop(columns=['Actual Total Length', 'Total Length'], inplace=True)

**We will also drop 'Reduced Distraction Environment' since all students testing at ODS test in an isolated environment regardless of accommodation**

In [15]:
ods.drop(columns=['Reduced Distraction Environment'], inplace=True)

**We need to handle null values. Replace all null values in categorical columns with "No."**

In [16]:
# Create variable to store the index of columns that we want to work with
cat_cols = ods.select_dtypes(exclude=['number', 'datetime64']).columns

#Drop the remaining columns that don't take 'yes/no' responses
cat_cols = cat_cols.drop(['Subject', 'LocationName','ProctorLastName', 'actual_start', 'actual_end'])

# Use '.fillna()' to fill in the null values with 'No'
ods[cat_cols] = ods[cat_cols].fillna('No')

**Let's fill the null values in 'Rescheduled' with 0.0**

In [17]:
# Set null values for rescheduled to be 0.0
ods['Rescheduled'].fillna(0.0, inplace=True)

**Let's handle the null values for 'ProctorLastName'. Set these vaules to be 'Unspecified'**

In [18]:
ods['ProctorLastName'].fillna('Unspecified', inplace=True)

**Fill 'NaT' Values with the value of the exam start and finish times**

In [19]:
ods['actual_start'].fillna(ods['start_time'], inplace=True)
ods['actual_end'].fillna(ods['end_time'], inplace=True)

In [20]:
ods.columns

Index(['Subject', 'Section', 'exam_date', 'ProctorLastName', 'LocationName',
       'start_time', 'end_time', 'actual_start', 'actual_end',
       'Exam Completed', 'No Show', 'first_entered', 'File Uploaded',
       'Received As Paper Copy', 'Rescheduled', 'Breaks during exams',
       'Extra Time 1.50x', 'Extra Time 2.00x',
       'Make-up exams due to disability',
       'Permission to mark on exam - No scantron', 'Reader for exams',
       'allotted_time', 'actual_time'],
      dtype='object')

**Create a new column to store 'exam_cancelled' values**

In [21]:
ods['exam_cancelled'] = ods['Exam Completed']=='No'
ods['no_show'] = ods['No Show'] == 'Yes'


**Drop 'Exam Completed' and 'No Show'**

In [22]:
ods.drop(['Exam Completed'], axis=1, inplace=True)
ods.drop(['No Show'], axis=1, inplace=True)

In [24]:
index = ods.loc[(ods['actual_start'].isna()) & 
            (ods['exam_cancelled']=='Yes') & (ods['no_show']=='No')].index

ods.drop(index=index, inplace = True)

**Create a column that stores the amount of days a request was submitted before the exam**

In [25]:
ods['days_requested_submitted_in_advance'] = (ods['exam_date'].dt.date - ods['first_entered'].dt.date)/pd.Timedelta(days=1)

**Before we continue, we need to dress things up a bit. Recast the dtypes for LocationName and First Entered**

In [26]:
ods[['LocationName']] = ods[['LocationName']].astype('string')

# ods['First Entered'] = ods['First Entered'].astype('datetime64', errors='ignore')

**Change the names of the columns to something more concise**

In [27]:
mapperDict = {'Subject':'subject', 'Course': 'course', 'Section':'section', 'ProctorLastName':'proctor',
                      'LocationName':'room_number', 'First Entered':'firstEntered', 'File Uploaded':'fileUploaded',
                      'Received As Paper Copy': 'received_as_paper_copy',
                      'Rescheduled':'rescheduled', 'Breaks during exams': 'breaks_during_exams',
                      'Extra Time 1.50x':'extra_time_1.50x', 'Extra Time 2.00x': 'extra_time_2.00x',
                      'Make-up exams due to disability':'makeup_accommodation',
                      'Permission to mark on exam - No scantron':'noScantronExam', 'Reader for exams': 'readerForExams'}



ods.rename(columns = mapperDict, inplace = True)

**Change nulls in 'room_number' to 'Not Specified'**

In [28]:
ods['room_number'].fillna('Not Specified', inplace=True)

In [29]:
# ods['actual_time'].fillna('', inplace=True)

In [30]:
ods

Unnamed: 0,subject,section,exam_date,proctor,room_number,start_time,end_time,actual_start,actual_end,first_entered,...,extra_time_1.50x,extra_time_2.00x,makeup_accommodation,noScantronExam,readerForExams,allotted_time,actual_time,exam_cancelled,no_show,days_requested_submitted_in_advance
0,EC,11,2020-09-01,Vanslambrouck,05,10:50:00,12:05:00,10:51:00,12:22:00,2020-08-22 11:41:00,...,Yes,No,No,No,No,75.0,91.0,False,False,10.0
1,EC,11,2020-09-01,Vanslambrouck,02,14:00:00,15:40:00,14:00:00,14:39:00,2020-08-27 11:42:00,...,No,Yes,No,No,No,100.0,39.0,False,False,5.0
2,EC,11,2020-09-01,Vanslambrouck,01,14:00:00,15:15:00,14:00:00,15:15:00,2020-08-20 16:54:00,...,Yes,No,No,No,No,75.0,,False,True,12.0
3,MC,4,2020-09-08,Bulls,01,08:00:00,09:30:00,08:09:00,08:50:00,2020-09-03 12:58:00,...,Yes,No,No,No,No,90.0,41.0,False,False,5.0
4,MC,4,2020-09-08,Bulls,02,08:00:00,10:00:00,08:10:00,09:21:00,2020-09-03 20:12:00,...,No,Yes,No,No,No,120.0,71.0,False,False,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11235,PY,1,2021-04-30,Bulls,13,11:30:00,15:15:00,11:32:00,12:07:00,2021-04-19 17:42:00,...,Yes,No,No,No,No,225.0,35.0,False,False,11.0
11236,PY,1,2021-04-30,Bulls,10,11:30:00,15:15:00,11:29:00,12:08:00,2021-04-19 01:16:00,...,Yes,No,No,No,No,225.0,39.0,False,False,11.0
11237,PY,1,2021-04-30,Unspecified,01,11:30:00,15:15:00,11:30:00,15:15:00,2021-01-27 18:21:00,...,Yes,No,No,No,No,225.0,,True,False,93.0
11238,CE,1,2021-04-30,Bulls,11,11:35:00,15:20:00,11:34:00,13:50:00,2021-04-26 13:11:00,...,Yes,No,No,No,No,225.0,136.0,False,False,4.0


## Create a New DataFrame to Store Final Exam Values

In [31]:
# ods[(ods['exam_date'] >= '04-26-2021') & (ods['exam_date'] <= '04-30-2021')]
# 
# ods[(ods['exam_date'] >= '12-09-2019') & (ods['exam_date'] <= '12-13-2019')
#                      | (ods['exam_date'] == '11/22/2019') & (ods['subject'] == 'GBA')]

*We do not have finals data for Spring 20 and Fall 20 semesters due to the outbreak of covid-19; finals were taken online for those semesters*

In [32]:
# summer_19_finals = ods[(ods['exam_date'] >= '08-01-2019') & (ods['exam_date'] <= '08-02-2019')]
# fall_19_finals = ods[(ods['exam_date'] >= '12-09-2019') & (ods['exam_date'] <= '12-13-2019')]
# spring_21_finals = ods[(ods['exam_date'] >= '04-26-2021') & (ods['exam_date'] <= '04-30-2021')]

In [33]:
# index = ods[(ods['exam_date'] >= '08-01-2019') & (ods['exam_date'] <= '08-02-2019')|
#              (ods['exam_date'] >= '12-09-2019') & (ods['exam_date'] <= '12-13-2019')|
#              (ods['exam_date'] >= '04-26-2021') & (ods['exam_date'] <= '04-30-2021')].index

# ods.drop(index=index, inplace=True)

In [34]:
# fall_19_finals.info()

In [35]:
# odsFinals = pd.concat([summer_19_finals, fall_19_finals, spring_21_finals], ignore_index=True)

**Name the exported file (Do not add the extension)**

In [36]:
reg_file=input("Enter the name of the file you wish to use for regular semester (Don't add the extension): ")
# finals_file = input("Enter the name of the file you wish to use for final exams (Don't add the extension): ")

Enter the name of the file you wish to use for regular semester (Don't add the extension): ods20210713


In [37]:
ods.to_pickle(f'{reg_file}.pkl')
# odsFinals.to_pickle(f'{finals_file}.pkl')