## Programming Assignment #3

**Hogwarts Data Cleaning**

100 points possible.

This assignment asks you to perform specific data cleaning tasks.

# The Setting -- The Return to Hogwarts

Being a wizard can be dangerous. Being a wizard in training can be even more dangerous. The Hogwarts school nurse is a very busy person and records their activity in a log.

You are asked to review a sample of the logs from the nurse's office of Hogwarts and to **clean** these logs for analysis later on.  Please perform these parts **in order** as earlier steps may impact later steps.

Note: in class, we saw that there were a wide variety of issues with the data file.  This assignment asks you to programmatically clean data quality issues on **a specific subset of the of the issues** identified in class. You do not have to resolve all issues in the file, only those described below.

# Part 0 -- Submission Details


(10 points) Please enter your name and the date below. Submit your answers as a completed notebook by the deadline posted on Canvas.  Late submissions will not get credit for this section.

Name: Duong Hoang

Date: 10/13/2022


# Part 1 -- Create New Test Cases

(10 points) Review the original data and the data quality tasks below.  Design 10 test records (rows of patient visits) to insert into the original data.  You can manually develop these by modifying the original file or use Python in some capacity to generate them. If you modify the original file, please submit your modified file with your submission.

In plain english, explain how your test records will help test the data quality tasks. What makes them good test cases?

**Answer: Each of the test cases is designed to have some kind of bad data to trigger the functions belows. All of the missing data are intentional to test whether the functions would modify the data as expected. Most of the test cases are based off the existing data with the hope to correct some of those bad data with the functions.**

In [1]:
# insert code (if applicable)
import pandas as pd
import numpy as np

new_records = [
    ['015685', 'Harry', 'Potter', '9870', '10-17-1994', '0.25 hours', '174', '57', '15.24', '1'],
    ['051627', 'Albus', 'Dumbledore', '9871', '10-17-1994', '20 minutes', '185', '73', '9.52', '2'],
    ['051672', 'Albus', 'dumbledore', '9872', '10-29-1994', '8 minute', '185', '73', '0', '0'],
    ['', 'Harry', 'Potter', '9873', '11-02-1994', '23 minutes', '174', '57', '10', ''],
    ['0145', 'hermione', 'granger', '9874', '11-04-1994', '1 hours', '164', '53', '30.02', '5'],
    ['07619', 'rOn', 'Weasley', '9875', '11-04-1994', '5 minute', '180', '60', '3.99', '1'],
    ['01234', 'Ron', 'Weasley', '9876', '', '20 minutes', '180', '60', '16.23', '4'],
    ['', 'Dobby', '', '9877', '1994-11-10', '0.75 hours', '106', '25', '6.98', '2'],
    ['037493', 'Rubeus', 'Hagrid', '9878', '12-12-1994', '12 minutes', '259', '132', '4.23', '1'],
    ['053422', 'Serverus', 'Snape', '9879', '1994-12-32', '17 minute', '185', '82', '0', '0']]

#Part 2 -- Create

(5 points) Load the original data and your new test cases into the same Python structure (of your choice).



In [2]:
# insert code here
# read data from file
nurse_data = pd.read_csv('nurse-log.csv', dtype=str)
# insert new data to the same column set
new_data = pd.DataFrame(new_records, columns=nurse_data.columns).replace(r'^\s*$', np.nan, regex=True)
nurse_data = pd.concat([nurse_data, new_data], ignore_index=True, sort=False)
# convert height, weight, charge, and number of supplies used to corresponding numeric data
nurse_data['height(cm)'] = pd.to_numeric(nurse_data['height(cm)'], downcast='unsigned', errors='coerce')
nurse_data['weight(kg)'] = pd.to_numeric(nurse_data['weight(kg)'], downcast='unsigned', errors='coerce')
nurse_data['charge'] = pd.to_numeric(nurse_data['charge'], errors='coerce')
nurse_data['supplies_used'] = pd.to_numeric(nurse_data['supplies_used'], downcast='unsigned', errors='coerce')

# Part 3 -- Missing Data for Time Spent

(5 points) Drop all rows with missing values for the time_spent column.  Print your data to verify your changes.


In [3]:
# insert code here
# drop all rows where time_spent entries are na 
nurse_data.dropna(subset=['time_spent'], inplace=True)
nurse_data.reset_index(drop=True, inplace=True)
# display dataset after dropping
print(nurse_data)

   medical_record_number first_name   last_name visit_id        date  \
0                 015685      Harry      Potter     8219  06-05-1994   
1                  07619        Ron     Weasley     7512  01-15-1994   
2                 014593   Hermione     Granger     5896  01-25-1994   
3                 015685      Harry      Potter     1552  1994-02-15   
4                 015685      Harry      Potter     1202  05-19-1994   
5                  08954      Dobby         NaN     1205  03-12-1994   
6                  07619        Ron     Weasley     6895  04-05-1994   
7                 015689      Harry      Potter     6854  10-11-1994   
8                  07619        Ron     Weasley     1265  07-08-1994   
9                    NaN   Serverus       Snape     5454  09-12-1994   
10                015685        Ron     Weasley     2369  09-31-1994   
11                015685      Harry      Potter     9852  10-16-1994   
12                 07619        Ron     Weasley     7512  01-15-

# Part 4 -- Range checking for Height

(10 points) Drop all rows with values larger than 250 for height (slightly larger than the world record for tallest person). Print your data to verify your changes.


In [4]:
# insert code here
# get indices of rows where height entries are greater than 250
outlier_height_ind = nurse_data[(nurse_data['height(cm)'] > 250)].index
# remove those rows from dataset
nurse_data.drop(outlier_height_ind, inplace=True)
nurse_data.reset_index(drop=True, inplace=True)
# display dataset after dropping
print(nurse_data)

   medical_record_number first_name   last_name visit_id        date  \
0                 015685      Harry      Potter     8219  06-05-1994   
1                  07619        Ron     Weasley     7512  01-15-1994   
2                 014593   Hermione     Granger     5896  01-25-1994   
3                 015685      Harry      Potter     1552  1994-02-15   
4                 015685      Harry      Potter     1202  05-19-1994   
5                  08954      Dobby         NaN     1205  03-12-1994   
6                  07619        Ron     Weasley     6895  04-05-1994   
7                 015689      Harry      Potter     6854  10-11-1994   
8                  07619        Ron     Weasley     1265  07-08-1994   
9                    NaN   Serverus       Snape     5454  09-12-1994   
10                015685        Ron     Weasley     2369  09-31-1994   
11                015685      Harry      Potter     9852  10-16-1994   
12                 07619        Ron     Weasley     7512  01-15-

# Part 5 -- Missing Data for Supplies Used

(10 points)  For the supplies used column, replace (1) all missing values, (2) all negative values, and (3) all values over 100 (exclusive). Replace with the **mean** value for supplies used; restrict your mean calculation using non-missing, non-negative values from the *supplies_used* column that are less than or equal to 100. Print your data to verify your changes.


In [5]:
# insert code here
# get indices of rows where supplies used entries are either na, negative, or greater than 100
bad_supplies_used_ind = nurse_data[nurse_data['supplies_used'].isna() | (nurse_data['supplies_used'] < 0) | (nurse_data['supplies_used'] > 100)].index[0]
# replace all those bad entries with na values
nurse_data.loc[[bad_supplies_used_ind], ['supplies_used']] = np.nan
# calculate the average supplied used excluding na entries
avg_supplies_used = nurse_data['supplies_used'].mean(skipna=True)
# replace all na entries in column with the average
nurse_data['supplies_used'].fillna(avg_supplies_used, inplace=True)
# display dataset after bad data replacement
print(nurse_data) 

   medical_record_number first_name   last_name visit_id        date  \
0                 015685      Harry      Potter     8219  06-05-1994   
1                  07619        Ron     Weasley     7512  01-15-1994   
2                 014593   Hermione     Granger     5896  01-25-1994   
3                 015685      Harry      Potter     1552  1994-02-15   
4                 015685      Harry      Potter     1202  05-19-1994   
5                  08954      Dobby         NaN     1205  03-12-1994   
6                  07619        Ron     Weasley     6895  04-05-1994   
7                 015689      Harry      Potter     6854  10-11-1994   
8                  07619        Ron     Weasley     1265  07-08-1994   
9                    NaN   Serverus       Snape     5454  09-12-1994   
10                015685        Ron     Weasley     2369  09-31-1994   
11                015685      Harry      Potter     9852  10-16-1994   
12                 07619        Ron     Weasley     7512  01-15-

#Part 6 -- Normalize Time Spent

(15 points) Add two columns to your data structure by splitting the *time_spent* column. Split this into two columns named *time_spent* and *time_spent_unit*, where the number value is stored into *time_spent* and the description of the unit (minutes, hours, etc) is stored in *time_spent_unit*.  

Furthermore, convert any values with hours as units into minutes as units and correct any typos of minute to minutes. Print your data to verify your changes.



In [6]:
# insert code here
# split time_spent number values and units into 2-column DataFrame
split_time_spent = nurse_data['time_spent'].str.split(' ', n=1, expand=True)
# store number value in time_spent column
nurse_data['time_spent'] = pd.to_numeric(split_time_spent[0])
# insert unit column next to time_spent column
nurse_data.insert(loc=nurse_data.columns.get_loc('time_spent')+1, column='time_spent_unit', value=split_time_spent[1])
# check data quality
for row_ind in range(nurse_data.shape[0]):
    # convert hour value to minute value
    if nurse_data.at[row_ind, 'time_spent_unit'] == 'hours':
        nurse_data.at[row_ind, 'time_spent'] *= 60
        nurse_data.at[row_ind, 'time_spent_unit'] = 'minutes'
    # correct minutes typo
    elif nurse_data.at[row_ind, 'time_spent_unit'] == 'minute':
        nurse_data.at[row_ind, 'time_spent_unit'] = 'minutes'
# display dataset after normalization
print(nurse_data)

   medical_record_number first_name   last_name visit_id        date  \
0                 015685      Harry      Potter     8219  06-05-1994   
1                  07619        Ron     Weasley     7512  01-15-1994   
2                 014593   Hermione     Granger     5896  01-25-1994   
3                 015685      Harry      Potter     1552  1994-02-15   
4                 015685      Harry      Potter     1202  05-19-1994   
5                  08954      Dobby         NaN     1205  03-12-1994   
6                  07619        Ron     Weasley     6895  04-05-1994   
7                 015689      Harry      Potter     6854  10-11-1994   
8                  07619        Ron     Weasley     1265  07-08-1994   
9                    NaN   Serverus       Snape     5454  09-12-1994   
10                015685        Ron     Weasley     2369  09-31-1994   
11                015685      Harry      Potter     9852  10-16-1994   
12                 07619        Ron     Weasley     7512  01-15-

# Part 7 -- Replace Bad Dates

(5 points) Replace any bad dates (missing, impossible dates, poorly formated, etc) with a date representing January 1st, 1994. Print your data to verify your changes.


In [7]:
# insert code here
# set all non mm-dd-yyyy entries to NaT
nurse_data['date'] = pd.to_datetime(nurse_data['date'], format='%m-%d-%Y', errors='coerce').dt.strftime('%m-%d-%Y')
# replace all na entries with '01-01-1994'
nurse_data['date'] = nurse_data['date'].fillna(value='01-01-1994')
# display dataset after replacement
print(nurse_data)

   medical_record_number first_name   last_name visit_id        date  \
0                 015685      Harry      Potter     8219  06-05-1994   
1                  07619        Ron     Weasley     7512  01-15-1994   
2                 014593   Hermione     Granger     5896  01-25-1994   
3                 015685      Harry      Potter     1552  01-01-1994   
4                 015685      Harry      Potter     1202  05-19-1994   
5                  08954      Dobby         NaN     1205  03-12-1994   
6                  07619        Ron     Weasley     6895  04-05-1994   
7                 015689      Harry      Potter     6854  10-11-1994   
8                  07619        Ron     Weasley     1265  07-08-1994   
9                    NaN   Serverus       Snape     5454  09-12-1994   
10                015685        Ron     Weasley     2369  01-01-1994   
11                015685      Harry      Potter     9852  10-16-1994   
12                 07619        Ron     Weasley     7512  01-15-

#Part 8 -- Consistency of IDs

(10 points) Replace any inconsistent medical record numbers with the most commonly occurring medical record number for each first/last name combination (ignoring case). For example, 015689 was inconsistent with Potter's other IDs and would become 015685.

In [8]:
# insert code here
# replace NaN fields in names to empty strings to use as group keys
nurse_data['first_name'] = nurse_data['first_name'].fillna('')
nurse_data['last_name'] = nurse_data['last_name'].fillna('')
# get unique patient records with their corresponding most common IDs
unique_patients = nurse_data.groupby([nurse_data['first_name'].str.lower(), nurse_data['last_name'].str.lower()], dropna=False)['medical_record_number'].agg(lambda x: pd.Series.mode(x, dropna=True))
# for each record, replace ID with their most common one found above
for row_ind in range(nurse_data.shape[0]):
    # get keys which are lowercase first and last name
    first_name = (nurse_data.at[row_ind, 'first_name'])
    if isinstance(first_name, str): first_name = first_name.lower()
    last_name = (nurse_data.at[row_ind, 'last_name'])
    if isinstance(last_name, str): last_name = last_name.lower()
    # get corresponding common ID
    common_id = unique_patients[first_name][last_name]
    # if current ID doesn't match, replace it with their common ID
    if nurse_data.at[row_ind, 'medical_record_number'] != common_id:
        nurse_data.at[row_ind, 'medical_record_number'] = common_id
    

# Part 9 -- Calculate Aggregates, Part 1

(5 points) Use your cleaned data to calculate the mean time spent (in minutes) for all records. Print this value.



In [9]:
# insert code here
# calculate mean time spent of all records in minutes
mean_time_spent = nurse_data['time_spent'].mean()
# output mean time spent value
print('Mean Time Spent is', mean_time_spent, 'minutes')

Mean Time Spent is 17.586206896551722 minutes


# Part 10 -- Calculate Aggregates, Part 2

(5 points) Use your cleaned data to find the month in 1994 with the largest amount of time spent logged by the nurse. Print this value.  

Leave as a comment any lingering data quality concerns you might have in reporting aggregate monthly values back to Hogwarts administration.

In [10]:
# insert code here
# get total time spent of each month (extracting the first part of date format)
time_spent_per_month = nurse_data.groupby([nurse_data['date'].str[:2]])['time_spent'].sum().reset_index()
# find the month with the most time spent
greatest_time_spent_per_month = time_spent_per_month.max()
# display result
print("The largest amount of time spent is", greatest_time_spent_per_month['time_spent'], "minutes during the month", greatest_time_spent_per_month['date'])

# for the supplies used, we are using mean to replace bad data, which is float, while supplies used should be integer type

The largest amount of time spent is 182.0 minutes during the month 11


#Part 11 -- Documentation and Correctness
(10 points) Please document your code with human-readable messages explaining what the code is doing; at a minimum, every function and control structure should be documented.  If your response is a 1-liner, explain how it works.

Additionally, please error check your code; partial credit will be given to answers that do not fully address the requirements. For example, if it says write a function, please make sure your code provides a function.

Please make sure your submission has everything completed.

**If you modified the original data file, please submit your updated file too.** If you used Python to generate test cases and simply running your code loads the new test cases, you do not need to upload the original data file.