## Programming Assignment #3

**Hogwarts Data Cleaning**

100 points possible.

This assignment asks you to perform specific data cleaning tasks.

# The Setting -- The Return to Hogwarts

Being a wizard can be dangerous. Being a wizard in training can be even more dangerous. The Hogwarts school nurse is a very busy person and records their activity in a log.

You are asked to review a sample of the logs from the nurse's office of Hogwarts and to **clean** these logs for analysis later on.  Please perform these parts **in order** as earlier steps may impact later steps.

Note: in class, we saw that there were a wide variety of issues with the data file.  This assignment asks you to programmatically clean data quality issues on **a specific subset of the of the issues** identified in class. You do not have to resolve all issues in the file, only those described below.

# Part 0 -- Submission Details


(10 points) Please enter your name and the date below. Submit your answers as a completed notebook by the deadline posted on Canvas.  Late submissions will not get credit for this section.

Name: Duong Hoang

Date: 10/13/2022


# Part 1 -- Create New Test Cases

(10 points) Review the original data and the data quality tasks below.  Design 10 test records (rows of patient visits) to insert into the original data.  You can manually develop these by modifying the original file or use Python in some capacity to generate them. If you modify the original file, please submit your modified file with your submission.

In plain english, explain how your test records will help test the data quality tasks. What makes them good test cases?

**Answer: add discussion here.**

In [23]:
# insert code (if applicable)
import pandas as pd
import numpy as np
#new_records = pd.DataFrame([[12340, 'Sirius', 'Black', ]])

#Part 2 -- Create

(5 points) Load the original data and your new test cases into the same Python structure (of your choice).



In [24]:
# insert code here
nurse_data = pd.read_csv('nurse-log.csv')
print(nurse_data.dtypes)

medical_record_number    float64
first_name                object
last_name                 object
visit_id                   int64
date                      object
time_spent                object
height(cm)               float64
weight(kg)                 int64
charge                   float64
supplies_used            float64
dtype: object


# Part 3 -- Missing Data for Time Spent

(5 points) Drop all rows with missing values for the time_spent column.  Print your data to verify your changes.


In [25]:
# insert code here
# drop all rows where time_spent entries are na 
nurse_data.dropna(subset=['time_spent'], inplace=True)
# display dataset after dropping
print(nurse_data)

    medical_record_number first_name   last_name  visit_id        date  \
0                 15685.0      Harry      Potter      8219  06-05-1994   
1                  7619.0        Ron     Weasley      7512  01-15-1994   
2                 14593.0   Hermione     Granger      5896  01-25-1994   
3                 15685.0      Harry      Potter      1552  1994-02-15   
4                 15685.0      Harry      Potter      1202  05-19-1994   
5                  8954.0      Dobby         NaN      1205  03-12-1994   
6                  7619.0        Ron     Weasley      6895  04-05-1994   
7                 15689.0      Harry      Potter      6854  10-11-1994   
8                  7619.0        Ron     Weasley      1265  07-08-1994   
9                     NaN   Serverus       Snape      5454  09-12-1994   
10                15685.0        Ron     Weasley      2369  09-31-1994   
11                15685.0      Harry      Potter      9852  10-16-1994   
12                 7619.0        Ron  

# Part 4 -- Range checking for Height

(10 points) Drop all rows with values larger than 250 for height (slightly larger than the world record for tallest person). Print your data to verify your changes.


In [26]:
# insert code here
# get indices of rows where height entries are greater than 250
outlier_height_ind = nurse_data[(nurse_data['height(cm)'] > 250)].index
# remove those rows from dataset
nurse_data.drop(outlier_height_ind, inplace=True)
# display dataset after dropping
print(nurse_data)

    medical_record_number first_name   last_name  visit_id        date  \
0                 15685.0      Harry      Potter      8219  06-05-1994   
1                  7619.0        Ron     Weasley      7512  01-15-1994   
2                 14593.0   Hermione     Granger      5896  01-25-1994   
3                 15685.0      Harry      Potter      1552  1994-02-15   
4                 15685.0      Harry      Potter      1202  05-19-1994   
5                  8954.0      Dobby         NaN      1205  03-12-1994   
6                  7619.0        Ron     Weasley      6895  04-05-1994   
7                 15689.0      Harry      Potter      6854  10-11-1994   
8                  7619.0        Ron     Weasley      1265  07-08-1994   
9                     NaN   Serverus       Snape      5454  09-12-1994   
10                15685.0        Ron     Weasley      2369  09-31-1994   
11                15685.0      Harry      Potter      9852  10-16-1994   
12                 7619.0        Ron  

# Part 5 -- Missing Data for Supplies Used

(10 points)  For the supplies used column, replace (1) all missing values, (2) all negative values, and (3) all values over 100 (exclusive). Replace with the **mean** value for supplies used; restrict your mean calculation using non-missing, non-negative values from the *supplies_used* column that are less than or equal to 100. Print your data to verify your changes.


In [27]:
# insert code here
# get indices of rows where supplies used entries are either na, negative, or greater than 100
bad_supplies_used_ind = nurse_data[nurse_data['supplies_used'].isna() | (nurse_data['supplies_used'] < 0) | (nurse_data['supplies_used'] > 100)].index[0]
# replace all those bad entries with na values
nurse_data.loc[[bad_supplies_used_ind], ['supplies_used']] = np.nan
# calculate the average supplied used excluding na entries
avg_supplies_used = nurse_data['supplies_used'].mean(skipna=True)
# replace all na entries in column with the average
nurse_data.fillna(avg_supplies_used, inplace=True)
# display dataset after bad data replacement
print(nurse_data) 

13
    medical_record_number first_name   last_name  visit_id        date  \
0            15685.000000      Harry      Potter      8219  06-05-1994   
1             7619.000000        Ron     Weasley      7512  01-15-1994   
2            14593.000000   Hermione     Granger      5896  01-25-1994   
3            15685.000000      Harry      Potter      1552  1994-02-15   
4            15685.000000      Harry      Potter      1202  05-19-1994   
5             8954.000000      Dobby    4.473684      1205  03-12-1994   
6             7619.000000        Ron     Weasley      6895  04-05-1994   
7            15689.000000      Harry      Potter      6854  10-11-1994   
8             7619.000000        Ron     Weasley      1265  07-08-1994   
9                4.473684   Serverus       Snape      5454  09-12-1994   
10           15685.000000        Ron     Weasley      2369  09-31-1994   
11           15685.000000      Harry      Potter      9852  10-16-1994   
12            7619.000000        Ro

#Part 6 -- Normalize Time Spent

(15 points) Add two columns to your data structure by splitting the *time_spent* column. Split this into two columns named *time_spent* and *time_spent_unit*, where the number value is stored into *time_spent* and the description of the unit (minutes, hours, etc) is stored in *time_spent_unit*.  

Furthermore, convert any values with hours as units into minutes as units and correct any typos of minute to minutes. Print your data to verify your changes.



In [None]:
# insert code here
split_time_spent = nurse_data['supplies_used'].str.split(' ', n=1, expand=True)
nurse_data['time_spent'] = split_time_spent[0]
nurse_data['time_spent_unit'] = split_time_spent[1]




# Part 7 -- Replace Bad Dates

(5 points) Replace any bad dates (missing, impossible dates, poorly formated, etc) with a date representing January 1st, 1994. Print your data to verify your changes.


In [None]:
# insert code here

#Part 8 -- Consistency of IDs

(10 points) Replace any inconsistent medical record numbers with the most commonly occurring medical record number for each first/last name combination (ignoring case). For example, 015689 was inconsistent with Potter's other IDs and would become 015685.

In [None]:
# insert code here

# Part 9 -- Calculate Aggregates, Part 1

(5 points) Use your cleaned data to calculate the mean time spent (in minutes) for all records. Print this value.



In [None]:
# insert code here

# Part 10 -- Calculate Aggregates, Part 2

(5 points) Use your cleaned data to find the month in 1994 with the largest amount of time spent logged by the nurse. Print this value.  

Leave as a comment any lingering data quality concerns you might have in reporting aggregate monthly values back to Hogwarts administration.

In [None]:
# insert code here

#Part 11 -- Documentation and Correctness
(10 points) Please document your code with human-readable messages explaining what the code is doing; at a minimum, every function and control structure should be documented.  If your response is a 1-liner, explain how it works.

Additionally, please error check your code; partial credit will be given to answers that do not fully address the requirements. For example, if it says write a function, please make sure your code provides a function.

Please make sure your submission has everything completed.

**If you modified the original data file, please submit your updated file too.** If you used Python to generate test cases and simply running your code loads the new test cases, you do not need to upload the original data file.