## Programming Assignment #2

**Hogwarts Data Quality**

100 points possible.

This assignment aims to explore checking common data quality issues.

# The Setting

Being a wizard can be dangerous. Being a wizard in training can be even more dangerous. The Hogwarts school nurse is a very busy person and records their activity in a log.

You are asked to review a sample of the logs from the nurse's office of Hogwarts and to report your findings.  

Note: in class, we saw that there were a wide variety of issues with the data file.  This assignment asks you to programmatically flag data quality issues on **a specific subset of the of the issues** identified in class.

# Part 0 -- Submission Details


(10 points) Please enter your name and the date below. Submit your answers as a completed notebook by the deadline posted on Canvas.  Late submissions will not get credit for this section.

Name: Duong Hoang

Date: 10/03/2022


#Part 1 -- Create

(10 points) Download the data from Canvas and load it into an appropriate Python structure.  Leave a comment in your code justifying your choice in data structure.  

You do not need to include this file with your submission; everyone's code will be ran with the same input data; do not modify the format of the file for your program.

In [2]:
# insert code here
import pandas as pd

# read data into pandas DataFrame
data = pd.read_csv("nurse-log.csv")

# I choose pandas DataFrame because I can easily manage the data by
# using their columns' names without tracking them separately (so that I know 
# which kind of data I am working with instead of having to track by index), 
# and also by using a variety of pandas library's functions to simplify my codes

# Part 2 -- Missing Data

(15 points) Write a function that accepts as input your data structure from Part 1 and a "column" of data (a single variable name represented as a string) and reports the number of missing values.  

Test your function by calling it at least three time and show the output; demonstrate your function works correctly by calling it with columns with and without missing data.

In [3]:
# insert code here
def count_missing_vals(data: pd.DataFrame, col_name: str):
    # count total number of null values in the given column
    return data[col_name].isnull().sum()

# columns with missing data values
print("Number of missing values in column 'medical_record_number':", count_missing_vals(data, "medical_record_number"))
print("Number of missing values in column 'last_name':", count_missing_vals(data, "last_name"))
print("Number of missing values in column 'date':", count_missing_vals(data, "date"))

# column without missing data values
print("Number of missing values in column 'visit_id':", count_missing_vals(data, "visit_id"))
print("Number of missing values in column 'charge':", count_missing_vals(data, "charge"))

Number of missing values in column 'medical_record_number': 1
Number of missing values in column 'last_name': 1
Number of missing values in column 'date': 1
Number of missing values in column 'visit_id': 0
Number of missing values in column 'charge': 0


# Part 3 -- Bad Dates

(15 points) Write a function that accepts as input your data structure from Part 1 and a "column" of data (a single variable name represented as a string) and reports the number of bad dates (impossible dates, poorly formated, etc).  For example, September has no 31st day.  If the record is not a date at all, such as the name "Potter", consider it a bad date by default.

Test your function by calling it at least two times and show the output: once with the "date" column from the data and once with any other column.


In [4]:
# insert code here
def count_bad_dates(data: pd.DataFrame, col_name: str):
    # convert all non-'%m-%d-%Y' values in the given column to NaT then count them
    return pd.to_datetime(data[col_name], format='%m-%d-%Y', errors='coerce').isna().sum()

# test function with date type data
print("Number of bad dates in column 'date':", count_bad_dates(data, 'date'))
# test function with non-date type data
print("Number of bad dates in column 'first_name':", count_bad_dates(data, 'first_name'))

Number of bad dates in column 'date': 4
Number of bad dates in column 'first_name': 22


#Part 4 -- Outliers

(15 points) Write a function that accepts as input your data structure from Part 1 and a "column" of data (a single variable represented as a string) and reports the number of outliers.  

Define outliers as any value that is -/+ **X** standard deviations away from the mean value, where **X** is a value you choose.  

Special case consideration: return 0 by default if the input data is not numerical data (because standard deviation must be well-defined for this function to work properly).

Leave as a comment how you chose the value of **X**.

Test your function by calling it at least two times and show the output: once with a column containly only numerical records and one containing at least one non-numerical result.

In [28]:
# insert code here
def count_outliers(data: pd.DataFrame, col_name: str):
    col = data[col_name] # extract column from DataFrame
    # if the given column is not numerical data, return 0
    if not pd.api.types.is_numeric_dtype(col):
        return 0
    # otherwise, count outliers based on data's standard deviation(std) and mean values
    mean = col.mean(axis=0, skipna=True)
    std = col.std(axis=0, skipna=True)
    # outlier would be those smaller than mean subtracting std and greater than mean adding std
    return sum((col < mean - std) | (col > mean + std))

# test function with column contains only numerical records
print("Number of outliers in column 'height(cm)':", count_outliers(data, 'height(cm)'))

# test function with column contains at least one non-numerical records
print("Number of outliers in column 'first_name':", count_outliers(data, 'first_name'))

Number of outliers in column 'height(cm)': 1
Number of outliers in column 'first_name': 0


#Part 5 -- Consistency Checking Example

(15 points) -- Write a function that checks for consistency in the **time_spent** column.  Assume that the correct unit of measurement should be **minutes**.  Your function should return how many values are not consistent with this chosen standard.

Design your own function parameters.  Leave as a comment why you chose these parameters.

Test your function by calling it at least once and show the output.

In [25]:
# insert code here

# 3 parameters: data (a DataFrame) and col_name (column name, string) to specify
# the data column to check for unit consistency, and unit (string) is the standard unit 
# to use for checking across all entries in the given data column.
# Using these parameters, we would be able to check for a variety of data 
# and using different units as standards
def time_inconsistency(data: pd.DataFrame, col_name: str, unit: str):
    # count all the entries that use the given unit as their unit
    correct_val_count = data[col_name].str.contains(unit, regex=False).sum()
    # number of entries that use other units would be 
    # the total entries minus the correct entries count above
    return data[col_name].size - correct_val_count

# test on time_spent column
print("Number of inconsistent time values:", time_inconsistency(data, 'time_spent', 'minutes'))

Number of inconsistent time values: 3


#Part 6 -- Reporting

(10 points) -- Write code that generates a report using the functions defined in Parts 2-5.  Your report should summarize any findings **per appropriate columns**.

You may write this to file or just print to standard output.  

Below is an example output (using X,Y,Z in place of actual values). You may deviate from the example output as long as the same information is conveyed; in other words, you have **creative freedom in presentation** but the content needs to summarize the results from Parts 2-5.

```
Column medical_record_number has X missing values.
Column first_name has X missing values.
Column last_name has X missing values.
Column visit has X missing values.
Column date has X missing values.
Column time_spent has X missing values.
Column height(cm) has X missing values.
Column weight(kg) has X missing values.
Column charge has X missing values.
Column supplies_used has X missing values.

Column time_spent has X outliers.
Column height(cm) has X outliers.
Column weight(kg) has X outliers.
Column charge has X outliers.
Column supplies_used has X outliers

Column date has Y bad dates.
Column time_spent Z inconsistent values.
```

In [29]:
# insert code here

# report each column's number of missing values
for col in data.columns:
    print("Column", col, "has", count_missing_vals(data, col), "missing values.")
print()
# report each column's number of outliers
for col in data.columns:
    print("Column", col, "has", count_outliers(data, col), "outliers.")
print()
# report number of bad dates
print("Column date has", count_bad_dates(data, 'date'), "bad dates.")
# report number of inconsistent time values
print("Column time_spent has", time_inconsistency(data, 'time_spent', 'minutes'), "inconsistent values.")

Column medical_record_number has 1 missing values.
Column first_name has 1 missing values.
Column last_name has 1 missing values.
Column visit_id has 0 missing values.
Column date has 1 missing values.
Column time_spent has 1 missing values.
Column height(cm) has 1 missing values.
Column weight(kg) has 0 missing values.
Column charge has 0 missing values.
Column supplies_used has 1 missing values.

Column medical_record_number has 2 outliers.
Column first_name has 0 outliers.
Column last_name has 0 outliers.
Column visit_id has 9 outliers.
Column date has 0 outliers.
Column time_spent has 0 outliers.
Column height(cm) has 1 outliers.
Column weight(kg) has 2 outliers.
Column charge has 7 outliers.
Column supplies_used has 1 outliers.

Column date has 4 bad dates.
Column time_spent has 3 inconsistent values.


# Part 7 -- Documentation and Correctness
(10 points) Please document your code with human-readable messages explaining what the code is doing; at a minimum, every function and control structure should be documented.

Additionally, please error check your code; partial credit will be given to answers that do not fully address the requirements. For example, if it says write a function, please make sure your code provides a function.

Please make sure your submission has everything completed.

You do not need to include the common data file in your submission.