# DSC540 Milestone2: Cleaning/Formatting Flat File Source

*Milestone Objective:* This milestone aims to perform data transformations and cleansing on a flat file dataset related to U.S. healthcare readmissions. The goal is to create a clean and usable dataset for further analysis by executing at least five specific data transformation steps.

The flat file dataset utilized in this project contains essential information regarding 30-day readmission and mortality rates for hospitals across the United States. It comprises over 64,000 rows, capturing data from various states and providing a comprehensive overview of healthcare performance metrics.

### Transformation Steps and Code Outline

In [1]:
# Step 0: Load the CSV file into a DataFrame

# Description: This step loads the dataset from the specified CSV file and displays the first few rows to understand its initial structure.

import pandas as pd

# Load the CSV file
file_path = 'Readmissions_and_Deaths -Hospital.csv'   
hospital_data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
print("Initial dataset:\n", hospital_data.head())

Initial dataset:
    index  Provider ID             Hospital Name                   Address  \
0      0       230100  TAWAS ST JOSEPH HOSPITAL               200 HEMLOCK   
1      1       230121       MEMORIAL HEALTHCARE      826 WEST KING STREET   
2      2       230118      HURON MEDICAL CENTER  1100 SOUTH VAN DYKE ROAD   
3      3       230121       MEMORIAL HEALTHCARE      826 WEST KING STREET   
4      4       230133  OTSEGO MEMORIAL HOSPITAL          825 N CENTER AVE   

         City State  ZIP Code County Name  Phone Number  \
0  TAWAS CITY    MI     48764       IOSCO    9893629301   
1      OWOSSO    MI     48867  SHIAWASSEE    9897235211   
2     BAD AXE    MI     48413       HURON    9892699521   
3      OWOSSO    MI     48867  SHIAWASSEE    9897235211   
4     GAYLORD    MI     49735      OTSEGO    9897312100   

                                        Measure Name          Measure ID  \
0  Rate of readmission after discharge from hospi...  READM_30_HOSP_WIDE   
1     Rate o

In [2]:
# Step 1: Replace headers

# Goal: Ensure that the headers are clean, readable, and consistent. Column headers contain extra spaces and inconsistent capitalization.
# Action: Rename headers to a consistent format (e.g., all lowercase, replacing spaces with underscores).

hospital_data.columns = ['index', 'provider_id', 'hospital_name', 'address', 'city', 'state', 'zip_code', 
              'county_name', 'phone_number', 'measure_name', 'measure_id', 'compared_to_national',
              'denominator', 'score', 'lower_estimate', 'higher_estimate', 'footnote', 
              'measure_start_date', 'measure_end_date', 'location']

In [3]:
# Step 2: Handling missing values

# Goal: Ensure that missing data does not cause issues during the analysis by filling in rows with missing values.
# Action: Replace missing values in the 'score' column with 0, as this column is critical for analysis.

hospital_data['score'] = hospital_data['score'].fillna(0)  # Replace missing scores with 0

In [4]:
# Step 3: Drop unnecessary columns

# Goal: Simplify the dataset by removing columns that are not needed for the analysis, reducing complexity and improving readability.
# Action: Drop columns like 'hospital_name', 'provider_id', 'phone_number', and 'footnote' because they are not relevant to the current analysis 
#         and won't contribute to meaningful insights.

hospital_data.drop(columns=['hospital_name', 'provider_id', 'phone_number', 'footnote'], inplace=True)

In [5]:
# Step 4: Format Date Columns

# Goal: Ensure that date fields are in a proper format for analysis, which will allow us to perform time-based calculations and comparisons later on.
# Action: Convert 'Measure Start Date' and 'Measure End Date' to the datetime format using pd.to_datetime(), handling any errors by coercing 
#         invalid date entries. After conversion, check the data types to confirm that the columns have been successfully formatted as dates.

hospital_data['measure_start_date'] = pd.to_datetime(hospital_data['measure_start_date'], errors='coerce')
hospital_data['measure_end_date'] = pd.to_datetime(hospital_data['measure_end_date'], errors='coerce')

# Display the data types to confirm changes
#print("Data types after date conversion:\n", hospital_data.dtypes)

In [6]:
# Step 5: Handle leading/trailing spaces and drop rows with missing state values

# Goal: Ensure the State values are properly formatted by removing any leading or trailing spaces, and drop rows 
#       where the State field is missing. This ensures consistency and accuracy in further analysis and merges with other datasets.
# Action: Strip any leading/trailing spaces from the State column and remove rows with missing State values.

# Remove leading/trailing spaces from the State column
hospital_data['state'] = hospital_data['state'].str.strip()

# Drop rows where the State column has missing values
hospital_data.dropna(subset=['state'], inplace=True)

# Display the shape of the dataset after cleaning state values
print(f"Dataset shape after handling spaces and dropping missing state values: {hospital_data.shape}")

Dataset shape after handling spaces and dropping missing state values: (64764, 16)


In [7]:
# Step 6: Identify and Handle Duplicates

# Goal:   Ensure the dataset does not contain duplicate rows, which could skew analysis results or lead to redundant insights.
# Action: Identify and count any duplicate rows. If duplicates are found, remove them from the dataset to ensure only unique records remain.
#         Finally, display the updated shape of the dataset to confirm that duplicates have been removed.

duplicates = hospital_data.duplicated().sum()
print(f"Number of duplicate rows found: {duplicates}")

# If duplicates are found, remove them
hospital_data = hospital_data.drop_duplicates()

# Display the shape of the dataset after removing duplicates
print(f"Dataset shape after removing duplicates: {hospital_data.shape}")

Number of duplicate rows found: 0
Dataset shape after removing duplicates: (64764, 16)


In [8]:
# Final Step: Display the cleaned and transformed dataset

# Print the first few rows of the cleaned dataset
print("Cleaned dataset preview after all transformations:\n")
print(hospital_data.head())

# Optional: Print dataset information to check final structure and data types
print("\nDataset Info:")
print(hospital_data.info())

Cleaned dataset preview after all transformations:

   index                   address        city state  zip_code county_name  \
0      0               200 HEMLOCK  TAWAS CITY    MI     48764       IOSCO   
1      1      826 WEST KING STREET      OWOSSO    MI     48867  SHIAWASSEE   
2      2  1100 SOUTH VAN DYKE ROAD     BAD AXE    MI     48413       HURON   
3      3      826 WEST KING STREET      OWOSSO    MI     48867  SHIAWASSEE   
4      4          825 N CENTER AVE     GAYLORD    MI     49735      OTSEGO   

                                        measure_name          measure_id  \
0  Rate of readmission after discharge from hospi...  READM_30_HOSP_WIDE   
1     Rate of readmission after hip/knee replacement   READM_30_HIP_KNEE   
2             Pneumonia (PN) 30-Day Readmission Rate         READM_30_PN   
3            Rate of readmission for stroke patients        READM_30_STK   
4           Heart failure (HF) 30-Day Mortality Rate          MORT_30_HF   

                  comp

### Ethical Implications of Data Wrangling for the U.S. Healthcare Readmissions and Mortality Dataset

During the data cleaning and transformation process for the U.S. Healthcare Readmissions and Mortality dataset, several key changes were implemented. These included renaming headers for better clarity, addressing missing values, dropping unnecessary columns, formatting date columns, and standardizing geographic identifiers. While these transformations improve the dataset's quality for analysis, they also raise important ethical considerations. Legal and regulatory guidelines, such as HIPAA, govern the handling of healthcare data, requiring strict adherence to privacy standards, especially when personal information is involved. The transformations made can introduce risks, such as potential bias from filling in missing values or removing rows, which might distort the dataset's representation of patient demographics and outcomes. Assumptions made during cleaning, such as treating missing 'score' values as zeros, may not accurately reflect the quality of patient care. The dataset was sourced from Kaggle, a reputable platform known for public datasets, enhancing its credibility. However, it is crucial to ensure that data is used ethically and does not exploit vulnerable populations or lead to misleading conclusions. To mitigate these ethical implications, it is important to maintain transparency regarding the limitations of the data, validate the cleaning methods employed, and adhere to ethical standards when reporting findings.