Notebook Summary

This notebook focuses on cleaning the dataset by handling missing values and duplicates. Below are the key tasks and outputs:
Key Tasks

Identify Missing Values:
Count the number of missing values in each column.
Drop Problematic Columns:
Remove columns with more than 10% missing values.
Clean Rows:
Drop rows that contain any missing values.
Remove Duplicates:
Eliminate duplicate rows from the dataset.
Save Clean Data:
Export the cleaned dataset to a CSV file located in the data folder.
Display Summary:
Print the dataset's dimensions, remaining missing values, duplicate count, and a preview of the first few rows.
Output

Dataset Shape:
Number of rows and columns after cleaning.
Missing Values Report:
Count of missing values per column (should be zero).
Duplicate Count:
Number of duplicate records remaining (should be zero).
Data Preview:
First few rows of the cleaned dataset.

In [2]:
#Library Loading
import pandas as pd

In [3]:
# Data Loading
data_path = "../data/raw_data.csv"
df_full = pd.read_csv(data_path)

  df_full = pd.read_csv(data_path)


In [5]:
# Selected Columns
# Define the list of columns to keep
columns_to_keep = [
    'NTD ID',
    'Primary UZA UACE Code',
    'Rail/Bus/Ferry',
    'Mode Name',
    'Mode',
    'TOS',
    'Year',
    'Event Date',
    'Event Time',
    'Event Type',
    'Event Type Group',
    'Safety/Security',
    'Property Damage',
    'Total Injuries',
    'Total Fatalities',
    'Towed (Y/N)',
    'Number of Transit Vehicles Involved',
    'Number of Non-Transit Vehicles Involved',
    'Number of Cars on Involved Transit Vehicles',
    'Non-Transit Vehicle Type List',
    'Location Type',
    'Latitude',
    'Longitude',
    'Weather',
    'Lighting',
    'Road Configuration',
    'Path Condition',
    'Right of Way Condition',
    'Intersection Control Device',
    'Transit Vehicle Action',
    'Other Transit Vehicle Action Description',
    'Non-Transit Vehicle Action List',
    'Transit (Y/N)',
    'Fuel Type',
    'Vehicle Speed',
    'Transit Vehicle Type',
    'Non-Transit Vehicle Type',
    'Transit Vehicle Manufacturer',
    'Total Serious Injuries'
]

df = df_full[columns_to_keep]

In [7]:
# count the number of missing values in each column
missing_values = df.isnull().sum()

# drop columns with more than 10% missing values
columns_to_drop = missing_values[missing_values > 0.1 * df.shape[0]].index
df = df.drop(columns=columns_to_drop)

# drop rows with missing values
df = df.dropna()

# drop duplicates
df = df.drop_duplicates()

# save the cleaned data, save to data folder as data_cleaned.csv
df.to_csv('../data/cleaned_data.csv', index=False)

# print the number of rows and columns in the cleaned data
print(df.shape)

# print the number of missing values in the cleaned data
print(df.isnull().sum())

# print the number of duplicates in the cleaned data
print(df.duplicated().sum())

# print the first few rows of the cleaned data
print(df.head())

(98611, 20)
NTD ID                                         0
Primary UZA UACE Code                          0
Rail/Bus/Ferry                                 0
Mode Name                                      0
Mode                                           0
TOS                                            0
Year                                           0
Event Date                                     0
Event Time                                     0
Event Type                                     0
Event Type Group                               0
Safety/Security                                0
Total Injuries                                 0
Total Fatalities                               0
Towed (Y/N)                                    0
Number of Transit Vehicles Involved            0
Number of Non-Transit Vehicles Involved        0
Number of Cars on Involved Transit Vehicles    0
Location Type                                  0
Total Serious Injuries                         0
dtype: i