# Data Quality:

#### Dataset 1: MTA Subway Delay-Causing Incidents: Beginning 2020

Accuracy: The dataset is accurate. When checking for any syntax errors within the Line Column, which we are primarily merging on, there were no misspellings and no duplicates. There were no discrepancies in accords to the “ground truth”

Completeness: The dataset has no missing rows, except the first row for each train line for each day (either weekend or weekday) always starts off with a blank under the Reporting Category Col and a 0 in the incidents column. While this is intentional and not an indicator of missing data, it is still something to be aware of. 

Timeliness: According to the Data.NY.Gov website, this dataset was just recently updated on October 24, 2025. It seems like they are frequently updating it. Both the data and the meta data were uploaded on the same day. The dataset provides content for each line for each month starting from 01-01-20. 

Consistency: The dataset is consistent. Under the Reporting Category, there are 6 options to choose from and no other variations so it is easy to categorize the types of incidents. While these 6 options help with consistency, it does limit true understanding of what the detailed true incident is. The other columns are consistent as well: The Month column follows the same syntax of Year-Month-Day, the Division column only has options from A Division or B Division, the Line column only contains numbers or letters of real train lines in the MTA, and the Incidents column only has numbers.  

In [1]:
import pandas as pd

In [8]:
delayed_df = pd.read_csv("../raw_data/delayed.csv")
delayed_df

Unnamed: 0,month,division,line,day_type,incidents,reporting_category
0,2024-01-01T00:00:00.000,A DIVISION,1,1,0,
1,2024-01-01T00:00:00.000,A DIVISION,1,1,31,Crew Availability
2,2024-01-01T00:00:00.000,A DIVISION,1,1,2,External Factors
3,2024-01-01T00:00:00.000,A DIVISION,1,1,92,Infrastructure & Equipment
4,2024-01-01T00:00:00.000,A DIVISION,1,1,86,Operating Conditions
...,...,...,...,...,...,...
3761,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,16,External Factors
3762,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,6,Infrastructure & Equipment
3763,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,1,Operating Conditions
3764,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,1,Planned ROW Work


In [10]:
#redoing the cleaning I did in the MTA_Code notebook because it didnt transfer over

#Clean line column for syntax issues (capitalization and white space from front and back)
delayed_df['line'] = delayed_df['line'].str.strip().str.upper()
delayed_df

Unnamed: 0,month,division,line,day_type,incidents,reporting_category
0,2024-01-01T00:00:00.000,A DIVISION,1,1,0,
1,2024-01-01T00:00:00.000,A DIVISION,1,1,31,Crew Availability
2,2024-01-01T00:00:00.000,A DIVISION,1,1,2,External Factors
3,2024-01-01T00:00:00.000,A DIVISION,1,1,92,Infrastructure & Equipment
4,2024-01-01T00:00:00.000,A DIVISION,1,1,86,Operating Conditions
...,...,...,...,...,...,...
3761,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,16,External Factors
3762,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,6,Infrastructure & Equipment
3763,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,1,Operating Conditions
3764,2024-12-01T00:00:00.000,B DIVISION,S ROCK,2,1,Planned ROW Work


### Dataset 2: MTA Subway Customer Journey-Focused Metrics: 2020-2024

Accuracy: The dataset is accurate. When checking for any syntax errors within the Line Column, which we are primarily merging on, there were no misspellings and no duplicates.There were no discrepancies in accords to the “ground truth”

Completeness: The dataset has no missing rows. 

Timeliness: According to the Data.NY.Gov website, this dataset was updated January 24, 2025 and the meta data was updated July 14, 2025. Although it is not as frequent as the other datasets, it is not relevant in our case because we are only looking at 2024 data for our research questions and there is data for each month for each line in this dataset. 

Consistency: The dataset is consistent. The Month column follows the same syntax of Year-Month-Day, the Line column only contains numbers or letters of real train lines in the MTA,  the Division column only has options from A Division or B Division, the Period column only has options from Peak or Offpeak, and the Num_passengers column only has numbers, columns Additional Platform Time, Additional Train Time, Total_apt, Total_att, and Over_five_mins are all in minutes, and the Over_five_mins_perc and Customer journey time performance columns are all in percentages. 

# Data Cleaning:

#### Dataset 1: MTA Subway Delay-Causing Incidents: Beginning 2020

**Missing values**: we started cleaning by dropping NaN values. By doing this, it removed 538 rows. 


**Data Types**: we used the dtypes function to check for data types. Because all the data types were objects at first, we converted them to their respective data types (datetime, integers, strings) so we can properly do data analysis on the columns. 


**Duplicate rows**: we checked for duplicate rows across the entire data set; there was no duplicate rows found. We did not check for duplicates based on specific columns because repeated values still had unique information in the other fields. 


**Syntactic Check**: 
- We accounted for white spaces and capitalization for all string columns to make sure everything is consistent. 
- Checked to make sure the months column was all following the same syntax so there is no data left out later in our analysis and integration; after checking, they all followed the same syntax. 


**Semantic Check**: 
- For the "incidents" column, we checked for negative values, since a negative number of "incidents" is not logically possible; there were no negative values so there was no additional cleaning needed. 
- Checked semantics between the "division" and "line" columns, since Division A is associated with numbered subway lines and Division B is associated with lettered subway lines; after, checking, all entries were correctly matched.


We also used the unique() function to check for any additional syntactic or semantic inconsistencies such as text-only columns containing numbers, or potential misspellings. No additional issues were found, so no further cleaning was necessary.


Overall, the only cleaning required involved handling missing values, correcting data types, and standardizing whitespace and capitalization (just incase).

### Dataset 2: MTA Subway Customer Journey-Focused Metrics: 2020-2024

**Similar cleaning done on dataset 2!**

**Missing values**: we started cleaning by dropping NaN values. By doing this, it *removed 24 rows*.

**Data Types**: like for dataset 1, we used the dtypes function to check for data types. Because all the data types were objects at first, we converted them to their respective data types (datetime, floats, strings). 

**Duplicate rows**: we checked for duplicate rows across the entire data set; there was no duplicate rows found.

**Syntactic Check**: 
- We accounted for white spaces and capitalization for all string columns to make sure everything is consistent.
- Checked to make sure the months column was all following the same; after checking, they all followed the same syntax.

**Semantic Check**: 
- *For the float columns, we checked for negative values. Although there was 77 rows containing negative values under the "additional_train_time" and "total_att" columns, it makes sense to keep these values, since negative numbers indicate that riders actually spent less additional time on the train.*
- Checked semantics between the "division" and "line" columns, since Division A is associated with numbered subway lines and Division B is associated with lettered subway lines; after, checking, all entries were correctly matched.


We also used the unique() function to check for any additional syntactic or semantic inconsistencies. No additional issues were found, so no further cleaning was necessary.


LIke for dataset 1, the only cleaning required involved handling missing values, correcting data types, and standardizing whitespace and capitalization (just incase).