# Chicago Transit Authority Ridership and Funding EDA


## Summary

Over the last few years, Chicago public transportation has lost ridership and funding in favor of private car ownership. Ever since the pandemic of 2020, ridership was massively cut and reliability of the service became questioned by the public. In 2024, Chicago sees ridership increasing, however, locals are finding the environment dramatically changed. The largest complaints of the CTA are:
1. Increase in violant crime on the train
2. Reliability and on-time delivery
3. Health or hygiene concerns of the trains due to homelessness or drug addiction

### Objective

(Review with Marcello)

With the limited funding provided, can we use public data to understand where financial support should be focused?


Can we determine if safety or reliability is the biggest impact to ridership recovery? (Crime vs. late busses/trains)


What "lines" have the biggest unrecovered population and what are the commonalities that can be remedied? 


## Data Sources

Data for ridership is publically available through the CTA developer site [here](https://data.cityofchicago.org/browse?q=ridership&sortBy=relevance)

Data for crime is posted by the city of Chicago [here](https://data.cityofchicago.org/Public-Safety/City-of-Chicago-Crime-Data/v9q9-3dm2)

Data for community names is posted by the City of Chicago [here](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6)


## Data Dictionary and Glossary

### Crime

**Summary**
Reported crimes committed within the Chicago area and published by the Chicago Police Department. Data last refreshed as of 4th of April, 2024 00:00 (Central Standard Time / Chicago).

### Data Dictionary

**Case Number** - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.

**Date** - Date when the incident occurred. this is sometimes a best estimate.

**Block** - The partially redacted address where the incident occurred, placing it on the same block as the actual address.

**IUCR** - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.

**Primary Type** - The primary description of the IUCR code.

**Description** - The secondary description of the IUCR code, a subcategory of the primary description.

**Location Description** - Description of the location where the incident occurred.

**Arrest** - Indicates whether an arrest was made.

**Domestic** - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.

**Beat** - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.

**District** - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.

**Ward** - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.

**Community Area** - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.

**FBI Code** - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS).See the Chicago Police Department listing of these classifications at https://gis.chicagopolice.org/pages/crime_details.

**X Coordinate** - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.

**Y Coordinate** - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.

**Year** - Year the incident occurred.

**Updated On** - Date and time the record was last updated. 

**Latitude** - The latitude of the location where incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

**Longitude** - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.

**Location** - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

### L Stop Daily Rides

**Summary**
This list shows daily totals of ridership, by station entry, for each 'L' station dating back to 2001. Dataset shows entries at all turnstiles, combined, for each station. Daytypes are as follows: W=Weekday, A=Saturday, U=Sunday/Holiday

**station_id** - City ID for station

**stationname** - Street Intersection/Name of the station

**date** - Day of Year

**daytype** - W=Weekdays, A=Saturday, U=Sunday/Holiday

**rides** - Total rides by day for each station


### L Stop Names and Locations

**Summary**
This list of 'L' stops provides location and basic service availability information for each place on the CTA system where a train stops, along with formal station names and stop descriptions.

**STOP_D** - ID assigned by the CTA for reference in their system. 

**STOP_NAME** - Name displayed on tickers. Stop name is similar to station with the direction of the train.

**STATION_NAME** - Official name for the stop. The name is assigned based on the intersection or largest roadway.

**MAP_ID** - This is unknown and may be related to map Geolocation.

**ADA** - American Disability Act compliant. Refers to requirements such as a lift system as an alternative to stairs. 

**RED** - If the Red line has a stop at this station.

**BLUE** - If the Blue line has a stop at this station.

**G** - If the Green line has a stop at this station.

**BRN** - If the Brown line has a stop at this station.

**P** - If the Purple line has a stop at this station.

**Pexp** - If the Purple line express functions for this station.

**Y** - If the Yellow line has a stop at this location.

**O** - If the Orange line has a stop at this location.

**Location** - The Longitude and Latitude coordinates for this station.



### Community Area 

**Summary**
This shows the data related to the city of Chicago and how each neighborhood is organized. Data includes the geometric shape of the neighborhood and the related ID's for governance or reporting. 

**Community** - The standard name of the area or neighborhood. This is what the area is officially called and can be found on maps.

**Area Number** - The ID associated with the community. This is a reference for other records such as crime. 

**Area** - This is unknown. This is not reflected in the official site and does not correlate to the area number.

**Area Num 1** - This is the same value as the Area Number.

**Shape Area** - Appears to be the square feet of the neighboorhood.

**Geometry** - The official exact coordinates for the neighborhood.

**Shape Len** - Appears to be the *length* of the neighboor given the official shape in feet.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import polars
import json


pd.options.display.max_columns = None

In [2]:
crimes = pd.read_csv("crimes.csv")
l_stop_daily_rides = pd.read_csv("l_stop_daily_rides.csv")




## Community Data

The community data provided by Chicago is available in a csv, however, at the time of this EDA the file was written in a way that wasn't easily digestable. We have opted to use the geojson file of the same data. To make a dataframe for this file, we have to reformat and finally load. 

Below shows the required steps to extract the information we want and load into a dataframe. 

In [3]:
with open('bounds.geojson', 'r') as file:
    comm_area_json = json.load(file)


In [4]:

community_dict = {}

for i in range(len(comm_area_json['features'])):
    community_dict_name = comm_area_json['features'][i]['properties']['community']
    community_dict_code = int(comm_area_json['features'][i]['properties']['area_num_1'])
    community_dict[community_dict_code] = community_dict_name


community_dict[0] = 'UNKNOWN'

# Understanding the Data

A review of the data is necessary to understand where unnecessary features are listed or data that needs to be sanitized before it becomes useable. Below we see each dataset reviewed for feature data types, and null/NaN values. Any features where nulls are highly present will require adjustments or dropping entirely. 

In [5]:
l_stop_daily_rides.dtypes

station_id      int64
stationname    object
date           object
daytype        object
rides           int64
dtype: object

In [6]:
crimes.dtypes

ID                        int64
Case Number              object
Date                     object
Block                    object
IUCR                     object
Primary Type             object
Description              object
Location Description     object
Arrest                     bool
Domestic                   bool
Beat                      int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                 object
X Coordinate            float64
Y Coordinate            float64
Year                      int64
Updated On               object
Latitude                float64
Longitude               float64
Location                 object
dtype: object

## Updating the Column Names

Below we condition the columns so that we follow best practices. All spaces are removed and words are lowercase for easy reference. 

In [7]:
c_cols = crimes.columns
l_cols = l_stop_daily_rides.columns


def minimize_cols(cols: list[str]) -> list:
    new_cols = []
    for i in range(len(cols)):
        updated_col = cols[i].lower()
        updated_col = updated_col.replace(" ","_")
        new_cols.append(updated_col)
    return new_cols



crimes.rename(columns=dict(zip(c_cols,minimize_cols(c_cols))), inplace=True)
l_stop_daily_rides.rename(columns=dict(zip(l_cols,minimize_cols(l_cols))),inplace=True)




## Cleaning Up the Data

During the upload process most of the data was converted to the object type. Here we need to update the value types and remove any data that is invalid. Missing values or nulls can be updated depending on the characteristic and relevance to the goal. 



We can see that no values are lost for the CTA ridership data, however, our crime data is missing data in the coordinate feature. After counting the NA/Null values we can see that the missing data accounts for approximately 1.1% of the total values. With so many values completed, we will keep these values and clean up the data with best practices.



In [8]:
crime_nulls = {}
for col in crimes.columns:
    crime_nulls[col] = len(crimes[crimes[col].isna()])
crime_nulls


{'id': 0,
 'case_number': 0,
 'date': 0,
 'block': 0,
 'iucr': 0,
 'primary_type': 0,
 'description': 0,
 'location_description': 12851,
 'arrest': 0,
 'domestic': 0,
 'beat': 0,
 'district': 47,
 'ward': 614851,
 'community_area': 613475,
 'fbi_code': 0,
 'x_coordinate': 88559,
 'y_coordinate': 88559,
 'year': 0,
 'updated_on': 0,
 'latitude': 88559,
 'longitude': 88559,
 'location': 88559}

In [9]:
percent_missing = round(crime_nulls['location']/len(crimes.index),3)*100
print(f"Percent of locations missing in data set: {percent_missing}%")

Percent of locations missing in data set: 1.0999999999999999%


In [10]:
l_stop_nulls = {}
for col in l_stop_daily_rides.columns:
    l_stop_nulls[col] = len(l_stop_daily_rides[l_stop_daily_rides[col].isna()])
l_stop_nulls

{'station_id': 0, 'stationname': 0, 'date': 0, 'daytype': 0, 'rides': 0}

The date values must be converted to the appropriate datetime data type for both the L stop and crime data sets. 

In [11]:
crimes['date'] = pd.to_datetime(crimes['date'])
l_stop_daily_rides['date'] = pd.to_datetime(l_stop_daily_rides['date'])



### Crime Data

It would be helpful to also include the community name within the data. We will use the community dataframe to create new features based on the existing data community code / number. The missing values for crime reflects over 600,000 missing records for community. We can safely convert this feature to an integer and fill in unknowns with 0. No community uses 0 so we can safely group this data into an unknown community while it may be of use for other statistics for chicago as a whole. We use the community dictionary to create a new column matching the community id already found in the crime data. 

We can clean up a few features that don't have relation to anything with public transportation and are used more for details about the unique recorded crime. 

In [12]:
crimes['community_area'] = crimes['community_area'].fillna(0).copy()

crimes['community_area'] = crimes['community_area'].astype('Int64')

crimes['community_name'] = crimes['community_area'].apply(lambda x: community_dict[x])

In [13]:
crimes.drop(['case_number','x_coordinate','y_coordinate','updated_on','fbi_code'], axis=1, inplace=True)
crimes.set_index('id', inplace=True)

In [14]:
crimes['location'] = crimes['location'].fillna('Not Available').copy()

In [15]:
crimes.head()

Unnamed: 0_level_0,date,block,iucr,primary_type,description,location_description,arrest,domestic,beat,district,ward,community_area,year,latitude,longitude,location,community_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
5741943,2007-08-25 09:22:18,074XX N ROGERS AVE,560,ASSAULT,SIMPLE,OTHER,False,False,2422,24.0,49.0,1,2007,,,Not Available,ROGERS PARK
25953,2021-05-24 15:06:00,020XX N LARAMIE AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,True,False,2515,25.0,36.0,19,2021,41.917838,-87.755969,"(41.917838056, -87.755968972)",BELMONT CRAGIN
26038,2021-06-26 09:24:00,062XX N MC CORMICK RD,110,HOMICIDE,FIRST DEGREE MURDER,PARKING LOT,True,False,1711,17.0,50.0,13,2021,41.995219,-87.713355,"(41.995219444, -87.713354912)",NORTH PARK
13279676,2023-11-09 07:30:00,019XX W BYRON ST,620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,1922,19.0,47.0,5,2023,41.952345,-87.677975,"(41.952345086, -87.677975059)",NORTH CENTER
13274752,2023-11-12 07:59:00,086XX S COTTAGE GROVE AVE,454,BATTERY,"AGGRAVATED P.O. - HANDS, FISTS, FEET, NO / MIN...",SMALL RETAIL STORE,True,False,632,6.0,6.0,44,2023,41.737751,-87.604856,"(41.737750767, -87.604855911)",CHATHAM


In [23]:
l_stop_daily_rides['stationname'].unique()


array(['Jefferson Park', 'Cermak-Chinatown', 'Central-Lake',
       'Dempster-Skokie', 'Dempster', 'Lake/State',
       'Oak Park-Forest Park', 'Kedzie-Homan-Forest Park', '35th/Archer',
       'Addison-North Main', 'Main', 'Chicago/State', 'Wellington',
       'Austin-Forest Park', 'Clinton-Lake', 'East 63rd-Cottage Grove',
       'Grand/State', 'Wilson', 'Cicero-Cermak', 'State/Lake', '51st',
       '95th/Dan Ryan', 'Jackson/State', 'Randolph/Wabash',
       'Logan Square', 'Morse', 'Grand/Milwaukee', '69th', 'Paulina',
       'Damen-Brown', 'Washington/Dearborn', 'Kimball', 'Clark/Lake',
       'Lawrence', 'Polk', '47th-Dan Ryan', 'Sedgwick', '54th/Cermak',
       'Ashland/63rd', 'Morgan-Lake', 'Harrison', 'Sheridan', 'Racine',
       'Washington/Wells', 'Quincy/Wells', 'Foster',
       'California/Milwaukee', 'Cermak-McCormick Place',
       'Sox-35th-Dan Ryan', 'Chicago/Milwaukee', "O'Hare Airport",
       'Kedzie-Lake', 'Fullerton', 'Irving Park-Brown',
       'LaSalle/Van Buren'