> Mexico City subway network connects the fifth largest city in the world with 21 million habitants and as such,  is essential for people's commutes to work, school, and daily activities. Having a large amount of users, the subway needs to be on prime condition to operate efficienty and on time. The dataset for this EDA gives us information on the daily users by line and station, which can be used to identify the most used stations by entrypoint and date, as well as the growth of users by year or month. All of this insights provide government and users information about the stations that could need more maintance as they have more users and can also give insights about the current rate of users related to the subway capcity. The latter can assist urban developers. 
> The dataset in this exploratory data analysis was creted by Mexico City's Ministry of Movility and published in Mexico City Data Portal, with its last update on January 19, 2024. It measures daily influx of users in the Mexico City's metro from January 2021 to December 2023, the data presents information of the daily influx by station and type of payment. It includes 640576 observations and seven variables:
- date
- month 
- year 
- metro line
- metro station
- type of payment 
- user influx


> The data was not clean as it is had missing values on the month variable, this was resolved by creting a new variable and extracting the information from the date column. 
>  A second dataset containing more information about the stations was merged into the dataset using the station name as key. This new dataset has the following information for each station:
- name of the public transportation sytem: SISTEMA
- name of station:   NOMBRE
- name of line:  LINEA
- consecutive number of the station in the subway line: EST 
- station: keyCVE_EST
- station key for a metropolitan mobility survey: CVE_EOD17
- type of station (terminal, intermediate or transfer): TIPO
- neighborhood where the station is located: ALCALDIAS:
- year in which the station started operating: AÑO

>  A third dataset containing information on the number of trains per line was added using thelime name as key. This new dataset was collected from the metro public website. 

In [86]:
import pandas as pd
from simpledbf import Dbf5

stc_users = pd.read_csv("afluenciastc_desglosado_12_2023.csv")
dbf = Dbf5("stcmetro_shp/STC_Metro_estaciones_utm14n.dbf")
stc_location = dbf.to_dataframe()

# clean month
stc_users["fecha2"] = pd.to_datetime(stc_users["fecha"])
stc_users["month"] = stc_users["fecha2"].dt.month_name()
stc_users["week_day"] = stc_users["fecha2"].dt.day_name()
stc_users["weekend"] = stc_users["week_day"].isin(["Saturday", "Sunday"]).astype(int)
stc_users

# crete number of trains dataset
trenes_db = {
    "linea": [
        "Línea 1",
        "Línea 2",
        "Línea 3",
        "Línea 4",
        "Línea 5",
        "Línea 6",
        "Línea 7",
        "Línea 8",
        "Línea 9",
        "Línea A",
        "Línea B",
        "Línea 12",
    ],
    "trenes": [50, 41, 54, 14, 25, 15, 32, 30, 34, 33, 36, 30],
}
trenes_db = pd.DataFrame(trenes_db)

# merge station info dataset
stc_user_location = pd.merge(
    stc_users, stc_location, how="left", left_on="estacion", right_on="NOMBRE"
)

stc_user_location = pd.merge(
    stc_user_location, trenes_db, how="left", left_on="linea", right_on="linea"
)


3. Plot the data, demonstrating interesting features that you discover. Are there any relationships between variables that were surprising or patterns that emerged? Please exercise creativity and curiosity in your plots. You should have at least 4 plots exploring the data in different ways.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)
# graph 1
users_year = (
    stc_user_location.groupby(["anio", "linea"])["afluencia"].sum().reset_index()
)
plt.figure(figsize=(10, 4))
sns.barplot(x="anio", y="afluencia", hue="linea", data=users_year)
plt.xlabel("Station")
plt.ylabel("Afluence")
plt.title("Graph 1. Afluence by line and year", y=1.02)
plt.legend(bbox_to_anchor=(0.5, -0.15), loc="upper center", ncol=3)

# graph 2
d1 = stc_user_location.groupby(["fecha2", "tipo_pago"])["afluencia"].sum().reset_index()
plt.figure(figsize=(20, 10))
sns.lineplot(x="fecha2", y="afluencia", hue="tipo_pago", data=d1)
plt.xlabel("year")
plt.ylabel("Afluence")
plt.title("Graph 2. Afluence by year and type of payment from 2021 to 2023")
plt.legend(title="Tipo")
plt.show()


# graph 3
users_2023 = stc_user_location[stc_user_location["anio"] == 2023]
users_2023 = (
    users_2023.groupby(["estacion", "linea", "TIPO"])["afluencia"].sum().reset_index()
)
users_2023_top_50 = users_2023.sort_values(by="afluencia", ascending=False).head(50)


plt.figure(figsize=(12, 8))
sns.barplot(
    x="afluencia",
    y="estacion",
    hue="linea",
    data=users_2023_top_50,
    linewidth=8,
    dodge=False,
)
plt.ylabel("Station")
plt.xlabel("Afluence")
plt.title("Graph 3. Afluence by top 50 stations in 2023, highlighting line", y=1.02)
plt.legend(bbox_to_anchor=(0.5, -0.15), loc="upper center", ncol=3)

# graph 4
plt.figure(figsize=(12, 8))
sns.barplot(
    x="afluencia",
    y="estacion",
    hue="TIPO",
    data=users_2023_top_50,
    linewidth=8,
    dodge=False,
)
plt.ylabel("Station")
plt.xlabel("Afluence")
plt.title(
    "Graph 4. Afluence by top 50 stations in 2023, highlighting station type", y=1.02
)
plt.legend(bbox_to_anchor=(0.5, -0.15), loc="upper center", ncol=3)

# graph 5
users_2023_line = stc_user_location[stc_user_location["anio"] == 2023]

users_2023_line

users_2023_line = (
    users_2023_line.groupby(["linea", "fecha2", "trenes"])["afluencia"]
    .sum()
    .reset_index()
)

users_2023_line_avg = (
    users_2023_line.groupby(["linea", "trenes"])["afluencia"].mean().reset_index()
)

plt.figure(figsize=(10, 6))
groups = users_2023_line_avg.groupby("linea")
for name, group in groups:
    plt.plot(
        group.trenes,
        group.afluencia,
        marker="o",
        linestyle="",
        markersize=12,
        label=name,
    )
for i, label in enumerate(users_2023_line_avg["linea"]):
    plt.text(
        users_2023_line_avg["trenes"][i],
        users_2023_line_avg["afluencia"][i],
        label,
        fontsize=8,
        ha="right",
        va="bottom",
    )

plt.ylabel("Average daily afluence")
plt.xlabel("Trains in service")
plt.title("Graph 5. Average daily afluence and trains in 2023 by line", y=1.02)
plt.show()

4. What insights are you able to take away from exploring the data? Is there a reason why analyzing the dataset you chose is particularly interesting or important? Summarize this for a general audience (imagine your publishing a blog post online) - boil down your findings in a way that is accessible, but still accurate.


I was born and raised in Mexico City, and my reliance on the public transportation system stems from two reasons: firstly, my aversion to driving, and secondly, the abysmal state of car transit, making the subway a quicker alternative. However, my experiences with the metro have been marked by a noticeable lack of maintenance and severe overcrowding. Furthermore, the newest metro line was recently closed for an entire year due to construction issues, which led to a tragic incident that claimed 27 lives. Recognizing the variations in metro usage across stations and lines is crucial for both authorities and citizens, providing valuable insights for service improvement.

While the city's transport authority strives to publish user-related information, there's an omission regarding infrastructure details. Information on trains only encompasses the total count per line, neglecting aspects such as train size, capacity, years in usage, or days in service. These limitations hinder accountability, but even with the available data, intriguing insights can be gleaned.

As depicted in Graph 1, the yearly growth in total users highlights consistently high usage on lines 2, 3, 9, and 8. Graph 2 reveals a notable shift towards the preference for prepaid cards over tickets in the past three years, with an unexpected decline in single ticket usage in the last trimester of 2023. This decline warrants further exploration to comprehend its origin and its implications for both citizens and the transportation authority.

Graphs 3 and 4 unveil the most frequented stations, providing authorities with valuable information to allocate resources for maintenance. Additionally, by categorizing stations as terminal, intermediate, or transfer, it becomes possible to gauge the maintenance needs and the ramifications of potential closures for service.

The concluding graph illustrates the average daily users per metro line and the corresponding number of operating trains. Notably, heavily-used lines 2, 3, 8, and 9 don't necessarily have the highest train counts. This raises  questions for further exploration. Why do more congested lines have fewer trains than less congested ones? Is this linked to train type or capacity? What is the average passenger load compared to capacity?

This exploratory data analysis of daily users in Mexico City's subway system is a significant step towards comprehending its usage patterns. It serves as an accountability tool, urging authorities to allocate resources for enhancement and maintenance. Regrettably, the data currently available is insufficient to gauge the system's response to demand, as crucial information on the state of trains, stations, and service remains elusive.

References:
- https://www.metro.cdmx.gob.mx/parque-vehicular
- https://datos.cdmx.gob.mx/dataset/afluencia-diaria-del-metro-cdmx/resource/cce544e1-dc6b-42b4-bc27-0d8e6eb3ed72
- https://datos.cdmx.gob.mx/dataset/lineas-y-estaciones-del-metro/resource/288b10dd-4f21-4338-b1ed-239487820512 