<a href="https://colab.research.google.com/github/eamonzhang777-spec/Final-project/blob/main/Pandas_%26_Data_Visualization_with_Plotly_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Uni: yz4892

Name: Yimeng Zhang


# 1.Introduction
## Background
With the urbanization and changes in demographic structure, public health in metropolitan areas has become a growing focus. As one of the most densely populated city in the United States and even in the world, New York City's distribution and trends in causes of death are of great research value.
Over the past fifteen years, NYC has faced multiple public health challenges, including seasonal influenza, persistent threats of chronic diseases, and the COVID-19 pandemic beginning in 2020. These events profoundly impacted the health and mortality of city residents. When analyzing different causes of death, we can not only assess the effectiveness of existing public health policies but also provide data analysis of the trends to support future resource allocation and intervention strategies.
## Objectives
Understanding the main causes of death is fundamental for effective public health policymaking and efficiently allocating healthcare resources. This project analyzes the top five causes of death in New York City from 2007 to 2021, based on data from [NYC Open Data](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data). By identifying the trends and patterns in mortality, this project aims to help policymakers and healthcare department to analyze the top health risks affect public health in New York city. This analysis could contribute to design of targeted interventions, such as disease prevention programs and public health projects.
## Data Sources
The dataset is provided by the NYC Department of Health, including demographic variables such as Year, Leading Cause, Sex, Race Ethnicity, Deaths, Death Rate, Age Adjusted Death Rates, covering a complete 15-year period from 2007 to 2021. The original dataset contains 2102 records, 843,059 total deaths number, and 7 variables, including year, leading cause of death, sex, race/ethnicity, number of deaths, death rate, and age-adjusted death rate [(NYC Open Data, 2025)](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data).
# 2. Methodology
## Processing Tools（Pandas+Plotly）
This study primarily utilizes two core libraries from Python: Pandas and Plotly. Pandas is a powerful data processing library, offers efficient data cleaning, transformation, and aggregation capabilities. Plotly is practical for data visualization, making it time-series data clear and direct [(Stack Abuse,2023)](https://stackabuse.com/introduction-to-data-visualization-in-python-with-pandas/).

Pandas library stands for Python data analysis, which is useful for working with both numeric and text data stored in tabular format [(Krisel, 2025)](https://docs.google.com/presentation/d/1DSsHqu4yPCiaG7WzqIM2XVGcArT5P_C23iWNRdBPsmI/edit?slide=id.g20c42a2e460_0_104#slide=id.g20c42a2e460_0_104). In this study, the dataset is panel data, including 2102 observations over multiple time periods (2007-2021) across 7 different variables. Pandas could perform critical tasks for panel data such as standardizing column names, converting data types and removing missing values. Pandas could group data by aggregating death counts by cause and year to establish a solid analysis foundation for subsequent process.

Plotly was adopted in this study for interactive visualization. It could transform data into dynamic charts, enabling clear and intuitive forms of trends. By generating an interactive line chart with features like zooming, we could not only clearly illustrate the trends of leading 5 causes of death over 15 years, but also included the data details, significantly enhancing the communication and impact of the analysis. Compared with Matplotlib, Plotly could generate the interactive details in the curriculum, offering stronger exploratory evidence [(Toward data science, 2024)](https://towardsdatascience.com/seven-key-features-you-should-know-for-creating-professional-visualizations-with-plotly-f89558de5d0c/)
## Data Processing
First, standardized column names by eliminating inconsistencies in spacing and capitalization. Prevents bugs from inconsistent casing or stray spaces in headers. Second, converted the 'Deaths' column to a numeric type, handling potential conversion errors. Since the presence of missing values in the original data, this study removed completely missing key variables. In analysis phase, this study first grouped and aggregated the data by cause of death and year, calculating the total number of deaths for each year. Then, the top five leading causes of death were identified, according to the total number of deaths over the period.
## Visualization
To make the trends clear, this study utilized interactive line charts as the visualization format. Line chart could show long-term and short-term trends clearly, making increases or decreases intuitive. In addition, this study utilized it to compare multiple categories across time, different causes of death showing as separate lines simultaneously. Each color line represented one cause of death, with data point markers added to indicate data value and improve readability. The chart design follows data visualization practices, including clear axis labels and a responsive layout.



# 3. Results

In [1]:
# Step 1: Import libraries
import pandas as pd
import plotly.express as px

In [2]:
# Step 2: Load dataset
url = "https://docs.google.com/spreadsheets/d/1Qdc_xh2ZRnuQ0_Z_NX5zSZmDAXvO-5JmdBUSsbfemgA/export?format=csv"
df = pd.read_csv(url)

In [3]:
# Step 3: Inspect data structure
print(df.head())
print(df.columns)



   Year                                   Leading Cause     Sex  \
0  2021  Diseases of Heart (I00-I09, I11, I13, I20-I51)    Male   
1  2021                       Alzheimer's Disease (G30)  Female   
2  2021  Diseases of Heart (I00-I09, I11, I13, I20-I51)  Female   
3  2021           Malignant Neoplasms (Cancer: C00-C97)    Male   
4  2021       Cerebrovascular Disease (Stroke: I60-I69)    Male   

          Race Ethnicity Deaths Death Rate Age Adjusted Death Rate  
0     Not Stated/Unknown    190        NaN                     NaN  
1     Not Stated/Unknown      7        NaN                     NaN  
2     Not Stated/Unknown    113        NaN                     NaN  
3     Not Stated/Unknown     84        NaN                     NaN  
4  Other Race/ Ethnicity     11        NaN                     NaN  
Index(['Year', 'Leading Cause', 'Sex', 'Race Ethnicity', 'Deaths',
       'Death Rate', 'Age Adjusted Death Rate'],
      dtype='object')


In [4]:
rows, cols = df.shape
rows, cols

(2102, 7)

In [5]:
# Step 4: Explore & Clean
# a. Clean column names (remove spaces or capitalization)
df.columns = df.columns.str.strip().str.title()
# b. Convert 'Deaths' column to numeric, coercing errors
df['Deaths'] = pd.to_numeric(df['Deaths'], errors='coerce')
# c. Drop rows where 'Deaths' is NaN after coercion
df.dropna(subset=['Deaths'], inplace=True)

In [6]:
# Step 5: Summarization
# a. Group data by Leading Cause and Year, summing total deaths
grouped_df = df.groupby(['Leading Cause', 'Year'])['Deaths'].sum().reset_index()
# b. Identify top 5 causes of death by total deaths
top5_causes = (
    grouped_df.groupby('Leading Cause')['Deaths']
    .sum()
    .nlargest(5)
    .index
)
# c. Filter for only the top 5 causes
filtered_df = grouped_df[grouped_df['Leading Cause'].isin(top5_causes)]

In [7]:
# Step 6: Visualization (Pandas + Plotly)
import plotly.express as px
fig = px.line(
    filtered_df,
    x='Year',
    y='Deaths',
    color='Leading Cause',
    markers=True,
    title='Top 5 Leading Causes of Death Over Time'
)
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Deaths',
    legend_title='Leading Cause',
    template='plotly_white'
)

fig.show()

# 4. Discussion & conclusion
## Outputs & Findings
The output suggests that heart disease is the persistent leading cause of death during 2007-2021, with over 17,000 annual average deaths. Heart disease deaths surges in 2020 before returning to the downward trend in 2021, possibly due to covid pandemic. Cancer (approximately 13,000 every year) is the second major causes of death, which began declining around 2016. COVID-19 emerged as a major cause in 2020-2021, causing 21,241 deaths in 2020 and dropped over 60% in 2021 with vaccination. Influenza/Pneumonia deaths remained relatively stable. And "All Other Causes" increased by 75% during this period, potentially indicating aging populations.


First, the persistent dominance of heart disease indicates the complexity of heart disease prevention and control. However, the overall trend is positive. Second, the steady decline of cancer mortality suggests the effectiveness of cancer prevention and control efforts over the past decade. Third, the sudden increase of COVID-19 mortality highlights the vulnerability of modern cities to infectious diseases.
## Limitations and Future Directions
Limitations: The data were not age-standardized, which is likely to be influenced by changes in population structure. In addition, this study did not include impact of socio-demographic factors, such as race on mortality risk. Future directions: conducting age-standardized analyses to remove the effects of population structure; utilizing socio-demographic factors and intervention policies to evaluate the trends.
## Conclusion
In summary, the structure of causes of death in NYC from 2007 to 2021 suggests both improvement in heart disease control and reveals challenges in emerging threats such as covid. Heart disease and cancer still require continuous attention. The COVID-19 pandemic highlights the need for dynamic disease surveillance systems, sensitive emergency response mechanisms, and efficient vaccine systems should be the priorities in future public health development.

