# pandas exploration
In this assignment you will select a data set and do some munging and analysis of it using `pandas`, Jupyter Notebooks, and associated Python-centric data science tools.

The following lines ensure that `numpy` and `pandas` are installed in the notebook environment.  Depending on your system, this may not be necessary and may be removed.

In [2]:
!pip install numpy
!pip install pandas



Import the core data science libraries:

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


## Import the raw data
In this section, you will import the raw data into a `pandas` DataFrame.

In [None]:
# place your code into this Code cell

nyc_leading_cause = pd.read_csv('New_York_City_Leading_Causes_of_Death_20240430.csv')


## Data inspection
In this section, you will show enough of your data for a viewer to get a general sense of how the data is structured and any unique features of it.  Complete each of the indicated tasks in a Code cell, making sure to include a Markdown cell above each Code cell that explains what is being shown by the code.  
- Show 5 rows, selected at random, from the data set.
- Show each of the column names and their data types.
- Show any unique features of your chosen data set.

Feel free to add as many additional cells as you need to help explain the raw data.

In [None]:
# 5 random rows
nyc_leading_cause.sample(5)

In [None]:
# Column names and types
nyc_leading_cause.info()


In [None]:
# unique features
print("Data covers years:", nyc_leading_cause['Year'].min(), "to", nyc_leading_cause['Year'].max()) 
print("Leading causes of death:\n", nyc_leading_cause['Leading Cause'].unique())

## Data munging
Place your **data munging** code and documentation within this section.  
- Keep each of your Code cells short and focused on a single task.  
- Include a Markdown cell above each code cell that describes what task the code within the code cell is performing.
- Make as many code cells as you need to complete the munging - a few have been created for you to start with.
- Display 5 sample rows of the modified data after each transformation so a viewer can see how the data has changed.

**Note**: If you believe that your data set does not require any munging, please explain in detail.  Create Markdown cells that explain your thinking and create Code cells that show any specific structures of the data you refer to in your explanation.

In [None]:

nyc_leading_cause['Death_Rate'] = nyc_leading_cause['Death_Rate'].replace('.', 0).astype(float)
nyc_leading_cause['Age_Adjusted_Death_Rate'] = nyc_leading_cause['Age_Adjusted_Death_Rate'].replace('.', 0).astype(float)



## Data analysis
Place your **data analysis** code and documentation within this section.
- Perform at least 5 different statistical or other analyses of different aspects of the data.
    - Your analyses must be specific and relevant to your chosen data set and show interesting aspects of it.
    - Include at least one analysis that includes grouping rows by a shared attribute and performing some kind of statistical analysis on each group.
    - Sort the data in at least 1 of your analyses, but sort on its own does not constitute an analysis on its own.
- Keep each of your Code cells short and focused on a single task.
- Include a Markdown cell above each Code cell that describes what task the code within the Code cell is performing.
- Make as many code cells as you need to complete the analysis - a few have been created for you to start with.

In [None]:
# Leading causes of death overall
print("Leading causes of death overall:")
print(nyc_leading_cause.groupby('Leading_Cause')['Deaths'].sum().sort_values(ascending=False).head(10))


In [None]:

# Death rates by gender
print("\nDeath rates by gender:")
print(nyc_leading_cause.groupby('Sex')['Age_Adjusted_Death_Rate'].mean())


In [None]:
# Age-adjusted death rates over time for top causes
top_causes = nyc_leading_cause.groupby('Leading_Cause')['Deaths'].sum().nlargest(3).index
print(f"\nAge-adjusted death rates over time for: {', '.join(top_causes)}")
print(nyc_leading_cause[nyc_leading_cause['Leading_Cause'].isin(top_causes)].groupby(['Leading_Cause', 'Year'])['Age_Adjusted_Death_Rate'].mean().unstack())


## Data visualization
In this section, you will create a few **visualizations** that show some of the insights you have gathered from this data.
- Create at least 5 different visualizations, where each visualization shows different insights into the data.
- Use at least 3 different visualization types (e.g. bar charts, line charts, stacked area charts, pie charts, etc)
- Create a Markdown cell and a Code cell for each, where you explain and show the visualizations, respectively.
- Create as many additional cells as you need to prepare the data for the visualizations.

In [None]:
# leading causes of death
cause_totals = nyc_leading_cause.groupby('Leading_Cause')['Deaths'].sum().nlargest(10)
cause_totals.plot(kind='bar', figsize=(10,6))
plt.title("Top 10 Leading Causes of Death")
plt.xlabel("Cause")
plt.ylabel("Number of Deaths")
plt.show()

# heart disease vs. cancer over time 
nyc_leading_cause[nyc_leading_cause['Leading_Cause'].isin(['Diseases of Heart', 'Malignant Neoplasms'])].groupby(['Leading_Cause', 'Year'])['Age_Adjusted_Death_Rate'].mean().unstack().plot(figsize=(10,6))
plt.title("Heart Disease vs. Cancer Age-Adjusted Death Rates")  
plt.xlabel("Year")
plt.ylabel("Age-Adjusted Death Rate")
plt.show()