# Project 1 - Leading causes of death
### Charlotte Bacchetta
### Date: 2024-11-08


We will:
- Load and explore a new dataset
- Calculate mean, median, and mode using both `pandas` and the standard library
- Visualize the data using basic Python output techniques

---

In [1]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

## Step 1: Load the Dataset

Link to the dataset : https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data

The dataset used in this project should contain at least one numeric column. We will start by loading the dataset and inspecting its contents.

_Instructions:_ Update the `file_path` variable with the path to your dataset.

In [19]:
# Step 1: Import Libraries and Load Dataset
# In this step, we load, explore, and prepare the dataset to ensure it meets project criteria.

# Import necessary libraries
import pandas as pd

# Define the file path for the dataset
file_path = '/Users/charlottebacchetta/Desktop/Columbia/Computing in Context - U6006/Project 1/New_York_City_Leading_Causes_of_Death_20241108.csv'

# Load the dataset into a pandas DataFrame
data = pd.read_csv(file_path)

# Display the first few rows to get an initial look at the data
print("First few rows of the dataset:")
display(data.head())

# Display basic information about the dataset, including column names, data types, and number of entries
print("\nDataset Information before conversion:")
data.info()

# Check for missing values in each column to assess data quality
print("\nMissing values per column before conversion:")
missing_values = data.isnull().sum()
print(missing_values)

# Convert 'Deaths' column to numeric, forcing non-numeric values to NaN
# This step is necessary to ensure we have clean numeric data for analysis
data['Deaths'] = pd.to_numeric(data['Deaths'], errors='coerce')

# Drop rows where 'Deaths' is NaN after conversion to keep only valid numeric data
data = data.dropna(subset=['Deaths'])

# Display updated information about the dataset after conversion
print("\nDataset Information after 'Deaths' column conversion:")
data.info()

# Verify that the dataset meets project requirements:
# - At least one numeric column (confirmed with 'Deaths')
# - Between 1,000 and 1,000,000 rows

# Check for numeric columns (after conversion)
numeric_columns = data.select_dtypes(include=['number']).columns.tolist()
print("\nNumeric columns:", numeric_columns)

# Check the number of rows to confirm dataset size
num_rows = data.shape[0]
print(f"\nNumber of rows after cleanup: {num_rows}")

# Confirm if the dataset meets the row criteria
if num_rows >= 1000 and num_rows <= 1000000:
    print("The dataset meets the row count criteria (1,000 to 1,000,000 rows).")
else:
    print("The dataset does NOT meet the row count criteria.")

# Explanation:
# - We converted the 'Deaths' column to numeric to ensure data consistency.
# - `errors='coerce'` converts non-numeric values in 'Deaths' to NaN, which we then drop.
# - This prepares the dataset for accurate analysis in subsequent steps.


First few rows of the dataset:


Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
0,2011,"Nephritis, Nephrotic Syndrome and Nephrisis (N...",F,Black Non-Hispanic,83,7.9,6.9
1,2009,Human Immunodeficiency Virus Disease (HIV: B20...,F,Hispanic,96,8.0,8.1
2,2009,Chronic Lower Respiratory Diseases (J40-J47),F,Hispanic,155,12.9,16.0
3,2008,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Hispanic,1445,122.3,160.7
4,2009,Alzheimer's Disease (G30),F,Asian and Pacific Islander,14,2.5,3.6



Dataset Information before conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2102 entries, 0 to 2101
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Year                     2102 non-null   int64 
 1   Leading Cause            2102 non-null   object
 2   Sex                      2102 non-null   object
 3   Race Ethnicity           2102 non-null   object
 4   Deaths                   2102 non-null   object
 5   Death Rate               1759 non-null   object
 6   Age Adjusted Death Rate  1759 non-null   object
dtypes: int64(1), object(6)
memory usage: 115.1+ KB

Missing values per column before conversion:
Year                         0
Leading Cause                0
Sex                          0
Race Ethnicity               0
Deaths                       0
Death Rate                 343
Age Adjusted Death Rate    343
dtype: int64

Dataset Information after 'Deaths' column conver

## Step 2: Identify a Numeric Column
We'll select a numeric column from the dataset to analyze. This column will be used to calculate summary statistics.

_Instructions:_ Inspect the dataset above and identify an appropriate numeric column.

In [24]:
# Step 2: Focus on the 'Deaths' Column for Analysis
# In this step, we isolate the 'Deaths' column, which we prepared in Step 1 as a clean numeric column.

# Isolate the 'Deaths' column for statistical analysis
deaths_column = 'Deaths'
data_deaths = data[deaths_column]  # 'data' is already filtered for NaN values in 'Deaths'

# Verify that 'Deaths' is numeric for statistical analysis
if pd.api.types.is_numeric_dtype(data_deaths):
    print(f"The '{deaths_column}' column is confirmed to be numeric.\n")
else:
    print(f"Warning: The '{deaths_column}' column is not numeric. Check data format.\n")

# Display basic statistics to understand the distribution of death counts
print("Basic statistics for the 'Deaths' column:")
print(data_deaths.describe())

# Display a few entries to confirm data integrity
print("\nSample entries in 'Deaths' column:")
print(data_deaths.head(5))

# Explanation:
# - We focus on 'Deaths' as it is a key numeric column, necessary for statistical calculations.
# - This column has been cleaned for non-numeric and missing values, ensuring reliability in our analysis.
# - The `describe()` method provides a summary of the distribution, including mean, min, max, and quartiles.


The 'Deaths' column is confirmed to be numeric.

Basic statistics for the 'Deaths' column:
count    1964.000000
mean      429.256110
std       827.583725
min         1.000000
25%        25.000000
50%       140.000000
75%       317.250000
max      7050.000000
Name: Deaths, dtype: float64

Sample entries in 'Deaths' column:
0      83.0
1      96.0
2     155.0
3    1445.0
4      14.0
Name: Deaths, dtype: float64


## Step 3: Calculate Mean, Median, and Mode Using pandas
Now that we have our numeric data, we will calculate the mean, median, and mode using `pandas`.

_Instructions:_ Run the code below to compute these statistics.

In [25]:
# Step 3: Calculate Mean, Median, and Mode Using pandas
# In this step, we use pandas to calculate key statistics for the 'Deaths' column: mean, median, and mode.

# Calculate the mean of the 'Deaths' column to understand the average annual deaths
mean_pandas = data_deaths.mean()

# Calculate the median of the 'Deaths' column to find the middle value in the dataset
median_pandas = data_deaths.median()

# Calculate the mode of the 'Deaths' column to identify the most frequently occurring death count
# The mode() function may return multiple values; we select the first mode if there are multiple
mode_pandas = data_deaths.mode().iloc[0]

# Display the calculated statistics
print("Statistics for 'Deaths' column using pandas:")
print(f"Mean (pandas): {mean_pandas}")
print(f"Median (pandas): {median_pandas}")
print(f"Mode (pandas): {mode_pandas}")

# Explanation:
# - The mean provides the average death count, giving a sense of the central tendency.
# - The median identifies the midpoint of the data, which can help understand the distribution.
# - The mode reveals the most common death count, indicating any significant frequency.


Statistics for 'Deaths' column using pandas:
Mean (pandas): 429.2561099796334
Median (pandas): 140.0
Mode (pandas): 1.0


## Step 4: Calculate Mean, Median, and Mode Using Standard Python Library
To reinforce understanding, we will repeat these calculations without using `pandas`.

_Instructions:_ Use basic Python functions to calculate these values.

In [26]:
# Step 4: Calculate Mean, Median, and Mode Using the Standard Library
# In this step, we calculate the mean, median, and mode of the 'Deaths' data using only standard Python functions.

from collections import Counter

# Convert data_deaths to a list to ensure compatibility with standard Python functions
data_list = data_deaths.tolist()

# Mean calculation
# The mean is calculated as the sum of all values divided by the number of entries
mean_stdlib = sum(data_list) / len(data_list)

# Median calculation
# To find the median, we first sort the data and then select the middle value (or average of two middle values if even length)
sorted_data = sorted(data_list)
n = len(sorted_data)
median_stdlib = sorted_data[n // 2] if n % 2 != 0 else (sorted_data[n // 2 - 1] + sorted_data[n // 2]) / 2

# Mode calculation
# Using Counter, we count occurrences of each value and select the most frequent one as the mode
data_counts = Counter(data_list)
mode_stdlib = max(data_counts, key=data_counts.get)

# Display the results for mean, median, and mode
print("Statistics for 'Deaths' column using standard Python functions:")
print(f"Mean (standard library): {mean_stdlib}")
print(f"Median (standard library): {median_stdlib}")
print(f"Mode (standard library): {mode_stdlib}")

# Explanation:
# - The mean provides the average death count, calculated manually without pandas.
# - The median is the midpoint of sorted data, which helps indicate central tendency.
# - The mode shows the most common death count, calculated using the Counter class.
# - This exercise reinforces fundamental Python skills for basic statistical analysis.



Statistics for 'Deaths' column using standard Python functions:
Mean (standard library): 429.2561099796334
Median (standard library): 140.0
Mode (standard library): 1.0


## Step 5: Data Visualization
We will create a simple vertical bar chart using text. This visualization represents values in the numeric column selected earlier.

_Instructions:_ Run the code below to view a text-based vertical bar chart of the data.

In [27]:
# Step 5: Data Visualization - Text-Based Bar Chart for Total Deaths by Year
# In this step, we create a simple text-based bar chart to represent total deaths for each year.
# This visualization uses only standard Python functions and is designed to scale dynamically with the data.

# Group the data by 'Year' and calculate the total deaths for each year
deaths_by_year = data.groupby('Year')['Deaths'].sum()

# Convert the grouped data to lists for easier handling in text-based visualization
years = deaths_by_year.index.tolist()
totals = deaths_by_year.tolist()

# Title for the visualization
print("Data Visualization (Total Deaths by Year):\n")

# Determine the maximum value in totals to set a scaling factor for the bar width
# The scaling factor allows us to fit bars within a 50-character width for readability on narrow screens
max_value = max(totals)
scale_factor = 50 / max_value  # Adjust scaling based on the maximum value to fit within 50 characters

# Generate and display the bar chart, where each bar's length represents the total deaths for that year
for year, total in zip(years, totals):
    # Create a bar by repeating the block character proportional to the scaled total deaths
    bar = "█" * int(total * scale_factor)
    print(f"{year}: {bar} ({total})")

# Explanation:
# - This text-based bar chart provides a simple way to visualize total deaths by year.
# - The length of each bar is scaled to fit within a maximum width, making it adaptable to different screen sizes.
# - Each line represents a year, with the bar length indicating the total deaths for that year.
# - This approach meets the project requirements by using only the standard library and creating a dynamic visualization.



Data Visualization (Total Deaths by Year):

2007: ████████████████████████████████ (53996.0)
2008: ████████████████████████████████ (54138.0)
2009: ████████████████████████████████ (52820.0)
2010: ███████████████████████████████ (52505.0)
2011: ████████████████████████████████ (52726.0)
2012: ███████████████████████████████ (52420.0)
2013: ████████████████████████████████ (53387.0)
2014: ████████████████████████████████ (53006.0)
2015: ████████████████████████████████ (54120.0)
2016: █████████████████████████████████ (54280.0)
2017: █████████████████████████████████ (54319.0)
2018: █████████████████████████████████ (55081.0)
2019: █████████████████████████████████ (54559.0)
2020: ██████████████████████████████████████████████████ (82142.0)
2021: ██████████████████████████████████████ (63560.0)
