The dataset I selected for data analysis is the New York City Air Quality dataset, which comes from the Department of Health and Mental Hygiene (DOHMH). The database contains air pollution measurements across different geographic locations and time periods, recorded using various indicators.

Data sources: https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/about_data

In [20]:
""" First, import the packages needed to load the dataset, 
then import the dataset we have downloaded, 
and display the first five rows 
to get a general understanding of the structure of the imported data."""

import pandas as pd
df = pd.read_csv("Air_Quality_20251107.csv")
df.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,878218,386,Ozone (O3),Mean,ppb,UHF42,402,West Queens,Summer 2023,06/01/2023,34.365989,
1,876975,375,Nitrogen dioxide (NO2),Mean,ppb,UHF42,501,Port Richmond,Summer 2023,06/01/2023,11.331992,
2,876900,375,Nitrogen dioxide (NO2),Mean,ppb,UHF42,207,East Flatbush - Flatbush,Summer 2023,06/01/2023,12.020333,
3,877140,375,Nitrogen dioxide (NO2),Mean,ppb,CD,205,Fordham and University Heights (CD5),Summer 2023,06/01/2023,14.123178,
4,874556,365,Fine particles (PM 2.5),Mean,mcg/m3,UHF34,410,Rockaways,Summer 2023,06/01/2023,8.150637,


Compute: The mean, median, mode using Pandas

In [21]:
"""Extract the numerical columns from the DataFrame, 
then use pandas methods to calculate the mean, median, and mode respectively."""

mean = df["Data Value"].mean()
print("Mean:", mean)
median = df["Data Value"].median()
print("median:", median)
mode = df["Data Value"].mode()
print("mode:", mode)

Mean: 21.05158015629472
median: 14.79
mode: 0    2.0
Name: Data Value, dtype: float64


Compute: The mean, median, mode using the Python standard library

In [22]:
# Import CSV file as a list
"""We first import Python's csv module(like a toolbox dealing with CSV files)"""
import csv

"""Then we create an empty list called 'values' 
as a basket to collect all numbers we need to dealing with"""
values = []

"""open our CSV file as 'f'"""
with open('Air_Quality_20251107.csv', newline='') as f:

    """create a reader object to get each line in the CSV 
    and splits it into a list of values."""
    reader = csv.reader(f) 

    """Since our CSV file has a header, 
    we skip the header to get the value directly"""
    next(reader) 

    """looping through each remaining row in the CSV. 
    Each row is a list, representing one line of data."""
    for row in reader:
        try:
            """grab the 11th column from this row, 
            convert it into float and store it in variable 'value'"""
            value = float(row[10])

            """put the number we collected into the basket 'values'"""
            values.append(value)

        except (ValueError, IndexError):
            """if this isn't a number or no 11th column,skip"""
            continue

In [23]:
# Calculating Mean
"""Using the sum() and len() function in standard library 
to calculate the sum and count of a column of number
and then using the formula mean=sum/count to calculate mean"""
mean = sum(values) / len(values)
print("Mean:", mean)

Mean: 21.05158015629472


In [24]:
# Calculating Median
"""sort all the values in the column to get the number in the middle"""
sorted_values = sorted(values)

n = len(sorted_values)
if n % 2 == 1:
    median = sorted_values[n // 2]
else:
    median = (sorted_values[n//2 - 1] + sorted_values[n//2]) / 2
print("Median:", median)

Median: 14.79


In [25]:
# Calculating Mode
"""create an empty dictionary 'counts'
to store the value and how many times the value appear,
then we can get the most frequent value as mode"""
counts = {}

"""Loop through each value in values; 
each time a value appears, increase its corresponding count by one, 
thereby keeping track of how many times each value occurs"""
for v in values:
    counts[v] = counts.get(v, 0) + 1

"""Find the value that appears the most, and use it as the mode"""
max_count = max(counts.values())

"""we pick out the actual number(s) that have that frequency:
Go through each pair, and if the count matches max_count, grab the number"""
mode = [k for k, v in counts.items() if v == max_count]
print("Mode:", mode)

Mode: [2.0]


# Visualization

In [27]:
# decide how long our star bars should be.
max_length = 50 

# mode is originally a list, convert it into a number like mean and median
mode=max(mode)

# find the largest number to get the full 50 stars
max_val = max(mean, median, mode)

# calculate how many stars it deserves for each statistic
mean_len = int(mean / max_val * max_length)
median_len = int(median / max_val * max_length)
mode_len = int(mode / max_val * max_length)

# using repeating stars to visualize
print("Visualization of Statistics:")
print(f"Mean  : {'*' * mean_len} ({mean:.2f})")
print(f"Median: {'*' * median_len} ({median:.2f})")
print(f"Mode  : {'*' * mode_len} ({mode:.2f})")

Visualization of Statistics:
Mean  : ************************************************** (21.05)
Median: *********************************** (14.79)
Mode  : **** (2.00)
