# Project 1 – NYC Yellow Taxi Analysis

## Introduction
In this project, I analyze a dataset of NYC Yellow Taxi trips.  
The dataset includes trip distance, fare amount, passenger counts,
pickup and dropoff times, and other trip-related attributes.

For this assignment, I focus on the numeric column `trip_distance`
and compute the mean, median, and mode using two approaches:

1. Using pandas  
2. Using only the Python standard library (the "hard way")

Finally, I create a simple text-based visualization of trip distances using only
Python's built-in printing capabilities.


In [9]:
import pandas as pd

df = pd.read_csv("yellow_taxi.csv")
df.head()


Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01T00:32:10.000,2023-01-01T00:40:36.000,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0
1,2,2023-01-01T00:55:08.000,2023-01-01T01:01:27.000,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0
2,2,2023-01-01T00:25:04.000,2023-01-01T00:37:49.000,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0
3,1,2023-01-01T00:03:48.000,2023-01-01T00:13:25.000,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01T00:10:29.000,2023-01-01T00:21:19.000,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0


## Selecting a Numeric Column

The column `trip_distance` represents how many miles each taxi trip traveled.
It is numeric, contains no text, and is appropriate for calculating summary statistics.


In [10]:
distances = df["trip_distance"]

distances.head()


0    0.97
1    1.10
2    2.51
3    1.90
4    1.43
Name: trip_distance, dtype: float64

## Summary Statistics using pandas

Below I compute the mean, median, and mode of the `trip_distance` column
using pandas built-in methods.


In [13]:
mean_dist = float(distances.mean())
median_dist = float(distances.median())
mode_dist = float(distances.mode()[0])

print("Mean trip distance:", round(mean_dist, 2))
print("Median trip distance:", round(median_dist, 2))
print("Mode trip distance:", round(mode_dist, 2))

Mean trip distance: 2.96
Median trip distance: 1.93
Mode trip distance: 1.6


## The Hard Way – Manual Computation Using the Python Standard Library

In this part, I recreate the same summary statistics without using pandas.
I manually read the CSV using Python’s `csv` module and compute the mean, median,
and mode using only basic Python operations. This demonstrates that the same
analysis can be performed with the standard library alone.


In [None]:
import csv

values = []

# Read the CSV manually
with open("yellow_taxi.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        values.append(float(row["trip_distance"]))

# ---- Mean ----
manual_mean = sum(values) / len(values)

# ---- Median ----
values_sorted = sorted(values)
n = len(values_sorted)

if n % 2 == 1:
    manual_median = values_sorted[n // 2]
else:
    manual_median = (values_sorted[n // 2 - 1] + values_sorted[n // 2]) / 2

# ---- Mode ----
counts = {}
for v in values:
    counts[v] = counts.get(v, 0) + 1

manual_mode = max(counts, key=counts.get)

print("Manual mean distance:", round(manual_mean, 2))
print("Manual median distance:", round(manual_median, 2))
print("Manual mode distance:", round(manual_mode, 2))

Manual mean distance: 2.96
Manual median distance: 1.93
Manual mode distance: 1.6


## Text-Based Visualization of Trip Distances

To visualize the distribution of taxi trip distances, I create a simple text-based bar chart.
I group the `trip_distance` values into 1-mile bins (0–1 miles, 1–2 miles, etc.) and represent
the number of trips in each bin using a line of asterisks `*`.

This visualization is produced entirely using Python’s standard library and does not rely on
any plotting pac


In [None]:
# Create bins: 0–1 miles, 1–2 miles, ..., 9–10 miles, 10+ miles
bins = {i: 0 for i in range(0, 11)}

for d in values:  # values was created in the "hard way" section
    bin_index = int(min(d, 10))  # cap at 10+
    bins[bin_index] += 1

# Determine scaling factor so chart doesn't get too wide
max_count = max(bins.values())
scale = max(1, max_count // 50)  # 50 stars max width

print("Trip Distance Distribution (text-based visualization)")
print("----------------------------------------------------")

for b in range(0, 10):
    stars = "*" * (bins[b] // scale)
    print(f"{b:2d}–{b + 1:2d} miles: {stars}")

# 10+ bin
stars_10plus = "*" * (bins[10] // scale)
print(f"10+ miles: {stars_10plus}")


Trip Distance Distribution (text-based visualization)
----------------------------------------------------
 0– 1 miles: *********************************
 1– 2 miles: ****************************************************
 2– 3 miles: ******************************
 3– 4 miles: ****************
 4– 5 miles: **********
 5– 6 miles: *******
 6– 7 miles: ***
 7– 8 miles: **
 8– 9 miles: *
 9–10 miles: *
10+ miles: *******


## Additional Visualization – Sparkline

This sparkline provides a compact, single-line visualization of the relative frequency
of taxi trip distances across bins.


In [16]:
spark_chars = "▁▂▃▄▅▆▇█"  # from shortest to tallest

bins_list = [bins[i] for i in range(0, 11)]
max_count = max(bins_list)

sparkline = ""

for count in bins_list:
    height = int((count / max_count) * (len(spark_chars) - 1))
    sparkline += spark_chars[height]

print("Trip Distance Sparkline:")
print(sparkline)
print("Bins: 0–10 miles")


Trip Distance Sparkline:
▅█▅▃▂▁▁▁▁▁▂
Bins: 0–10 miles
