# Project 1 – NYC Yellow Taxi Analysis

## Introduction
In this project, I take a closer look at a dataset of NYC Yellow Taxi trips — a snapshot of how people move through one of the busiest cities in the world.

Instead of treating it purely as a technical exercise, I approached the dataset the way I would approach a small data-journalism piece:
1. What can trip distances tell us about how New Yorkers actually use taxis?
2. Are most rides quick neighborhood hops, or long airport runs?
3. Does the distribution match what we intuitively expect?

The dataset contains the usual taxi-related features — trip distance, fare amount, passenger counts, pickup and dropoff times — but for this assignment, I zoom in on just one numeric field: trip_distance.

My goal is to compute its mean, median, and mode using two different approaches:

1. Using pandas, the convenient, modern way.

2. Using only the Python standard library, the “hard way,” to better understand what pandas is doing behind the scenes.

Finally, I build text-based visualizations of trip distances using only Python’s built-in printing capabilities.
Although basic, this step helps illustrate the shape of the data even without a plotting library.

## Source Dataset

2023 Yellow Taxi Trip Data — NYC Open Data  

https://data.cityofnewyork.us/Transportation/2023-Yellow-Taxi-Trip-Data/4b4i-vvec


In [4]:
import pandas as pd

df = pd.read_csv("yellow_taxi.csv")
df.head()


Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01T00:32:10.000,2023-01-01T00:40:36.000,1.0,0.97,1.0,N,161,141,2,9.3,1.0,0.5,0.0,0.0,1.0,14.3,2.5,0.0
1,2,2023-01-01T00:55:08.000,2023-01-01T01:01:27.000,1.0,1.1,1.0,N,43,237,1,7.9,1.0,0.5,4.0,0.0,1.0,16.9,2.5,0.0
2,2,2023-01-01T00:25:04.000,2023-01-01T00:37:49.000,1.0,2.51,1.0,N,48,238,1,14.9,1.0,0.5,15.0,0.0,1.0,34.9,2.5,0.0
3,1,2023-01-01T00:03:48.000,2023-01-01T00:13:25.000,0.0,1.9,1.0,N,138,7,1,12.1,7.25,0.5,0.0,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01T00:10:29.000,2023-01-01T00:21:19.000,1.0,1.43,1.0,N,107,79,1,11.4,1.0,0.5,3.28,0.0,1.0,19.68,2.5,0.0


## Selecting a Numeric Column

The column `trip_distance` represents how many miles each taxi trip traveled.
It is numeric, contains no text, and is appropriate for calculating summary statistics.


In [5]:
distances = df["trip_distance"]

distances.head()


0    0.97
1    1.10
2    2.51
3    1.90
4    1.43
Name: trip_distance, dtype: float64

## Summary Statistics using pandas

Below I compute the mean, median, and mode of the `trip_distance` column
using pandas built-in methods.


In [6]:
mean_dist = float(distances.mean())
median_dist = float(distances.median())
mode_dist = float(distances.mode()[0])

print("Mean trip distance:", round(mean_dist, 2))
print("Median trip distance:", round(median_dist, 2))
print("Mode trip distance:", round(mode_dist, 2))

Mean trip distance: 2.96
Median trip distance: 1.93
Mode trip distance: 1.6


## The Hard Way – Manual Computation Using the Python Standard Library

In this part, I recreate the same summary statistics without using pandas.
I manually read the CSV using Python’s `csv` module and compute the mean, median,
and mode using only basic Python operations. This demonstrates that the same
analysis can be performed with the standard library alone.


In [7]:
import csv

values = []

# Read the CSV manually
with open("yellow_taxi.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        values.append(float(row["trip_distance"]))

# ---- Mean ----
manual_mean = sum(values) / len(values)

# ---- Median ----
values_sorted = sorted(values)
n = len(values_sorted)

if n % 2 == 1:
    manual_median = values_sorted[n // 2]
else:
    manual_median = (values_sorted[n // 2 - 1] + values_sorted[n // 2]) / 2

# ---- Mode ----
counts = {}
for v in values:
    counts[v] = counts.get(v, 0) + 1

manual_mode = max(counts, key=counts.get)

print("Manual mean distance:", round(manual_mean, 2))
print("Manual median distance:", round(manual_median, 2))
print("Manual mode distance:", round(manual_mode, 2))

Manual mean distance: 2.96
Manual median distance: 1.93
Manual mode distance: 1.6


## Text-Based Visualization of Trip Distances

To visualize the distribution of taxi trip distances, I create a simple text-based bar chart.
I group the `trip_distance` values into 1-mile bins (0–1 miles, 1–2 miles, etc.) and represent
the number of trips in each bin using a line of asterisks `*`.

This visualization is produced entirely using Python’s standard library and does not rely on
any other plotting tools. 


In [None]:
print("Trip Distance Distribution (text-based visualization)")
print("----------------------------------------------------")

# Regular bins
for b in range(0, 10):
    stars = "*" * (bins[b] // scale)
    left = f"{b:2d}-{b + 1:2d}"  # e.g. " 9-10"
    label = f"{left:<5} miles:"  # force 5-character width
    print(f"{label} {stars}")

# 10+ bin
stars_10plus = "*" * (bins[10] // scale)
left_10plus = "10+"  # e.g. "10+"
label_10plus = f"{left_10plus:<5} miles:"  # also force 5-char left block
print(f"{label_10plus} {stars_10plus}")

Trip Distance Distribution (text-based visualization)
----------------------------------------------------
 0- 1 miles: *********************************
 1- 2 miles: ****************************************************
 2- 3 miles: ******************************
 3- 4 miles: ****************
 4- 5 miles: **********
 5- 6 miles: *******
 6- 7 miles: ***
 7- 8 miles: **
 8- 9 miles: *
 9-10 miles: *
10+   miles: *******


## Additional Visualization – Sparkline

This sparkline provides a compact, single-line visualization of the relative frequency
of taxi trip distances across bins.


In [9]:
spark_chars = "▁▂▃▄▅▆▇█"  # from shortest to tallest

bins_list = [bins[i] for i in range(0, 11)]
max_count = max(bins_list)

sparkline = ""

for count in bins_list:
    height = int((count / max_count) * (len(spark_chars) - 1))
    sparkline += spark_chars[height]

print("Trip Distance Sparkline:")
print(sparkline)
print("Bins: 0–10 miles")


Trip Distance Sparkline:
▅█▅▃▂▁▁▁▁▁▂
Bins: 0–10 miles


## Conclusion

Working through this project helped me better understand both the structure of the NYC Yellow Taxi dataset and the process of analyzing numeric data using Python. By focusing on trip_distance, I was able to compute the mean, median, and mode using both pandas and the standard library, which highlighted how much convenience pandas provides compared to manually implementing these calculations.

Creating a text-based visualization also helped me appreciate how even a simple ASCII chart can reveal real patterns in the data. In this case, the distribution clearly showed that most taxi trips in New York are short—often just one or two miles—while longer rides are much less common. Even without a plotting library, the pattern is easy to see.

Overall, this assignment reinforced the importance of understanding what’s happening “under the hood” in data analysis tools. It also showed that meaningful insights don’t always require sophisticated graphs—sometimes a few lines of Python and a row of stars are enough to tell the story.