# Project 1

For this project, I'm using the [Bicycle Counts ](https://data.cityofnewyork.us/Transportation/Bicycle-Counts/uczf-rk3c/about_data) dataset from NYC Open Data, which counts the number of bicyclists that pass a given point (over East River bridges, for example). I took just the first 100,000 rows. 


In [None]:
import pandas as pd

df = pd.read_csv("bicycle_counts_100k.csv", low_memory=False)
df.head()

Index(['countid', 'id', 'date', 'counts', 'status'], dtype='object')


I imported pandas, our data management package. Then I read in the csv I downloaded and look at the first 5 rows

## First, the easy way using pandas

In [None]:
df["counts"] = pd.to_numeric(df["counts"], errors="coerce")

mean_counts = df["counts"].mean()
median_counts = df["counts"].median()
mode_counts = df["counts"].mode()[0]

print(f"Mean cyclists per observation: {mean_counts:.2f}")
print(f"Median cyclists per observation: {median_counts:.2f}")
print(f"Mode cyclists per observation: {mode_counts}")


Mean cyclists per observation: 30.02
Median cyclists per observation: 20.00
Mode cyclists per observation: 0


With pandas, it's really simple to compute the mean, median, and mode; we simply use the functions with those names within pandas. Thus, we have our mean, median, and mode of cyclist counts:
mean = 30.02, median = 20.00, mode = 0 (most common is 0 observed cyclists)

## Now, the hard way

In [None]:
counts_dict = {}

with open("bicycle_counts_100k.csv", "r") as f:
    next(f)
    for line in f:
        val_str = line.split(",")[3].strip().strip('"')
        if val_str.isdigit():
            val = int(val_str)
            counts_dict[val] = counts_dict.get(val, 0) + 1

counts = []
for val, freq in counts_dict.items():
    counts.extend([val] * freq)

counts.sort()

n = len(counts)
mean_counts = sum(counts) / n

if n % 2 == 1:
    median_counts = counts[n // 2]
else:
    median_counts = (counts[n // 2 - 1] + counts[n // 2]) / 2

mode_counts = max(counts_dict, key=counts_dict.get)

print("Mean:", mean_counts)
print("Median:", median_counts)
print("Mode:", mode_counts)


Mean: 30.02199
Median: 20.0
Mode: 0


I used simple Python functions, such as isdigit, split, strip and sort, as well as first a dictionary and then a list, to compute the mean, median and mode the hard way, without using pandas, statistics, or any other added functionality. The results are the same. Since we know the fourth column is 'counts' and it includes a number in each row, we don't need to do extra validation at the beginning. 

## Finally, data visualization (using pandas but only standard library for drawing)

First, I created a histogram of the distribution of bicycle counts within these first 100k rows

In [None]:
import pandas as pd

df = pd.read_csv("bicycle_counts_100k.csv", low_memory=False)
df["counts"] = pd.to_numeric(df["counts"], errors="coerce")
counts = df["counts"].dropna().astype(int).tolist()

min_val, max_val = min(counts), max(counts)
bin_size = (max_val - min_val) // 10 or 1

print("\nDistribution of Bicycle Counts\n")

for i in range(10):
    low = min_val + i * bin_size
    high = low + bin_size
    count_in_bin = sum(low <= x < high for x in counts)
    bar = "â–ˆ" * (count_in_bin // 500)
    print(f"{low:5d}â€“{high:5d} | {bar}")



Distribution of Bicycle Counts

    0â€“   30 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
   30â€“   60 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
   60â€“   90 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
   90â€“  120 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ
  120â€“  150 | â–ˆâ–ˆ
  150â€“  180 | 
  180â€“  210 | 
  210â€“  240 | 
  240â€“  270 | 
  270â€“  300 | 


Then, I created a chart of the top 8 location ID's by average bicycle count. 

In [None]:
import pandas as pd

df = pd.read_csv("bicycle_counts_100k.csv", low_memory=False)
df["counts"] = pd.to_numeric(df["counts"], errors="coerce")

print("Rows:", len(df), "| non-null counts:", df["counts"].notna().sum())
print("Columns:", list(df.columns))

by_id = pd.Series(dtype=float)
if "id" in df.columns:
    by_id = (
        df.dropna(subset=["counts"])
        .groupby("id", dropna=True)["counts"]
        .mean()
        .dropna()
        .sort_values(ascending=False)
    )

top = by_id.head(8)


def draw_bar_chart(pairs, title, width=40, label=""):
    if not pairs:
        print(f"\n{title}\n(no data to display)")
        return
    max_label_len = max(len(k) for k, _ in pairs)
    max_val = max(v for _, v in pairs) if pairs else 1.0
    print(f"\n{title}\n")
    for k, v in pairs:
        n = int(round((v / max_val) * width)) if max_val > 0 else 0
        bar = "â–ˆ" * n
        print(f"{k:<{max_label_len}} | {bar} {v:.0f} {label}")


if not top.empty:
    data = [(str(i), float(v)) for i, v in top.items()]
    draw_bar_chart(data, "Average hourly bicycle counts â€” top IDs")
else:
    counts = df["counts"].dropna().astype(float)
    if counts.empty:
        print("\nNo non-null numeric 'counts' found to visualize.")
    else:
        bins = 8
        cmin, cmax = counts.min(), counts.max()
        if cmax == cmin:
            edges = [cmin, cmax + 1]
            hist = [len(counts)]
        else:
            edges = [cmin + i * (cmax - cmin) / bins for i in range(bins + 1)]
            hist = [0] * bins
            for x in counts:
                idx = min(int((x - cmin) / (cmax - cmin) * bins), bins - 1)
                hist[idx] += 1
        pairs = []
        for i in range(len(hist)):
            lo = edges[i]
            hi = edges[i + 1]
            label = f"[{int(lo)}â€“{int(hi)}]"
            pairs.append((label, hist[i]))
        draw_bar_chart(
            pairs, "Distribution of hourly bicycle counts (text histogram)", label="obs"
        )


Rows: 100000 | non-null counts: 100000
Columns: ['countid', 'id', 'date', 'counts', 'status']

Average hourly bicycle counts â€” top IDs

100009427 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 72 
100009428 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 66 
100062893 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 53 
100047029 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 53 
300028963 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 46 
300020904 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 42 
300020241 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 42 
100010019 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 37 


Finally, I created a sparkline of average counts by day. Since I only took the first 100k rows and it only includes data for June and July 2023, and thus doesn't differ too much, none of this analysis is the most useful, but it's still interesting to explore, and would be more interesting to do this with the full dataset of 7 million rows. 

In [None]:
import pandas as pd

df = pd.read_csv("bicycle_counts_100k.csv", low_memory=False)
df["counts"] = pd.to_numeric(df["counts"], errors="coerce")
df["date"] = pd.to_datetime(df["date"], errors="coerce")

by_day = df.groupby(df["date"].dt.date)["counts"].mean().dropna()

print("\nAverage Daily Bicycle Counts (ðŸš² = 1 count)\n")
for day, val in by_day.items():
    bikes = int(val)
    bar = "ðŸš²" * bikes
    print(f"{day}: {bar}")



Average Daily Bicycle Counts (ðŸš² = 1 count)

2023-06-02: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-03: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-04: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-05: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-06: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-07: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-08: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²
2023-06-09: ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ðŸš²ð

## Conclusion

This project explored the NYC DOT *Bicycle Counts* dataset from NYC Open Data, focusing on the `counts` column representing the number of cyclists recorded per observation. I only included the first 100k rows, so the analysis was limited.  
Using **pandas**, I computed the mean, median, and mode of these counts to summarize the data.  
I then repeated these calculations using only Pythonâ€™s standard libraryâ€”reading and processing the raw CSV manually to calculate each value "the hard way."

Finally, I created a simple text-based visualization using only built-in Python features, where each ðŸš² emoji represents one cyclist counted per day.  
This literal visualization highlights variation in cycling activity across days, showing how code and data can communicate insights without any external plotting libraries.
