<a href="https://colab.research.google.com/github/Yeasung-Kim/MAT451/blob/main/MAT451_Project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pybaseball
from pybaseball import statcast
import pandas as pd

dfs = []
date_ranges = [
    ("2017-01-01", "2017-01-31"),
    ("2017-02-01", "2017-02-28"),
    ("2017-03-01", "2017-03-31"),
    ("2017-04-01", "2017-04-30"),
    ("2017-05-01", "2017-05-31"),
    ("2017-06-01", "2017-06-30"),
    ("2017-07-01", "2017-07-31"),
    ("2017-08-01", "2017-08-31"),
    ("2017-09-01", "2017-09-30"),
    ("2017-10-01", "2017-10-31"),
    ("2017-11-01", "2017-11-30"),
    ("2017-12-01", "2017-12-31")
]

for start, end in date_ranges:
    try:
        temp_df = statcast(start_dt=start, end_dt=end)
        dfs.append(temp_df)
    except Exception as e:
        print(f"Error fetching data from {start} to {end}: {e}")

# Combine all monthly DataFrames
data_2017 = pd.concat(dfs, ignore_index=True)
print("Final combined shape:", data_2017.shape)


In [None]:
# Proceed with cleaning
columns_of_interest = [
    "game_date", "batter", "player_name",
    "launch_speed", "launch_angle",
    "hc_x", "hc_y", "events"
]

columns_to_use = [col for col in columns_of_interest if col in data_2017.columns]
df_2017 = data_2017[columns_to_use]

df_2017_clean = df_2017.dropna(
    subset=["launch_speed", "launch_angle", "hc_x", "hc_y"]
)

df_2017_clean = df_2017_clean[df_2017_clean["events"].notna()]
df_2017_clean.reset_index(drop=True, inplace=True)

print("Cleaned dataset shape:", df_2017_clean.shape)
df_2017_clean.sample(10)


In [None]:
import matplotlib.pyplot as plt

# Assume your cleaned DataFrame is called df_2017_clean
# with the columns:
#   "launch_speed"
#   "launch_angle"
#   "hc_x"
#   "hc_y"
#   "events"
# etc.

# 1) Histogram of Launch Speeds

plt.figure()
plt.hist(df_2017_clean["launch_speed"], bins=30)
plt.title("Distribution of Launch Speed (mph)")
plt.xlabel("Launch Speed (mph)")
plt.ylabel("Count")
plt.show()

# What It Shows
X-axis (Launch Speed): How fast the ball comes off the bat, measured in miles per hour (mph).

Y-axis (Count): The number of batted balls that fell into each speed bin.

# How to Interpret
A single peak near a certain mph range suggests a typical exit velocity where most batted balls cluster.

Higher launch speeds (e.g., > 100 mph) typically correspond to harder-hit balls, which are often more valuable offensively (e.g., more extra-base hits).

If you see a long tail to the right, it means there are a few exceptionally hard-hit balls (e.g., 110–120 mph).

The spread (e.g., from ~50–120 mph) shows the range of contact quality from weakly hit balls to well-hit drives.

In [None]:
# 2) Scatter Plot: Launch Angle vs. Launch Speed

plt.figure()
plt.scatter(
    df_2017_clean["launch_angle"],
    df_2017_clean["launch_speed"],
    alpha=0.3  # make points semi-transparent for overlap
)
plt.title("Launch Angle vs. Launch Speed")
plt.xlabel("Launch Angle (degrees)")
plt.ylabel("Launch Speed (mph)")
plt.show()

# What It Shows
X-axis (Launch Angle): The vertical angle at which the ball leaves the bat.

Y-axis (Launch Speed): Again, how fast the ball was hit.

# How to Interpret
Each point corresponds to a single batted ball.

High speed + moderate angle (often in a “sweet spot” of ~10–30°) is more likely to lead to line drives or fly balls that can become extra-base hits or home runs.

Very low angles (negative values or close to 0°) at high speeds typically turn into hard grounders.

Very high angles (e.g., > 40°) might be high fly balls or pop-ups, often resulting in outs unless coupled with very high exit velocity.

A visible “cluster” might show the most common combination, e.g., around 85–95 mph and 10–20° angle, indicating typical line drives.

Outliers can indicate extreme hits (either very high or very low angles, extremely high or low exit velocity).

In [None]:
# 3) Bar Chart: Event Type Distribution
event_counts = df_2017_clean["events"].value_counts()

plt.figure()
plt.bar(event_counts.index, event_counts.values)
plt.title("Event Counts in 2017 (Cleaned Dataset)")
plt.xlabel("Event Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

# What It Shows
X-axis (Event Type): Different outcomes recorded in the dataset (e.g., single, double, home_run, field_out, etc.).

Y-axis (Count): The total number of times each outcome occurred in the cleaned dataset.

# How to Interpret
This chart gives you a quick view of which outcomes are most/least frequent.

Typically, field_out (or other “out” variations) will be the highest count, since most batted balls result in outs.

The ratio of singles, doubles, triples, and home runs to outs shows the overall offensive environment.

If an event category seems suspiciously low or high, double-check your data cleaning or event mapping logic (e.g., Are certain events missing? Did you categorize them properly?).

In [None]:
# 4) Spray Chart: hc_x vs. hc_y
# Visualizes where the ball landed on the field
plt.figure()
plt.scatter(
    df_2017_clean["hc_x"],
    df_2017_clean["hc_y"],
    alpha=0.3
)
plt.title("Spray Chart: hc_x vs. hc_y")
plt.xlabel("hc_x")
plt.ylabel("hc_y")
plt.show()

# What It Shows
X-axis (hc_x) and Y-axis (hc_y): Approximate coordinates of where the batted ball lands in a 2D plane. In Statcast, (0,0) is usually the top-left corner of the field diagram, and (250,250) might be bottom-right, but exact calibrations vary.

# How to Interpret
A spray chart shows you directional tendencies.

If most of the points are on one side, it might indicate pull tendencies for a particular batter (if you filter by player) or general hitting distributions for all batters.

A symmetrical spread might show a balanced or “spray” approach across the field.

You might notice more clustering in certain outfield areas (e.g., a lot of balls around center field if it’s a league-wide dataset).

If you see data that looks truncated or artificially constrained, check if the coordinate transformation (e.g., home plate offset) is properly handled.