### **<span style="color:navy">Final Project: Visualizing MLB Batting Performance Trends</span>**

#### **1. Recap of Data, Goals, and Tasks**

**Dataset Overview**

- **Source:** [Kaggle (2023 MLB Player Stats)](https://www.kaggle.com/datasets/vivovinco/2023-mlb-player-stats)
- **Focus:** Batting statistics for the 2023 MLB season
- **Key Attributes Used:** Home Runs (HR), Batting Average (AVG), Slugging Percentage (SLG), On-Base Percentage (OBP), Team names

**Project Goals**

1. Analyze the relationship between Batting Average (AVG) and Home Runs (HR)
2. Compare power-focused teams (HR, SLG) vs. contact-focused teams (AVG, OBP)
3. Identify the top home-run-hitting teams in 2023

**Planned Visualizations**

- **Scatter Plot:** HR vs. AVG (color-coded by team)
- **Violin Plot:** Distribution of HR, SLG, AVG, and OBP by team
- **Bar Chart:** Total HR per team (sorted in descending order)

----

#### **Import Libraries and Load Data**

In [187]:
import pandas as pd
import altair as alt

import warnings
warnings.filterwarnings('ignore')

# Load the data with specified encoding and delimiter
mlb_data = pd.read_csv('2023 MLB Player Stats - Batting.csv', encoding='ISO-8859-1', delimiter=';')

In [188]:
mlb_data.columns

Index(['Rk', 'Name', 'Age', 'Tm', 'Lg', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B',
       'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+',
       'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB'],
      dtype='object')

### **2. Visualization Implementation**

Scatter Plot: Shows the relationship between Home Runs (HR) and Batting Average (AVG), colored by team. This helps analyze whether power hitters maintain high averages.

In [189]:

# Convert categorical columns to strings (Altair prefers this)
mlb_data["Tm"] = mlb_data["Tm"].astype(str)

# Dropdown filter with "HOU" pre-selected
team_dropdown = alt.binding_select(options=mlb_data["Tm"].unique().tolist(), name="Select Team: ")
team_selection = alt.selection_single(fields=["Tm"], bind=team_dropdown, name="Team")

# Scatter Plot: Home Runs vs. Batting Average
scatter_plot = alt.Chart(mlb_data).mark_circle(size=80).encode(
    x=alt.X("HR:Q", title="Home Runs"),
    y=alt.Y("BA:Q", title="Batting Average"),
    color=alt.Color("Tm:N", legend=None),
    tooltip=["Name:N", "Tm:N", "HR:Q", "BA:Q"]
).add_selection(
    team_selection
).transform_filter(
    team_selection
).interactive().properties(
    title="Home Runs vs Batting Average (Interactive)",
    width=600,
    height=400
)

# Add regression line to the scatter plot
scatter_plot = scatter_plot + scatter_plot.transform_regression("HR", "BA").mark_line(color="red")


# Display the scatter plot
scatter_plot.show()


Violin Plot: Displays the distribution of Home Runs across different teams, helping compare power-focused and contact-focused teams.

In [190]:
import plotly.express as px

# Initial plot with HR
fig = px.violin(mlb_data, x="Tm", y="HR", box=True, points=False, title="Distribution by Team",
                labels={"Tm": "Team", "HR": "Home Runs"}, color="Tm")


# Show the plot
fig.show()

Bar Chart: Ranks MLB teams by total Home Runs, making it easy to identify the most power-heavy teams.

In [191]:
# Using Altair, create a bar chart of the total home runs by team
# Aggregate total home runs by team
hr_by_team = mlb_data.groupby('Tm')['HR'].sum().reset_index()

# Sort teams by total home runs in descending order
hr_by_team = hr_by_team.sort_values(by='HR', ascending=False)

# Create a bar chart
bar_chart = alt.Chart(hr_by_team).mark_bar().encode(
    x=alt.X('Tm:N', sort='-y', title='Team'),
    y=alt.Y('HR:Q', title='Total Home Runs'),
    color=alt.Color('HR:Q', scale=alt.Scale(scheme='blues'), legend=None),
    tooltip=['Tm:N', 'HR:Q']
).properties(
    title='Total Home Runs by Team',
    width=600,
    height=400
).interactive()

bar_chart


### 3. Summary of Key Design Elements

**Scatter Plot Enhancements:**

- Color-coded by team for quick identification.
- Transparency (alpha=0.7) to reduce overlapping points.
- No legend clutter (as teams are visually distinguishable).

**Violin Plot Enhancements:**

- Inner quartile marks to show the distribution clearly.
- Interactive rollover to show values of violin distribution

**Bar Chart Enhancements:**

- Sorted in descending order for easy interpretation.
- Blue gradient (Blues_r) to emphasize higher values.
- Grid lines for readability.

------

### 4. Evaluation Approach

**Participants:**

- A mix of baseball coaches, fans, and analysts.
- If experts are unavailable, friends, family or classmates with an interest in baseball analytics will be recruited.

**Evaluation Methods:**

1. User Surveys & Questionnaires:
    - Rate clarity, effectiveness, and ease of interpretation.
    - Example question: "On a scale of 1-5, how easy was it to interpret the scatter plot?"

2. Task-Based Testing:
    - Participants will complete tasks such as:
        - "Identify the team with the most home runs."
        - "Determine whether high-HR players generally have high AVG."
    - For deeper insights, combine open-ended and multiple-choice questions:
      *"Which team shows the highest variability in home run distribution?"*  
        a) Yankees  
        b) Dodgers  
        c) Braves  
        d) Astros  

3. Expert Feedback:
    - Baseball coaches will review statistical validity and practical use cases.

---



### **5. Findings and Refinements**
- **Success Criteria**:
  - 80% of users should find the visualizations intuitive.
  - 70% of users should correctly identify key insights.
  - 50% of users should interact with filters or annotations (if an interactive dashboard is built).

---

### **Next Steps**
1. **Gather User Feedback** through surveys, usability tests, and expert reviews.
2. **Summarize Evaluation Results** and discuss insights gained.
3. **Refine Visualizations** based on feedback.
4. **Prepare Final Report** including screenshots, links (if interactive), and a summary of findings.
