# Welcome to the Moneyball Project!

In this project, we are doing a statistical analysis on a baseball dataset in order to find the metrics that best predict success as well as finding the most undervalued players.



We'll start by understanding our problem, identifying:

Who are our key stakeholders?
What do they want to solve?
What kind of data do they have?
Once we have all of this information, we will take a step back and develop a well-considered plan for designing our solution.

## Milestone 1 : Background
### Moneyball

Moneyball is a tactic of Billy Beane, who was the general manager of the Oakland Athletics baseball team in the early 2000s. The Moneyball story revolves around Beane's innovative approach to assembling a competitive team on a limited budget, using statistical analysis to identify undervalued players and build a winning team. You'll be doing the same thing in this project!

At the time, baseball teams often relied on traditional scouting methods to evaluate players, such as watching them play in person and relying on opinions. Beane believed that statistical analysis could be used to identify players who were undervalued by other teams, and therefore could be acquired at a lower cost.

Despite skepticism from many in the baseball community, Beane's approach proved successful, as the Athletics went on to make the playoffs in several consecutive years despite having one of the lowest payrolls in the league. The Moneyball story has since become a popular example of how data-driven decision making can lead to success in sports and other fields.

Now that we have some overall context for the project, we can start learning about the basics of baseball! Take some time to watch the following video, and make sure you understand how the game works overall!



## Milestone 2 : Initial Data Analysis


In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

teams = pd.read_csv("https://drive.google.com/uc?id=1kbfBhzHNqfg4Af81mdGRTyMKPt1kCGUu")


teams = teams.loc[(teams["yearID"] > 1972)  &
                  (teams["yearID"] != 1981) &
                  (teams["yearID"] != 1994) &
                  (teams["yearID"] != 1995) &
                  (teams["yearID"] != 2020)]

columns_to_drop =  [
                    # team IDs in other datasets
                    'teamIDBR', 'teamIDlahman45', 'teamIDretro',

                    # IDs for franchise, division, league
                    'franchID', 'divID', 'lgID',

                    # other success metrics
                    'DivWin', 'WCWin', 'LgWin', 'WSWin', 'Rank',

                    # Batting/Pitching Park Factor and related features
                    'BPF', 'PPF', 'Ghome', 'park', 'attendance',

                    # more batting stats
                    'SB', 'CS', 'SO',

                    # misc defense/pitching stats
                    'ER', 'E', 'CG', 'SHO', 'SV', 'IPouts', 'DP', 'FP',
                    'ERA', 'RA', 'HA', 'HRA', 'BBA', 'SOA'
                    ]
teams = teams.drop(columns=columns_to_drop)

# Rename columns to be more descriptive
abbrev_map = {
    "R" : "Runs",

    "AB" : "AtBats",

    "H" : "Hits",
    "2B" : "Doubles",
    "3B" : "Triples",
    "HR" : "HomeRuns",

    "BB" : "Walks",
    "HBP" : "HitsByPitch",
    "SF" : "SacrificeFlies",

    "W" : "Wins",
    "L" : "Losses",
    "G" : "Games",
}
teams = teams.rename(columns=abbrev_map)

# Move team name column
teams.insert(teams.columns.get_loc('teamID') + 1, 'Name',
              teams.pop("name"))

teams.reset_index(inplace=True, drop=True)

# Disable numexpr for all eval/query
pd.set_option("compute.use_numexpr", False)

We've loaded in the data into the dataframe teams, here is a quick look at the dataframe and what type of data it holds.

In [None]:
teams.head(10)

Unnamed: 0,yearID,teamID,Name,Games,Wins,Losses,Runs,AtBats,Hits,Doubles,Triples,HomeRuns,Walks,HitsByPitch,SacrificeFlies
0,1973,ATL,Atlanta Braves,162,76,85,799,5631,1497,219,34,206,608,34.0,46.0
1,1973,BAL,Baltimore Orioles,162,97,65,754,5537,1474,229,48,119,648,43.0,49.0
2,1973,BOS,Boston Red Sox,162,89,73,738,5513,1472,235,30,147,581,31.0,44.0
3,1973,CAL,California Angels,162,79,83,629,5505,1395,183,29,93,509,34.0,45.0
4,1973,CHA,Chicago White Sox,162,77,85,652,5475,1400,228,38,111,537,32.0,42.0
5,1973,CHN,Chicago Cubs,161,77,84,614,5363,1322,201,21,117,575,20.0,37.0
6,1973,CIN,Cincinnati Reds,162,99,63,741,5505,1398,232,34,137,639,31.0,51.0
7,1973,CLE,Cleveland Indians,162,71,91,680,5592,1429,205,29,158,471,32.0,43.0
8,1973,DET,Detroit Tigers,162,85,77,642,5508,1400,213,32,157,509,39.0,31.0
9,1973,HOU,Houston Astros,162,82,80,681,5532,1391,216,35,134,469,33.0,36.0


Let's take a closer look at all the data we have to work with! First, let's figure out the range of the years of data. 



In [6]:
print(teams['yearID'].min())
print(teams['yearID'].max())

1973
2022


Now we know the range of years we are working with is from 1973 to 2022.

#### Batting Average
Batting average is a statistic used in baseball to measure a player's performance at the plate. It is calculated as the number of hits a batter gets divided by their total number of at-bats:

Batting Average = Total Hits / Total At-Bats 

Batting average is one of the most well-known and widely used statistics in baseball, and is often used to evaluate a player's overall hitting ability.



We will now use the cell below to create a new column called BattingAverage from a calculation on other columns!

In [8]:
teams["BattingAverage"] = teams["Hits"] / teams["AtBats"]
teams.head()

Unnamed: 0,yearID,teamID,Name,Games,Wins,Losses,Runs,AtBats,Hits,Doubles,Triples,HomeRuns,Walks,HitsByPitch,SacrificeFlies,BattingAverage
0,1973,ATL,Atlanta Braves,162,76,85,799,5631,1497,219,34,206,608,34.0,46.0,0.26585
1,1973,BAL,Baltimore Orioles,162,97,65,754,5537,1474,229,48,119,648,43.0,49.0,0.266209
2,1973,BOS,Boston Red Sox,162,89,73,738,5513,1472,235,30,147,581,31.0,44.0,0.267005
3,1973,CAL,California Angels,162,79,83,629,5505,1395,183,29,93,509,34.0,45.0,0.253406
4,1973,CHA,Chicago White Sox,162,77,85,652,5475,1400,228,38,111,537,32.0,42.0,0.255708


#### Slugging Percentage
Slugging percentage is a statistic used in baseball to measure a player's power at the plate.

It is calculated by dividing the total number of bases a player earns by their total number of at-bats. Unlike batting average, which only takes into account the number of hits a player gets, slugging percentage also considers all bases the batter gets as a result of the hit!

SluggingPercentage = Total # of Bases From Hits / Total At-Bats 

However, It looks like we have information on when the batter makes it to 2nd or 3rd bases, or make it all the way around for a run. We currently however don't have a column for when they only make it to 1st base!

To add a singles column, we will use the code below and then display some rows.

In [9]:
teams["Singles"] = teams.eval("Hits - (Doubles + Triples + HomeRuns)")
teams.head()

Unnamed: 0,yearID,teamID,Name,Games,Wins,Losses,Runs,AtBats,Hits,Doubles,Triples,HomeRuns,Walks,HitsByPitch,SacrificeFlies,BattingAverage,Singles
0,1973,ATL,Atlanta Braves,162,76,85,799,5631,1497,219,34,206,608,34.0,46.0,0.26585,1038
1,1973,BAL,Baltimore Orioles,162,97,65,754,5537,1474,229,48,119,648,43.0,49.0,0.266209,1078
2,1973,BOS,Boston Red Sox,162,89,73,738,5513,1472,235,30,147,581,31.0,44.0,0.267005,1060
3,1973,CAL,California Angels,162,79,83,629,5505,1395,183,29,93,509,34.0,45.0,0.253406,1090
4,1973,CHA,Chicago White Sox,162,77,85,652,5475,1400,228,38,111,537,32.0,42.0,0.255708,1023


##### Adding the Slugging Percentage Column
Now that we have the 'Singles' column, we can use the code below to create the SluggingPercentage column. 

In [10]:
teams["SluggingPercentage"] = teams.eval("(Singles + 2*Doubles + 3*Triples + 4*HomeRuns) / AtBats")
teams.head()

Unnamed: 0,yearID,teamID,Name,Games,Wins,Losses,Runs,AtBats,Hits,Doubles,Triples,HomeRuns,Walks,HitsByPitch,SacrificeFlies,BattingAverage,Singles,SluggingPercentage
0,1973,ATL,Atlanta Braves,162,76,85,799,5631,1497,219,34,206,608,34.0,46.0,0.26585,1038,0.426567
1,1973,BAL,Baltimore Orioles,162,97,65,754,5537,1474,229,48,119,648,43.0,49.0,0.266209,1078,0.389381
2,1973,BOS,Boston Red Sox,162,89,73,738,5513,1472,235,30,147,581,31.0,44.0,0.267005,1060,0.400508
3,1973,CAL,California Angels,162,79,83,629,5505,1395,183,29,93,509,34.0,45.0,0.253406,1090,0.347866
4,1973,CHA,Chicago White Sox,162,77,85,652,5475,1400,228,38,111,537,32.0,42.0,0.255708,1023,0.372055


#### On-Base Percentage
Hitting the ball isn't the only way to get on base in baseball! Here we'll introduce the terms walk, hit by pitch and sacrifice fly before discussing On-Base Percentage.

Hit By Pitch - exactly what it sounds like: if a batter is hit by a pitch, he is awarded first base
Sacrifice Fly - occurs when a batter hits a fly-ball out to the outfield or foul territory that allows a runner to score. This does not count as an at-bat so that the failure to reach base doesn't count against the batter's batting average, since the batter was still able to score runs.
Walk - when a batter is allowed to walk to first base after receiving four balls.

On-base percentage incorporates these in its formula:

On-Base Percentage = (Hits + Walks + Hits By Pitch) / (At-Bats + Walks + Hits By Pitch + Sacrifice Flies)

Note that this metric includes a batter's ability to hit the ball and get on base, as well as other ways the batter is able to get on base.

Now we will add the OBP column to our teams dataframe.

In [11]:
teams["OnBasePercentage"] = teams.eval("(Hits + Walks + HitsByPitch) / (AtBats + Walks + HitsByPitch + SacrificeFlies)")
teams.head()

Unnamed: 0,yearID,teamID,Name,Games,Wins,Losses,Runs,AtBats,Hits,Doubles,Triples,HomeRuns,Walks,HitsByPitch,SacrificeFlies,BattingAverage,Singles,SluggingPercentage,OnBasePercentage
0,1973,ATL,Atlanta Braves,162,76,85,799,5631,1497,219,34,206,608,34.0,46.0,0.26585,1038,0.426567,0.338503
1,1973,BAL,Baltimore Orioles,162,97,65,754,5537,1474,229,48,119,648,43.0,49.0,0.266209,1078,0.389381,0.34491
2,1973,BOS,Boston Red Sox,162,89,73,738,5513,1472,235,30,147,581,31.0,44.0,0.267005,1060,0.400508,0.337818
3,1973,CAL,California Angels,162,79,83,629,5505,1395,183,29,93,509,34.0,45.0,0.253406,1090,0.347866,0.31807
4,1973,CHA,Chicago White Sox,162,77,85,652,5475,1400,228,38,111,537,32.0,42.0,0.255708,1023,0.372055,0.323529


#### Exploring Team Performance Over Time
Now that we've calculated additional features like Slugging Percentage and On-Base Percentage, we can gain deeper insights into the performance of baseball teams over time. Visualizing these metrics allows us to see how different aspects of a team's performance evolve across seasons. By selecting specific teams and features, we can:

Analyze Trends: Identify upward or downward trends in team performance metrics such as Wins, Runs, and Batting Average.
Compare Performance: Compare how different teams fare in various statistical categories, highlighting strengths and weaknesses.
Understand Impact: Understand the impact of specific metrics on overall team success. For example, see how improvements in Slugging Percentage correlate with Runs scored.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interact

# Function to plot team progression over time by a given feature
def plot_team_progression(team, feature):
    plt.figure(figsize=(12, 8))
    team_data = teams[teams['teamID'] == team]
    plt.plot(team_data['yearID'], team_data[feature], marker='o', label=team_data['Name'].iloc[0])

    plt.title(f'{team_data["Name"].iloc[0]} Progression Over Time by {feature}')
    plt.xlabel('Year')
    plt.ylabel(feature)
    plt.legend()
    plt.grid(True)
    plt.show()

# Interactive widgets
team_selector = widgets.Dropdown(
    options=teams['teamID'].unique(),
    description='Team:'
)

feature_selector = widgets.Dropdown(
    options=['Games', 'Wins', 'Losses', 'Runs', 'AtBats', 'Hits', 'Doubles', 'Triples', 'HomeRuns', 'Walks', 'HitsByPitch', 'SacrificeFlies', 'BattingAverage', 'Singles', 'SluggingPercentage', 'OnBasePercentage'],
    description='Feature:'
)

# Display interactive plot
interact(plot_team_progression, team=team_selector, feature=feature_selector)

#### Win Percentage and Team Performance Correlation

In baseball, understanding the relationship between various batting statistics and a team's overall performance is crucial. One way to explore these relationships is by examining the correlation between different batting metrics and the team's win percentage.

Run the cell below to calculate and display the correlation matrix as a heatmap, which will give you a sense of how each pair of variables is related! This visualization helps in quickly identifying strong or weak relationships between these metrics.

In [None]:
correlationMatrix = teams.corr(numeric_only=True)

plt.figure(figsize=(10, 8))
sns.heatmap(correlationMatrix, annot=True, cmap='coolwarm', linewidths=.5, annot_kws={"size": 6})
plt.title('Correlation Matrix Heatmap')
plt.show()

A correlation coefficient close to +1 indicates a strong positive correlation (as one statistic increases, so does the other), while a coefficient close to -1 indicates a strong negative correlation. A coefficient near 0 suggests little to no linear relationship.

Remember, correlation does not imply causation. These correlations can suggest relationships but do not confirm direct cause-and-effect.

Here is a little more in-depth breakdown on what the different ranges of correlation values actually indicate:

Very Strong Correlation: |r| >= 0.9
Strong Correlation: 0.7 <= |r| < 0.9
Moderate Correlation: 0.5 <= |r| < 0.7
Weak Correlation: 0.3 <= |r| < 0.5
Very Weak or No Correlation: |r| < 0.3

That was a good introduction to baseball rules and other basic stats!

In the next two notebooks, we will start using these and other data in some models to predict player salaries and choose the most undervalued players for each team, what our model was initially supposed to do!

