# An analysis of the *Mendoza Line* in MLB batting statistics.
-----

The *Mendoza Line* is common U.S. slang referring to the threshold for seriously below average performance.
The term originated in baseball, referring to the batting average of shortstop Mario Mendoza.
For those unfamiliar with the origin of the term, there is good background in the [wikipedia entry on the Mendoza Line] and [this column] from the St. Louis Post-Dispatch.  

The term has made Mendoza's last name famous since it was first coined in 1979, but we should verify the figure and analyze where this level of performance falls in the spectrum of other major league batters.
In addition, we'll look at how batting averages over time compare to this figure.

The data used in this analysis comes from SeanLahman.com's [baseball database](http://www.seanlahman.com/baseball-archive/statistics/).

[wikipedia entry on the Mendoza Line]: https://en.wikipedia.org/wiki/Mendoza_Line

[this column]: http://www.stltoday.com/sports/baseball/professional/branded-for-life-with-the-mendoza-line/article_cff05af5-032e-5a29-b5a8-ecc9216b0c02.html

### Table of contents:
1.  Set up
2.  Data  
    2.1  Sources  
    2.2  Data wrangling and initial observations  
    2.3  Data quality check
3.  Exploration and analysis  
    3.1  How bad was this average in the years leading up to 1979?  
    3.2  What percent of batters are below the Mendoza Line over time?
4.  Conclusions  
    4.1  Limitations and areas for further investigation

## 1. Set up
-----
Load the required libraries:

In [None]:
import numpy as np
import pandas as pd
import platform
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, percentileofscore
%matplotlib inline

For readers and reviewers, the versions of the major software components are:

In [None]:
print('python version:', platform.python_version())
print(pd.__name__, 'version', pd.__version__)
print(np.__name__, 'version', np.__version__)

## 2. Data
-----

### 2.1 Sources
As noted earlier, the data used comes from SeanLahman.com's baseball database. Specifically, I used this [dataset](http://seanlahman.com/files/database/baseballdatabank-2017.1.zip) which was updated February 26, 2017 with statistics through the 2016 season.

While the zip repository contains 27 different .csv files covering various statistics, we're only going to use a subset:

1. Master.csv --> player names and biographical data
2. Batting.csv --> batting stastics
3. Appearances.csv --> positional info

### 2.2 Data wrangling and initial observations
Import each of the .csv files into a pandas DataFrame object:

In [None]:
directory = 'core/'
master_df = pd.read_csv(directory + 'Master.csv')
batting_df = pd.read_csv(directory + 'Batting.csv')
appearances_df = pd.read_csv(directory + 'Appearances.csv')

Look at the master table to make sure it loaded correctly:

In [None]:
master_df.head()

First, let's see if we can find Mario Mendoza in our database...

In [None]:
mendozas = master_df.loc[master_df['nameLast'] == 'Mendoza']
mendozas

Judging by the first names and the dates played compared to the biographical info in the background reading, it's pretty easy to find our man in the third row, born in Chihuahua, Mexico in 1950. Let's save his player ID in a variable **mendoza_ID** so we can look up his stats.

In [None]:
mendoza_ID = mendozas[mendozas['nameFirst'] == 'Mario']['playerID'].values[0]
mendoza_ID

Now, let's look up Mario Mendoza's batting statistics.  First, let's look at the batting dataframe:

In [None]:
batting_df.head()

The columns in the batting_df dataframe have the following labels:

In [None]:
#playerID       Player ID code
#yearID         Year
#stint          player's stint (order of appearances within a season)
#teamID         Team
#lgID           League
#G              Games
#AB             At Bats
#R              Runs
#H              Hits
#2B             Doubles
#3B             Triples
#HR             Homeruns
#RBI            Runs Batted In
#SB             Stolen Bases
#CS             Caught Stealing
#BB             Base on Balls
#SO             Strikeouts
#IBB            Intentional walks
#HBP            Hit by pitch
#SH             Sacrifice hits
#SF             Sacrifice flies
#GIDP           Grounded into double plays

Let's examine Mendoza's numbers:

In [None]:
appearances_df[appearances_df['yearID'] >= 1975].info()

Similarly, it looks like there are no missing data points in this subset of the data either. Again, it makes sense that the data sets from 1975 forward would be clean as baseball was very popular during this entire period and keeping detailed statistics had long been part of baseball, even pre-dating the period in question.

## 3. Exploration and analysis
-----

### 3.1 How bad was this average in the years leading up to 1979?
In order to quantify how mediocre a performance batting .200 was in 1979 when the phrase was coined, I want to look at typical batting averages in this time period. To do this, I need to adjust the batting_df dataset in a few different ways:
* Look only at data in the 5 year window from 1975 - 1979
* Remove pitchers
* Remove players without at least 50 at bats in a season *(which could be stints with multiple teams in the same season)*

#### First, create a new dataframe with just the batting data from 1975 to 1979 (inclusive)

In [None]:
def stat_window(df, start_year, end_year):
    search = str(start_year) + ' <= yearID <= ' + str(end_year)
    return df.query(search)

start_year = 1975
end_year = 1979
batting_window = stat_window(batting_df, start_year, end_year)
print(len(batting_window), "batting data records from {} - {}".format(start_year,
                                                                            end_year))
batting_window.head()

In [None]:
batting_window.info()

In [None]:
players_set = set(batting_window['playerID'])
print(len(players_set), "unique players with batting records during this period")

#### Next, remove pitchers from the dataset.  
Pitchers are defined as players with more than one appearance as pitcher during a season. One appeance is used as the threshold to allow for fielders who might pitch rarely during an extra innings situation. This could lead to slight errors on edge cases of fielders who routinely pitched or players who switched positions during their career, but this would be very rare case during the time period being analyzed.

In [None]:
# Create a set of all players with more than one game pitched in a stint or season
min_G_p = 1
all_pitchers = set(appearances_df[appearances_df['G_p'] > min_G_p]['playerID'])

# remove these players from the batting dataframe
batters_set = set(x for x in batting_window['playerID'] if x not in all_pitchers)
print(len(batters_set), "unique non-pitchers in {} - {}".format(start_year, end_year))

In [None]:
def remove_position(df, position):
    non_position = [x not in position for x in df['playerID']]
    return df[non_position]

batting_window = remove_position(batting_window, all_pitchers)
print(len(batting_window), 'batting data records with pitchers removed')
print(len(set(batting_window['playerID'])), 
      "unique players, should match unique non-pitchers in cell above")
batting_window.head()

#### Next, remove players without at least 50 at bats in that year. 
The intent here is to try to remove "noisy" data points from players who didn't have at least 50 at bats in a season, which might included short-term call-ups from the minor leagues, injured players, etc.  However, we must allow for players to achieve this minimum in a combination of 'stints' across different teams in the same season.
***To do this, we create a multi-index*** to sum the games played ('G') data by playerID and yearID (to aggregate seasons with multiple stints), so that we can look up our data by player, by year:

In [None]:
def get_player_year_sum(df, field):
    
    grouped = df.groupby(['playerID', 'yearID'], as_index=False).sum()
    
    index_arrays = [grouped['playerID'], grouped['yearID']]
    multi_index = pd.MultiIndex.from_arrays(index_arrays, names = ['playerID', 'yearID'])
    return pd.Series(grouped[field].values, index=multi_index)

stat = 'AB'
player_year_stats = get_player_year_sum(batting_window, stat)
player_year_stats.head(10)

Create a boolean array to check for minimum criteria (at bats) in the season:

In [None]:
min_stat = 50
required_min = []
for x in batting_window.iterrows():
    if player_year_stats[x[1][0], x[1][1]] >= min_stat:
        required_min.append(True)
    else:
        required_min.append(False)

batting_window = batting_window[required_min]
print(len(batting_window), 'batting data records with minimum of {} {}'.format(min_stat, stat))

#### Now that we've cleaned up this data, we can analyze the distribution of batting averages.  

In [None]:
BAs_window = batting_window['H']/batting_window['AB']
BAs_window.describe()

From the describe() statement above, the mean of the batting averages was .251, with a standard deviation of 0.47 **- so the Mendoza Line of .200 was about one standard deviation below the mean.**  We can also graph the distribution of batting averages to get a visual feeling for the distribution.

In [None]:
BA_bins = [x/1000 for x in range(100,410,10)]
plt.rcParams['figure.figsize'] = 8, 5
BAs_window.hist(bins=BA_bins, normed = True, edgecolor='black')
plt.title('MLB batting averages: 1975 - 1979')
plt.axvline(x=0.200, color='black')
plt.text(.190, 7 , "Mendoza Line", rotation=90)
plt.xlabel('Batting average')
plt.ylabel('Frequency (percentage)')
plt.show()

Calculating some statistics based on a normal distribution...

In [None]:
mendoza_Z = (MENDOZA_LINE - BAs_window.mean())/BAs_window.std(ddof=0)
print("The Z score of a .200 batting average is {:4.2f}".format(mendoza_Z))
print("Assuming a normal distribution of batting averages, this would place .200 above",
      "only {:3.1f}% of batters".format(100*norm.cdf(mendoza_Z)))

However, the normal distribution is only and approximation of the data.  We can look at the actual percentile rankings of the batting averages to calculate precisely what percentage of batters would fall below the Mendoza Line:

In [None]:
BAs_window.quantile([0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1])

Eyeballing the deciles above would imply that and average of .200 would fall just north of the tenth precentile (where only 10% of observations would be below this point).  This is even worse than what the normal distribution would imply. We can use **percentileofscore** from the scipy.stats module to figure out precisely what percentage of scores were below .200:

In [None]:
def mendoza_percentile(series):
    return percentileofscore(series, MENDOZA_LINE, kind="strict")

print("Given the actual distribution, a .200 batting average was above", 
      "only {:3.1f}% of batters".format(mendoza_percentile(BAs_window)))

#### Conculsion: 
1. The term "Mendoza Line" refers to a performance of batting average of approximately .200, as verified by Mario Mendoza's actual batting average in the years before the term was coined.  
2. This level of performance in the 1975-1979 time frame would have placed a batter in only the 10th percentile. Said another way, almost *90% of batters had a higher average* when we removed pitchers and players without a minimum number of at bats. 

### 3.2 What percent of batters are below the "Mendoza Line" over time?

In the 1975-1979 time frame, when the term Mendoza Line was coined, batting .200 put a player in roughly the 10th percentile of eligible batters (those with at least 50 ABs, excluding pitchers). I'd like to know how this level varied over time thereafter (from 1980 onward). 
  
**Specifically, what percent of batters are below .200 each year?**

#### First, create a dataset with just the figures from 1980 forward

In [None]:
start_year = 1980
end_year = batting_df['yearID'].max()

batting_window = stat_window(batting_df, start_year, end_year)
print(len(batting_window), "batting data records from {} - {}".format(
                            start_year, end_year))
batting_window.head()

#### Again, remove the pitchers

In [None]:
mendoza_batting_df = batting_df[batting_df['playerID'] == mendoza_ID]
mendoza_batting_df

Create a quick summary of Mendoza's hits and at bats per year, and calculate his batting average **('BA')** - note the convention is to round this to three decimals places:

In [None]:
def calculate_BA(batting_df):
    return (batting_df['H']/batting_df['AB']).round(3)

In [None]:
mendoza_data = pd.DataFrame.from_items([('BA', calculate_BA(mendoza_batting_df)),
                             ('H', mendoza_batting_df['H']), 
                             ('AB', mendoza_batting_df['AB'])])
mendoza_data.index = mendoza_batting_df['yearID']
mendoza_data

Let's look at his typical batting average in the years up through (and including) 1979 when the phrase was coined:

In [None]:
end_year = 1979
start_year = mendoza_data.index.values.min()
print('Average {} - {} batting average: {:4.3f}'.format(start_year, end_year, 
      mendoza_data[(mendoza_data.index) <= end_year]['BA'].mean()))

#### The Mendoza Line quantified and verified: he was a .200 hitter 

Now, this "average of averages" would give equal weighting to his batting averages from each year regardless of the number of at bats.  Let's redo the previous calculation using the actual hits and at bats from each season:

In [None]:
print('Cumulative {} - {} batting average: {:4.3f}'.format(start_year, end_year,
    float(mendoza_data[(mendoza_data.index) <= end_year]['H'].sum()/mendoza_data[(mendoza_data.index) <= end_year]['AB'].sum())))

Looks like the cumulative batting average over that period was almost consistent with the average of his batting averages, so the initial figure wasn't skewed by any outlier years.
  
How did he fare from 1979 through the end of his career in 1982?

In [None]:
final_career_year = mendoza_data.index.values.max()
print('{} - {} batting average: {:4.3f}'.format(end_year+1, final_career_year, 
      float(mendoza_data[(mendoza_data.index) > end_year]['H'].sum()/mendoza_data[(mendoza_data.index) > end_year]['AB'].sum())))

He was a little better those last few years, but unfortunately the saying had already become a cultural idoim and the "Mendoza Line" was memorialized as a batting average of **0.200**. 

In [None]:
MENDOZA_LINE = 0.200

### 2.3 Data quality check
We've imported the csv files into three dataframes for our analysis

1. master_df --> player names and biographical data
2. batting_df --> batting stastics
3. appearances_df --> positional info

The master_df was only needed to find our info for Mario Mendoza as we aren't using biographical data elsewhere in our analysis, so we don't need to scrub this dataset as it has already served its limited purpose. However, we should investigate the batting and appearances datasets to check for data issues.

In [None]:
batting_df.info()

We can see in the information above that it looks like there are a good number of missing data points from the batting records.  This data set goes back to 1871 and it's not surprising that some data may not have been tracked in the same way historically.  However, our analysis will only be covering from 1975 onward, a relatively modern period.  We can check that subset of the data:

In [None]:
batting_df[batting_df['yearID'] >= 1975].info()

Great - it looks like there is no missing batting data in this period. Now, let's verify the same on the appearances data:

In [None]:
batting_window = remove_position(batting_window, all_pitchers)
print(len(batting_window), 'batting data records with pitchers removed')

#### Next, remove players without at least 50 at bats in that year. 
Similar to the process above, we need to create a multiindex to allow for players to have different 'stints' across different teams in the same season.  Note that to qualify for awards like the batting title, the minimum level of appearances is much higher.  

Create the at bats multiindex for the 1980 onward batting data:

In [None]:
stat = 'AB'
player_year_stats = get_player_year_sum(batting_window, stat)
player_year_stats.head(10)

And remove the players without less than 50 ABs in a year from our post-1980 batting dataframe

In [None]:
batting_window.head(10)

In [None]:
#  helper function to return an array with the qualifying batting averages for any given year
def get_annual_BA(year):
    annual_data = batting_window[batting_window['yearID'] == year]
    return (annual_data['H']/annual_data['AB']).values

# create a dataframe with a column containing the qualifying batting averages for each year
# note that the columns will be of varying lengths, but pandas will pad the missing values with NaN

BA_dict = {x: get_annual_BA(x) for x in range(start_year, end_year+1)}
annual_BA_df = pd.DataFrame.from_dict(BA_dict, orient='index')
annual_BA_df = annual_BA_df.transpose()
annual_BA_df.head()

#### Quick detour: 
Let's take a look at the 1980's to get a feel for how batting averages are distributed by year