# The Worst Hitter in MLB History

In [None]:
import pandas as pd
batting = pd.read_csv('../data/Batting.csv')

Data courtesy of [Sean Lahman's Baseball](http://www.seanlahman.com/baseball-archive/statistics/)

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np

## Exploring the Batting Dataset

In [None]:
batting.head()

In [None]:
batting.columns

In [None]:
batting.tail()

### Checking out Pete Rose's Career Numbers

In [None]:
batting[batting['playerID'] == 'rosepe01']

### Adding a `seasons` column

In [None]:
batting['seasons'] = batting['stint'].map(lambda x: x if x == 1 else 0)

### Adding a `year_rect` column

This eliminates multiple copies of the year for when players played with multiple teams in the same season. These two columns together will help when defining the `Era` column below.

In [None]:
def year_correct(row):
    """
    This function replaces multiple copies of the same year
    in a player's record with 0's.
    """
    if row['stint'] != 1:
        return 0
    else:
        return row['yearID']

In [None]:
batting['year_rect'] = batting.apply(year_correct, axis = 1)

In [None]:
batting[batting['playerID'] == 'rosepe01']

### Grouping By `playerID` and then Summing Will Show Career Totals

In [None]:
players = batting.groupby('playerID').sum()

#### Hank Aaron, for example

In [None]:
players[players.index == 'aaronha01']

### Adding a Batting Average Column

In [None]:
players['avg'] = players['H'] / players['AB']
players[players.index == 'rosepe01']

### Adding a Slugging Average Column

In [None]:
players['slg'] = (players['H'] + players['2B'] + 2 * players['3B'] + 3 * players['HR']) / players['AB']

In [None]:
players.head()

## Finding the Worst Hitter

Beging by simply sorting according to batting average.

In [None]:
players['avg'].sort_values()

But this is obviously not good enough. By this measure, any player who played in only a few games but never got a hit is at the top (or bottom) of the list. But let's think like a Bayesian!

### Bringing in Bayes

New plan: Find some average batting average to use as a *prior* probability distribution. Then treat the weight of a player's career as the *evidence* upon which to conditionalize. This will remove the players with very few career appearances from the top of our worst hitters list, since the prior (if we choose it appropriately) will dominate over the new evidence.  Inversely, players with long careers should have their actual numbers dominate over the prior.

Now getting a hit is a binary process (hit or no hit), so we can use the [beta-binomial distribution](https://en.wikipedia.org/wiki/Beta-binomial_distribution), since the beta distribution is a [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior) for a binomial likelihood.

Taking advantage of conjugacy here means that I can simply add a player's career hits and at-bats to the prior's hits and at-bats to get an updated estimate of a player's batting average.

#### Setting the prior

I'm going to use .260 as an average MLB batting average.

In [None]:
# Prior

successes = 26
failures = 74
alpha_prior = successes + 1
beta_prior = failures + 1

beta_dist = stats.beta(alpha_prior, beta_prior)

fig, ax = plt.subplots(figsize=(8, 5))
pvals = np.linspace(0, 1, 101)
prior = beta_dist.pdf(pvals)

ax.plot(pvals, prior, lw=3)
ax.set_xlabel('p', fontsize=16)
ax.set_ylabel('P(p | a,b)\n', fontsize=16)
ax.set_title(f'Beta PDF for alpha={alpha_prior}, beta={beta_prior}\n',
            fontsize=18)
plt.show()

#### Adding a Maximum A Posteriori Column

In [None]:
# A:

players['MAP'] = (players['H'] + 26) / (players['AB'] + 100)

In [None]:
players.sort_values('MAP', ascending = False).head()

In [None]:
players = players.reset_index()

#### Adding an `Era` Column

In [None]:
players = players.rename({'MAP': 'MAP_avg'}, axis = 1)

players['Era'] = (players['year_rect'] / players['seasons']).astype(int)

Some players had perfect averages (going 1 for 1 in their careers or the like). Their new `MAP_avg` scores should be close to .260.

In [None]:
players[players['avg'] == max(players['avg'])].head()

#### Experimenting with Different Numbers for the Prior's Hits and At-Bats

In [None]:
def avg_prior(h, ab):
    """
    This function takes in a number of hits and a number of at-bats
    to use as prior values for the Bayesian MAP Method. It returns
    the top (worst) hitter according to the MAP average. The ratio
    of hits to at-bats should be (near) 26:100.
    """
    players['MAP_avg'] = (players['H'] + h) / (players['AB'] + ab)
    return players.sort_values('MAP_avg',
                               #ascending=False
                              ).head(10)['playerID']

The following code will find the worst hitter for prior values of at-bats between 10 and 5000, counting by tens.

In [None]:
worst = []
for i in range(10, 5001, 10):
    worst.append(avg_prior(0.26 * i, i))

For just ten at-bats, the worst hitter is Ron Herbel.

In [None]:
worst[0]

In [None]:
worst[169]

In [None]:
worst[170]

At precisely 1705 at-bats, the worst hitter switches from Bob Buhl to Bill Bergen.

In [None]:
avg_prior(0.26 * 1704, 1704)

In [None]:
avg_prior(0.26 * 1705, 1705)

In [None]:
worst

In [None]:
players[players['playerID'] == 'herbero01']

Ron Herbel: pitcher for the San Francisco Giants, San Diego Padres, New York Mets, and Atlanta Braves.

In [None]:
players[players['playerID'] == 'chancde01']

Dean Chance: pitcher for the Los Angeles / California Angels, Minnesota Twins, Cleveland Indians, New York Mets, and Detroit Tigers.

In [None]:
players[players['playerID'] == 'buhlbo01']

Bob Buhl: pitcher for the Milwaukee Braves, Chicago Cubs, and Philadelphia Phillies.

In [None]:
players[players['playerID'] == 'bergebi01']

Bill Bergen: Cincinnati Reds and Brooklyn Superbas / Dodgers.

**Aside on the Best Hitters**

To find the *best* hitters, we can re-use our `avg_prior()` function, but simply include an `ascending=False` in the `.sort_values()` call.

In [None]:
best = []
for i in range(10, 5001, 10):
    best.append(avg_prior(0.26 * i, i))

best

For almost all values of at-bats between 10 and 5000 (counting again by tens), the best hitter is Ty Cobb. But there are two names for 10 and 20 at-bats that I don't recognize.

In [None]:
players[players['playerID'] == 'jansera01']

Ray Jansen: St. Louis Browns.

In [None]:
players[players['playerID'] == 'sherlvi01']

Vince Sherlock: Brooklyn Dodgers.

In [None]:
players[players['playerID'] == 'cobbty01']

### Looking for Non-Pitchers

Pitchers are specialized players who are generally given a bit of a break when it comes to hitting, and so really what I'm after is the worst hitter *who was not a pitcher*.

For these next cells I change the function to print out the top ten instead of just the top one.

In [None]:
avg_prior(26, 100)

Ben Sheets: pitcher for the Milwaukee Brewers, Oakland Athletics, and Atlanta Braves.

Dick Ellsworth: pitcher for the Chicago Cubs, Philadelphia Phillies, Boston Red Sox, Cleveland Indians, and Milwaukee Brewers.

Bill Hands: pitcher for the San Francisco Giants, Chicago Cubs, Minnesota Twins, and Texas Rangers.

Al Leiter: pitcher for the New York Yankees, Toronto Blue Jays, Florida Marlins, and New York Mets.

Sandy Koufax: pitcher for the Brooklyn / Los Angeles Dodgers.

Brian Moehler: pitcher for the Detroit Tigers, Cincinnati Reds, Houston Astros, and Florida Marlins.

Roger Craig: pitcher for the Brooklyn / Los Angeles Dodgers, New York Mets, St. Louis Cardinals, Cincinnati Reds, and Philadelphia Phillies.

In [None]:
avg_prior(52, 200)

Aaron Harang: pitcher for the Oakland Athletics, Cincinnati Reds, San Diego Padres, Los Angeles Dodgers, Seattle Mariners, New York Mets, Atlanta Braves, and Philadelphia Phillies.

John Burkett: pitcher for the San Francisco Giants, Florida Marlins, Texas Rangers, Atlanta Braves, and Boston Red Sox.

Nolan Ryan: pitcher for the New York Mets, California Angels, Houston Astros, and Texas Rangers.

In [None]:
avg_prior(78, 300)

Mickey Lolich: pitcher for the Detroit Tigers, New York Mets, and San Diego Padres.

In [None]:
avg_prior(104, 400)

Bob Friend: pitcher for the Pittsburgh Pirates, New York Yankees, and New York Mets.

Milt Pappas: pitcher for the Baltimore Orioles, Cincinnati Reds, Atlanta Braves, and Chicago Cubs.

In [None]:
avg_prior(130, 500)

Jerry Koosman: pitcher for the New York Mets, Minnesota Twins, Chicago White Sox, and Philadelphia Phillies.

#### Bringing in pitching data

In [None]:
pitching = pd.read_csv('../data/Pitching.csv')

#### Exploring the pitching data

In [None]:
pitching['playerID'].values

In [None]:
pitchers = pitching.groupby('playerID').sum()

Somewhat crudely, we'll look for the hitters who have no pitching record whatever.

In [None]:
pure_bats = [player for player in players['playerID'] if player not in\
             pitching['playerID'].values]

In [None]:
def also_pitch(x):
    """
    This function will be used to add a column to my
    players DataFrame that will indicate whether the
    player ever pitched.
    """
    if x['playerID'] in pure_bats:
        return 0
    else:
        return 1

In [None]:
players['no_pitch'] = players.apply(also_pitch, axis = 1)

In [None]:
pure_batters = players[players['no_pitch'] == 0].copy()

Now we'll just redo our previous MAP Method, but this time applying it only to the pure batters who didn't also pitch.

In [None]:
def avg_prior_pure(h, ab):
    """
    This function mimics the MAP function from before.
    """
    pure_batters['MAP_avg'] = (pure_batters['H'] + h) / (pure_batters['AB'] + ab)
    return pure_batters.sort_values('MAP_avg',
                               # ascending = False
                              ).head(1)['playerID']

In [None]:
worst = []
for i in range(10, 5001, 10):
    worst.append(avg_prior_pure(0.26 * i, i))

In [None]:
worst

These names of pure hitters come up early in the list, but they correspond to very low numbers of at-bats as prior.

Skeeter Shelton: New York Yankees.

Ed Gastfield: Detroit Wolverines and Chicago White Stockings.

Mike Jordan: Pittsburgh Alleghenys.

John Humphries: New York Gothams and Washington Nationals.

It looks like **Bill Bergen** (who came up even before we eliminated the pitchers!) is our winner here. Let's find the exact number of at-bats where Bergen comes to the "top". Given the output from the preceding code, it must be somewhere between 130 and 140 at-bats.

In [None]:
avg_prior_pure(0.26 * 132, 132)

In [None]:
avg_prior_pure(0.26 * 133, 133)

***If we set our prior number of at-bats at 133 or more, Bill Bergen will count as the worst hitter in the history of the Major Leagues!***