Fernando Crema

- **Course:** Inside Baseball Analytics in Action
- **Instructor:** Adam Guttridge.

# Scouting scale (assignment).

Throughout this class, we've discussed frameworks which communicate value, and allow for comparison. We've discussed the shortcomings of the conventional 20-80 scouting scale. We've suggested solutions for grading metrics outside of baseball.

Either suggest some improvements which would make the current 20-80 system more effective, or describe a new system for relaying scouting information which would be less problematic.

## Suggestions to the system

First of all the usage of the word effective is strange as the objective of the system is to measure future performance of players. The not signed players have no opportunity whatsoever to prove they're valuable or MLB did a mistake (most of the times) therefore is difficult to measure accuracy.

Some suggestions to improve the system could be:

1. Gathering measurable data and assinging values without subjective evaluation should be mandatory. Assinging values from 20 to 80 is just changing the measured space and applying a simple categorization of it. Furthermore, assuming it **must** have a gaussian distribution may be wrong in some cases. Even though we know that when the number of samples is sufficiently high then the empirical distribution will seem like a gaussian distribution we can (and we do) find different behaviours (like power laws, for example) that will not follow this gaussian prior judgement.
1. About 1. even thought having "objective" tools to measure future performance sometimes is good to mantain the "human" part of scouting. Otherwise Jose Altuve would not be playing.
1. I really like the idea of baseball IQ. Therefore adding tests of situations or making the player analyze videos and explain the situation and what he could do may be an idea to measure future development at least in this criteria. I know that most of the plays talent will pay off but in some cases seeing beyond can [flip](https://www.youtube.com/watch?v=ApoJk9X7Vto) you some games.
1. Making "scouting future world series" or whatever name MLB could add to see performance on field to support statistical/measurable elements on 1.
1. Past medical records and nutrition information is used in Venezuela. According to some scouts back in 2013 I heard that Salvador Perez wouldn't last that much as he had eating disorders as a child. Specially in Latam this is extremely common.

# Test.

## Run Expectancy

Understanding Run Expectancy and run estimators is a critical/fundamental aspect of sabermetrics. In the run environment which spanned 1993-2009, what was the expected run value of being on first base with no outs, versus the expected run value of being on second base with one out?

Use this as your source. [Tango, Tiger. re24](http://www.tangotiger.net/re24.html)

We get the table of the expected runs **scored** and analyze the time asked

![](img/re24.PNG)

The values requested:

1. Expected run value of being on first base with no outs: **0.944**.
1. Expected run value of being on second base with one out: **0.723**.

This question may be related with the diminishing of bunts with man on first base with no outs. Moreover, if we analyze the probability (empirical chance) of scoring in the two states we found: **0.442** and **0.419**. 

However, further analysis could be added taking into account strikeout percentage of a batter compared to bunt success as chances of scoring halves if we pass from state man on first with 0 outs to man on first with 1 out. Also, we generalize against all possible runners maybe if a fast runner is on base and a high chance strikeout batter is on the plate may be logical to bunt.  

And, second.... fill in these blanks: Run creation is the product of _________ and ________ while avoiding ______.

## Statcast 

1. Go to Baseball Savant's "Statcast Search."
1. Create a list of pitchers with the highest average velocity in 2018.
1. Export the resulting data into any sort of analysis software. Excel, R, SAS, Access, SQL, etc etc.

### Screenshot and showing data

The screenshot of the webpage to extract this data is the following:

![](img/statcast.PNG)

Which leads to the file "average_exit" in the data folder.

To load this data we can easily do:

In [134]:
import pandas as pd

df = pd.read_csv("data/average_exit.csv")

From which we could select the following features:

In [135]:
print("\n".join(map(lambda x: str(x[0]) + " " + x[1], enumerate(df.columns))))

0 pitches
1 player_id
2 player_name
3 total_pitches
4 pitch_percent
5 ba
6 iso
7 babip
8 slg
9 woba
10 xwoba
11 xba
12 hits
13 abs
14 launch_speed
15 launch_angle
16 spin_rate
17 velocity
18 effective_speed
19 whiffs
20 swings
21 takes
22 eff_min_vel
23 release_extension
24 pos3_int_start_distance
25 pos4_int_start_distance
26 pos5_int_start_distance
27 pos6_int_start_distance
28 pos7_int_start_distance
29 pos8_int_start_distance
30 pos9_int_start_distance


If we would like to see Number of pitches, the player names, total of pitches, average velocity, babip and iso for the top 10 players we can do the following:

In [136]:
df[["pitches","player_name", "total_pitches", "velocity", "babip", "iso"]][0:15]

Unnamed: 0,pitches,player_name,total_pitches,velocity,babip,iso
0,264,Jordan Hicks,338,99.4,0.218,0.034
1,244,Aroldis Chapman,329,98.7,0.269,0.078
2,309,Tayron Guerrero,379,98.1,0.326,0.131
3,201,Jose Alvarado,288,97.9,0.279,0.043
4,54,Ryne Stanek,90,97.9,0.5,0.429
5,76,Seranthony Dominguez,105,97.8,0.077,0.0
6,92,Justin Anderson,227,97.7,0.462,0.056
7,471,Luis Severino,973,97.6,0.293,0.117
8,208,Joe Kelly,346,97.5,0.205,0.0
9,250,Arodys Vizcaino,367,97.5,0.22,0.167


## Analyzing Jed Lowrie 

Pick one of the following players. Notice their 2018 seasons. Tell me about the changes you notice in their performance from prior seasons, and as a result, how you expect them to perform for the remainder of 2018. Jed Lowrie, Daniel Robertson, Sean Newcomb, Alex Cobb, Jose Quintana, Franmil Reyes.

### Why jed?

I selected Jed Lowrie because of the performance and the amount of previous years we have from him we have to compare. This allows a better analysis compared with the cases of Newcomb, Reyes and Robertson. Moreover, in the cases of Cobb and Quintana we've seen since  last year a downgrade in their performances. So the most interesting case, for me, is Jed Lowrie that having 34 years is having a career year.  

To provide a better analysis let's get the data from baseball reference from all previous years.

In [137]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

def get_header(soup):
    """
    Method to retrieve the column of the table
    :param soup: Beautiful Soup object.
    :returns: A list of the columns for the table
    """
    return list(map(lambda x: x.text, soup.find('thead').findAll('th')))[1:]

def get_table_body(soup):
    """
    Method to retrieve the batting table of a given player.
    :param soup: Beautiful soup object.
    :returns: The years as index and the data inside the target table
    """
    years, data = [], []
    
    for season in soup.find('tbody').findAll('tr', {'class': 'full'}):
        years.append(int(season.find('th').text))
        data.append(list(map(lambda x: str(x.text), season.findAll('td'))))
        
    return years, np.array(data)

def get_batting_data(
    player,
    template="https://www.baseball-reference.com/players/{letter}/{player}.shtml"
):
    """
    Method to extract the pandas table associated with player=player.
    :param player: ID from baseball reference
    :template: Template from baseball reference to scrap the player.
    """
    soup = BeautifulSoup(
        markup=requests.get(
            template.format(
                letter=player[0],
                player=player)
            ).content,
        features='html.parser'
    )
    
    header = get_header(soup)
    years, data = get_table_body(soup)
    
    df = pd.DataFrame(
        data=data,
        index=years,
        columns=header
    )
    
    # Changing data types    
    for col in df.columns:
        try:
            if df[col].iloc[0][0] == '.':
                df[col] = df[col].astype('float64')
            else:
                try:
                    df[col] = df[col].astype('int64')
                except ValueError:
                    df[col] = df[col].astype(str)
        except IndexError:
            df[col] = df[col].astype(str)
            
    return df

In [138]:
df = get_batting_data("lowrije01")
df.iloc[:, list(range(4, 11)) + list(range(13, 20))]

Unnamed: 0,PA,AB,R,H,2B,3B,HR,CS,BB,SO,BA,OBP,SLG,OPS
2008,306,260,34,67,25,3,2,0,35,68,0.258,0.339,0.4,0.739
2009,76,68,5,10,2,0,2,0,6,20,0.147,0.211,0.265,0.475
2010,197,171,31,49,14,0,9,1,25,25,0.287,0.381,0.526,0.907
2011,341,309,40,78,14,4,6,1,23,60,0.252,0.303,0.382,0.685
2012,387,340,43,83,18,0,16,0,43,65,0.244,0.331,0.438,0.769
2013,662,603,80,175,45,2,15,0,50,91,0.29,0.344,0.446,0.791
2014,566,502,59,125,29,3,6,0,51,79,0.249,0.321,0.355,0.676
2015,263,230,35,51,14,0,9,0,28,43,0.222,0.312,0.4,0.712
2016,369,338,30,89,12,1,2,0,26,65,0.263,0.314,0.322,0.637
2017,645,567,86,157,49,3,14,1,73,100,0.277,0.36,0.448,0.808


Right now, with 9 home runs lowrie ranks 5th in the MLB and 3rd in the AL among second baseman. Also, with 99 total bases ranks 3rd in the MLB and 2nd in the AL just behind Jose Ramirez. Finally, with 37 rbis ranks again 2nd in the AL among all second baseman always.

For Jed, this numbers are totally outside his normal parameters. If we project right now this performance over a 630 plate appearances he'll end with:

1. 27 home runs
1. 112 rbis
1. 60 runs scored (Oakland...)

is this going to last long?

### Comparing with past performance

We could see year by year the woba (weighted on base average) from fangraphs splitted by home games and away games:

![](/img/woba.PNG)

This year, Oakland has played 22 games at home and 27 games away. According to [ESPN](http://www.espn.com/mlb/stats/parkfactor) Oklahoma Field has, by far, the worst park factor in the MLB with a 0.774 and a surprinsigly 0.494 for HR. Not surprinsigly though, 2 of the 9 homers of Jed have been in Oakland. This could explain the difference of performance if we split by away/home games and could add to the case where we shouldn't continue this level of performance.

Lastly, let's analyze his babip throughout the years (and specially this year):

![](img/babip.PNG)

Lowrie has an absurd .364 babip this year compare to his .298 lifetime and a .297 league average. 

Finally, if we analyze only the seasons in which he has been healthy (let's say more than 450 plate appeareances) 

In [139]:
df.iloc[:, list(range(4, 11)) + list(range(13, 20))][df.PA > 450]

Unnamed: 0,PA,AB,R,H,2B,3B,HR,CS,BB,SO,BA,OBP,SLG,OPS
2013,662,603,80,175,45,2,15,0,50,91,0.29,0.344,0.446,0.791
2014,566,502,59,125,29,3,6,0,51,79,0.249,0.321,0.355,0.676
2017,645,567,86,157,49,3,14,1,73,100,0.277,0.36,0.448,0.808


He has 2 good seasons and 1 mediocre season in 11 seasons. We could then say that, if healthy, Jed has a solid chance to have a decent season. Taking into account one of this 3 seasons is 2017 then (maybe) we could keep some of this absurd numbers.

### Conclusion

Assuming his tendency to be injured or relegated to a secondary role, his absurd babip compared with the rest of the league and his past performance it's hard to expect he'll keep this pace. So, as I did, if you play fantasy baseball: is time to sell.