# MLB opening day salaries

Let's start by poking at some MLB opening day salary data from 2017. The file lives here: `../data/mlb.csv`.

![wow](../img/rj-bird.gif "wowwwwww")

Let's also open the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) in a new browser tab.

### Import pandas

We've already installed `pandas`, an external Python library that we'll use to analyze data. Now we just need to _import_ it so we can use its functionality in our script.

👉For more details on installing and importing Python libraries, [see this notebook](../appendix/Installing%20and%20importing%20modules%20and%20libraries.ipynb).

In [3]:
import pandas as pd

### Load the CSV

Next, we'll load the CSV into a pandas _dataframe_, which is sort of like a virtual spreadsheet with rows and columns.

We'll take a _string_ -- some text sandwiched between two apostrophes, or quotation marks -- with the path to our CSV and hand it off to the pandas `read_csv()` method ([here's the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)). We'll assign the result to a variable called `df`. (The name of the `df` variable is arbitrary -- you could call it `banana` and things would still work, though people reading your notebook would be confused.)

👉For more details on _strings_ (and other data types) and _variable assignment_, [see this notebook](../appendix/Python%20data%20types%20and%20basic%20syntax.ipynb).

👉For more details on loading data into pandas, [see this notebook](../appendix/Importing%20data%20into%20pandas.ipynb).

In [4]:
df = pd.read_csv('../data/mlb.csv')

### Use `head()` to check out the data

Now that the dataframe is loaded with data, let's use the `head()` method to see the first five rows of data ([here's the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)).

In [5]:
df.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


### Other ways to check out the dataframe

- `.columns` will list the column names
- [`.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) will let us know if any columns have null values in them
- [`.count()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) will count the records in each column
- [`.shape`](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) will give you `(number of rows, number of columns)`
- [`.describe()`](pandas.DataFrame.describe — pandas 0.22.0 documentation) will compute summary stats for the values in each column

In [6]:
df.columns

Index(['NAME', 'TEAM', 'POS', 'SALARY', 'START_YEAR', 'END_YEAR', 'YEARS'], dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 868 entries, 0 to 867
Data columns (total 7 columns):
NAME          868 non-null object
TEAM          868 non-null object
POS           868 non-null object
SALARY        868 non-null int64
START_YEAR    868 non-null int64
END_YEAR      868 non-null int64
YEARS         868 non-null int64
dtypes: int64(4), object(3)
memory usage: 47.5+ KB


In [8]:
df.count()

NAME          868
TEAM          868
POS           868
SALARY        868
START_YEAR    868
END_YEAR      868
YEARS         868
dtype: int64

In [9]:
df.shape

(868, 7)

In [10]:
df.describe()

Unnamed: 0,SALARY,START_YEAR,END_YEAR,YEARS
count,868.0,868.0,868.0,868.0
mean,4468069.0,2016.486175,2017.430876,1.9447
std,5948459.0,1.205923,1.163087,1.916764
min,535000.0,2008.0,2015.0,1.0
25%,545500.0,2017.0,2017.0,1.0
50%,1562500.0,2017.0,2017.0,1.0
75%,6000000.0,2017.0,2017.0,2.0
max,33000000.0,2017.0,2027.0,13.0


### Come up with a list of questions

Now that we have a general idea of our data, let's come up with a list of questions. For starters:

- What's the total, average and median salary for an MLB player?
- How many players are on each team?
- Which catchers makes the most money?
- How many players make the league minimum?
- Which teams have the biggest payrolls?

Other questions?

### Q: What's the total, average and median salary for an MLB player?

If we were doing this in Excel, we'd probably scroll to the bottom of the worksheet and enter, in the SALARY column, `=SUM(D2:D868)`, and below that, `=AVERAGE(D2:D868)`, and then below that, `=MEDIAN(D2:D868)`. Here, we're going to select the values in the SALARY column and use a couple of built-in pandas methods to do the same math.

In pandas, to select a column of data, you can use dot notation (`df.SALARY`) or, if the column name has spaces, bracket notation (`df['SALARY']`).

In [11]:
df.SALARY.sum()

3878284045

In [12]:
df.SALARY.mean()

4468069.176267281

In [13]:
df.SALARY.median()

1562500.0

### Q: How many players are on each team?

To answer this question, we're going to use a method called [`.value_counts()`](pandas.Series.value_counts — pandas 0.22.0 documentation) on the TEAM column. The equivalent operation in Excel would be pivot table; in SQL, it'd be a GROUP BY statement with COUNT(\*).

In [14]:
df.TEAM.value_counts()

TEX    34
TB     32
COL    32
CIN    31
SEA    31
LAD    31
NYM    31
BOS    31
SD     31
STL    30
OAK    30
LAA    30
ATL    30
TOR    29
MIN    29
SF     28
BAL    28
CWS    28
ARI    28
KC     28
MIA    28
CLE    28
HOU    27
NYY    27
CHC    26
MIL    26
PIT    26
WSH    26
PHI    26
DET    26
Name: TEAM, dtype: int64

### Q: Which catchers makes the most money?

To answer this question, first we'll _filter_ the dataframe to include only catchers. Then we'll sort the data descending and look at the top 5.

👉For more details on filtering data in pandas, [see this notebook](../appendix/Filtering%20columns%20and%20rows%20in%20pandas.ipynb).

First, we need to figure out how "catcher" is represented in our data. Let's use the [`unique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) method to get a list of unique values in the `POS` column.

In [15]:
df.POS.unique()

array(['SP', '1B', 'RF', '2B', 'DH', 'CF', 'C', 'LF', '3B', 'SS', 'OF',
       'RP', 'P'], dtype=object)

Looks like we want to target records where the `POS` value is "C."

To filter data in a pandas dataframe, we'll put the filtering condition inside square brackets and pass that to the `df[]`. It's a little confusing at first.

In [16]:
catchers = df[df['POS'] == 'C']

Now we want to sort these records top to bottom. To do that, we'll use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method, which needs the name of the column to sort by ('SALARY'). We want to sort largest to smallest, so we'll also specify that `ascending=False`. (Curious about what other options you could specify for this method?

In [17]:
catchers.sort_values('SALARY', ascending=False)

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
18,Buster Posey,SF,C,22177778,2013,2021,9
36,Russell Martin,TOR,C,20000000,2015,2019,5
52,Brian McCann,HOU,C,17000000,2014,2018,5
75,Yadier Molina,STL,C,14200000,2013,2017,5
77,Miguel Montero,CHC,C,14000000,2013,2017,5
108,Carlos Santana,CLE,C,12000000,2012,2016,5
129,Matt Wieters,WSH,C,10500000,2017,2017,1
143,Francisco Cervelli,PIT,C,9000000,2017,2019,3
151,Jason Castro,MIN,C,8500000,2017,2019,3
176,Devin Mesoraco,CIN,C,7325000,2015,2018,4


### Q: How many players make the league minimum?

First, we'll need to figure out what the league minimum is. By definition, it's the lowest number in the salary data. We could also reasonably expect that number to occur more frequently than other numbers. So first, let's use the [`min()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html) method to see what the lowest salary value is; then we'll use `value_counts()` to check frequency.

In [18]:
df.SALARY.min()

535000

In [19]:
df.SALARY.value_counts()

535000      50
540000      21
545000      14
2000000     13
4000000     13
6000000     13
537000      12
3000000     12
5500000     11
537500      11
1250000     10
555000      10
1750000      9
6500000      8
12000000     8
13000000     8
5000000      7
11000000     7
7000000      7
9000000      7
8000000      7
1500000      7
600000       6
900000       6
536500       6
2500000      6
1400000      5
547000       5
4200000      5
539000       5
            ..
18700000     1
550625       1
579300       1
6625000      1
3666666      1
561900       1
7350000      1
3210000      1
546500       1
19000000     1
26000000     1
538300       1
537600       1
20083333     1
4325000      1
3650000      1
4687500      1
8250000      1
546450       1
1640000      1
537250       1
2350000      1
13625000     1
10888877     1
1550000      1
760500       1
2275000      1
2925000      1
8916667      1
7700000      1
Name: SALARY, Length: 419, dtype: int64

#### Bonus Q: What percentage of MLB players make the league minimum?

First, we can filter to get just the players who make the league minimum. Then we can use the built-in Python function `len()` to get the count. We can also use this function to get a quick record count on our entire dataset, and from there the math is straightforward: `(part / whole) * 100`

In [20]:
league_min = df[df.SALARY == df.SALARY.min()]
                
(len(league_min) / len(df)) * 100

5.76036866359447

### Q: Which teams have the biggest payrolls?

To answer this question, we're again going to use equivalent of an Excel pivot table. Our steps:

1. Select the two columns we're interested in: `[TEAM, SALARY]`
2. Use the `groupby()` method to group the data by team
3. Use the `sum()` method to sum salaries by team
4. Use the `sort_values()` method to sort the results descending

_Furthermore_, we're gonna chain these methods together and do it all in one whack. We can use `\` at the end of the line to tell Python that we're _not quite done yet_.

In [21]:
df[['TEAM', 'SALARY']].groupby('TEAM') \
                      .sum() \
                      .sort_values('SALARY', ascending=False)

Unnamed: 0_level_0,SALARY
TEAM,Unnamed: 1_level_1
LAD,187989811
DET,180250600
TEX,178431396
SF,176531278
NYM,176284679
BOS,174287098
NYY,170389199
CHC,170088502
WSH,162742157
TOR,162353367
