# 1. MLB opening day salaries

Let's start by poking at some MLB opening day salary data from 2017. The file lives at `../data/mlb.csv`.

Let's also open the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) in a new browser tab.

### Import pandas

We've already installed `pandas`, an external library we'll use to analyze data, on this computer. Now we just need to _import_ it so we can use its functionality in our script.

👉For more details on installing and importing Python libraries, [see this notebook](../appendix/Installing%20and%20importing%20modules%20and%20libraries.ipynb).

In [None]:
import pandas as pd

### Load the CSV

Next, we'll load the CSV into a pandas _dataframe_, which is sort of like a virtual spreadsheet with rows and columns.

We'll take a _string_ -- some text sandwiched between two apostrophes, or quotation marks -- with the path to our CSV and hand it off to the pandas `read_csv()` method. We'll assign the result to a variable called `df`. (The name of the `df` variable is arbitrary, FWIW -- you could call it `banana` and things would still work, though people reading your notebook would be confused.)

👉For more details on _strings_ (and other data types) and _variable assignment_, [see this notebook](Appendix%20-%20Python%20data%20types%20and%20basic%20syntax.ipynb).

👉For more details on loading data into pandas, [see this notebook](Appendix%20-%20Importing%20data%20into%20pandas.ipynb).

In [None]:
df = pd.read_csv('../data/mlb.csv')

### Use `head()` to check out the data

Now that the dataframe is loaded with data, let's use the `head()` method to see the first five rows of data.

In [None]:
df.head()

### Other ways to check out the dataframe

- `.columns` will list the column names
- `.info()` will let us know if any columns have null values in them
- `.count()` will count the records in each column
- `.shape` will give you `(number of rows, number of columns)`
- `.describe()` will compute summary stats for the values in each column

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.count()

In [None]:
df.shape

In [None]:
df.describe()

### Come up with a list of questions

Now that we have a general idea of our data, let's come up with a list of questions. For starters:

- What's the total, average and median salary for an MLB player?
- How many players are on each team?
- Which catchers makes the most money?
- How many players make the league minimum?
- Which teams have the biggest payrolls?

Other questions?

### Q: What's the total, average and median salary for an MLB player?

If we were doing this in Excel, we'd probably scroll to the bottom of the worksheet and enter, in the SALARY column, `=SUM(D2:D868)`, and below that, `=AVERAGE(D2:D868)`, and then below that, `=MEDIAN(D2:D868)`. Here, we're going to select the values in the SALARY column and use a couple of built-in pandas methods to do the same math.

In pandas, to select a column of data, you can use dot notation (`df.SALARY`) or, if the column name has spaces, bracket notation (`df['SALARY']`).

In [None]:
df.SALARY.sum()

In [None]:
df.SALARY.mean()

In [None]:
df.SALARY.median()

### Q: How many players are on each team?

To answer this question, we're going to use a method called `.value_counts()` on the TEAM column. The equivalent operation in Excel would be pivot table; in SQL, it'd be a GROUP BY statement with COUNT(\*).

In [None]:
df.TEAM.value_counts()

### Q: Which catchers makes the most money?

To answer this question, first we'll _filter_ the dataframe to include only catchers. Then we'll sort the data descending and look at the top 5.

First, we need to figure out how "catcher" is represented in our data. Let's use the `unique()` method to get a list of unique values in a column.

In [None]:
df.POS.unique()

Looks like we want to target records where the `POS` value is "C."

To filter data in a pandas dataframe, we'll put the filtering condition inside square brackets being called by `df[]`. It's a little confusing at first.

In [None]:
catchers = df[df['POS'] == 'C']

Now we want to sort these records top to bottom. To do that, we'll use the `sort_values()` method, which needs the name of the column to sort by ('SALARY'), and we'll also specify that `ascending` is `False`.

In [None]:
catchers.sort_values('SALARY', ascending=False)

### Q: How many players make the league minimum?

First, we'll need to figure out what the league minimum is. It's reasonable to assume that this number will be the lowest value in the data, and that it would probably appear most frequently. Let's use `value_counts()` again to check it out.

In [None]:
df.SALARY.value_counts()

#### Bonus Q: What percentage of players does that represent?

We now know how many make the league minimum (50). We can use Python's built-in `len()` function to get a quick record count on the data frame, and from there the math is straightforward.

In [None]:
(50 / len(df)) * 100

### Q: Which teams have the biggest payrolls?

To answer this question, we're again going to use equivalent of an Excel pivot table. Our steps:

1. Select out the two columns we're interested in: `[TEAM, SALARY]`
2. Use the `groupby()` method to group the data by team
3. Use the `sum()` method to sum salaries by team
4. Use the `sort_values()` method to sort the results descending

_Furthermore_, we're gonna chain these methods together and do it all in one whack.

In [None]:
df[['TEAM', 'SALARY']].groupby('TEAM') \
                      .sum() \
                      .sort_values('SALARY', ascending=False)