# Automate your analysis with `pandas`

Automating your data analysis is one of the most powerful things you can do with Python in a newsroom. We're going to use a library called [`pandas`](http://pandas.pydata.org/) that will leave a replicable, transparent script for others to follow.

## Warmup: MLB salary data

Remember `data/mlb.csv`? Let's use that to practice.

### Make a list of questions

- What's the total MLB payroll?
- What's the total payroll by team?
- What's the typical (mean) salary for a designated hitter?

(What else?)

### Import pandas

### Read in the data into a data frame

We'll use the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method to read the CSV file in as a [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

### _How many players?_

Use the `.count()` method to find out how many pieces of data are in each column.

### _Total MLB payroll_

To select a column from your data frame, you can use either dot or bracket notation. The result will be a [`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html).

Then, you can use the [`sum()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html) method to sum numeric data.

### _Get a list of teams_

Sometimes you want to get a list of unique values in a column. You can use the [`unique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) method for this.

### _Total payroll by team_

We want to group our data by team and sum the salaries -- analagous to a pivot table in Excel or a `GROUP BY` statement in SQL. Our steps:

- Select the two columns we care about -- to select multiple columns, pass a list of columns to the data frame inside square brackets
- Use the [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) function to group the data by team
- Use the [`sum()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.sum.html) method to sum up the grouped data
- Use the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html) method to turn the Series back into a DataFrame
- Index the results on the `TEAM` column (optional)
- Use the [`sort_values()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method to order the data by the `SALARY` column

### _Typical salary for DH_

Let's find the median salary for designated hitters. To filter a data frame, you use square brackets to pass a filtering condition to the data frame: `df[YOUR CONDITION HERE]`.

It's like a WHERE clause in SQL. In this case, we can use one of Python's [comparison operators](https://docs.python.org/3/reference/expressions.html#comparisons) to define our conditions. If the value in the `POS` column is `DH`, include it in the results.

Then we're going to select the values in the `SALARY` column and calculate the typical salary with the [`mean()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) function.

### _Exercise_

Come up with more questions to ask the MLB data and use what you've learned to answer them. If you can't figure out how to answer them, try to formulate a search query that could lead you to a possible solution.

Some suggestions to get you started:

- Which team has the largest pitching staff?
- What is the lowest-paid position in MLB? (How do you define "lowest-paid"?)
- How many catchers are paid more than $1 million? 