# Filtering columns and rows in pandas

This notebook has a little more detail on selecting and filtering data in pandas. We'll use the MLB salary data as an example.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/mlb.csv')

In [3]:
df.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


### Selecting one column of data

You can select a column of data in a dataframe with a period ("dot notation") `.` or by prvoding the name of the column as a string inside square brackets ("bracket notation"): `[]`.

If you want to select just one column of data, and the name of the column you're selecting doesn't have spaces, you can use dot notion. If your column name has spaces, you must use bracket notation.

Let's say we wanted to select the `TEAM` column. We could do this:

In [None]:
df.TEAM

... or we could do this:

In [None]:
df['TEAM']

Either works.

### Selecting multiple columns of data

To select multiple columns of data, we're going to pass a _list_ of column names into the square brackets. Let's select the `NAME` and `TEAM` columns.

👉 For a refresher on _lists_, [check out this notebook](Python%20data%20types%20and%20basic%20syntax.ipynb).

In [None]:
df[['NAME', 'TEAM']]

Lots of square brackets happening here! You could easily assign the list of column names to its own variable to make things clearer:

In [None]:
cols_of_interest = ['TEAM', 'NAME']
df[cols_of_interest]

### Filtering rows of data

You can also filter rows to keep just the records that meet your filtering condition(s) -- like a `WHERE` clause in SQL.

For example, let's say you wanted to filter this data to include just the Los Angeles Dodgers. The basic syntax is to pass your filtering condition to the data frame in square brackets `[]`.

First, let's use the `unique()` method on the `TEAM` column to make sure we understand how team names are represented in the data.

In [4]:
df.TEAM.unique()

array(['LAD', 'ARI', 'BOS', 'DET', 'CHC', 'LAA', 'SEA', 'NYY', 'TEX',
       'SF', 'MIN', 'NYM', 'WSH', 'CIN', 'ATL', 'BAL', 'CWS', 'COL',
       'TOR', 'STL', 'MIL', 'PHI', 'HOU', 'KC', 'MIA', 'CLE', 'PIT', 'TB',
       'OAK', 'SD'], dtype=object)

Cool. So we want to select all rows where the `TEAM` value is `'LAD'`.

In [None]:
lad = df[df['TEAM'] == 'LAD']

In [None]:
lad

You can do numerical comparisons -- let's get just the players who make $1 million or more:

In [None]:
millionaires = df[df['SALARY'] >= 1000000]

In [None]:
millionaires

### Filtering against multiple matches
You can use the [`isin()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html) method to test a value against multiple matches -- just hand it your list of values to check against.

Let's say we wanted to return all of the players for the Texas Rangers and Houston Astros.

In [None]:
tx = df[df['TEAM'].isin(['TEX', 'HOU'])]

In [None]:
tx

### "Not" filtering

If we prepend a tilde `~` to the filtering statement, that reverses the meaning -- it will return all values that do _not_ match the criteria. If we wanted to get all non-Texas players, we could use the same filter we just used, but with a tilde:

In [5]:
not_tx = df[~df['TEAM'].isin(['TEX', 'HOU'])]

In [None]:
not_tx

### Filtering on multiple criteria

You can filter your data on multiple criteria. A few gotchas:
- Don't use Python's native `and` and `or` operators to chain the statements -- [pandas wants you to use `&` and `|`](https://pandas.pydata.org/pandas-docs/version/0.22/indexing.html#boolean-indexing)
- Don't forget to use parentheses to group your statements

Let's filter for all catchers who make the league minimum of $535,000.

In [None]:
catchers_lm = df[(df['POS'] == 'C') & (df['SALARY'] == 535000)]

In [None]:
catchers_lm