## CMPINF 2100 Week 05

### Filter Pandas with strings

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Review of DataFrames

Let's create the baseball related DataFrame that we worked with last week.

In [2]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [3]:
baseball_dict

{'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
 'Division': ['Central', 'Central', 'Central', 'Central', 'Central'],
 'League': ['NL', 'NL', 'NL', 'NL', 'NL']}

In [4]:
baseball_df = pd.DataFrame( baseball_dict,
                            columns=['League', 'Division', 'City', 'Team'])

In [5]:
baseball_df

Unnamed: 0,League,Division,City,Team
0,NL,Central,Pittsburgh,Pirates
1,NL,Central,Cincinatti,Reds
2,NL,Central,Chicago,Cubs
3,NL,Central,St. Louis,Cardinals
4,NL,Central,Milwaukee,Brewers


Add a column for the number of games back.

In [6]:
baseball_df['games_back'] = pd.Series( [31.5, 27.5, 22.5, 0, 7.5],
                                       index=baseball_df.index )

In [7]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back
0,NL,Central,Pittsburgh,Pirates,31.5
1,NL,Central,Cincinatti,Reds,27.5
2,NL,Central,Chicago,Cubs,22.5
3,NL,Central,St. Louis,Cardinals,0.0
4,NL,Central,Milwaukee,Brewers,7.5


Sort by the `games_back` column. Ignore the index, and modify in place!

In [8]:
baseball_df.sort_values( ['games_back'], ignore_index=True, inplace=True)

In [9]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back
0,NL,Central,St. Louis,Cardinals,0.0
1,NL,Central,Milwaukee,Brewers,7.5
2,NL,Central,Chicago,Cubs,22.5
3,NL,Central,Cincinatti,Reds,27.5
4,NL,Central,Pittsburgh,Pirates,31.5


Add two more columns that have values which change down the rows.

In [10]:
baseball_df['wins'] = pd.Series([87, 79, 64, 59, 55],
                                index=baseball_df.index)

In [11]:
baseball_df['losses'] = pd.Series([63, 70, 85, 90, 94],
                                  index=baseball_df.index)

In [12]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back,wins,losses
0,NL,Central,St. Louis,Cardinals,0.0,87,63
1,NL,Central,Milwaukee,Brewers,7.5,79,70
2,NL,Central,Chicago,Cubs,22.5,64,85
3,NL,Central,Cincinatti,Reds,27.5,59,90
4,NL,Central,Pittsburgh,Pirates,31.5,55,94


Lastly, add a column with a constant value down all rows.

In [13]:
baseball_df['season'] = 2022

In [14]:
baseball_df

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022
2,NL,Central,Chicago,Cubs,22.5,64,85,2022
3,NL,Central,Cincinatti,Reds,27.5,59,90,2022
4,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


## Filter rows

Filtering refers to SELECTING rows based on the CONDITIONAL TESTS.

In [15]:
baseball_df.loc[ baseball_df.wins > 65, : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


In [16]:
baseball_df.loc[ baseball_df.Team == 'Pirates', : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
4,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


We also saw how to use the OR operator, `|`, to find all rows where the value equals A or B.

Or, the value is ONE OF those presented.

In [17]:
baseball_df.loc[ (baseball_df.Team == 'Cardinals') | (baseball_df.Team == 'Brewers'), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


The `==` operator combined with `|` operator is correct to use...but it does not SCALE well!

For example, if we needed to check for 10 possible values...we would need to type in 10 different conditions!!!

Instead, we can use the `.isin()` method to streamline the `|` operator!

In [18]:
baseball_df.loc[ baseball_df.Team.isin(['Cardinals', 'Brewers']), :]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


In [19]:
baseball_df.loc[ baseball_df.Team.isin(['Cardinals', 'Brewers', 'Pirates']), :]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022
4,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


The `.isin()` method can also be applied to numbers.

In [20]:
baseball_df.loc[ baseball_df.losses.isin([70, 90]), :]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022
3,NL,Central,Cincinatti,Reds,27.5,59,90,2022


The `.isin()` operator is especially for string filtering!

In [21]:
top_teams = baseball_df.loc[ baseball_df.games_back < 10, 'Team'].copy().tolist()

In [22]:
top_teams

['Cardinals', 'Brewers']

In [23]:
baseball_df.loc[ baseball_df.Team.isin( top_teams ), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022
1,NL,Central,Milwaukee,Brewers,7.5,79,70,2022


## String pattern matching

In [24]:
baseball_df.loc[ baseball_df.City == 'Pittsburgh', : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
4,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


But what if I didn't feel like typing out the whole string for `'Pittsburgh'`?

In [25]:
baseball_df.loc[ baseball_df.City == 'Pitt', : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season


What if we had a typo?

In [26]:
baseball_df.loc[ baseball_df.City == 'Pittsburg', :]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season


Instead, we could instead focus on a PATTERN. The `.str.contains()` method searching for a PATTERN **WITHIN** the string!

In [27]:
baseball_df.City.str.contains('Pitt')

0    False
1    False
2    False
3    False
4     True
Name: City, dtype: bool

In [28]:
baseball_df.City

0     St. Louis
1     Milwaukee
2       Chicago
3    Cincinatti
4    Pittsburgh
Name: City, dtype: object

In [29]:
baseball_df.loc[ baseball_df.City.str.contains('Pitt'), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
4,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


We can even apply the PATTERN search to a single character!

In [30]:
baseball_df.loc[ baseball_df.City.str.contains('P'), :]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
4,NL,Central,Pittsburgh,Pirates,31.5,55,94,2022


But...be CAREFUL! If the PATTERN is TOO SHORT...it will not uniquely identify the string you are looking for!

In [31]:
baseball_df.loc[ baseball_df.City.str.contains('C'), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
2,NL,Central,Chicago,Cubs,22.5,64,85,2022
3,NL,Central,Cincinatti,Reds,27.5,59,90,2022


The `.str.contains()` method is very helpful when EXPLORING data!

I particularly like to use it to search for non-letter characters.

To find a period in a string we need to search for the pattern `\\.`.

In [32]:
baseball_df.loc[ baseball_df.City.str.contains( '\\.' ), : ]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022


We can even search for a WHITE SPACE.

In [33]:
baseball_df.loc[ baseball_df.City.str.contains( ' ' ), :]

Unnamed: 0,League,Division,City,Team,games_back,wins,losses,season
0,NL,Central,St. Louis,Cardinals,0.0,87,63,2022


There are many more STRING METHODS available. Many of the Pandas `.str.` methods are consistent with the base Python string methods.

In [34]:
dir( baseball_df.City.str )

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__frozen',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_doc_args',
 '_freeze',
 '_get_series_list',
 '_index',
 '_inferred_dtype',
 '_is_categorical',
 '_is_string',
 '_name',
 '_orig',
 '_parent',
 '_validate',
 '_wrap_result',
 'capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'fullmatch',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize