# Filter Pandas with strings

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Review of DataFrames

Let's create the baseball related DataFrame that we worked with last week.

In [None]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [None]:
baseball_dict

In [None]:
baseball_df = pd.DataFrame( baseball_dict,
                            columns=['League', 'Division', 'City', 'Team'])

In [None]:
baseball_df

Add a column for the number of games back.

In [None]:
baseball_df['games_back'] = pd.Series( [31.5, 27.5, 22.5, 0, 7.5],
                                       index=baseball_df.index )

In [None]:
baseball_df

Sort by the `games_back` column. Ignore the index, and modify in place!

In [None]:
baseball_df.sort_values( ['games_back'], ignore_index=True, inplace=True)

In [None]:
baseball_df

Add two more columns that have values which change down the rows.

In [None]:
baseball_df['wins'] = pd.Series([87, 79, 64, 59, 55],
                                index=baseball_df.index)

In [None]:
baseball_df['losses'] = pd.Series([63, 70, 85, 90, 94],
                                  index=baseball_df.index)

In [None]:
baseball_df

Lastly, add a column with a constant value down all rows.

In [None]:
baseball_df['season'] = 2022

In [None]:
baseball_df

## Filter rows

Filtering refers to SELECTING rows based on the CONDITIONAL TESTS.

In [None]:
baseball_df.loc[ baseball_df.wins > 65, : ]

In [None]:
baseball_df.loc[ baseball_df.Team == 'Pirates', : ]

We also saw how to use the OR operator, `|`, to find all rows where the value equals A or B.

Or, the value is ONE OF those presented.

In [None]:
baseball_df.loc[ (baseball_df.Team == 'Cardinals') | (baseball_df.Team == 'Brewers'), : ]

The `==` operator combined with `|` operator is correct to use...but it does not SCALE well!

For example, if we needed to check for 10 possible values...we would need to type in 10 different conditions!!!

Instead, we can use the `.isin()` method to streamline the `|` operator!

In [None]:
baseball_df.loc[ baseball_df.Team.isin(['Cardinals', 'Brewers']), :]

In [None]:
baseball_df.loc[ baseball_df.Team.isin(['Cardinals', 'Brewers', 'Pirates']), :]

The `.isin()` method can also be applied to numbers.

In [None]:
baseball_df.loc[ baseball_df.losses.isin([70, 90]), :]

The `.isin()` operator is especially for string filtering!

In [None]:
top_teams = baseball_df.loc[ baseball_df.games_back < 10, 'Team'].copy().tolist()

In [None]:
top_teams

In [None]:
baseball_df.loc[ baseball_df.Team.isin( top_teams ), : ]

## String pattern matching

In [None]:
baseball_df.loc[ baseball_df.City == 'Pittsburgh', : ]

But what if I didn't feel like typing out the whole string for `'Pittsburgh'`?

In [None]:
baseball_df.loc[ baseball_df.City == 'Pitt', : ]

What if we had a typo?

In [None]:
baseball_df.loc[ baseball_df.City == 'Pittsburg', :]

Instead, we could instead focus on a PATTERN. The `.str.contains()` method searching for a PATTERN **WITHIN** the string!

In [None]:
baseball_df.City.str.contains('Pitt')

In [None]:
baseball_df.City

In [None]:
baseball_df.loc[ baseball_df.City.str.contains('Pitt'), : ]

We can even apply the PATTERN search to a single character!

In [None]:
baseball_df.loc[ baseball_df.City.str.contains('P'), :]

But...be CAREFUL! If the PATTERN is TOO SHORT...it will not uniquely identify the string you are looking for!

In [None]:
baseball_df.loc[ baseball_df.City.str.contains('C'), : ]

The `.str.contains()` method is very helpful when EXPLORING data!

I particularly like to use it to search for non-letter characters.

To find a period in a string we need to search for the pattern `\\.`.

In [None]:
baseball_df.loc[ baseball_df.City.str.contains( '\\.' ), : ]

We can even search for a WHITE SPACE.

In [None]:
baseball_df.loc[ baseball_df.City.str.contains( ' ' ), :]

There are many more STRING METHODS available. Many of the Pandas `.str.` methods are consistent with the base Python string methods.

In [None]:
dir( baseball_df.City.str )

# Read data into Pandas

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Preliminaries

Download the 4 files from Canvas into the SAME directory as this notebook!!!!!

Let's make sure the files were downloaded and saved within the correct location.

In [None]:
import os

In [None]:
os.listdir()

In [None]:
os.getcwd()

## Read Excel

In [None]:
pd.read_excel( 'Excel_Example_Data.xlsx' )

In [None]:
df0 = pd.read_excel( 'Excel_Example_Data.xlsx' )

In [None]:
%whos

In [None]:
df0.info()

In [None]:
df0.shape

In [None]:
df0.columns

In [None]:
df0.dtypes

In [None]:
df0.index

We can force one of the columns to be the `.index` attribute when the data are read.

In [None]:
df0_b = pd.read_excel( 'Excel_Example_Data.xlsx', index_col=0 )

In [None]:
df0_b

In [None]:
df0_b.info()

In [None]:
df0_b.index

In [None]:
df0_b.loc[ 'a' ]

In [None]:
df0_b.loc[ 'c' ]

In [None]:
df0_b.shape

We do not just need the zeroth column to be the attribute. It can be any column!

In [None]:
df0_c = pd.read_excel( 'Excel_Example_Data.xlsx', index_col=4 )

In [None]:
df0_c

In [None]:
df0_c.info()

In [None]:
df0_c.loc[ 'aa' ]

By default, NONE of the columns are treated as the `.index`.

In [None]:
df0_d = pd.read_excel( 'Excel_Example_Data.xlsx', index_col = None )

In [None]:
df0_d

In [None]:
df0_d.info()

In [None]:
df0_d.index

## Headers or Column names

By default, the `pd.read_*` family of functions ASSUMES the TOP row is the HEADER row!!!

The top row therefore does NOT contain VALUES!!! Instead, it is assumed the TOP ROW or HEADER ROW contains the COLUMN NAMES!

In [None]:
df0.columns

In [None]:
df0.dtypes

In [None]:
df0

But, let's see what happens if we work with a data set or a SHEET within an Excel workbook that does NOT use a header row!

When you know there is NO HEADER...then the `header` argument must be set to `None`.

In [None]:
df0_no_names = pd.read_excel( 'Excel_Example_Data.xlsx', sheet_name='no_headers', header=None )

In [None]:
df0_no_names

In [None]:
df0_no_names.info()

In [None]:
df0_no_names.columns

If there is no header row, we can use the `names` argument to NAME the columns!

In [None]:
df0_no_names_2 = pd.read_excel( 'Excel_Example_Data.xlsx', 
                                sheet_name='no_headers',
                                header=None,
                                names=df0.columns)

In [None]:
df0_no_names_2

In [None]:
df0_no_names_2.info()

But...be VERY careful...it there is a HEADER row and you specify that there isn't by mistake!!!!

In [None]:
df0_mistake = pd.read_excel( 'Excel_Example_Data.xlsx', header=None )

In [None]:
df0_mistake.shape

In [None]:
df0_mistake.info()

In [None]:
df0_mistake

In [None]:
df0_mistake.dtypes

In [None]:
df0.dtypes

In [None]:
df0_mistake.columns

In [None]:
df0.columns

## Read other sheets

Just specify the `sheet_name` argument appropriately.

In [None]:
pd.read_excel( 'Excel_Example_Data.xlsx', sheet_name='ExB' )

In [None]:
pd.read_excel( 'Excel_Example_Data.xlsx', sheet_name='ExC' )

## Read CSV

CSV files have the extension `.csv`. Unlike Excel workbooks they contain a single spread sheet rather than multiple spread sheets.

In [None]:
pd.read_csv( 'Example_A.csv' )

`pd.read_csv()` has nearly all the same arguments as `pd.read_excel()`!!!

In [None]:
pd.read_csv( 'Example_A.csv', index_col=0)

But a few other important arguments that you may need.

We can specify the MAX NUMBER of rows to read.

In [None]:
pd.read_csv( 'Example_A.csv', nrows=3 )

We can also skip rows.

In [None]:
pd.read_csv( 'Example_A.csv', skiprows=2 )

When you skip rows...be very careful with the HEADER !!!

In [None]:
pd.read_csv( 'Example_A.csv', skiprows=2, header=None )

Read in the 3 CSV files and assign them to different objects.

In [None]:
dfA = pd.read_csv( 'Example_A.csv' )

In [None]:
dfB = pd.read_csv( 'Example_B.csv' )

In [None]:
dfC = pd.read_csv( 'Example_C.csv' )

In [None]:
dfA

In [None]:
dfB

In [None]:
dfC

## Read or download from a website

The data might be located at a web address or URL.

In [None]:
gap_url = 'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'

In [None]:
type( gap_url )

To read in the data we provide the URL web address as a string instead of a file name on our computer!

Because we are reading a TAB separated rather than a CSV...we need to change the `sep` argument.

In [None]:
gap_df = pd.read_csv( gap_url, sep = '\t' )

In [None]:
gap_df.shape

In [None]:
gap_df.info()

## Read or load data from Modules

We will use data from Modules throughout the ADDM program.

We just need to make sure the module is imported!

In [None]:
import seaborn as sns

In [None]:
titanic = sns.load_dataset( 'titanic' )

In [None]:
titanic.info()

# Combine DataFrames - Concatenation

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Read data

Read in the Example A CSV file discussed in the previous recording.

In [None]:
dfA0 = pd.read_csv( 'Example_A.csv' )

In [None]:
dfA0

Add a column with a constant value of 0.

In [None]:
dfA0['attempt'] = 0

In [None]:
dfA0

Read in the same CSV file again!

In [None]:
dfA1 = pd.read_csv( 'Example_A.csv' )

In [None]:
dfA1

Add a constant but this time equal to 1.

In [None]:
dfA1['attempt'] = 1

In [None]:
dfA1

## Vertically Concatenate

Vertically combining means we STACK the objects on top of each other.

In [None]:
pd.concat( [dfA0, dfA1] )

This works because BOTH DataFrames have the SAME colum names!

In [None]:
dfA0.columns == dfA1.columns

Look closely at the `.index` attribute of the COMBINED VERTICALLY STACKED DataFrames!

In [None]:
pd.concat( [dfA0, dfA1] ).loc[ 10 ]

By default, the `.index` attribute is allowed to repeat. The `.index` does NOT uniquely define a row in the new stacked DataFrame!

Ignoring the index allows each stacked row to be unique!

In [None]:
pd.concat( [dfA0, dfA1], ignore_index=True)

I also like to force the DEEP COPY as a just in case.

In [None]:
pd.concat([dfA0, dfA1], ignore_index=True, copy=True)

We can assign the result to an object.

In [None]:
dfA_double = pd.concat( [dfA0, dfA1], ignore_index=True, copy=True)

In [None]:
dfA_double.shape

In [None]:
dfA0.shape

In [None]:
dfA1.shape

## Horizontal Concatenation

BINDING columns together!

The default `axis` argument is ZERO meaning the DATAFRAMES are VERTICALLY combined!

In [None]:
pd.concat([dfA0, dfA1], axis=0)

If we change `axis` to `axis=1` then the two DataFrames will be combined HORIZONTALLY!!!!!

In [None]:
pd.concat( [dfA0, dfA1], axis=1 )

In [None]:
pd.concat( [dfA0, dfA1], axis=1 ).columns

The column names are NO LONGER UNIQUE!!!!

In [None]:
pd.concat( [dfA0, dfA1], axis=1).loc[ :, ['A', 'B'] ]

I think this is VERY BAD. I really dislike that Pandas allows combining DataFrames horizontally even if they have the SAME COLUMN NAMES!!!!!

Be careful when you horizontally combine!!!!!

So why would we ever horizontally combine?

In [None]:
dfA_left = dfA0.loc[ :, dfA0.columns[:3] ].copy()

In [None]:
dfA_left

In [None]:
dfA_right = dfA0.loc[ :, dfA0.columns[-2:]].copy()

In [None]:
dfA_right

In [None]:
dfA_left.shape

In [None]:
dfA_right.shape

The point of horizontally combining is to bring together DIFFERENT columns that have the SAME number of rows!

In [None]:
pd.concat( [dfA_left, dfA_right], axis=1)

BUT...be careful...if you ignore the index with horizontal concatenation...you will REMOVE the column names!!!

In [None]:
pd.concat([dfA_left, dfA_right], axis=1, ignore_index=True)

# Begin Exploring Data by Summarizing Pandas Series

We will Explore data before training predictive models. This process is known as Exploratory Data Analysis (EDA).

An important aspect of EDA is knowing how to calculate SUMMARY STATISTICS. This notebook demonstrates how to summarize Pandas Series.

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Review NumPy summary methods

Let's create a list of integers and then convert that list to a 1D NumPy array.

In [None]:
my_list = [10, 20, 30, 40, 50, 60, 70, 80]

Convert to an array.

In [None]:
my_array = np.array( my_list )

In [None]:
my_array.mean()

In [None]:
my_array.var(ddof=1)

In [None]:
my_array.std(ddof=1)

In [None]:
my_array.min()

In [None]:
my_array.max()

## Pandas Series - summary methods

Convert the list into a Pandas Series.

In [None]:
my_series = pd.Series( my_list )

Most of the Pandas Series summary methods work very similarly to their NumPy counterparts!

In [None]:
my_series.mean()

In [None]:
my_series.min()

In [None]:
my_series.max()

BUT...LOOK CLOSELY...at the VARIANCE!!!!

In [None]:
my_series.var()

In [None]:
my_array.var()

In [None]:
my_array.var(ddof=1)

Look closely at the standard deviation!!!

In [None]:
my_series.std()

In [None]:
my_array.std()

In [None]:
my_array.std(ddof=1)

Pandas CORRECTLY sets `ddof=1` when the variance or standard deviation are calculated!!!

Pandas calculates the UNBIASED estimate to variance and standard deviation!!!

### Unique values

We can get the number of unique values for a Pandas Series!

In [None]:
my_series.nunique()

In [None]:
my_series

In [None]:
my_series.size

Knowing the number of unique values is especially important for CATEGORICAL or STRING variables!

In [None]:
my_series_b = pd.Series( ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'D', 'D'])

In [None]:
my_series_b

In [None]:
my_series_b.size

In [None]:
my_series_b.shape

In [None]:
my_series_b.nunique()

The number of unique values does NOT need to equal the number of elements or SIZE!!!!

My favorite Pandas method focuses on dealing with unique values!!!

Often times we want to COUNT the number of times a unique value occurs!

In [None]:
my_series_b.value_counts()

The COUNTS give us more information than just the unique values.

In [None]:
my_series_b.value_counts().index

If you just want the unique values, then you can use the `.unique()` method.

In [None]:
my_series_b.unique()

## Summarize individual columns within DataFrames

This is to reinforce the fact that COLUMNS are really Pandas Series within a DataFrame.

Let's read in the JOINED DATA set we created previously.

In [None]:
df = pd.read_csv('joined_data.csv')

In [None]:
df

Access any COLUMN and APPLY summary methods just like it was a "regular" Pandas Series in the environment.

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df['A']

In [None]:
df['A'].nunique()

In [None]:
df.A.nunique()

In [None]:
df.D.nunique()

In [None]:
df.E.nunique()

In [None]:
df.F.nunique()

In [None]:
df.G.nunique()

Can apply summary methods like `.mean()` and `.std()` to any numeric column!

In [None]:
df.dtypes

In [None]:
df.F.mean()

In [None]:
df['F'].mean()

In [None]:
df.F.std()

In [None]:
df['F'].std()

We can also calculate the STANDARD ERROR ON THE MEAN (SEM)!!!

In [None]:
df.F.std() / np.sqrt( df.F.size )

In [None]:
df.F.sem()