# Introduction to Pandas - Series

Pandas is built on top of NumPy. So whenever we work with Pandas it is also important to load in NumPy.

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Pandas Series

A Pandas Series is similar to, yet different from, Python lists and 1D NumPy arrays. There are certain influences from the KEY/VALUE pair ITEM from DICTIONARIES!

Let's create a Pandas Series FROM a Python list.

In [None]:
grocery_list = ['milk', 'bananas', 'apples', 'lunch meat', 'soup', 'oreos']

In [None]:
%whos

In [None]:
len( grocery_list )

In [None]:
type( grocery_list )

We know how to SLICE or INDEX the list.

In [None]:
grocery_list[ 0 ]

In [None]:
grocery_list[ -1 ]

In [None]:
grocery_list[ :2 ]

In [None]:
grocery_list[ 1:4 ]

Let's convert the `grocery_list` LIST into a Pandas Series.

In [None]:
grocery_series = pd.Series( grocery_list )

In [None]:
%whos

In [None]:
type( grocery_series )

Let's focus on the the `.index` attribute of the Pandas Series object.

In [None]:
grocery_series.index

In [None]:
grocery_list[0]

In [None]:
grocery_list[5]

In [None]:
grocery_list[-1]

In [None]:
grocery_series[0]

In [None]:
grocery_series[5]

In [None]:
grocery_series[-1]

But, the `.index` attribute does NOT need to be a range index integer!

In [None]:
grocery_series.index

In [None]:
grocery_series.index = ['zero', 'one', 'two', 'three', 'four', 'five']

In [None]:
grocery_series.index

In [None]:
grocery_series

In [None]:
grocery_list

The Pandas Series has both INDEX and VALUES!

Kind of like...a DICTIONARY with KEY/VALUE defining each ITEM!

In [None]:
grocery_series.values

In [None]:
grocery_series.index

In [None]:
grocery_series[ 'zero' ]

In [None]:
grocery_series[ 'one' ]

In [None]:
grocery_series[ 0 ]

In [None]:
grocery_series[ 1 ]

Let's continue to practice by defining a new Series. This Series will define the `.index` attribute when the object is created.

In [None]:
more_groceries = pd.Series( ['apple juice', 'poptarts', 'butter', 'yogurt'],
                            index = ['item 1', 'item 2', 'item 3', 'top gun'])

In [None]:
%whos

In [None]:
more_groceries

In [None]:
more_groceries.values

In [None]:
more_groceries.index

In [None]:
more_groceries[ 'item 1' ]

In [None]:
more_groceries[ 'top gun' ]

In [None]:
more_groceries[3]

IMPORTANT...please be careful...if you define the `.index` attribute manually upon creation of the Series...the number of elements or entries for `index` MUST be the same as the number of entries for the values!

In [None]:
pd.Series(['1', '2', '3', '4'],
          index=['a', 'b', 'c'])

In [None]:
pd.Series(['1', '2', '3', '4'],
          index=['a', 'b', 'c', 'd'])

I think the `.index` attribute is confusing...because the `.index` attribute does NOT need to be unique!!!!!!

In [None]:
another_series = pd.Series( ['a', 'b', 'c'],
                            index=['1', '2', '2'])

In [None]:
another_series.values

In [None]:
another_series.index

Slicing the Series by given the index...will return MULTIPLE values if the index is NOT unique!

In [None]:
another_series[ '1' ]

In [None]:
another_series[ '2' ]

In [None]:
another_series[ '2' ].values

In [None]:
another_series[ '2' ].index

In [None]:
another_series[ '2' ]

## Combining Series

We can APPEND or EXTEND or COMBINE multiple Series together using the `pd.concat()` function.

In [None]:
pd.concat( [grocery_series, more_groceries, another_series] )

When combining separate Series together, sometimes we want a BRAND NEW `.index` attribute. To do so, we can use the `ignore_index` argument within `pd.concat()`.

In [None]:
pd.concat( [grocery_series, more_groceries, another_series], ignore_index=True )

In [None]:
a_bigger_series = pd.concat( [grocery_series, more_groceries, another_series], ignore_index=True ).copy()

In [None]:
a_bigger_series

In [None]:
a_bigger_series.index

In [None]:
pd.concat( [ grocery_series, more_groceries, another_series, a_bigger_series ] )

In [None]:
pd.concat( [grocery_series, more_groceries, another_series, a_bigger_series ] ).index

In [None]:
pd.concat( [grocery_series, more_groceries, another_series, a_bigger_series ], ignore_index=True)

## Summary

The Pandas Series looks kind of like a mix between Lists, 1D NumPy arrays, and Dictionaries.

Values are stored and associated with an `.index` attribute. The Series can be sliced using the `.index` Location.

# Introduction to Pandas DataFrame

The previous video was all about Pandas Series. A Series contains values associated with a single variable.

The DataFrame is a COLLECTION of variables! Or, a collection of Pandas Series!

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## NumPy to DataFrames

We previously learned that Pandas Series are kind of like 1D NumPy arrays.

Pandas DataFrame is kind of like a 2D NumPy array.

So let's create a DataFrame from a 2D NumPy array.

In [None]:
X = np.arange(1, 25).reshape( 6, -1 )

In [None]:
X.shape

In [None]:
X.ndim

In [None]:
X.size

In [None]:
X[ 0 ]

In [None]:
X[ 1 ]

In [None]:
X[ -1 ]

In [None]:
X

In [None]:
X[ :, 0 ]

In [None]:
X[ :, -1 ]

In [None]:
X[ 0, -1 ]

In [None]:
X[ -1, 0 ]

In [None]:
X[ :3, :2 ]

Let's convert `X` into a DataFrame!!

In [None]:
Xdf = pd.DataFrame( X )

In [None]:
%whos

In [None]:
type( Xdf )

In [None]:
print( Xdf )

In [None]:
print( X )

In [None]:
Xdf

Provide a single index to slice the NumPy 2D array!

In [None]:
X[ 0 ]

Look what happens when we use a single index to slice the DataFrame!

In [None]:
Xdf[ 0 ]

In [None]:
X[ :, 0 ]

In [None]:
Xdf[ :, 0 ]

In [None]:
Xdf[ 0, : ]

There is clearly something different about the Pandas DataFrame compared to the NumPy 2D array...even though both are TABLE-LIKE.

Both have 2 dimensions.

Both have rows and columns.

But we CANNOT interact with a Pandas DataFrame using syntax just like the NumPy 2D array!!!

Remember...the Pandas Series is like a list, dictionary, and 1D NumPy array!!!!

## Dictionary to DataFrame

Let's see the connection between a Dictionary and a DataFrame by converting a Dictionary into a DataFrame!

Remember that the Dictionary has KEY/VALUE pairs to define each ITEM!

In most of our previous examples, the KEY was associated with a SINGLE valued VALUE. But now, we will use a MULTI-VALUED or MULTI-ENTRY VALUE per KEY!

In [None]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinnati', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [None]:
baseball_dict

In [None]:
baseball_dict['City']

In [None]:
baseball_dict['Team']

Convert the dictionary into a DataFrame!!!

In [None]:
baseball_df = pd.DataFrame( baseball_dict )

In [None]:
baseball_df

The Dictionary KEYS become COLUMN NAMES!!!!!

In [None]:
baseball_df.columns

In [None]:
baseball_dict.keys()

The dictionary VALUEs become the entries within the ROWS for each COLUMN!

In [None]:
baseball_df.index

## DataFrame attributes

In [None]:
Xdf.index

In [None]:
baseball_df.index

In [None]:
Xdf.columns

In [None]:
baseball_df.columns

In [None]:
Xdf.shape

In [None]:
baseball_df.shape

In [None]:
type( Xdf )

In [None]:
type( baseball_df )

But...we are REALLY interested in the DATA TYPE associated with each COLUMN contained within the DataFrame!!!

In [None]:
Xdf.dtypes

In [None]:
baseball_df.dtypes

In [None]:
baseball_df.dtypes

In [None]:
type( baseball_df.dtypes )

In [None]:
baseball_df.dtypes.index

In [None]:
baseball_df.columns

## DataFrame methods

We will only show a few in this video. Next week is really dedicated to DataFrame methods!!!!!!

In [None]:
Xdf.info()

In [None]:
baseball_df.info()

In [None]:
Xdf.describe()

In [None]:
X.mean(axis=0)

In [None]:
X.std(axis=0, ddof=1)

In [None]:
baseball_df.describe()

In [None]:
baseball_df

There are many more methods but we will not show most in this video.

Instead, let's consider how to SORT or ORDER or ARRANGE the DataFrame.

In [None]:
baseball_df

In [None]:
baseball_df.sort_values(['Team'])

In [None]:
baseball_df

In [None]:
baseball_df

Most Pandas methods do NOT modify in place!!!!!

In [None]:
baseball_df.sort_values(['Team'])

Sometimes, we do not want to retain the original `.index` attribute positions!

We want to IGNORE the `.index` when we SORT!

In [None]:
baseball_df.sort_values( ['Team'], ignore_index=True )

We could also sort in DESCENDING order.

In [None]:
baseball_df.sort_values( ['Team'], ignore_index=True, ascending=False)

The result is NOT stored. So to store or KEEP the sorting...we need to either assign to a NEW object.

When I assign a result to a new object, I like to force the DEEP COPY!

In [None]:
baseball_df_b = baseball_df.sort_values( ['Team'], ignore_index=True, ascending=False).copy()

In [None]:
baseball_df_b

Alternatively, we CAN force Pandas to MODIFY in place using the `inplace` argument!

In [None]:
baseball_df.sort_values( ['Team'], inplace=True )

In [None]:
baseball_df

In [None]:
baseball_df_b

## Summary

This video focused on CREATING DataFrames from NumPy and from Dictionaries.

This was to highlight the fact that DataFrames are 2D objects - they have 2 dimensions - ROWS AND COLUMNS!!!!!

This was to highlight that the KEY becomes the COLUMN NAME!!!!

We need to think of Pandas DataFrames as a combination of the NumPy 2D array and the Python Dictionary!!!!!

# Pandas DataFrame index deep dive

Let's build on the previous video, but focus heavily on the `.index` attribute of a DataFrame.

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Example DataFrame

We will use the same baseball example DataFrame from the previous recording.

We will create that DataFrame from a dictionary.

In [None]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [None]:
baseball_dict

In [None]:
baseball_df = pd.DataFrame( baseball_dict )

In [None]:
baseball_df

But...this is video is ALL about the `.index` attribute!!!!

In [None]:
baseball_df.index

In [None]:
baseball_df.info()

But we can change the `.index` attribute!

I dislike that Pandas allows the `.index` attribute to be ANYTHING!!!

I feel the `.index` attribute should simply be a ROW counter.

But...Pandas lets the `.index` be a meaningful quantity! The `.index` attribute can therefore be a separate variable!

For the baseball example, let's change the `.index` to be the NUMBER OF GAMES BACK a team is from the division leader.

In [None]:
baseball_df.index = [31.5, 27.5, 22.5, 0, 7.5]

In [None]:
baseball_df.index

In [None]:
baseball_df.info()

In [None]:
baseball_df

How can we find a single row or SELECT a single row from the DataFrame?

In [None]:
baseball_df[ 0 ]

In [None]:
baseball_df[ 0.0 ]

Pandas is COLUMN or variable or field centric!!!!

In [None]:
baseball_df

When we type the `[]` to access SOMETHING in a Pandas DataFrame...we need to provide the COLUMN NAME!!!!

In [None]:
baseball_df[ 'City' ]

In [None]:
baseball_df[ 'Team' ]

The column name allows easy SLICING of the DataFrame!

In [None]:
baseball_df.columns

But how does this help us with the ROWS???

The `.index` attribute is associated with the ROWS!!!!!!!!

In [None]:
baseball_df.index

Pandas uses a SPECIAL ATTRIBUTE to let us manage or select ROWS!

There are two flavors.

The `.loc[]` attribute allows selecting rows based on the `.index` LOCATION or KEY!!!

The `.iloc[]` attribute allows selecting rows based on the `.index` integer POSITION!!!!

In [None]:
baseball_df

In [None]:
baseball_df.loc[ 0.0 ]

In [None]:
baseball_df.loc[ 31.5 ]

In [None]:
baseball_df.iloc[ 0 ]

In [None]:
baseball_df.iloc[ 3 ]

In [None]:
baseball_df.iloc[ -1 ]

The `.loc[]` attribute selects ROWS based on `.index` LOCATION KEY!!!

The `.iloc[]` attribute selects ROWS based on `.index` INTEGER POSITION!!!

## Reseting Index

I do not like the `.index`. I prefer it to be a REGULAR column if it is storing MEANINGFUL VALUES!!!

In [None]:
baseball_df

Let's sort by the `.index`!

In [None]:
baseball_df.sort_index()

In [None]:
baseball_df

In [None]:
baseball_df.sort_index( inplace=True )

In [None]:
baseball_df

We can PULL OUT the `.index` attribute into a REGULAR column using the `.reset_index()` method!

In [None]:
baseball_df.reset_index()

In [None]:
baseball_df.reset_index().info()

If...you do NOT want to KEEP the values within the `.index` attribute when you reset...you can DROP them!

In [None]:
baseball_df.reset_index(drop=True)

But...if you want to KEEP the `.index` attribute values...the DEFAULT name of the new column `index` is VERY vague!

In [None]:
baseball_df.reset_index()

We can RENAME a column using the `.rename()` method!!!!

To rename the columns in Pandas DataFrames...we need to use a DICTIONARY within the `columns` argument of the `.rename()` method!!!!

The KEY is the original column name and the VALUE is the NEW or desired column name!

In [None]:
baseball_df.reset_index().rename(columns={'index': 'games_back'})

In [None]:
baseball_df

Sometimes we want to place each step in a PIPELINE or WORKFLOW of actions on a separate line.

This makes the code easier to read especially when there are MANY, MANY actions in the WORKFLOW!

In [None]:
baseball_df_b = baseball_df.\
reset_index().\
rename(columns={'index': 'games_back'}).\
copy()

In [None]:
baseball_df_b

In [None]:
baseball_df

## Index upon creation

The `.index` attribute can be defined when the DataFrame is created.

In [None]:
baseball_df_c = pd.DataFrame( data = baseball_dict,
                              index = [31.5, 27.5, 22.5, 0, 7.5],
                              columns = ['League', 'Division', 'City', 'Team'])

In [None]:
baseball_df_c

## Summary

The `.loc[]` and `.iloc[]` attributes allow us to select rows based on the `.index` attribute!!!!

We can also reset the index using the `.reset_index()` method.

We can change colum names using the `.rename()` method.

# Columns and Rows in the Pandas DataFrame

We have seen how to create the DataFrame. We have worked with the `.index` attribute. We now need to practice selecting columns and rows within the DataFrame.

## Import Modules

In [None]:
import numpy as np
import pandas as pd

## Create example DataFrame

We will make the Dictionary about baseball and then convert that Dictionary to a DataFrame.

In [None]:
baseball_dict = {'City': ['Pittsburgh', 'Cincinatti', 'Chicago', 'St. Louis', 'Milwaukee'],
                 'Team': ['Pirates', 'Reds', 'Cubs', 'Cardinals', 'Brewers'],
                 'Division': 5 * ['Central'],
                 'League': 5 * ['NL']}

In [None]:
baseball_dict

In [None]:
baseball_df = pd.DataFrame( baseball_dict,
                            columns=['League', 'Division', 'City', 'Team'])

In [None]:
baseball_df

In [None]:
baseball_df.info()

## Columns

### Selecting columns

In [None]:
baseball_df[ 'Team' ]

In [None]:
baseball_df[ 'Division' ]

In [None]:
type( baseball_df[ 'Division' ] )

In [None]:
baseball_df[ 'Division' ].index

In [None]:
baseball_df[ 'Team' ].index

There is another way to access or select a single Column and return it as a Series.

The previous way is the BRACKET notation.

But this other way is known as the DOT NOTATION!

In [None]:
baseball_df.Team

In [None]:
baseball_df.Division

In [None]:
baseball_df.City

Still another approach is to use the FORMAL `.loc[]` attribute for selecting the COLUMN.

In [None]:
baseball_df.loc[ :, 'Team' ]

In [None]:
baseball_df.loc[ :, 'Division' ]

In [None]:
baseball_df.loc[ :, 'City' ]

In [None]:
type( baseball_df.loc[ :, 'City'] )

In [None]:
baseball_df.iloc[ :, 0 ]

In [None]:
baseball_df.iloc[ :, -1 ]

In [None]:
baseball_df.loc[:, 'Team' ]

In [None]:
baseball_df.columns

However, be very careful...with how you ENTER or TYPE the column names!

In [None]:
baseball_df.loc[ :, ['Team'] ]

In [None]:
baseball_df

In [None]:
type( baseball_df.loc[ :, ['Team'] ] )

In [None]:
type( baseball_df.loc[ :, 'Team' ] )

The reason for the difference in returned data type...is because we MUST use a LIST to select MULTIPLE COLUMNS!!!!

In [None]:
baseball_df

In [None]:
baseball_df.loc[ :, ['City', 'Team'] ]

In [None]:
baseball_df.loc[ :, ['Team', 'City'] ]

In [None]:
baseball_df.loc[ :, ['League', 'Team'] ]

In [None]:
baseball_df.loc[ :, baseball_df.columns[1:] ]

We need the base Python LIST to SELECT multiple columns from the DataFrame!!!!

### Adding and Deleting Columns

Adding columns is similar to how we added new KEY/VALUE pairs to Dictionaries.

The VALUE will be assigned to a KEY. The KEY is the NEW column Name!!!!

Let's start with the VALUE that will be added.

In [None]:
baseball_df

In [None]:
baseball_df['games_back'] = [31.5, 27.5, 22.5, 0, 7.5]

In [None]:
baseball_df.info()

In [None]:
baseball_df

In [None]:
baseball_df.sort_values(['games_back'])

Force modifying the sort in place.

In [None]:
baseball_df.sort_values(['games_back'], inplace=True)

In [None]:
baseball_df

Let's add another column which contains the number of WINS per team.

This new column will be assigned a Pandas Series rather than a List.

In [None]:
baseball_df['wins'] = pd.Series([87, 79, 64, 59, 55],
                                index=baseball_df.index)

In [None]:
baseball_df

In [None]:
baseball_df.wins

In [None]:
baseball_df.Team

Let's add the number of losses per team.

In [None]:
baseball_df['losses'] = pd.Series([63, 70, 85, 90, 94],
                                  index=baseball_df.index)

In [None]:
baseball_df

But what if we want to add a SCALAR or CONSTANT value to the DataFrame?

For example, I want a column named `season` to store the year the data came from.

At first, we might think we need to do the following to add the column:

In [None]:
pd.Series([2022, 2022, 2022, 2022, 2022], index=baseball_df.index)

In [None]:
pd.Series( 5 * [2022], index = baseball_df.index )

Pandas will replicate the constant value down all rows of the new column!!!

In [None]:
baseball_df['season'] = 2022

In [None]:
baseball_df

In [None]:
baseball_df.season.index

In [None]:
baseball_df.wins.index

### Deleting or DROPPING columns

We can remove columns through the `.drop()` method.

In [None]:
baseball_df.drop(columns=['season'])

In [None]:
baseball_df.drop(columns=['wins', 'losses', 'season'])

In [None]:
baseball_df.drop(columns=['City', 'Team'])

In [None]:
baseball_df

We would need to set the `inplace` argument to True to RETAIN or KEEP the resulting DataFrame that has FEWER columns!!!

## Selecting Rows

We know how to use the `.loc[]` and `.iloc[]` attributes to select rows based on the `.index` attribute.

In [None]:
baseball_df.loc[ 3 ]

In [None]:
baseball_df.iloc[ 3 ]

But the more interesting way to SELECT or FILTER rows is based on CONDITIONS!!!!

We want to CONDITIONALLY SUBSET the rows!!!

We must identify or SELECT the column to apply the condition!

In [None]:
baseball_df.wins > 65

In [None]:
baseball_df.loc[ baseball_df.wins > 65 ]

In [None]:
baseball_df.loc[ baseball_df['wins'] > 65 ]

In [None]:
baseball_df.loc[ baseball_df.wins > 65, : ]

We can select a subset of the columns by providing a LIST of column names!

In [None]:
baseball_df.loc[ baseball_df.wins > 65, ['Team', 'wins', 'losses'] ]

We can also FILTER by strings. For example, let's find the row where `Team == 'Brewers'`.

In [None]:
baseball_df.loc[ baseball_df.Team == 'Brewers', : ]

In [None]:
baseball_df.loc[ baseball_df.Team == 'Pirates', : ]

If you want to match or FILTER based on MULTIPLE conditions you need to use () to separate each condition.

For example, let's find Brewers OR the Cardinals.

In [None]:
baseball_df.loc[ (baseball_df.Team == 'Brewers') | (baseball_df.Team == 'Cardinals'), : ]

We can also have an AND operation.

I want to find all rows where Team is equal to Cardinals AND the wins is greater than 65.

In [None]:
baseball_df.loc[ (baseball_df.Team == 'Cardinals') & (baseball_df.wins > 65), : ]

## Summary

We have seen how to select COLUMNS using BRACKET and DOT notation.

We have seen how to select MULTIPLE columns.

We have seen how to select ROWS based on `.index` attribute via `.loc[]` and `.iloc[]`.

We have seen how to select ROWS based on CONDITIONS (FILTERING rows).

We also saw how to ADD and REMOVE (DROP) COLUMNS.