<a href="https://colab.research.google.com/github/bundickm/CheatSheets/blob/master/Pandas_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro
This colab notebook runs through some of the basic functionality of Pandas. Documentation for the Pandas library can be found [here](https://pandas.pydata.org/pandas-docs/stable/index.html) and is referenced by sections throughout the rest of the notebook. Many of the examples and descriptions come from Chapter 3 of [Python Data Science Handbook](https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.pdf) which I recommend reading from cover to cover. The Pandas documentation also contains a section on [Basic Functionality](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html) that can be very useful as well.

# Importing and Aliasing

In [0]:
import numpy as np
import pandas as pd

# Panda Series
A one dimensional array of indexed data that can be created from a list, dictionary, or array.
[Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

In [0]:
pd_series = pd.Series([1,2,3,4])
pd_series

0    1
1    2
2    3
3    4
dtype: int64

As seen above, Series objects have both values and index attributes. Both the index and values ([Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.values.html#pandas.Series.values)) can be accessed using the below commands. The values are output as a Numpy array and the index is an array-like object of type pd.Index.

In [0]:
pd_series.values

array([1, 2, 3, 4])

In [0]:
pd_series.index

RangeIndex(start=0, stop=4, step=1)

Like a numpy array, values can be accessed with the associated index.

In [0]:
pd_series[1]

2

In [0]:
pd_series[1:4]

1    2
2    3
3    4
dtype: int64

In [0]:
pd_series[-1]

4

Panda Series have an explicitly defined index which means index values can be of any type and are not required to be contiguous.

In [0]:
pd_series = pd.Series([1,2,3,4], index=['a','d','c',1])
pd_series

a    1
d    2
c    3
1    4
dtype: int64

Values can be retrieved just like before by using the bracket notation and either the implicit or explicit index. If there is a conflict between the explicit and implicit index values, the explicit is referenced.

In [0]:
pd_series['a']

1

In [0]:
pd_series[0]

1

In [0]:
pd_series[1]

4

If you want clarity in what index you are referencing, you can use [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) and [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc) to reference either the explicit or implicit index name.

In [0]:
pd_series.loc[1]

4

In [0]:
pd_series.loc['d']

2

In [0]:
pd_series.iloc[3]

4

In [0]:
pd_series.iloc[1]

2

Because we can explicitly define the index, Panda Series can resemble a dictionary and even be created from one.

In [0]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
pop_series = pd.Series(population_dict)
pop_series

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

Unlike a dictionary though, Panda Series also supports array style operations such as slicing.

In [0]:
pop_series['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

Panda Series can be created to be a scalar that is repeated to fill the specified index

In [0]:
pd.Series(10, index = [1,2,3,4])

1    10
2    10
3    10
4    10
dtype: int64

Panda Series can also be a dictionary with the index defaulting to sorted key values.

In [0]:
pd.Series({3:'c',2:'b',1:'a'})

1    a
2    b
3    c
dtype: object

You can explicity state the index too, which results in only the specified keys being populated in the series.

In [0]:
pd.Series({3:'c',2:'b',1:'a'}, index=[3,1])

3    c
1    a
dtype: object

Since Panda Series are built off Numpy Arrays you can also perform masking and fancy indexing.

In [0]:
pd_series = pd.Series([1,2,3,4,5])
pd_series

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [0]:
#masking
pd_series[pd_series > 3]

3    4
4    5
dtype: int64

In [0]:
pd_series[(pd_series > 1) & (pd_series < 4)]

1    2
2    3
dtype: int64

In [0]:
#fancy indexing
pd_series[[1,4,3]]

1    2
4    5
3    4
dtype: int64

# DataFrame
A DataFrame can be thought of as a two-dimensional array with flexible indices and column names. Just like an array can be thought of as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. [Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [0]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}
area_series = pd.Series(area_dict)

states_df = pd.DataFrame({'population': pop_series, 'area':area_series})
states_df

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


The DataFrame has both an index and columns attribute that can be accessed.

In [0]:
states_df.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [0]:
states_df.columns

Index(['area', 'population'], dtype='object')

Values in the DataFrame can be accessed through [.values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values), or by [index](https://), columns, or a combination of the two.

In [0]:
states_df.loc['California']

area            423967
population    38332521
Name: California, dtype: int64

In [0]:
states_df['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [0]:
states_df.loc['California','area']

423967

In [0]:
states_df.values

array([[  423967, 38332521],
       [  170312, 19552860],
       [  149995, 12882135],
       [  141297, 19651127],
       [  695662, 26448193]])

We can use the dictionary-style syntax to modify a DataFrame object.

In [0]:
states_df['density'] = states_df['population'] / states_df['area']
states_df

Unnamed: 0,area,population,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


Data Frames can also be viewed as two-dimensional array-like objects allowing for common operations such as Transpose.

In [0]:
states_df.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
population,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


However, due to its similarities to the dictionary, there are limitations on what it can do as an array. One of these limitations is that it is unable to pass a single index to access a row. To do this we either have to convert to an array with the values function or use two indices to access the information we want.

In [0]:
states_df.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [0]:
states_df.iloc[0,0]

423967

In [0]:
states_df.loc['Florida':'New York', 'population':]

Unnamed: 0,population,density
Florida,19552860,114.806121
Illinois,12882135,85.883763
New York,19651127,139.076746


In [0]:
states_df.iloc[1:4,1:]

Unnamed: 0,population,density
Florida,19552860,114.806121
Illinois,12882135,85.883763
New York,19651127,139.076746


We can also use any of the access patterns familiar to NumPy such as masking and fancy indexing.

In [0]:
states_df.loc[states_df['area'] > 170000, ['population', 'density']]

Unnamed: 0,population,density
California,38332521,90.413926
Florida,19552860,114.806121
Texas,26448193,38.01874
