# Intro to Pandas
---
DAT 512 Canisuis College <br>
Professor Paul Lambson<br>
<br>
### Learning Objectives
- Become familiar with Pandas
- Use Series and Data Frame
- Use indexes and reindex
- Drop elements, select and filter
- Perform math between objects
- Apply user defined functions to data elements
- Sort and rank data elements
- Use summary statistics
<br>

### Markdown notes 
- Similar to `.readme` files
- More info can be found in a bunch of places, but you can also __[go here](https://ipython.readthedocs.io/en/stable/interactive/magics.html)__

### Sections
- [Pandas Series](#pandas-series)
- [Pandas Dataframe](#pandas-dataframe)
- [Index Objects](#index-objects)
- [Reindex](#reindex)
- [Dropping Entries From An Axis](#dropping-entries-from-an-axis)
- [Index Selection and Filtering](#index-selection-and-filtering)
- [Arithmetic and Data Alignment](#arithmetic-and-data-alignment)
- [Function Application and Mapping](#function-application-and-mapping)
- [Sorting and Ranking](#sorting-and-ranking)
- [Axis Indexes with Duplicate Labels](#axis-indexes-with-duplicate-labels)
- [Summarizing and Computing Descriptive Statistics](#summarizing-and-computing-descriptive-statistics)
- [Correlation and Covariance](#correlation-and-covariance)
- [Unique Values, Value Counts and Membership](#unique-values-value-counts-and-membership)

<a id='pandas-series'></a>
# Pandas Series
---
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index.

In [None]:
import numpy as np
import pandas as pd

In [None]:
from pandas import Series, DataFrame

In [None]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 80
np.set_printoptions(precision=4, suppress=True)

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

In [None]:
obj.array
obj.index

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2
obj2.index

In [None]:
# Selecting values by index value
obj2["a"]
obj2["d"] = 6
obj2[["c", "a", "d"]] # note the index value to search are in a [list]

In [None]:
#Boolen Masking
obj2>0

In [None]:
# Rather than a list of index values a boolean mask is passed, the same length of the series
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [None]:
import numpy as np
np.exp(obj2)

In [None]:
print("b" in obj2) # series could be an ordered dictionary
print("e" in obj2)
print(2 in obj2) # "in" evaluates indexes not values

In [None]:
# A series can be created directly from a dictionary
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

In [None]:
# Then returned to a dictionary from a series
obj3.to_dict()

In [None]:
# Series can be created from componenet pieces
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

In [None]:
# Logical checks are available for rapid data validation
pd.isna(obj4)

In [None]:
pd.notna(obj4)

In [None]:

obj4.isna()

In [None]:
# Series can be aligned and combined
obj3
obj4
obj3 + obj4

In [None]:
# meta data can be added to a Series inside the instance
obj4.name = "population"
obj4.index.name = "state"
obj4

In [None]:
# A Series index can be altered in place by assignment
obj

In [None]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

## In Class Excercise
---
- make a Pandas Series called 'roster' populate it with values of other students in the class
- make another Pandas series called 'roster_2' where the index is the names of students in the class and the value is a random integer

<a id='pandas-dataframe'></a>
# Pandas Dataframe
---
A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [None]:
# Dictionary to data frame, keys to columns headers values to rows
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [None]:
# The full data frame can be accessed
frame

In [None]:
# In most cases, a subset can be seen
frame.head()

In [None]:
frame.tail()

In [None]:
# On load the sequence of columns can be specified
pd.DataFrame(data, columns=["year", "state", "pop"])

In [None]:
# If a columns is passed without a value the value is null
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

In [None]:
frame2.columns

In [None]:
#! ipython id=6e3dd5278f434cd599d2666468e4577c
frame2["state"]
frame2.year

Attribute-like access (e.g., `frame2.year`) and tab completion of column names in IPython are provided as a convenience.

`frame2[column]` works for any column name, but `frame2.column` works only when the column name is a valid Python variable name and does not conflict with any of the method names in DataFrame. For example, if a column’s name contains whitespace or symbols other than underscores, it cannot be accessed with the dot attribute method.

In [None]:
# Locating rows by position
frame2.loc[1]

In [None]:
frame2.iloc[2]

In [None]:
# Assign a single value to a column
frame2["debt"] = 16.5
frame2

In [None]:
# Assign a list or array to a column, the collection needs to be the same length as the number or rows 
frame2["debt"] = np.arange(6.)
frame2

In [None]:
# A series needs to match on index, or it will not be assigned
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

In [None]:
# create a boolean column to delete
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

In [None]:
# del column
del frame2["eastern"]
frame2.columns

In [None]:
# Create a dataframe from a nested dictionary
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

In [None]:
#! ipython id=ce53aa5d1ba149ebbad6692ba2b5b4bb
frame3 = pd.DataFrame(populations)
frame3

In [None]:
# Easy transpose
frame3.T

In [None]:
#Only specified indexed rows are included
pd.DataFrame(populations, index=[2001, 2002, 2003])

In [None]:
# index notation is honored in dataframe creation
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

In [None]:
# modifying meta data
frame3.index.name = "year"
frame3.columns.name = "state"
frame3

In [None]:
# export easily for 2-D array (Scikitlearn data prep)
frame3.to_numpy()

In [None]:
# If data types vary the dtypes adjusted to accomodate
frame2.to_numpy()

## In Class Excercise
---
- make a dictionary of 3 key values pairs, named 'Day','High','Low', where the values are the day of week name, a high temperate and a low temperate
- create a data frame from this dictionary called 'weather'

<a id='index-objects'></a>
# Index Objects
---
pandas’s Index objects are responsible for holding the axis labels (including a DataFrame’s column names) and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index.

In [None]:
#
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

In [None]:
# indexes are immutable, cannot be changed by the user
index[1:]='d'

In [None]:
# easy to share amond different data structures
labels = pd.Index(np.arange(3))
labels
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2
obj2.index is labels

In [None]:
#! ipython id=ae6e0ad0026347e2a624f864873bdcc2
frame3

In [None]:
frame3.columns

In [None]:
"Ohio" in frame3.columns

In [None]:
2003 in frame3.index

In [None]:
# indexes CAN contain duplicate data
pd.Index(["foo", "foo", "bar", "bar"])

<a id='reindex'></a>
# Reindex

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3
obj3.reindex(np.arange(6), method="ffill")

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame

In [None]:
# new index adds rows without values
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

In [None]:
# indexes can be rows or columns, rows by default
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

In [None]:
#! ipython id=b9b8180a14764143adb17041231fdf08
frame.reindex(states, axis="columns")

In [None]:
#! iloc is locating by index, can be used on both rows and columns
frame.loc[["a", "d", "c"], ["California", "Texas"]]

## In Class Excercise
---
- reindex the 'weather' DataFrame to have 'Day' be the new index
- explore the 'inplace' keyword in the reindex method

<a id='dropping-entries-from-an-axis'></a>
# Dropping Entries from an Axis

In [None]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj

In [None]:
new_obj = obj.drop("c")
new_obj

In [None]:
obj.drop(["d", "c"])

In [None]:

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

In [None]:
# calling drop will delete from rows
data.drop(index=["Colorado", "Ohio"])

In [None]:
# calling drop with a columns keyword and list will drop columns
data.drop(columns=["two"])

In [None]:
# Axis=1 is the same a columns
data.drop("two", axis=1)

In [None]:
data.drop(["two", "four"], axis="columns")

<a id='index-selection-and-filtering'></a>
# Indexing, Selection, and Filtering

In [None]:
#! ipython id=85f497478a9d4b5e97aa8012912d1d2f
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

In [None]:
obj["b"]

In [None]:
obj[1]

In [None]:
obj[2:4]

In [None]:
obj[["b", "a", "d"]]

In [None]:
obj[[1, 3]]

In [None]:
obj[obj < 2]

In [None]:
# the prefered method of locating rows
obj.loc[["b", "a", "d"]]

In [None]:
# create two series, interger index and object index
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj1

In [None]:
obj2

In [None]:
# when the index is a integer, the value of the index is used
obj1[[0, 1, 2]]

In [None]:
# when the index is a object, the index position is used
obj2[[0, 1, 2]]

In [None]:
# using iloc[] defaults to index, independent of value type
obj1.iloc[[0, 1, 2]]

In [None]:
obj2.iloc[[0, 1, 2]]

In [None]:
# slicing is inclusive, which is different than other slicing
obj2.loc["b":"c"]

In [None]:
# slicing to assign values
obj2.loc["b":"c"] = 5
obj2

In [None]:
# Indexing into a DataFrame retrieves one or more columns either with a single value or sequence
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data


In [None]:
# one bracket returns a series
data["two"]

In [None]:
# two brackets returns a data frame
data[["three", "one"]]

In [None]:
# slicing
data[:2]

In [None]:
# selecting with a boolean mask
data[data["three"] > 5]

In [None]:
# Boolean DataFrame created from a scalar comparison
data < 5

In [None]:
# overwrite value based on boolean mask
data[data < 5] = 0
data

In [None]:
# select a series based on the index value loc when axis label iloc when integer
data
data.loc["Colorado"]

In [None]:
# pass a list of values
data.loc[["Colorado", "New York"]]

In [None]:
# select on both row and columns values
data.loc["Colorado", ["two", "three"]]

In [None]:
# more examples using integer index position
data.iloc[2]

In [None]:
data.iloc[[2, 1]]

In [None]:
data.iloc[2, [3, 0, 1]]

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

In [None]:
# both loc and iloc work with slices
data.loc[:"Utah", "two"]

In [None]:
data.iloc[:, :3][data.three > 5]

In [None]:
# boolean arrays can be used with loc but not iloc
data.loc[data.three >= 2]

<a id='arithmetic-and-data-alignment'></a>
# Arithmetic and Data Alignment

In [None]:
# when you add objects, if any index pairs are not the same, 
# the respective index in the result will be the union of the index pairs
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=["a", "c", "e", "f", "g"])
s1


In [None]:
s2

In [None]:
# adding series adds only matching index
s1 + s2

In [None]:
# matches with columns as well
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
df1


In [None]:
df2

In [None]:
# summing two data frames only sums matching columns and row indexes
df1 + df2

In [None]:
#! if there are no row/columns pairs in common, no resulting products will be available
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
df1
df2
df1 + df2

In [None]:
# a more complex example where NaN plus a value is NaN
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))
df2.loc[1, "b"] = np.nan
df1


In [None]:
df2

In [None]:
df1 + df2

In [None]:
# using add method will handle NaN
df1.add(df2, fill_value=0)

In [None]:
# r in front of divsion reverses the order so these two statements are equal
1 / df1

In [None]:
df1.rdiv(1)

In [None]:
# can specify fill_value when reindexing
df1.reindex(columns=df2.columns, fill_value=0)

<a id='function-application-and-mapping'></a>
# Function Application and Mapping

In [None]:
# Numpy functions can be applied to a DataFrame
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                     index=["Utah", "Ohio", "Texas", "Oregon"])
frame

In [None]:
np.abs(frame)

In [None]:
# applying a function to a 1-d array to each column or row
def f1(x):
    return x.max() - x.min()

frame.apply(f1)

In [None]:
# need to specify to run on columns
frame.apply(f1, axis="columns")

In [None]:
# it can also return a Series with multiple values
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])
frame.apply(f2)

In [None]:
# apply a new format
def my_format(x):
    return f"{x:.2f}"

frame.applymap(my_format)

In [None]:
# can be done to a single row or column
frame["e"].map(my_format)

<a id='sorting-and-ranking'></a>
# Sorting and Ranking

In [None]:
#  To sort lexicographically by row or column label, use the sort_index method
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj

In [None]:
# a new object is created
obj.sort_index()

In [None]:
# this method can also be applied to columns
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=["three", "one"],
                     columns=["d", "a", "b", "c"])
frame

In [None]:
frame.sort_index()

In [None]:
frame.sort_index(axis="columns")

In [None]:
# ascending as a keyword argument reverses the order
frame.sort_index(axis="columns", ascending=False)

In [None]:
# osrt values works as an alternative
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

In [None]:
# NaN are put at the last by default
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

In [None]:
# which can be reversed
obj.sort_values(na_position="first")

In [None]:
# in a dataframe the value needs to be specified
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame

In [None]:
frame.sort_values("b")

In [None]:
# multi-level soting is an option
frame.sort_values(["a", "b"])

In [None]:
# ranking creates another object with rank
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

In [None]:
# rank order can be adjusted
obj.rank(method="first")

In [None]:
# multiple ways
obj.rank(ascending=False)

In [None]:
# can rank by over columns by rows
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1],
                      "c": [-2, 5, 8, -2.5]})
frame

In [None]:
frame.rank(axis="columns")

<a id='axis-indexes-with-duplicate-labels'></a>
# Axis Indexes with Duplicate Labels

In [None]:
#! create a series with duped indexes
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

In [None]:
# check if index is unique
obj.index.is_unique

In [None]:
# select by index with with duped value
obj["a"]

In [None]:
obj["c"]

In [None]:
# similar example with a data frame
df = pd.DataFrame(np.random.standard_normal((5, 3)),
                  index=["a", "a", "b", "b", "c"])
df

In [None]:
df.loc["b"]

In [None]:
df.loc["c"]

<a id='summarizing-and-computing-descriptive-statistics'></a>
# Summarizing and Computing Descriptive Statistics

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

In [None]:
# apply sum method to entire dataframe
df.sum()

In [None]:
# modify to columns
df.sum(axis="columns")

In [None]:
# can skip NA
df.sum(axis="index", skipna=False)

In [None]:
df.sum(axis="columns", skipna=False)

In [None]:
# mean method
df.mean(axis="columns")

In [None]:
# returns the index where the value is maxed (min version exists)
df.idxmax()

In [None]:
# cummulate sum
df.cumsum()

In [None]:
# multiple summary statistics at once
df.describe()

In [None]:
# can be applied to a series, also a object rather than a 
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

<a id='correlation-and-covariance'></a>
# Correlation and Covariance

In [None]:
# read in a larger file from examples folder
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [None]:
# make a new data frame that shows percentage change
returns = price.pct_change()
returns.tail()

In [None]:
# The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series
returns["MSFT"].corr(returns["IBM"])

In [None]:
# cov computes the covariance
returns["MSFT"].cov(returns["IBM"])

In [None]:
# full coorelation matrix
returns.corr()

In [None]:
# full covariance matrix
returns.cov()

In [None]:
# pair-wise correlations
returns.corrwith(returns["IBM"])

In [None]:
# calculate each rows correlation percent change to volume
returns.corrwith(volume)

<a id='unique-values-value-counts-and-membership'></a>
# Unique Values, Value Counts, and Membership

In [None]:
# get unique values from a series
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])
uniques = obj.unique()
uniques

In [None]:
# counts the occurences of each value
obj.value_counts()

In [None]:
# can pass a sort variable if needed
pd.value_counts(obj.to_numpy(), sort=False)

In [None]:
# is in can be used for a mask
obj
mask = obj.isin(["b", "c"])
mask


In [None]:
obj[mask]

In [None]:
# get_indexer can be used with a non unique list of values
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)
indices

In [None]:
# in some cases a histogram may be needed on multiple columns
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
data

In [None]:
# a single columns histogram
data["Qu1"].value_counts().sort_index()

In [None]:
# count of occurences on multiple columns
result = data.apply(pd.value_counts).fillna(0)
result

# In Class Problems

In [None]:
'''
    #1
    What are the number of rows and columns of the price and volume data frames
'''
'''
    #2
    Create a data frame that has summary stastics about price for all 4 stocks
'''
'''
    #3
    How many days did IBM trade at a higher volme than GOOG?
'''
''' 
    #3
    Create a data frame that has the date of max price and a the date of min price 
    for each of the 4 stocks in the price data frame
'''

''' 
    #4
    Create a data frame that has the date of max volume and a the date of min volume 
    for each of the 4 stocks in the price data frame
'''
'''
    #5
    What two stocks are most highly correlated in price?
    What two stocks are most negagtively corrrelated in volume?
'''
