# Pandas

<a href="https://colab.research.google.com/github/umsi-data-science/si370/blob/master/SI_370_Day_02_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives
* understand the role of pandas in python data science
* differentiate DataFrames, Series, and (numpy) arrays
* be able to use apply(), map() and applymap() to manipulate DataFrames and Series
* be able to handle missing values
* be able to filter rows and columns
* be able to sort by indexes and values

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({"a": [1, 2, 3, 3, 2, 2],
                   "b": [4, 5, 6, 4, 5, 6],
                   "c": [7, 8, 9, 0, 0, 0]})

In [None]:
df

## Selecting a column

In [None]:
df['a']

Equivalently:

In [None]:
df.a

## Extracting the ndarray of values

In [None]:
df.a.values 

## Universal functions (h/t numpy)
* DataFrames and Series are easy to manipulate
* universal functions (ufuncs) are provided by numpy ndarrays

In [None]:
df

In [None]:
df + 1

In [None]:
df.a + 1

In [None]:
df['d'] = df['a'] + 1

In [None]:
df

## map() and applymap()
* perform actions on elements in Series (.map()) or DataFrames (.applymap())

In [None]:
def divide_and_round(x): return round(x/10)

In [None]:
df

In [None]:
df.applymap(divide_and_round)

In [None]:
df.b.map(divide_and_round)

## Using .apply() to apply a function to an axis
* sometimes we want to apply a function not to just an element but to either entire columns or rows
* columns and rows are referred to as "axes"
* columns are "axis 0" and rows are "axis 1"


In [None]:
df

In [None]:
df

In [None]:
df.apply(np.mean, axis="rows")

In [None]:
df.apply(np.mean, axis="columns")

## Counting value frequencies

In [None]:
df

In [None]:
df.a

In [None]:
df.a.value_counts()

## Describing a DataFrame

In [None]:
df.describe()

In [None]:
df['e'] = ["one", "two", "three", "four", "five", "six"]
df['f'] = [1.0, 5.0, 9.0, 10.1, 11.2, 12.3]

In [None]:
df

In [None]:
df.dtypes

In [None]:
df.select_dtypes(include=['int', 'float']) 

## Missing values

* a missing value is just that -- it's missing
* represented as nan, NaN, NAN, np.nan, np.NaN (you get the idea)
* many tools that we'll be learning can't handle missing values
* you need to decide what to do with it
* can leave it as is, replace it with a scalar value, replace it with the output of a function (like mean), or drop the row
* think of what that would mean if you were going to calculate the mean of 1,2,3,NaN,4,5,6


In [None]:
df_missing = pd.DataFrame(
    {'a': [1, 2, 3, np.NaN],
     'b': [2, 3, np.NaN, 5]})

In [None]:
df_missing

In [None]:
df_missing.dropna()

In [None]:
df_missing.dropna(subset=['a'])

In [None]:
df_missing

In [None]:
df_missing.fillna(0)

In [None]:
df_missing.fillna(df_missing.mean())

In [None]:
df_missing

In [None]:
df_missing.fillna(method='ffill')

In [None]:
df_missing

In [None]:
df_missing.fillna(method='bfill')

## Setting and resetting indexes
* you can select which column is the index using .set_index()
* note that index values do not have to be unique
* you can also unset an index by using .reset_index()


In [None]:
df

In [None]:
df.set_index('e')

In [None]:
df.reset_index()

## Filtering rows
* df.iloc[] will retrieve specific row number(s)
* df.loc[] will retrieve rows based on their index values
* df[df[col=value]] will filter
* df.query('col = value') will filter

In [None]:
df

In [None]:
df

In [None]:
df.iloc [1]

In [None]:
df.iloc[1:3]

In [None]:
df

In [None]:
df.iloc[:,1].values

In [None]:
df

In [None]:
df = df.set_index('e') 

In [None]:
df

In [None]:
df.loc['two']

In [None]:
df

In [None]:
df[df['a'] > 2]

In [None]:
df[df['a'] > 2]

## Sorting DataFrames
* you can either sort on index values (`df.sort_index()`) or on other values (`df.sort_values()`)


In [None]:
df

In [None]:
df

In [None]:
df.sort_index()

In [None]:
df.sort_values('a')

In [None]:
df

## End of notebook