# Pandas

The pandas module is one of the most powerful tools for data analysis.  Pandas was designed to work with tabular and heterogeneous data.  The original author of pandas is Wes McKinney, so it makes sense that most of his book "Python for Data Analysis" covers the functionality of pandas. In fact, chapters 5 - 11 are basically about what pandas can do.  

Here are some of the things that I hope you can do by the end of the week:
* Create Series and DataFrames (ch 5)
* Index, slice, and filter (ch 5)
* Examine your data (ch 5)
* Compute summarization and descriptive statistics (ch 5)
* Drop rows and columns (ch 5)
* Create columns (ch 5)
* Count the number of missing values (ch 7)
* Drop or fill missing values (ch 7)
* Drop duplicate rows (ch 7)
* Combine categories of categorical data (ch 7)
* Discretize numerical data (ch 7)
* Have some practice with hierarchical indexing (ch 8)
* Reset the index (ch 8)
* Merge and concatenate DataFrames (ch 8)
* Simple plots with pandas (ch 9)
* Use .groupby() for category aggregation (ch 10)
* Fill missing values by group summary statistics (ch 10)

## Importing Pandas

It is standard to use the alias ``pd`` when importing pandas.
~~~
import pandas as pd
~~~
I usually import numpy at the same time since pandas and numpy are often used in tandem.

In [1]:
import pandas as pd
import numpy as np

In [2]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [2]:
## The following code will allow us to see all the columns and rows in the dataset
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

---
The two main data structures that we will use from pandas are the *Series* and the *DataFrame*.  

### Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called the *index*.  

#### Creating a Series
A Series can be created from a list, a numpy ndarray, or a dictionary using the function ``pd.Series``.

In [3]:
## Try:  Create a Series from a list
x = [1,2,3,4,5]
lab = ['a','b','c','d','e']

s = pd.Series(x, index=lab)

In [4]:
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [None]:
## Try: Create a Series and specify the index
s['c']

You can kind of think about a Series as an ordered dictionary where the labels are the key and the data are the values.

The data in a Series need not be numeric

In [6]:
## Try:  Make a series with non-numeric data

In [None]:
s.mean()

In [None]:
np.mean(s)

### DataFrames
DataFrames are the main data structure of pandas and were directly inspired by the R programming language.  DataFrames are a bunch of Series objects put together to share the same (row) index.  A DataFrame has both a row and a column index.  

#### Creating DataFrames
DataFrames can also be created from lists, dictionaries, or numpy arrays.

In [9]:
## Try: Make a DataFrame from a list of lists

In [10]:
## Try: Make a DataFrame from a dictionary

In [11]:
df = pd.DataFrame()

In [None]:
df.shape

In [5]:
d = {'state':['ohio','wyoming','utah'], 'v1': [1,2,3], 'v2':[4,5,6]}

In [6]:
df = pd.DataFrame(d)
df.index = ['a','b','c']

In [7]:
df

Unnamed: 0,state,v1,v2
a,ohio,1,4
b,wyoming,2,5
c,utah,3,6


In [None]:
df.loc['a','state':'v2']

**Index and columns**

We can turn a column of the data into the index (row name)

In [17]:
## Try:  Turn a column into the index and drop

In [18]:
df.index = df['state']

In [19]:
del df['state']

In [None]:
df

### Read in some practice data

In [None]:
## Tips data
import seaborn as sns
iris = sns.load_dataset('iris')
tips = sns.load_dataset('tips')
pen = sns.load_dataset('penguins')

In [None]:
tips.head()

In [None]:
tips.describe()

In [None]:
tips.info()

In [None]:
tips.isna().sum()

In [8]:
import seaborn as sns
iris = sns.load_dataset('iris')

In [9]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [20]:
iris.loc[0:4, 'sepal_length':'petal_width']

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [18]:
iris.loc[0,'sepal_length']

5.1

In [25]:
## this won't work in python
iris['species'=='setosa']

KeyError: False

In [24]:
iris[iris['species'] == 'setosa']

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


---

## Selection and Indexing

There are various ways to get subsets of the data.  In the following ``df`` refers to a DataFrame.

#### Selecting columns
One column (producing a Series)
```{python}
df['column_name'] 
df.column_name 
```
---

Multiple columns (producing a DataFrame)
```
df[['column_name']] # this will produce a DataFrame
df[['col1', 'col2', 'col3']]
```
---

#### Selecting row and columns with ``loc`` and ``iloc``
```
df.loc['row_name', 'col_name'] 
df.iloc['row index', 'col index']
```

``loc`` and ``iloc`` also support slicing.  Note: when slicing with ``loc``, the end point IS including (but not when slicing with ``iloc``.

---
```
df.loc['row_name1':'row_name2', 'col_name1':'col_name2'] 
df.loc[:, 'col_name1':'col_name2']
df.loc['r1':'r2', :]
df.loc[['r1','r2','r3'],['c1','c2]] 
```
*When using `.loc()`, `row_name2` and `col_name2` WILL be included*

---
~~~
df.iloc[index1:index2, col1:col2] 
~~~
*When using `.iloc()`, `index2` and `col2` will NOT be included*

---
#### Selecting rows based on column condition
~~~
df[df[boolean condition]]

df[mask]

df[df['authors']=='J.K. Rowling']
~~~


In [32]:
## Sometimes, when coming from R, we try the following:
## df['authors'== 'Bryan']
## This will not work in Python
## Instead, we need to use the following:
## df[df['authors']=='Bryan']

In [None]:
type(iris[['species']])

In [None]:
iris.head(2)

In [None]:
iris.loc[0:10, 'sepal_length':'petal_width']
iris.iloc[0:10, 0:3]

In [None]:
iris.iloc[0:10, 0:2]

In [None]:
iris[iris['sepal_length'] < 5]

In [40]:
## Try:  Practice all these methods for selecting and slicing

In [None]:
iris.head()

In [None]:
iris.iloc[0:2, 0:]

In [None]:
iris['sepal_length']>6

In [None]:
## Try:  

iris[iris['sepal_length']>6]

---

## Looking at your DataFrame

``df.head()``  
``df.tail()``  
``df.shape``  
``df.info()``  
``df.describe()``   
``df.columns``

In [45]:
## Try Explore the Iris Data

In [None]:
iris.info()

In [None]:
iris.columns

In [None]:
iris.index

In [None]:
iris.describe()

## Methods for computing summary and descriptive statistics
pandas objects have many reduction / summary statistics methods that extract a single value from the rows or columms of a DataFrame.  See Table 5-8 in *Python for Data Analysis* for a more complete list, but here are a few that are commonly used.

`count`: number of non-NA values   
`describe`: summary statistics for numerical columns   
`min`, `max`: min and max values  
`argmin`, `argmax`: index of min and max values (I'm not sure if this works anymore!?)   
`idxmin`, `idxmax`: index or column name of min and max values  
`sum`: sum of values  
`mean`: mean of values  
`quantile`: quantile from 0 to 1 of values  
`var`: (sample) variance of values  
`std`: (sample) standard deviation of values  

Most of these functions also take an `axis` argument which specifies whether to reduce over rows or columns: 0 for rows and 1 for columns.   
There is also an argument `skipna` which specifies whether or not to skip missing values.  The default is True.


In [None]:
iris[['sepal_length', 'sepal_width']].idxmax()

In [None]:
iris[['sepal_length', 'sepal_width']].max()

In [None]:
iris[iris.sepal_width==4.4]

### Unique values and value counts

``df.nunique()`` or ``df['column'].nunique()``  

``df.value_counts()`` or ``df['column'].value_counts()``

In [None]:
iris.species.value_counts()

In [None]:
iris.species.nunique()

### Correlation and covariance
`df.corr()` and `df.cov()` will produce the correlation or covariance matrix.  Or two Series can be used to get the correlation (or covariance) with `Series1`.corr(`Series2`).

Numpy functions can also be used: `np.corrcoef()`

In [None]:
df.corr()

In [None]:
iris.iloc[:,0:4].corr()

---
## Dropping rows and columns

Columns and rows can be dropped with the `.drop()` method (using `axis=1` for columns and `axis=0` (default) for rows).  This method creates a new object unless `.inplace = True` is specified. 

The `del` command can also be used to drop columns in place.

In [61]:
iris.drop(['species', 'sepal_length'], axis=1, inplace=True)

In [None]:
iris.head()

In [65]:
# Similar to using "inplace=True": 
# iris = iris.drop('species', axis=1)

In [66]:
iris.drop('sepal_width', axis=1, inplace=True)

In [None]:
iris.head()

## Adding columns

Add a new column to the end of a data frame
~~~
df['new_col'] = value
~~~

Add a new column at a specific index 

`.insert(col_index, 'new_col_name', value(s))` 

In [None]:
iris.head()

In [70]:
iris.insert(0, 'new_column', np.random.randn(150))

In [None]:
iris.head()

## Missing Values

**Ways to count missing values**
~~~
df.info()
df.isna().sum()
df.isna().sum(axis=0)
~~~

**Drop missing values with `.dropna()`**

Calling `.dropna()` without any arguments will drop all rows with missing values

Arguments:
* `axis=1` will drop columns with missing values (default is `axis=0`)
* `how='all'` will drop rows (or columns) if all the values are NA (default is `how='any'`) 
* `subset=` will limit na search to these specic columns (or indexes) 
    

**Fill missing values with `.fillna()`**
Arguments:
* `value`: value used to fill. 
* `method'`: methods used to fill (forward or backward fill)


In [None]:
pen.info()

In [73]:
pen.index = pen.species

In [None]:
pen.head()

In [None]:
pen.loc['Adelie', ['island', 'sex']]

In [None]:
pen.iloc[0:10, 0:3]

In [None]:
pen.iloc[0:10, 0:3]

In [None]:
pen.info()

In [None]:
pen[(pen['bill_length_mm'] > 38) & (pen['bill_length_mm'] < 41)]

In [None]:
pen.isna().sum(axis=0)

In [None]:
len(pen)