# Pandas

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
data = sns.load_dataset('iris')
data.head()

## Indexing

While there are sometimes several ways to index, Pandas documentation suggests using the appropriate indexing functions: "While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at, .iat, .loc and .iloc.`"

See this great answer for more distinction between the two: https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different

**Index Single Column by Name**

*As a series*

In [None]:
a = data['sepal_length'].head()

# Or

b = data.sepal_length.head()

# Or

c = data.loc[:, 'sepal_length'].head()  # Preferred!

assert(a.equals(b))
assert(b.equals(c))

c

*As a DataFrame*

In [None]:
a = data[['sepal_length']].head()

# Or

b = data.loc[:, ['sepal_length']].head()  # Preferred!

assert(a.equals(b))
b

**Index Multiple Columns by Name**

In [None]:
a = data[['sepal_length', 'sepal_width']].head()

# Or

b = data.loc[:, ['sepal_length', 'sepal_width']].head()  # Preferred!
assert(a.equals(b))
b

**Indexing by Row Index**

*Single row (series)*

In [None]:
a = data.iloc[1, :]
b = data.iloc[1]
assert a.equals(b)
a

*Single row (DataFrame)*

In [None]:
a = data.iloc[[1], :]
b = data.iloc[[1]]
assert a.equals(b)
b

Multiple rows (slice syntax)

In [None]:
a = data[5:].head()

# Or

b = data.iloc[5:].head()  # Preferred
assert a.equals(b)
b

**Indexing rows and columns**

In [None]:
data.iloc[:3, :3]

In [None]:
data.iloc[[90, 95], [1, 2, 4]]

**By condition**

In [None]:
data.loc[data.species == 'versicolor'].head()

### Boolean Indexing

In general, can pass a boolean series with indexing to get appropriate rows / columns

In [None]:
data[data.sepal_length > 6.0].head()

*By certain column values*

In [None]:
data[data.species.isin(['virginica', 'setosa'])].head()

# Other

**Column names**

In [None]:
# As an index
data.columns

# As a list
list(data)
list(data.columns)

**Sorting**

Sort by a column (low to high)

In [None]:
data.sort_values(by='sepal_length').head()

High to low:

In [None]:
data.sort_values(by='sepal_length', ascending=False).head()

**WARNING**: `.loc` for accessing rows refers to *labels* of the index, **not** an integer position:

In [None]:
new_data = data.copy()
new_data.index = range(150, 300)

# Both access the first row!
new_data.loc[150].equals(data.loc[0])

## Apply

*Apply to each column*

In [None]:
data.drop(columns='species').apply(sum)  # Apply to each column

*Apply to each row*

In [None]:
data.drop(columns='species').apply(sum, axis=1).head()

## Group By

*Apply mean to each species*

In [None]:
data.groupby('species').mean()