<img src="files/img/pandas.png" alt="Operations Across Axes" />

# [Pandas](http://pandas.pydata.org/) - Python Data Analysis Library
---
## I shamelessly picked from these chaps:
 - [Daniel Chen - Pandas for Data Analysis](https://www.youtube.com/watch?v=oGzU688xCUs)
 - [Jeff Delaney - 19 Essential Snippets in Pandas](https://jeffdelaney.me/blog/useful-snippets-in-pandas/)
 - [Burke Squires - Intro to Data Analysis with Python](https://github.com/burkesquires/python_biologist/tree/master/05_python_data_analysis)


## General plan:
- What is Pandas all about?
- Brief intro to Pandas objects and syntax
- NumPy dataframe, show the basics
- Import gapminder dataset, interactive

## [Jupyter Notebook Shortcuts](http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html)
- documentation:
```python
type?
```
- check function arguments using: shift + tab
- run current cell/block: shift + enter 
- insert cell above: esc + a
- delete cell: esc (hold) + d + d (double tap)

In [None]:
# try some shortcuts here


<img src="files/img/python-scientific-ecosystem.png" alt="Python Scientific Ecosystem" />

## What is Pandas?
- The go-to data analysis library for Python
    - Import and wrangle your raw data
    - Manipulate and visualize
- Allows for mixed data types in the same array

### The DataFrame is your friend!
- DataFrames are the primary object used in Pandas (it's like an Excel sheet)
- Each DataFrame has:
    - columns: the variables being measured
    - rows: the observations being made
    - index: maintains the order of the rows
- You'll also come across an object called a Series, which can be thought of as a single column from a DataFrame

<img src="files/img/dataframe.jpg" alt="Pandas DataFrame" />

# Creating A Simple Pandas DataFrame From  NumPy Test Data
---
## Create A NumPy Array Of Test Data

In [None]:
# check numpy version
import numpy as np
print('NumPy version:',np.version.version)

In [None]:
# create a 4x100 numpy ndarray using numpy.random.randint()
np.random.seed(0)
array = np.random.randint(0,100,size=(20,4))
array

In [None]:
# check the array and the type
type(array)

## Create A Pandas DataFrame

In [None]:
# import the pandas library
import pandas as pd
print('Pandas version:', pd.__version__)

In [None]:
# checkout the documentation
pd.DataFrame?

In [None]:
# create a Pandas DataFrame from the NumPy ndarray
df = pd.DataFrame(data=array, index=None, columns=None, dtype=None)

## Exploring A DataFrame

In [None]:
# check out what the dataframe looks like
df

In [None]:
# check the shape
df.shape

In [None]:
# you can also use the len() function to get the number or rows/observations
len(df)

In [None]:
# get a concise summary of the DataFrame with .info()
df.info()

In [None]:
# view brief descriptive stats of the DataFrame
df.describe()

In [None]:
# view the top 5 rows
# you can input how many rows you want, default is 5
df.head() 

In [None]:
# view the bottom 5 rows
df.tail()

In [None]:
# take a sample of random rows/observations
df.sample(5)

## Manipulating DataFrame Columns (Variables)

In [None]:
# view current columns 
df.columns

In [None]:
# send current column names to a list
cols = df.columns.tolist()
cols

In [None]:
# change column names
# create a list of new column names (same length as columns)
cols = ['a', 'b', 'c', 'd']

# set "df.columns" to the new list of names
df.columns = cols

# check the change by viewing the head of the dataframe
df.head()

In [None]:
# select an individual column
df['a']

In [None]:
# notice that a single column is a "Series" object in Pandas
type(df['a'])

In [None]:
# insert a new column and put a string in each cell
df['new_column'] = 'cheese'
df.head()

In [None]:
# shuffling column positions

# set current column order to a list object
cols = df.columns.tolist()
print('Starting column order:', cols)

# manipulate column names as a list object
# reverse column order
rev_order = cols[::-1]
print('Reverse column order:', rev_order)

# move last column to first
new_order = cols[-1:] + cols[:-1]
print('Last to first order:', new_order)

# set the column order (creates new dataframe)
df = df[new_order]
df.head()

In [None]:
# delete a column
del df['new_column']
df.head()

In [None]:
# alternate way to delete column (or row), axis numbers are reversed

# save the column to add back later
a = df['a']

# use drop() to remove column
df.drop(['a'], axis=1) # does this really delete the column?

In [None]:
# recheck to see if column 'a' was dropped
df.head()

In [None]:
# drop column 'a' properly
df2 = df.drop(['a'], axis=1)
df2.head()

In [None]:
# but we still have the original DataFrame with 4 columns
df.head()

## Calculating Values from the DataFrame

In [None]:
# Pandas has basic arithmetic functions built-in
df.sum()

## Axis Key
- 0 == Calculate statistic for each column
- 1 == Calculate statistic for each row
  
<img src="files/img/python-operations-across-axes.svg" alt="Operations Across Axes" />

* The axis key is reversed when using the drop() function to remove columns/rows

In [None]:
# set the axis to apply functions across rows or down columns
df.sum(axis=1)

In [None]:
# get the sum of specific columns
df[['a', 'b']].sum(axis=1)

In [None]:
# add a new column called "sum" containint the sum of each row
df['sum'] = df.sum(axis=1)
df.head()

In [None]:
# does the sum stay updated?
# insert a new row of random integers
df['e'] = np.random.randint(0,100, size=df.shape[0])
df.head()

---
## Exercise
1. Make an object titled "df3" that is a copy of "df"
2. Add a column to "df3" that expresses the [mean](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html) of the values in columns a, b, c, and d across each row 
    * What axis refers to calculations across a row?
3. Make an object titled "df4" that is a copy of "df3" and delete column "c" from df4 
    * Do the values in the "mean" column change?

In [None]:
# Make an object titled "df3" that is a copy of "df"


In [None]:
# create a mean column in df3 that calculates the mean across each row
# be sure to exclue the sum value from the mean


In [None]:
# make a copy of df3 named df4, delete column "c"

---
## Combining DataFrames
Before getting into the manipulation of DataFrame rows it helps to understand a bit more about index values and combined dataframes
<img src="files/img/concat_axis0.png" alt="concat axis0" />

In [None]:
# create two new DataFrames from NumPy values
df_index1 = pd.DataFrame(np.random.randint(0,100, size = (50,4)), 
                         columns = ['a', 'b', 'c', 'd'])
df_index2 = pd.DataFrame(np.random.randint(0,100, size = (50,4)), 
                         columns = ['a', 'b', 'c', 'd'])

In [None]:
# Use pd.concat() to combine the two DataFrames by stacking vertically (axis=0)
cat_df = pd.concat([df_index1, df_index2], axis=0) # what happens if the axis is 1?
cat_df.shape

In [None]:
# check the index
cat_df.index

In [None]:
# reset_index() will generate a column with the old index
# use this function when you want to reset the order of the index
reset = cat_df.reset_index()
# reset = cat_df.reset_index(drop=True) # use this to drop the new column with old index
reset.tail()

## Manipulating DataFrame Rows (Observations)
Two important functions to introduce here are loc() and iloc()
- loc[ ] - accesses the index based on the value
- iloc[ ] - accesses the index based on the position.  
You may come across ix[ ] to select rows, but this function has depreciated 

Tips for specifying indexers:
- series.loc[indexer]
- dataframe.loc[row_indexer, column_indexer]

In [None]:
# We're going back to the concatenated Dataframe that was NOT reindexed

# select a row based on the index value using loc
cat_df.loc[0]

In [None]:
# compare this with selecting a row using iloc
cat_df.iloc[0]

In [None]:
# just for giggles, try using ix for the same row
cat_df.ix[0]

In [None]:
# changing all the values in a specific row
cat_df.iloc[0] = [44, 45, 46, 47]
cat_df.head()

In [None]:
# change a single value in a row
cat_df.loc[0,'d'] = 50
cat_df.head()

In [None]:
# add a row using loc[]
cat_df.loc[len(df)] = [1,2,3,4]
cat_df.tail()

In [None]:
# delete the row that was just added using drop()
new = cat_df.drop(100)
new.tail()

## Exporting A DataFrame
If you want to save your work you can [pickle](https://ianlondon.github.io/blog/pickling-basics/) the DataFrame or you could export it as a file

In [None]:
# saving the "new" DataFrame as a .csv file, can export in multiple file types
new.to_csv('my_dataframe.csv', sep=',')

## Plots
In general, the [Matplotlib](https://matplotlib.org/) library is the go-to for plots and figures, but Pandas has a plot() function that uses matplotlib to generate basic visualizaitons

In [None]:
# import matplotlib.pyplot
import matplotlib.pyplot as plt
% matplotlib inline

In [None]:
# you can use the Pandas plot() function to return a matplotlib.axes.AxesSubplot object
plot = reset.plot(x=None, y=None, kind='line')
plot.set_xlabel('x label')
plot.set_ylabel('y label')

In [None]:
# how to save a figure

# save the figure by using get_figure() to extract the plot as 
# a matplotlib.figure.Figure object
fig = plot.get_figure()
fig.savefig('figure.png')

## Exercise
* Create a bar plot for the first 20 values in column "a" in the "reset" DataFrame

In [None]:
bar = reset['a'][:20].plot(kind='bar')

# Working With Heterogeneous Data
---
## Import a .csv as a Pandas DataFrame

In [None]:
# import a .csv file to a DataFrame
df = pd.read_csv('data/gapminder.tsv', 
                 sep='\t', # the delimiter in the file
                 header='infer', # row with names of the columns 
                 names=None, # change the names of the columns
                 index_col=None, # column to use for the row index
                 usecols=None) # what columns to use
df.head()

In [None]:
# strip-down column names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('-','_')
df.head()

## Take A Glance At The DataFrame

In [None]:
# shape, columns, values


## Techniques To Filter Data

In [None]:
# Sorting
# Why not sort the df by year in ascending order
df = df.sort_values('year', axis=0, ascending=True)
df = df.reset_index(drop=True)
df.head()

In [None]:
# Unique Values
# Get a list of the countries represented using unique()
countries = df.country.unique()
len(countries)

In [None]:
# What about continents?
df.continent.unique()

In [None]:
# Groupby
# How would I get a list of the countries that fall within "Oceania"?
df.groupby('continent')['country'].unique()['Oceania']

In [None]:
# nunique()
# How many countries are represented by each continent?
df.groupby('continent')['country'].nunique()

In [None]:
# What about a dictionary containing all the countries for each continent?
conts = df.groupby('continent')['country'].unique()
cont_dict = conts.to_dict()
cont_dict['Africa']

In [None]:
len(cont_dict['Africa'])

In [None]:
# Filter Using Conditional Logic
# What if we just want a DataFrame of all the African countries?
africa = df[df['continent'] == 'Africa']
africa.head()

In [None]:
# reset the index
africa = africa.reset_index(drop=True)

In [None]:
# explore the new africa df
africa.describe()

In [None]:
# create a new series with the mean gdp per cap for each country in africa
mean_gdp_country = africa.groupby('country')['gdp_per_cap'].mean()
mean_gdp_country.plot(kind='hist')

In [None]:
# create a boxplot
mean_gdp_country.plot(kind='box')

In [None]:
# Filter further with conditional logic
# create a DataFrame that cuts down the outliers
filt_africa = africa[africa.gdp_per_cap < 2500]
plot = filt_africa['gdp_per_cap'].plot(kind = 'hist')

In [None]:
# apply()
# create a category/bins and apply to gdp per cap

# start with a function
def func(x):
    if x <=500:
        return 'low'
    elif 500< x <1750:
        return 'mid'
    else:
        return 'high'

# apply the new function to each row in the DataFrame
africa['gdp_category'] = africa['gdp_per_cap'].apply(func)
africa.head()

In [None]:
# we could also use the cut() function
bins = [0,500, 2750, 70000]
names = ['low', 'mid', 'high']
africa['new_categories'] = pd.cut(africa.loc[:,'gdp_per_cap'], bins, labels=names)
africa.head()

# Misc.
---
## Create a Pandas DataFrame from a dictionary

In [None]:
# create a dictionary object
my_dict = {'a':['cheese', 'dog', 'goat', '4h'], 'b':['lush','planet', '2017', 'la trance'] }

# create a pandas DataFrame from a dictionary
df = pd.DataFrame(my_dict)
df