# SheHacks 2020: Pandas Workshop

### Introduction

Pandas is a popular python library providing high-performance, easy-to-use data structures and data analysis tools.  It is inspired by R's `data.frame` and bears many resemblences to SQL and other tabular data tools.

This workshop will serve as a basic introduction to pandas' capabilities.  For more, I encourage you to read the docs, browse blog posts, and read questions and answers on stack overflow.

Pandas documentation: https://pandas.pydata.org/pandas-docs/version/1.0.0/index.html

A set of useful examples from Chris Albon: https://chrisalbon.com/

Stack overlow: https://stackoverflow.com/questions/tagged/pandas


### How To Install Pandas

Pandas relies on several other python libraries.  The easiest way to install pandas is to download the anaconda release of python.  **Be sure to download python 3.x since python 2.7 is no longer being maintained**.

Anaconda: https://www.anaconda.com/distribution/

If you already have a python installation with `pip`, you can install pandas with

`pip install pandas`


### What You Will Learn

This will be a very low level workshop intented to get you comfortable working with dataframes.  In particular, I'm going to focus on the basics but will show you some other things pandas is capable of.  I'll do my best to follow the Pareto Rule of 80% simple and useful things, 20% cool things that you can reference should you need them.

Here is a rough schedule:

* Loading data into python with pandas

* Working with data

* Summarizing data

* Plotting data

* Saving data to file with pandas

Let's get started!

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

%matplotlib inline

___
### Loading In Data

The easiest way to load data into pandas is with the `read_*` function.  Pandas is capable of reading in a lot of different data formats.

pandas I/O functions: https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/io.html

You can also read: SQL, JSON, HTML, and Excel files.

In [10]:
# dataframes are usually denoted by df, but you should think of a better name
df = pd.read_csv('2018_data.csv') #reads in a csv into python.



#Now you have data!  Let's see what it looks like

df.head() #Shows you the first 5 rows of your data

Unnamed: 0,created_at,apparentTemperature,humidity,precipType,wr
0,2018-01-02 06:28:33,-23.27,0.83,NoPrecip,9.0
1,2018-01-02 06:58:21,-23.27,0.83,NoPrecip,10.0
2,2018-01-02 07:27:11,-24.22,0.83,NoPrecip,10.0
3,2018-01-02 07:58:38,-24.22,0.83,NoPrecip,7.0
4,2018-01-02 08:27:15,-19.47,0.85,NoPrecip,18.0


___
### Working With Data

There may come a time when you just need to work with a single column or a single row.  There are a few ways to go about doing this in pandas.

The main tools for this part are the `loc` and `iloc` methods.  They allow you to slice your dataframe much like you would slice an array.

`loc` docs: https://pandas.pydata.org/pandas-docs/version/1.0.0/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc

`iloc` docs: https://pandas.pydata.org/pandas-docs/version/1.0.0/reference/api/pandas.DataFrame.iloc.html?highlight=iloc#pandas.DataFrame.iloc

#### Getting A Column

1) `df['column_name']` -- Extracts the column named `'column_name'` and returns a Pandas Series. 

2) `df.column_name` -- Does the same as above.  Can only be used when your column name does not have any spaces or special characters.

3) `df.loc[:, 'column_name'] -- Equivalent to method 1).

4) `df.iloc[;, column_number]` -- Extracts columns based on their indexed location.  If you know that you need the fifth column, then you can use `df.iloc[:, 5]` (since python considers 0 to be the first number).

Let's try it together. Let's extract the `wr` column using these four different ways.



In [17]:
#1)
method_1 = df['wr']

#2)
method_2 = df.wr

#3)
method_3 = df.loc[:,'wr']

#4) 
method_4 = df.iloc[:,4]

#Print each of these out to see what they look like.  Verify they have returned the exact same thing.

#### Getting Several Columns

Almost certainly there will come a time when you need to remove some columns and work with a subset of your data.  Much of the methods above have natural extensions to allow you to select several columns.

Let's try it together.  Let's etract the `wr` column and the `creaated_at` column using these four different ways.

In [20]:
#1)
method_1 = df[['wr', 'created_at']]

#2) #CAN'T BE DONE THIS WAY!


#3)
method_3 = df.loc[:,['wr','created_at']]

#4) 
method_4 = df.iloc[:,[0,4]]

#Print each of these out to see what they look like.  Verify they have returned the exact same thing.



### Getting Rows of Data

Suppose you want to get a few rows of your data to work with.  You can subset and slice out rows using the following two ways:

1) `df.loc['row_name', :]` -- Gets all columns with and rows with the index row_name.  Useful when your dataframe is indexed by a unique identifier.

2) `df.iloc[row_numbers, :]` -- Gets all rows based on an integer location.  So for instance. to get the first 5 rows, you could do `df.iloc[:5,:]`.

Let's try it together.  Select the first 10 rows, the last 10 rows, and rows 45-56.

In [28]:
# First 10
first_10 = df.iloc[:10,:]

# Last 10
last_10 = df.iloc[-10:, :]

# Rows 45-55
middle_10 = df.iloc[45:56, :]



You can combine the row selection and column selection methods too!  If you wnated the first 10 rows of the `wr` column, you could do...

`df.loc[:10, 'wr']`


---
### Summarizing Data

Whether you're taking the mean of your data to report to your boss or doing more advanced summaries, pandas offers a multitude of ways to summarize data.  Let's talk about some of them together.

Dataframes have a buch of methods you can call to get summaries of your data.  For instance, we can call `df.mean()` to get the mean of every numeric column.  If a column is not numeric, pandas will omit it from the result.  Let's try it below.

In [30]:
# Get the mean of every column
df.mean()

apparentTemperature     6.906356
humidity                0.739849
wr                     70.747821
dtype: float64

There are a ton of methods like this!

* `.std()` -- Standard deviation
* `.median()` -- Median
* `.count()` -- Counts non-missing cells.  Very valuable for finding missing data!
* `.quantile()` -- Get quantiles of each numeric column.  You need to pass what quantiles you want to know about.  For example `df.quantile([0.25, 0.5, 0.75])` will get the quartiles of the data.

If you apply these methods directly to the dataframe, then pandas will summarize every column.  If you want to summarize only a single column, first extract it using the methods we practiced above and then summarize it.

Let's try it below.

In [42]:
# Get some summaries of the data