In [None]:
%reload_ext postcell
%postcell register

In [None]:
import numpy as np
import pandas as pd

# Pandas - dataframes

Pandas dataframes are the main tool used by data scientists to explore data.Generally, "pandas" is a synonym for pandas dataframes. Pandas is deigned to work with what we normally think of as "business data."

Given what we have learned so far, we can think of dataframes as: Numpy with labels and a set of series objects.

#### Numpy with labels

In [None]:
rand_array = (np.random.rand(10,3) * 100).round()
rand_array

With pandas, we can annotate this numeric matrix with labels:

In [None]:
student_scores_pd = pd.DataFrame(rand_array
             , columns=['Assignment 1', 'Assignment 2', 'Test']
             , index=['Homer', 'Marge', 'Bart', 'Lisa', 'Maggie', 'Jon', 'Arya', 'Ned', 'Danny', 'That red lady']
            )
student_scores_pd

A dataframe can be converted back to a numpy:

In [None]:
student_scores_pd.values

In [None]:
student_scores_pd.to_numpy()

#### Set of series objects

![](images/dataframes.jpg)

In [None]:
homers_grades = pd.Series(np.round(100 * np.random.rand(3)))
marges_grades = pd.Series(np.round(100 * np.random.rand(3)))
barts_grades = pd.Series(np.round(100 * np.random.rand(3)))
lisas_grades = pd.Series(np.round(100 * np.random.rand(3)))
maggies_grades = pd.Series(np.round(100 * np.random.rand(3)))
jons_grades = pd.Series(np.round(100 * np.random.rand(3)))
aryas_grades = pd.Series(np.round(100 * np.random.rand(3)))
neds_grades = pd.Series(np.round(100 * np.random.rand(3)))
dannys_grades = pd.Series(np.round(100 * np.random.rand(3)))
thatredlady_grades = pd.Series(np.round(100 * np.random.rand(3)))

In [None]:
homers_grades

In [None]:
student_scores_pd = pd.DataFrame({
    'homer': homers_grades
    , 'marge': marges_grades
    , 'bart': barts_grades
    , 'lisa': lisas_grades
    , 'maggie': maggies_grades
    , 'jon': jons_grades
    , 'arya': aryas_grades
    , 'ned': neds_grades
    , 'danny': dannys_grades
    , 'That red lady': thatredlady_grades
})
student_scores_pd

In [None]:
student_scores_pd = student_scores_pd.set_index(pd.Index(['Assignment 1', 'Assignment 2', 'Test']))
student_scores_pd

In [None]:
#student_scores_pd = student_scores_pd.set_index([['Assignment 1', 'Assignment 2', 'Test']])
#student_scores_pd

In [None]:
student_scores_pd = student_scores_pd.T
student_scores_pd

Notice that while each series has its own index, in a dataframe, the index is shared among all columns.

In [None]:
student_scores_pd['Test']

In [None]:
type(student_scores_pd['Test'])

### Creating dataframes

Above we saw two examples of creating dataframes from dictionaries and numpy arrays. You are more likely to create dataframes by reading them from a data file or loading them from an sql server. The pandas overview already showed such an example but let's take a closer look at some important functions.

#### `.read_csv()`

Pandas provides a very useful way of ingesting csv (command delimited) files. These types of files are overwhelmingly the format used to distribute small and medium sized data.

In [None]:
data_df = pd.read_csv('../../datasets/deaths-in-gameofthrones/game-of-thrones-deaths-data.csv')

In [None]:
data_df.head()

Note that you can load a file directly from the web:

In [None]:
%%time
pd.read_csv('https://raw.githubusercontent.com/washingtonpost/data-game-of-thrones-deaths/master/game-of-thrones-deaths-data.csv').head()

In [None]:
#pd.read_excel()

The `.read_csv` function is very rich and lets you control:
1. Column names, including working with datasets which do not include column names: `names`, `header`
2. Which delimiter to use: comma, tabs, etc.: `delimiter`
3. Custom data types (in case your integers are being brought in as strings or floats): `dtype`
    a. `parse_dates` lets you control which columns are date or time values

etc.

Similarly `.read_sql` lets you read resutls of sql queries, `.read_feather` lets you read a very important upcoming format called `feather` or `arrow`. Popular format 'parquet' from the Hadoop ecosystem can be read with `.read_parquet`

An example of read_csv's richness:

`.read_csv(header=int or list of int)`
The option of `header` can be used to control the header.
    - `header = 0` The first row contains the header, use that for column names
    - `header = [0,1,2,3,10]` Only import columns 0, 1, 2, 3 and 10 (skip 4-9)
    - `header = None` No column names provided

#### `.read_html()`

The `read_html` function parses a web page to extract tables. This functionality is built on top of the "beautiful soup" library. Using that library is not necessarily easy for new developers. Along with Python, they also have to know a bit about the html markup langauge. Pandas makes such tasks trivial.

In [None]:
import ssl
#In some instances, students get ssl errors, this will resolve it
#https://stackoverflow.com/a/56230607
ssl._create_default_https_context = ssl._create_unverified_context 

In [None]:
%%time
html_tables = pd.read_html('https://en.wikipedia.org/wiki/World_population')

In [None]:
len(html_tables)

In [None]:
html_tables[0]

In [None]:
html_tables[1]

If you want a specific table, use the `match` keyword to search for a specific table header

In [None]:
pd.read_html('https://en.wikipedia.org/wiki/World_population'
             , match='10 most densely populated countries')[0]

**Exercise** Download the following file to you computer and read it into a dataframe: https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv

Note, you must download the file to you computer, you may not use Pandas to read directly from the web.

In [None]:
%%postcell exercise_030_120_a

#type your answer here

### Getting rows and columns from Dataframes


In [None]:
student_scores_pd

Similar to series, values in dataframes can be retrieved in several ways:
1. Implicit index (similar to lists)
2. Explicit index or label (similar to dictionaries)
3. Slicing
4. Boolean masking
5. Fancy indexing

#### Implicit index

Perhaps surprisingly, implicit index is not commonly used to acces rows or columns (see `iloc` or `icol`)

In [None]:
student_scores_pd

In [None]:
student_scores_pd[0]

In [None]:
student_scores_pd.iloc[:, 0]

#### Explicit index or labels

In [None]:
student_scores_pd

In a dataframe, using an explicit key returns the column associated with that key. Note that this is not what you may have expected. According to the rule "first rows, then column" we learned in numpy, the syntax in the following line should result a row, but Pandas makes an exception here. 

In [None]:
student_scores_pd['homer']

In [None]:
student_scores_pd['Test']

Notice, in the last example, although, in common vernacular, you are asking for a column of data. It is more precise to say that you are asking for a series, since a series contains a column of values, and the index associated with it. 

Using `.loc` brings back the "rows then column" scheme.

In [None]:
student_scores_pd.loc['homer']

In [None]:
student_scores_pd.loc[:, 'Test']

In [None]:
student_scores_pd.loc['homer', "Test"]

Once you have a series, you can operate on it using the same methods we learned in the previous lecture

In [None]:
tests_s = student_scores_pd['Test']

In [None]:
type(tests_s)

In [None]:
tests_s

In [None]:
tests_s['homer'] # <= this is a series, not a dataframe

Dataframe and series operations are often combined

In [None]:
student_scores_pd['Test']['homer']

In the pervious example, `student_scores_pd['Test']...` returns a series, the subsequent `...['homer']` then pulls out Homer's test results.

**Exercise** Find the average of all test scores in `student_scores_pd`

In [None]:
%%postcell exercise_030_120_b

#type your answer here

#### Fancy indexing

Fnacy indexing operates on columns

In [None]:
student_scores_pd[['Test', 'Test', 'Assignment 1']]

**Exercise** Get the `Test` series from `student_scores_pd`

In [None]:
%%postcell exercise_030_120_c

#type your answer here

#### Slicing

However, if slicing syntax is used to operate on rows

In [None]:
student_scores_pd['bart': 'arya']

**Note** Sclicing columns names will not work (unless the `.loc` notation is used)

In [None]:
student_scores_pd['Assignment 1': 'Assignment 2']

In [None]:
student_scores_pd.loc[:, 'Assignment 1': 'Assignment 2']

**Exrercise** Use slicing to find test scores (not assignments) for Maggie, Jon and Arya (note that all three are next to each ohter in the dataframe

In [None]:
%%postcell exercise_030_120_d

#type your answer here

#### Masking

Masking syntax also operates on rows

In [None]:
student_scores_pd

In [None]:
student_scores_pd[student_scores_pd.Test > 50]

In [None]:
student_scores_pd.Test > 50

In [None]:
student_scores_pd[student_scores_pd.index == 'homer']

**Exercise** Find all records where test scores are above 80 in dataframe `student_scores_pd`

In [None]:
%%postcell exercise_030_120_e

#type your answer here

**Exercise** Find all records where test scores are above 80 or assignment 1 scores are above 75 in dataframe `student_scores_pd`

In [None]:
%%postcell exercise_030_120_h

#type your answer here

### `.loc`, `.iloc` and `.at`

Just like `Series`, `Dataframe` objects also use:

- `df.loc[]` to operate only on explicit keys/labels (like dictonaries)
- `df.iloc[]` to operate only on implicit index locations (like lists)
- `df.at[]` to return a single item

However, since Dataframes are a multidimensional data structure, these functions provide direct ability to access rows and columns

`df.loc/iloc/at[rows, columns]`

In [None]:
student_scores_pd

### `.loc` key based indexing

Get the column 'Test' and all rows

In [None]:
student_scores_pd.loc[:,'Test']

Get all columns for row containing 'homer'

In [None]:
student_scores_pd.loc['homer', :]

Get the column 'Test' and row with index 'homer'

In [None]:
student_scores_pd.loc['homer', 'Test']

Get columns 'Assignment 1' and 'Assignment 2' and rows containing Simpson children (**fancy indexing**)

In [None]:
student_scores_pd.loc[['bart', 'lisa', 'maggie'],['Assignment 1', 'Assignment 2']]

Get 'Test' scores for everyone between 'bart' and 'maggie' (**slicing** and **fancy indexing**)

In [None]:
student_scores_pd.loc['bart':'maggie', ['Assignment 1', 'Assignment 2']]

**Exercise** Find Maggie and Danny's 'Assignment 1' score

In [None]:
%%postcell exercise_030_120_f

#type your answer here

### .iloc location based indexing

In [None]:
student_scores_pd

Get the first column and all rows

In [None]:
student_scores_pd.iloc[:, 0]

Get the second column and 3-5 rows

In [None]:
student_scores_pd.iloc[3:6, 1]

**Exercise** Get the first three rows of `student_scores_pd`, but only include 'Assignment 1' and 'Assignment 2'

In [None]:
%%postcell exercise_030_120_g

#type your answer here

### `.at` return a single value

Get Homer's test results (think of it as `.loc` which returns a single value...or an error)

In [None]:
student_scores_pd.at['homer', 'Test']

In [None]:
student_scores_pd.at['homer',['Test', 'Assignment 1']]

#### `.query()`

Although I personally haven't developed a habit of using this function, `df.query(..)` can be very useful and intuitive

In [None]:
student_scores_pd

In [None]:
student_scores_pd.query('Test > 50')

The above example is the same as:

In [None]:
student_scores_pd[student_scores_pd['Test'] > 50]

As you can see, `.query()` is much simpler