# Pandas DataFrame Review/Crash Course

Recall the two data types offered:
* Series
* DataFrame (our rectangular data)

As part of the review, this section covers:
* loading simple data from files using Pandas
* calculate how many rows and columns were loaded
* subsetting
* aggregate stats and grouped stats
* creating simple figures

---

## Loading Data
Recall, when looking a data, start looking at descriptives and descriptive stats to characterize it.

In [84]:
import pandas as pd

As part of the review and for quick reference, we'll go over loading data from files with Pandas.

In [85]:
# read a csv file and specify delimiter 
df = pd.read_csv('./gapminder.tsv', sep='\t')

In [86]:
# checkout our dataframe
print(df)

          country continent  year  lifeExp       pop   gdpPercap
0     Afghanistan      Asia  1952   28.801   8425333  779.445314
1     Afghanistan      Asia  1957   30.332   9240934  820.853030
2     Afghanistan      Asia  1962   31.997  10267083  853.100710
3     Afghanistan      Asia  1967   34.020  11537966  836.197138
4     Afghanistan      Asia  1972   36.088  13079460  739.981106
...           ...       ...   ...      ...       ...         ...
1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298

[1704 rows x 6 columns]


In [87]:
# are we working with a dataframe?...
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [88]:
# get number of rows and columns (shape is an attribute, not a method)
print(df.shape)

(1704, 6)


In [89]:
# check out the columns attribute to get the column names
print(df.columns)

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')


In [90]:
# verify column data types
df.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

In [91]:
# get more information about the dataframe columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


--- 

## Look at Columns, Rows, and Cells

In [92]:
# show first 5 observations
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Select and Subset Columns by Name
Access a specific column by name from dataset.

In [93]:
# only return country column
country_df = df['country']

In [94]:
# it's a Series
type(country_df)

pandas.core.series.Series

In [95]:
country_df.head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [96]:
country_df.tail()

1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object

In [97]:
# subset with a Python list for multiple columns
subset = df[['country', 'continent', 'year']]

In [98]:
type(subset)

pandas.core.frame.DataFrame

In [99]:
subset

Unnamed: 0,country,continent,year
0,Afghanistan,Asia,1952
1,Afghanistan,Asia,1957
2,Afghanistan,Asia,1962
3,Afghanistan,Asia,1967
4,Afghanistan,Asia,1972
...,...,...,...
1699,Zimbabwe,Africa,1987
1700,Zimbabwe,Africa,1992
1701,Zimbabwe,Africa,1997
1702,Zimbabwe,Africa,2002


In [100]:
# improper subset by index position
df[0]

KeyError: 0

In [None]:
# shorthand notation to get column vector instead of [] notation
df.country

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

### Subset Rows
We can subset rows by row name (*.loc[]*) or row index (*.iloc[]*).

In [None]:
# use .loc[] to subset row by index label to get first row
df.loc[0]

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap     779.445314
Name: 0, dtype: object

In [None]:
# get 100th row
df.loc[99]

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap    721.186086
Name: 99, dtype: object

In [None]:
# get last row via calculation
num_rows = df.shape[0]
last_index = num_rows - 1
df.loc[last_index]

country        Zimbabwe
continent        Africa
year               2007
lifeExp          43.487
pop            12311143
gdpPercap    469.709298
Name: 1703, dtype: object

In [None]:
# alternatively...
df.tail(n=1)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


In [None]:
# filter multiple rows
df.loc[[0, 99, 999]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
99,Bangladesh,Asia,1967,43.453,62821884,721.186086
999,Mongolia,Asia,1967,51.253,1149500,1226.04113


We can use **.iloc[]** to subset a row by row index number. 

In [None]:
# get the 2nd row
df.iloc[1]

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1, dtype: object

This distinction may seem confusing with this example, but recall in many datasets, rows have names. By default, pandas assigns row names using a monotonically increasing integer starting from 0.

In [None]:
# get 100th row
df.iloc[99]

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap    721.186086
Name: 99, dtype: object

In [None]:
# we can use negative indices with .iloc[]
df.iloc[-1]

country        Zimbabwe
continent        Africa
year               2007
lifeExp          43.487
pop            12311143
gdpPercap    469.709298
Name: 1703, dtype: object

### Subsetting by row and column
We use Python's slicing syntax to subset just columns when using *.loc[]* and *.iloc[]*.


In [None]:
# subset specific columns with loc and get all the rows for those columns, 
# observe we used the column names NOT by index
subset = df.loc[:, ['year', 'pop']]
subset

Unnamed: 0,year,pop
0,1952,8425333
1,1957,9240934
2,1962,10267083
3,1967,11537966
4,1972,13079460
...,...,...
1699,1987,9216418
1700,1992,10704340
1701,1997,11404948
1702,2002,11926563


In [None]:
# subset columns using column index values
subset = df.iloc[:, [2, 4, -1]]
subset

Unnamed: 0,year,pop,gdpPercap
0,1952,8425333,779.445314
1,1957,9240934,820.853030
2,1962,10267083,853.100710
3,1967,11537966,836.197138
4,1972,13079460,739.981106
...,...,...,...
1699,1987,9216418,706.157306
1700,1992,10704340,693.420786
1701,1997,11404948,792.449960
1702,2002,11926563,672.038623


### Subsetting with range()
The built-in *range()* function will allow us to specify a range of values using "half-open" 🙄 notation. 

Since the *range()* function returns a *generator*, if we want to use it to access data via index, we'll need to convert it to a list of integers first.

In [None]:
# range of integers from 0-4, convert range generator to a list
small_range = list(range(5))

In [None]:
small_range

[0, 1, 2, 3, 4]

In [None]:
# now subset the dataframe with our range, grab all the first 
# 5 columns and all their row values
subset = df.iloc[:, small_range]

In [None]:
subset

Unnamed: 0,country,continent,year,lifeExp,pop
0,Afghanistan,Asia,1952,28.801,8425333
1,Afghanistan,Asia,1957,30.332,9240934
2,Afghanistan,Asia,1962,31.997,10267083
3,Afghanistan,Asia,1967,34.020,11537966
4,Afghanistan,Asia,1972,36.088,13079460
...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418
1700,Zimbabwe,Africa,1992,60.377,10704340
1701,Zimbabwe,Africa,1997,46.809,11404948
1702,Zimbabwe,Africa,2002,39.989,11926563


In [None]:
# create a different range of values to index by
small_range = list(range(3, 6))
subset = df.iloc[:, small_range]
subset

Unnamed: 0,lifeExp,pop,gdpPercap
0,28.801,8425333,779.445314
1,30.332,9240934,820.853030
2,31.997,10267083,853.100710
3,34.020,11537966,836.197138
4,36.088,13079460,739.981106
...,...,...,...
1699,62.351,9216418,706.157306
1700,60.377,10704340,693.420786
1701,46.809,11404948,792.449960
1702,39.989,11926563,672.038623


In [None]:
# get every other column by using a step
small_range = list(range(0, 6, 2))

In [None]:
subset = df.iloc[:, small_range]

In [None]:
subset

Unnamed: 0,country,year,pop
0,Afghanistan,1952,8425333
1,Afghanistan,1957,9240934
2,Afghanistan,1962,10267083
3,Afghanistan,1967,11537966
4,Afghanistan,1972,13079460
...,...,...,...
1699,Zimbabwe,1987,9216418
1700,Zimbabwe,1992,10704340
1701,Zimbabwe,1997,11404948
1702,Zimbabwe,2002,11926563


### Subsetting with Slicing
Just specify start, stop, step using a :, basically a shorthand for using the *range()* function to generate a list of integers.

In [None]:
df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')

In [None]:
# let's get the first three columns by index via slicing and all rows
subset = df.iloc[:, :3]
subset

Unnamed: 0,country,continent,year
0,Afghanistan,Asia,1952
1,Afghanistan,Asia,1957
2,Afghanistan,Asia,1962
3,Afghanistan,Asia,1967
4,Afghanistan,Asia,1972
...,...,...,...
1699,Zimbabwe,Africa,1987
1700,Zimbabwe,Africa,1992
1701,Zimbabwe,Africa,1997
1702,Zimbabwe,Africa,2002


In [None]:
# get columns 3-5
subset = df.iloc[:, 3:6]
subset

Unnamed: 0,lifeExp,pop,gdpPercap
0,28.801,8425333,779.445314
1,30.332,9240934,820.853030
2,31.997,10267083,853.100710
3,34.020,11537966,836.197138
4,36.088,13079460,739.981106
...,...,...,...
1699,62.351,9216418,706.157306
1700,60.377,10704340,693.420786
1701,46.809,11404948,792.449960
1702,39.989,11926563,672.038623


In [None]:
# get every other column 
subset = df.iloc[:, 3:6:2]
subset

Unnamed: 0,lifeExp,gdpPercap
0,28.801,779.445314
1,30.332,820.853030
2,31.997,853.100710
3,34.020,836.197138
4,36.088,739.981106
...,...,...
1699,62.351,706.157306
1700,60.377,693.420786
1701,46.809,792.449960
1702,39.989,672.038623


### Subsetting Rows and Columns
We can put values to the left of the comma to select specific rows along with specific columns.

In [None]:
# using loc to subset rows and cols by name,
# we want row name 42 and the column value for country
df.loc[42, 'country']

'Angola'

In [None]:
# using iloc 
df.iloc[42, 0]

'Angola'

In [None]:
# subsetting multiple rows and columns,
# get 1st, 100th, and 1000th rows from columns
# 0, 3, and 5
df.iloc[[0, 99, 999], [0, 3, 5]]

Unnamed: 0,country,lifeExp,gdpPercap
0,Afghanistan,28.801,779.445314
99,Bangladesh,43.453,721.186086
999,Mongolia,51.253,1226.04113


*Tip:* Try to use column names a much as possible (so use *.loc[]*) as it makes it easier to read and interpret what values are being returned. 

---

## Grouped and Aggregate Calculations