# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Loading,-Inspecting,-Getting-Familiarised-with-the-Data" data-toc-modified-id="Loading,-Inspecting,-Getting-Familiarised-with-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading, Inspecting, Getting Familiarised with the Data</a></div><div class="lev2 toc-item"><a href="#The-Preamble" data-toc-modified-id="The-Preamble-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>The Preamble</a></div><div class="lev2 toc-item"><a href="#Instantiating-and-Quickly-Inspecting-DataFrames" data-toc-modified-id="Instantiating-and-Quickly-Inspecting-DataFrames-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Instantiating and Quickly Inspecting DataFrames</a></div><div class="lev2 toc-item"><a href="#Schema,-Structure,-Datatypes,-Distribution" data-toc-modified-id="Schema,-Structure,-Datatypes,-Distribution-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Schema, Structure, Datatypes, Distribution</a></div><div class="lev2 toc-item"><a href="#numpy-Is-Behind-The-Scenes" data-toc-modified-id="numpy-Is-Behind-The-Scenes-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span><code>numpy</code> Is Behind The Scenes</a></div><div class="lev2 toc-item"><a href="#Subsetting-Data-On-Row-and-Column-Labels" data-toc-modified-id="Subsetting-Data-On-Row-and-Column-Labels-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Subsetting Data On Row and Column Labels</a></div><div class="lev2 toc-item"><a href="#Size-and-Shape" data-toc-modified-id="Size-and-Shape-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Size and Shape</a></div><div class="lev2 toc-item"><a href="#Boolean-Masks" data-toc-modified-id="Boolean-Masks-17"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Boolean Masks</a></div><div class="lev2 toc-item"><a href="#Sorting-DataFrames" data-toc-modified-id="Sorting-DataFrames-18"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Sorting <code>DataFrame</code>s</a></div><div class="lev1 toc-item"><a href="#Exercises" data-toc-modified-id="Exercises-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exercises</a></div><div class="lev2 toc-item"><a href="#GapMinder" data-toc-modified-id="GapMinder-21"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>GapMinder</a></div>

# Loading, Inspecting, Getting Familiarised with the Data

## The Preamble

In [None]:
import numpy as np
import pandas as pd

## Instantiating and Quickly Inspecting DataFrames

In [None]:
# .read_csv will read a CSV dataset and instantiate a suitable DataFrame
# `index_col=0` tells it the first column should be used as the row index
states_df = pd.read_csv('../data/area_pop_five_us_states.csv', index_col=0)

In [None]:
# a DataFrame is a tabular, 2D structure composed of rows and columns
# columns and rows are associated to sets of labels (an `Index`)
# to address a specific row or column, use its label
states_df

We make use of a simplified version of the [`gapminder`](https://www.gapminder.org/data/) dataset, which concerns records of population, life expectancy and GDP of countries covering a number of years in the period of 1952 to 2007. The dataset can be downloaded from [here](https://github.com/jennybc/gapminder/blob/master/inst/extdata/gapminder.tsv); more information [here](https://github.com/jennybc/gapminder/). 

In [None]:
# notice that data is tab-separated, hence the argument sep=`\t`
gm_df = pd.read_csv('../data/gapminder.tsv', sep='\t')

In [None]:
# a quick look - the top 5 rows by default
gm_df.head()

In [None]:
# show the `n` top rows
gm_df.head(13)

In [None]:
gm_df.sample(5)

In [None]:
gm_df.tail()

## Schema, Structure, Datatypes, Distribution

In [None]:
gm_df.columns

In [None]:
gm_df.dtypes

In [None]:
# columns and some statistics about the data distribution (numeric, by default)
gm_df.describe()

In [None]:
gm_df.describe(include='all')

## `numpy` Is Behind The Scenes

In [None]:
gm_df.values

## Subsetting Data On Row and Column Labels

In [None]:
# `.loc[]` is the preferred method
# a quick example of how to select a row based on its label
# by default, a single row is returned as `Series` object.
# More on `Series` later
gm_df.loc[4]

In [None]:
# a list of labels (containing one element, if you'd like) will produce
# a DataFrame as a result
gm_df.loc[ [4] ]

In [None]:
# one can always perform the task into invididual steps and
# assign the intermediate results to variables
row = gm_df.loc[4]
row

In [None]:
type(row)

In [None]:
# `[]` can be used as a shortcut notation for `loc[]` when dealing with
# columns only
# a single column produced by `[]`, identified by its label, is a `Series`
gm_df['country'].head()

In [None]:
gm_df['pop'].head()

In [None]:
# a subset of columns (producing a DataFrame)
gm_df[ [ 'country', 'year' ] ].head()

In [None]:
# `.loc` expects the indexing to be on rows only or on both rows and columns
gm_df.loc[ [4], [ 'country', 'year' ] ]

In [None]:
# subsetting the smaller, US states dataset
states_df.loc['Illinois']

In [None]:
states_df

In [None]:
# you can also specify a slice object (a range) of row labels!
states_df.loc['California':'Illinois']
# equivalent statement...
# states_df.loc[slice('California', 'Illinois')]

In [None]:
# a slice from 'Florida' to the last label
# states_df.loc['Florida':, ['pop']]
states_df.loc[slice('Florida', None), ['pop']]

In [None]:
# all rows (an empty range)
states_df.loc[:, ['pop']]

In [None]:
states_df.loc[:, ['pop']]

In [None]:
states_df.loc['Florida', 'area']

## Size and Shape

In [None]:
len(gm_df)

In [None]:
gm_df.shape

In [None]:
gm_df.size

## Boolean Masks

In [None]:
# just like `numpy`
states_df['pop'] >= 20

In [None]:
states_df.loc[ states_df['pop'] >= 20 ]

In [None]:
# Would you rather work with millions of people?
# a quick example of vectorisation
states_df.loc[:, 'pop'] * 1000000

In [None]:
# combining boolean arrays with & and | (use brackets)
states_df.loc[ (states_df['pop'] >= 20) & (states_df['area'] >= 500000) ]

## Sorting `DataFrame`s

In [None]:
# states_df.sort_values?

In [None]:
states_df.sort_values(by='pop', ascending=False)

In [None]:
# could add `inplace` for modification; otherwise, produces a new DataFrame
states_df.sort_values(by='area', ascending=False)

In [None]:
# sorts the rows based on their index
states_df.sort_index(ascending=False)

In [None]:
states_df

# Exercises

In [None]:
# as a reminder, this is the DataFrame we are working with
gm_df.head()

## GapMinder

In [None]:
# select only the observations concerning the year of 1972


In [None]:
# for the query above, how many observations are there?


In [None]:
# select the rows for Brazil only


In [None]:
# what are the years for which observations have been recorded for?
# (might require checking the documentation; we are looking here for unique
# values...)


In [None]:
# ...and how many?


In [None]:
# what are the top 5 countries, for 2007, with regards to life expectancy?


In [None]:
# select the observations for Asia for the period of 2007


In [None]:
# select the observations for the Americas for the period of 2000 to 2009


In [None]:
# show life expectancy and GDP per capita for Brazil


In [None]:
# (optional) how many observations for each continent? There is a method that
# might just allow one to do that job in a single line
