# A Brief Introduction to Pandas
### Wrangling data with Python

In this notebook, we will cover the basics of Pandas, a dynamic, powerful Python library that is ubiquitous in the field of Data Science. It has robust I/O functionality and makes working with data a lot simpler than doing so in Excel. Let's get stared.

In [25]:
import pandas as pd

## 1.1 DataFrames
Let's read in some existing data using Pandas convenient read_csv method.

In [1]:
unemployment_filepath = './data/unemployment_2016.csv'
# Read in data using this filepath

In [2]:
# Check the type

In [3]:
# Select a single column of the dataframe

In [4]:
# Check the type

In [5]:
# Try and access by index

## 1.2 Vectorized Operations
In Pandas, operations work on each value in most data structures, such as dataframes and series.

In [8]:
# Add a string to the countries series

In [7]:
# Add a number to the countries series

In [154]:
# Select the unemployment rates as a series

In [155]:
# Get the average

Pandas is smart! When possible, it will do the portion of what is asked that will prevent errors.

In [10]:
# Get the average of the whole unemployment dataframe

In [11]:
# Sort the dataframe by rate

In [12]:
# Use .head() to get the top five values

### Exercise - What is the average unemployment of the seven countries with the highest unemployment in europe?

In [13]:
# Sort the values of the unemployment_df - descending or ascending?
# Get the first seven rows in the returned series
# Get the average of those rows

#### Extra - explore these methods:
.min(), .max(), .sum(), .unique(), .nunique(), .count(), .duplicated()

## 1.3 Merging DataFrames

In [18]:
gdp_filepath = './data/gdp_2016.csv'
# Read in data using this filepath

Let's create a new dataframe by merging two existing dataframes. This is the one we'll use for the rest of section 2.

In [14]:
# Merge unemployment df with gdp df

What other data can we include? Are there other file types that we can work with?

In [167]:
# Read in misc data from excel
misc_filepath = './data/misc_data.xlsx'

In [15]:
# Get correct data from file
# ?pd.read_excel

In [16]:
# Get income sheet from excel

In [175]:
# Merge with existing data
eur_data = pd.merge(eur_data, income_df)

In [177]:
# Read in population data from misc excel file

In [178]:
# How can we get the right data?

In [180]:
# Do we need all of these columns?

In [181]:
# How do we select specific columns?

In [17]:
# Get countries along with total population

Pandas is smart, but it's not clairvoyant!

In [18]:
# Merge with existing data

In [19]:
# Sort values by country

## 1.4 Indexes

In [20]:
countries_dict = {
    15: {'country': 'Italy', 
         'unemp_rate': 11.7, 
         'gdp': 1689824.0, 
         'median_income': 16237, 
         'total_pop': 59433744
        }
}

In [21]:
# Make dataframe from countries_dict
countries_from_dict = pd.DataFrame.from_dict(countries_dict, orient='index')

NameError: name 'pd' is not defined

In [22]:
# Get the index of the sorted eur data
# Check the type

Indexes are immutable. And for good reason...

Dataframes can also be indexed by labels, rather than numbers.

In [23]:
# Set index of data by 'country'

In some cases, it can be beneficial to reset the index of a dataframe. In this example, we do this for presentation purposes.

In [26]:
# Reset indexes for data, drop existing

## 1.5 Exporting Data
Pandas has robust I/O functionality.

In [27]:
# Write a function to generate a filepath
# Params
prefix = './data/out/'
def get_filepath(filename, extension):
    pass

In [29]:
# get filepath
filepath = get_filepath('eur_data_sorted', 'csv')

In [30]:
# write data to csv

### Exercise - Write this dataframe to Excel,  JSON, and HTML
Google is your friend...

In [33]:
# Excel
extension = 'xlsx'

In [34]:
# JSON
# How would you view this file?
# What similarities does it have with csv and excel? What differences?
extension = 'json'

In [35]:
# HTML
extension = 'html'

## 2.1 Working with DataFrames
Let's work with an existing dataframe that includes a lot of the same data from the previous sections.

In [36]:
# Read in csv data
filepath = './data/final/eur_data_final.csv'

In [37]:
# Get the head

In [38]:
# Get the tail

In [39]:
# Use .info()

In [40]:
# Get the shape

In [41]:
# Use describe()

In [42]:
# Use .mean()

## 2.2 Boolean Indexing
First of all, what is the boolean data type? It is data type that represents one of two possible values. For example, True False, On Off, etc.

In [260]:
# Get boolean series for the country of Austria

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: country, dtype: bool

In [43]:
# Query data with austria bool

In [44]:
# Use bracket notation, as opposed to dot notation
# This can be useful when there are spaces in the column name
# Try it with 'Greece'

Queries can also be generated using partial matches.

In [45]:
# Check a partial match in the country name

In [46]:
# Query the data with the partial match

In [47]:
# How can we make this query better?

In [48]:
# Use the improved query

## 2.3 Multiple queries

Which countries have low unemployment an an 'l' in their name?

In [275]:
# Get unemployment rates under 7

In [49]:
# Combine queries using Pandas equivalent of 'and'
# In set theory, this is intersection.

In [51]:
# Combine queries using Pandas equivalent of 'or'
# In set theory, this is union.

In [53]:
# Negate query using Pandas operator
# In set theory, this is complement

### Exercise - What countries have an unemployment rate greater than 9% and a GDP less than 280000?

### Exercise - What countries have a median_income between 10000 and 20000?

In [64]:
# Can you think of an alternative way of doing this?

## 3.1 Selection
Using .loc(), .iloc()

In [54]:
countries_dict = {
    15: {'country': 'Italy', 
         'unemp_rate': 11.7, 
         'gdp': 1689824.0, 
         'median_income': 16237, 
         'total_pop': 59433744
        }
}

In [67]:
# With vanilla python, how do we get the word 'Italy' from a dictionary?

'Italy'

In [56]:
# Now use .loc()

.loc() is short for locate. It can be used a variety of different ways. In this example above, it functions like selecting from a grid. First row, then column...

In [57]:
# Try another

Entire rows can also be returned...

...or multiples columns of a row...

...or multiple rows and multiple columns.

If the data is indexed by label, .loc() can be used with a string.

In [60]:
# Get the country of Slovenia using .loc() on the label-indexed data

### Exercise - What countries have a higher unemployment rate than Slovenia and have a lowercase 't' in their name?

## 4.1 MultiIndexes (hierarchical indexes)
Pandas also supports multindexes, which allow users to index by multiple values or groups of values.

In [281]:
parent_array = ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']
child_array = ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']
arrays = [parent_array, child_array]

We want a multidimensional array of random numbers. Numpy for the win!

In [289]:
import numpy as np

In [61]:
# Set seed as 5433210
# Use np.random.randint() to create a multidimensional array  of size 8,4 with 10 integers

In [62]:
# Create a dataframe with the multiple indexes

In [63]:
# Reset the column names
column_names = ['var1', 'var2', 'var3', 'var4', 'var5']

In [65]:
# Select the rows of the bar index

### Exercise - Add a 'three' to two of the high-level indexes.

## 4.2 Groupby

In [66]:
# Read in the csv data
un_data_filepath = './data/un_world_data.csv'

In [68]:
# Group the data by region

In [69]:
# Get the average of each column by region

In [70]:
# What other methods can we use?
# .min(), .max(), .count()

In [363]:
# Use .agg() to get different sets of values, applying multiple methods

In [71]:
# Access one of the sub indexes using bracket notation and apply another operation

### Exercise - What six regions of the world have the highest GDP?

### Exercise - What is the average population of the six regions with the highest population density?