<a href="https://colab.research.google.com/github/datamind-dotfit/python_for_data_science/blob/master/Python_for_data_science_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1:
## Initial setup - let's get some data!
Before we start, let's first get some data. In Google Colabs you can execute shell commands by adding an exclamation like shown below. The **wget** command is a built-in utility that allows retrieving files from the web.

**Note:** The wget command only works on the google colabs environment. To run this locally, download the data by hand.

In [1]:
! wget https://github.com/datamind-dotfit/python_for_data_science/raw/master/observations.xlsx https://github.com/datamind-dotfit/python_for_data_science/raw/master/sales.xlsx

/bin/sh: wget: command not found


### Verify data retrieval
Let's verify that the data was actually retrieved and stored in the working directory. We can use the shell command 'ls' for this, it lists all the files in the current directory. We should see both *observations.xlsx* and *sales.xlsx* in the result.

In [2]:
! ls

Python_for_data_science_ex3.ipynb
Python_for_data_science_ex3_answers.ipynb
observations.xlsx
sales.xlsx


---

# Step 2:
## Let's load the pandas library and our freshly retrieved data

Load the pandas package and make sure you can access its functions through the name 'pd'. Remember in python we can import modules like so:

```
import module_name as some_shorthand
```



In [None]:
import pandas as pd

Read in the two excel files that we downloaded in step 1.<br>
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [2]:
sales = pd.read_excel('sales.xlsx')
observations = pd.read_excel('observations.xlsx')

# Step 3
## 3.1 - Explore the data 

Let's start of by investigating some of the rows of our dataframe.

<br>

**Questions:** <br>
- Inspect the top of the sales dataframe. What's going on here?
- Sample 20 random rows from the dataframe
- Check out the seemingly empty columns. Which unique values do they contain? Is there anything other than NaN? <br>


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html <br>
Note: unique() is only usable on a pandas Series, not on a DataFrame.

In [3]:
# 3.1 - 1
# Head() & Tail() displays the top/bottom rows.
sales.head()

Unnamed: 0,name_of_country,year,bikes,Unnamed: 3,Unnamed: 4,name_of_country.1,year.1,total_turnover
0,AUSTRIA,1670,60,,,AUSTRIA,1670,5274.0
1,AUSTRIA,1671,59,,,AUSTRIA,1671,5186.1
2,AUSTRIA,1674,53,,,AUSTRIA,1674,4658.7
3,AUSTRIA,1675,58,,,AUSTRIA,1675,5098.2
4,AUSTRIA,1676,56,,,AUSTRIA,1676,4922.4


In [None]:
# 3.1 - 2
# Sample works on a DataFrame
sales.sample(20)

In [4]:
# 3.1 - 3
# Unique() can be used on Series
sales.iloc[:, 3].unique()

array([nan])

## Selecting with pandas' iloc
Remember that you can use pandas' iloc functionality to access information by integer index:



```
df.iloc[:, 3]
```


**Questions:**
- Inspect the top of the observations dataframe. 
- What do you notice about the format of the data? How does it compare to the sales dataframe?
- Find the name of the country and the year in the sixth row of the sales dataframe.




In [5]:
# 3.1 - 4
observations.head()

Unnamed: 0,countryname,year,pop
0,Austria,1670,85
1,Austria,1671,83
2,Austria,1672,86
3,Austria,1673,81
4,Austria,1674,75


In [None]:
# 3.1 - 5
# Using iloc and column selection in sequence.
sales.iloc[5][['name_of_country', 'year']]

In [0]:
# 3.1 - 6
# Using iloc for both rows & columns.
sales.iloc[5, 0:2]

## Filtering in pandas
Filtering rows works by specifying the conditions that the rows should meet:



```
df[df['column_one'] == 3]
```

**Questions:**
- Find all rows in the observations dataframe where the value of 'pop' is greater than 80.
- Find all rows in the sales dataframe concerning the Netherlands.
- Now from those rows, select the bikes and year columns.
- Add an extra filter: Find the row about the Netherlands in 1675.

In [0]:
# 3.1 - 7
# Filter pop > 80
observations[observations['pop'] > 80]

In [0]:
# 3.1 - 8
# Filter 'NETHERLANDS'
sales[sales['name_of_country'] == 'NETHERLANDS']

In [0]:
# 3.1 - 9
# Select columns from subset
sales[sales['name_of_country'] == 'NETHERLANDS'][['bikes', 'year']]

In [0]:
# 3.1 - 10
# Multiple criteria filter
sales[(sales['name_of_country'] == 'NETHERLANDS') & (sales['year'] == 1675)]

## 3.2 - Joining data

![alt text](https://i.stack.imgur.com/iJUMl.png)

Now we know what the data looks like and what to expect from it, we can start thinking about merging the sales and observations dataframes. This will give us combined data on what happened in a country during a year. Before we can start joining, should any data transformations be performed? Remember that you can inspect the data frames with the .head() function. 

Using the str.title() function, which transforms any string into a version where only the first letter is capitalized, you can create uniform data among dataframes to join on. 

**Question:**
- Make the relevant join data uniform among dataframes.

Now that our country names are written in the exact same way, we can perform the join we have been speaking about. What do we join on, and why? 

**Questions:**
- What happens if you join on only one column, such as the country or the year? 
- What happens if you join a dataframe to itself?

Look up the way to join on multiple columns in the pandas documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

- Think of when you would use which type of join. If you wanted to find all observations and, if possible, add the sales information to them, how would the code look?

Try them all (create a df_inner, df_outer, df_left and df_right) and inspect the results to find out the effects of various joins. The following code will show you the shape of the dataframe in the format (rows, columns):

```
df.shape
```



In [0]:
# 3.2 - 1
# Inspection
sales.head()
observations.head()

# We can use apply on a series and reassign this value. Note that we do not include the parentheses for
# str.title, since we pass the entire function
sales['name_of_country'] = sales['name_of_country'].apply(str.title)
observations['countryname'] = observations['countryname'].apply(str.title)



In [None]:
# 3.2 - 2
# Joining on only one column introduces multiple matches per record which leads to an increase in rows.

In [None]:
# 3.2 - 3
# Depending on the column you might end up with the same dataframe or a larger data frame (rows and column wise)

In [0]:
# 3.2 - 4
# Inner, Outer, Left, Right join
df_inner = sales.merge(observations, how='inner', left_on=['name_of_country', 'year'], 
                       right_on=['countryname', 'year'])

df_outer = sales.merge(observations, how='outer', left_on=['name_of_country', 'year'], 
                       right_on=['countryname', 'year'])

df_left = sales.merge(observations, how='left', left_on=['name_of_country', 'year'], 
                      right_on=['countryname', 'year'])

df_right = sales.merge(observations, how='right', left_on=['name_of_country', 'year'], 
                       right_on=['countryname', 'year'])

## 3.3 - Missing values
We now have a dataframe with all information in it. Upon closer inspection, you will see that the dataframe from the **outer join** has some missing values. Do you understand completely why this happens? 

In this section, we will inspect the missing values and deal with them in an appropriate manner. Let's clean up the dataframe from the outer join a bit

**Questions**:
- Delete all columns containing only NaN or duplicate data, making it easier to work with.
- Find all rows that have null values.
- Can we guess from which country the observations with the missing country value are?
- If you had to fill in the missing values in the 'bikes', 'total_turnover' and 'pop' columns, how would you do this? Just think about it for now, later we will actually do it.

Remember that the isnull() function helps you find rows containing null values.

In [0]:
# 3.3 - 1
# Select relevant columns
cols = ['countryname', 'year', 'bikes', 'total_turnover', 'pop']

df_outer = df_outer[cols]

In [0]:
# 3.3 - 2
# Use isnull(), then use any to select all rows that include a null value.
df_outer[df_outer.isnull().any(axis=1)]

In [None]:
# 3.3 - 3

In [None]:
# 3.3 - 4
# - Use a dictionary in the fillna function
# - OR loop over columns and apply fillna

## Aggregating data
Aggregations are useful to get a quick look at derived statistics. At Datamind we often use these derived statistics to get a better picture of our data and also to fill in missing values. In this section we will get a clear picture on how to aggregate data while grouping.

**Questions**:
- Count the number of observations per country in the dataframe.
- Calculate the average turnover per year
- Create a dataframe containing the year, average turnover per year and average population per year
- Calculate the maximum population per country.
- Calculate the sum of the total turnover per year per country. Why is this not such a sensible statistic? Take note of what happens to the NaN observations.

**Bonus**:
- Calculate, for each row, the difference between the total turnover for this observation and the average turnover for that year and store this in a new dataframe containing all old columns plus the new one. For completeness, rename the new column to something appropriate.

Hint: Create a new dataframe with the averages and join the new dataframe onto the old one. Then create a new column by subtracting one value from another:



```
df['derived_column'] = df['original_column1'] * df['original_column2']
```



In [0]:
# Count per group
df_outer.groupby('countryname').count()

In [0]:
# Average over year
df_outer[['year', 'total_turnover']].groupby('year').mean()

In [0]:
# Average turnover and population
df_outer[['year', 'total_turnover', 'pop']].groupby('year').mean()

In [0]:
# Max population per country
df_outer.groupby(['countryname', 'year']).sum()

In [0]:
# Bonus
average_per_year = df_outer[['year', 'total_turnover']].groupby('year').mean()
df_outer2 = df_outer.merge(average_per_year, on='year', how='left')
df_outer2['diff_from_avg'] = df_outer2['total_turnover_x'] - df_outer2['total_turnover_y']

## Enrich our data using API's, store information in a new database using Pandas

We can enrich our dataframe further by adding data from an API. Let's see if we can add country information to the data that we currently have. Let's use another public API to add the 2-letter ISOcode, longitude coordinates and latitude coordinates of each country that we have in our data set. This exercise is meant to take some time, feel free to experiment! This is where you learn the most.

API: https://restcountries.eu/rest/v2/name

This exercise requires you to do the following:
- Collect all countries in your data set
- Call the API for each country
- Extract the information from the **JSON** request
- Create a data frame from the API results
- Join the data to your existing data frame
- Inspect the results
- Optional: Store the table in a SQLite database table using pandas SQL
- Optional: Verify that the results are in your sqlite database



In [None]:
# Final assignment