# <center>Datamind - Python for Data Science<center>
Welcome to Python for Data Science! Today we'll deep dive into the world of Pandas, dataframes and dataframe operations. You may copy this notebook to your local Google Drive or computer to study the questions and answers later on. If you're not familiar with notebooks, please see https://jupyter-notebook.readthedocs.io/en/latest/notebook.html#notebook-documents for a quick introduction to Jupyter notebooks. These principles are directly applicable to Google Colabs notebooks.

The notebook is divided into four sections, each step requires input from the previous step(s). If you have any questions, feel free to ask them in class or contact one of the trainers per e-mail!
    
**Some tips to get you started:**
- Make new cells! You can make as many as you want, this way you can execute small code blocks and work iteratively.
- Break larger problems down into small chunks (and work them out in individual code cells). Once you're convinced everything is in order, you can merge your code into one code cell.
- Use comments to explain what you're doing, `#` allows you to write inline comments.
- If you resort to Google or Stackoverflow for help, make sure you actually understand the solution that is provided. And rather than copying the code, make sure <u>*you*</u> write the code!


---

# <center> Part one: getting data from the web. <center>
## Getting data using `wget`
Before we start, let's first get some data. In Google Colabs you can execute shell commands by adding an exclamation like shown below. The `wget` command is a built-in utility that allows retrieving files from the web. Executing a shell command means that this is no python code! Any commands that you find throughout this notebook that start with `!` are not usable in python code directly. For this you'll have to look into different solutions.

**Note:** <br>The wget command only works on the Google Colabs environment or some Linux based systems. To run this locally, download the data by hand (by copying the links into a browsers) or change the command to whatever is applicable for your OS.

In [None]:
! wget https://github.com/datamind-dotfit/python_for_data_science/raw/master/observations.xlsx https://github.com/datamind-dotfit/python_for_data_science/raw/master/sales.xlsx

## Verify data retrieval
Let's verify that the data was actually retrieved and stored in the working directory. We can use the shell command `ls` for this, this command lists all the files in the current directory. We should see both `observations.xlsx` and `sales.xlsx` in the result. As you may notice, there may be some other files in the directory, these files are there by default (or are located on your computer).

In [None]:
! ls


# <br><center>Part two: Loading data</center>
---
## Import Pandas
To load data we first have to import the Pandas module. After importing pandas, we can use many of its useful features to work with our data. In python we can import modules using the `import` statement. 


Load the pandas module and make sure you can access its functions through the name 'pd'. Remember in python we can import modules like so: <br>`import module_name as some_shorthand`

In [None]:
# Your answer

## Load in the data
For this part we'll use the Pandas reader to load our files. Check what kind of file we downloaded in part one, and use the appropriate function to read both files in as data frames. See http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html for more information.

#### Question:
- What kind of files are we dealing with?
- Read in the sales data as `sales` and the observations data as `observations`

In [None]:
# What kind of files are we dealing with?

In [None]:
# Read in the sales data as sales and the observations data as observations


# <br><center>Part three: data exploration & cleaning </center>
---
## 3.1 - Explore the data 
Once we have the data loaded, it's time to explore! Let's start of by answering the following questions:

#### Questions: 
- Inspect the top five and bottom five rows of the `sales` dataframe. What's going on here?
- What's up with these column names, is this an error? Download the files by hand and see for yourself
- We would like to sample 20 rows from the data frame at random, write the code for this.
- Check out the seemingly empty columns. Which unique values do they contain? Is there anything other than NaN?
- Create descriptive statistics of **both** data frames.

In [None]:
# Inspect the top five and bottom five rows of the sales dataframe. What's going on here?

In [None]:
# What's up with these column names, is this an error? Download the files by hand and see for yourself

In [None]:
# We would like to sample 20 rows from the data frame at random, write the code for this.

In [None]:
# Check out the seemingly empty columns. Which unique values do they contain? Is there anything other than NaN?

In [None]:
# Create descriptive statistics of both data frames.

---
## 3.2 - Selecting with pandas' iloc
Remember that you can use Pandas' `iloc` functionality to access information by integer index `df.iloc[4]`. We may also select columns *directly* by index using `df.iloc[:, 3]`, or select columns *after* doing an `iloc` selection using `df.iloc[5]['ColumnName']`. 

#### Questions:
- Inspect the top of the observations dataframe. What do you notice about the format of the data? How does it compare to the sales dataframe?
- Find the name of the country and the year in the sixth row of the sales dataframe.




In [None]:
# Inspect the top of the observations dataframe. What do you notice about the format of the data? How does it compare to the sales dataframe?

In [None]:
# Find the name of the country and the year in the sixth row of the sales dataframe.

---
## 3.3 - Filtering in pandas

### Pandas filtering, a look under the hood
In this section you will practice filtering data in Pandas. Filtering in Pandas may look a bit daunting, but it's actually quite easy. Filtering is always done by following the same pattern and to understand how Pandas filtering works under the hood, we will first investigate the result of a filter/expression. After that you'll practice by applying these filters on the sales and observations data.

Filtering is done by writing expressions that evaluate to `True` or `False`. There may be more than one expression, we can connect multiple expression by constructions such as AND and OR, you may have seen this before in mathematics or a course on logic. These expressions are often captured in a *truth table*. In pandas AND is indicated by `&` and or by `|`. 

The pattern to follow is:<br>
`dataframe[dataframe['ColumnName'] == 'Value']`

For multiple statements, we must put each expression between `(` and `)`.<br>
`dataframe[(dataframe['ColumnName'] == 'Value') & (dataframe['SomeOtherColumn'] == 'SomeOtherValue')]`

#### Question:
- Write the expression `expression = sales['name_of_country'] == 'NETHERLANDS'`
- Print out the expression, what do you see?

In [None]:
# Write the expression expression = sales['name_of_country'] == 'NETHERLANDS'
# Print out the expression, what do you see?

### Bringing it together
So instead of selecting the rows by their index, we determine `True` or `False` for each row based on this expression. With this knowledge, try answer the questions below.

**Questions:**
- Find all rows in the observations dataframe where the value of 'pop' is greater than 80.
- Find all rows in the sales dataframe concerning the Netherlands.
- Now from those rows, select the bikes and year columns.
- Add an extra filter: Find the row about the Netherlands in 1675.
- Find all rows in the sales data frame where the country is Germany or France and the number of bikes are at least 49. In addition if the year is 1680, we also want to see that one.

In [None]:
# Find all rows in the observations dataframe where the value of 'pop' is greater than 80.


In [None]:
# Find all rows in the sales dataframe concerning the Netherlands.


In [None]:
# Find all rows in the sales dataframe concerning the Netherlands.


In [None]:
# Add an extra filter: Find the row about the Netherlands in 1675.


In [None]:
# Find all rows in the sales data frame where the country is Germany or France and the number of bikes are at least 49. In addition if the year is 1680, we also want to see that one.


---
## 3.4 - Joining data

![alt text](https://i.stack.imgur.com/iJUMl.png)

Now we know what the data looks like and what to expect from it, we can start thinking about merging the `sales` and `observations` dataframes. This will give us combined data on what happened in a country during a year. Before we can start joining, should any data transformations be performed? Remember that you can inspect the data frames with the `.head()` function. 

#### Question:
- Investigate if the columns in both data frames are ready for joining, do we need to make any Transformations?
- If your answer for the previous question is 'yes', then proceed to make adjustments so the column values are aligned.

In [None]:
# Investigate if the columns in both data frames are ready for joining, do we need to make any Transformations?

In [None]:
# If your answer for the previous question is 'yes', then proceed to make adjustments so the column values are aligned.

Now that our country names are written in the exact same way, we can perform the join we have been speaking about. What do we join on, and why? Look up the way to join on multiple columns in the pandas documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

#### Questions:
- Think of when you would use which type of join. If you wanted to find all observations and, if possible, add the sales information to them, how would the code look?
- What happens if you join on only one column, such as the country or the year? 
- What happens if you join a dataframe to itself?
- Try them all (create a df_inner, df_outer, df_left and df_right) and inspect the results to find out the effects of various joins. The following code will show you the shape of the dataframe in the format (rows, columns): `df.shape`


In [None]:
# Think of when you would use which type of join. If you wanted to find all observations and, if possible, add the sales information to them, how would the code look?


In [None]:
# What happens if you join on only one column, such as the country or the year?


In [None]:
# What happens if you join a dataframe to itself?


In [None]:
# Try them all (create a df_inner, df_outer, df_left and df_right) and inspect the results to find out the effects of various joins. 


---
## 3.5 - Missing values
We now have a dataframe with all information in it. Upon closer inspection, you will see that the dataframe from the **outer join** has some missing values. Do you understand completely why this happens? 

In this section, we will inspect the missing values and deal with them in an appropriate manner. Let's clean up the dataframe from the outer join a bit

#### Questions:
- Delete all columns containing only NaN or duplicate data, making it easier to work with.
- Find all rows that have null values.
- If you had to fill in the missing values in the 'bikes', 'total_turnover' and 'pop' columns, how would you do this? Just think about it for now, later we will actually do it.

Remember that the `isnull()` function helps you find rows containing null values.

In [None]:
# Delete all columns containing only NaN or duplicate data, making it easier to work with.


In [None]:
# Find all rows that have null values.


In [None]:
# If you had to fill in the missing values in the 'bikes', 'total_turnover' and 'pop' columns, how would you do this?


---
## 3.6 - Aggregating data
Aggregations are useful to get a quick look at derived statistics. At Datamind we often use these derived statistics to get a better picture of our data and also to fill in missing values. In this section we will get a clear picture on how to aggregate data while grouping.

**Questions**:
- Count the number of observations per country in the dataframe.
- Calculate the average turnover per year
- Create a dataframe containing the year, average turnover per year and average population per year
- Calculate the maximum population per country.
- Calculate the sum of the total turnover per year per country. Why is this not such a sensible statistic? Take note of what happens to the NaN observations.

**Bonus**:
- Calculate, for each row, the difference between the total turnover for this observation and the average turnover for that year and store this in a new dataframe containing all old columns plus the new one. For completeness, rename the new column to something appropriate.

Hint: Create a new dataframe with the averages and join the new dataframe onto the old one. Then create a new column by subtracting one value from another:



```
df['derived_column'] = df['original_column1'] * df['original_column2']
```



In [None]:
# Count the number of observations per country in the dataframe.


In [None]:
# Calculate the average turnover per year


In [None]:
# Create a dataframe containing the year, average turnover per year and average population per year


In [None]:
# Calculate the maximum population per country.


In [None]:
# Calculate the sum of the total turnover per year per country. Why is this not such a sensible statistic? Take note of what happens to the NaN observations.


Congratulations, you've made it to the end. With the techniques used in these exercises, you can do about 80% of the work with Pandas. You could further look into the Pandas `.apply()` function and plotting.

# <br><br><center> Extra exercise </center>
## Enrich our data frame using API information and store the resulting data frame using Pandas

We can enrich our dataframe further by adding data from an API. Let's see if we can add country information to the data that we currently have. Let's use another public API to add the 2-letter ISOcode, longitude coordinates and latitude coordinates of each country that we have in our data set. This exercise is meant to take some time, feel free to experiment! This is where you learn the most.

API: https://restcountries.eu/rest/v2/name

This exercise requires you to do the following:
- Collect all countries in your data set
- Call the API for each country
- Extract the information from the **JSON** request
- Create a data frame from the API results
- Join the data to your existing data frame
- Inspect the results

