# Step 1:
## Initial setup - let's get some data!
Before we start, let's first get some data. In Google Colabs you can execute shell commands by adding an exclamation like shown below. The **wget** command is a built-in utility that allows retrieving files from the web.

**Note:** The wget command only works on the google colabs environment. To run this locally, download the data by hand.

In [None]:
! wget https://github.com/datamind-dotfit/python_for_data_science/raw/master/observations.xlsx https://github.com/datamind-dotfit/python_for_data_science/raw/master/sales.xlsx

### Verify data retrieval
Let's verify that the data was actually retrieved and stored in the working directory. We can use the shell command 'ls' for this, it lists all the files in the current directory. We should see both *observations.xlsx* and *sales.xlsx* in the result.

In [None]:
! ls

---

# Step 2:
## Let's load the pandas library and our freshly retrieved data

Load the pandas package and make sure you can access its functions through the name 'pd'. Remember in python we can import modules like so:

```
import module_name as some_shorthand
```



In [4]:
import pandas as pd

Read in the two excel files that we downloaded in step 1.<br>
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

In [6]:

sales = pd.read_excel('sales.xlsx')
observations = pd.read_excel('observations.xlsx')

# Step 3
## Explore the data 

Let's start of by investigating some of the rows of our dataframe.

<br>

**Questions:** <br>
- Inspect the top of the sales dataframe. What's going on here?
- Sample 20 random rows from the dataframe
- Check out the seemingly empty columns. Which unique values do they contain? Is there anything other than NaN? <br>


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html <br>
Note: unique() is only usable on a pandas Series, not on a DataFrame.

## Selecting with pandas' iloc
Remember that you can use pandas' iloc functionality to access information by integer index:



```
df.iloc[:, 3]
```


**Questions:**
- Inspect the top of the observations dataframe. 
- What do you notice about the format of the data? How does it compare to the sales dataframe?
- Find the name of the country and the year in the sixth row of the sales dataframe.




## Filtering in pandas
Filtering rows works by specifying the conditions that the rows should meet:



```
df[df['column_one'] == 3]
```

**Questions:**
- Find all rows in the observations dataframe where the value of 'pop' is greater than 80.
- Find all rows in the sales dataframe concerning the Netherlands.
- Now from those rows, select the bikes and year columns.
- Add an extra filter: Find the row about the Netherlands in 1675.

## Joining data

![alt text](https://i.stack.imgur.com/iJUMl.png)

Now we know what the data looks like and what to expect from it, we can start thinking about merging the sales and observations dataframes. This will give us combined data on what happened in a country during a year. Before we can start joining, should any data transformations be performed? Remember that you can inspect the data frames with the .head() function. 

Using the str.title() function, which transforms any string into a version where only the first letter is capitalized, you can create uniform data among dataframes to join on. 

**Question:**
- Make the relevant join data uniform among dataframes.

Now that our country names are written in the exact same way, we can perform the join we have been speaking about. What do we join on, and why? 

**Questions:**
- What happens if you join on only one column, such as the country or the year? 
- What happens if you join a dataframe to itself?

Look up the way to join on multiple columns in the pandas documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

- Think of when you would use which type of join. If you wanted to find all observations and, if possible, add the sales information to them, how would the code look?

Try them all (create a df_inner, df_outer, df_left and df_right) and inspect the results to find out the effects of various joins. The following code will show you the shape of the dataframe in the format (rows, columns):

```
df.shape
```



## Missing values
We now have a dataframe with all information in it. Upon closer inspection, you will see that the dataframe from the outer join has some missing values. Do you understand completely why this happens? 

In this section, we will inspect the missing values and deal with them in an appropriate manner. Let's clean up the dataframe from the outer join a bit

**Questions**:
- Delete all columns containing only NaN or duplicate data, making it easier to work with.
- Find all rows that have null values.
- Can we guess from which country the observations with the missing country value are?
- If you had to fill in the missing values in the 'bikes', 'total_turnover' and 'pop' columns, how would you do this? Just think about it for now, later we will actually do it.

Remember that the isnull() function helps you find rows containing null values.

## Aggregating data
Aggregations are useful to get a quick look at derived statistics. At Datamind we often use these derived statistics to get a better picture of our data and also to fill in missing values. In this section we will get a clear picture on how to aggregate data while grouping.

**Questions**:
- Count the number of observations per country in the dataframe.
- Calculate the average turnover per year
- Create a dataframe containing the year, average turnover per year and average population per year
- Calculate the maximum population per country.
- Calculate the sum of the total turnover per year per country. Why is this not such a sensible statistic? Take note of what happens to the NaN observations.

**Bonus**:
- Calculate, for each row, the difference between the total turnover for this observation and the average turnover for that year and store this in a new dataframe containing all old columns plus the new one. For completeness, rename the new column to something appropriate.

Hint: Create a new dataframe with the averages and join the new dataframe onto the old one. Then create a new column by subtracting one value from another:



```
df['derived_column'] = df['original_column1'] * df['original_column2']
```

