# Introduction to Programming

<br/>
<br/>

# ***Datasets***

<br/>

1. [Setup](#0)<br>
2. [About Pandas](#I)<br>
3. [Creating & Reading](#II)<br>
    3.1 [Creating data](#II.I)<br>
    3.2 [Reading data](#II.II)<br>
4. [Indexing, Selecting and Assigning](#III)<br>
    4.1 [Naitive Python accessors](#III.I)<br>
    4.2 [Indexing using Pandas syntaxis](#III.II)<br>
    4.3 [Manipulating the index](#III.III)<br>
    4.4 [Conditional Selection](#III.VI)<br>
    4.5 [Assigning data](#III.V)<br>
5. [Summary Functions and Maps](#VI)<br>
    5.1 [Summary functions](#VI.I)<br>
    5.2 [Maps](#VI.II)<br>
6. [Grouping and Sorting](#V)<br>
    6.1 [Grouping](#V.I)<br>
    6.2 [Sorting](#V.II)<br>
7. [Data Types and Missing Values](#IV)<br>
    7.1 [Data Types](#IV.I)<br>
    7.2 [Missing data](#IV.II)<br>
8. [Renaming and Combining](#IIV)<br>
    8.1 [Renaming](#IIV.I)<br>
    8.2 [Combining](#IIV.II)<br>
<br/>
<br/>
<br/>
    





## 1. [Setup](#0)
<a id="0"></a>
<br/>

To see where is the *current working directory* for this specific `jupyter notebook` we leverage the method `getcwd()` from package `os`

In [None]:
#Type your code here


*Current directory* of a specific `jupyter notebook` is usually originally set to where the file is stored. Remember that during the first part of this course we learn that it is possible to change our *current working directory* by leveraging the method `chdir()` from the package `os`. 

The `while` loop above will *check* if the ending string returned by `os.getcwd()` is equal to "IntroDS", while this condition it will keep going one folder backwards.

**IntroDS** is the name of the folder we have created for this second part of the course. The structure of it is as follows:

+ data
    + Data_Extract_From_World_Development_Indicators-OCDE-BRCH-EU.xlsx
    + WorldBankDataReshaped.csv
+ html
    + NOVASBEIP2020-Class.html
    + NOVASBEIP2020.html
+ images
    + DSKC_logo.png
+ metadata
    + Expense.xlsx
    + Final Consumption Expenditure.xlsx
    + Militar Expenditure.xlsx
+ notebooks
    + NOVASBEIP2020.html
    + NOVASBEIP2020.ipynb
    + ReshapingWorldBankData.R
+ README.md
    
   



## 2. [About pandas](#I)
<a id="I"></a>
<br/>

The most popular `Python` library for data analysis is `pandas`. In this part we will learn how to create our own data, along with how to work with data that already exists (i.e. how to import it to `Python`). To use `pandas` you will typically start with the following line of code.





In [None]:
#type your code here


(note that Anaconda environments known as conda already includes the `pandas`*package*).

## 3. [Creating & Reading data](#II)
<a id="II"></a>
<br/>

### 3.1 [Creating data](#II.I)
<a id="II.I"></a>
<br/>

There are two core objects in pandas: the `DataFrame` and the `Series`.
<br/>

#### 3.1.1 `DataFrame`

<hr>

A `DataFrame`is a table that contains and array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

As an example let's generate our first `DataFrame`. For this we will need to declare a `dictionary` and then feed it to the `pandas` method `DataFrame()`. 

`Dictionaries` are one of the basic `Python` data structures (as we learned on the first part of this course). If you are familiar with other programming languages you can think of them as mappings or collection of objects that are stored by a *key*, unlike other structures such as sequences or lists that store objects by their relative position.

In [None]:
#type your code here

In [None]:
#type your code here

In this example, the entry indexed by ("0", "Key 1") corresponds to "Value 1". The ("0","Key 2") value corresponds to "Value 3" and so on.

`DataFrame` entries are not limited to `strings`. For instance, here is a `DataFrame` whose values are not strings.

In [None]:
#type your code here

We are using the `DataFrame()` constructor from the library `pandas` that we declared as `pd` to generate these `DataFrame` objects. The syntax for declaring a new one is a dictionary whose keys are the column names (`Yes`and `No` in the prior example). This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the `row labels`. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of `row labels` used in a `DataFrame` is known as an **Index**. We can assign values to it by using an `index` parameter in our constructor, for example:

In [None]:
#type your code here

<br/>

##### 3.1.2 `Series`

<hr>

A `Series` is a sequence of data values. If a `DataFrame` is a **table**, a `Series` is a **list**, both with special methods and constructors than the ones available by using the `Matrix` (two-dimensional array) and `list` data structures. So, it is possible to generate a `pandas` `Series` with nothing more than a list, for example:

In [None]:
#type your code here

A `Series` is, in essence, a single column of a `DataFrame`. So you can assign column values to the `Series` the same way as before, using an `index` parameter. However

In [None]:
#type your code here

<hr>
<hr>

**Summarizing**, we saw that `pandas` library has two main objects `DataFrame` and `Series`. By this time it must be clear that they are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together"

<hr>
<hr>

<br/>

### 3.2 [Reading data](#II.II)
<a id="II.II"></a>

<br/>

Data can be stored in any of a number of different forms and formats. By far the most basic format is the CSV file or *Comma-Separated Values*. The function from pandas that will allow us to read this format to Python is `read_csv()`. Now, lets read our example database using `read_csv()` function.

In [None]:
#type your code here

We can use the `shape` **attribute** to check the dimensions of the resulting `DataFrame`.

In [None]:
#type your code here

So our `DataFrame` has 59 k records split accross 6 different columns. That is almost 356 k entries!

We can examine the contents of the resultant `DataFrame` using the `head()` **method**, which grabs, by default, the first five rows.

In [None]:
#type your code here

The pandas `read_csv()` functions is well-endowed, with over 30 optional parameters for you to specify as needed. This optional parameters can allow you to read other types of format files. For example, by specifying the `sep='\t'` this same function will allow you to read **tsv** files. Another optional parameter is the `index_col` which allows you to use the specified column (you will need to specify the number of the column starting by 0) as the `row_labels` of the `DataFrame`. For example:

In [None]:
#type your code here

In [None]:
#type your code here

Another very common format is the **xlsx** or Excel files. Unfortunately, `read_csv()` method from pandas library will not let us read this type of data. For reading into Python this format we need to install the package `xlrd`. Once we do that we can simply use the pandas method `read_excel()` in the following way:

In [None]:
#type your code here

In [None]:
#type your code here

The dataset above is the extraction as downloaded from [World Bank's databank](https://databank.worldbank.org/data/source/world-development-indicators#). 

So far we have declared to our *environment* two <code>dataframe</code>  objects: `data` & `row_data`. The first one has been preprocessed in such a way to reshape it from a wide to a long format. The only difference is that for each country data belonging to a specific series was horizonatally presented (in raw data), but now it is in a *panel data* fashion. 


<div class="alert alert-block alert-info">
<b>Note:</b> that you can always access to the <i>specifications</i> of a method by simply ending the method statement with a question mark instead of parenthesis. For example, lets see the additional parameters available in the <code>read_excel</code> method.
</div>




For example, note that optional parameter `sheet_name` will let you pick the Excel worksheet of your choice (the default one is the first one indexed by number 0). 

<hr>

#### Exercise 

<hr>

Use the information about the parameters of method `read_excel()` mentioned above to:
+ Read the `Data_Extract_From_World_Development_Indicators-OCDE-BRCH-EU.xlsx` file using col `Country Code` as `row_labels`
+ Use attribute `shape` to check the dimensions of it
+ Use the method `head()` to let us visualize the 3 first rows of it

In [None]:
#type your code here

In [None]:
#type your code here

<br/>

## 4. [Indexing, Selecting and Assigning](#III)
<a id="III"></a>
<br/>

Selecting specific values of a pandas `DataFrame` or `Series` to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

<br/>



### 4.1 [Naitive Python accessors](#III.I)
<a id="III.I"></a>
<br/>

Native Python objects provide good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. **Columns in a pandas** `DataFrame` **work in much the same way**.

Hence to access the `Continent` **property** of our `data` we can use:

In [None]:
#type your code here

In Python dictionaries we can access its values by using the indexing `[ ]` operator. And, so we can do the same with columns in a `DataFrame`.

In [None]:
#type your code here

<br/>

Or `[[ ]]` for a `list` of columns.

In [None]:
#type your code here

Pandas `Series` are pretty much like `list`s wrapped inside a fancy dictionary. And, so we can select a specific value by using the indexing operator once more, for example lets address the first observation fo the column `Country Name`.

In [None]:
#type your code here

<hr>
<hr>

<b>Summarizing:</b> As inherit forms from Python naitive objects we have <b>two</b> ways to select columns,  by <b>attribute</b> of the <code>DataFrame</code> or by <b>key</b> of the <code>dictionary</code> we can select specific <code>Series</code> out of a <code>DataFrame</code>. Neither of them is more or less syntactically valid than the other, but the indexing operator <code>[]</code> does have the advantage that it can handle column names with reserved characters in them (e.g. columns names separated by blank spaces such as <code>Country Name</code>, since <code>data.Country Name</code> would not work!)

<hr>
<hr>



<br>

<hr>

#### Exercise
<hr>

Can you select element 4 of the same `Series`? 

In [None]:
#Type your code here


<br/>

### 4.2 [Indexing using pandas syntaxis](#III.II)
<a id="III.II"></a>
<br/>

Pandas has its own accessor operatos, `loc[ ]` and `iloc[ ]`. For more advanced operations, these are the ones you are supposed to be using.


<br/>

#### 4.2.1 Index-based selection `iloc[ ]`

<br/>

In pandas we have two paradigms for indexing. The first one we will review is the **index-based selection**. This simply means that we will be selecting data based on its numerical position in the `DataFrame` (as we did when we review `list`s). For this first paradigm we use the method `iloc[]`. 

To select the **first row** of our data `DataFrame`, we can do the follwoing.

In [None]:
#type your code here

This is returning a `Series` where the values of it are the first row of data in our `DataFrame`. And the `index` of them are the `column_names` of our `DataFrame`.

Both `loc[]` and `iloc[]` are **row-first, column-second**. Please, note that this is the opposite of what we do in naitive Python, which is column-first, row-second.

This means that using pandas accessors it is marginally easier to retrieve rows, and marginally harder to retrieve columns.

To get a column by using `iloc[]` method we will need to specify two arguments `iloc[rows,columns]`.For example, lets select the column `Continent` using its numerical position on the data `DataFrame`.

In [None]:
#type your code here

This is returning a `Series` where the values of it are the **first column** of data in our `DataFrame`. And the `index` of them are the `row_labels` of our `DataFrame`.

The operator `:` also comes from naitive Python and it **means "everything"**. When combinded with other selectors, it can be used to indicate a range of values. For example, to select the `Continent` column but just the first three rows we can do the following.

In [None]:
#type your code here

Or to select just the from the same column rows 10 to 13.

In [None]:
#type your code here

<div class="alert alert-block alert-info">
    <b>Note:</b>  I introduced a new method of pandas <code>DataFrame</code>s that is  <code>reset_index()</code>. This new method allows us to reset the <code>row_labels</code>, recall that when we imported the data we asked the <code>index</code> to be equal to the <code>Country Code</code>. For presentation purposes in the code above I present the <code>row_labels</code> as 0, 1, 2, 3, ...  so it is clearer what the <code>iloc[10:14,1]</code> method is doing (also pay attention that now <code>Continent</code> column is the fifth one since the first position was occupied by <code>Country Code</code>).
</div>


It is also possible to pass a list to `iloc[]` method.

In [None]:
#type your code here

As a final note to this **index-based selection** it is worth knowning that it is also posible to use negative numbers. This makes the `iloc[]` method to start counting forwards from the *end* of the values. So for example here are the last five elements of our dataset.

In [None]:
#type your code here

<br/>

#### 4.2.2 Label-based selection `loc[ ]`

<br/>

The second paradigm in pandas accesses to `DataFrame`s is the one followed by the `loc[ ]` method. In this paradigm, it's the data **index value**, not its position, which matters.

For example, as befor lets address the first observation fo the column `Continent`.

<br/>

In [None]:
#type your code here

<br/>

#### 3.2.3 Differences between `loc[ ]`and `iloc[ ]`

<br/>


The two methods use different indexing schemes.

`iloc[ ]` uses a scheme where the **first** element of the range is **included** and the **last** one is **excluded**. So, for example `.iloc[0:10]` will select entries 0,...,9. In contrast, `.loc[0:10]` indexes inclusively, so will output entries 0,...,10.

This difference is due to the fact that `loc[ ]` is meant to work indexing strings. So for example this characteristic is very convenient when we need to index a dataframe (let it be called `df`) that contains index values such as fruits: `Apples,...,Potatoes,...,` and we want to select "all the alphabetical fruit choices between Apples and Potatoes". Then `df.loc['Apples':'Potatoes']` has a much more intuitive use than something like `df.loc['Apples':'Potatoet']` (t comes after s in the alphabet).

Otherwise, the semantics and use of `loc[ ]` are the same as those for `iloc[ ]`

<br/>

### 4.3 [Manipulating the index](#III.III)
<a id="III.III"></a>

<br/>

Label-based selection dervies its power from the lables in the index. So, we can manipulate the index in any way we see fit by using the method `set_index()`.

For example, lets reset the index of our data. Remember that when first imported it we set the column `Time` as the new index. 

In [None]:
#type your code here

<br/>

### 4.4 [Conditional Selection](#III.VI)
<a id="III.VI"></a>

<br/>


This type of selection works well for **asking interesting questions** to our data. In particular, it helps to *ask* questions based on conditions.

For example, suppose that for some reason we are interested in all european countries that have an average GDP of the period higher than 1,000,000,000,000 (or $10^{12}$).

First, we need to check if each observation has the column `Continent` equal to **Europe**.



In [None]:
#type your code here

This operation produced a `Series`of `True/False` booleans based on the continent of each record. This result can be used inside the `.loc[]` method to select the relevant data.

In [None]:
#type your code here

The resulting `DataFrame` has $\sim 37 k$ rows. The original had $\sim 59 k$. Means that around $66\%$ of the observations has the `Continent` column equal to **`Europe`**.

We also wanted to know which ones have an average GDP higher than 1,000,000,000,000. For this purpose, I will present a new method: `Series.mean()`. This method and others that allow us to generate summary statistics will be reviewed further in detail.

To implement the second condition we need to:

1. extract the `Series` with the values of the Gross Domestic Product (it is called `GDP (constant 2010 US$)`
2. compare the `value` in the `Series` with the desired one ($10^{12}$)


In [None]:
#type your code here

In [None]:
#type your code here

Then we can use the ampersand (`&`) symbol to bring the two questions together.

In [None]:
#type your code here

Suppose we want to select any observations in which the continent is equal to Europe *or* the GDP is above the threshold mentioned aboved. For this we use the pipe (`|`) symbol.

In [None]:
#type your code here

<br/>

#### 3.3.1 Pandas built-in conditional selectors `.isin([])` and `.isnull()`

<br/>

Pandas comes with a few built-in conditional selectors. In this section we presented two of them.

`.isin([])` helps to select data whose value "is in" a **list of values**. For example, we can use it to select observations from the [G7 countries](https://www.investopedia.com/terms/g/g5.asp). 

Such countries are **United States**, the **United Kingdom**, **Canada**, **Germany**, **Japan**, **Italy**, **France**, and, until recently, Russia. In 2014, Russia was suspended indefinitely from the group after annexing Crimea, an autonomous republic of Ukraine. As a result, the G8 is now often referred to as the **G7**.



In [None]:
#type your code here

`.isnull()` (and its counterpart `.notnull()`) let you highlight values which are (or are not) empty (`NaN`). 

For example, lets filter out all rows that contain missing values for `Series Name`. Here is how we do it by using this operator.

In [None]:
#type your code here

The original dataset contains $\sim 59k$ entries. When filtered out all missing values we stay with $\sim 42k$ registers.

For concluding this section, let me stress out that I used operator `.loc[:,'Series Name']` that returns exactly the same `Series` that the one you can get by using the native Python operator `data['Series Name']`. 

In the following lines of code I will be changing indistinctly between those two ways of selection. Just to emphasize that they produce exactly the same output (when outputing a `Series`). Note that `.loc[]` additionally let us to perform the type of **conditional selection** that we reviewed during this section.

<br/>



### 4.5 [Assigning data](#III.V)
<a id="III.V"></a>

<br/>

Assigning data to a `Series` of column of a `DataFrame` is easy. It is possible to assign either a constant value:

In [None]:
#type your code here

Or iterable values, such as:

In [None]:
#type your code here

For concluding this section, please note that the code above is using two native Python function `range()` and `len()`. The first one outputs an object that produces a sequence of integers from start (inclusive)
to stop (exclusive) by step. The second one simply returns the number of items in the object (recall that `DataFrame` is always rows first columns second). Lastly, you can always review what a Python method does by asking after its name.

In [None]:
#range?
#len?

## 5. [Summary Functions and Maps](#VI)
<a id="VI"></a>
<br/>

As data does not always come out of memory in the format we want it. Sometimes we need to do some more work ourselves to reformat it for the task at hand. In this section, we will be covering different opeartions that we can apply to our data to get the input "just right" for our models, presentations and so on. First lets see a group of functions that are usually and informally called *summary functions*.

<br/>




### 5.1 [Summary functions](#VI.I)
<a id="VI.I"></a>
<br/>

Pandas provides many simple *summary functions* which helps when you need to restructure your data in some useful way. For example, consider the `.describe()` method.

In [None]:
#type your code here

This method generates a high-level summary of the attributes of the given column. It is **type-aware**, meaning that its output varies based on the data type of the column. The output above only makes sense for numerical data. In the case of string we get:

In [None]:
#type your code here

It is possible to get a particular simple summary about a column in a `DataFrame` or a `Series`, most of the time there is a helpful pandas function that can make it happen.

For example, to see the mean of the Portuguese *population*in the available period, we can use the method `.mean()`.

To do so, we need to:

1. select the rows where the `Series Name` is equal to `Population, total`
2. select the rows where the `Country Name` is equal to `Portugal`
3. extract `value` series
4. apply the method mentioned before

In [None]:
#type your code here

Other interesanting **method** of pandas `DataFrames` is `info()`. `info()` constructs a table that allow us to get the following information about the columns:
+ name
+ how many rows are non-null
+ type

In [None]:
#type your code here

We will return on talking about null values and columns types in a while.

<br/>

#### 5.1.1 Finding unique values

<br/>

To see a list of unique values we can use the `.unique()` function.

For example let's see the list of countries that this extraction has.

In [None]:
#type your code here

<br/>

#### 5.1.2 Counting unique values

<br/>

To see a list of values and how often they occur in a data set we can use `value_counts()` method. For example, let's see how many observations per year we have.

In [None]:
#type your code here

<br/>

### 5.2 [Maps](#VI.II)
<a id="VI.II"></a>
<br/>
A **map** is a term borrowed from mathematics. It can be seen as a function that takes one set of values and "maps" them to another set of values. In data science we often need to create new representations from existing data, or transform data from the format we recieve it (for example from our operations team) to the format we want it to be later for models or business presentations. Maps are what handle this type of work.

There are two mapping methods that you will often use.

`map()` is slightly the simpler one. For example, suppose that we need to remean the GDP to 0. We can do it in the following way.

Again to do so, we need to:

1. select the rows where the `Series Name` is equal to `Population, total`
2. select the rows where the `Country Name` is equal to `Portugal`
3. extract `value` series
4. apply the method mentioned before

In [None]:
#type your code here

The function you pass to `map()` expects a single value from the `Series`, and return a transformed version of that value (in our example that same value minus the column mean). `map()` returns a new `Series` where all the values have been transformed by your function.
<br/>
<hr>

#### 5.2.1 Quick review about Python's lambda expression

<hr>
<br/>
Lambda expressions allow us to create "anonymous" functions. This basically means we can quickly make ad-hoc functions without needing to properly define a function using def. 

Function objects returned by running lambda expressions work exactly the same as those created and assigned by `def`s. There is a key difference that makes lambda useful in specialized roles. Pandas library works very well with lambda expressions.

**lambda's body is a single expression, not a block of statements.**

+ The lambda's body is similar to what we would put in a `def` body return statement. We simply type the result as an expression instead of explicitly returning it. Because it is limited to an expression, a lambda is less general that a def. We can only squeeze design, to limit program nesting. lambda is designed for coding simple functions, and def to handle the larger tasks.




Lets generate a simple function and compare it with its equivalent lambda expression.

In [None]:
#type your code here

In [None]:
square(4)

This is the from that a lambda expression trying to replicate the function above would take.

In [None]:
#type your code here

Note that the output is a function (in our data set this function is passed to `map()` method so it can be applied to each of the elements of the column). So we need to assing it to a variable to then use it in the desired way

In [None]:
#type your code here

In [None]:
lambda_square(4)

Now, lets generate a lambda function that will allow us to have a quick overview of any dataset that we may encounter.

In [None]:
#type your code here

<br/>
<hr>
<hr>
<br/>

`apply()` is the equivalent mehtod if we want to transform a whole `DataFrame` by calling a custom method on each row. For example, lets define the mean_gdp function and then use it to transform our data.

In [None]:
#type your code here

In [None]:
#type your code here

If we called the `port_gdp_data.apply()` with `axis='index'`, then instead of passing a function to transform each row, we would need to give a function to transform each **column** .

Note that both `map()`and `apply()`methods return a new transformed `Series`and `DataFrame`respectively. They do not modify the original dataset they uses as input. If we look at the original `port_gdp_data` we can see that it still has its original `value`.

In [None]:
#type your code here

<br/>

#### 5.2.2 **Built in mapping operators in Pandas**

<br/>
For the most common operators we need not to use explicitly the `map()` method. For example, here is a faster way of remeaning our `value` column.

In [None]:
#type your code here

Note that in the code above we are performing an operation between a lot of vales on the left-hand side (actually everyting in the `Series`) and a single value on the right-hand side (the mean value $230,698,913,940.09$). Pandas looks at this expression and figures out that we must mean to substract that value from every value in the column.

Pandas will also understand what to do if we perform these operations between `Series` of equal length. For example, an easy way to combine the continent and country name of our data is by doing the following:

In [None]:
#type your code here

In the cell above I introduced the method `.astype()` to transform `Time` from an integer column to a string column. 

This needs to be done because Pandas can only concatenate a string with other string or equivalently can only add a number with other number. Both operations are performed with the same `+` operator.

<div class="alert alert-block alert-info">
    <b>Note:</b>  Performing this operators rather than using <code>map()</code> or <code>apply()</code> methods is faster because they use speed ups built into Pandas. Actualy, all of the standard Python operators (greater, lesser or equal than) work in this manner.
</div>

However, they are not as flexible as `map()` or `apply()`, which can do more advanced things, like applying conditional logic, which cannot be done with addition and substraction alone.

<br/>

## 6. [Grouping and Sorting](#V)
<a id="V"></a>
<br/>

### 6.1 [Grouping](#V.I)
<a id="V.I"></a>
<br/>

Most data operations are done on groups defined by variables for creating them pandas `DataFrame`s have the method `groupby()`. The way it generally works is as follows:

1. Define the *groups* based on a set of columns in your data
2. Summarize the information of other columns at the grouped level

For the second step the method you will need to use is `agg()`, this method will allow you to summarize a set of columns each of them with a specific summary function. The structure of the command is presented in the image below:
<br/>

<p align="center">
  <img width="820" height="500" src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2019/10/pandas-python-group-by-named-aggregation-update.jpg">
</p>

<br/>

Now, let's make a simple table with summary statistcis of our data


In [None]:
#type your code here

There are aggregation functions already predifined by pandas below you can see a **non-exhaustive** list of them:


| Function | Description  |
| :-----: | :-----: |
| count | Number of non-null observations |
| sum | Sum of values |
| mean | Mean of values |
| mad | Mean absolute deviation |
| median | Arithmetic median of values |
| min | Minimum |
| max | Maximum |
| mode | Mode |
| abs | Absolute Value |
| prod | Product of values |
| var | Unbiased variance |
| sem | Unbiased standard error of the mean |
| skew | Unbiased skewness (3rd moment) |
| kurt | Unbiased kurtosis (4th moment) |
| quantile | Sample quantile (value at %) |
| cumsum | Cumulative sum |
| cumprod | Sample quantile (value at %) |
| quantile | Cumulative product |
| cummax | Cumulative maximum |
| cummin | Cumulative minimum |

Note that it is also possible to create your own aggregation functions. However, the need for custom functions is minimal unless you have very specific requirements. The full range of basic statistics that are quickly calculable and built into the base Pandas package can be found [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html)

<br/>

### 6.2 [Sorting](#V.II)
<a id="V.II"></a>
<br/>

Now, imagine you need to sort in a descending way the results shown above. By using the `DataFrame` method `sort_values()` and selecting the parameter `ascending=False` this process can be easily performed.

In [None]:
#type your code here

<br/>

## 7. [Data Types and Missing Values](#IV)
<a id="IV"></a>
<br/>

### 7.1 [Data Types](#IV.I)
<a id="IV.I"></a>
<br/>

At the end of the prior section we introduced an operator to modify the type a specific column. The correct name of the type of a column in a `DataFrame` (or `Series`) is **dtype**.

<br/>



#### 7.1.1 Specific Column
<br/>

So, you can use the `dtype` attribute of a `Series` to grab the type of a specific column. For instance, we can get the `dtype` of the `Time` column in our data.

In [None]:
#type your code here

<br/>

#### 7.1.2 Every column in the dataset
<br/>
Alternatively, the `dtypes` attribute of a `DataFrame` returns the `dtype`of *every* column.

In [None]:
#type your code here

Data types tell us something about how pandas is storing the data internally. For example, `float64` means that it is using a 64-bit floating point number, whereas `int64` means a similarly sized integer instead.

One peculiarity to keep in mind is that columns consisting entirely of strings do not get their own type, instead they are given the `object` type.

As we presented before, function `.astype()` makes possible to convert a column of one type into another wherever such a convertion makes sense. 

A `DataFrame` or `Series` index has its own `dtype` too.

In [None]:
#type your code here

As a final note to this section, let's remark that Pandas also supports more exoctic data types. For example, categorical data and timeseries data are also allowed.

### 7.2 [Missing data](#IV.II)
<a id="IV.II"></a>
<br/>
Entries with missing values are given the value NaN, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` dtype.
<br/>

#### 7.2.1 Selecting missing rows data
<br/>
Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). For example, lets output all the rows of our dataset that has `value` column with missing values.

In [None]:
#type your code here

Replacing missing values is a common operation. Pandas provides a handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, we can simply replace each `NaN`with an `"Unknown"`. 

In [None]:
#type your code here

Alternatively, we may have a none-null value that we would like to replace. For example, lets `replace()` the string value of `Continent` column from 'World' to 0, 'America' to 1, 'Europe' to 2, 'Middle East' to 3, 'Africa' to 4, 'Asia' to 5 and 'Oceania' to 6

In [None]:
#type your code here

<br/>

## 8. [Renaming and Combining](#IIV)
<a id="IIV"></a>
<br/>
Data comes often to us with column names, index names or other naming conventions that we are not satisfied with. In that case, we will learn how to use pandas functions to change the names of the entries we need to.

We will also explore three methods to combine data from multiple `DataFame` and/or `Series`.
<br/>

### 8.1 [Renaming](#IIV.I)
<a id="IIV.I"></a>
<br/>
The first function to introduce here is `rename()`, which allows you to change index names and/or column names. For example, to change the `value` column in our dataset to `Series Value`, we would do:
<br/>

In [None]:
#type your code here

`rename()` supports a variety of input formats, but usually a Python dictionary is the most convenient. For **modifying a set of columns** we leverage two core Python operators: `dict()` and `zip()`. The first one simply produces a dictionary (is equivalent to `{ }`), the second one is the `zip()` function that *paste* one on one each of the elements of two list.

In [None]:
#type your code here

`rename()` lets you rename *index* or *column* values by specifying `index` or `column` keyword parameter, respectively. Here is an example using it to rename some elements of the index. 

In [None]:
#type your code here

It is very common to rename columns, but rename index values is very rarely. For doing so it is usually more convenient to use the `set_index()` method.
<br/>

#### 8.1.1 Renaming axis
<br/>
Both rows and columns have their own name attribute. So, additionaly to renaming values of each axis  `rename_axis()` method may be used to change these name attribute (naming the hole rows or the hole columns). For example:

In [None]:
#type your code here

<br/>

### 8.2 [Combining](#IIV.II)
<a id="IIV.II"></a>
<br/>

When working on a dataset, we will sometimes need to combine different `DataFrames` and/or `Series` in *non-trivial* ways. Pandas has three core methods for doing this. Ranked on increasing complexity, these methods are `concat()`, `join()`, and `merge()`.

As its name says, `concat()` will concatenate two `DataFrames` together along an axis. This method is very useful when working with data in different `DataFrames` or `Series` objects but having the same fields (columns) . 

For example imagine that we need to create a report with a subset of the available goverment data from Portugal of the period. The indicators that we need to present are the following:

+ GDP (constant 2010 US\$)
+ Expense (\% of GDP)
+ Government expenditure on education, total (\% of GDP)
+ Domestic general government health expenditure (\% of GDP)

For this we will need to create a subset of the original dataframe with all the series available for Portugal. With it we will need to:

+ select all the rows where `Series Name` equals each of the desired indicators
+ rename `Series Name` to fit the indicator description
+ use the method mentioned above

In [None]:
#First of all let's select only entries reported by Portugal
#type your code here

In [None]:
#Then let's set the conditions to select the data
#type your code here

In [None]:
#Then select and rename each series
#type your code here

In [None]:
#Finally let's concatenate all together
#and put back together the column names since concate operator loss them
#type your code here

The middlemost complex combiner is `join()`. It works for combining different `DataFrame` objects which have a column (or index) in common. For a clear explanation of its usage and its parameters, let's generate two new `DataFrames` each of them with two columns named: `Key` and `Values`.  

The first one name it `right` and fill each column with the following values:

+ `Key`: 2000,2001,2002,2003
+ `Values`:12,13,12,13 

The second one name it `left` and fill each column with the following values:

+ `Key`: 2000,2001,2003
+ `Values`:15,16,17

In [None]:
#type your code here

Now, lets *join* this to datasets using operator `.join()`. 

In [None]:
#type your code here

Note that the `on` parameter allows us to to join the datasets using the `Key` column, however `DataFrame.join()` **always uses the second one index** so that's why in the code above we use the method `set_index('Key')` on the second `DataFrame` (in the example is called `left`).

Back to our example database, imagine that now we have a report with another set of indicators:

+ Final consumption expenditure (\% of GDP): (formerly total consumption) is the sum of household final consumption expenditure (private consumption) and general government final consumption expenditure (general government consumption)
+ Exports of goods and services (\% of GDP)

And we want join them together in a single dataset.


In [None]:
#Let's set the conditions to select the data
#type your code here'

#Then select and rename each series
#type your code here

In [None]:
#type your code here

Note that the new dataset has exactly the same amount of rows (or observations) that our first dataset. `join()` method has set by default the value of the parameter `how` to "left". So, the code above is *pasting*  two additional columns using the values of `Time` (set as index in both datasets) as keys for this process.

The last combining operator is `merge()`, it is said to be more complicated simply because it has more parameters to specify. This allows more flexibility when dealing with difficult *types* of combinations. 

Differently from `join()` it supports any column or index as *key* for the combination to perform. It also has set as default `how` value "inner" (so by default it would just return rows in which key values belong to the intersection of both dataset). 

Let's repeat what we have done so far using this operator.

In [None]:
#type your code here

In [None]:
#type your code here


Finally, it is important to remark that almost everything that can be done with `merge()` can also be done with `join()`. However, `merge()` gives you more control on how pandas is  internally procesing your data. In particular, the **optional parameter** `validate` helps you check how pandas is processing internally the key values. In our example, I specified "1:1" that means "one to one" because it makes pandas to check if merge keys are unique on the both dataset.

Remember that you can always *ask Python* for help when you feel confused about the parameters or its usage of both `merge()` and `join()` (or any other) operator.