# Exercise notebook :

In [1]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import pandas as pd
from datetime import datetime

## Exercise 1: Dataframes and CSV files
A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for
`‘comma-separated values’`.

To read a CSV file into a dataframe you need to call the pandas function called <code>read_csv()</code>. The simplest usage of this function is with a single argument, a string that holds the name of the CSV file, for example.

In [2]:
df = pd.read_csv('WHO POP TB all.csv')

The above code creates a dataframe from the data in the file `WHO POP TB all.csv` and
assigns it to the variable `df`. This is the simplest usage of the `read_csv()` function, just
using a single argument, a string that holds the name of the CSV file.
However the function can take many additional arguments (some of which you’ll use
later), which determine how the file is to be read.

In [3]:
df

Unnamed: 0,Country,Population (1000s),TB deaths
0,Afghanistan,30552,13000.00
1,Albania,3173,20.00
2,Algeria,39208,5100.00
3,Andorra,79,0.26
4,Angola,21472,6900.00
...,...,...,...
189,Venezuela (Bolivarian Republic of),30405,480.00
190,Viet Nam,91680,17000.00
191,Yemen,24407,990.00
192,Zambia,14539,3600.00


### Dataframe attributes

A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is <code>columns</code> which holds a dataframe's column names.
So the expression `df.columns` evaluates to the value of the columns attribute inside
the dataframe `df`. The following code will get and display the names of the columns in the
dataframe df:

In [4]:
df.columns

Index(['Country', 'Population (1000s)', 'TB deaths'], dtype='object')

## Getting and displaying dataframe rows

Dataframes can have hundreds or thousands of rows, so it is not practical to display a
whole dataframe.
However, there are a number of dataframe attributes and methods that allow you to get
and display either a single row or a number of rows at a time. Three of the most useful
methods are: **iloc(), head() and tail()**. Note that to distinguish methods and
attributes, we write () after a method’s name.

### Dataframe rows
A dataframe has a default integer index for its rows, which starts at zero <code>0</code>. The `iloc` attribute can be used to obtain the row at the given index.

**The iloc attribute**

You can get and display any single row in a dataframe by using the `iloc` attribute with the index of the
row you want to access as its argument. For example, the following code will get and
display the first row of data in the dataframe df, which is at index 0:

In [5]:
df.iloc[0] # first row, index 0

Country               Afghanistan
Population (1000s)          30552
TB deaths                   13000
Name: 0, dtype: object

Similarly, the following code will get and display the third row of data in the dataframe df,
which is at index 2:

In [6]:
df.iloc[2] # third row, index 2

Country               Algeria
Population (1000s)      39208
TB deaths                5100
Name: 2, dtype: object

### The <code>head()</code> method

The `head()` method returns a dataframe with the first rows, as many as given in the argument. By default, if the argument is missing, it returns the first five rows.

The first few rows of a dataframe can be printed out with the head() method.
You can tell `head()` is a method, rather than an attribute such as columns, because of
the parentheses (round brackets) after the property name.
If you don’t give any argument, i.e. don’t put any number within those parentheses, the
default behaviour is to return the `first five rows of the dataframe`. If you give an argument,
it will print that number of rows (starting from the row indexed by 0).

For example, executing the following code will get and display the first five rows in the
dataframe df.

In [7]:
df.head() # first five rows

Unnamed: 0,Country,Population (1000s),TB deaths
0,Afghanistan,30552,13000.0
1,Albania,3173,20.0
2,Algeria,39208,5100.0
3,Andorra,79,0.26
4,Angola,21472,6900.0


And, executing the following code will get and display the first seven rows in the
dataframe df.

In [8]:
df.head(7) # first seven rows

Unnamed: 0,Country,Population (1000s),TB deaths
0,Afghanistan,30552,13000.0
1,Albania,3173,20.0
2,Algeria,39208,5100.0
3,Andorra,79,0.26
4,Angola,21472,6900.0
5,Antigua and Barbuda,90,1.2
6,Argentina,41446,570.0


### The <code>tail()</code> method
The <code>tail()</code> method is similar to the <code>head()</code> method. If no argument is used, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument.

In [9]:
df.tail() # last five rows

Unnamed: 0,Country,Population (1000s),TB deaths
189,Venezuela (Bolivarian Republic of),30405,480.0
190,Viet Nam,91680,17000.0
191,Yemen,24407,990.0
192,Zambia,14539,3600.0
193,Zimbabwe,14150,5700.0


## Getting and displaying dataframe columns

You learned that you can get and display a single column of a dataframe by
putting the name of the column (in quotes) within square brackets immediately after the
dataframe’s name.
For example, like this:

In [10]:
df['TB deaths']

0      13000.00
1         20.00
2       5100.00
3          0.26
4       6900.00
         ...   
189      480.00
190    17000.00
191      990.00
192     3600.00
193     5700.00
Name: TB deaths, Length: 194, dtype: float64

Notice that although there is an index, there is no column heading. This is because what is
returned is not a new dataframe with a single column but an example of the Series data
type.

### Each column in a dataframe is an example of a series

The Series data type is a collection of values with an integer index that starts from zero.
In addition, the Series data type has many of the same methods and attributes as
the DataFrame data type, so you can still execute code like:

In [11]:
df['TB deaths'].head()

0    13000.00
1       20.00
2     5100.00
3        0.26
4     6900.00
Name: TB deaths, dtype: float64

In [12]:
df['TB deaths'].iloc[2]

5100.0

However, pandas does provide a mechanism for you to get and display one or more
selected columns as a new dataframe in its own right. To do this you need to use a list. 

A list in Python consists of one or more items separated by commas and enclosed within
square brackets, for example `['Country'] or ['Country', 'Population(1000s)']`. This list is then put within outer square brackets immediately after the
dataframe’s name, like this:

In [13]:
df[['Country']].head()

Unnamed: 0,Country
0,Afghanistan
1,Albania
2,Algeria
3,Andorra
4,Angola


Note that the column is now named. The expression `df[['Country']]`(with two square
brackets) evaluates to a new dataframe (which happens to have a single column) rather
than a series.
To get a new dataframe with multiple columns you just need to put more column names in
the list, like this:

In [14]:
df[['Country', 'Population (1000s)']].head()

Unnamed: 0,Country,Population (1000s)
0,Afghanistan,30552
1,Albania,3173
2,Algeria,39208
3,Andorra,79
4,Angola,21472


The code has returned a new dataframe with just the `'Country' and 'Population
(1000s)’ columns`.