# Week 6 Lab: NumPy and pandas

This week are going to focus on loading data, working with a DataFrame to get a sense for what it contains, and manuplating columns. I have downloaded the [world population dataset from the World Bank](https://data.worldbank.org/indicator/SP.POP.TOTL) that we are going to use to learn pandas.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The following line changes the default floating point display format to two decimal places.

In [None]:
pd.set_option('display.float_format', '{:.2f}'.format)

## Load Data
### Import the CSV file
This notebook requires a CSV file called `world_pop.csv`, if you look inside the CSV file, you will notice that the first few lines contain some information about the file rather than the actual data.

We can pass a  `skiprows` argument with the number of lines that pandas should skip when the [read_csv method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) loads the file.

Task:
- Load the CSV file and refer to it with a variable named `df`



In [None]:
# Write your code here

### View Data
We can see the [top five rows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) of a dataframe using the `head()` method on a dataframe.

Task:
- Show the first five rows of the dataset

In [None]:
# Write your code here

We can see the [bottom five rows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail) with the `tail()` method.

Task:
- Show the bottom five rows of the dataset

In [None]:
# Write your code here

### Select Data to Show
It would be useful to have a list of country names and their codes for future reference.

Task:
- Show only the `Country Name` and `Country Code` columns:


In [None]:
# Write your code here

Find information that is available about France, see the documentation on [selection](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe) for more information.

Tasks:
- Use the `Country Name` to find rows that match `France`.

In [None]:
# Write your code here

You will have noticed that there is an extra column at the end of the DataFrame, we can [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop) the last column because it does not contain any useful information, we need the column names so that we can get rid of the last column.

Tasks:
- Find the name of the last column by accessing `df.columns` and getting the last value using `[-1]`
- Drop the last column from the DataFrame by using the `drop()` method and reassigning the dataframe back to `df`

In [None]:
# Write your code here

There are two columns with repeated data that are not useful for us.

Tasks:
- Remove the `Indicator Name` and `Indicator Code` columns from the DataFrame

In [None]:
# Write your code here

## Data Cleaning and Pre-Processing
Check the data type of each column, there are some columns that have the dtype of `object`, convert them to `string` so that they are more efficient. Note that the difference between `string` and `object` is that a string only contains a string, whilst an object is an entire Python object for the same string.

Task
- Check the datatypes in the DataFrame using the `info()` method
- Convert the object dtype columns to string

In [None]:
# Write your code here

It would be easier for us the access the data using the country codes as the index, since we can access the data for a country using its respective country code.

Tasks:
- Set the `Country Code` column as the index

In [None]:
# Write your code here

The country code for Great Britain is `GBR`, select the data for GBR using the index.

Tasks:
- Select the data for `GBR` using the index

In [None]:
# Write your code here

## Data Analysis
### Total Population in 2024
What was the world population in 2024?

Tasks:
- Select the data from the `2024` column and calculate the total sum

In [None]:
# Write your code here

### Average World Population in 1960
Task:
- Select the data from the `1960` column and calculate the [mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean) using the `mean()` method

In [None]:
# Write your code here

### Total World Population Over Time
Task:
- Calculate the total world population for each year in the dataset

Hint:

We can use the `iloc` method to [select using indexes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc) for all rows but the first column using the following syntax:

```python
df.iloc[:, 1:]
```
Inside the square brackets, the first `:` before the comma represents all rows. The `1:` after the comma represents columns starting from the second column (1) one to the end.

In [None]:
# Write your code here

### Line Chart of Population Over Time
Tasks:
- Draw a line chart to show the total sum for each year you calculated in the previous step

Hint:

We can make plots using `plot()` method to [make charts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot).

In [None]:
# Write your code here

### Top 10 in 2023
Sort the data by the top ten countries with the most population in 2024 using the [sort values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) method.

In [None]:
# Write your code here

### Top 10 in 1960
Which countries were in the top ten in 1960?

In [None]:
# Write your code here

### Percentage Change 1960-2024
Create a new column called `Percentage Change` which contains the percentage difference in population between 1960 and 2024 for each country.


The formula for percentage change is:

$$\frac{\text{2024} - \text{1960}}{\text{1960}} \times 100$$

In [None]:
# Write your code here


### Biggest Percentage Change
Show the top 15 countries with the greatest percentage increase in population between 1960 and 2023, and display only the `Country Name` and `Percentage Change` columns.

In [None]:
# Write your code here

### Visualise Percentage Change
Create a bar chart that visualises this increase in population. Use the `plot()` method with the `x` and `y` parameters as the column names.

In [None]:
# Write your code here