# Intro to `pandas` 

There are a very large number of open-source libraries we can import. 

The Python data analysis package `pandas` is a favorite: https://pandas.pydata.org/

Pandas is based on using **Dataframes**. They allow us to store data as panels in  rows and columns. You may have encountered this data structure in R or Matlab. If not, you can think of the approach as similar to how you use a spreadsheet (except with far greater flexibility and power in Python).

In this tutorial, we will focus on manipulation of dataframes using pandas. 

<div class="alert alert-block alert-info">
    
<b>Note:</b> There are several sub-libraries of pandas such as geopandas for geospatial data that you can always harness their power while performing highly specialised tasks. In the next tutorial we will cover geopandas.
</div>

To begin working with dataframes, we first install and import pandas (as we did for packages in previous sessions). 

<div class="alert alert-block alert-warning">

<b>!! Note !!</b> Pandas is not part of most python distribution packages hence it should be installed where it is not available before starting the tutorial. You may need to exit and reinstall. 

</div>

In [1]:
# Example
# import sys
# !conda install --yes --prefix {sys.prefix} pandas
import pandas as pd

## Creating a `pandas` dataframe

Here we are focusing on ways of creating dataframes. However, in most day-to-day programming, you are likely to work with data in other formats which you can convert to dataframe by a single line of code. 

Nevertheless, dataframes can be created by converting lists to dictionaries and then to a dataframe. 

First, let's create a list with our information.

In [22]:
# Example
country = ['US','UK','France','Germany','Rwanda']
iso3 = ['USA','GBR','FRA','DEU','RWA']
regional_level = [2,2,2,2,1]
income_group = ['HIC','HIC','HIC','HIC','LIC']
continent = ['North America','Europe','Europe','Europe','Africa']

Next, we need to combine the lists into a dictionary:

In [23]:
# Example
country_data = {
    'country': country,
    'iso3': iso3,
    'regional_level': regional_level,
    'income_group': income_group,
    'continent': continent
}
country_data

{'country': ['US', 'UK', 'France', 'Germany', 'Rwanda'],
 'iso3': ['USA', 'GBR', 'FRA', 'DEU', 'RWA'],
 'regional_level': [2, 2, 2, 2, 1],
 'income_group': ['HIC', 'HIC', 'HIC', 'HIC', 'LIC'],
 'continent': ['North America', 'Europe', 'Europe', 'Europe', 'Africa']}

Finally, we convert the dictionary into dataframe using the `pandas` library function `.DataFrame()`.

In [24]:
# Example
df = pd.DataFrame(country_data)
df

Unnamed: 0,country,iso3,regional_level,income_group,continent
0,US,USA,2,HIC,North America
1,UK,GBR,2,HIC,Europe
2,France,FRA,2,HIC,Europe
3,Germany,DEU,2,HIC,Europe
4,Rwanda,RWA,1,LIC,Africa


The prepared dataframe can be converted to other formats such as excel spreadsheet or CSV files, Example;

In [25]:
# Example
df.to_csv('country_information.csv')

## Exercise

Have a go at manually creating a `pandas` dataframe called 'point_data' using the following:

    - A column called 'id' containing four numbers from 0-3.
    - A column called 'longitude' containing 0.02, 0.05, 0.06 and 0.08.
    - A column called 'latitude' containing 1.54, 1.65, 1.48 and 1.59.
    - A column called 'filename' containing 'layer1.tiff', 'layer2.tiff', 'layer3.tiff' and 'layer4.tiff'.
    
Write this dataframe to a .csv file called 'point_data.csv'. 
        

In [None]:
#Enter your attempt below:


## Accessing data

`pandas` provides a lot of functionality. For example, you can view the first few rows of data using the 'head' function:

In [7]:
# Example
df.head(n=2)

Unnamed: 0,name,iso3,regional_level,income_group,continent
0,US,USA,2,HIC,North America
1,UK,GBR,2,HIC,Europe


You can view the last few rows of your data using the 'tail' function.

In [8]:
# Example
df.tail(n=2)

Unnamed: 0,name,iso3,regional_level,income_group,continent
3,Germany,DEU,2,HIC,Europe
4,Rwanda,RWA,1,LIC,Africa


You can know how many rows and columns are in your data using the shape function.

In [9]:
# Example
df.shape

(5, 5)

You can know the name of columns by using the column function.

In [10]:
# Example
df.columns

Index(['name', 'iso3', 'regional_level', 'income_group', 'continent'], dtype='object')

You can access a single columns using the following syntax.

In [11]:
# Example
df['iso3']

0    USA
1    GBR
2    FRA
3    DEU
4    RWA
Name: iso3, dtype: object

You can access multiple columns using the following line of code.

In [12]:
# Example
df[['name','iso3']]

Unnamed: 0,name,iso3
0,US,USA
1,UK,GBR
2,France,FRA
3,Germany,DEU
4,Rwanda,RWA


You can access individual rows by specifying the row index, e.g;

In [13]:
# Example
df.loc[3]

name              Germany
iso3                  DEU
regional_level          2
income_group          HIC
continent          Europe
Name: 3, dtype: object

You can access datapoints between a given range by;

In [15]:
# Example
df.loc[2:4]

Unnamed: 0,name,iso3,regional_level,income_group,continent
2,France,FRA,2,HIC,Europe
3,Germany,DEU,2,HIC,Europe
4,Rwanda,RWA,1,LIC,Africa


You can access data of specific attributes by subsetting. Eg regional levels over 1.

In [16]:
# Example
df[df['regional_level'] > 1]

Unnamed: 0,name,iso3,regional_level,income_group,continent
0,US,USA,2,HIC,North America
1,UK,GBR,2,HIC,Europe
2,France,FRA,2,HIC,Europe
3,Germany,DEU,2,HIC,Europe


We can also to multiple subsets at the same time. For example, those over regional level 1 in Europe;

In [20]:
# Example
df[df['regional_level'] > 1][df['continent'] == 'Europe']

  df[df['regional_level'] > 1][df['continent'] == 'Europe']


Unnamed: 0,name,iso3,regional_level,income_group,continent
1,UK,GBR,2,HIC,Europe
2,France,FRA,2,HIC,Europe
3,Germany,DEU,2,HIC,Europe


You can create add a new column to the existing dataframe as follows;

In [21]:
# Example
df['my_new_column'] = ''
df

Unnamed: 0,name,iso3,regional_level,income_group,continent,my_new_column
0,US,USA,2,HIC,North America,
1,UK,GBR,2,HIC,Europe,
2,France,FRA,2,HIC,Europe,
3,Germany,DEU,2,HIC,Europe,
4,Rwanda,RWA,1,LIC,Africa,


## Exercise

Now let's use the dataframe you created in the first exercise to practise extracting information.

First, print the shape of 'point_data':

In [None]:
#Enter your attempt below:


Now print the top 3 rows of 'point_data' to inspect the contents:

In [None]:
#Enter your attempt below:


Subset the 'filename' column, convert to a list, and inspect the contents:

In [None]:
#Enter your attempt below:


Subset only the `longitude` and `latitude` columns, and then use a new function called `.to_dict('records')` to convert the dataframe of coordinates to a list of dicts:

In [None]:
#Enter your attempt below:


Next subset those rows which have a 'latitude' between 1.5 and 1.6:

In [None]:
#Enter your attempt below:


Finally, subset those rows which have a 'longitude' between 0.05 and 0.1:

In [None]:
#Enter your attempt below:
