# Lab Notebook 3 – Data Science with Python 

The next two week's materials are designed to cover some of the 3rd party data science and visualisation libraries that are commonly used in Python -> Pandas (Python Data Analysis Library) and Matplotlib (Visualisation with Python)
In this lab notebook, we will cover:
- 1) A new data type introduced by Pandas: DataFrames
- 2) Basics of data cleaning with Pandas
- 3) Loading in and saving data to and from csv

# Pandas
Pandas is a third party Python library for Data analysis. It introduces useful data types that contain lots of new inbuilt methods for data handling. These new data types are a DataSeries and Dataframe. While it is not important to understand the specifics yet, it may be important to note that both of these are built on top of numpy arrays, so they are well optimised and pandas and numpy share lots of similarities (of naming conventions and function etc.). 

Let's start by looking at these new data types, firstly the Data Series (which operate like arrays/lists)

In [None]:
import pandas as pd

In [None]:
country_name_list = ["United Kingdom", "Burundi", "Moldova", "Singapore", "Canada", "Taiwan", "Uruguay"]

In [None]:
country_name_series = pd.Series(country_name_list, name="Country Name")

In [None]:
country_name_series.sort_values()

In [None]:
# Look at the variable below, why do you think the order is not changed? 
country_name_series

In [None]:
type(country_name_series)

In [None]:
# dtype is the data type, which is important for pandas to know which operations can be computed on that column i.e. mathematical operations
country_name_series.dtype # 'O' means object which is a generic type 

## Dataframes
Let's now look at DataFrames (which store collections of series into a table or data frame)

In [None]:
country_name_list = ["United Kingdom", "Burundi", "Moldova", "Singapore", "Cuba", "Taiwan", "Uruguay"]
continent = ["Europe", "Africa", "Europe", "Asia", "Central America", None, "South America"]
population_greater_than_10million = [True, True, False, False, True, True, False] # Boolean for population more than 10 million or not
hdi_list = [0.929, 0.426, 0.767, 0.939, 0.764, 0.926, 0.809] # Human Development Index
area_km2_list = [242495, 27834, 30334 , 734.3, 109884 , 36197, 176215] # Area in km^2

We can hard code our column names using a dictionary

In [None]:
country_info_df = pd.DataFrame({"Country Name":country_name_list, "Continennt" : continent,\
     "Population greater than 10 million" : population_greater_than_10million,
     "HDI" : hdi_list, "Area (km squared)" : area_km2_list})

In [None]:
# Look at our dataframe
country_info_df

In [None]:
# Sometime we only want a sneak preview of our data (especially if there are 100s+ of rows), for this we can use the .head() or .foot() method
country_info_df.head()

#### 🤨 TASK
We've looked at the header, in the cell below look at the footer of the data  
*Replace the `???` below with your answer*

In [None]:
???

In [None]:
type(country_info_df)

We can still extract the data series from the dataframe using square brackets with a string identifier: following the syntax `dataframe['column_name']`. See below:

In [None]:
country_info_df['Country Name']

We can look at all columns...

In [None]:
country_info_df.columns

We can look at the information of the dataframe including the data types (dtype), index, non-null count (non missing values) and memory usage.

In [None]:
# float64 is a 64-bit float, i.e. it has precision up to 64 decimals.
country_info_df.info()

Hmm, you may have noticed that there are a few minor mistakes with our data, let's clean it up a little (next week we will go further). Obviously, we can also go back and change the original cells, but let's assume we loaded in some data from another source.

Firstly, there is mistake with the column name: "Continennt"...

In [None]:
country_info_df.rename(columns={"Continennt": "Continent"})

#### 🤨 Tough TASK
Open two new cells below and in the first look at the `country_info_df` dataframe again, and see if anything changed. In the second, try to fix the `country_info_df` so the changes are saved.   
*Hint: either use a keyboard shortcut ('b' for below) or the buttons at the top of the notebook to open new cells*

### Changing data types
We may also want the `Area km` to be of type integer (as this helps readability and we may not care about 300 meters in Singapore).

In [None]:
country_info_df['Area (km squared)'] = country_info_df['Area (km squared)'].astype(int)

#### 🤨 TASK
Have a look at the new dtype of the area column  
*Replace the `???` below with your answer*

In [None]:
???

Pandas comes in built with a quick data summary method: `.describe()`. This will only work for int or float columns

In [None]:
country_info_df.describe()

## Adding new columns
This is easy enough, and roughly looks like a variable assignment...

In [None]:
country_info_df['Currency'] = ["pound", "france", "leu", "dollar", "peso", "dollar", "dollar"] # This has to be the same length of the data or it fails.
country_info_df['Has a Govt'] = True # We can also set single values to a column

In [None]:
country_info_df.head()

## Dropping columns
Actually, let's drop that last column, it is redundant here...

In [None]:
country_info_df = country_info_df.drop("Has a Govt", axis=1) # axis=1 means columns, axis=0 means index. This is a numpy convention
country_info_df.head()

#### 🤨 TASK
Try running the cell above again, can you interpret the Error that is produced now?  
*To get things back to normal, you can re load the data or just restart the entire notebook*

## Subsetting data with locate (.loc & .iloc)
`.loc` and `.iloc` are extremely important methods for subsetting data. They allow us to use conditions to search for things in our dataframes (`.loc`). Or search for indexes (with `.iloc`).

In [None]:
# Let's get all the European countries
country_info_df.loc[country_info_df["Continent"] == "Europe"]

In [None]:
# Let's get all the high HDI countries
country_info_df.loc[country_info_df["HDI"] > 0.8]

In [None]:
# Let's get all the high HDI countries in Europe in our dataframe.
# We use the syntax (condition) & (condition) within the loc
country_info_df.loc[(country_info_df["Continent"] == "Europe") & (country_info_df["HDI"] > 0.8)]

In [None]:
# Index loc works like this...
country_info_df.iloc[5]

#### 🤨 TASK
Subset all the countries that use a currency called "dollar"  
*Replace the `???` below with your answer*

In [None]:
???

## More advanced loc to replace missing value
This may be a bit more advanced, but I will introduce to you how to replace missing data on a given row here.
You may have noticed one final mistake with the dataframe. Taiwan does not have a value for continent. Please see below:


In [None]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan"]

In [None]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan"]['Continent']

We can fill in this value using loc (or iloc if we use `5`)...

In [None]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan", "Continent"] = "Asia"

In [None]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan"]

In [None]:
country_info_df

## Reading in csv files to pandas dataframe
For this purpose, I have found a small, but unformatted csv file from the Hestia API docs [crop.csv](https://www.hestia.earth/docs/#hestia-calculation-models-ipcc-2013-including-feedbacks)

In [None]:
crop = pd.read_csv("data/crop.csv")

#### 🤨 TASK
Let's have a quick look at the head (and foot) of this data, and some basic stats...
*Replace the `???` below with your answer*

In [None]:
???

Let's look at the columns in the data

In [None]:
crop.columns

In [None]:
## Look at unique land use categories in this data
crop["IPCC_LAND_USE_CATEGORY"].unique()

For now let's just subset at a few columns (feel free to change if you want)

In [None]:
columns_to_examine = ["term.id", "IPCC_LAND_USE_CATEGORY", "Nursery_duration"]

In [None]:
crop_sub = crop[columns_to_examine]

In [None]:
crop_sub.head()

Next, let's rename these columns to something simpler. Remember to be careful, as we can look information from our data if we lose verbosity.

In [None]:
crop_sub = crop_sub.rename(columns={"term.id": "Name", "IPCC_LAND_USE_CATEGORY": "Land Use Category", "Nursery_duration": "Nursery Duration"})

In [None]:
crop_sub.head()

In [None]:
len(crop_sub)

We have 1424 rows of data, but not all data is present. We can drop missing values.

## Dropping NaN Values
NaN means not a number

In [None]:
crop_sub = crop_sub.dropna(subset=["Nursery Duration"])

In [None]:
len(crop_sub)

## Save new data
Saving to a file is very easy. We can use the `.to_csv` method. 

In [None]:
crop_sub.to_csv("data/formatted_crop_subset.csv")

# Extra

## Note on 'Methods'
If you take nothing else away from the explanation that follows, *Methods are functions* (introduced last week). They are slightly different in that they are functions specific to a class of object. Below we look at some of the in-built 'methods' that come with strings:

In [None]:
my_string = "hello" # tip: why not change the string here to experiment

In [None]:
my_string.upper() # the syntax for methods is '.<method_name>()'

In [None]:
my_string.capitalize()

As we see above, the syntax is `.<method_name>()`. The `()` is function call. Most object have methods and a pro-tip, if I have not already taught this, is to press tab after you type the `.` of your variable to see a list of potential methods (ignore anything with an underscore `_` for now...).    

See more String method here: https://www.w3schools.com/python/python_ref_string.asp

# (More) Dataframe Methods
We have already seen lots of methods i.e. `.head()` `.describe()`, `.rename(<args go here>)`.

*Note on cell below:* It is good pratice to minimise, what in software development we call, 'scope'. This means that define/declare variables near to where they are used. This prevents problems where a variable is modified that then means that later code does not work as intended. Especially in jupyter notebooks (where you can run cells in any ordered) this is important. Let's redeclare the `crop` dataframe below.


In [None]:
# redeclare the dataframe
crop = pd.read_csv("data/crop.csv")

In [None]:
crop.info()

In [None]:
crop['Nursery_duration'].count()

In [None]:
crop['Nursery_duration'].sum()

#### 🤨 TASK
Before continuing, try pressing 'tab' after the `.` below to see what we can see. Remember we need to include a `()` at the end if we want to call methods. Confusingly, we do not use `()` if what follows `.` is a property (introduced after this cell).  
*Delete the `???` and press the tab key*

In [None]:
crop.???

### Quick note on properties
These are, like their namesake, properties. They store variable i.e. the index is stored as a property. They will often be named like `is_...` or named like an object's property: `size` or `index`. 

**Remember:** you can use the in-built `help` function if you want to read more about a method (but do not use `()` inside the `help(<method>)`)

In [None]:
crop.index

In [None]:
crop.size