# Lab Notebook 3 – Data Science with Python 

The next two week's materials are designed to cover some of the 3rd party data science and visualisation libraries that are commonly used in Python -> Pandas (Python Data Analysis Library) and Matplotlib (Visualisation with Python)
In this lab notebook, we will cover:
- 1) A new data type introduced by Pandas: DataFrames
- 2) Basics of data cleaning with Pandas
- 3) Loading in and saving data to and from csv

# Pandas
Pandas is a third party Python library for Data analysis. It introduces useful data types that contain lots of new inbuilt methods for data handling. These new data types are a DataSeries and Dataframe. While it is not important to understand the specifics yet, it may be important to note that both of these are built on top of numpy arrays, so they are well optimised and pandas and numpy share lots of similarities (of naming conventions and function etc.). 

Let's start by looking at these new data types, firstly the Data Series (which operate like arrays/lists)

In [1]:
import pandas as pd


In [2]:
country_name_list = ["United Kingdom", "Burundi", "Moldova", "Singapore", "Canada", "Taiwan", "Uruguay"]


In [3]:
country_name_series = pd.Series(country_name_list, name="Country Name")


In [4]:
country_name_series.sort_values()


1           Burundi
4            Canada
2           Moldova
3         Singapore
5            Taiwan
0    United Kingdom
6           Uruguay
Name: Country Name, dtype: object

In [5]:
# Look at the variable below, why do you think the order is not changed? 
country_name_series


0    United Kingdom
1           Burundi
2           Moldova
3         Singapore
4            Canada
5            Taiwan
6           Uruguay
Name: Country Name, dtype: object

In [8]:
# To keep the order changed, we would need to create a new variable
country_name_series_sorted = country_name_series.sort_values()
country_name_series_sorted


1           Burundi
4            Canada
2           Moldova
3         Singapore
5            Taiwan
0    United Kingdom
6           Uruguay
Name: Country Name, dtype: object

In [9]:
type(country_name_series)


pandas.core.series.Series

In [10]:
# dtype is the data type, which is important for pandas to know which operations can be computed on that column i.e. mathematical operations
country_name_series.dtype # 'O' means object which is a generic type 


dtype('O')

## Dataframes
Let's now look at DataFrames (which store collections of series into a table or data frame)

In [11]:
country_name_list = ["United Kingdom", "Burundi", "Moldova", "Singapore", "Cuba", "Taiwan", "Uruguay"]
continent = ["Europe", "Africa", "Europe", "Asia", "Central America", None, "South America"]
population_greater_than_10million = [True, True, False, False, True, True, False] # Boolean for population more than 10 million or not
hdi_list = [0.929, 0.426, 0.767, 0.939, 0.764, 0.926, 0.809] # Human Development Index
area_km2_list = [242495, 27834, 30334 , 734.3, 109884 , 36197, 176215] # Area in km^2


We can hard code our column names using a dictionary

In [12]:
country_info_df = pd.DataFrame({"Country Name":country_name_list, "Continennt" : continent,
     "Population greater than 10 million" : population_greater_than_10million,
     "HDI" : hdi_list, "Area (km squared)" : area_km2_list})


In [13]:
# Look at our dataframe
country_info_df


Unnamed: 0,Country Name,Continennt,Population greater than 10 million,HDI,Area (km squared)
0,United Kingdom,Europe,True,0.929,242495.0
1,Burundi,Africa,True,0.426,27834.0
2,Moldova,Europe,False,0.767,30334.0
3,Singapore,Asia,False,0.939,734.3
4,Cuba,Central America,True,0.764,109884.0
5,Taiwan,,True,0.926,36197.0
6,Uruguay,South America,False,0.809,176215.0


In [53]:
# Sometime we only want a sneak preview of our data (especially if there are 100s+ of rows), for this we can use the .head() or .tail() method
country_info_df.head()


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
0,United Kingdom,Europe,True,0.929,242495,pound
1,Burundi,Africa,True,0.426,27834,france
2,Moldova,Europe,False,0.767,30334,leu
3,Singapore,Asia,False,0.939,734,dollar
4,Cuba,Central America,True,0.764,109884,peso


#### 🤨 TASK
We've looked at the header, in the cell below look at the footer of the data  
*Replace the `???` below with your answer*

In [15]:
country_info_df.tail()


Unnamed: 0,Country Name,Continennt,Population greater than 10 million,HDI,Area (km squared)
2,Moldova,Europe,False,0.767,30334.0
3,Singapore,Asia,False,0.939,734.3
4,Cuba,Central America,True,0.764,109884.0
5,Taiwan,,True,0.926,36197.0
6,Uruguay,South America,False,0.809,176215.0


In [16]:
type(country_info_df)


pandas.core.frame.DataFrame

We can still extract the data series from the dataframe using square brackets with a string identifier: following the syntax `dataframe['column_name']`. See below:

In [17]:
country_info_df['Country Name']


0    United Kingdom
1           Burundi
2           Moldova
3         Singapore
4              Cuba
5            Taiwan
6           Uruguay
Name: Country Name, dtype: object

We can look at all columns...

In [18]:
country_info_df.columns


Index(['Country Name', 'Continennt', 'Population greater than 10 million',
       'HDI', 'Area (km squared)'],
      dtype='object')

We can look at the information of the dataframe including the data types (dtype), index, non-null count (non missing values) and memory usage.

In [19]:
# float64 is a 64-bit float, i.e. it has precision up to 64 decimals.
# Use the non-null count to spot the null values easily.
country_info_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country Name                        7 non-null      object 
 1   Continennt                          6 non-null      object 
 2   Population greater than 10 million  7 non-null      bool   
 3   HDI                                 7 non-null      float64
 4   Area (km squared)                   7 non-null      float64
dtypes: bool(1), float64(2), object(2)
memory usage: 363.0+ bytes


Hmm, you may have noticed that there are a few minor mistakes with our data, let's clean it up a little (next week we will go further). Obviously, we can also go back and change the original cells, but let's assume we loaded in some data from another source.

Firstly, there is mistake with the column name: "Continennt"...

In [22]:
country_info_df.rename(columns={"Continennt": "Continent"})


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared)
0,United Kingdom,Europe,True,0.929,242495.0
1,Burundi,Africa,True,0.426,27834.0
2,Moldova,Europe,False,0.767,30334.0
3,Singapore,Asia,False,0.939,734.3
4,Cuba,Central America,True,0.764,109884.0
5,Taiwan,,True,0.926,36197.0
6,Uruguay,South America,False,0.809,176215.0


#### 🤨 Tough TASK
Open two new cells below and in the first look at the `country_info_df` dataframe again, and see if anything changed. In the second, try to fix the `country_info_df` so the changes are saved.   
*Hint: either use a keyboard shortcut ('b' for below) or the buttons at the top of the notebook to open new cells*

In [21]:
country_info_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country Name                        7 non-null      object 
 1   Continennt                          6 non-null      object 
 2   Population greater than 10 million  7 non-null      bool   
 3   HDI                                 7 non-null      float64
 4   Area (km squared)                   7 non-null      float64
dtypes: bool(1), float64(2), object(2)
memory usage: 363.0+ bytes


In [24]:
country_info_df = country_info_df.rename(columns={"Continennt": "Continent"})
country_info_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country Name                        7 non-null      object 
 1   Continent                           6 non-null      object 
 2   Population greater than 10 million  7 non-null      bool   
 3   HDI                                 7 non-null      float64
 4   Area (km squared)                   7 non-null      float64
dtypes: bool(1), float64(2), object(2)
memory usage: 363.0+ bytes


### Changing data types
We may also want the `Area km` to be of type integer (as this helps readability and we may not care about 300 meters in Singapore).

In [25]:
country_info_df['Area (km squared)'] = country_info_df['Area (km squared)'].astype(int)


#### 🤨 TASK
Have a look at the new dtype of the area column  
*Replace the `???` below with your answer*

In [27]:
country_info_df["Area (km squared)"].info()


<class 'pandas.core.series.Series'>
RangeIndex: 7 entries, 0 to 6
Series name: Area (km squared)
Non-Null Count  Dtype
--------------  -----
7 non-null      int64
dtypes: int64(1)
memory usage: 188.0 bytes


Pandas comes in built with a quick data summary method: `.describe()`. This will only work for int or float columns

In [28]:
country_info_df.describe()


Unnamed: 0,HDI,Area (km squared)
count,7.0,7.0
mean,0.794286,89099.0
std,0.179792,90705.812728
min,0.426,734.0
25%,0.7655,29084.0
50%,0.809,36197.0
75%,0.9275,143049.5
max,0.939,242495.0


## Adding new columns
This is easy enough, and roughly looks like a variable assignment...

In [29]:
country_info_df['Currency'] = ["pound", "france", "leu", "dollar", "peso", "dollar", "dollar"] # This has to be the same length of the data or it fails.
country_info_df['Has a Govt'] = True # We can also set single values to a column


In [30]:
country_info_df.head()


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency,Has a Govt
0,United Kingdom,Europe,True,0.929,242495,pound,True
1,Burundi,Africa,True,0.426,27834,france,True
2,Moldova,Europe,False,0.767,30334,leu,True
3,Singapore,Asia,False,0.939,734,dollar,True
4,Cuba,Central America,True,0.764,109884,peso,True


## Dropping columns
Actually, let's drop that last column, it is redundant here...

In [32]:
country_info_df = country_info_df.drop("Has a Govt", axis=1) # axis=1 means columns, axis=0 means index. This is a numpy convention
country_info_df.head()


KeyError: "['Has a Govt'] not found in axis"

#### 🤨 TASK
Try running the cell above again, can you interpret the Error that is produced now?  
*To get things back to normal, you can re load the data or just restart the entire notebook*

In [None]:
# The error comes from the fact that we saved our variable with the dropped column, so there is no longer a "Has a Govt" column to drop. 


## Subsetting data with locate (.loc & .iloc)
`.loc` and `.iloc` are extremely important methods for subsetting data. They allow us to use conditions to search for things in our dataframes (`.loc`). Or search for indexes (with `.iloc`).

In [33]:
# Let's get all the European countries
country_info_df.loc[country_info_df["Continent"] == "Europe"]


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
0,United Kingdom,Europe,True,0.929,242495,pound
2,Moldova,Europe,False,0.767,30334,leu


In [34]:
# Let's get all the high HDI countries
country_info_df.loc[country_info_df["HDI"] > 0.8]


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
0,United Kingdom,Europe,True,0.929,242495,pound
3,Singapore,Asia,False,0.939,734,dollar
5,Taiwan,,True,0.926,36197,dollar
6,Uruguay,South America,False,0.809,176215,dollar


In [35]:
# Let's get all the high HDI countries in Europe in our dataframe.
# We use the syntax (condition) & (condition) within the loc
country_info_df.loc[(country_info_df["Continent"] == "Europe") & (country_info_df["HDI"] > 0.8)]


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
0,United Kingdom,Europe,True,0.929,242495,pound


In [36]:
# Index loc works like this...
country_info_df.iloc[5]


Country Name                          Taiwan
Continent                               None
Population greater than 10 million      True
HDI                                    0.926
Area (km squared)                      36197
Currency                              dollar
Name: 5, dtype: object

#### 🤨 TASK
Subset all the countries that use a currency called "dollar"  
*Replace the `???` below with your answer*

In [37]:
country_info_df.loc[country_info_df["Currency"] == "dollar"]


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
3,Singapore,Asia,False,0.939,734,dollar
5,Taiwan,,True,0.926,36197,dollar
6,Uruguay,South America,False,0.809,176215,dollar


## More advanced loc to replace missing value
This may be a bit more advanced, but I will introduce to you how to replace missing data on a given row here.
You may have noticed one final mistake with the dataframe. Taiwan does not have a value for continent. Please see below:


In [38]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan"]


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
5,Taiwan,,True,0.926,36197,dollar


In [39]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan"]['Continent']


5    None
Name: Continent, dtype: object

We can fill in this value using loc (or iloc if we use `5`)...

In [40]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan", "Continent"] = "Asia"


In [41]:
country_info_df.loc[country_info_df["Country Name"] == "Taiwan"]


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
5,Taiwan,Asia,True,0.926,36197,dollar


In [42]:
country_info_df


Unnamed: 0,Country Name,Continent,Population greater than 10 million,HDI,Area (km squared),Currency
0,United Kingdom,Europe,True,0.929,242495,pound
1,Burundi,Africa,True,0.426,27834,france
2,Moldova,Europe,False,0.767,30334,leu
3,Singapore,Asia,False,0.939,734,dollar
4,Cuba,Central America,True,0.764,109884,peso
5,Taiwan,Asia,True,0.926,36197,dollar
6,Uruguay,South America,False,0.809,176215,dollar


In [54]:
country_info_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country Name                        7 non-null      object 
 1   Continent                           7 non-null      object 
 2   Population greater than 10 million  7 non-null      bool   
 3   HDI                                 7 non-null      float64
 4   Area (km squared)                   7 non-null      int64  
 5   Currency                            7 non-null      object 
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 419.0+ bytes


## Reading in csv files to pandas dataframe
For this purpose, I have found a small, but unformatted csv file from the Hestia API docs [crop.csv](https://www.hestia.earth/docs/#hestia-calculation-models-ipcc-2013-including-feedbacks)

In [57]:
crop = pd.read_csv("crop.csv")


#### 🤨 TASK
Let's have a quick look at the head (and foot) of this data, and some basic stats...
*Replace the `???` below with your answer*

In [46]:
crop.head()


Unnamed: 0,id,name,units,synonyms,subClassOf.0.id,subClassOf.1.id,definition,scientificName,hsCode,iccCode,...,lookups.41.source,lookups.41.dataState,lookups.42.name,lookups.42.value,lookups.42.source,lookups.42.dataState,lookups.43.name,lookups.43.value,lookups.43.source,lookups.43.dataState
0,genericCropPlant,Generic crop plant,ha,Unspecified crop plant; unknown crop plant; mu...,-,-,A term describing a generic plant.,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,not required,C_CONTENT_AG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,not required,C_CONTENT_BG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,not required
1,genericCropProduct,Generic crop product,kg,Unspecified crop product; unknown crop product...,genericCropPlant,-,A term describing a generic crop product.,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete
2,genericCropSeed,"Generic crop, seed",kg,Unspecified seed; unknown seed; multiple seed,genericCropPlant,-,A term describing the seed of a group of crops...,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete
3,genericCropStraw,"Generic crop, straw",kg,Unspecified straw; unknown straw; multiple str...,genericCropPlant,-,A term describing the straw of a group of crop...,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete
4,genericCropMeal,"Generic crop, meal",kg,Meal unspecified; Meal unknown; Meal generic; ...,oilCropSeedFruit,-,The coarse residue obtained after oil is remov...,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete


In [47]:
crop.tail()


Unnamed: 0,id,name,units,synonyms,subClassOf.0.id,subClassOf.1.id,definition,scientificName,hsCode,iccCode,...,lookups.41.source,lookups.41.dataState,lookups.42.name,lookups.42.value,lookups.42.source,lookups.42.dataState,lookups.43.name,lookups.43.value,lookups.43.source,lookups.43.dataState
1420,melilotusPlant,Melilotus plant,ha,Sweet yellow clover; Yellow melilot; Ribbed m...,-,-,A herbaceous annual or biennial plant of the b...,Melilotus officinalis,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_AG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_BG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation
1421,melilotusLeaf,"Melilotus, leaf",kg,Sweet yellow clover; Yellow melilot; Ribbed m...,melilotusPlant,-,The edible leaf of the melilotus plant.,Melilotus officinalis,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation
1422,melilotusFlower,"Melilotus, flower",kg,Sweet yellow clover; Yellow melilot; Ribbed m...,melilotusPlant,-,The edible flower of the melilotus plant.,Melilotus officinalis,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation
1423,patienceDockPlant,Patience dock plant,ha,Garden patience; Herb patience; Monk's rhubarb,-,-,A herbaceous perennial plant of the buckwheat ...,Rumex patientia,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_AG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_BG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation
1424,patienceDockLeaf,"Patience dock, leaf",kg,Garden patience; Herb patience; Monk's rhubarb,patienceDockPlant,-,"The edible leaf of the patience dock plant, wh...",Rumex patientia,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_AG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,requires validation


In [68]:
import numpy as np
crop.loc[crop["lookups.42.value"] == "-", "lookups.42.value"] = np.nan
crop["lookups.42.value"] = crop["lookups.42.value"].astype(float)
# country_info_df['Area (km squared)'] = country_info_df['Area (km squared)'].astype(int)
# country_info_df.loc[country_info_df["Country Name"] == "Taiwan", "Continent"] = "Asia"
crop.head()


Unnamed: 0,id,name,units,synonyms,subClassOf.0.id,subClassOf.1.id,definition,scientificName,hsCode,iccCode,...,lookups.41.source,lookups.41.dataState,lookups.42.name,lookups.42.value,lookups.42.source,lookups.42.dataState,lookups.43.name,lookups.43.value,lookups.43.source,lookups.43.dataState
0,genericCropPlant,Generic crop plant,ha,Unspecified crop plant; unknown crop plant; mu...,-,-,A term describing a generic plant.,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,not required,C_CONTENT_AG_CROP_RESIDUE,,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,not required,C_CONTENT_BG_CROP_RESIDUE,-,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,not required
1,genericCropProduct,Generic crop product,kg,Unspecified crop product; unknown crop product...,genericCropPlant,-,A term describing a generic crop product.,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42.0,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete
2,genericCropSeed,"Generic crop, seed",kg,Unspecified seed; unknown seed; multiple seed,genericCropPlant,-,A term describing the seed of a group of crops...,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42.0,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete
3,genericCropStraw,"Generic crop, straw",kg,Unspecified straw; unknown straw; multiple str...,genericCropPlant,-,A term describing the straw of a group of crop...,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42.0,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete
4,genericCropMeal,"Generic crop, meal",kg,Meal unspecified; Meal unknown; Meal generic; ...,oilCropSeedFruit,-,The coarse residue obtained after oil is remov...,-,-,-,...,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_AG_CROP_RESIDUE,42.0,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete,C_CONTENT_BG_CROP_RESIDUE,42,IPCC (2019) https://www.ipcc-nggip.iges.or.jp/...,complete


In [74]:
for item in crop.columns:
    # print(item)
    if item[-5:] == "value":
      print(item)
      if crop[item][0] in "0123456789"


defaultProperties.0.value
defaultProperties.1.value
defaultProperties.2.value
lookups.0.value
lookups.1.value
lookups.2.value
lookups.3.value
lookups.4.value
lookups.5.value
lookups.6.value
lookups.7.value
lookups.8.value
lookups.9.value
lookups.10.value
lookups.11.value
lookups.12.value
lookups.13.value
lookups.14.value
lookups.15.value
lookups.16.value
lookups.17.value
lookups.18.value
lookups.19.value
lookups.20.value
lookups.21.value
lookups.22.value
lookups.23.value
lookups.24.value
lookups.25.value
lookups.26.value
lookups.27.value
lookups.28.value
lookups.29.value
lookups.30.value
lookups.31.value
lookups.32.value
lookups.33.value
lookups.34.value
lookups.35.value
lookups.36.value
lookups.37.value
lookups.38.value
lookups.39.value
lookups.40.value
lookups.41.value
lookups.42.value
lookups.43.value


In [69]:
crop.describe()


Unnamed: 0,lookups.19.value,lookups.42.value
count,1425.0,1013.0
mean,0.811789,40.71616
std,0.154695,2.170149
min,-0.2,32.55
25%,0.8,39.48
50%,0.8,42.0
75%,0.8,42.0
max,3.8,42.0


In [67]:
crop["lookups.42.value"].unique()


array([None, '42', '36.12', '38.22', '37.59', '38.01', '39.27', '39.48',
       '38.85', '39.69', '37.38', '37.17', '35.28', '37.8', '35.07',
       '32.55', '38.43', '39.9', '36.33'], dtype=object)

Let's look at the columns in the data

In [49]:
crop.columns


Index(['id', 'name', 'units', 'synonyms', 'subClassOf.0.id', 'subClassOf.1.id',
       'definition', 'scientificName', 'hsCode', 'iccCode',
       ...
       'lookups.41.source', 'lookups.41.dataState', 'lookups.42.name',
       'lookups.42.value', 'lookups.42.source', 'lookups.42.dataState',
       'lookups.43.name', 'lookups.43.value', 'lookups.43.source',
       'lookups.43.dataState'],
      dtype='object', length=250)

In [75]:
## Look at unique land use categories in this data
crop["IPCC_LAND_USE_CATEGORY"].unique()


KeyError: 'IPCC_LAND_USE_CATEGORY'

For now let's just subset at a few columns (feel free to change if you want)

In [None]:
columns_to_examine = ["term.id", "IPCC_LAND_USE_CATEGORY", "Nursery_duration"]


In [None]:
crop_sub = crop[columns_to_examine]


In [None]:
crop_sub.head()


Next, let's rename these columns to something simpler. Remember to be careful, as we can look information from our data if we lose verbosity.

In [None]:
crop_sub = crop_sub.rename(columns={"term.id": "Name", "IPCC_LAND_USE_CATEGORY": "Land Use Category", "Nursery_duration": "Nursery Duration"})


In [None]:
crop_sub.head()


In [None]:
len(crop_sub)


We have 1424 rows of data, but not all data is present. We can drop missing values.

## Dropping NaN Values
NaN means not a number

In [None]:
crop_sub = crop_sub.dropna(subset=["Nursery Duration"])


In [None]:
len(crop_sub)


## Save new data
Saving to a file is very easy. We can use the `.to_csv` method. 

In [None]:
crop_sub.to_csv("data/formatted_crop_subset.csv")


# Extra

## Note on 'Methods'
If you take nothing else away from the explanation that follows, *Methods are functions* (introduced last week). They are slightly different in that they are functions specific to a class of object. Below we look at some of the in-built 'methods' that come with strings:

In [None]:
my_string = "hello" # tip: why not change the string here to experiment


In [None]:
my_string.upper() # the syntax for methods is '.<method_name>()'


In [None]:
my_string.capitalize()


As we see above, the syntax is `.<method_name>()`. The `()` is function call. Most object have methods and a pro-tip, if I have not already taught this, is to press tab after you type the `.` of your variable to see a list of potential methods (ignore anything with an underscore `_` for now...).    

See more String method here: https://www.w3schools.com/python/python_ref_string.asp

# (More) Dataframe Methods
We have already seen lots of methods i.e. `.head()` `.describe()`, `.rename(<args go here>)`.

*Note on cell below:* It is good pratice to minimise, what in software development we call, 'scope'. This means that define/declare variables near to where they are used. This prevents problems where a variable is modified that then means that later code does not work as intended. Especially in jupyter notebooks (where you can run cells in any ordered) this is important. Let's redeclare the `crop` dataframe below.


In [None]:
# redeclare the dataframe
crop = pd.read_csv("data/crop.csv")


In [None]:
crop.info()


In [None]:
crop['Nursery_duration'].count()


In [None]:
crop['Nursery_duration'].sum()


#### 🤨 TASK
Before continuing, try pressing 'tab' after the `.` below to see what we can see. Remember we need to include a `()` at the end if we want to call methods. Confusingly, we do not use `()` if what follows `.` is a property (introduced after this cell).  
*Delete the `???` and press the tab key*

In [None]:
crop.???


### Quick note on properties
These are, like their namesake, properties. They store variable i.e. the index is stored as a property. They will often be named like `is_...` or named like an object's property: `size` or `index`. 

**Remember:** you can use the in-built `help` function if you want to read more about a method (but do not use `()` inside the `help(<method>)`)

In [None]:
crop.index


In [None]:
crop.size
