# GLOBAL UNDERNUTRITION STUDY - EXPLORATION  

*BY FURAWA*  

**Table of Contents**  

1. [Data collection](#data_collection)  
2. [Data discovery](#data_discovery)  
3. [Data cleaning](#data_cleaning)  
4. [Computing new variables to lead the analysis](#new_variables)  
5. [Identify major trends](#major_trends)  

In [1]:
# Import all the needed libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import re
%matplotlib inline
pd.set_option('max_rows', 20)

<a id='data_collection'></a>
## 1. Data collection  
All the data has been downloaded from the [FAO](http://www.fao.org/faostat/en/#data) website.  
Let us check the files.

In [2]:
# Store the file names in the file_names variable
file_names = glob('files/*.csv')
# Check the file_names list
file_names

['files/food_balance_cereals.csv',
 'files/food_security_indicators.csv',
 'files/food_balance_vegetal.csv',
 'files/food_balance_animal.csv',
 'files/population.csv']

In [3]:
# Loop into the file_names list
for file in file_names:
    # Read each file in the file_names list and assign it to a variable retrieved from the file name
     exec(re.split('\. |\W', file)[1] + "=  pd.read_csv(file)")

In [4]:
# Check the dataframe info
food_security_indicators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 15 columns):
Domain Code         1020 non-null object
Domain              1020 non-null object
Area Code           1020 non-null int64
Area                1020 non-null object
Element Code        1020 non-null int64
Element             1020 non-null object
Item Code           1020 non-null int64
Item                1020 non-null object
Year Code           1020 non-null int64
Year                1020 non-null object
Unit                1020 non-null object
Value               605 non-null object
Flag                1020 non-null object
Flag Description    1020 non-null object
Note                0 non-null float64
dtypes: float64(1), int64(4), object(10)
memory usage: 119.7+ KB


In [5]:
# Check the dataframe info
food_balance_vegetal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104871 entries, 0 to 104870
Data columns (total 14 columns):
Domain Code         104871 non-null object
Domain              104871 non-null object
Country Code        104871 non-null int64
Country             104871 non-null object
Element Code        104871 non-null int64
Element             104871 non-null object
Item Code           104871 non-null int64
Item                104871 non-null object
Year Code           104871 non-null int64
Year                104871 non-null int64
Unit                104871 non-null object
Value               104871 non-null float64
Flag                104871 non-null object
Flag Description    104871 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 11.2+ MB


In [6]:
# Check the dataframe info
food_balance_animal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37166 entries, 0 to 37165
Data columns (total 14 columns):
Domain Code         37166 non-null object
Domain              37166 non-null object
Country Code        37166 non-null int64
Country             37166 non-null object
Element Code        37166 non-null int64
Element             37166 non-null object
Item Code           37166 non-null int64
Item                37166 non-null object
Year Code           37166 non-null int64
Year                37166 non-null int64
Unit                37166 non-null object
Value               37166 non-null float64
Flag                37166 non-null object
Flag Description    37166 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 4.0+ MB


In [7]:
# Check the dataframe info
food_balance_cereals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 909 entries, 0 to 908
Data columns (total 14 columns):
Domain Code         909 non-null object
Domain              909 non-null object
Country Code        909 non-null int64
Country             909 non-null object
Element Code        909 non-null int64
Element             909 non-null object
Item Code           909 non-null int64
Item                909 non-null object
Year Code           909 non-null int64
Year                909 non-null int64
Unit                909 non-null object
Value               909 non-null int64
Flag                909 non-null object
Flag Description    909 non-null object
dtypes: int64(6), object(8)
memory usage: 99.5+ KB


In [8]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 14 columns):
Domain Code         175 non-null object
Domain              175 non-null object
Country Code        175 non-null int64
Country             175 non-null object
Element Code        175 non-null int64
Element             175 non-null object
Item Code           175 non-null int64
Item                175 non-null object
Year Code           175 non-null int64
Year                175 non-null int64
Unit                175 non-null object
Value               175 non-null int64
Flag                1 non-null object
Flag Description    175 non-null object
dtypes: int64(6), object(8)
memory usage: 19.3+ KB


Except from the food_security_indicators dataframe all the other dataframes have the same 14 columns.  
We can remove the Note variable of the food_security_indicators as it has just NaN values.

<a id='data_discovery'></a>
## 2. Data Discovery

In [9]:
# Check the first few rows of each dataframe
food_security_indicators.head(2)

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20122014,2012-2014,millions,7.9,F,FAO estimate,
1,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20132015,2013-2015,millions,8.8,F,FAO estimate,


In [10]:
# Check food balance vegetal
food_balance_vegetal.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,1173.0,S,Standardized data


In [11]:
# Check food balance animal
food_balance_animal.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,134.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.0,S,Standardized data


In [12]:
# Check commodity balance livestock
food_balance_cereals.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2805,Rice (Milled Equivalent),2013,2013,1000 tonnes,342,S,Standardized data


In [13]:
# Check population 
population.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,30552,,Official data
1,FBS,Food Balance Sheets,3,Albania,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,3173,,Official data


Let us check the primary key of each table and test them.  
We will create a function to find the primary key.  

In [14]:
# Function to calculate find the potential primary keys
def check_potential_primary_key(df) -> str:
    # Loop in the column list of the specific dataframe
    for column_pk in df.keys():
        # Remove the duplicated values from each column and check if the size is the same as the df
        if len(df) != len(df[column_pk].drop_duplicates()):
            # No output if the column is not a primary key
           None
        else:
            # Print all the potential primary keys
            print("{} could be a primary key!".format(column_pk))

Now we can use the function to find the potential primary keys of each dataframe. 

In [15]:
# Check the primary key of population dataframe
check_potential_primary_key(population)

Country Code could be a primary key!
Country could be a primary key!
Value could be a primary key!


We have 3 potential primary keys for the population dataframe, but the best choice here is the Country Code variable. It won't be a good idea to have the population or country name as primary key as they will be difficult to query.

In [16]:
# Check the primary key of food balance vegetal
check_potential_primary_key(food_balance_vegetal)

We have no output, which means there are no potential primary key in this dataframe.

In [17]:
# Check the primary key of food balance livestock
check_potential_primary_key(food_balance_animal)

Same here, there are no primary keys in the food balance livestock

In [18]:
# Check the primary key of food balance cereals
check_potential_primary_key(food_balance_cereals)

In [19]:
# Check the primary key of food security indicators
check_potential_primary_key(food_security_indicators)

Even for the food balance cereals and food security indicators dataframes we have no potential primary keys

Let us create column with the total population and remove some useless columns from the population dataframe.

In [20]:
# Create the population column, we retrieve the 1000 in the Unit column and multiply it by the Value column
population["population"] = int(population.Unit.str.split(" ")[0][0]) * population.Value
# Remove some useless columns
population_df = population.drop(population.columns.difference(["Country Code", "Country", "population"]),
                                axis =1)
# Check the dataframe
population_df.head()

Unnamed: 0,Country Code,Country,population
0,2,Afghanistan,30552000
1,3,Albania,3173000
2,4,Algeria,39208000
3,7,Angola,21472000
4,8,Antigua and Barbuda,90000


Now we can calculate the total numbers of human involved.

In [21]:
# Calcalute the total number of humans 
total_population = population_df.population.sum()
print("The total number of humans on the planet is : {:,}".format(total_population))

The total number of humans on the planet is : 8,413,993,000


This result cannot be correct, mostly if we talk about the 2013 world population. actually in 2019 the world population is around 7.7 billion. There must be a mistake, we will go deep to check the issue.  

<a id='data_cleaning'></a>
## 3. Data Cleaning  

The dataframes are downloaded and loaded but dirty. There are useless rows and columns, anomalies in the population data must be corrected, the columns names must be changed. Let's do some cleaning.  
We start putting all the food balance dataframes in one unique dataframe.  

In [22]:
# Create the origin variable in each balance food dataframe to store the food origin
food_balance_animal["origin"] = "animal"
food_balance_cereals["origin"] = "cereal"
food_balance_vegetal["origin"] = "vegetal"

In [23]:
# Append the 3 dataframes in one unique dataframe
food_balance_df = food_balance_animal.append([food_balance_vegetal, food_balance_cereals])
# Check the first rows
food_balance_df.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,origin
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,134.0,S,Standardized data,animal
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.0,S,Standardized data,animal


In [24]:
# Check the last rows
food_balance_df.tail(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,origin
907,FBS,Food Balance Sheets,5000,World,5511,Production,2518,Sorghum and products,2013,2013,1000 tonnes,62118.0,A,"Aggregate, may include official, semi-official...",cereal
908,FBS,Food Balance Sheets,5000,World,5511,Production,2520,"Cereals, Other",2013,2013,1000 tonnes,28415.0,A,"Aggregate, may include official, semi-official...",cereal


In [25]:
# Delete the 3 useless balance food dataframe
del food_balance_animal, food_balance_cereals, food_balance_vegetal

In [26]:
# Transform the dataframe from long to wide with pivot_table
food_balance_wide = food_balance_df.pivot_table(
    # Put as index the Columns that we want to keep in the dataframe
    index = ["Country Code", "Item Code", "Country", "Item", "Year", "Unit", "origin"],
    # Select the columns that we want to transform from long to wide and the values that we sum 
    columns = ["Element"], values = ["Value"], aggfunc = sum)
food_balance_wide.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value,Value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Element,Domestic supply quantity,Export Quantity,Fat supply quantity (g/capita/day),Feed,Food,Food supply (kcal/capita/day),Food supply quantity (kg/capita/yr),Import Quantity,Losses,Other uses,Processing,Production,Protein supply quantity (g/capita/day),Seed,Stock Variation
Country Code,Item Code,Country,Item,Year,Unit,origin,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,2511,Armenia,Wheat and products,2013,1000 tonnes,cereal,,,,,,,,,,,,312.0,,,


In [27]:
# Drop the level of the columns
food_balance_wide.columns = food_balance_wide.columns.droplevel(0)

In [28]:
food_balance_wide.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Element,Domestic supply quantity,Export Quantity,Fat supply quantity (g/capita/day),Feed,Food,Food supply (kcal/capita/day),Food supply quantity (kg/capita/yr),Import Quantity,Losses,Other uses,Processing,Production,Protein supply quantity (g/capita/day),Seed,Stock Variation
Country Code,Item Code,Country,Item,Year,Unit,origin,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2511,Armenia,Wheat and products,2013,1000 tonnes,cereal,,,,,,,,,,,,312.0,,,


In [29]:
# Reset the indexes and rename the Element as None to remove it as column name of the index
food_balance = food_balance_wide.reset_index().rename_axis(None, axis = 1)
food_balance.head(1)

Unnamed: 0,Country Code,Item Code,Country,Item,Year,Unit,origin,Domestic supply quantity,Export Quantity,Fat supply quantity (g/capita/day),...,Food supply (kcal/capita/day),Food supply quantity (kg/capita/yr),Import Quantity,Losses,Other uses,Processing,Production,Protein supply quantity (g/capita/day),Seed,Stock Variation
0,1,2511,Armenia,Wheat and products,2013,1000 tonnes,cereal,,,,...,,,,,,,312.0,,,


In [30]:
# Rename the columns
food_balance.columns = ["country_code", "item_code", "country", "item", "year", "unit", "origin", 
                       "domestic_supply_quantity", "export_quantity", "fat_supply_quantity_gcapitaday",
                       "feed", "food", "food_supply_kcalcapitaday", "food_supply_quantity_kgcapita_yr", 
                       "import_quantity", "losses", "other_uses", "processing", "production", 
                       "protein_supply_quantity_gcapitaday", "seed", "stock_variation"]

In [31]:
# Check the info of the dataframe
food_balance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57905 entries, 0 to 57904
Data columns (total 22 columns):
country_code                          57905 non-null int64
item_code                             57905 non-null int64
country                               57905 non-null object
item                                  57905 non-null object
year                                  57905 non-null int64
unit                                  57905 non-null object
origin                                57905 non-null object
domestic_supply_quantity              15478 non-null float64
export_quantity                       12320 non-null float64
fat_supply_quantity_gcapitaday        11877 non-null float64
feed                                  2757 non-null float64
food                                  14107 non-null float64
food_supply_kcalcapitaday             14333 non-null float64
food_supply_quantity_kgcapita_yr      14107 non-null float64
import_quantity                       14946 non-

In [32]:
# Rename the column of population dataframe
population_df.columns = ["country_code", "country", "population"]
population_df.columns

Index(['country_code', 'country', 'population'], dtype='object')

In [33]:
# Check some statistic of the dataframe
population_df.describe()

Unnamed: 0,country_code,population
count,175.0,175.0
mean,126.72,48079960.0
std,75.168519,178632700.0
min,1.0,54000.0
25%,64.5,2543500.0
50%,121.0,9413000.0
75%,188.5,28881500.0
max,351.0,1416667000.0


In [34]:
# Check the country with population higher than 1 billion
population_df[population_df.population > 1000000000]

Unnamed: 0,country_code,country,population
34,41,"China, mainland",1385567000
73,100,India,1252140000
174,351,China,1416667000


We got our error here. We have 2 entries for China, the mainland and the whole country. So we are counting twice the China, mainland population because it is included in the China population as well. We need to remove the China mainland population when we calculate the world population.  

In [35]:
# Calcute the world population removing the china mainland population  
world_population = population_df[population_df.index != 34].population.sum()
print("The total number of humans on the planet is : {:,}".format(world_population))

The total number of humans on the planet is : 7,028,426,000


In [36]:
# Remove useless columns 
food_security_indicators.drop(food_security_indicators.iloc[: , [0,1,4,5,6,8,10,12,13,14]],
                              axis =1, inplace = True)
food_security_indicators.head(2)

Unnamed: 0,Area Code,Area,Item,Year,Value
0,2,Afghanistan,Number of people undernourished (million) (3-y...,2012-2014,7.9
1,2,Afghanistan,Number of people undernourished (million) (3-y...,2013-2015,8.8


In [37]:
# Rename the columns 
food_security_indicators.columns = ["country_code", "country", "item", "year", "value"]