# GLOBAL UNDERNUTRITION STUDY - EXPLORATION  

*BY FURAWA*  

**Table of Contents**  

1. [Data collection](#data_collection)  
2. [Data discovery](#data_discovery)  
3. [Data cleaning](#data_cleaning)  
4. [Computing new variables to lead the analysis](#new_variables)  
5. [Identify major trends](#major_trends)  

In [1]:
# Import all the needed libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import re
%matplotlib inline
pd.set_option('max_rows', 20)

<a id='data_collection'></a>
## 1. Data collection  
All the data has been downloaded from the [FAO](http://www.fao.org/faostat/en/#data) website.  
Let us check the files.

In [2]:
# Store the file names in the file_names variable
file_names = glob('files/*.csv')
# Check the file_names list
file_names

['files/food_balance_cereals.csv',
 'files/food_security_indicators.csv',
 'files/food_balance_vegetal.csv',
 'files/food_balance_animal.csv',
 'files/population.csv']

In [3]:
# Loop into the file_names list
for file in file_names:
    # Read each file in the file_names list and assign it to a variable retrieved from the file name
     exec(re.split('\. |\W', file)[1] + "=  pd.read_csv(file)")

In [4]:
# Check the dataframe info
food_security_indicators.head(2)

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20122014,2012-2014,millions,7.9,F,FAO estimate,
1,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20132015,2013-2015,millions,8.8,F,FAO estimate,


In [5]:
# Check the dataframe info
food_balance_vegetal.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,1173.0,S,Standardized data


In [6]:
# Check the dataframe info
food_balance_animal.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,134.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.0,S,Standardized data


In [7]:
# Check the dataframe info
food_balance_cereals.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,1173.0,S,Standardized data


In [8]:
population.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,30552,,Official data
1,FBS,Food Balance Sheets,3,Albania,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,3173,,Official data


Except from the food_security_indicators dataframe all the other dataframes have the same 14 columns.  

<a id='data_discovery'></a>
## 2. Data Discovery

Let us check the primary key of each table and test them.  
We will create a function to find the primary key.  

In [9]:
# Function to find the potential primary keys
def check_potential_primary_key(df) -> str:
    # Loop in the column list of the specific dataframe
    for column_pk in df.keys():
        # Remove the duplicated values from each column and check if the size is the same as the df
        if len(df) != len(df[column_pk].drop_duplicates()):
            # No output if the column is not a primary key
           None
        else:
            # Print all the potential primary keys
            print("{} could be a primary key!".format(column_pk))

Now we can use the function to find the potential primary keys of each dataframe. 

In [10]:
# Check the primary key of population dataframe
check_potential_primary_key(population)

Country Code could be a primary key!
Country could be a primary key!
Value could be a primary key!


We have 3 potential primary keys for the population dataframe, but the best choice here is the Country Code variable. It won't be a good idea to have the population or country name as primary key as they will be difficult to query.

In [11]:
# Check the primary key of food balance vegetal
check_potential_primary_key(food_balance_vegetal)

We have no output, which means there are no potential primary key in this dataframe.

In [12]:
# Check the primary key of food balance livestock
check_potential_primary_key(food_balance_animal)

Same here, there are no primary keys in the food balance livestock

In [13]:
# Check the primary key of food balance cereals
check_potential_primary_key(food_balance_cereals)

In [14]:
# Check the primary key of food security indicators
check_potential_primary_key(food_security_indicators)

Even for the food balance cereals and food security indicators dataframes we have no potential primary keys

Let us create column with the total population and remove some useless columns from the population dataframe.

In [15]:
# Create the population column, we retrieve the 1000 in the Unit column and multiply it by the Value column
population["population"] = int(population.Unit.str.split(" ")[0][0]) * population.Value
# Remove some useless columns
population_df = population.drop(population.columns.difference(["Country Code", "Country", "population"]),
                                axis =1)
# Check the dataframe
population_df.head()

Unnamed: 0,Country Code,Country,population
0,2,Afghanistan,30552000
1,3,Albania,3173000
2,4,Algeria,39208000
3,7,Angola,21472000
4,8,Antigua and Barbuda,90000


Now we can calculate the total numbers of human involved.

In [16]:
# Calcalute the total number of humans 
total_population = population_df.population.sum()
print("The total number of humans on the planet is : {:,}".format(total_population))

The total number of humans on the planet is : 8,413,993,000


This result cannot be correct, mostly if we talk about the 2013 world population. actually in 2019 the world population is around 7.7 billion. There must be an error, we will go deep to check the issue.  

<a id='data_cleaning'></a>
## 3. Data Cleaning  

The dataframes are downloaded and loaded but dirty. There are useless rows and columns, anomalies in the population data must be corrected, the columns names must be changed. Let's do some cleaning.  
We start putting all the food balance dataframes in one unique dataframe.  

In [17]:
# Create the origin variable in each balance food dataframe to store the food origin
food_balance_animal["origin"] = "animal"
food_balance_cereals["origin"] = "cereal"
food_balance_vegetal["origin"] = "vegetal"

In [18]:
# Append the 3 dataframes in one unique dataframe
food_balance_df = food_balance_animal.append(food_balance_vegetal)
# Check the first rows
food_balance_df.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,origin
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,134.0,S,Standardized data,animal
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.0,S,Standardized data,animal


In [19]:
# Check the last rows
food_balance_df.tail(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,origin
104869,FBS,Food Balance Sheets,351,China,674,Protein supply quantity (g/capita/day),2899,Miscellaneous,2013,2013,g/capita/day,0.01,Fc,Calculated data,vegetal
104870,FBS,Food Balance Sheets,351,China,684,Fat supply quantity (g/capita/day),2899,Miscellaneous,2013,2013,g/capita/day,0.0,Fc,Calculated data,vegetal


In [20]:
# Delete the 3 useless balance food dataframe
del food_balance_animal, food_balance_vegetal

In [21]:
# Rename the columns
food_balance_df.rename(columns = {"Country Code":"country_code", "Country":"country", "Element":"element",
                                  "Item Code":"item_code", "Item":"item", "Year":"year","Unit":"unit",
                                  "Value":"value"}, inplace = True)

In [22]:
# Transform the dataframe from long to wide with pivot_table
food_balance_wide = food_balance_df.pivot_table(
    # Put as index the Columns that we want to keep in the dataframe
    index = ["country_code", "country", "item_code", "item", "year", "unit", "origin"],
    # Select the columns that we want to transform from long to wide and the values that we sum 
    columns = ["element"], values = ["value"], aggfunc = sum)
# Renaming the columns 
food_balance_wide.columns = ["domestic_supply_quantity", "export_quantity", "fat_supply_quantity_gcapitaday",
                             "feed", "food", "food_supply_kcalcapitaday", "food_supply_quantity_kgcapitayr", 
                            "import_quantity", "losses", "other_uses", "processing", "production", 
                            "protein_supply_quantity_gcapitaday", "seed", "stock_variation"]

In [23]:
# Reset the index to have normal columns
food_balance = food_balance_wide.reset_index()
# delete the useless dataframe
del food_balance_df, food_balance_wide
# Check the first rows of the dataframe
food_balance.head()

Unnamed: 0,country_code,country,item_code,item,year,unit,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,food_supply_kcalcapitaday,food_supply_quantity_kgcapitayr,import_quantity,losses,other_uses,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation
0,1,Armenia,2511,Wheat and products,2013,1000 tonnes,vegetal,554.0,1.0,,...,,,361.0,32.0,0.0,10.0,312.0,,30.0,-118.0
1,1,Armenia,2511,Wheat and products,2013,g/capita/day,vegetal,,,3.6,...,,,,,,,,30.52,,
2,1,Armenia,2511,Wheat and products,2013,kcal/capita/day,vegetal,,,,...,1024.0,,,,,,,,,
3,1,Armenia,2511,Wheat and products,2013,kg,vegetal,,,,...,,130.6,,,,,,,,
4,1,Armenia,2513,Barley and products,2013,1000 tonnes,vegetal,198.0,0.0,,...,,,9.0,15.0,26.0,7.0,189.0,,14.0,0.0


Looking at the definitions of the elements in the [FAO](http://www.fao.org/faostat/en/#data/FBS)(Definitions and Standards tab), we notice that there are redundant information concerning those elements. Let's indentify the redundancy with a mathematical formula and give and example with the wheat in France.


In [24]:
# Create a dataframe with France as country and wheat as item
wheat_france = food_balance.query("country == 'France' and item == 'Wheat and products'")
# Formulas
print("Formula 1 : Domestic supply = Production + Import Quantity + Stock Variation - Export Quantity \n\
Formula 2 : Domestic supply = Food + Feed + Losses + Seed + Processing + Other Uses")

Formula 1 : Domestic supply = Production + Import Quantity + Stock Variation - Export Quantity 
Formula 2 : Domestic supply = Food + Feed + Losses + Seed + Processing + Other Uses


In [93]:
# Apply the formula in the wheat france dataframe
term_1 = (wheat_france[:1].production + wheat_france[:1].import_quantity + wheat_france[:1].stock_variation \
         - wheat_france[:1].export_quantity).values[0]

term_2 = wheat_france[:1].domestic_supply_quantity.values[0]

term_3 = (wheat_france[:1].food + wheat_france[:1].feed + wheat_france[:1].losses + wheat_france[:1].seed + \
         wheat_france[:1].processing + wheat_france[:1].other_uses).values[0]

In [90]:
# Check if all the term are equal, no output means it is correct
assert term_1 == term_2 == term_3

In [92]:
print("For the wheat in France we have : \n\
Domestic supply quantity = {} ktonnes \n\
Production + Import Quantity + Stock Variation - Export Quantity = {} ktonnes \n\
Food + Feed + Losses + Seed + Processing + Other Uses = {} ktonnes".format(term_2, term_1, term_3))

For the wheat in France we have : 
Domestic supply quantity = 20298.0 ktonnes 
Production + Import Quantity + Stock Variation - Export Quantity = 20298.0 ktonnes 
Food + Feed + Losses + Seed + Processing + Other Uses = 20298.0 ktonnes
