# GLOBAL UNDERNUTRITION STUDY - EXPLORATION  

*BY FURAWA*  

**Table of Contents**  

1. [Data collection](#data_collection)  
2. [Data discovery](#data_discovery)  
3. [Data cleaning](#data_cleaning)  
4. [Computing new variables to lead the analysis](#new_variables)  
5. [Identify major trends](#major_trends)  

In [1]:
# Import all the needed libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import re
from pandas.api.types import CategoricalDtype

%matplotlib inline
pd.set_option('max_rows', 20)

<a id='data_collection'></a>
## 1. Data collection  
All the data has been downloaded from the [FAO](http://www.fao.org/faostat/en/#data) website.  
Let us check the files.

In [2]:
# Store the file names in the file_names variable
file_names = glob('files/*.csv')
# Check the file_names list
file_names

['files/food_balance_cereals.csv',
 'files/food_security_indicators.csv',
 'files/food_balance_vegetal.csv',
 'files/food_balance_animal.csv',
 'files/population.csv']

In [3]:
# Loop into the file_names list
for file in file_names:
    # Read each file in the file_names list and assign it to a variable retrieved from the file name
     exec(re.split('\. |\W', file)[1] + "=  pd.read_csv(file)")

In [4]:
# Check the dataframe info
food_security_indicators.head(2)

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20122014,2012-2014,millions,7.9,F,FAO estimate,
1,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20132015,2013-2015,millions,8.8,F,FAO estimate,


In [5]:
# Check the dataframe info
food_balance_vegetal.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,1173.0,S,Standardized data


In [6]:
# Check the dataframe info
food_balance_animal.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,134.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.0,S,Standardized data


In [7]:
# Check the dataframe info
food_balance_cereals.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,1173.0,S,Standardized data


In [8]:
population.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,30552,,Official data
1,FBS,Food Balance Sheets,3,Albania,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,3173,,Official data


Except from the food_security_indicators dataframe all the other dataframes have the same 14 columns.  

<a id='data_discovery'></a>
## 2. Data Discovery

Let us check the primary key of each table and test them.  
We will create a function to find the primary key.  

In [9]:
# Function to find the potential primary keys
def check_potential_primary_key(df) -> str:
    # Loop in the column list of the specific dataframe
    for column_pk in df.keys():
        # Remove the duplicated values from each column and check if the size is the same as the df
        if len(df) != len(df[column_pk].drop_duplicates()):
            # No output if the column is not a primary key
           None
        else:
            # Print all the potential primary keys
            print("{} could be a primary key!".format(column_pk))

Now we can use the function to find the potential primary keys of each dataframe. 

In [10]:
# Check the primary key of population dataframe
check_potential_primary_key(population)

Country Code could be a primary key!
Country could be a primary key!
Value could be a primary key!


We have 3 potential primary keys for the population dataframe, but the best choice here is the Country Code variable. It won't be a good idea to have the population or country name as primary key as they will be difficult to query.

In [11]:
# Check the primary key of food balance vegetal
check_potential_primary_key(food_balance_vegetal)

We have no output, which means there are no potential primary key in this dataframe.

In [12]:
# Check the primary key of food balance livestock
check_potential_primary_key(food_balance_animal)

Same here, there are no primary keys in the food balance livestock

In [13]:
# Check the primary key of food balance cereals
check_potential_primary_key(food_balance_cereals)

In [14]:
# Check the primary key of food security indicators
check_potential_primary_key(food_security_indicators)

Even for the food balance cereals and food security indicators dataframes we have no potential primary keys

Let us create column with the total population and remove some useless columns from the population dataframe.

In [15]:
# Create the population column, we retrieve the 1000 in the Unit column and multiply it by the Value column
population["population"] = int(population.Unit.str.split(" ")[0][0]) * population.Value
# Remove some useless columns
population_df = population.drop(population.columns.difference(["Country Code", "Country", "population"]),
                                axis =1)
# Check the dataframe
population_df.head()

Unnamed: 0,Country Code,Country,population
0,2,Afghanistan,30552000
1,3,Albania,3173000
2,4,Algeria,39208000
3,7,Angola,21472000
4,8,Antigua and Barbuda,90000


Now we can calculate the total numbers of human involved.

In [16]:
# Calcalute the total number of humans 
total_population = population_df.population.sum()
print("The total number of humans on the planet is : {:,}".format(total_population))

The total number of humans on the planet is : 8,413,993,000


This result cannot be correct, mostly if we talk about the 2013 world population. actually in 2019 the world population is around 7.7 billion. There must be an error, we will go deep to check the issue.  

<a id='data_cleaning'></a>
## 3. Data Cleaning  

The dataframes are downloaded and loaded but dirty. There are useless rows and columns, anomalies in the population data must be corrected, the columns names must be changed. Let's do some cleaning.  
We start putting all the food balance dataframes in one unique dataframe.  
- **Food balance**

In [17]:
# Create the origin variable in each balance food dataframe to store the food origin
food_balance_animal["origin"] = "animal"
food_balance_cereals["origin"] = "cereal"
food_balance_vegetal["origin"] = "vegetal"

In [18]:
# Append the 3 dataframes in one unique dataframe
food_balance_df = food_balance_animal.append(food_balance_vegetal)
# Check the first rows
food_balance_df.shape

(142037, 15)

In [19]:
# Select the rows where Unit equal 1000 tonnes and convert the respectives values
food_balance_df.loc[food_balance_df.Unit == "1000 tonnes", "Value"] *= 1000000

In [20]:
food_balance_df

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,origin
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,1.340000e+08,S,Standardized data,animal
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.000000e+06,S,Standardized data,animal
2,FBS,Food Balance Sheets,2,Afghanistan,5301,Domestic supply quantity,2731,Bovine Meat,2013,2013,1000 tonnes,1.400000e+08,S,Standardized data,animal
3,FBS,Food Balance Sheets,2,Afghanistan,5142,Food,2731,Bovine Meat,2013,2013,1000 tonnes,1.400000e+08,S,Standardized data,animal
4,FBS,Food Balance Sheets,2,Afghanistan,645,Food supply quantity (kg/capita/yr),2731,Bovine Meat,2013,2013,kg,4.590000e+00,Fc,Calculated data,animal
5,FBS,Food Balance Sheets,2,Afghanistan,664,Food supply (kcal/capita/day),2731,Bovine Meat,2013,2013,kcal/capita/day,2.700000e+01,Fc,Calculated data,animal
6,FBS,Food Balance Sheets,2,Afghanistan,674,Protein supply quantity (g/capita/day),2731,Bovine Meat,2013,2013,g/capita/day,1.890000e+00,Fc,Calculated data,animal
7,FBS,Food Balance Sheets,2,Afghanistan,684,Fat supply quantity (g/capita/day),2731,Bovine Meat,2013,2013,g/capita/day,2.100000e+00,Fc,Calculated data,animal
8,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2732,Mutton & Goat Meat,2013,2013,1000 tonnes,1.500000e+08,S,Standardized data,animal
9,FBS,Food Balance Sheets,2,Afghanistan,5301,Domestic supply quantity,2732,Mutton & Goat Meat,2013,2013,1000 tonnes,1.500000e+08,S,Standardized data,animal


In [21]:
# Delete the 3 useless balance food dataframe
del food_balance_animal, food_balance_vegetal

In [22]:
# Rename the columns
food_balance_df.rename(columns = {"Country Code":"country_code", "Country":"country", "Element":"element",
                                  "Item Code":"item_code", "Item":"item", "Year":"year",
                                  "Value":"value"}, inplace = True)

In [23]:
# Transform the dataframe from long to wide with pivot_table
food_balance_wide = food_balance_df.pivot_table(
    # Put as index the Columns that we want to keep in the dataframe
    index = ["country_code", "country", "item_code", "item", "year", "origin"],
    # Select the columns that we want to transform from long to wide and the values that we sum 
    columns = ["element"], values = ["value"], aggfunc = sum)
# Renaming the columns 
food_balance_wide.columns = ["domestic_supply_quantity", "export_quantity", "fat_supply_quantity_gcapitaday",
                             "feed", "food", "food_supply_kcalcapitaday", "food_supply_quantity_kgcapitayr", 
                            "import_quantity", "losses", "other_uses", "processing", "production", 
                            "protein_supply_quantity_gcapitaday", "seed", "stock_variation"]

In [24]:
# Reset the index to have normal columns
food_balance = food_balance_wide.reset_index()
# delete the useless dataframe
del food_balance_df, food_balance_wide
# Check the first rows of the dataframe
food_balance.head()

Unnamed: 0,country_code,country,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,feed,...,food_supply_kcalcapitaday,food_supply_quantity_kgcapitayr,import_quantity,losses,other_uses,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation
0,1,Armenia,2511,Wheat and products,2013,vegetal,554000000.0,1000000.0,3.6,93000000.0,...,1024.0,130.6,361000000.0,32000000.0,0.0,10000000.0,312000000.0,30.52,30000000.0,-118000000.0
1,1,Armenia,2513,Barley and products,2013,vegetal,198000000.0,0.0,0.0,137000000.0,...,0.0,0.0,9000000.0,15000000.0,26000000.0,7000000.0,189000000.0,0.0,14000000.0,0.0
2,1,Armenia,2514,Maize and products,2013,vegetal,102000000.0,,,96000000.0,...,0.0,0.03,82000000.0,7000000.0,,,21000000.0,0.01,0.0,
3,1,Armenia,2515,Rye and products,2013,vegetal,1000000.0,,0.0,1000000.0,...,1.0,0.12,0.0,0.0,,,1000000.0,0.02,0.0,0.0
4,1,Armenia,2516,Oats,2013,vegetal,6000000.0,,0.03,4000000.0,...,2.0,0.37,1000000.0,0.0,,,5000000.0,0.09,0.0,


Looking at the definitions of the elements in the [FAO](http://www.fao.org/faostat/en/#data/FBS)(Definitions and Standards tab), we notice that there are redundant information concerning those elements. Let's indentify the redundancy with a mathematical formula and give and example with the wheat in France.


In [25]:
# Create a dataframe with France as country and wheat as item
wheat_france = food_balance.query("country == 'France' and item == 'Wheat and products'")
# Formulas
print("Formula 1 : Domestic supply = Production + Import Quantity + Stock Variation - Export Quantity \n\
Formula 2 : Domestic supply = Food + Feed + Losses + Seed + Processing + Other Uses")

Formula 1 : Domestic supply = Production + Import Quantity + Stock Variation - Export Quantity 
Formula 2 : Domestic supply = Food + Feed + Losses + Seed + Processing + Other Uses


In [26]:
# Apply the formula in the wheat france dataframe
term_1 = (wheat_france[:1].production + wheat_france[:1].import_quantity + wheat_france[:1].stock_variation \
         - wheat_france[:1].export_quantity).values[0]

term_2 = wheat_france[:1].domestic_supply_quantity.values[0]

term_3 = (wheat_france[:1].food + wheat_france[:1].feed + wheat_france[:1].losses + wheat_france[:1].seed + \
         wheat_france[:1].processing + wheat_france[:1].other_uses).values[0]

In [27]:
# Check if all the term are equal, no output means it is correct
assert term_1 == term_2 == term_3

In [28]:
print("For the wheat in France we have : \n\
Domestic supply quantity = {:,} kg \n\
Production + Import Quantity + Stock Variation - Export Quantity = {:,} kg \n\
Food + Feed + Losses + Seed + Processing + Other Uses = {} kg".format(term_2, term_1, term_3))

For the wheat in France we have : 
Domestic supply quantity = 20,298,000,000.0 kg 
Production + Import Quantity + Stock Variation - Export Quantity = 20,298,000,000.0 kg 
Food + Feed + Losses + Seed + Processing + Other Uses = 20298000000.0 kg


- **Food Security Indicators**

In [29]:
food_security_indicators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 15 columns):
Domain Code         1020 non-null object
Domain              1020 non-null object
Area Code           1020 non-null int64
Area                1020 non-null object
Element Code        1020 non-null int64
Element             1020 non-null object
Item Code           1020 non-null int64
Item                1020 non-null object
Year Code           1020 non-null int64
Year                1020 non-null object
Unit                1020 non-null object
Value               605 non-null object
Flag                1020 non-null object
Flag Description    1020 non-null object
Note                0 non-null float64
dtypes: float64(1), int64(4), object(10)
memory usage: 119.7+ KB


- The Value column has 605 non-null rows which means that there are many null values  
- There are many useless columns to remove  
- We need to change the columns name
- The Value must be float number not a string  
- The year must be a categorical variable not a string 
- There is a value <0.1, that must be changed  
- Convert the value in million

Let us make all those changes.

In [30]:
food_security_indicators[food_security_indicators.Value.isnull()][:5]

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
15,FS,Suite of Food Security Indicators,5,American Samoa,6132,Value,210011,Number of people undernourished (million) (3-y...,20122014,2012-2014,millions,,NV,Data not available,
16,FS,Suite of Food Security Indicators,5,American Samoa,6132,Value,210011,Number of people undernourished (million) (3-y...,20132015,2013-2015,millions,,NV,Data not available,
17,FS,Suite of Food Security Indicators,5,American Samoa,6132,Value,210011,Number of people undernourished (million) (3-y...,20142016,2014-2016,millions,,NV,Data not available,
18,FS,Suite of Food Security Indicators,5,American Samoa,6132,Value,210011,Number of people undernourished (million) (3-y...,20152017,2015-2017,millions,,NV,Data not available,
19,FS,Suite of Food Security Indicators,5,American Samoa,6132,Value,210011,Number of people undernourished (million) (3-y...,20162018,2016-2018,millions,,NV,Data not available,


All these rows with NaN values are useless. We will remove them. 

In [31]:
# Remove all NaN rows from the dataframe
food_security_indicators = food_security_indicators[food_security_indicators.Value.isnull() == False]

In [32]:
# Check that the Value column has no more NaN values, no output means it is correct
assert food_security_indicators.Value.isnull().all() == False

Now we can change the columns name and keep the columns that we need for the Analysis.

In [33]:
# Select the columns that we need
indicators_df = food_security_indicators.loc[:, ["Area Code", "Area", "Year", "Value"]]
# Change the columns name
indicators_df.columns = ["country_code", "country", "year", "value"]
# Turn the year variable into a categorical type
category_type = CategoricalDtype(categories = ["2012-2014", "2013-2015", "2014-2016", "2015-2017", "2016-2018"]
                                               ,ordered = True)
indicators_df.year = indicators_df.year.astype(category_type)
indicators_df.year = indicators_df.year.cat.rename_categories([2013, 2014, 2015, 2016, 2017])

In [34]:
# Assert that the data type is correct
assert indicators_df.year.dtype == "category"

In [35]:
# Replace the <0.1 value with 0.09 
indicators_df.value = indicators_df.value.apply(lambda x : x.replace("<0.1", "0.09"))

In [36]:
# Check if there is no <0.1 value anymore
assert indicators_df.loc[indicators_df.value == "<0.1"].value.any() == False

In [37]:
# Change the data type of the value columns from string to float
indicators_df.value = indicators_df.value.astype("float64")

In [38]:
# Assert that the changes occurs
assert indicators_df.value.dtype == "float64"

In [39]:
indicators_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 605 entries, 0 to 1019
Data columns (total 4 columns):
country_code    605 non-null int64
country         605 non-null object
year            605 non-null category
value           605 non-null float64
dtypes: category(1), float64(1), int64(1), object(1)
memory usage: 19.7+ KB


In [40]:
indicators_df.sample(3)

Unnamed: 0,country_code,country,year,value
324,209,Eswatini,2017,0.3
708,166,Panama,2016,0.4
204,41,"China, mainland",2017,121.4


In [41]:
# Convert the value variable in million 
indicators_df.value = indicators_df.value * 1000000
indicators_df.head(2)

Unnamed: 0,country_code,country,year,value
0,2,Afghanistan,2013,7900000.0
1,2,Afghanistan,2014,8800000.0


- **Population**  

There is one flag value. Let see what it is about

In [42]:
population[population.Flag.isnull() == False]

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,population
174,FBS,Food Balance Sheets,351,China,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,1416667,A,"Aggregate, may include official, semi-official...",1416667000


We see here that The "China" entry as country is an aggregate value. Let's check if there are another entry with China.

In [43]:
# Find all the entries with china
population_df[population_df.Country.str.contains("China")]

Unnamed: 0,Country Code,Country,population
32,96,"China, Hong Kong SAR",7204000
33,128,"China, Macao SAR",566000
34,41,"China, mainland",1385567000
35,214,"China, Taiwan Province of",23330000
174,351,China,1416667000


In [44]:
# Sum the four first values
population_df.population[32:36].sum()

1416667000

We can see that China, Hong Kong SAR, China, Macao SAR, China, mainland and China, Taiwan Province of are actually four parts of the China country. We can confirm this because the total population of the four parts are equal to the population of the China country.
We will remove the China population value from the dataframe during the calculation of the total number of humans on the planet.

In [45]:
# Remove China from the dataframe
population_df = population_df[population_df.Country != "China"]

In [46]:
# Compute the wold population
world_population = population_df.population.sum()
print("The total number of humans on the planet in 2013 is : {:,}".format(world_population))

The total number of humans on the planet in 2013 is : 6,997,326,000


In [47]:
# Rename the columns of the dataframe
population_df.columns = ["country_code", "country", "population"]

In [48]:
population_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174 entries, 0 to 173
Data columns (total 3 columns):
country_code    174 non-null int64
country         174 non-null object
population      174 non-null int64
dtypes: int64(2), object(1)
memory usage: 5.4+ KB


<a id=new_variables></a>
## 4. Computing New Variables To Lead the Analysis 

All the dataframes are clean, we can compute some new variables for future analysis

- **food_supply_kcal(food supply express in kcal)**

In [49]:
# Create a temporary dataframe where we join the population and the food_balande dfs together 
food_balance_full = pd.merge(population_df, food_balance, how = "left", on = ["country", "country_code"])
food_balance_full.shape

(15605, 22)

In [50]:
# Compute food_supply_kcal by multiplying by the population of each country and by 365 days
food_balance_full["food_supply_kcal"] = food_balance_full.food_supply_kcalcapitaday * food_balance_full.population * 365
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,food_supply_quantity_kgcapitayr,import_quantity,losses,other_uses,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,160.23,1173000000.0,775000000.0,,,5169000000.0,36.91,322000000.0,-350000000.0,15266380000000.0
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,2.92,10000000.0,52000000.0,,,514000000.0,0.79,22000000.0,0.0,289938500000.0


- **food_supply_kgprotein(food supply express in kg protein)**

In [51]:
# Compute food_supply_kgprotein, dividing by 1000 to have it in kg then multiplying by 365 for 1 year
food_balance_full["food_supply_kgprotein"] = (food_balance_full.protein_supply_quantity_gcapitaday / 1000) * food_balance_full.population * 365
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,import_quantity,losses,other_uses,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal,food_supply_kgprotein
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,1173000000.0,775000000.0,,,5169000000.0,36.91,322000000.0,-350000000.0,15266380000000.0,411601126.8
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,10000000.0,52000000.0,,,514000000.0,0.79,22000000.0,0.0,289938500000.0,8809669.2


- **food_supply_kg(food supply express in kg)**

In [52]:
# Compute food_supply_kg just multiplying food by 1 million  
food_balance_full["food_supply_kg"] = food_balance_full.food
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,losses,other_uses,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal,food_supply_kgprotein,food_supply_kg
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,775000000.0,,,5169000000.0,36.91,322000000.0,-350000000.0,15266380000000.0,411601126.8,4895000000.0
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,52000000.0,,,514000000.0,0.79,22000000.0,0.0,289938500000.0,8809669.2,89000000.0


- **ratio_kcalkg(energy:weight ratio of each)**  
This variable is computed using food_supply_kcal and food_supply_kg

In [53]:
# Compute ratio_kcalkg (food_supply_kcal / food_supply_kg)  
food_balance_full["ratio_kcalkg"] = round(food_balance_full.food_supply_kcal / food_balance_full.food_supply_kg,2)
food_balance_full.ratio_kcalkg = food_balance_full.ratio_kcalkg.replace((np.inf, -np.inf), 0)
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,other_uses,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal,food_supply_kgprotein,food_supply_kg,ratio_kcalkg
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,,,5169000000.0,36.91,322000000.0,-350000000.0,15266380000000.0,411601126.8,4895000000.0,3118.77
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,,,514000000.0,0.79,22000000.0,0.0,289938500000.0,8809669.2,89000000.0,3257.74


In [54]:
# Check the ratio of an italian egg
food_balance_full.query("item == 'Eggs' and country == 'Italy'").ratio_kcalkg

7210    1422.1
Name: ratio_kcalkg, dtype: float64

- **protein_percentage(pretein percentage of each item)**

In [55]:
food_balance_full["protein_percentage"] = round(food_balance_full.food_supply_kgprotein / food_balance_full.food_supply_kg * 100,1)
food_balance_full.protein_percentage = food_balance_full.protein_percentage.replace((np.inf, -np.inf), 0)
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,processing,production,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal,food_supply_kgprotein,food_supply_kg,ratio_kcalkg,protein_percentage
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,,5169000000.0,36.91,322000000.0,-350000000.0,15266380000000.0,411601126.8,4895000000.0,3118.77,8.4
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,,514000000.0,0.79,22000000.0,0.0,289938500000.0,8809669.2,89000000.0,3257.74,9.9


In [56]:
food_balance_full.query("item == 'Eggs' and country == 'Italy'").protein_percentage

7210    11.4
Name: protein_percentage, dtype: float64

According to [wikipedia](https://en.wikipedia.org/wiki/Egg_as_food#Nutritional_value) 50g of egg provides approximately 70 calories  and 6g of protein.   
In italy we have a value of 11.4 percent of protein in one egg which means that in 50g we will have $50 * 11.4 / 100 = 5.7g$ of protein which is almost the same as the value of wikipedia.

- **dom_sup_kcal(Global domestic supply in kcal)**

In [57]:
# Compute dom_sup_kcal 
food_balance_full["dom_sup_kcal"] = food_balance_full.domestic_supply_quantity * food_balance_full.ratio_kcalkg
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,production,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal,food_supply_kgprotein,food_supply_kg,ratio_kcalkg,protein_percentage,dom_sup_kcal
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,5169000000.0,36.91,322000000.0,-350000000.0,15266380000000.0,411601126.8,4895000000.0,3118.77,8.4,18687670000000.0
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,514000000.0,0.79,22000000.0,0.0,289938500000.0,8809669.2,89000000.0,3257.74,9.9,1707056000000.0


- **dom_sup_kgprot(Global domestic supply in kg of protein)**

In [58]:
# Compute dom_sup_kgprot variable 
food_balance_full["dom_sup_kgprot"] = food_balance_full.domestic_supply_quantity * food_balance_full.protein_percentage
food_balance_full.head(2)

Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,protein_supply_quantity_gcapitaday,seed,stock_variation,food_supply_kcal,food_supply_kgprotein,food_supply_kg,ratio_kcalkg,protein_percentage,dom_sup_kcal,dom_sup_kgprot
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,36.91,322000000.0,-350000000.0,15266380000000.0,411601126.8,4895000000.0,3118.77,8.4,18687670000000.0,50332800000.0
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,0.79,22000000.0,0.0,289938500000.0,8809669.2,89000000.0,3257.74,9.9,1707056000000.0,5187600000.0


- **great_import_from_undern_countries (200 highest imports of the 25 most exported items from countries with more than 10% malnourishment)**      
The variable must be a boolean(True or False)  
    1. Countries with a malnourishment rate of more than 10%

In [59]:
# Merge indicators_df and population_df toghether
ind_pop = indicators_df.merge(population_df, how = "left", on = ["country_code", "country"])
# Compute the ratio
ind_pop["ratio"] = round(ind_pop.value / ind_pop.population * 100, 1)
# Find the countries with ratio greater than 10% in 2013 
ind_pop_10 = ind_pop.query("ratio > 10 & year == 2013")
ind_pop_10.shape

(76, 6)

There are 76 countries with a manoulrishment rate of more than 10%.

    2. 25 most exported items(in terms of quantity) by these countries for any given year.  

In [60]:
# Merge ind_pop and food_balance_full together to get the export quantity and items
exported_items_25 = ind_pop_10.merge(food_balance_full, how = "left"). \
sort_values(by = "export_quantity", ascending = False).loc[:,("item")].drop_duplicates()[:25]
# Print the 25 items from the highest to the lowest
for i, item in enumerate(exported_items_25):
    print("{} - {}".format(i+1, item))

1 - Rice (Milled Equivalent)
2 - Cassava and products
3 - Wheat and products
4 - Maize and products
5 - Soyabeans
6 - Bananas
7 - Milk - Excluding Butter
8 - Sugar (Raw Equivalent)
9 - Bovine Meat
10 - Onions
11 - Pineapples and products
12 - Beans
13 - Coffee and products
14 - Freshwater Fish
15 - Coconut Oil
16 - Nuts and products
17 - Cocoa Beans and products
18 - Vegetables, Other
19 - Fruits, Other
20 - Oilcrops Oil, Other
21 - Tomatoes and products
22 - Coconuts - Incl Copra
23 - Alcohol, Non-Food
24 - Groundnuts (Shelled Eq)
25 - Barley and products


    3. 200 highest import quantities among the 25 items

In [61]:
# Select the 200 highest import quantities
high_imp_200 = food_balance_full[food_balance_full.item.isin(exported_items_25.tolist())] \
.sort_values(by = "import_quantity", ascending = False)[:200]
high_imp_200.shape

(200, 29)

    4. Create great_import_from_undern_countries variable and set it to True for the 200 higest importation quantities

In [62]:
# If the index of food_balance_full is in high_imp_200 we set the variable to True othewise False
food_balance_full["great_import_from_undern_countries"] = food_balance_full.index.isin(high_imp_200.index)
# Check if there are just 200 lines with True for the variable
print(food_balance_full[food_balance_full.great_import_from_undern_countries == True].shape)
food_balance_full.head(2)

(200, 30)


Unnamed: 0,country_code,country,population,item_code,item,year,origin,domestic_supply_quantity,export_quantity,fat_supply_quantity_gcapitaday,...,seed,stock_variation,food_supply_kcal,food_supply_kgprotein,food_supply_kg,ratio_kcalkg,protein_percentage,dom_sup_kcal,dom_sup_kgprot,great_import_from_undern_countries
0,2,Afghanistan,30552000,2511,Wheat and products,2013,vegetal,5992000000.0,,4.69,...,322000000.0,-350000000.0,15266380000000.0,411601126.8,4895000000.0,3118.77,8.4,18687670000000.0,50332800000.0,True
1,2,Afghanistan,30552000,2513,Barley and products,2013,vegetal,524000000.0,,0.24,...,22000000.0,0.0,289938500000.0,8809669.2,89000000.0,3257.74,9.9,1707056000000.0,5187600000.0,False


<a id="major_trends"></a>
## 5. Identify Major Trends  
### 5.1 Proportion of the global domestic supply considering only plant products  
- **Proportion used as food**

In [63]:
# Create a dataframe with just vegetal products
vegetal_full = food_balance_full.query("origin == 'vegetal'")
# Compute the global domestic supply 
global_domestic_supply = vegetal_full.domestic_supply_quantity.sum()
global_domestic_supply

8482244000000.0

In 2013 the global domestic supply is about **8,482,244,000,000** kg

In [64]:
# Compute the proportion of domestic supply used as food
vegetal_food_prop = round(vegetal_full.food_supply_kg.sum() / global_domestic_supply * 100, 2)
vegetal_food_prop

43.53

Just 43.53% of the global domestic food is actually used as food

- **Proportion used as feed**

In [65]:
vegetal_feed_prop = round(vegetal_full.feed.sum() / global_domestic_supply * 100, 2)
vegetal_feed_prop

14.11

- **Proportion used as waste**

In [66]:
vegetal_waste_prop = round(vegetal_full.losses.sum() / global_domestic_supply * 100, 2)
vegetal_waste_prop

5.07

- **Proportion used as other uses**

In [67]:
vegetal_other_prop = round(vegetal_full.other_uses.sum() / global_domestic_supply * 100, 2)
vegetal_other_prop

9.65

### 5.2 Number of humans that could be fed with vegetal  
We will give the result in term of calories and protein and express it as a percentage of the world's population.
It is difficult to know the number of calories we need to survive because it depends on the age, weight, activity level and gender. For long term health most people will need a minimum of 1200 calories per day. 
Let's take according to the [Dietary Guidelines](https://health.gov/dietaryguidelines/2015/guidelines/appendix-2/) an average of 2400 calories per day per human.  
- **Result considering Kcal**  

In [78]:
# Compute the vegetal global food supply in kcal
vegetal_global_supply_kcal = vegetal_full.dom_sup_kcal.sum()
# Compute food supply in kcal considering just food and feed proportions
vegetal_food_supply_kcal = vegetal_global_supply_kcal * (vegetal_food_prop + vegetal_feed_prop) / 100
# Compute the number of humans that could be fed Considering 2500 calories per day
vegetal_humans_fed_kcal = vegetal_food_supply_kcal / 2400 /365
# Percentage of the world population
percentage_fed_kcal = round(vegetal_humans_fed_kcal / world_population * 100)
percentage_fed_kcal

116.0

- **Result considering kgprot**

In [79]:
# Compute the vegetal global food supply in kprot
vegetal_global_supply_kgprot = vegetal_full.dom_sup_kgprot.sum()
# Compute food supply in kcal considering just food and feed proportions
vegetal_food_supply_kgprot = vegetal_global_supply_kgprot * (vegetal_food_prop + vegetal_feed_prop) / 100
# Compute the number of humans that could be fed Considering 56 grams per day
vegetal_humans_fed_kgprot = vegetal_food_supply_kgprot / 0.056 / 365
# Percentage of the world population
percentage_fed_kgprot = round(vegetal_humans_fed_kgprot / world_population)
percentage_fed_kgprot

117.0

With just the vegetal global supply including just the food and the feed, about 116% of the population can be fed
in term of kcal and kg of protein intake.