# GLOBAL UNDERNUTRITION STUDY - EXPLORATION  

*BY FURAWA*  

**Table of Contents**  

1. [Data collection](#data_collection)  
2. [Data discovery](#data_discovery)  
3. [Data cleaning](#data_cleaning)  
4. [Computing new variables to lead the analysis](#new_variables)  
5. [Identify major trends](#major_trends)  

In [9]:
# Import all the needed libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import re
%matplotlib inline
pd.set_option('max_rows', 20)

<a id='data_collection'></a>
## 1. Data collection  
All the data has been downloaded from the [FAO](http://www.fao.org/faostat/en/#data) website.  
Let us check the files.

In [10]:
# Store the file names in the file_names variable
file_names = glob('files/*.csv')
# Check the file_names list
file_names

['files/food_balance_cereals.csv',
 'files/food_security_indicators.csv',
 'files/food_balance_vegetal.csv',
 'files/food_balance_animal.csv',
 'files/population.csv']

In [11]:
# Loop into the file_names list
for file in file_names:
    # Read each file in the file_names list and assign it to a variable retrieved from the file name
     exec(re.split('\. |\W', file)[1] + "=  pd.read_csv(file)")

In [12]:
# Check the dataframe info
food_security_indicators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 15 columns):
Domain Code         1020 non-null object
Domain              1020 non-null object
Area Code           1020 non-null int64
Area                1020 non-null object
Element Code        1020 non-null int64
Element             1020 non-null object
Item Code           1020 non-null int64
Item                1020 non-null object
Year Code           1020 non-null int64
Year                1020 non-null object
Unit                1020 non-null object
Value               605 non-null object
Flag                1020 non-null object
Flag Description    1020 non-null object
Note                0 non-null float64
dtypes: float64(1), int64(4), object(10)
memory usage: 119.7+ KB


In [13]:
# Check the dataframe info
food_balance_vegetal.Element.unique()

array(['Production', 'Import Quantity', 'Stock Variation',
       'Domestic supply quantity', 'Seed', 'Losses', 'Food',
       'Food supply quantity (kg/capita/yr)',
       'Food supply (kcal/capita/day)',
       'Protein supply quantity (g/capita/day)',
       'Fat supply quantity (g/capita/day)', 'Feed', 'Export Quantity',
       'Processing', 'Other uses'], dtype=object)

In [14]:
# Check the dataframe info
food_balance_animal.Element.unique()

array(['Production', 'Import Quantity', 'Domestic supply quantity',
       'Food', 'Food supply quantity (kg/capita/yr)',
       'Food supply (kcal/capita/day)',
       'Protein supply quantity (g/capita/day)',
       'Fat supply quantity (g/capita/day)', 'Seed', 'Losses',
       'Export Quantity', 'Feed', 'Other uses', 'Stock Variation',
       'Processing'], dtype=object)

In [23]:
# Check the dataframe info
food_balance_cereals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16192 entries, 0 to 16191
Data columns (total 14 columns):
Domain Code         16192 non-null object
Domain              16192 non-null object
Country Code        16192 non-null int64
Country             16192 non-null object
Element Code        16192 non-null int64
Element             16192 non-null object
Item Code           16192 non-null int64
Item                16192 non-null object
Year Code           16192 non-null int64
Year                16192 non-null int64
Unit                16192 non-null object
Value               16192 non-null float64
Flag                16192 non-null object
Flag Description    16192 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 1.7+ MB


In [None]:
population.head()

Except from the food_security_indicators dataframe all the other dataframes have the same 14 columns.  
We can remove the Note variable of the food_security_indicators as it has just NaN values.

<a id='data_discovery'></a>
## 2. Data Discovery

In [None]:
# Check the first few rows of each dataframe
food_security_indicators.head(2)

In [None]:
# Check food balance vegetal
food_balance_vegetal.head(2)  

In [None]:
# Check food balance animal
food_balance_animal.head(2)  

In [None]:
# Check commodity balance livestock
food_balance_cereals

In [None]:
# Check population 
population

Let us check the primary key of each table and test them.  
We will create a function to find the primary key.  

In [None]:
# Function to find the potential primary keys
def check_potential_primary_key(df) -> str:
    # Loop in the column list of the specific dataframe
    for column_pk in df.keys():
        # Remove the duplicated values from each column and check if the size is the same as the df
        if len(df) != len(df[column_pk].drop_duplicates()):
            # No output if the column is not a primary key
           None
        else:
            # Print all the potential primary keys
            print("{} could be a primary key!".format(column_pk))

Now we can use the function to find the potential primary keys of each dataframe. 

In [None]:
# Check the primary key of population dataframe
check_potential_primary_key(population)

We have 3 potential primary keys for the population dataframe, but the best choice here is the Country Code variable. It won't be a good idea to have the population or country name as primary key as they will be difficult to query.

In [None]:
# Check the primary key of food balance vegetal
check_potential_primary_key(food_balance_vegetal)

We have no output, which means there are no potential primary key in this dataframe.

In [None]:
# Check the primary key of food balance livestock
check_potential_primary_key(food_balance_animal)

Same here, there are no primary keys in the food balance livestock

In [None]:
# Check the primary key of food balance cereals
check_potential_primary_key(food_balance_cereals)

In [None]:
# Check the primary key of food security indicators
check_potential_primary_key(food_security_indicators)

Even for the food balance cereals and food security indicators dataframes we have no potential primary keys

Let us create column with the total population and remove some useless columns from the population dataframe.

In [None]:
# Create the population column, we retrieve the 1000 in the Unit column and multiply it by the Value column
population["population"] = int(population.Unit.str.split(" ")[0][0]) * population.Value
# Remove some useless columns
population_df = population.drop(population.columns.difference(["Country Code", "Country", "population"]),
                                axis =1)
# Check the dataframe
population_df.head()

Now we can calculate the total numbers of human involved.

In [None]:
# Calcalute the total number of humans 
total_population = population_df.population.sum()
print("The total number of humans on the planet is : {:,}".format(total_population))

This result cannot be correct, mostly if we talk about the 2013 world population. actually in 2019 the world population is around 7.7 billion. There must be a mistake, we will go deep to check the issue.  

<a id='data_cleaning'></a>
## 3. Data Cleaning  

The dataframes are downloaded and loaded but dirty. There are useless rows and columns, anomalies in the population data must be corrected, the columns names must be changed. Let's do some cleaning.  
We start putting all the food balance dataframes in one unique dataframe.  

In [None]:
# Create the origin variable in each balance food dataframe to store the food origin
food_balance_animal["origin"] = "animal"
food_balance_cereals["origin"] = "cereal"
food_balance_vegetal["origin"] = "vegetal"

In [None]:
# Append the 3 dataframes in one unique dataframe
food_balance_df = food_balance_animal.append([food_balance_vegetal, food_balance_cereals])
# Check the first rows
food_balance_df.head(2)

In [None]:
# Check the last rows
food_balance_df.tail(2)

In [None]:
# Delete the 3 useless balance food dataframe
del food_balance_animal, food_balance_cereals, food_balance_vegetal

In [None]:
# Transform the dataframe from long to wide with pivot_table
food_balance_wide = food_balance_df.pivot_table(
    # Put as index the Columns that we want to keep in the dataframe
    index = ["Country Code", "Item Code", "Country", "Item", "Year", "origin"],
    # Select the columns that we want to transform from long to wide and the values that we sum 
    columns = ["Element"], values = ["Value"], aggfunc = sum)
food_balance_wide.head(1)

In [None]:
# Drop the level of the columns
food_balance_wide.columns = food_balance_wide.columns.droplevel(0)

In [None]:
food_balance_wide.head(1)

In [None]:
# Reset the indexes and rename the Element as None to remove it as column name of the index
food_balance = food_balance_wide.reset_index().rename_axis(None, axis = 1)
food_balance.head(1)

In [None]:
# Check the bottom of the dataframe
food_balance.tail(1)

The "World" value as country is not necessary for the analysis. We will remove it.

In [None]:
# Remove the rows with World as country
food_balance = food_balance.drop(food_balance[food_balance.Country == "World"].index)

In [None]:
# Rename the columns
food_balance.columns = ["country_code", "item_code", "country", "item", "year", "origin", 
                       "domestic_supply_quantity", "export_quantity", "fat_supply_quantity_gcapitaday",
                       "feed", "food", "food_supply_kcalcapitaday", "food_supply_quantity_kgcapita_yr", 
                       "import_quantity", "losses", "other_uses", "processing", "production", 
                       "protein_supply_quantity_gcapitaday", "seed", "stock_variation"]

In [None]:
# Check the info of the dataframe
food_balance

In [None]:
# Rename the column of population dataframe
population_df.columns = ["country_code", "country", "population"]
population_df.columns

In [None]:
# Check some statistic of the dataframe
population_df.describe()

In [None]:
# Check the country with population higher than 1 billion
population_df[population_df.population > 1000000000]

It seems that there are many entries for china. Let us find them all.

In [None]:
# Find all the country begining with China
population_df[population_df.country.str.contains("China")]

In [None]:
# Sum the four first values
population_df.population[32:36].sum()

We can see that *China, Hong Kong SAR*, *China, Macao SAR*, *China, mainland* and *China, Taiwan Province of* are actually four parts of the China country. We can confirm that because the total population of the four parts are equal to the population of the China country.  
We will remove the China population value from the dataframe during the calculation of the total number of humans on the planet.

In [None]:
# Calcute the world population removing the china country population
world_population = population_df[population_df.index != 174].population.sum()
print("The total number of humans on the planet is : {:,}".format(world_population))

This value has more sense and is more consistent with the reality.

In [None]:
# Remove useless columns 
food_security_indicators.drop(food_security_indicators.iloc[:, [0,1,4,5,6,8,10,12,13,14]],
                              axis =1, inplace = True)
food_security_indicators.head(2)

In [None]:
# Rename the columns 
food_security_indicators.columns = ["country_code", "country", "item", "year", "value"]

In [None]:
# Check for the info
food_security_indicators.info()

Actually value is an object, it must be a float. Let's change it.  
Before that we will count and remove all the null values in the value variable.  

In [None]:
# Count the number of null values in the dataframe
food_security_indicators.isnull().sum()

In [None]:
# Check the rows with null values  
food_security_indicators[food_security_indicators.value.isnull()]

In [None]:
# Remove all the rows with NaN values
food_security_indicators.drop(food_security_indicators[food_security_indicators.value.isnull()].index, 
                             inplace = True)

In [None]:
# Assert that there are no null value anymore, no output means it is correct
assert food_security_indicators.value.isnull().any() == False

In [None]:
# Check the unique values 
food_security_indicators.value.unique()

The value '<0.1' must be changed. We will change it to the value '0.09' and then change the variable type.

In [None]:
# Change the <0.1 value to 0.09
food_security_indicators[food_security_indicators.value == "<0.1"] = 0.09

In [None]:
# Assert there are not <0.1 values anymore 
assert food_security_indicators[food_security_indicators.value == "<0.1"].value.any() == False

In [None]:
# Turn value variable from string to float 
food_security_indicators.value = food_security_indicators.value.astype("float64")

In [None]:
# Assert that the variable is float
assert food_security_indicators.value.dtype == "float64"

<a id="new_variables"></a>
## 4. Computing New Variables To Lead The Analysis

Now that we have all the dataframes clean, let us compute some new variables for future study. 
- **food_supply_kcal (food supply express in kcal)**

In [None]:
# Create a temporary dataframe where we join the population and the food_balance dfs together
temp = pd.merge(population_df, food_balance[["country", "food_supply_kcalcapitaday"]],
                # we make an inner join on country
                            how = "left", on = ["country"])

In [None]:
# Calcutlate the variable by multiplying by the population of each country and by 365 days 
food_supply_kcal = temp.population * temp.food_supply_kcalcapitaday * 365
# Create food_supply_kcal column and assign it the previous variable
food_balance["food_supply_kcal"] = food_supply_kcal
# del the temp datafram
# del temp
# Check the result
food_balance.head()

- **food_supply_kgprotein(food supply expressed in kg of protein)**

In [None]:
# Compute the new variable multiplying by the population and by 365 days 
food_balance["food_supply_kgprotein"] = (food_balance.protein_supply_quantity_gcapitaday / 1000) * temp.population * 365
# Check some rows at random in the column
food_balance.head()

- **food_supply_kg (food supply express in kg)**

In [None]:
# Compute the new variable multiplying by 1 million as the food variable has unit 1000 tonnes
food_balance["food_supply_kg"] = food_balance.food * 1000000
# Check some random rows
food_balance.head()

- **ratio_kcalkg (energy:weight ratio of each item expressed in kcal/kg)**  
We will compute this variable using food_supply_kcal, food_supply_kg

In [None]:
# Compute ratio_kcalkg = food_supply_kcal / food_supply_kg 
food_balance["ratio_kcalkg"] = food_balance.food_supply_kcal / food_balance.food_supply_kg
food_balance.ratio_kcalkg = food_balance.ratio_kcalkg.replace([np.inf, -np.inf], 0)
food_balance.groupby("item").mean().ratio_kcalkg.sort_values(ascending = False)[:20]

In [None]:
# Compute the protein percentage 
