# GLOBAL UNDERNUTRITION STUDY - EXPLORATION  

*BY FURAWA*  

**Table of Contents**  

1. [Data collection](#data_collection)  
2. [Data discovery](#data_discovery)  
3. [Data cleaning](#data_cleaning)  
4. [Computing new variables to lead the analysis](#new_variables)  
5. [Identify major trends](#major_trends)  

In [1]:
# Import all the needed libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import re
%matplotlib inline
pd.set_option('max_rows', 20)

<a id='data_collection'></a>
## 1. Data collection  
All the data has been downloaded from the [FAO](http://www.fao.org/faostat/en/#data) website.  
Let us check the files.

In [2]:
# Store the file names in the file_names variable
file_names = glob('files/*.csv')
# Check the file_names list
file_names

['files/food_balance_cereals.csv',
 'files/food_security_indicators.csv',
 'files/food_balance_crops.csv',
 'files/food_balance_livestock.csv',
 'files/population.csv']

In [3]:
# Loop into the file_names list
for file in file_names:
    # Read each file in the file_names list and assign it to a variable retrieved from the file name
     exec(re.split('\. |\W', file)[1] + "=  pd.read_csv(file)")

In [4]:
# Check the dataframe info
food_security_indicators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 15 columns):
Domain Code         1020 non-null object
Domain              1020 non-null object
Area Code           1020 non-null int64
Area                1020 non-null object
Element Code        1020 non-null int64
Element             1020 non-null object
Item Code           1020 non-null int64
Item                1020 non-null object
Year Code           1020 non-null int64
Year                1020 non-null object
Unit                1020 non-null object
Value               605 non-null object
Flag                1020 non-null object
Flag Description    1020 non-null object
Note                0 non-null float64
dtypes: float64(1), int64(4), object(10)
memory usage: 119.7+ KB


In [5]:
# Check the dataframe info
food_balance_crops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104871 entries, 0 to 104870
Data columns (total 14 columns):
Domain Code         104871 non-null object
Domain              104871 non-null object
Country Code        104871 non-null int64
Country             104871 non-null object
Element Code        104871 non-null int64
Element             104871 non-null object
Item Code           104871 non-null int64
Item                104871 non-null object
Year Code           104871 non-null int64
Year                104871 non-null int64
Unit                104871 non-null object
Value               104871 non-null float64
Flag                104871 non-null object
Flag Description    104871 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 11.2+ MB


In [6]:
# Check the dataframe info
food_balance_livestock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37166 entries, 0 to 37165
Data columns (total 14 columns):
Domain Code         37166 non-null object
Domain              37166 non-null object
Country Code        37166 non-null int64
Country             37166 non-null object
Element Code        37166 non-null int64
Element             37166 non-null object
Item Code           37166 non-null int64
Item                37166 non-null object
Year Code           37166 non-null int64
Year                37166 non-null int64
Unit                37166 non-null object
Value               37166 non-null float64
Flag                37166 non-null object
Flag Description    37166 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 4.0+ MB


In [7]:
# Check the dataframe info
food_balance_cereals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 909 entries, 0 to 908
Data columns (total 14 columns):
Domain Code         909 non-null object
Domain              909 non-null object
Country Code        909 non-null int64
Country             909 non-null object
Element Code        909 non-null int64
Element             909 non-null object
Item Code           909 non-null int64
Item                909 non-null object
Year Code           909 non-null int64
Year                909 non-null int64
Unit                909 non-null object
Value               909 non-null int64
Flag                909 non-null object
Flag Description    909 non-null object
dtypes: int64(6), object(8)
memory usage: 99.5+ KB


In [8]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 14 columns):
Domain Code         175 non-null object
Domain              175 non-null object
Country Code        175 non-null int64
Country             175 non-null object
Element Code        175 non-null int64
Element             175 non-null object
Item Code           175 non-null int64
Item                175 non-null object
Year Code           175 non-null int64
Year                175 non-null int64
Unit                175 non-null object
Value               175 non-null int64
Flag                1 non-null object
Flag Description    175 non-null object
dtypes: int64(6), object(8)
memory usage: 19.3+ KB


Except from the food_security_indicators dataframe all the other dataframes have the same 14 columns.  
We can remove the Note variable of the food_security_indicators as it has just NaN values.

<a id='data_discovery'></a>
## 2. Data Discovery

In [9]:
# Check the first few rows of each dataframe
food_security_indicators.head(2)  

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20122014,2012-2014,millions,7.9,F,FAO estimate,
1,FS,Suite of Food Security Indicators,2,Afghanistan,6132,Value,210011,Number of people undernourished (million) (3-y...,20132015,2013-2015,millions,8.8,F,FAO estimate,


In [10]:
# Check food balance crops
food_balance_crops.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2511,Wheat and products,2013,2013,1000 tonnes,1173.0,S,Standardized data


In [11]:
# Check food balance livestock
food_balance_livestock.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2731,Bovine Meat,2013,2013,1000 tonnes,134.0,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5611,Import Quantity,2731,Bovine Meat,2013,2013,1000 tonnes,6.0,S,Standardized data


In [12]:
# Check commodity balance livestock
food_balance_cereals.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169,S,Standardized data
1,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2805,Rice (Milled Equivalent),2013,2013,1000 tonnes,342,S,Standardized data


In [13]:
# Check population 
population.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,30552,,Official data
1,FBS,Food Balance Sheets,3,Albania,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,3173,,Official data


Let us check the primary key of each table and test them.  
We will create a function to find the primary key.  

In [14]:
# Function to calculate find the potential primary keys
def check_potential_primary_key(df) -> str:
    # Loop in the column list of the specific dataframe
    for column_pk in df.keys():
        # Remove the duplicated values from each column and check if the size is the same as the df
        if len(df) != len(df[column_pk].drop_duplicates()):
            # No output if the column is not a primary key
           None
        else:
            # Print all the potential primary keys
            print("{} could be a primary key!".format(column_pk))

Now we can use the function to find the potential primary keys of each dataframe. 

In [15]:
# Check the primary key of population dataframe
check_potential_primary_key(population_df)

NameError: name 'population_df' is not defined

We have 3 potential primary keys for the population dataframe, but the best choice here is the Country Code variable. It won't be a good idea to have the population or country name as primary key as they will be difficult to query.

In [None]:
# Check the primary key of food balance crops
check_potential_primary_key(food_balance_crops)

We have no output, which means there are no potential primary key in this dataframe.

In [None]:
# Check the primary key of food balance livestock
check_potential_primary_key(food_balance_livestock)

Same here, there are no primary keys in the food balance livestock

In [None]:
# Check the primary key of food balance cereals
check_potential_primary_key(food_balance_cereals)

In [None]:
# Check the primary key of food security indicators
check_potential_primary_key(food_security_indicators)

Even for the food balance cereals and food security indicators dataframes we have no potential primary keys

Let us create column with the total population and remove some useless columns from the population dataframe.

In [None]:
# Create the population column, we retrieve the 1000 in the Unit column and multiply it by the Value column
population["population"] = int(population.Unit.str.split(" ")[0][0]) * population.Value
# Remove some useless columns
population_df = population.drop(population.columns.difference(["Country Code", "Country", "population"]),
                                axis =1)
# Check the dataframe
population_df.head()

Now we can calculate the total numbers of human involved.

In [None]:
# Calcalute the total number of humans 
total_population = population_df.population.sum()
print("The total number of humans on the planet is : {:,}".format(total_population))

This result cannot be correct, mostly if we talk about the 2013 world population. actually in 2019 the world population is around 7.7 billion. There must be a mistake, we will go deep to check the issue.  