# GLOBAL UNDERNUTRITION STUDY - EXPLORATION  

*BY FURAWA*  

**Table of Contents**  

1. [Data collection](#data_collection)  
2. [Data discovery](#data_discovery)  
3. [Data cleaning](#data_cleaning)  
4. [Computing new variables to lead the analysis](#new_variables)  
5. [Identify major trends](#major_trends)  

In [1]:
# Import all the needed libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from glob import glob
import re
%matplotlib inline
pd.set_option('max_rows', 20)

<a id='data_collection'></a>
## 1. Data collection  
All the data has been downloaded from the [FAO](http://www.fao.org/faostat/en/#data) website.  
Let us check the files.

In [2]:
# Store the file names in the file_names variable
file_names = glob('files/*.csv')
# Check the file_names list
file_names

['files/food_security_indicators.csv',
 'files/commodity_balance_crops.csv',
 'files/food_supply_crops.csv',
 'files/commodity_balance_livestock.csv',
 'files/food_balance_sheet.csv',
 'files/food_supply_livestock.csv']

In [3]:
# Loop into the file_names list
for file in file_names:
    # Read each file in the file_names list and assign it to a variable retrieved from the file name
     exec(re.split('\. |\W', file)[1] + "=  pd.read_csv(file)")

In [4]:
# Check the dataframe info
food_security_indicators.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3173 entries, 0 to 3172
Data columns (total 15 columns):
Domain Code         3173 non-null object
Domain              3173 non-null object
Area Code           3173 non-null int64
Area                3173 non-null object
Element Code        3173 non-null int64
Element             3173 non-null object
Item Code           3173 non-null int64
Item                3173 non-null object
Year Code           3173 non-null int64
Year                3173 non-null object
Unit                3173 non-null object
Value               3052 non-null object
Flag                3173 non-null object
Flag Description    3173 non-null object
Note                0 non-null float64
dtypes: float64(1), int64(4), object(10)
memory usage: 372.0+ KB


In [5]:
# Check the dataframe info
food_supply_crops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60628 entries, 0 to 60627
Data columns (total 14 columns):
Domain Code         60628 non-null object
Domain              60628 non-null object
Country Code        60628 non-null int64
Country             60628 non-null object
Element Code        60628 non-null int64
Element             60628 non-null object
Item Code           60628 non-null int64
Item                60628 non-null object
Year Code           60628 non-null int64
Year                60628 non-null int64
Unit                60628 non-null object
Value               60628 non-null float64
Flag                60628 non-null object
Flag Description    60628 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 6.5+ MB


In [6]:
# Check the dataframe info
food_supply_livestock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25569 entries, 0 to 25568
Data columns (total 14 columns):
Domain Code         25569 non-null object
Domain              25569 non-null object
Country Code        25569 non-null int64
Country             25569 non-null object
Element Code        25569 non-null int64
Element             25569 non-null object
Item Code           25569 non-null int64
Item                25569 non-null object
Year Code           25569 non-null int64
Year                25569 non-null int64
Unit                25569 non-null object
Value               25569 non-null float64
Flag                25569 non-null object
Flag Description    25569 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 2.7+ MB


In [7]:
# Check the dataframe info
food_balance_sheet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142212 entries, 0 to 142211
Data columns (total 14 columns):
Domain Code         142212 non-null object
Domain              142212 non-null object
Country Code        142212 non-null int64
Country             142212 non-null object
Element Code        142212 non-null int64
Element             142212 non-null object
Item Code           142212 non-null int64
Item                142212 non-null object
Year Code           142212 non-null int64
Year                142212 non-null int64
Unit                142212 non-null object
Value               142212 non-null float64
Flag                142038 non-null object
Flag Description    142212 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 15.2+ MB


In [8]:
# Check the dataframe info
commodity_balance_crops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85123 entries, 0 to 85122
Data columns (total 14 columns):
Domain Code         85123 non-null object
Domain              85123 non-null object
Country Code        85123 non-null int64
Country             85123 non-null object
Element Code        85123 non-null int64
Element             85123 non-null object
Item Code           85123 non-null int64
Item                85123 non-null object
Year Code           85123 non-null int64
Year                85123 non-null int64
Unit                85123 non-null object
Value               85123 non-null int64
Flag                85123 non-null object
Flag Description    85123 non-null object
dtypes: int64(6), object(8)
memory usage: 9.1+ MB


In [9]:
commodity_balance_livestock.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30814 entries, 0 to 30813
Data columns (total 14 columns):
Domain Code         30814 non-null object
Domain              30814 non-null object
Country Code        30814 non-null int64
Country             30814 non-null object
Element Code        30814 non-null int64
Element             30814 non-null object
Item Code           30814 non-null int64
Item                30814 non-null object
Year Code           30814 non-null int64
Year                30814 non-null int64
Unit                30814 non-null object
Value               30814 non-null int64
Flag                30814 non-null object
Flag Description    30814 non-null object
dtypes: int64(6), object(8)
memory usage: 3.3+ MB


Except from the food_security_indicators dataframe all the other dataframes have the same 14 columns.  
We can remove the Note variable of the food_security_indicators as it has just NaN values.

<a id='data_discovery'></a>
## 2. Data Discovery

In [10]:
# Check the first few rows of each dataframe
food_security_indicators.head(2)  

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
0,FS,Suite of Food Security Indicators,2,Afghanistan,6121,Value,21010,Average dietary energy supply adequacy (percen...,20122014,2012-2014,%,99,F,FAO estimate,
1,FS,Suite of Food Security Indicators,2,Afghanistan,6122,Value,21011,Average value of food production (constant 200...,20122014,2012-2014,I$ per person,110,F,FAO estimate,


In [11]:
# Check food supply crops
food_supply_crops.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,CC,Food Supply - Crops Primary Equivalent,2,Afghanistan,641,Food supply quantity (tonnes),2617,Apples and products,2013,2013,tonnes,67678.0,S,Standardized data
1,CC,Food Supply - Crops Primary Equivalent,2,Afghanistan,646,Food supply quantity (g/capita/day),2617,Apples and products,2013,2013,g/capita/day,6.07,Fc,Calculated data


In [12]:
# Check food supply livestock
food_supply_livestock.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,CL,Food Supply - Livestock and Fish Primary Equiv...,2,Afghanistan,641,Food supply quantity (tonnes),2731,Bovine Meat,2013,2013,tonnes,140087.0,S,Standardized data
1,CL,Food Supply - Livestock and Fish Primary Equiv...,2,Afghanistan,646,Food supply quantity (g/capita/day),2731,Bovine Meat,2013,2013,g/capita/day,12.56,Fc,Calculated data


In [13]:
# Check food balance sheet 
food_balance_sheet.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,FBS,Food Balance Sheets,2,Afghanistan,511,Total Population - Both sexes,2501,Population,2013,2013,1000 persons,30552.0,,Official data
1,FBS,Food Balance Sheets,2,Afghanistan,5511,Production,2511,Wheat and products,2013,2013,1000 tonnes,5169.0,S,Standardized data


In [14]:
# Check commodity balance crops
commodity_balance_crops.head(2)  

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,BC,Commodity Balances - Crops Primary Equivalent,2,Afghanistan,5510,Production,2617,Apples and products,2013,2013,tonnes,78597,S,Standardized data
1,BC,Commodity Balances - Crops Primary Equivalent,2,Afghanistan,5610,Import Quantity,2617,Apples and products,2013,2013,tonnes,100,S,Standardized data


In [15]:
# Check commodity balance livestock
commodity_balance_livestock.head(2)

Unnamed: 0,Domain Code,Domain,Country Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,BL,Commodity Balances - Livestock and Fish Primar...,2,Afghanistan,5510,Production,2731,Bovine Meat,2013,2013,tonnes,134000,S,Standardized data
1,BL,Commodity Balances - Livestock and Fish Primar...,2,Afghanistan,5610,Import Quantity,2731,Bovine Meat,2013,2013,tonnes,6087,S,Standardized data


Let us create a dataframe with just the population information using the balance sheet dataframe.

In [16]:
# Retrieve the population information
population_df = food_balance_sheet[food_balance_sheet.Item == "Population"]


In [22]:
# Create the variable column using the Unit and Value column
population = [int(row.Unit.split(" ")[0]) * row.Value \
                               for index, row in population_df.iterrows()]
# Create the population column and give it the values in population variable
population_df.loc[:, ("population")] = population

In [58]:
# Drop the useless columns
population_df.drop(population_df.columns.difference(["Country Code", "Country", "population"]), axis = 1, inplace = True)

In [60]:
# Check the first rows
population_df.head()

Unnamed: 0,Country Code,Country,population
0,2,Afghanistan,30552000.0
469,3,Albania,3173000.0
1284,4,Algeria,39208000.0
2132,7,Angola,21472000.0
2842,8,Antigua and Barbuda,90000.0
