# 1 Introduction

> Notebook para preparação dos dados de suite of food security indicators da FAO
> 
> Dados abertos de https://www.fao.org/faostat/en/#data/SDGB

## 1.1 Business needs

> Carga e preparação dos dados abertos

## 1.2 Dependencies

> What dependencies does the reader need to download in order to run the whole notebook? Write the pip or conda command to allow easy access to those. Be sure to include the specific versions if needed be.

In [1]:

# !pip install pandas
# !pip install -U NumPy==1.23.2
# !pip install ydata-profiling
# !pip install --upgrade pip
# !pip install --upgrade Pillow

## 1.3 Imports

> Import all libraries needed. Include the latest versions used to allow reproducibility.

In [2]:
import pandas as pd
from ydata_profiling import ProfileReport      # Para análise exploratória dos dados
import warnings
import dtale as dtale

In [3]:
warnings.filterwarnings('ignore')

## 1.4 Global variables

> Declare all variables that will retain a constant value throught the notebook. For exemple, the path to the input data should be a global variable.

# 2 Exploratory data analysis

> Write a brief summary about which analysis and wrangling was performed.

## 2.1 Data reading

> Give a brief introduction to the data used and their sources.

In [4]:
df = pd.read_csv("../data/raw/SDBG/FAOSTAT_data_en_6-22-2024.csv", encoding='utf-8')

### Análise inicial

In [5]:
df.columns

Index(['Domain Code', 'Domain', 'Area Code (M49)', 'Area', 'Element Code',
       'Element', 'Item Code (SDG)', 'Item', 'Year Code', 'Year', 'Unit',
       'Value', 'Flag', 'Flag Description', 'Note'],
      dtype='object')

In [6]:
df.describe()

Unnamed: 0,Area Code (M49),Element Code,Year Code,Year,Value
count,57427.0,57427.0,57427.0,57427.0,44773.0
mean,430.823411,6159.884445,2018.697094,2018.697094,6000.389
std,252.119542,42.915204,1.258759,1.258759,69167.13
min,4.0,6121.0,2017.0,2017.0,-57.84
25%,214.0,6121.0,2018.0,2018.0,0.9
50%,418.0,6173.0,2018.0,2018.0,12.67
75%,646.0,6178.0,2020.0,2020.0,98.0
max,894.0,6241.0,2021.0,2021.0,5064050.0


In [7]:
df.head().T

Unnamed: 0,0,1,2,3,4
Domain Code,SDGB,SDGB,SDGB,SDGB,SDGB
Domain,SDG Indicators,SDG Indicators,SDG Indicators,SDG Indicators,SDG Indicators
Area Code (M49),4,4,4,4,4
Area,Afghanistan,Afghanistan,Afghanistan,Afghanistan,Afghanistan
Element Code,6132,6132,6132,6132,6132
Element,Value,Value,Value,Value,Value
Item Code (SDG),SN_ITK_DEFCN,SN_ITK_DEFCN,SN_ITK_DEFCN,SN_ITK_DEFCN,SN_ITK_DEFCN
Item,2.1.1 Number of undernourished people,2.1.1 Number of undernourished people,2.1.1 Number of undernourished people,2.1.1 Number of undernourished people,2.1.1 Number of undernourished people
Year Code,2017,2018,2019,2020,2021
Year,2017,2018,2019,2020,2021


## 2.2 Data wrangling

> Guide the reader about which data wranglings were needed and why

### Análise Exploratória Inicial

In [8]:
profile = ProfileReport(df)
profile.to_file("../reports/report_SDG.html")

profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

#### Análise da coluna Element

In [9]:
df.groupby('Element')['Element'].agg('count')

Element
Value                           54647
Value (2017 constant prices)     2780
Name: Element, dtype: int64

In [10]:
df[df['Element'] != 'Value'].head()

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (SDG),Item,Year Code,Year,Unit,Value,Flag,Flag Description,Note
410,SDGB,SDG Indicators,8,Albania,6174,Value (2017 constant prices),SI_AGR_LSFP-F,2.3.2 Average income of large-scale food produ...,2017,2017,USD_PPP/cap,,O,Missing value,Not Available | NA | NaN = A figure was not pr...
411,SDGB,SDG Indicators,8,Albania,6174,Value (2017 constant prices),SI_AGR_LSFP-F,2.3.2 Average income of large-scale food produ...,2018,2018,USD_PPP/cap,,O,Missing value,Not Available | NA | NaN = A figure was not pr...
412,SDGB,SDG Indicators,8,Albania,6174,Value (2017 constant prices),SI_AGR_LSFP-F,2.3.2 Average income of large-scale food produ...,2019,2019,USD_PPP/cap,,O,Missing value,Not Available | NA | NaN = A figure was not pr...
413,SDGB,SDG Indicators,8,Albania,6174,Value (2017 constant prices),SI_AGR_LSFP-F,2.3.2 Average income of large-scale food produ...,2020,2020,USD_PPP/cap,,O,Missing value,Not Available | NA | NaN = A figure was not pr...
414,SDGB,SDG Indicators,8,Albania,6174,Value (2017 constant prices),SI_AGR_LSFP-F,2.3.2 Average income of large-scale food produ...,2021,2021,USD_PPP/cap,,O,Missing value,Not Available | NA | NaN = A figure was not pr...


#### Análise de Item Code

In [11]:
df.groupby(['Item Code (SDG)'])['Item Code (SDG)'].agg('count')

Item Code (SDG)
AG_FPA_CFPI           955
AG_FPA_COMM-0111      275
AG_FPA_COMM-0112      275
AG_FPA_COMM-0113      360
AG_FPA_COMM-0114       90
                     ... 
SN_ITK_DEFCN          860
SP_GNP_WNOWNS         245
SP_LGL_LNDAGSEC-F     245
SP_LGL_LNDAGSEC-M     245
SP_LGL_LNDAGSEC-_T    245
Name: Item Code (SDG), Length: 204, dtype: int64

In [12]:
df.groupby(['Item Code (SDG)', 'Item'])[['Item Code (SDG)', 'Item']].agg('count').to_excel('../reports/food_security_industry_features_SDG.xlsx') # Exportar para arquivo excel

#### Análise da coluna Year e Year Code

In [13]:
df.groupby('Year')['Year'].agg('count')

Year
2017     8490
2018    24032
2019     8613
2020     8967
2021     7325
Name: Year, dtype: int64

#### Análise da coluna Area Code

In [14]:

df.groupby('Area Code (M49)')['Area Code (M49)'].agg('count')

Area Code (M49)
4      326
8      338
12     270
16      80
20     110
      ... 
862    279
876     76
882    194
887    290
894    264
Name: Area Code (M49), Length: 250, dtype: int64

#### Conclusões

- Valores faltantes:

| Campo  | Qtde  |   %  |
|-------|--------|-------|
| Unit  | 2960   | 5,2%  |
| Value | 12654  | 22,0% |

- Flag igual a Q representa linhas com valores vazios

> Eliminar coluna Note
> Eliminar linhas com campos nulos
> Sobrarão 84% da base (196184 linhas)

- Campos Year e Year Code são equivalentes
> Eliminar Year Code

### Limpeza e tratamento dos dados

In [15]:
df2 = df.drop(['Area Code (M49)', 'Note', 'Element Code', 'Element', 'Item Code (SDG)', 'Year Code', 'Flag', 'Flag Description'],axis=1).dropna().copy(deep=True)

df2.count()

Domain Code    41994
Domain         41994
Area           41994
Item           41994
Year           41994
Unit           41994
Value          41994
dtype: int64

In [16]:
df2.loc[0:1000]

Unnamed: 0,Domain Code,Domain,Area,Item,Year,Unit,Value
0,SDGB,SDG Indicators,Afghanistan,2.1.1 Number of undernourished people,2017,million No,8.10
1,SDGB,SDG Indicators,Afghanistan,2.1.1 Number of undernourished people,2018,million No,8.80
2,SDGB,SDG Indicators,Afghanistan,2.1.1 Number of undernourished people,2019,million No,10.20
3,SDGB,SDG Indicators,Afghanistan,2.1.1 Number of undernourished people,2020,million No,11.20
4,SDGB,SDG Indicators,Afghanistan,2.1.1 Number of undernourished people,2021,million No,12.00
...,...,...,...,...,...,...,...
984,SDGB,SDG Indicators,American Samoa,15.4.2a Mountain green cover area: Remaining m...,2018,km2,33.17
990,SDGB,SDG Indicators,American Samoa,15.4.2a Mountain green cover area: Remaining m...,2018,km2,18.91
992,SDGB,SDG Indicators,American Samoa,15.4.2a Mountain green cover area: Total; Land...,2018,km2,14.26
994,SDGB,SDG Indicators,American Samoa,15.4.2a Mountain green cover area: Total; Land...,2018,km2,0.40


### Análise Exploratória após limpeza dos dados

In [17]:
# Análise exploratória após limpeza dos dados
profile = ProfileReport(df2)
profile.to_file("../reports/report2_SDG.html")

profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Data Preparation

#### Pivotar por Item

In [18]:
df_pivot = df2.pivot(index=['Area', 'Year'],
                    columns='Item',
                    values='Value').reset_index()

In [19]:
df_pivot

Item,Area,Year,14.4.1 Proportion of fish stocks within biologically sustainable levels,14.6.1 International instruments to combat IUU fishing,14.7.1 Sustainable fisheries as a proportion of GDP,14.b.1 Protection of access rights for small-scale fisheries,15.1.1 Forest area,15.1.1 Forest area as a proportion of total land area,15.1.1 Land area,15.2.1 Above-ground biomass in forest,...,5.a.1 Share of women among owners or rights-bearers of agricultural land,5.a.2 Legal guarantees to women’s equal rights to land ownership and/or control,6.4.1 Water Use Efficiency (Agriculture (ISIC4 A01 A0210 A0322)),6.4.1 Water Use Efficiency (Industries),6.4.1 Water Use Efficiency (No breakdown),6.4.1 Water Use Efficiency (Services (G to T)),6.4.2 Level of water stress (Agriculture (ISIC4 A01 A0210 A0322)),6.4.2 Level of water stress (Industries),6.4.2 Level of water stress (No breakdown),6.4.2 Level of water stress (Services (G to T))
0,Afghanistan,2017,,,,,1208.44,1.85,65223.0,,...,,,0.11,9.32,0.77,57.81,53.75,0.46,54.76,0.55
1,Afghanistan,2018,,,,,1208.44,1.85,65223.0,,...,,,0.10,13.60,0.80,59.37,53.75,0.46,54.76,0.55
2,Afghanistan,2019,,,,,1208.44,1.85,65223.0,,...,,,0.12,14.55,0.82,57.99,53.75,0.46,54.76,0.55
3,Afghanistan,2020,,,,2.0,1208.44,1.85,65223.0,,...,,,0.12,14.50,0.79,55.58,53.75,0.46,54.76,0.55
4,Afghanistan,2021,,,,,,,,,...,,,0.12,12.20,0.60,38.30,53.75,0.46,54.76,0.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1245,Åland Islands,2017,,,,,,,158.0,,...,,,,,,,,,,
1246,Åland Islands,2018,,,,,,,158.0,,...,,,,,,,,,,
1247,Åland Islands,2019,,,,,,,158.0,,...,,,,,,,,,,
1248,Åland Islands,2020,,,,,,,158.0,,...,,,,,,,,,,


In [21]:
# Lista de países a serem removidos
remover = [
    'China, Hong Kong SAR', 'China, Macao SAR', 'China, Taiwan Province of',
    'Anguilla', 'Aruba', 'Bonaire, Sint Eustatius and Saba',
    'Bouvet Island', 'British Virgin Islands', 'Cayman Islands',
    'Chagos Archipelago', 'Channel Islands', 'Christmas Island',
    'Cocos (Keeling) Islands', 'Curaçao', 'Falkland Islands (Malvinas)',
    'Faroe Islands', 'French Guiana', 'French Southern Territories',
    'Gibraltar', 'Guadeloupe', 'Guam',
    'Guernsey', 'Heard and McDonald Islands', 'Isle of Man',
    'Jersey', 'Liechtenstein', 'Martinique',
    'Mayotte', 'Monaco', 'Montserrat',
    'Norfolk Island', 'Northern Mariana Islands', 'Pitcairn',
    'Réunion', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha',
    'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'San Marino',
    'Sark', 'Sint Maarten (Dutch part)', 'South Georgia and the South Sandwich Islands',
    'Svalbard and Jan Mayen Islands', 'Turks and Caicos Islands', 'United States Minor Outlying Islands',
    'United States Virgin Islands', 'Wallis and Futuna Islands',
    'Western Sahara', 'Åland Islands', 'Holy See'
]

df_pivot = df_pivot[~df_pivot['Area'].isin(remover)]

In [22]:
#nao encontramos alguns paises, vamos ajustar o DF para que seja o mesmo nome do area code
rename_country = {
    'China, mainland': 'China',
}

# Renomear os países no df_pivoted
df_pivot['Area'] = df_pivot['Area'].replace(rename_country)

In [23]:
df_codigo = pd.read_csv('../data/raw/HDI/area_code.csv', sep=',', encoding='utf-8',low_memory=False)

In [24]:
#Colocando o código de area no DF
df_merged = pd.merge(df_pivot, df_codigo, left_on='Area', right_on='Area', how='left')

In [25]:
# Reordenando as colunas para que 'Area Code' seja a primeira
cols = ['Area Code'] + [col for col in df_merged.columns if col != 'Area Code']
df_merged = df_merged[cols]

In [26]:
dtale.show(df_merged)



In [27]:
#salvar em CSV
df_merged.to_csv('SDBG-Jailson.csv', index=False)

# 3 Data modeling

> Give a brief introduction about the model. If you feel it would help, explain the math behind it.

## 3.1 Identifying key variables

> Guide the reader around the thought proccess about which variables were considered key to the model.

## 3.2 Building the model

> Guide the reader throughout the implementation and the decision of using the chosen model.

## 3.3 Extracting insights

> Be very thorough about which insights can be concluded from the data and how.

# 4 Conclusion

> Summarize the insights previously found.

## 4.1 Discussion

> What makes those insights relevant and how do they relate to the initial business needs?

## 4.2 Next steps

> If it makes sense, talk about the next logical steps that should follow the conclusion.