# Discovery

By now, world population is in constant increase. The more the people, the more the food we need that translate in an increase of use of natural resources.


But which are the product we produce that exploit most resources? Which countries contribute to the exploitation of natural resources?

**Goal**: Discovering which are the top 10 products that use more water, need more land and emit more gas. Then discovering which countries are the most producer of each category


# Data Selection

In [None]:
import pandas as pd
import numpy as np
import sidetable as stb


In [None]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.options.display.max_colwidth = 100



In [None]:
fao_df = pd.read_csv('FAO.csv', encoding='latin-1')
food_pr_df = pd.read_csv('Food_Production.csv')

# Data Cleaning

### FAO Data

In [None]:
fao_df.head()

In [None]:
fao_df.shape

### Checking columns dtypes

In [None]:
for col in fao_df.columns:
    print(f'{col :<25}{fao_df[col].dtypes}', end='\n\n')

changing *Y2012* and *Y2013* data type from int to float

In [None]:
fao_df[['Y2012', 'Y2013']] = fao_df[['Y2012', 'Y2013']].astype(float)

### Checking column names and eventually fixing them

In [None]:
fao_df.columns

In [None]:
fao_df.rename(columns={n:n[1:] for n in fao_df.columns if 'Y' in n}, inplace=True) 

### Checking for duplicates

In [None]:
print(f'Number of rows before dropping duplicates: {fao_df.shape[0] :>6}')
fao_df.drop_duplicates(keep='first')
print(f'Number of rows after dropping duplicates: {fao_df.shape[0] :>7}')

Counting the number of rows before checking for duplicates, and after, i can see there are no duplicates


### Checking qualitative columns values

In [None]:
years = fao_df.columns[fao_df.columns.get_loc('1961'):]
fao_df.stb.counts(exclude=['number'])

In [None]:
to_check = ['Area', 'Area Abbreviation', 'Item']
for col in to_check:
    print(f'{col}\n{sorted(fao_df[col].unique())}', end=f'\n{"-"*100}\n')

 *Area Abbreviation* and the *Area* count do not match. Let's investigate why

In [None]:
# checking if there are more than 1 country with the same area code
a_code_unique = fao_df.groupby('Area Abbreviation')['Area'].nunique()
a_code_unique[a_code_unique > 1]

As suspected, 3 *Area Abbreviation* are associated with more than 1 country. Let's find out which countries fall under the same code

In [None]:
cond = fao_df['Area Abbreviation'].isin(['AZE', 'THA', 'CHN'])
fao_df.loc[cond].groupby('Area Abbreviation')['Area'].unique()

The result of the investigation evidenciate that *Bahamas* and *The former Yugoslav Republic of Macedonia* fall under the wrong *Area abbreviation* code.
Since we don't need that column, we can stop here and just drop it


#### Dropping unecessary columns

In [None]:
fao_df.drop(columns=['Area Abbreviation', 'Area Code', 'Item Code', 'Element Code', 'Unit'], inplace=True)

#### Fixing Area column values

After a close look, some of the Area names are not correct. let's correct them:

In [None]:
Rename_area = {'Bolivia (Plurinational State of)':'Bolivia',
                      'Cabo Verde':'Cape Verde',
                      'China, mainland':'China',
                      'China, Hong Kong SAR':'Hong Kong',
                      'China, Macao SAR':'Macao',
                      'China, Taiwan Province of':'Taiwan',
                      'Czechia':'Czech Republic',
                      "Democratic People's Republic of Korea":'North Korea',
                      'Iran (Islamic Republic of)':'Iran',
                      'Republic of Korea':'South Korea',
                      'The former Yugoslav Republic of Macedonia':'Macedonia',
                      'Venezuela (Bolivarian Republic of)':'Venezuela',
                      }

fao_df['Area'].rename(Rename_area , inplace=True)

#### Checking latitude and longitude values

Let's check if latitude and longitude contains some non valide values. Latitude ranges from -90 to 90,  longitude ranges from -180 to 180

In [None]:
fao_df[['longitude', 'latitude']].agg([min, max])

Both max and min of latitude and longitude fall under the accepted range

#### Checking if there are some invalid values in the *Years* columns

Let's check if are there any negative numbers in production

In [None]:
cond = fao_df.loc[:,'1961':'2013'].agg([min])
cond.T[cond.T['min'] < 0]

Now let's investigate which country has a negative amount of production for which *Item* and *Element* (food or feed)

In [None]:
fao_df.loc[fao_df[['2013', '2012']].idxmin().unique(), ['Area', 'Item', 'Element']]

negative number in production must be an error of input, just drop the entire row

In [None]:
fao_df.drop(labels=10082, inplace=True)

#### Checking for missing values

In [None]:
fao_df.stb.missing(clip_0=True, style=True)

First let's drop all the rows that contain all NaN values, if there are any

In [None]:
fao_df = fao_df.dropna(how='all')

Now let's fill all the remaining NaN values with 0

In [None]:
fao_df.fillna(0, inplace=True)

Lastly we drop all the rows that has only 0 values

In [None]:
fao_df = fao_df.loc[(fao_df != 0).any(axis=1)]
fao_df

# Data Exploration

# Data Transformation

# Data Visualization

In [None]:
# cond = fao_df['Element'] == 'Feed'
# cond2 = fao_df['Element'] == 'Food'

# feed = fao_df.loc[cond, 'Item']
# food = fao_df.loc[cond2, 'Item']

# differences = np.setdiff1d(food, feed)
# for n, i in enumerate(differences):
#     print(f'{n}{i :>30}')