# Discovery

By now, world population is in constant increase. The more the people, the more the food we need that translate in an increase of use of natural resources.


But which are the product we produce that exploit most resources? Which countries contribute to the exploitation of natural resources?

**Goal**: Discovering which are the top 10 products that use more water, need more land and emit more gas. Then discovering which countries are the most producer of each category


# Data Selection

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.options.display.max_colwidth = 100



In [3]:
fao_df = pd.read_csv('FAO.csv', encoding='latin-1')
food_pr_df = pd.read_csv('Food_Production.csv')

# Data Cleaning

### FAO Data

In [9]:
fao_df.head()

Unnamed: 0,Area Abbreviation,Area Code,Area,Item Code,Item,Element Code,Element,Unit,latitude,longitude,Y1961,Y1962,Y1963,Y1964,Y1965,Y1966,Y1967,Y1968,Y1969,Y1970,Y1971,Y1972,Y1973,Y1974,Y1975,Y1976,Y1977,Y1978,Y1979,Y1980,Y1981,Y1982,Y1983,Y1984,Y1985,Y1986,Y1987,Y1988,Y1989,Y1990,Y1991,Y1992,Y1993,Y1994,Y1995,Y1996,Y1997,Y1998,Y1999,Y2000,Y2001,Y2002,Y2003,Y2004,Y2005,Y2006,Y2007,Y2008,Y2009,Y2010,Y2011,Y2012,Y2013
0,AFG,2,Afghanistan,2511,Wheat and products,5142,Food,1000 tonnes,33.94,67.71,1928.0,1904.0,1666.0,1950.0,2001.0,1808.0,2053.0,2045.0,2154.0,1819.0,1963.0,2215.0,2310.0,2335.0,2434.0,2512.0,2282.0,2454.0,2443.0,2129.0,2133.0,2068.0,1994.0,1851.0,1791.0,1683.0,2194.0,1801.0,1754.0,1640.0,1539.0,1582.0,1840.0,1855.0,1853.0,2177.0,2343.0,2407.0,2463.0,2600.0,2668.0,2776.0,3095.0,3249.0,3486.0,3704.0,4164.0,4252.0,4538.0,4605.0,4711.0,4810,4895
1,AFG,2,Afghanistan,2805,Rice (Milled Equivalent),5142,Food,1000 tonnes,33.94,67.71,183.0,183.0,182.0,220.0,220.0,195.0,231.0,235.0,238.0,213.0,205.0,233.0,246.0,246.0,255.0,263.0,235.0,254.0,270.0,259.0,248.0,217.0,217.0,197.0,186.0,200.0,193.0,202.0,191.0,199.0,197.0,249.0,218.0,260.0,319.0,254.0,326.0,347.0,270.0,372.0,411.0,448.0,460.0,419.0,445.0,546.0,455.0,490.0,415.0,442.0,476.0,425,422
2,AFG,2,Afghanistan,2513,Barley and products,5521,Feed,1000 tonnes,33.94,67.71,76.0,76.0,76.0,76.0,76.0,75.0,71.0,72.0,73.0,74.0,71.0,70.0,72.0,76.0,77.0,80.0,60.0,65.0,64.0,64.0,60.0,55.0,53.0,51.0,48.0,46.0,46.0,47.0,46.0,43.0,43.0,40.0,50.0,46.0,41.0,44.0,50.0,48.0,43.0,26.0,29.0,70.0,48.0,58.0,236.0,262.0,263.0,230.0,379.0,315.0,203.0,367,360
3,AFG,2,Afghanistan,2513,Barley and products,5142,Food,1000 tonnes,33.94,67.71,237.0,237.0,237.0,238.0,238.0,237.0,225.0,227.0,230.0,234.0,223.0,219.0,225.0,240.0,244.0,255.0,185.0,203.0,198.0,202.0,189.0,174.0,167.0,160.0,151.0,145.0,145.0,148.0,145.0,135.0,132.0,120.0,155.0,143.0,125.0,138.0,159.0,154.0,141.0,84.0,83.0,122.0,144.0,185.0,43.0,44.0,48.0,62.0,55.0,60.0,72.0,78,89
4,AFG,2,Afghanistan,2514,Maize and products,5521,Feed,1000 tonnes,33.94,67.71,210.0,210.0,214.0,216.0,216.0,216.0,235.0,232.0,236.0,200.0,201.0,216.0,228.0,231.0,234.0,240.0,228.0,234.0,228.0,226.0,210.0,199.0,192.0,182.0,173.0,170.0,154.0,148.0,137.0,144.0,126.0,90.0,141.0,150.0,159.0,108.0,90.0,99.0,72.0,35.0,48.0,89.0,63.0,120.0,208.0,233.0,249.0,247.0,195.0,178.0,191.0,200,200


In [None]:
fao_df.shape

#### Checking columns dtypes

In [None]:
for col in fao_df.columns:
    print(f'{col :<25}{fao_df[col].dtypes}', end='\n\n')

changing *Y2012* and *Y2013* data type from int to float

In [51]:
fao_df[['Y2012', 'Y2013']] = fao_df[['Y2012', 'Y2013']].astype(float)

#### Checking column names and eventually fixing them

In [54]:
fao_df.columns

Index(['Area Abbreviation', 'Area Code', 'Area', 'Item Code', 'Item',
       'Element Code', 'Element', 'Unit', 'latitude', 'longitude', 'Y1961',
       'Y1962', 'Y1963', 'Y1964', 'Y1965', 'Y1966', 'Y1967', 'Y1968', 'Y1969',
       'Y1970', 'Y1971', 'Y1972', 'Y1973', 'Y1974', 'Y1975', 'Y1976', 'Y1977',
       'Y1978', 'Y1979', 'Y1980', 'Y1981', 'Y1982', 'Y1983', 'Y1984', 'Y1985',
       'Y1986', 'Y1987', 'Y1988', 'Y1989', 'Y1990', 'Y1991', 'Y1992', 'Y1993',
       'Y1994', 'Y1995', 'Y1996', 'Y1997', 'Y1998', 'Y1999', 'Y2000', 'Y2001',
       'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007', 'Y2008', 'Y2009',
       'Y2010', 'Y2011', 'Y2012', 'Y2013'],
      dtype='object')

In [55]:
fao_df.rename(columns={n:n[1:] for n in fao_df.columns if 'Y' in n}, inplace=True) 

#### Checking qualitative columns values

In [None]:
for col in fao_df.columns: 
    if fao_df[col].dtypes == "O" or col == 'Area Code':
        var_unique = fao_df[col].unique()
        var_unique.sort()
        var_n = fao_df[col].nunique()
        print(f'{col}\n\n{var_unique}\n\ncount: {var_n}', end=f'\n\n\n{"-"*100}\n\n\n')

 *Area Abbreviation* and the *Area* count do not match. Let's investigate why

In [None]:
# checking if there are more than 1 country with the same area code
a_code_unique = fao_df.groupby('Area Abbreviation')['Area'].nunique()
a_code_unique[a_code_unique > 1]

As suspected, 3 *Area Abbreviation* are associated with more than 1 country. Let's find out which countries fall under the same code

In [None]:
cond = fao_df['Area Abbreviation'].isin(['AZE', 'THA', 'CHN'])
fao_df.loc[cond].groupby('Area Abbreviation')['Area'].unique()

The result of the investigation evidenciate that *Bahamas* and *The former Yugoslav Republic of Macedonia* fall under the wrong *Area abbreviation* code.
Since we don't need that column, we can stop here and just drop it


#### Dropping unecessary columns

In [85]:
fao_df.drop(columns=['Area Abbreviation', 'Area Code', 'Item Code', 'Element Code', 'Unit'], inplace=True)

#### Fixing Area column values

After a close look, some of the Area names are not correct. let's correct them:

In [None]:
Rename_area = {'Bolivia (Plurinational State of)':'Bolivia',
                      'Cabo Verde':'Cape Verde',
                      'China, mainland':'China',
                      'China, Hong Kong SAR':'Hong Kong',
                      'China, Macao SAR':'Macao',
                      'China, Taiwan Province of':'Taiwan',
                      'Czechia':'Czech Republic',
                      "Democratic People's Republic of Korea":'North Korea',
                      'Iran (Islamic Republic of)':'Iran',
                      'Republic of Korea':'South Korea',
                      'The former Yugoslav Republic of Macedonia':'Macedonia',
                      'Venezuela (Bolivarian Republic of)':'Venezuela',
                      }

fao_df['Area'].rename(Rename_area , inplace=True)

#### Checking for duplicates

In [None]:
# controllare se ci sono righe o colonne duplicate
fao_df.drop_duplicates(keep='first')

In [10]:
# evidenziare che il numero di righe non cambia tra prima e dopo l'esecuzione di drop.duplicates

#### Checking latitude and longitude values

Let's check if latitude and longitude contains some non valide values. Latitude ranges from -90 to 90,  longitude ranges from -180 to 180

In [8]:
fao_df[['longitude', 'latitude']].agg([min, max])

Unnamed: 0,longitude,latitude
min,-172.1,-40.9
max,179.41,64.96


Both max and min of latitude and longitude fall under the accepted range

#### Checking if there are some invalid values in the *Years* columns

Let's check if are there any negative numbers in production

In [109]:
cond = fao_df.loc[:,'Y1961':'Y2013'].agg([min])
cond.T[cond.T['min'] < 0]

Unnamed: 0,min
Y2012,-169.0
Y2013,-246.0


Now let's investigate which country has a negative amount of production for which *Item* and *Element* (food or feed)

In [83]:
fao_df.loc[fao_df[['Y2013', 'Y2012']].idxmin().unique(), ['Area', 'Item', 'Element']]

Unnamed: 0,Area,Item,Element
10082,Japan,Oats,Food


negative number in production must be an error of input, just drop the entire row

In [13]:
fao_df.drop(labels=10082, inplace=True)

#### Checking for missing values

In [None]:
for col in fao_df.columns:
    print(f'''
    {col}
    missing value percentage: {round((fao_df[col].isnull().sum()/fao_df.shape[0]*100), 2)}% ---> {fao_df[col].isnull().sum()}/{fao_df.shape[0]*100} rows
 ''')

In [None]:
# rwm = fao_df.loc[fao_df.isna().any(axis=1)].index.to_list()
# fao_df.loc[rwm]

#### Understanding why are there NaN values

# Data Exploration

# Data Transformation

# Data Visualization

In [140]:
print('diocan')

diocan


In [None]:
# cond = fao_df['Element'] == 'Feed'
# cond2 = fao_df['Element'] == 'Food'

# feed = fao_df.loc[cond, 'Item']
# food = fao_df.loc[cond2, 'Item']

# differences = np.setdiff1d(food, feed)
# for n, i in enumerate(differences):
#     print(f'{n}{i :>30}')