# Data Wrangling

This first capstone is intended to forecast the consumption of meat globally over the next 10 years. We will look for patterns in potential social, economic, and environmental indicators that could be predictors of consumption. 

When downloading the value, we saw visually that different datasets had different numbers of null values. We'll want to inspect each dataset and treat our nulls so that we can have a more reasonably sized dataframe.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from lib.sb_utils import save_file

## Food Balances
Our food balance sheet tells us the production quantity of each meat type in unit tons.

In [2]:
food = pd.read_csv('new data/food_balances.csv')
food.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1991,1000 tonnes,86.0
1,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1992,1000 tonnes,86.0
2,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1993,1000 tonnes,97.0
3,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1994,1000 tonnes,113.0
4,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1995,1000 tonnes,130.0


In [3]:
food.shape, food.isna().sum()

((20082, 7),
 Domain        0
 Area          0
 Element       0
 Item          0
 Year          0
 Unit       1545
 Value      1545
 dtype: int64)

In [4]:
food.dropna(inplace=True)
food.reset_index(drop=True, inplace=True)

In [5]:
#Topmost meat producers by country
top50 = food.groupby('Area')['Value'].mean().sort_values(ascending=False)[:50].index
top50

Index(['China, mainland', 'United States of America', 'Brazil', 'USSR',
       'Germany', 'Russian Federation', 'France', 'India', 'Spain', 'Mexico',
       'Argentina', 'Canada', 'Australia', 'Italy',
       'United Kingdom of Great Britain and Northern Ireland', 'Poland',
       'Japan', 'Viet Nam', 'Pakistan', 'Netherlands', 'Philippines',
       'Indonesia', 'Thailand', 'South Africa', 'Ukraine',
       'Iran (Islamic Republic of)', 'Turkey', 'Denmark', 'Republic of Korea',
       'Colombia', 'Belgium', 'Belgium-Luxembourg',
       'China, Taiwan Province of', 'Egypt', 'Czechoslovakia', 'New Zealand',
       'Myanmar', 'Malaysia', 'Venezuela (Bolivarian Republic of)', 'Chile',
       'Romania', 'Sudan', 'Nigeria', 'Peru', 'Hungary', 'Ireland',
       'Serbia and Montenegro', 'Sudan (former)', 'Austria', 'Belarus'],
      dtype='object', name='Area')

In [6]:
#Next we pivot our data frame so we have columns for each type of meat
food = pd.pivot_table(food, values='Value',index=['Area','Year'], columns='Item').reset_index()
food.head()

Item,Area,Year,Bovine Meat,Mutton & Goat Meat,Pigmeat,Poultry Meat
0,Afghanistan,1991,86.0,137.0,,12.0
1,Afghanistan,1992,86.0,133.0,,12.0
2,Afghanistan,1993,97.0,132.0,,12.0
3,Afghanistan,1994,113.0,134.0,,12.0
4,Afghanistan,1995,130.0,134.0,,12.0


In [7]:
top_prod = food[food.Area.isin(top50)].reset_index(drop=True)
top_prod.isna().sum()

Item
Area                   0
Year                   0
Bovine Meat            0
Mutton & Goat Meat     0
Pigmeat               51
Poultry Meat           0
dtype: int64

In [14]:
top_prod.drop(labels='Mutton & Goat Meat', axis=1,inplace=True)
top_prod.head()

Item,Area,Year,Bovine Production,Pig Production,Poultry Production
0,Argentina,1991,2918.0,142.0,411.0
1,Argentina,1992,2784.0,157.0,498.0
2,Argentina,1993,2808.0,230.0,713.0
3,Argentina,1994,2783.0,230.0,753.0
4,Argentina,1995,2688.0,211.0,817.0


In [15]:
top_prod.rename(columns={'Bovine Meat':'Bovine Production', 'Pigmeat':'Pig Production','Poultry Meat':'Poultry Production'},inplace=True)
top_prod.head()

Item,Area,Year,Bovine Production,Pig Production,Poultry Production
0,Argentina,1991,2918.0,142.0,411.0
1,Argentina,1992,2784.0,157.0,498.0
2,Argentina,1993,2808.0,230.0,713.0
3,Argentina,1994,2783.0,230.0,753.0
4,Argentina,1995,2688.0,211.0,817.0


## Population

In [16]:
pop = pd.read_csv('new data/population.csv')
pop.head()

Unnamed: 0,Domain,Area,Year,Unit,Value
0,Annual population,Afghanistan,1991,1000 persons,13299.017
1,Annual population,Afghanistan,1992,1000 persons,14485.546
2,Annual population,Afghanistan,1993,1000 persons,15816.603
3,Annual population,Afghanistan,1994,1000 persons,17075.727
4,Annual population,Afghanistan,1995,1000 persons,18110.657


In [17]:
#Save the population unit for later, we can then drop that column
pop_unit=1000

In [18]:
#Clean our dataset of nulls and unnecessary columns
pop.isna().sum()

Domain      0
Area        0
Year        0
Unit      475
Value     475
dtype: int64

In [19]:
pop.dropna(inplace=True)
pop.reset_index(drop=True, inplace=True)
pop.drop(labels=['Domain','Unit'], axis=1, inplace=True)

## Prices
Our prices data shows the Producer Price in dollars per ton for each meat.

In [20]:
prices = pd.read_csv('new data/ppi.csv')
prices.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Value
0,Producer Prices,Afghanistan,Producer Price Index (2014-2016 = 100),"Meat live weight, cattle",1991,17.34
1,Producer Prices,Afghanistan,Producer Price Index (2014-2016 = 100),"Meat live weight, chicken",1991,11.85
2,Producer Prices,Afghanistan,Producer Price Index (2014-2016 = 100),"Meat live weight, pig",1991,
3,Producer Prices,Afghanistan,Producer Price Index (2014-2016 = 100),"Meat live weight, cattle",1992,17.34
4,Producer Prices,Afghanistan,Producer Price Index (2014-2016 = 100),"Meat live weight, chicken",1992,11.85


In [21]:
prices = pd.pivot_table(prices, values='Value',index=['Area','Year'], columns='Item').reset_index()
prices.head()

Item,Area,Year,"Meat live weight, cattle","Meat live weight, chicken","Meat live weight, pig"
0,Afghanistan,1991,17.34,11.85,
1,Afghanistan,1992,17.34,11.85,
2,Afghanistan,1993,17.34,11.85,
3,Afghanistan,1994,18.46,12.61,
4,Afghanistan,1995,18.23,12.46,


In [22]:
prices.isna().sum()

Item
Area                           0
Year                           0
Meat live weight, cattle      74
Meat live weight, chicken    247
Meat live weight, pig        452
dtype: int64

In [23]:
prices.dropna(inplace=True)
prices.reset_index(drop=True, inplace=True)

In [24]:
prices.rename(columns={'Meat live weight, cattle':'Bovine Price','Meat live weight, chicken':'Poultry Price','Meat live weight, pig':'Pig Price'},inplace=True)
prices.head()

Item,Area,Year,Bovine Price,Poultry Price,Pig Price
0,Albania,1993,29.81,28.89,40.25
1,Albania,1994,26.36,42.09,44.48
2,Albania,1995,27.03,51.35,52.74
3,Albania,1996,34.27,50.32,64.2
4,Albania,1997,39.59,45.43,68.79


## CO2 Emissions
Finally we will look at the CO2 emissions by country.

In [25]:
co2 = pd.read_csv('new data/emissions.csv')
co2.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Source,Unit,Value
0,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1991,FAO TIER 1,kilotonnes,8760.3622
1,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1992,FAO TIER 1,kilotonnes,8786.8525
2,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1993,FAO TIER 1,kilotonnes,8865.5553
3,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1994,FAO TIER 1,kilotonnes,8947.8165
4,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1995,FAO TIER 1,kilotonnes,9407.0224


In [26]:
co2_unit = 'kilotonnes'
co2.isna().sum()

Domain       0
Area         0
Element      0
Item         0
Year         0
Source       0
Unit       431
Value      431
dtype: int64

In [27]:
co2.dropna(inplace=True)
co2.reset_index(drop=True, inplace=True)
co2.drop(labels=['Domain','Element','Item','Unit','Source'],axis=1,inplace=True)
co2.head()

Unnamed: 0,Area,Year,Value
0,Afghanistan,1991,8760.3622
1,Afghanistan,1992,8786.8525
2,Afghanistan,1993,8865.5553
3,Afghanistan,1994,8947.8165
4,Afghanistan,1995,9407.0224


## Merging
We'll merge our two dataframes together using an inner join, since we're really only interestsed in years and countries where we have both datapoints

In [28]:
top_prod.shape, pop.shape, prices.shape, co2.shape

((1274, 5), (6441, 3), (3240, 5), (6373, 3))

In [29]:
merged = pd.merge(top_prod, prices, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.head()

Item,Area,Year,Bovine Production,Pig Production,Poultry Production,Bovine Price,Poultry Price,Pig Price
0,Argentina,1991,2918.0,142.0,411.0,3.66,9.65,4.82
1,Argentina,1992,2784.0,157.0,498.0,4.6,9.57,6.59
2,Argentina,1993,2808.0,230.0,713.0,4.01,10.01,5.83
3,Argentina,1994,2783.0,230.0,753.0,3.91,9.77,6.09
4,Argentina,1995,2688.0,211.0,817.0,3.96,9.52,6.15


In [30]:
#merge the above with price data
merged = pd.merge(pop, merged, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.head()

Unnamed: 0,Area,Year,Value,Bovine Production,Pig Production,Poultry Production,Bovine Price,Poultry Price,Pig Price
0,Argentina,1991,33079.0,2918.0,142.0,411.0,3.66,9.65,4.82
1,Argentina,1992,33529.326,2784.0,157.0,498.0,4.6,9.57,6.59
2,Argentina,1993,33970.111,2808.0,230.0,713.0,4.01,10.01,5.83
3,Argentina,1994,34402.672,2783.0,230.0,753.0,3.91,9.77,6.09
4,Argentina,1995,34828.17,2688.0,211.0,817.0,3.96,9.52,6.15


In [31]:
merged.rename(columns={'Value':'Pop'}, inplace=True)

In [32]:
merged = pd.merge(merged, co2, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.head()

Unnamed: 0,Area,Year,Pop,Bovine Production,Pig Production,Poultry Production,Bovine Price,Poultry Price,Pig Price,Value
0,Argentina,1991,33079.0,2918.0,142.0,411.0,3.66,9.65,4.82,123797.0043
1,Argentina,1992,33529.326,2784.0,157.0,498.0,4.6,9.57,6.59,125478.4229
2,Argentina,1993,33970.111,2808.0,230.0,713.0,4.01,10.01,5.83,123486.8655
3,Argentina,1994,34402.672,2783.0,230.0,753.0,3.91,9.77,6.09,125037.5721
4,Argentina,1995,34828.17,2688.0,211.0,817.0,3.96,9.52,6.15,124291.0892


In [33]:
merged.rename(columns={'Value':'Emissions'}, inplace=True)

In [34]:
merged.isna().sum()

Area                  0
Year                  0
Pop                   0
Bovine Production     0
Pig Production        0
Poultry Production    0
Bovine Price          0
Poultry Price         0
Pig Price             0
Emissions             0
dtype: int64

In [35]:
merged.shape

(1009, 10)

In [39]:
merged.Year.value_counts().sort_index()

1991    32
1992    33
1993    34
1994    35
1995    35
1996    35
1997    35
1998    35
1999    35
2000    37
2001    37
2002    37
2003    37
2004    37
2005    37
2006    37
2007    37
2008    37
2009    37
2010    37
2011    37
2012    37
2013    37
2014    37
2015    36
2016    36
2017    36
2018    37
Name: Year, dtype: int64

In [40]:
merged.Area.value_counts().sort_index()

Argentina                             28
Australia                             28
Austria                               28
Belarus                               19
Belgium                               19
Brazil                                28
Canada                                28
Chile                                 28
China, mainland                       28
Colombia                              28
Denmark                               28
France                                28
Germany                               28
Hungary                               28
India                                 28
Indonesia                             28
Italy                                 28
Japan                                 28
Malaysia                              28
Mexico                                28
Myanmar                               28
Netherlands                           28
New Zealand                           28
Nigeria                               28
Peru            

## Save our data

In [41]:
datapath = 'new data' 
save_file(merged, 'merged.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "new data/merged.csv"
