# Data Wrangling

This first capstone is intended to forecast the consumption of meat globally over the next 10 years. We will look for patterns in potential social, economic, and environmental indicators that could be predictors of consumption. 

When downloading the value, we saw visually that different datasets had different numbers of null values. We'll want to inspect each dataset and treat our nulls so that we can have a more reasonably sized dataframe.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from lib.sb_utils import save_file

## Population

We start by investigating the population dataset. Intuitively, we will want to prioritize the most populous countries since this likely corresponds with richer information about consumption trends.

In [2]:
pop = pd.read_csv('new data/population.csv')
pop.head()

Unnamed: 0,Domain,Area,Year,Unit,Value
0,Annual population,Afghanistan,1991,1000 persons,13299.017
1,Annual population,Afghanistan,1992,1000 persons,14485.546
2,Annual population,Afghanistan,1993,1000 persons,15816.603
3,Annual population,Afghanistan,1994,1000 persons,17075.727
4,Annual population,Afghanistan,1995,1000 persons,18110.657


In [3]:
#Save the population unit for later, we can then drop that column
pop_unit=1000

In [4]:
#Clean our dataset of nulls and unnecessary columns
pop.isna().sum()
pop.dropna(inplace=True)
pop.reset_index(drop=True, inplace=True)
pop.drop(labels=['Domain','Unit'], axis=1, inplace=True)

In [5]:
#Get the 100 topmost countries by population
top100 = pop.sort_values('Value',ascending=False).Area.unique()[:100]
poptop100 = pop[pop.Area.isin(top100)]

## Food Balances
Our food balance sheet tells us the production quantity of each meat type in unit tons.

In [6]:
food = pd.read_csv('new data/food_balances.csv')
food.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1991,1000 tonnes,86.0
1,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1992,1000 tonnes,86.0
2,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1993,1000 tonnes,97.0
3,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1994,1000 tonnes,113.0
4,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1995,1000 tonnes,130.0


In [7]:
food.shape, food.isna().sum()
food_unit = 1000

The column we care about most here is Value, so if it's null, we can drop those rows.

In [8]:
food.dropna(inplace=True)
food.reset_index(drop=True, inplace=True)
food.drop(labels=['Domain','Unit','Element'], axis=1, inplace=True)
food.head()

Unnamed: 0,Area,Item,Year,Value
0,Afghanistan,Bovine Meat,1991,86.0
1,Afghanistan,Bovine Meat,1992,86.0
2,Afghanistan,Bovine Meat,1993,97.0
3,Afghanistan,Bovine Meat,1994,113.0
4,Afghanistan,Bovine Meat,1995,130.0


In [9]:
#Next we pivot our data frame so we have columns for each type of meat
food = pd.pivot_table(food, values='Value',index=['Area','Year'], columns='Item').reset_index()
food.head()

Item,Area,Year,Bovine Meat,Mutton & Goat Meat,Pigmeat,Poultry Meat
0,Afghanistan,1991,86.0,137.0,,12.0
1,Afghanistan,1992,86.0,133.0,,12.0
2,Afghanistan,1993,97.0,132.0,,12.0
3,Afghanistan,1994,113.0,134.0,,12.0
4,Afghanistan,1995,130.0,134.0,,12.0


In [10]:
#Let's check for null values again now that we've pivoted
food.isna().sum()

Item
Area                    0
Year                    0
Bovine Meat            28
Mutton & Goat Meat    140
Pigmeat               316
Poultry Meat           51
dtype: int64

In [11]:
#Not too many, we can safely drop these and move on
food.dropna(inplace=True)
food.reset_index(drop=True, inplace=True)

## Prices
Our prices data shows the Producer Price in dollars per ton for each meat.

In [15]:
prices = pd.read_csv('new data/meat_prices.csv')
prices.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Months,Unit,Value
0,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, cattle",1991,Annual value,,
1,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, chicken",1991,Annual value,,
2,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, goat",1991,Annual value,,
3,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, pig",1991,Annual value,,
4,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, cattle",1992,Annual value,,


In [16]:
prices.dropna(inplace=True)
prices.reset_index(drop=True, inplace=True)
prices.drop(labels=['Domain','Months','Unit','Element'],axis=1,inplace=True)

In [17]:
prices = pd.pivot_table(prices, values='Value',index=['Area','Year'], columns='Item').reset_index()
prices.head()

Item,Area,Year,"Meat, cattle","Meat, chicken","Meat, goat","Meat, pig"
0,Albania,1993,3116.2,,4330.1,2502.0
1,Albania,1994,2071.4,,3438.0,2078.8
2,Albania,1995,2481.2,,1723.2,2481.2
3,Albania,1996,2870.8,,1965.3,2679.5
4,Albania,1997,2350.1,,1762.0,2014.3


In [18]:
prices.isna().sum()

Item
Area                0
Year                0
Meat, cattle      364
Meat, chicken     501
Meat, goat       1312
Meat, pig         634
dtype: int64

In [19]:
#We have a lot of null values for goat meat - this might make our analysis difficult later so we'll drop those rows
prices.drop(labels='Meat, goat',axis=1,inplace=True)
food.drop(labels='Mutton & Goat Meat', axis=1,inplace=True)

In [20]:
prices.dropna(inplace=True)
prices.reset_index(drop=True,inplace=True)

## CO2 Emissions
Finally we will look at the CO2 emissions by country.

In [22]:
co2 = pd.read_csv('new data/emissions.csv')
co2.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Source,Unit,Value
0,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1991,FAO TIER 1,kilotonnes,8760.3622
1,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1992,FAO TIER 1,kilotonnes,8786.8525
2,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1993,FAO TIER 1,kilotonnes,8865.5553
3,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1994,FAO TIER 1,kilotonnes,8947.8165
4,Emissions Totals,Afghanistan,Emissions (CO2eq) (AR5),IPCC Agriculture,1995,FAO TIER 1,kilotonnes,9407.0224


In [23]:
co2_unit = 'kilotonnes'
co2.isna().sum()

Domain       0
Area         0
Element      0
Item         0
Year         0
Source       0
Unit       431
Value      431
dtype: int64

In [24]:
co2.dropna(inplace=True)
co2.reset_index(drop=True, inplace=True)
co2.drop(labels=['Domain','Element','Item','Unit','Source'],axis=1,inplace=True)
co2.head()

Unnamed: 0,Area,Year,Value
0,Afghanistan,1991,8760.3622
1,Afghanistan,1992,8786.8525
2,Afghanistan,1993,8865.5553
3,Afghanistan,1994,8947.8165
4,Afghanistan,1995,9407.0224


## Merging
We'll merge our two dataframes together using an inner join, since we're really only interestsed in years and countries where we have both datapoints

In [26]:
merged = pd.merge(poptop100, food, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.head()

Unnamed: 0,Area,Year,Value,Bovine Meat,Pigmeat,Poultry Meat
0,Afghanistan,2014,33370.794,137.0,0.0,22.0
1,Afghanistan,2015,34413.603,136.0,0.0,24.0
2,Afghanistan,2016,35383.032,135.0,0.0,25.0
3,Afghanistan,2017,36296.113,128.0,0.0,28.0
4,Afghanistan,2018,37171.921,130.0,0.0,29.0


In [27]:
merged.rename(columns={'Bovine Meat':'Bovine Production', 'Pigmeat':'Pig Production','Poultry Meat':'Poultry Production','Value':'Pop'},inplace=True)

In [30]:
#merge the above with price data
merged = pd.merge(prices, merged, left_on=['Area','Year'], right_on = ['Area','Year'])

In [31]:
merged.rename({'Meat, cattle':'Bovine Price','Meat, chicken':'Poultry Price','Meat, pig':'Pig Price'})
merged.head()

Unnamed: 0,Area,Year,"Meat, cattle","Meat, chicken","Meat, pig",Pop,Bovine Production,Pig Production,Poultry Production
0,Argentina,2004,1391.6,1048.2,929.2,38491.972,3024.0,160.0,909.0
1,Argentina,2005,1620.1,916.4,954.3,38892.931,3131.0,185.0,1053.0
2,Argentina,2006,1496.4,739.2,946.0,39289.878,3034.0,230.0,1202.0
3,Argentina,2007,1558.7,929.4,1071.3,39684.295,3224.0,240.0,1288.0
4,Argentina,2008,1741.1,981.3,1281.0,40080.16,3132.0,274.0,1445.0


In [34]:
merged = pd.merge(merged, co2, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.head()

Unnamed: 0,Area,Year,"Meat, cattle","Meat, chicken","Meat, pig",Pop,Bovine Production,Pig Production,Poultry Production,Value
0,Argentina,2004,1391.6,1048.2,929.2,38491.972,3024.0,160.0,909.0,136899.1923
1,Argentina,2005,1620.1,916.4,954.3,38892.931,3131.0,185.0,1053.0,136478.2809
2,Argentina,2006,1496.4,739.2,946.0,39289.878,3034.0,230.0,1202.0,139872.1149
3,Argentina,2007,1558.7,929.4,1071.3,39684.295,3224.0,240.0,1288.0,142662.1991
4,Argentina,2008,1741.1,981.3,1281.0,40080.16,3132.0,274.0,1445.0,140063.2322


In [35]:
merged.rename(columns={'Value':'Emissions'}, inplace=True)

In [36]:
merged.isna().sum()

Area                  0
Year                  0
Meat, cattle          0
Meat, chicken         0
Meat, pig             0
Pop                   0
Bovine Production     0
Pig Production        0
Poultry Production    0
Emissions             0
dtype: int64

## Save our data

In [37]:
datapath = 'new data' 
save_file(merged, 'merged_data.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "new data/merged_data.csv"
