# Data Wrangling

This first capstone is intended to forecast the consumption of meat globally over the next 10 years. We will look for patterns in potential social, economic, and environmental indicators that could be predictors of consumption. 

When downloading the value, we saw visually that different datasets had different numbers of null values. We'll want to inspect each dataset and treat our nulls so that we can have a more reasonably sized dataframe.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from lib.sb_utils import save_file

## Food Balances
Our food balance sheet tells us the production quantity of each meat type in unit tons.

In [2]:
food = pd.read_csv('new data/food_balances.csv')
food.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1991,1000 tonnes,86.0
1,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1992,1000 tonnes,86.0
2,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1993,1000 tonnes,97.0
3,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1994,1000 tonnes,113.0
4,"Food Balances (-2013, old methodology and popu...",Afghanistan,Production,Bovine Meat,1995,1000 tonnes,130.0


In [3]:
food.shape, food.isna().sum()

((20082, 7),
 Domain        0
 Area          0
 Element       0
 Item          0
 Year          0
 Unit       1545
 Value      1545
 dtype: int64)

The column we care about most here is Value, so if it's null, we can drop those rows.

In [4]:
food.dropna(inplace=True)
food.reset_index(drop=True, inplace=True)
food.drop(labels=['Domain','Unit','Element'], axis=1, inplace=True)
food.head()

Unnamed: 0,Area,Item,Year,Value
0,Afghanistan,Bovine Meat,1991,86.0
1,Afghanistan,Bovine Meat,1992,86.0
2,Afghanistan,Bovine Meat,1993,97.0
3,Afghanistan,Bovine Meat,1994,113.0
4,Afghanistan,Bovine Meat,1995,130.0


In [5]:
food = pd.pivot_table(food, values='Value',index=['Area','Year'], columns='Item').reset_index()
food.head()

Item,Area,Year,Bovine Meat,Mutton & Goat Meat,Pigmeat,Poultry Meat
0,Afghanistan,1991,86.0,137.0,,12.0
1,Afghanistan,1992,86.0,133.0,,12.0
2,Afghanistan,1993,97.0,132.0,,12.0
3,Afghanistan,1994,113.0,134.0,,12.0
4,Afghanistan,1995,130.0,134.0,,12.0


In [6]:
food.shape

(4768, 6)

## Prices
Our prices data shows the Producer Price in dollars per ton for each meat.

In [7]:
prices = pd.read_csv('new data/meat_prices.csv')
prices.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Months,Unit,Value
0,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, cattle",1991,Annual value,,
1,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, chicken",1991,Annual value,,
2,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, goat",1991,Annual value,,
3,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, pig",1991,Annual value,,
4,Producer Prices,Afghanistan,Producer Price (USD/tonne),"Meat, cattle",1992,Annual value,,


In [8]:
prices.shape, prices.isna().sum()

((22968, 8),
 Domain         0
 Area           0
 Element        0
 Item           0
 Year           0
 Months         0
 Unit       14923
 Value      14923
 dtype: int64)

Again, the column we care about most here is Value, so if it's null, we can drop those rows.

In [9]:
prices.dropna(inplace=True)
prices.reset_index(drop=True, inplace=True)
prices.drop(labels=['Domain','Months','Unit','Element'], axis=1, inplace=True)

In [10]:
prices = pd.pivot_table(prices, values='Value',index=['Area','Year'], columns='Item').reset_index()

## Merging Price and Food
We'll merge our two dataframes together using an inner join, since we're really only interestsed in years and countries where we have both datapoints

In [11]:
merged = pd.merge(prices, food, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.rename(columns={'Bovine Meat':'Bovine Production','Mutton & Goat Meat': 'Goat Production', 'Pigmeat':'Pig Production','Poultry Meat':'Poultry Production','Meat, cattle':'Bovine Price','Meat, chicken':'Poultry Price','Meat, goat':'Goat Price','Meat, pig':'Pig Price'}, inplace=True)
merged.head()

Item,Area,Year,Bovine Price,Poultry Price,Goat Price,Pig Price,Bovine Production,Goat Production,Pig Production,Poultry Production
0,Albania,1993,3116.2,,4330.1,2502.0,24.0,15.0,13.0,3.0
1,Albania,1994,2071.4,,3438.0,2078.8,28.0,19.0,14.0,4.0
2,Albania,1995,2481.2,,1723.2,2481.2,31.0,18.0,14.0,4.0
3,Albania,1996,2870.8,,1965.3,2679.5,33.0,17.0,6.0,4.0
4,Albania,1997,2350.1,,1762.0,2014.3,33.0,16.0,7.0,4.0


In [12]:
merged.isna().sum()

Item
Area                     0
Year                     0
Bovine Price           336
Poultry Price          475
Goat Price            1244
Pig Price              569
Bovine Production        0
Goat Production         37
Pig Production          77
Poultry Production      22
dtype: int64

## Drop goat data
We'll drop our goat data since there are so many null values, we can still analyze the remaining 3.

In [13]:
#dropping goat data because there are too many null values
merged.drop(labels=['Goat Price','Goat Production'],axis=1,inplace=True)

## Ag Jobs
Finally we will look at the employment in agriculture (by 1000 people).

In [39]:
jobs = pd.read_csv('new data/employees.csv')
jobs.head()

Unnamed: 0,Domain,Area,Indicator,Source,Year,Unit,Value
0,Employment Indicators,Afghanistan,Employment in agriculture,Administrative records,1991,,
1,Employment Indicators,Afghanistan,Employment in agriculture,Administrative records,1992,,
2,Employment Indicators,Afghanistan,Employment in agriculture,Administrative records,1993,,
3,Employment Indicators,Afghanistan,Employment in agriculture,Administrative records,1994,,
4,Employment Indicators,Afghanistan,Employment in agriculture,Administrative records,1995,,


In [40]:
jobs.shape, jobs.isna().sum()

((45878, 7),
 Domain           0
 Area             0
 Indicator        0
 Source           0
 Year             0
 Unit         43502
 Value        43502
 dtype: int64)

In [41]:
jobs.dropna(inplace=True)
jobs.reset_index(drop=True, inplace=True)
jobs.drop(labels=['Domain','Indicator','Source','Unit'],axis=1,inplace=True)

In [42]:
jobsunit = 1000

In [43]:
jobs.head()

Unnamed: 0,Area,Year,Value
0,Albania,2007,358.403
1,Albania,2008,501.433
2,Albania,2009,511.223
3,Albania,2010,491.227
4,Albania,2011,526.261


In [44]:
jobs.shape

(2376, 3)

## Population
Lastly we'll look at the overall country population. Tracking variants in accordance with this number will help explain trends.

In [20]:
pop = pd.read_csv('new data/population.csv')
pop.head()

Unnamed: 0,Domain,Area,Year,Unit,Value
0,Annual population,Afghanistan,1991,1000 persons,13299.017
1,Annual population,Afghanistan,1992,1000 persons,14485.546
2,Annual population,Afghanistan,1993,1000 persons,15816.603
3,Annual population,Afghanistan,1994,1000 persons,17075.727
4,Annual population,Afghanistan,1995,1000 persons,18110.657


In [21]:
pop.shape, pop.isna().sum()

((6916, 5),
 Domain      0
 Area        0
 Year        0
 Unit      475
 Value     475
 dtype: int64)

In [22]:
pop.dropna(inplace=True)
pop.reset_index(drop=True, inplace=True)
pop.drop(labels=['Domain','Unit'],axis=1,inplace=True)

In [23]:
popunit = 1000

In [24]:
pop.shape

(6441, 3)

## More Merging
Now we can merge our population & jobs data in with our food/price data.

In [25]:
merged = pd.merge(merged,pop, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.rename(columns={'Value':'Population'}, inplace=True)

In [26]:
merged = pd.merge(merged,jobs, left_on=['Area','Year'], right_on = ['Area','Year'])
merged.rename(columns={'Value':'Ag_jobs'}, inplace=True)

In [27]:
merged.head()

Unnamed: 0,Area,Year,Bovine Price,Poultry Price,Pig Price,Bovine Production,Pig Production,Poultry Production,Population,Ag_jobs
0,Albania,2007,5927.4,2952.6,4644.6,50.0,14.0,13.0,3033.998,358.403
1,Albania,2008,6708.4,3182.6,5006.3,57.0,16.0,16.0,3002.678,501.433
2,Albania,2009,7054.3,3337.6,5474.9,58.0,16.0,17.0,2973.048,511.223
3,Albania,2010,6494.4,3049.9,5031.9,64.0,17.0,17.0,2948.023,491.227
4,Albania,2011,7572.2,3350.0,5580.1,65.0,18.0,17.0,2928.592,526.261


In [28]:
merged.isna().sum()

Area                    0
Year                    0
Bovine Price          183
Poultry Price         272
Pig Price             300
Bovine Production       0
Pig Production         35
Poultry Production      2
Population              0
Ag_jobs                 0
dtype: int64

In [29]:
merged.dropna(inplace=True)

## Save our data

In [None]:
datapath = 'data' 
save_file(merged_indicators, 'merged_data.csv', datapath)