# Data Wrangling

This first capstone is intended to forecast the consumption of red meat in the United States over the next 10 years. We will look for patterns in potential social, economic, and environmental indicators that could be predictors of consumption. 

In this notebook, we will inspect and clean our datasets for this project. The way that FAO stores their data, they have two separate sheets for data before 2013 and after. Since we have the same number of columns in each, our first step will be to combine these into one.

In [1]:
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from lib.sb_utils import save_file

In [2]:
fao08 = pd.read_csv('data/faostat_08.csv')
fao14 = pd.read_csv('data/faostat_14.csv')

#First we concat our dataframes together, since they have the same column structure. Then we reset the index and drop the old one.
fao_all = pd.concat([fao08,fao14])
fao_all.reset_index(inplace=True)
fao_all.drop(labels='index', axis=1, inplace=True)
fao_all.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2008,1000 tonnes,12163.0
1,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2009,1000 tonnes,11891.0
2,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2010,1000 tonnes,12046.0
3,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2011,1000 tonnes,11921.0
4,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2012,1000 tonnes,11811.0


In [3]:
fao_all.dtypes

Domain      object
Area        object
Element     object
Item        object
Year         int64
Unit        object
Value      float64
dtype: object

In [4]:
fao_all.Element.unique()

array(['Production', 'Import Quantity', 'Stock Variation',
       'Export Quantity', 'Losses', 'Processing', 'Other uses (non-food)',
       'Tourist consumption', 'Residuals'], dtype=object)

In [5]:
fao_all.Item.unique()

array(['Bovine Meat', 'Mutton & Goat Meat', 'Pigmeat', 'Poultry Meat',
       'Meat, Other'], dtype=object)

In [6]:
fao_all.Unit.unique()

array(['1000 tonnes', nan], dtype=object)

In [7]:
fao_all['Domain'].unique()

array(['Food Balances (-2013, old methodology and population)',
       'Food Balances (2014-)'], dtype=object)

In [8]:
fao_all[(fao_all.Item=='Pigmeat') & (fao_all.Year==2008)]

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
108,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Pigmeat,2008,1000 tonnes,10599.0
114,"Food Balances (-2013, old methodology and popu...",United States of America,Import Quantity,Pigmeat,2008,1000 tonnes,454.0
120,"Food Balances (-2013, old methodology and popu...",United States of America,Stock Variation,Pigmeat,2008,1000 tonnes,-19.0
126,"Food Balances (-2013, old methodology and popu...",United States of America,Export Quantity,Pigmeat,2008,1000 tonnes,2129.0
132,"Food Balances (-2013, old methodology and popu...",United States of America,Losses,Pigmeat,2008,,
138,"Food Balances (-2013, old methodology and popu...",United States of America,Processing,Pigmeat,2008,1000 tonnes,0.0
144,"Food Balances (-2013, old methodology and popu...",United States of America,Other uses (non-food),Pigmeat,2008,1000 tonnes,15.0
150,"Food Balances (-2013, old methodology and popu...",United States of America,Tourist consumption,Pigmeat,2008,,
156,"Food Balances (-2013, old methodology and popu...",United States of America,Residuals,Pigmeat,2008,,


It looks like the "Domain" column is not relevant to our analysis - it only speaks to what methodology the balance sheet used - so we'll drop it

In [9]:
fao_all.drop(labels=['Domain', 'Area'], axis=1, inplace=True)

## Null Values
Next we'll check for null values. We can see from the output above, that the numeric data we care most about is the Value column. We'll want to check if and when values are null and if that corresponds to null columns.

In [10]:
null_vals = fao_all[fao_all.Value.isna()]
fao_all.drop(labels = null_vals.index, inplace=True)

## Understanding "Elements"
We don't have an precise measure for food consumption yet. We have a variety of "Elements" whose values can be taken together to measure how much meat was consumed. We don't have a sense of what the range of values here is and what makes sense to subtract and add together. Let's see what the mean value for each element can tell us.

In [11]:
fao_means = fao_all.groupby('Element')['Value'].mean()
fao_means

Element
Export Quantity          1464.254545
Import Quantity           407.090909
Other uses (non-food)     159.393939
Processing                  7.121212
Production               8687.309091
Residuals                 -19.640000
Stock Variation             9.200000
Name: Value, dtype: float64

A few interesting things stand out: Production is by far the highest. That makes sense because the United States is a big producer of meat. That goes hand in hand with being a strong exporter of meat, which is the next highest mean. The rest of our values are a lot lower by comparison. Through some googling we have learned that Processing, Other uses, Stock variation, and Residuals all *take away* from the overall production value. These are associated with food being lost before it reaches a consumer, so we'll want to subtract them from the overall production value. The one exception is Residuals, which have a negative value in this case. We'll want to add those operationally to reflect the amount lost.

In [12]:
#confirm residuals is always negative
fao_all[fao_all.Element=='Residuals']['Value'].max()

0.0

In [13]:
#change residuals values to be positive so our measure of variance later on isn't skewed
fao_all.loc[fao_all.Element=='Residuals', 'Value'] = fao_all.loc[fao_all.Element=='Residuals', 'Value'].astype(int).apply(lambda x:x*-1)

In [14]:
fao_all[fao_all.Element=='Residuals']['Value'].max()

97.0

## Create Consumption Dataframe
Now we will copy the fao_all dataframe and multiply all element values which take away from the overall consumption values by -1. This will let us sum all element values for each meat type which will give us an overall consumption values.

In [15]:
#Multiple all element values by -1 where they should be subtracted
consumption = fao_all.copy()
pos_elems = ['Import Quantity', 'Production']
neg_elems = set(list(consumption.Element.unique())) - set(pos_elems)
for neg in neg_elems:
    consumption.loc[consumption.Element==neg, ['Value']] = consumption.loc[consumption.Element==neg, ['Value']].apply(lambda x:x*-1)

In [16]:
#clean consumption df
consumption.reset_index(inplace=True)
consumption.drop(labels='index', axis=1, inplace=True)
consumption.rename(columns={'Value':'Consumption'}, inplace=True)
consumption.head()

Unnamed: 0,Element,Item,Year,Unit,Consumption
0,Production,Bovine Meat,2008,1000 tonnes,12163.0
1,Production,Bovine Meat,2009,1000 tonnes,11891.0
2,Production,Bovine Meat,2010,1000 tonnes,12046.0
3,Production,Bovine Meat,2011,1000 tonnes,11921.0
4,Production,Bovine Meat,2012,1000 tonnes,11811.0


Since our Unit is the same for all observations, we'll drop the column but store it in a variable for later use.

In [17]:
consumption_unit=1000
consumption.drop(labels='Unit', axis=1, inplace=True)

Next we need to group by Item and Year so that we can get the sum total of consumption for each type of meat (the positive elements - the elements that take away from the consumption value).

In [18]:
#Filter dataframe by item and year to get relevant consumption value
consumption_per_item = consumption.groupby(['Item','Year']).sum().reset_index()
consumption_per_item.head()

Unnamed: 0,Item,Year,Consumption
0,Bovine Meat,2008,12444.0
1,Bovine Meat,2009,12258.0
2,Bovine Meat,2010,12072.0
3,Bovine Meat,2011,11592.0
4,Bovine Meat,2012,11725.0


In [19]:
#If we want to see average values for each item per year
all_avg = consumption.groupby(['Year']).mean().reset_index()
all_avg

Unnamed: 0,Year,Consumption
0,2008,1448.538462
1,2009,1415.192308
2,2010,1427.192308
3,2011,1386.5
4,2012,1401.192308
5,2013,1414.076923
6,2014,1186.677419
7,2015,1214.709677
8,2016,1243.064516
9,2017,1264.709677


## Population Data
We will also import data for total population, consumer price for agriculture, and employment in agriculture. These indicators may provide context to understand why meat consumption increases or decreases over time.

In [20]:
pop = pd.read_csv('data/pop.csv')
pop.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2008,1000 persons,303486.012
1,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2009,1000 persons,306307.567
2,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2010,1000 persons,309011.475
3,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2011,1000 persons,311584.047
4,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2012,1000 persons,314043.885


In [21]:
pop.rename(columns={'Value':'Pop'}, inplace=True)
pop.Unit.unique(), pop.Item.unique(), pop.Element.unique()

(array(['1000 persons'], dtype=object),
 array(['Population - Est. & Proj.'], dtype=object),
 array(['Total Population - Both sexes'], dtype=object))

In [22]:
pop.isna().sum(), pop.shape

(Domain     0
 Area       0
 Element    0
 Item       0
 Year       0
 Unit       0
 Pop        0
 dtype: int64,
 (11, 7))

Since our Unit is the same for all observations and no null values, we'll drop the column but store it in a variable for later use. We'll also drop the rest of our columns since they contain the same information

In [23]:
pop_unit = 1000
pop.drop(labels=['Unit', 'Area','Domain', 'Element', 'Item'], axis=1, inplace=True)

In [24]:
pop.head()

Unnamed: 0,Year,Pop
0,2008,303486.012
1,2009,306307.567
2,2010,309011.475
3,2011,311584.047
4,2012,314043.885


## Price Data
Next we'll look at our price data, see what columns can be dropped, whether there are null values, and finally find an average price per year.

In [25]:
price = pd.read_csv('data/prices.csv')
price.head()

Unnamed: 0,Domain,Area,Year,Item,Months,Unit,Value
0,Consumer Price Indices,United States of America,2008,"Consumer Prices, Food Indices (2015 = 100)",January,,85.28566
1,Consumer Price Indices,United States of America,2009,"Consumer Prices, Food Indices (2015 = 100)",January,,90.16683
2,Consumer Price Indices,United States of America,2010,"Consumer Prices, Food Indices (2015 = 100)",January,,88.43106
3,Consumer Price Indices,United States of America,2011,"Consumer Prices, Food Indices (2015 = 100)",January,,90.28956
4,Consumer Price Indices,United States of America,2012,"Consumer Prices, Food Indices (2015 = 100)",January,,95.10648


In [26]:
price.Unit.unique(), price.Item.unique()

(array([nan]),
 array(['Consumer Prices, Food Indices (2015 = 100)',
        'Consumer Prices, General Indices (2015 = 100)'], dtype=object))

In [27]:
price.isna().sum(), price.shape, 

(Domain      0
 Area        0
 Year        0
 Item        0
 Months      0
 Unit      264
 Value       0
 dtype: int64,
 (264, 7))

In [28]:
price.drop(labels=['Unit', 'Item', 'Domain', 'Area'], axis=1,inplace=True)

In [29]:
price_per_year = price.groupby('Year').mean().reset_index()
price_per_year.rename(columns={'Value':'Price'}, inplace=True)

In [30]:
price_per_year.head()

Unnamed: 0,Year,Price
0,2008,89.51137
1,2009,89.552337
2,2010,90.440451
3,2011,94.027484
4,2012,96.152371


## Employment Data
Next we'll look at our employment data, see what columns can be dropped, whether there are null values, and finally find an average price per year.

In [32]:
employment = pd.read_csv('data/employment.csv')
employment[employment.Value.isna()==False]

Unnamed: 0,Domain,Area,Indicator,Source,Year,Value
88,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2008,69986.273438
89,Employment Indicators,United States of America,Employment in agriculture,Labour force survey,2008,1943.79
90,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2009,79915.085938
91,Employment Indicators,United States of America,Employment in agriculture,Labour force survey,2009,1888.286
92,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2010,73930.257813
93,Employment Indicators,United States of America,Employment in agriculture,Labour force survey,2010,1978.892
94,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2011,69596.734375
95,Employment Indicators,United States of America,Employment in agriculture,Labour force survey,2011,2020.801
96,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2012,68901.375
97,Employment Indicators,United States of America,Employment in agriculture,Labour force survey,2012,1966.731


In [36]:
employment[employment.Source !='Labour force survey']

Unnamed: 0,Domain,Area,Indicator,Source,Year,Value
0,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Administrative records,2008,
1,Employment Indicators,United States of America,Employment in agriculture,Administrative records,2008,
2,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Administrative records,2009,
3,Employment Indicators,United States of America,Employment in agriculture,Administrative records,2009,
4,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Administrative records,2010,
...,...,...,...,...,...,...
149,Employment Indicators,United States of America,Employment in agriculture,Population census,2016,
150,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Population census,2017,
151,Employment Indicators,United States of America,Employment in agriculture,Population census,2017,
152,Employment Indicators,United States of America,"Agriculture value added per worker (US$, 2010 ...",Population census,2018,


In [32]:
employment.Value.isna().sum(), employment.shape

(133, (154, 6))

In [33]:
employment.Indicator.unique(), employment.Source.unique()

(array(['Agriculture value added per worker (US$, 2010 prices)',
        'Employment in agriculture'], dtype=object),
 array(['Administrative records', 'Employment surveys',
        'Household income and expenditure survey', 'Household survey',
        'Labour force survey', 'Official estimates', 'Population census'],
       dtype=object))

In [34]:
employment = employment[~employment.Value.isna()]
employment = employment[employment.Indicator=='Employment in agriculture']

In [35]:
employment.drop(labels=['Domain', 'Area', 'Source', 'Indicator'], axis=1,inplace=True)
employment.rename(columns={'Value':'Employment'}, inplace=True)

In [36]:
employment.reset_index(inplace=True)
employment.drop(labels='index', axis=1, inplace=True)
employment.head()

Unnamed: 0,Year,Employment
0,2008,1943.79
1,2009,1888.286
2,2010,1978.892
3,2011,2020.801
4,2012,1966.731


## Combining Tables

In [59]:
merged1 = pd.merge(consumption_per_item, pop, how='left', on='Year')
merged1.head()

Unnamed: 0,Item,Year,Consumption,Pop
0,Bovine Meat,2008,12444.0,303486.012
1,Bovine Meat,2009,12258.0,306307.567
2,Bovine Meat,2010,12072.0,309011.475
3,Bovine Meat,2011,11592.0,311584.047
4,Bovine Meat,2012,11725.0,314043.885


In [60]:
merged2 = pd.merge(merged1, price_per_year, how='left', on='Year')
merged2.head()

Unnamed: 0,Item,Year,Consumption,Pop,Price
0,Bovine Meat,2008,12444.0,303486.012,89.51137
1,Bovine Meat,2009,12258.0,306307.567,89.552337
2,Bovine Meat,2010,12072.0,309011.475,90.440451
3,Bovine Meat,2011,11592.0,311584.047,94.027484
4,Bovine Meat,2012,11725.0,314043.885,96.152371


In [61]:
merged_indicators = pd.merge(merged2, employment, how='left', on='Year')
merged_indicators.head()

Unnamed: 0,Item,Year,Consumption,Pop,Price,Employment
0,Bovine Meat,2008,12444.0,303486.012,89.51137,1943.79
1,Bovine Meat,2009,12258.0,306307.567,89.552337,1888.286
2,Bovine Meat,2010,12072.0,309011.475,90.440451,1978.892
3,Bovine Meat,2011,11592.0,311584.047,94.027484,2020.801
4,Bovine Meat,2012,11725.0,314043.885,96.152371,1966.731


In [65]:
datapath = 'data' 
save_file(merged_indicators, 'merged_data.csv', datapath)

Writing file.  "data/merged_data.csv"
