# Data Wrangling

This first capstone is intended to forecast the consumption of red meat in the United States over the next 10 years. We will look for patterns in potential social, economic, and environmental indicators that could be predictors of consumption. 

In this notebook, we will inspect and clean our datasets for this project. The way that FAO stores their data, they have two separate sheets for data before 2013 and after. Since we have the same number of columns in each, our first step will be to combine these into one.

In [1]:
import pandas as pd
fao08 = pd.read_csv('faostat_08.csv')
fao14 = pd.read_csv('faostat_14.csv')

#First we concat our dataframes together, since they have the same column structure. Then we reset the index and drop the old one.
fao_all = pd.concat([fao08,fao14])
fao_all.reset_index(inplace=True)
fao_all.drop(labels='index', axis=1, inplace=True)
fao_all.head()


Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2008,1000 tonnes,12163.0
1,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2009,1000 tonnes,11891.0
2,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2010,1000 tonnes,12046.0
3,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2011,1000 tonnes,11921.0
4,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Bovine Meat,2012,1000 tonnes,11811.0


In [2]:
fao_all.dtypes

Domain      object
Area        object
Element     object
Item        object
Year         int64
Unit        object
Value      float64
dtype: object

In [3]:
fao_all.Element.unique()

array(['Production', 'Import Quantity', 'Stock Variation',
       'Export Quantity', 'Losses', 'Processing', 'Other uses (non-food)',
       'Tourist consumption', 'Residuals'], dtype=object)

In [4]:
fao_all.Item.unique()

array(['Bovine Meat', 'Mutton & Goat Meat', 'Pigmeat', 'Poultry Meat',
       'Meat, Other'], dtype=object)

In [5]:
fao_all.Unit.unique()

array(['1000 tonnes', nan], dtype=object)

In [6]:
fao_all['Domain'].unique()

array(['Food Balances (-2013, old methodology and population)',
       'Food Balances (2014-)'], dtype=object)

In [7]:
fao_all[(fao_all.Item=='Pigmeat') & (fao_all.Year==2008)]

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
108,"Food Balances (-2013, old methodology and popu...",United States of America,Production,Pigmeat,2008,1000 tonnes,10599.0
114,"Food Balances (-2013, old methodology and popu...",United States of America,Import Quantity,Pigmeat,2008,1000 tonnes,454.0
120,"Food Balances (-2013, old methodology and popu...",United States of America,Stock Variation,Pigmeat,2008,1000 tonnes,-19.0
126,"Food Balances (-2013, old methodology and popu...",United States of America,Export Quantity,Pigmeat,2008,1000 tonnes,2129.0
132,"Food Balances (-2013, old methodology and popu...",United States of America,Losses,Pigmeat,2008,,
138,"Food Balances (-2013, old methodology and popu...",United States of America,Processing,Pigmeat,2008,1000 tonnes,0.0
144,"Food Balances (-2013, old methodology and popu...",United States of America,Other uses (non-food),Pigmeat,2008,1000 tonnes,15.0
150,"Food Balances (-2013, old methodology and popu...",United States of America,Tourist consumption,Pigmeat,2008,,
156,"Food Balances (-2013, old methodology and popu...",United States of America,Residuals,Pigmeat,2008,,


It looks like the "Domain" column is not relevant to our analysis - it only speaks to what methodology the balance sheet used - so we'll drop it

In [8]:
fao_all.drop(labels='Domain', axis=1, inplace=True)

## Null Values
Next we'll check for null values. We can see from the output above, that the numeric data we care most about is the Value column. We'll want to check if and when values are null and if that corresponds to null columns.

In [9]:
null_vals = fao_all[fao_all.Value.isna()]
fao_all.drop(labels = null_vals.index, inplace=True)

## Understanding "Elements"
We don't have an precise measure for food consumption yet. We have a variety of "Elements" whose values can be taken together to measure how much meat was consumed. We don't have a sense of what the range of values here is and what makes sense to subtract and add together. Let's see what the mean value for each element can tell us.

In [10]:
fao_means = fao_all.groupby('Element')['Value'].mean()
fao_means

Element
Export Quantity          1464.254545
Import Quantity           407.090909
Other uses (non-food)     159.393939
Processing                  7.121212
Production               8687.309091
Residuals                 -19.640000
Stock Variation             9.200000
Name: Value, dtype: float64

A few interesting things stand out: Production is by far the highest. That makes sense because the United States is a big producer of meat. That goes hand in hand with being a strong exporter of meat, which is the next highest mean. The rest of our values are a lot lower by comparison. Through some googling we have learned that Processing, Other uses, Stock variation, and Residuals all *take away* from the overall production value. These are associated with food being lost before it reaches a consumer, so we'll want to subtract them from the overall production value. The one exception is Residuals, which have a negative value in this case. We'll want to add those operationally to reflect the amount lost.

In [11]:
#confirm residuals is always negative
fao_all[fao_all.Element=='Residuals']['Value'].max()

0.0

In [12]:
#Multiple all element values by -1 where they should be subtracted
neg_elems = ['Export Quantity', 'Processing','Stock Variation','Other uses (non-food)']
for neg in neg_elems:
    fao_all.loc[fao_all.Element==neg, ['Value']] = fao_all.loc[fao_all.Element==neg, ['Value']].apply(lambda x:x*-1)

In [13]:
#If we want to see averages for each element per year
avg_elems = fao_all.set_index('Year').groupby(['Element']).mean()
avg_elems[:]

Unnamed: 0_level_0,Value
Element,Unnamed: 1_level_1
Export Quantity,-1464.254545
Import Quantity,407.090909
Other uses (non-food),-159.393939
Processing,-7.121212
Production,8687.309091
Residuals,-19.64
Stock Variation,-9.2


In [14]:
#If we want to see average values for each item per year
all_avg = fao_all.set_index('Item').groupby(['Year']).mean()
all_avg

Unnamed: 0_level_0,Value
Year,Unnamed: 1_level_1
2008,1448.538462
2009,1415.192308
2010,1427.192308
2011,1386.5
2012,1401.192308
2013,1414.076923
2014,1186.677419
2015,1214.709677
2016,1243.064516
2017,1264.709677


## Conumption value
Now that we have identified which of our values are negative, we can sum the entire Value column for each food type to get the value for overall consumption for that type of meat.

In [15]:
consumption = fao_all.groupby(['Year','Item']).sum().reset_index()
consumption

Unnamed: 0,Year,Item,Value
0,2008,Bovine Meat,12444.0
1,2008,"Meat, Other",283.0
2,2008,Mutton & Goat Meat,163.0
3,2008,Pigmeat,8928.0
4,2008,Poultry Meat,15844.0
5,2009,Bovine Meat,12258.0
6,2009,"Meat, Other",281.0
7,2009,Mutton & Goat Meat,153.0
8,2009,Pigmeat,9005.0
9,2009,Poultry Meat,15098.0


## Other Indicators
We will also import data for total population, consumer price for agriculture, and employment in agriculture. These indicators may provide context to understand why meat consumption increases or decreases over time.

In [16]:
pop = pd.read_csv('pop.csv')
price = pd.read_csv('prices.csv')
employment = pd.read_csv('employment.csv')

In [17]:
pop.rename(columns={'Value':'Pop'}, inplace=True)
pop.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Pop
0,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2008,1000 persons,303486.012
1,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2009,1000 persons,306307.567
2,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2010,1000 persons,309011.475
3,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2011,1000 persons,311584.047
4,Annual population,United States of America,Total Population - Both sexes,Population - Est. & Proj.,2012,1000 persons,314043.885


In [18]:
price.head()
price.isna().sum(), price.shape

(Domain      0
 Area        0
 Year        0
 Item        0
 Months      0
 Unit      264
 Value       0
 dtype: int64,
 (264, 7))

We have all null values in the unit column, so we'll drop it

In [19]:
price.drop(labels='Unit', axis=1,inplace=True)

We have several null value for the Value column of employment, which won't help us. We'll make a new dataframe with only the rows that have values


In [20]:
employment = employment[~employment.Value.isna()].reset_index()
employment.Domain.unique()

array(['Employment Indicators'], dtype=object)

In [21]:
employment.drop(labels=['index','Domain'], axis=1, inplace=True)
employment.head()

Unnamed: 0,Area,Indicator,Source,Year,Value
0,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2008,69986.273438
1,United States of America,Employment in agriculture,Labour force survey,2008,1943.79
2,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2009,79915.085938
3,United States of America,Employment in agriculture,Labour force survey,2009,1888.286
4,United States of America,"Agriculture value added per worker (US$, 2010 ...",Labour force survey,2010,73930.257813
