# What's the Price of Wheat?

<img src="http://www.ictinternational.com/content/uploads/2015/02/wheat1.jpg" width=640 height=480/>

## Project Background and Motivation

We chose to create a predictor of the stock price of wheat because of the ever-pervasive fluctuation in food prices. Wheat is among one of the most fundamental agricultural commodities in the United States, so understanding and ultimately predicting the price of this commodity will allow us to understand an essential part of our economic ecosystem. Our team has diverse backgrounds in engineering and science, so we wanted to choose a topic that has profound global implications. One of the grand challenges identified by the leaders of the U.N. and world bank is the shortage of food in our ever growing population. We thought it would be interesting for us to be able to build predictive forecasting of the stock price of this key commodity as this framework is a proof of principle for forecasting the price of any other agricultural commodity of choice.

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# Cleaning Data
We will be using precipitation, temperature, [fill in additional features here] to predict the price of wheat. Let's clean our data before using it in our model.

### 1.1 Precipitation Data

In [27]:
dirtyprecdf=pd.read_csv("stationprec.csv")
dirtyprecdf.head()

Unnamed: 0,Station ID,Year,Jan,Feb,March,April,May,June,July,August,September,October,November,December
0,AQC00914000,1981,4279,3745,10762,6067,4096,3606,6203,5292,3092,6866,7163,7866
1,AQC00914000,1982,5039,9643,3211,2016,3355,2827,3199,9356,4150,6418,3965,1595
2,AQC00914000,1983,3351,2971,3044,2642,1644,1717,1020,1788,3433,6801,2531,7242
3,AQC00914000,1984,3368,3538,8187,2715,2916,3288,1246,3391,2932,6578,4787,9787
4,AQC00914000,1985,5202,3078,3279,8414,2884,4787,3447,3193,5296,5410,3950,1651


There are a few things in the precipitation dataframe that we want to change. The numbers in each month need to be interpreted, since it is often in the form ####F. There are also some results with -9999M, which we want to get rid of, since those indicate missing data. We want to get rid of the letters and turn the number into an understandable inch unit. According to the readme from our source, each number is the number of 1/100ths of an inch (e.g. 1486 = 14.86 inches)

In [28]:
precdf=dirtyprecdf
# remove the -9999M rows, the F, and turn numbers to inch values
for col in precdf:
    if (col!="Station ID" and col!= "Year"):
        precdf= precdf[precdf[col] != '-9999M']
        precdf[col]=precdf[col].map(lambda x: x.rstrip("F"))
        precdf[col]=precdf[col].map(lambda x: float(x)/100.00)
precdf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Station ID,Year,Jan,Feb,March,April,May,June,July,August,September,October,November,December
0,AQC00914000,1981,42.79,37.45,107.62,60.67,40.96,36.06,62.03,52.92,30.92,68.66,71.63,78.66
1,AQC00914000,1982,50.39,96.43,32.11,20.16,33.55,28.27,31.99,93.56,41.5,64.18,39.65,15.95
2,AQC00914000,1983,33.51,29.71,30.44,26.42,16.44,17.17,10.2,17.88,34.33,68.01,25.31,72.42
3,AQC00914000,1984,33.68,35.38,81.87,27.15,29.16,32.88,12.46,33.91,29.32,65.78,47.87,97.87
4,AQC00914000,1985,52.02,30.78,32.79,84.14,28.84,47.87,34.47,31.93,52.96,54.1,39.5,16.51
5,AQC00914000,1986,85.73,42.12,27.99,70.95,50.12,31.35,38.78,29.89,63.68,47.4,43.12,83.1
6,AQC00914000,1987,51.46,76.48,39.54,35.73,31.18,27.88,20.43,34.98,8.41,26.69,22.98,62.08
7,AQC00914000,1988,33.36,47.98,49.08,44.0,42.97,28.17,38.78,25.48,37.92,46.45,59.69,89.15
8,AQC00914000,1989,55.23,58.06,32.73,53.88,36.72,30.79,41.66,2.09,7.88,49.82,63.79,37.31
9,AQC00914000,1990,49.32,68.57,32.88,45.18,16.35,35.98,22.14,15.41,26.11,54.39,45.25,35.3


Station IDs aren't really helpful to us, and instead we want to convert this to the states that each station is in.

In [29]:
statiddf=pd.read_csv("statid.csv")
statiddf=statiddf[['Station ID', 'State']]
statid={}
for index, row in statiddf.iterrows():
    statid[row['Station ID']]=row['State']
# hard code some values that aren't in the station id list, but appear in precdf
statid['USC00085612']='FL'

In [31]:
state = []
for id in precdf['Station ID']:
    state.append(statid[id])
precdf['state'] = state
precdf=precdf.drop('Station ID', axis=1)
precdf.head()

Unnamed: 0,Year,Jan,Feb,March,April,May,June,July,August,September,October,November,December,State
0,1981,42.79,37.45,107.62,60.67,40.96,36.06,62.03,52.92,30.92,68.66,71.63,78.66,AS
1,1982,50.39,96.43,32.11,20.16,33.55,28.27,31.99,93.56,41.5,64.18,39.65,15.95,AS
2,1983,33.51,29.71,30.44,26.42,16.44,17.17,10.2,17.88,34.33,68.01,25.31,72.42,AS
3,1984,33.68,35.38,81.87,27.15,29.16,32.88,12.46,33.91,29.32,65.78,47.87,97.87,AS
4,1985,52.02,30.78,32.79,84.14,28.84,47.87,34.47,31.93,52.96,54.1,39.5,16.51,AS


Now let's modify the column names. We want to standardize the name of months to the abbreviated month names.

In [33]:
precdf.columns = ['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 
                  'Sep', 'Oct', 'Nov', 'Dec', 'State']
precdf.head()

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,State
0,1981,42.79,37.45,107.62,60.67,40.96,36.06,62.03,52.92,30.92,68.66,71.63,78.66,AS
1,1982,50.39,96.43,32.11,20.16,33.55,28.27,31.99,93.56,41.5,64.18,39.65,15.95,AS
2,1983,33.51,29.71,30.44,26.42,16.44,17.17,10.2,17.88,34.33,68.01,25.31,72.42,AS
3,1984,33.68,35.38,81.87,27.15,29.16,32.88,12.46,33.91,29.32,65.78,47.87,97.87,AS
4,1985,52.02,30.78,32.79,84.14,28.84,47.87,34.47,31.93,52.96,54.1,39.5,16.51,AS


Now group the rows by state and year.

In [35]:
prec_grouped = precdf.groupby(['State','Year']).mean().reset_index()
prec_grouped.head()

Unnamed: 0,State,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,AK,1981,15.683362,8.257069,8.708879,3.853276,4.915259,6.550172,10.535862,15.486552,12.651207,12.820948,12.226121,7.486121
1,AK,1982,5.100342,3.618462,5.281624,5.847863,6.187179,5.819829,7.964359,6.752137,15.152222,11.733419,9.297863,9.896068
2,AK,1983,7.984655,6.574655,2.432241,7.119828,6.262931,4.087155,6.538879,13.81931,10.171466,13.382414,8.525172,3.358017
3,AK,1984,12.432069,8.805259,6.767759,5.168362,4.187759,6.101121,8.3225,10.346638,8.663448,9.372586,7.268966,10.09319
4,AK,1985,15.284348,7.119652,8.681217,5.548087,5.936609,7.53713,5.375565,10.198435,13.984087,9.077913,4.656261,15.266783


### 1.2 Temperature Data

Now that we have cleaned the data for the precipitation, we follow a similar procedure for temperature. Since the temperature data is from the same source, we will follow almost exactly the same procedure.

In [37]:
dirtytempdf=pd.read_csv("stationtemp.csv")

In [40]:
# remove the -9999M rows, convert to degrees celsius
for col in dirtytempdf:
    if (col!="Station" and col!= "Year"):
        dirtytempdf= dirtytempdf[dirtytempdf[col] != '-9999M']
        dirtytempdf[col]=dirtytempdf[col].map(lambda x: float(x)/100.00)

In [43]:
# find averages for each state each year, and replace 'Station' with 'State'
# trying to optimize runtime, so storing last station ID, and State value to minimze list searching
%%time
last_stat=('statid','state')
dirtytempdf.rename(columns={'Station':'State'}, inplace=True)
i=0
for index, row in dirtytempdf.iterrows():
    if index%10000==0:
        print 
        index
    if (last_stat[0] != dirtytempdf['State'][index]):
        state=statid[row['State']]
        last_stat=(dirtytempdf['State'][index],state)
        dirtytempdf['State'][index]=state
    else:
        dirtytempdf['State'][index]=last_stat[1]
dirtytempdf.head()

SyntaxError: invalid syntax (<ipython-input-43-730e926f3f4e>, line 3)

In [44]:
tdf=dirtytempdf.groupby(['State','Year']).mean().reset_index()

In [45]:
tdf

Unnamed: 0,State,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,AK,1981,-7.316923,-10.849936,-2.146795,-0.466090,7.246795,11.301154,12.566090,11.463397,6.910577,-0.061474,-7.999615,-11.240769
1,AK,1982,-10.096667,-12.740577,-6.355897,-3.707115,4.874872,10.444679,13.082628,11.590128,7.221410,-3.105769,-10.229295,-7.180000
2,AK,1983,-10.949359,-9.052115,-6.856731,-1.559167,6.256795,11.532115,13.462115,11.002885,6.548077,-0.134551,-6.825641,-8.156731
3,AK,1984,-9.910449,-10.662051,-4.902692,-0.588718,6.524423,11.626667,13.034487,11.729359,7.952244,1.275385,-7.563333,-10.658910
4,AK,1985,-7.611731,-10.805769,-4.969167,-1.978077,6.095833,10.585897,13.630897,11.520577,6.658910,-2.034872,-10.239231,-7.453141
5,AK,1986,-14.249103,-9.054231,-8.436218,-1.005705,6.271795,11.427564,13.800705,11.938141,8.632692,0.615449,-9.357885,-6.661538
6,AK,1987,-11.177949,-13.390192,-5.334359,0.910128,7.198205,11.196282,13.571667,12.937628,7.433910,0.373910,-10.005577,-11.839038
7,AK,1988,-11.770385,-9.302564,-6.103590,0.414936,6.838333,11.586667,13.255641,11.353526,7.934808,-1.215641,-8.665321,-9.423782
8,AK,1989,-13.934968,-10.375097,-6.601097,0.337226,5.417097,11.458194,14.438710,12.680194,5.787161,-1.266968,-8.512065,-10.032129
9,AK,1990,-13.171731,-12.929231,-4.614231,2.455962,8.081154,12.093910,14.583205,12.959615,7.465449,0.566731,-9.072564,-9.837692


In [46]:
tdf.to_csv("tempdf.csv", index=False)