Production Curves from Oklahoma Dataset
=======================================

Part 1 - Extracting and Cleaning Data
-----------------------------

This notebook explores the Oklahoma historical production data set that can be downloaded from

[ftp://ftp.occ.state.ok.us/OG_DATA/historical.ZIP](ftp://ftp.occ.state.ok.us/OG_DATA/historical.ZIP)

This data is from [Oklahoma Corporation Commission Website](http://www.occeweb.com/og/ogdatafiles2.htm) (OCC).  This data set contains Oil & Gas production historical records and includes production records from 1987 to 2015. Note that the data for 1994 is missing. 

The collection consists of a separate ascii file for every year of production and contains many fields describing the well, ownership, lease numbers, etc.  This notebook will demonstrate how to extract the data from these raw files, and combine it into a single data set, and clean it so that it is ready for further exploration and analysis.

The Raw Data
--------
Download and extract the `historical.ZIP` archive.  The resulting directory will contain the annual production data files.  The files contain tabular data, with each row containing the monthly production totals for a single well, for an entire year of production.  The data is delimited by a vertical bar `|`.  The files contain mostly the same data over the entire range of years, though there is a slight change in format after 2008.

It will be more convenient to work with the data if it is all in one place.  The [pandas](http://pandas.pydata.org/) package provides an excellent set of tools for working with tabular data.  We can use `pandas` to read the data, and create a single `DataFrame` containing all of the data.


In [2]:
import pandas as pd
import numpy as np

from pandas import set_option
set_option("display.max_rows", 12)

data_dir = '~/Desktop/DATA/Projects/OK_production_data/historical/'

Because of the change in format, we will process the data files in two batches, with data from years 1987-2008 in one group and 2009-2015 in the other.  

In [4]:
years = np.arange(1988, 2009, 1)

The data from 1994 is missing because of a database error, so we will omit that from the years we will process.

In [5]:
years = np.delete(years, np.where(years==1994))