# Frac Production Data Cleaning

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re

## Production Reports
Note: This is the Colorado well production updated monthly until the end of 2017, broken out by individual well and by production month. This data might be useful for calculating different productions over the months (initial production versus production over time).
Data Source: [COGCC Production Summary Data](https://cogcc.state.co.us/documents/data/downloads/production/co%202017%20Annual%20Production%20Summary-xp.zip)

In [2]:
# Read in production reports for Colorado
df = pd.read_csv('2017_prod_reports.csv')
print(df.shape)
df.tail()

(857203, 33)


Unnamed: 0,report_month,report_year,ST,api_county_code,api_seq_num,sidetrack_num,formation_code,well_status,prod_days,water_disp_code,...,gas_prod,btu_sales,gas_press_tbg,gas_press_csg,operator_num,name,facility_name,facility_num,accepted_date,revised
857198,12,2016,5,125,12123,0,NBRR,PR,31.0,C,...,2225.0,1000.0,,,10489,AUGUSTUS ENERGY RESOURCES LLC,Gardner Trust,44-18 2N46W,2017-02-07 14:43:50.530000000,
857199,12,2016,5,125,12124,0,NBRR,PR,31.0,P,...,2364.0,992.0,,,66190,OMIMEX PETROLEUM INC,Fiddler Peak Ranch,4-3-5-45,2017-01-12 16:10:23.057000000,
857200,12,2016,5,125,12125,0,NBRR,PR,31.0,C,...,633.0,1000.0,,,10489,AUGUSTUS ENERGY RESOURCES LLC,Chapman,13-19 1S44W,2017-02-07 14:43:50.530000000,
857201,12,2016,5,125,12126,0,NBRR,PR,31.0,C,...,1202.0,996.0,,,10489,AUGUSTUS ENERGY RESOURCES LLC,Haven Hill,14-15 4N47W,2017-02-07 14:43:50.530000000,
857202,12,2017,5,43,6226,1,NBRR,TA,0.0,,...,,,,,10412,AUSCO PETROLEUM INC,Hudson,1,2017-12-27 14:58:21.907000000,


Let's check for what columns I am dealing with here, along with how many non-null values I have to work with.  

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857203 entries, 0 to 857202
Data columns (total 33 columns):
report_month       857203 non-null int64
report_year        857203 non-null int64
ST                 857203 non-null int64
api_county_code    857203 non-null int64
api_seq_num        857203 non-null int64
sidetrack_num      857203 non-null int64
formation_code     857203 non-null object
well_status        857203 non-null object
prod_days          825707 non-null float64
water_disp_code    454823 non-null object
water_vol          443316 non-null float64
water_press_tbg    240511 non-null float64
water_press_csg    235473 non-null float64
bom_invent         489472 non-null float64
oil_vol            378131 non-null float64
oil_sales          301069 non-null float64
adjustment         50065 non-null float64
eom_invent         487351 non-null float64
gravity_sales      301097 non-null float64
gas_sales          593103 non-null float64
flared             42414 non-null float64
gas

Most of this data is complete for the well information, but a lot of the specific values are missing, leaving approximiately 240K months of production to be analyzed. Let's see if the other sources are any better.

## Frac and Well Data
This data was provided by Jeffrey A. Beunier of Front Range Energy Partners, via the Drilling Info site.  I still haven't been provided any more details about this data, so that's all I know for now.
### Data Indices
This first set of data contains all of the headers for each of the provided datasets.

In [4]:
indices = pd.read_excel('DATA_EXPORT_INDEX.xlsx', usecols=(5, 6, 7, 8, 9))
print(indices.shape)
indices

(105, 5)


Unnamed: 0,WELL HEADER DATA,PRODUCTION HEADER DATA,PRODUCTION TIME SERIES DATA,WELL TEST DATA,FORMATION TOP DATA
0,API10,,,,
1,API12,API/UWI,Entity ID,API,API
2,API14,Operator Alias,API/UWI,Test Date,Formation
3,Well Name,Well/Lease Name,API/UWI List,Test Formation,Formation Top MD
4,Well Number,Well Number,Monthly Production Date,Test Type,Formation Top TVD
5,Lease Name,Entity Type,Monthly Oil,Liquid Volume,Formation Top Unknown
6,Operator Alias,County/Parish,Monthly Gas,Gas Volume,Formation Bottom Unknown
7,Reported Operator,Reservoir,Monthly Water,Water Volume,Field
8,Field,Production Type,Well Count,Hours Tested,State Province
9,County/Parish,Producing Status,Days,,Basin Name


The Wells data has the majority of columns of interest, and the Production data looks like a much more detailed version of the production data from the COGCC database, which could be interesting, but it was suggested that I look at the first 6 months of production, so I don't need this additional detail.  Finally, the production time series data could be interesting to do modeling using time series to predict future production after that time. The Formations and Test data sets are not very important in this analysis as each well in the Wells data already includes the target formation, and tests do not provide much quality data for our analysis. 

### Production Time Series Data
Next, let's take a look at the time series data to see what's there.

In [5]:
prod_time = pd.read_csv('dj hz 6-17-18 Production Time Series.csv')
print(prod_time.shape)
prod_time.tail()

(236016, 19)


Unnamed: 0,Entity ID,API/UWI,API/UWI List,Monthly Production Date,Monthly Oil,Monthly Gas,Monthly Water,Well Count,Days,Daily Avg Oil,Daily Avg Gas,Daily Avg Water,Reservoir,Well/Lease Name,Well Number,Operator Alias,Production Type,Production Status,Entity Type
236011,104208052,5123144240000,51231440000.0,1998-11-01,22.0,393.0,0.0,1,0.0,0.0,13.0,0.0,CODELL,HSR-KING,4-23,"HS RESOURCES, INC.",OIL,INACTIVE,WELL
236012,104208052,5123144240000,51231440000.0,1998-12-01,22.0,387.0,0.0,1,0.0,0.0,12.0,0.0,CODELL,HSR-KING,4-23,"HS RESOURCES, INC.",OIL,INACTIVE,WELL
236013,104208052,5123144240000,51231440000.0,1999-01-01,60.0,935.0,0.0,1,29.0,2.0,32.0,0.0,CODELL,HSR-KING,4-23,"HS RESOURCES, INC.",OIL,INACTIVE,WELL
236014,104208052,5123144240000,51231440000.0,1999-02-01,49.0,705.0,0.0,1,28.0,1.0,25.0,0.0,CODELL,HSR-KING,4-23,"HS RESOURCES, INC.",OIL,INACTIVE,WELL
236015,104208052,5123144240000,51231440000.0,1999-09-01,1.0,0.0,0.0,1,30.0,0.0,0.0,0.0,CODELL,HSR-KING,4-23,"HS RESOURCES, INC.",OIL,INACTIVE,WELL


Great.  I might explore this more in depth after my initial models to determine how production changes over time, not just the total of the first six months.