Formal Description
1. Finding a correlation between horizontal distance drilled and the amount of oil recovered
2. For both horizontal and vertical types of wells
3. Resample by end date year and compare results to see if production has increased over time with the assumption of better technologies and processes

Assumptions
1. Cost increases per linear foot drilled
2. Recovery per foot decreases as over a number of feet increases
3. More risk with drilling horizontally; however it is a lot better than drilling new holes every time
4. Geographic clustering of drilling data could lead us to different projections
5. Higher proppant PPF/frac fluid theoretically leads to more recovery.  Not an assumption - higher costs as proppant PPF goes up.

# Data Dictionary:
## Dimensions/Categorical:
1. API:  American Petroleum Identification Code (10 digits) -- check for duplicates (switch to 14 digits if necessary)
5. Entity Reserve Category:  (IGNORE: unless we get into probability)
    - Developed Producing --
    - Undeveloped --
    - (Blanks) -- 
13. First Prod:  first production date, drop before 1940
6. Formation: 
    - Name of the layer of rock
    - Group by Formation: From one formation to the next, the rock properties are different, can be imputed from vertical depth if necessary
9. Frac Fluid GPF:  another key driver, a continuous variable, how much water is forced into the hold to frack it. Drives up cost.   
14. Last Prod:  last production date, drop before 1940
4. MajorPhase:  What is the predominant thing it produces
    - Gas
    - Inj -- Injection (ignore)
    - Oil
    - Other -- Do Not Know; wells that are plugged; Shut-in that no longer produce there might be some exceptions -- KEEP
    - SWD --  Salt Water Disposal (ignore)
7. Prod Method:  What type of surface machine is used at that well, probably won’t use this feature. May use for anomalies. 
    - Flowing --
    - Gas lift --
    - Jetted --
    - Other --
    - Plunger Lift -- 
    - Pumping --
    - Swabbing -- 
    - Undesignated -- 
    - Unknown --
3. Status:
    - Active -- producing and have ultimate recovery number
    - Drilling -- actively being drilled, likely no recovery numbers 
    - Drilled Un-completed -- drilled and hooked up, but not yet on production, no recovery number, also referred to as a “DUC”, we will say it’s same as Active
    - DryHole -- nothing in the hole
    - Inactive -- same as Shut-in
    - Injection -- we will exclude these, used for disposal water, etc.
    - Other -- do not know
    - P&A -- plugged and abandoned, know how much it previously produced
    - Permitted -- the state has approved to drill, if has first and the last production, it’s active and the state hasn’t re-filed it, yet
    - Shut-in -- also turned off, but not plugged with concrete, know how much it previously produced
    - Uncompleted -- drilled, but have not yet hooked it up
2. Type:  Vertical, Horizontal, or “Other” -- other will be imputed from the lateral length
25. Well Id:  

***********
## Measures/Continuous:
12. Frac Fluid Type:  classifier, may or may not be a driver   
    - Acid
    - Foam
    - Freshwater
    - GelXLink
    - None
    - Oil
    - Other
    - Saltwater
    - Slickwater
    - (Blanks) -- mean it’s unknown, but they used something
11. Frac Stages:   number of stages of fracking, may not use, mostly null, a direct correlation between profit and frac stages, each stage drives up the cost
16. Gas EUR:  estimated ultimate recovery of gas, gas measured in mcf (6 mcf fits in one oil barrel), may want to combine Oil EUR and Gas EUR 
20. GOR Hist: the ratio of gas to oil produced to date by well, changes every month
21. IP90 BOEQPD: a metric that might come in handy, initially garbage comes out and then good oil, so this gives average oil recovered over 90 days, “initial potential 90-day barrel of oil equivalent (oil and gas) per day”
10. Lateral Len:  length of perforations (to let oil into the pipe)
26. MidPoint Lat:  midpoint surface latitude
27. MidPoint Long:  midpoint surface longitude
15. Oil EUR:  estimated ultimate recovery of oil, what we want to predict, oil in barrels
17. Oil Gravity:  likely a driver, may have to impute, the thickness of the oil, viscosity, usually related to gas/oil ratio (GOR Hist), could be imputed from GOR Hist.
19. Oil Hist: the number of barrels produced to date by well
8. Proppant PPF:  after injecting water for fracking, sand or ceramic is injected to hold open the layers of formation, to “prop” open the layers. Measured in pound per foot. A continuous variable that will be a key driver. Drives up cost.
18. Qi (init):  peak rate (like for a time series) at the start, initial producing rate
22. Sur Lat:  surface latitude 
23. Sur Long:  surface longitude  

In [1]:
import pandas as pd
from acquire_prepare import acquire_oil
from acquire_prepare import prep_data

In [2]:
df = acquire_oil()
df = prep_data(df)
df.shape

(16250, 29)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16250 entries, 0 to 19495
Data columns (total 29 columns):
api14                 16250 non-null int64
type                  16250 non-null category
status                16250 non-null category
major_phase           16250 non-null category
formation             16229 non-null object
proppant_ppf          12018 non-null float64
prod_method           16250 non-null category
frac_fluid_gpf        14133 non-null float64
lateral_len           16250 non-null float64
frac_stages           15056 non-null float64
frac_fluid_type       16066 non-null category
first_prod            16250 non-null datetime64[ns]
last_prod             16250 non-null datetime64[ns]
oil_gravity           15817 non-null float64
peak_boepd            16250 non-null float64
oil_hist              16250 non-null float64
gas_hist              16250 non-null float64
gor_hist              15597 non-null float64
ip90_boeqpd           16250 non-null float64
landing_depth         160

In [4]:
df.sample(10)

Unnamed: 0,api14,type,status,major_phase,formation,proppant_ppf,prod_method,frac_fluid_gpf,lateral_len,frac_stages,...,landing_depth,sur_lat,sur_long,well_id,mid_point_lat,mid_point_long,recovery,recovery_per_foot,months_active,recovery_per_month
15289,42383398910000,Horizontal,Inactive,OIL,Wolfcamp Upper B,,Flowing,,9880.0,1.0,...,8795.0,31.508137,-101.720476,98283,31.49457,-101.716231,113.174667,11.454926,3,37724.888889
661,42003473530000,Horizontal,Active,OIL,SPRABERRY,1458.58142,Gas Lift,1940.74573,9802.0,41.0,...,9536.0,32.34169,-102.240423,142018,32.355178,-102.2457,114.172833,11.647912,26,4391.262821
11011,42371042580001,Vertical,Inactive,OIL,SEVEN RIVERS,465.116272,Undesignated,168.604645,172.0,0.0,...,0.0,31.104766,-102.653266,157170,31.104766,-102.653266,1.945167,11.309109,50,38.903333
8359,42329100630001,Vertical,Inactive,OIL,PENNSYLVANIAN,,Gas Lift,4.155673,1516.0,0.0,...,0.0,31.791336,-102.232676,154222,31.791336,-102.232676,6.952167,4.585862,41,169.565041
15955,42389342640000,Horizontal,Active,OIL,WOLFCAMP,1677.51147,Flowing,1883.88281,7008.0,0.0,...,10741.0,31.386147,-103.382306,152029,31.375801,-103.385103,811.9307,115.857691,52,15614.051923
18151,42461403840000,Horizontal,Active,OIL,WOLFCAMP,1525.64453,Gas Lift,1625.98694,7370.0,0.0,...,9794.0,31.629807,-102.049658,152176,31.639967,-102.053618,91.425503,12.405089,27,3386.129756
4780,42227200400000,Vertical,Inactive,OIL,SAN ANDRES D,67.95017,Pumping,35.107586,883.0,0.0,...,0.0,32.146523,-101.242953,153377,32.146523,-101.242953,15.038,17.030578,166,90.590361
13406,42383360720000,Vertical,Inactive,OIL,SPRABERRY UPPER,,Pumping,,2094.0,0.0,...,0.0,31.603748,-101.441888,156686,31.603748,-101.441888,44.388833,21.198106,114,389.375731
2124,42109329470000,Horizontal,Active,GAS,WOLFCAMP,1267.31421,Flowing,1370.04175,4736.0,0.0,...,10042.0,31.998733,-104.180796,143367,31.991794,-104.180121,73.206336,15.457419,21,3486.016008
8054,42317402530000,Horizontal,Active,OIL,SPRABERRY,1364.74121,Pumping,1944.06787,7981.0,0.0,...,9367.0,32.438881,-102.145977,146022,32.450102,-102.147224,170.747167,21.394207,32,5335.848958


In [5]:
print (df.apply(lambda x: x.nunique()))

api14                 16033
type                      3
status                    2
major_phase               3
formation               179
proppant_ppf          11670
prod_method               9
frac_fluid_gpf        13297
lateral_len            6827
frac_stages              65
frac_fluid_type          10
first_prod              622
last_prod               438
oil_gravity             245
peak_boepd            14894
oil_hist              15395
gas_hist              15203
gor_hist              15589
ip90_boeqpd           15633
landing_depth          5984
sur_lat               15560
sur_long              15764
well_id               16250
mid_point_lat         15900
mid_point_long        15913
recovery              16182
recovery_per_foot     16237
months_active           565
recovery_per_month    16088
dtype: int64


In [6]:
df.describe()

Unnamed: 0,api14,proppant_ppf,frac_fluid_gpf,lateral_len,frac_stages,oil_gravity,peak_boepd,oil_hist,gas_hist,gor_hist,...,landing_depth,sur_lat,sur_long,well_id,mid_point_lat,mid_point_long,recovery,recovery_per_foot,months_active,recovery_per_month
count,16250.0,12018.0,14133.0,16250.0,15056.0,15817.0,16250.0,16250.0,16250.0,15597.0,...,16072.0,16250.0,16250.0,16250.0,16250.0,16250.0,16250.0,16250.0,16250.0,16250.0
mean,42201740000000.0,2409.546,2118.496,4170.744677,2.173618,3.412634,429.991144,122.782936,454.575591,4589510.0,...,5952.05662,31.724644,-102.157757,140956.327877,31.724889,-102.157851,190.474018,431.389005,89.3016,inf
std,1110139000000.0,75139.96,68325.95,3108.757709,7.677189,12.667286,470.398824,179.474173,1319.099721,486771800.0,...,4377.950489,0.536802,0.827376,19152.058603,0.536721,0.827368,345.909094,5313.508554,102.959469,
min,30005210000000.0,0.2531144,0.0,1.0,0.0,0.0,0.065754,0.0,0.0,0.0,...,0.0,30.233049,-104.533822,22866.0,30.232985,-104.532366,0.0,0.0,0.0,0.0
25%,42227010000000.0,335.5705,136.6258,1510.25,0.0,0.0,59.99211,20.40075,50.121,1367.16,...,0.0,31.38088,-102.660748,142531.25,31.382211,-102.659731,33.157084,14.667256,33.0,438.0776
50%,42329340000000.0,1048.781,745.9227,4051.0,0.0,0.0,285.733067,76.8385,218.8685,2477.042,...,7332.0,31.675202,-101.947623,147348.5,31.676249,-101.948102,107.65709,27.559452,55.0,1827.793
75%,42383390000000.0,1509.654,1495.726,7001.5,0.0,0.0,660.6077,167.86175,529.52625,5370.206,...,9542.0,31.986593,-101.530365,152325.75,31.991514,-101.531618,204.984171,63.99336,92.0,4239.073
max,42501370000000.0,7646000.0,7497903.0,13815.0,70.0,72.7,9358.215,3846.625,80091.79,60441170000.0,...,18593.0,33.83482,-100.399997,157521.0,33.83655,-100.399927,13345.776667,467575.734583,747.0,inf
