# Frac Production Modeling
This is a continuation of Frac Produciton Data Cleaning and Frac Production Analysis notebooks.

Here I will model with the data and optimize the models and features.

In [22]:
# Necessary Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Model Imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

Great, now I will write two functions to streamline training and testing the models.  The training function will fit the model and then cross validate the scores and return the scores.  The test function will cross validate and return the scores.

In [25]:
def train_model(model, X, Y, cv):
    model.fit(X, Y)
    scores = cross_val_score(model, X, Y, cv=cv)
    return 'Training Scores: {:0.4f} (+/- {:0.4f})'.format(scores.mean(), scores.std()*2)

def test_model(model, X, Y, cv):
    scores = cross_val_score(model, X, Y, cv=cv)
    return 'Test Scores: {:0.4f} (+/- {:0.4f})'.format(scores.mean(), scores.std()*2)


In [114]:
# Reading in ALL Colorado Features (pre-feature selection)
co_all = pd.read_csv('fracwells_co.csv', index_col=0)
co_all.head()

Unnamed: 0,CumBOE,CumGas,CumOil,DrillType,Field,First6BOE,FirstProdDate,GrossPerforatedInterval,HorizontalLength,LowerPerforation,...,Township,TreatmentJobCount,UpperPerforation,VerticalDepth,WellName,WellNumber,gel,sandmass,slick,xlinkgel
5001098010000,188242.0,273498.0,142659.0,H,WATTENBERG,46241.0,2015-01-01,,2247.84,,...,01S,1.0,,7774.0,SHARP,24-3-11HC,0.0,2533881.0,1.0,1.0
5001097850000,22828.0,39256.0,16285.0,H,THIRD CREEK,7094.0,2014-06-01,,4499.09,,...,01S,1.0,,7576.73,STATE OF CO,1S-66-36-1609CH,0.0,3274332.0,1.0,1.0
5001097830000,23909.0,44706.0,16458.0,H,THIRD CREEK,8304.0,2014-06-01,,4556.06,,...,01S,1.0,,7511.68,STATE OF CO,1S-66-36-0108BH,0.0,3045143.0,1.0,1.0
5001097820000,21407.0,45118.0,13887.0,H,WATTENBERG,7118.0,2014-06-01,,4525.4,,...,01S,1.0,,7574.72,STATE OF CO,1S-66-36-0108CH,0.0,2784211.0,1.0,1.0
5001097810000,31084.0,65779.0,20121.0,H,THIRD CREEK,10385.0,2014-06-01,,4504.51,,...,01S,1.0,,7513.7,STATE OF CO,1S-66-36-1609BH,0.0,3056162.0,1.0,1.0


## Modeling with ALL Colorado Features

In [115]:
# Dropping well identifiers
co_all = co_all.drop(['OperatorAlias','WellName','WellNumber','Township','Range','Field',
                      'FirstProdDate','CumBOE','CumGas','CumOil'], axis=1)

co_all = co_all.dropna(axis=0)
co_all_d = pd.get_dummies(co_all)

co_all_features = co_all_d.drop('First6BOE', axis=1)
co_all_target = co_all_d['First6BOE']

Xa_train, Xa_test, ya_train, ya_test = train_test_split(co_all_features, co_all_target, test_size=0.25, random_state=42)

### Random Forest Regression

In [116]:
# Instantiate the model
rfra = RandomForestRegressor()
# Fit the model and generate training scores
rfra_train = train_model(rfra, Xa_train, ya_train, 5)
# Generate test scores
rfra_test = test_model(rfra, Xa_test, ya_test, 5)
print(rfra_train)
print(rfra_test)

Training Scores: 0.5911 (+/- 0.0849)
Test Scores: 0.4809 (+/- 0.1387)


In [117]:
from sklearn.feature_selection import RFE
rfr = RandomForestRegressor()
selector = RFE(rfr)
selector = selector.fit(co_all_features, co_all_target)

sel_features = pd.DataFrame(selector.support_,
                                   index = co_all_features.columns,
                                    columns=['Selected']).sort_values('Selected', ascending=False)
sel_features

Unnamed: 0,Selected
GrossPerforatedInterval,True
gel,True
TargetFormation_CODELL,True
HorizontalLength,True
TargetFormation_NIOBRARA,True
xlinkgel,True
sandmass,True
slick,True
VerticalDepth,True
UpperPerforation,True


From this recursive feature elimination, I will need to eliminate the Target Formation feature and the Drill Type feature, as they are the least important in this model.

In [118]:
co_all = co_all.drop(['DrillType','TargetFormation'], axis=1)

Now, let's get a final baseline of these features for modeling using the random forest regressor.  I need to drop features and then split again.

In [120]:
co_all = pd.get_dummies(co_all)

co_sel_features = co_all.drop('First6BOE', axis=1)
co_all_target = co_all['First6BOE']

Xs_train, Xs_test, ys_train, ys_test = train_test_split(co_sel_features, co_all_target, test_size=0.25, random_state=42)

Now, let's check how these selected features did with the random forest model.

In [121]:
# Instantiate the model
rfrs = RandomForestRegressor()
# Fit the model and generate training scores
rfrs_train = train_model(rfrs, Xs_train, ys_train, 5)
# Generate test scores
rfrs_test = test_model(rfrs, Xs_test, ys_test, 5)
print(rfrs_train)
print(rfrs_test)

Training Scores: 0.5953 (+/- 0.0621)
Test Scores: 0.4793 (+/- 0.1702)


In [122]:
feature_importances = pd.DataFrame(rfrs.feature_importances_,
                                   index = Xs_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
feature_importances

Unnamed: 0,importance
TotalDepth,0.272077
LowerPerforation,0.151895
GrossPerforatedInterval,0.133756
VerticalDepth,0.127252
UpperPerforation,0.105697
sandmass,0.091193
HorizontalLength,0.084611
xlinkgel,0.017526
TreatmentJobCount,0.00964
gel,0.005525


Alright, this is our baseline model now.  An R-squared value of 0.4793 (+/- 0.17).

## Modeling with Selected and Engineered Features
Now I will try with the engineered features to see how they performed.

In [123]:
# Reading in Colorado Features
cofeatures = pd.read_csv('cofeatures', index_col=0)
cofeatures.head()

Unnamed: 0,gel,slick,xlinkgel,VerticalDepth,HorizontalLength,GrossPerforatedInterval,nphf_sqrt,sandmass,sqrtsandmass,location,First6BOE,TargetFormation_CODELL,TargetFormation_NIOBRARA,TargetFormation_OTHER,hzlen_bin_1-2,hzlen_bin_<1,hzlen_bin_>2,County_Adams,County_Larimer,County_Weld
5001098010000,0.0,1.0,1.0,7774.0,2247.84,2901.282977,73.120155,2533881.0,1591.816874,0,46241.0,1,0,0,0,1,0,1,0,0
5001097820000,0.0,1.0,1.0,7574.72,4525.4,5120.217574,73.519946,2784211.0,1668.595385,0,7118.0,0,1,0,0,1,0,1,0,0
5001098410000,0.0,0.0,0.0,8045.0,4463.02,5059.443262,73.509025,3141628.0,1772.463745,0,23404.0,1,0,0,0,1,0,1,0,0
5001098450000,0.0,1.0,0.0,7841.0,4658.59,5249.979204,73.543258,3998957.0,1999.739152,0,97243.0,1,0,0,0,1,0,1,0,0
5001098470000,0.0,1.0,0.0,7707.0,5786.57,6348.924499,73.740393,3938537.0,1984.574673,0,93034.0,1,0,0,0,1,0,1,0,0


Knowing that this data is clean, I will first identify my variables, with the first 6 month's production as the target and the remainder as features in the model.  I will then split both variables into training and test sets.

In [124]:
# Identifying all engineered features
cofeats = cofeatures.drop(['First6BOE', 'sandmass', 'HorizontalLength', 'GrossPerforatedInterval'], axis=1)

target = cofeatures.First6BOE

X_train, X_test, y_train, y_test = train_test_split(cofeats, target, test_size=0.25, random_state=42)

### Multivariate Linear Regression
I will start with multivariate linear regression to see how well the model can predict production.

In [125]:
# Instantiate the model
regr = LinearRegression()
# Fit the model and generate training scores
regr_train = train_model(regr, X_train, y_train, 5)
# Generate test scores
regr_test = test_model(regr, X_test, y_test, 5)
print(regr_train)
print(regr_test)

Training Scores: 0.3989 (+/- 0.0442)
Test Scores: 0.3418 (+/- 0.1858)


Well, that's not great.  The multivariate linear regression model only explains 34% of the variance of the model.  Let's try some other models to see how they do.

### Random Forest Regression
The next model I will use to predict production is random forest regression. 

In [126]:
# Instantiate the model
rfr = RandomForestRegressor()
# Fit the model and generate training scores
rfr_train = train_model(rfr, X_train, y_train, 5)
# Generate test scores
rfr_test = test_model(rfr, X_test, y_test, 5)
print(rfr_train)
print(rfr_test)

Training Scores: 0.6186 (+/- 0.0825)
Test Scores: 0.4995 (+/- 0.2051)


That's better, except the variance in the scores increased from the linear regression models, which is indicative of some overfitting.  

I will use a GridSearchCV to optimize the parameters of the model.  I will optimize the number of estimators, the max features used, the minimum samples split, and the max depth of the tree all for the engineered features.

In [127]:
# Identifying potential parameters
param_grid = { 
            "n_estimators"      : [10, 100, 1000],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [4,6,8],
            "max_depth": [4,6,8,10]
            }
# Instantiating grid search
grid = GridSearchCV(estimator=rfr, param_grid=param_grid, cv=5)
# Fitting model
grid.fit(X_train, y_train)
# Identifying best score and best parameters from the Grid Search
print(grid.best_score_)
best_params = grid.best_params_
print(best_params)

0.6301372653485476
{'max_depth': 10, 'max_features': 'auto', 'min_samples_split': 4, 'n_estimators': 1000}


Now that GridSearchCV has identified the optimal parameters from the set, I will use those parameters to fit a model and see what the training and test set scores look like.

In [128]:
# Instantiate the model
rfr_grid = RandomForestRegressor(**best_params)
# Fit the model and generate training scores
rfr_grid_train = train_model(rfr_grid, X_test, y_test, 5)
# Generate test scores
rfr_grid_test = test_model(rfr_grid, X_test, y_test, 5)
print(rfr_grid_train)
print(rfr_grid_test)

Training Scores: 0.5235 (+/- 0.1920)
Test Scores: 0.5232 (+/- 0.2053)


Even better! The test score mean increased, but there is still a sizable variance in the scores between the folds. 

Let's check out the different feature importances to see what features are influencing the model most.

In [129]:
feature_importances = pd.DataFrame(rfr_grid.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
feature_importances

Unnamed: 0,importance
hzlen_bin_>2,0.301455
VerticalDepth,0.252768
nphf_sqrt,0.161671
sqrtsandmass,0.129723
hzlen_bin_<1,0.046109
location,0.037597
xlinkgel,0.034237
hzlen_bin_1-2,0.020284
TargetFormation_NIOBRARA,0.00703
TargetFormation_CODELL,0.005386


In [133]:
from skgarden import RandomForestQuantileRegressor
rfqr = RandomForestQuantileRegressor(random_state=42)
rfqr_train = train_model(rfqr, X_train, y_train, 5)
rfqr_test = test_model(rfqr, X_test, y_test, 5)
print(rfqr_train)
print(rfqr_test)

Training Scores: 0.6229 (+/- 0.0971)
Test Scores: 0.4728 (+/- 0.2629)




I will use a GridSearchCV to optimize the parameters of the model.  I will optimize the number of estimators, the max features used, the minimum samples split, and the max depth of the tree all for the engineered features.

In [134]:
# Identifying potential parameters
param_grid = { 
            "n_estimators"      : [10, 100, 1000],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [2,4,6,8],
            "max_depth": [4,6,8,10]
            }
# Instantiating grid search
grid = GridSearchCV(estimator=rfqr, param_grid=param_grid, cv=5)
# Fitting model
grid.fit(X_train, y_train)
# Identifying best score and best parameters from the Grid Search
print(grid.best_score_)
best_params = grid.best_params_
print(best_params)

0.6312129468215613
{'max_depth': 10, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 1000}


Now that GridSearchCV has identified the optimal parameters from the set, I will use those parameters to fit a model and see what the training and test set scores look like.

In [135]:
# Instantiate the model
rfqr_grid = RandomForestRegressor(**best_params)
# Fit the model and generate training scores
rfqr_grid_train = train_model(rfqr_grid, X_test, y_test, 5)
# Generate test scores
rfqr_grid_test = test_model(rfqr_grid, X_test, y_test, 5)
print(rfqr_grid_train)
print(rfqr_grid_test)

Training Scores: 0.5707 (+/- 0.1224)
Test Scores: 0.5713 (+/- 0.1148)
