# Frac Production Modeling
This is a continuation of Frac Produciton Data Cleaning and Frac Production Analysis notebooks.

Here I will model with the data and optimize the models and features.

In [1]:
# Necessary Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Model Imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

In [2]:
# Reading in Colorado Features
cofeatures = pd.read_csv('cofeatures', index_col=0)
cofeatures.head()

Unnamed: 0,gel,slick,xlinkgel,VerticalDepth,nphf_sqrt,sqrtsandmass,First6BOE,TargetFormation_CODELL,TargetFormation_NIOBRARA,TargetFormation_OTHER,hzlen_bin_1-2,hzlen_bin_<1,hzlen_bin_>2
5001098010000,0.0,1.0,1.0,7774.0,73.120155,1591.816874,46241.0,1,0,0,0,1,0
5001097820000,0.0,1.0,1.0,7574.72,73.519946,1668.595385,7118.0,0,1,0,0,1,0
5001098410000,0.0,0.0,0.0,8045.0,73.509025,1772.463745,23404.0,1,0,0,0,1,0
5001098450000,0.0,1.0,0.0,7841.0,73.543258,1999.739152,97243.0,1,0,0,0,1,0
5001098470000,0.0,1.0,0.0,7707.0,73.740393,1984.574673,93034.0,1,0,0,0,1,0


Knowing that this data is clean, I will first identify my variables, with the first 6 month's production as the target and the remainder as features in the model.  I will then split both variables into training and test sets.

In [3]:
cofeats = cofeatures.drop('First6BOE', axis=1)
target = cofeatures.First6BOE

X_train, X_test, y_train, y_test = train_test_split(cofeats, target, test_size=0.25, random_state=42)

Great, now I will write two functions to streamline training and testing the models.  The training function will fit the model and then cross validate the scores and return the scores.  The test function will cross validate and return the scores.

In [4]:
def train_model(model, X, Y, cv):
    model.fit(X, Y)
    scores = cross_val_score(model, X, Y, cv=cv)
    return 'Training Scores: {:0.4f} (+/- {:0.4f})'.format(scores.mean(), scores.std()*2)

def test_model(model, X, Y, cv):
    scores = cross_val_score(model, X, Y, cv=cv)
    return 'Test Scores: {:0.4f} (+/- {:0.4f})'.format(scores.mean(), scores.std()*2)


Now I'll use these functions to test out some models.  

## Modeling
### Multivariate Linear Regression
I will start with multivariate linear regression to see how well the model can predict production.

In [5]:
# Instantiate the model
regr = LinearRegression()
# Fit the model and generate training scores
regr_train = train_model(regr, X_train, y_train, 5)
# Generate test scores
regr_test = test_model(regr, X_test, y_test, 5)
print(regr_train)
print(regr_test)

Training Scores: 0.3950 (+/- 0.0438)
Test Scores: 0.3449 (+/- 0.1922)




Well, that's not great.  The multiveriate linear regression model only explains 34% of the variance of the model.  Let's try some other models to see how they do.

### Random Forest Regression
The next model I will use to predict production is random forest regression. 

In [6]:
# Instantiate the model
rfr = RandomForestRegressor()
# Fit the model and generate training scores
rfr_train = train_model(rfr, X_train, y_train, 5)
# Generate test scores
rfr_test = test_model(rfr, X_test, y_test, 5)
print(rfr_train)
print(rfr_test)

Training Scores: 0.5835 (+/- 0.0695)
Test Scores: 0.4645 (+/- 0.1902)


That's better, except the variance in the scores increased, which is indicative of some overfitting.  I will use a GridSearchCV to optimize the parameters of the model.  I will optimize the number of estimators, the max features used, the minimum samples split, and the max depth of the tree.

In [7]:
# Identifying potential parameters
param_grid = { 
            "n_estimators"      : [10, 100, 1000],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [4,6,8],
            "max_depth": [4,6,8,10]
            }
# Instantiating grid search
grid = GridSearchCV(estimator=rfr, param_grid=param_grid, cv=5)
# Fitting model
grid.fit(X_train, y_train)
# Identifying best score and best parameters from the Grid Search
print(grid.best_score_)
best_params = grid.best_params_
print(best_params)

0.6054909727623922
{'max_depth': 10, 'max_features': 'sqrt', 'min_samples_split': 4, 'n_estimators': 1000}


Now that GridSearchCV has identified the optimal parameters from the set, I will use those parameters to fit a model and see what the training and test set scores look like.

In [8]:
# Instantiate the model
rfr_grid = RandomForestRegressor(**best_params)
# Fit the model and generate training scores
rfr_grid_train = train_model(rfr_grid, X_test, y_test, 5)
# Generate test scores
rfr_grid_test = test_model(rfr_grid, X_test, y_test, 5)
print(rfr_grid_train)
print(rfr_grid_test)

Training Scores: 0.5478 (+/- 0.1359)
Test Scores: 0.5455 (+/- 0.1443)


Even better! The test score mean increased, but there is still a sizable variance in the scores between the folds. 

Let's check out the different feature importances to see what features are influencing the model most.

In [9]:
feature_importances = pd.DataFrame(rfr.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance']).sort_values('importance', ascending=False)
feature_importances

Unnamed: 0,importance
hzlen_bin_<1,0.263592
VerticalDepth,0.241367
nphf_sqrt,0.165844
sqrtsandmass,0.155211
xlinkgel,0.063691
hzlen_bin_1-2,0.040941
hzlen_bin_>2,0.038666
gel,0.010525
TargetFormation_CODELL,0.00862
TargetFormation_NIOBRARA,0.007104


In [None]:
svr = SVR()
# Fit the model and generate training scores
svr_train = train_model(svr, X_train, y_train, 5)
# Generate test scores
svr_test = test_model(svr, X_test, y_test, 5)
print(svr_train)
print(svr_test)

Training Scores: -0.0186 (+/- 0.0120)
Test Scores: -0.0350 (+/- 0.0315)


In [None]:
parameters = [{'C': [0.01, 0.1, 1, 10, 100], 'kernel':['linear']}, 
              {'C': [0.01, 0.1, 1, 10, 100], 'kernel':['poly']},
               {'C': [0.01, 0.1, 1, 10, 100], 'kernel':['rbf']},
               {'C': [0.01, 0.1, 1, 10, 100], 'kernel':['sigmoid']}]

# Instantiate and SVR, testing each alpha and kernel
svr = SVR() 
grid = GridSearchCV(estimator=svr, param_grid=parameters) 
grid.fit(X_train, y_train) 
print(grid.best_score_)
best_params = grid.best_params_
print(best_params)