# Well water cut calculation using Machine Learning
#### Alexander Kalinichenko

This notebook demonstrates how to train a machine learning algorithm to predict water cut (WCT) of oil production wells.
The exercise is based on data of a field in Russia, Volga-Ural basin.

_Add some info about field..._

The dataset was created from observed production data from 70 wells which was combined with other well bore data such as bottom location, perforation interval, well bore type, well operation duration etc. This data uses to train Random Forresr Regression to predict water cut of producing wells or planned side-tracks. The sklearn.ensemble module was used in this exercise. The [enseble methods](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble) includes ensemble-based methods for classification, regression and anomaly detection from [scikit-learn Python library](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. A random forest regresor was choosen because this is simple, easy undenstandable without over-fitting algorithm.

First we will [explore the dataset](#Exploring-the-dataset).  We will load the training data from 70 wells, and take a look at what we have to work with.  We will plot the location data , and create cross plots to look at the variation within the data.  

. . .

## well locations in the dataset converted to dictances between wells

**Глобальная задача:**
- оценить перспективу бурения дополнительных сайдтреков

Дополнительные задачи:
- предсказать обводнённость
- предсказать дополнительную добычу
- предсказать ВНК

### кроме R2 добавить SMA

Import libraries

In [None]:
import numpy as np
import pandas as pd
#import datetime
#from datetime import datetime, date
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import pylab
from pylab import rcParams
#import math
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
#from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score as r2, mean_absolute_error as mae, mean_squared_error as mse, accuracy_score
#from sklearn.model_selection import KFold, GridSearchCV
#from sklearn.tree import DecisionTreeRegressor
#from sklearn.svm import SVC
from sklearn.neighbors import DistanceMetric

## Exploring the dataset

Getting the data at exact date (2018-06-01 - last available production data for all wells of the field). Drop unnecessary data.

In [None]:
data_path = '../data/df.xlsx'
df = pd.read_excel(data_path, sheet_name='prod')
df = df.drop(['Pres'], axis=1)
df_2018_06_01 = df.loc[df_train['Date'] == '2018-06-01']
df_2018_06_01

Check that data loaded right. Plot well location map

In [None]:
ax = df_2018_06_01.plot(kind='scatter', x='x', y='y')
df_2018_06_01[['x','y','Well']].apply(lambda row: ax.text(*row),axis=1);
rcParams['figure.figsize'] = [11, 8]
#plt.rcParams.update({'font.size': 8})

Create a matrix of distance between wells

In [None]:
dist = DistanceMetric.get_metric('euclidean')
loc = pd.DataFrame(dist.pairwise(df_train[['x','y']].to_numpy()),
             columns=df_train.Well.unique(), index=df_train.Well.unique())
loc

In [None]:
# x_test = df_test.drop(['Well', 'wct'], axis=1)
# y_test = df_test['wct']
# x_test
df_train

In [None]:
df_train_prep = df_train.drop(['Well', 'wct', 'x', 'y'], axis=1)
# y = df_train['wct']
x = pd.concat([df_train_prep, loc], axis=1)
x

In [None]:
pca = PCA(n_components=1, random_state=100)
mc = pca.fit_transform(df_train[['x', 'y']])
mc

In [None]:
pca.explained_variance_ratio_

In [None]:
x = df_train.drop(['Well', 'wct', 'x', 'y'], axis=1)
x['loc'] = mc
x

In [None]:
pca = PCA(n_components=1, random_state=100)
mc_test = pca.fit_transform(df_test[['x', 'y']])
mc_test

x_test = df_test.drop(['Well', 'wct', 'x', 'y'], axis=1)
x_test['loc'] = mc_test
x_test

In [None]:
x_test

In [None]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x)

x_test = scaler.transform(x_test)

model = RandomForestRegressor(random_state=42, max_depth=14)
model.fit(x_train, y)

y_pred = model.predict(x_test)

y_pred_train = model.predict(x_train)

r2_train = r2(y, y_pred_train)
print(f'R2 train: {r2_train.round(4)}')

r2_test = r2(y_test, y_pred)
print(f'R2 test: {r2_train.round(4)}')

model

In [None]:
df_y_test = pd.DataFrame({'Well': df_test['Well'], 
                          'wct predicted, %': y_pred.round(1), 
                          'wct actual, %': y_test.round(1)})
df_y_test

In [None]:
df_y_train = pd.DataFrame({'Well': df_train['Well'], 
                           'wct predicted, %': y_pred_train.round(1), 
                           'wct actual, %': y.round(1)})
df_y_train

In [None]:
model.feature_importances_
feature_importances = pd.DataFrame()
feature_importances['importance'] = model.feature_importances_
feature_importances['feature_name'] = x.columns.tolist()
feature_importances = feature_importances.sort_values(by='importance', ascending=False)
feature_importances

In [None]:
feature_importances = feature_importances.sort_values(by='importance', ascending=True)
height = feature_importances['importance']
bars = feature_importances['feature_name']
y_pos = np.arange(len(bars))
# Create horizontal bars
plt.barh(y_pos, height)
 # Create names on the y-axis
plt.yticks(y_pos, bars)
plt.show()

In [None]:
def evaluate_preds(true_values, pred_values):
    print("R2:\t" + str(round(r2(true_values, pred_values), 3)) + "\n" +
          "MAE:\t" + str(round(mae(true_values, pred_values), 0)) + "\n" +
          "MSE:\t" + str(round(mse(true_values, pred_values), 0))) 
    plt.figure(figsize=(6,6))
    sns.scatterplot(x=pred_values, y=true_values, s=1)
    plt.xlabel('Predicted values')
    plt.ylabel('True values')
    plt.title('True vs Predicted values')
    plt.ylim(0, 105)
    plt.xlim(0, 105)
    plt.show()

evaluate_preds(y_pred_train.flatten(), y.values.flatten())
evaluate_preds(y_pred.flatten(), y_test.values.flatten())