# Choosing a location for an oil well

The data of the mining company GlavRosGosNeft is at our disposal. We need to decide where to drill a new well.

We have been provided with oil samples in three regions: in each there are 10,000 deposits, where the quality of oil and the volume of its reserves have been measured. The task is to build a machine learning model to help determine the region where mining will bring the most profit. Possible profits and risks should be analyzed using the *Bootstrap.* technique

Steps to choose a location:

- In the selected region, they are looking for deposits, for each, the values of the signs are determined;
- Build a model and estimate the volume of reserves;
- Select the deposits with the highest value estimates. The number of fields depends on the company's budget and the cost of developing one well;
- The profit is equal to the total profit of the selected fields.

## Data overview

In [42]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [43]:
try:
    region_1 = pd.read_csv('/datasets/geo_data_0.csv')
    region_2 = pd.read_csv('/datasets/geo_data_1.csv')
    region_3 = pd.read_csv('/datasets/geo_data_2.csv')
except: 
    region_1 = pd.read_csv('geo_data_0.csv')
    region_2 = pd.read_csv('geo_data_1.csv')
    region_3 = pd.read_csv('geo_data_2.csv')

In [44]:
for table in (region_1,region_2, region_3):
    def info(table):
        display(table.info())
        print(100*'=')
        display(table.describe())
        print(100*'=')
        print(f'Number of null values: {table.isna().mean()}')
        print(100*'=')
        print(f'Number of duplicates: {table.duplicated().sum()}')
        print(100*'=')
        display(table.head())
        print(100*'=')
        display(table.shape)
    info(table) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None



Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


Number of null values: id         0.0
f0         0.0
f1         0.0
f2         0.0
product    0.0
dtype: float64
Number of duplicates: 0


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647




(100000, 5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None



Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


Number of null values: id         0.0
f0         0.0
f1         0.0
f2         0.0
product    0.0
dtype: float64
Number of duplicates: 0


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305




(100000, 5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None



Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


Number of null values: id         0.0
f0         0.0
f1         0.0
f2         0.0
product    0.0
dtype: float64
Number of duplicates: 0


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746




(100000, 5)

The geological exploration data of the three regions are in the files. Known:

`id` - unique identifier of the well;
`f0, f1, f2 ` - three signs of points (it doesn't matter what they mean, but the signs themselves are significant);
`product` - the volume of reserves in the well (thousand barrels).


- We have 100000 unique id in the table for each region.
- Columns `f0, f1, f2, product` contain information in `float` format


Now we can move on to  machine learning.

## Model training and validation

In [45]:
state = np.random.RandomState(12345)

In [46]:
def region_predict(regions):
    
    
    feature = regions.drop(['id', 'product'], axis=1)
    target = regions['product']
    
    feature_train, feature_valid, target_train, target_valid = train_test_split(feature, target, test_size=0.25, random_state=12345)
    
    model = LinearRegression()
    model.fit(feature_train, target_train)
    predictions = model.predict(feature_valid)
    rmse = (mean_squared_error(predictions, target_valid))**(0.5)
    average_product = sum(predictions) / len(predictions)

    
    return predictions, rmse, target_valid, average_product

After training the model, let's take a look at the data.

In [47]:
predictions_0, rmse_0, target_valid_0, average_product_0 = region_predict(region_1)
print('Region 1')
print(f'RMSE of the region  = {rmse_0 :.2f}')
print(f'Average stock of predicted raw materials in the region = {average_product_0:.2f} ths. barrels')

Region 1
RMSE of the region  = 37.58
Average stock of predicted raw materials in the region = 92.59 ths. barrels


In [48]:
predictions_1, rmse_1, target_valid_1, average_product_1 = region_predict(region_2)
print('Region 2')
print(f'RMSE of the region  = {rmse_1 :.2f}')
print(f'Average stock of predicted raw materials in the region = {average_product_1:.2f} ths. barrels')

Region 2
RMSE of the region  = 0.89
Average stock of predicted raw materials in the region = 68.73 ths. barrels


In [49]:
predictions_2, rmse_2, target_valid_2, average_product_2 = region_predict(region_3)
print('Region 3')
print(f'RMSE of the region  = {rmse_2 :.2f}')
print(f'Average stock of predicted raw materials in the region = {average_product_2:.2f} ths. barrels')

Region 3
RMSE of the region  = 40.03
Average stock of predicted raw materials in the region = 94.97 ths. barrels


After training and testing the model, we can draw the following conclusions:

- The highest standard deviation can be seen in the third region, the lowest - in the second.
- The largest average stock of predicted raw materials is in the third region, the least - in the second.

## Preparation for profit calculation

Let's save all key values for calculations. 

In [50]:
BUDGET = 10**10
TOTAL_OIL_WELLS = 500
OIL_WELLS = 200
REVENUE = 450000
ONE_BARREL_PROFIT = 450
MAX_RISK = 0.025

In [51]:
revenue_one_well = BUDGET / OIL_WELLS
print(f'Average profit for the best well  : {revenue_one_well:.2f}')

Average profit for the best well  : 50000000.00


In [52]:
value_well = revenue_one_well / REVENUE
print(f'The volume of raw materials for break-even development of a new well: {value_well:.2f}')

The volume of raw materials for break-even development of a new well: 111.11


In [53]:
print(f'Average real raw material stock in region 1:  {target_valid_0.mean():.2f}')
print(f'Average real raw material stock in region 2:  {target_valid_1.mean():.2f}')
print(f'Average real raw material stock in region 3:  {target_valid_2.mean():.2f}')

Average real raw material stock in region 1:  92.08
Average real raw material stock in region 2:  68.72
Average real raw material stock in region 3:  94.88


It can be concluded that the largest oil reserves per well are in regions 3 and 1, and the least in the second.

In [54]:
def revenue_region(target, probab):
    
    probab = pd.Series(probab, index=target.index)                               
    sample_probs = probab.sample(n=TOTAL_OIL_WELLS, replace=True)                    
    probs_sorted = sample_probs.sort_values(ascending=False).head(OIL_WELLS)    
    target_sorted = target[probs_sorted.index]                                    
    target_revenue = target_sorted.sum() * REVENUE 
    net_revenue = target_revenue - BUDGET
    
    return np.around(net_revenue, decimals=3)

In [55]:
revenue_region_0 = revenue_region(target_valid_0, predictions_0)
revenue_region_1 = revenue_region(target_valid_1, predictions_1)
revenue_region_2 = revenue_region(target_valid_2, predictions_2)
print(f'Revenue from oil production in the region 1 = {revenue_region_0:.2f}')
print(f'Revenue from oil production in the region 2 = {revenue_region_1:.2f}')
print(f'Revenue from oil production in the region 3 = {revenue_region_2:.2f}')

Revenue from oil production in the region 1 = 503023183.10
Revenue from oil production in the region 2 = 306512352.91
Revenue from oil production in the region 3 = 679275073.28


Profit from oil production is highest in the first and second regions, least of all in the third.

## Profit and Risk Calculation

In [56]:
state = np.random.RandomState(12345)

In [57]:
def bootstrap_region(target, predictions):
    values = []
    for i in range(1000):
        target_subsample = target.reset_index(drop=True).sample(n=500, replace=True, random_state=state)
        probabilities_subsample = predictions[target_subsample.index]
        revenues = revenue_region(target_subsample, probabilities_subsample)
        values.append(revenues)

    values = pd.Series(values)
    lower = values.quantile(MAX_RISK)
    upper = values.quantile(0.95)
    mean = values.mean()

    print(f'Average revenue : {mean:.2f} ')
    print(f'2.5% quantile : {lower:.2f} ')
    print(f'Confidence interval from {upper:.2f}  to {lower:.2f}')
    print(f'Loss risk: {(values < 0).mean():.2f} ')

In [58]:
bootstrap_region(target_valid_0, predictions_0)

Average revenue : 596450455.30 
2.5% quantile : -203518869.07 
Confidence interval from 1363212272.00  to -203518869.07
Loss risk: 0.08 


In [59]:
bootstrap_region(target_valid_1, predictions_1)

Average revenue : 670158266.03 
2.5% quantile : 28877177.66 
Confidence interval from 1289005367.38  to 28877177.66
Loss risk: 0.02 


In [60]:
bootstrap_region(target_valid_2, predictions_2)

Average revenue : 602304117.58 
2.5% quantile : -219213987.44 
Confidence interval from 1381132426.07  to -219213987.44
Loss risk: 0.08 


Based on the data obtained, the following conclusions can be drawn:

- The average revenue is the highest in the second region, the lowest in the third.
- 2.5% quantile is lower in the third region, higher is in the second.
- The confidence interval is higher in the first and third regions, lower is in the second.
- The risk of losses is lower in the second region, higher in the first and third.

## Summary

- Data was studied and prepared
- The model was trained for each region. The average stock of predicted raw materials averages 94.97 thousand barrels.
- The highest standard deviation can be seen in the 3rd region, the lowest - in the 2nd.
- Most of the oil reserves per well are in regions 3 and 1.
- Profit from oil production is highest in regions 1 and 2.
- The average revenue is highest in the second region.
- The risk of losses is lower in the second region.

It can be concluded that the second region is best suited for well development, as it can be observed the lowest risk of losses, the highest revenue, as well as a high 95% confidence interval.
