# Selecting location for oil well

Let's say you work for the mining company. We need to decide where to drill a new well.

You were provided with oil samples in three regions: in each 10,000 fields, where the quality of oil and the volume of its reserves were measured. Build a machine learning model that will help determine the region where mining will bring the greatest profit. Analyze the possible profits and risks using the *Bootstrap.* technique

Steps to select a location:

- Deposits are searched for in the selected region, and the characteristic values are determined for each;
- Build a model and estimate the volume of reserves;
- Deposits with the highest estimated values are selected. The number of fields depends on the company’s budget and the cost of developing one well;
- Profit is equal to the total profit of the selected fields.

## Preparation of the data

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import f1_score, roc_auc_score, mean_squared_error, r2_score

from numpy.random import RandomState
RANDOM_STATE=RandomState(12345)

POINTS=500
BEST_POINTS=200
BUDGET=10e9 
BARREL=45e4
MAX_LOSS=0.025
BOOTSTRAP_NUMBER=1000

In [2]:
region_1=pd.read_csv('geo_data_0.csv')
region_2=pd.read_csv('geo_data_1.csv')
region_3=pd.read_csv('geo_data_2.csv')

In [3]:
print(region_1.info())
print(region_2.info())
print(region_3.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column  

In [4]:
print(region_1.head())
print(region_2.head())
print(region_3.head())

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746


In [5]:
print("Number of duplicate rows in region #1:", region_1.duplicated().sum())
print("Number of duplicate rows in region #2:", region_2.duplicated().sum())
print("Number of duplicate rows in region #3:", region_3.duplicated().sum())

Number of duplicate rows in region #1: 0
Number of duplicate rows in region #2: 0
Number of duplicate rows in region #3: 0


**Conclusion**

High quality source data: no gaps, no duplicate lines.

## Model training and validation

We divide the sample into training and testing for each region:

In [6]:
features_1_train, features_1_valid, target_1_train, target_1_valid = \
train_test_split(region_1.drop(columns=['product', 'id']), region_1['product'], test_size=0.25, random_state=RANDOM_STATE)

features_2_train, features_2_valid, target_2_train, target_2_valid = \
train_test_split(region_2.drop(columns=['product', 'id']), region_2['product'], test_size=0.25, random_state=RANDOM_STATE)

features_3_train, features_3_valid, target_3_train, target_3_valid = \
train_test_split(region_3.drop(columns=['product', 'id']), region_3['product'], test_size=0.25, random_state=RANDOM_STATE)

In [7]:
def learn_model(features_train, target_train, features_valid, target_valid):
    model=LinearRegression()
    model.fit(features_train, target_train)
    predictions_valid=model.predict(features_valid)
    print('Average volume of predicted raw materials in region: {:.2f} thousands barrels'.format(predictions_valid.mean()))
    print('RMSE: {:.2f}'.format(mean_squared_error(target_valid, predictions_valid) ** 0.5))
    return predictions_valid

Training the model in each region

In [8]:
predictions_1_valid=learn_model(features_1_train, target_1_train, features_1_valid, target_1_valid)

Average volume of predicted raw materials in region: 92.59 thousands barrels
RMSE: 37.58


In [9]:
predictions_2_valid=learn_model(features_2_train, target_2_train, features_2_valid, target_2_valid)

Average volume of predicted raw materials in region: 68.77 thousands barrels
RMSE: 0.89


In [10]:
predictions_3_valid=learn_model(features_3_train, target_3_train, features_3_valid, target_3_valid)

Average volume of predicted raw materials in region: 95.09 thousands barrels
RMSE: 39.96


**Conclusion**

The smallest error is in region No. 2.

At the same time, the average oil reserve there is 35% less than in the first or third regions.

In [11]:
budger_per_each=BUDGET/BEST_POINTS
min_amount=budger_per_each/BARREL
print('Sufficient volume of raw materials for break-even development of a new well: {:.2f} thousands barrels'.format(min_amount))

Sufficient volume of raw materials for break-even development of a new well: 111.11 thousands barrels


The profit function, which calculates the profit and production volume in each region:

In [12]:
def profit (target_valid, predictions_valid):
    predictions_valid=pd.Series(predictions_valid)
    predictions_valid=predictions_valid.sample(POINTS, replace=True, random_state=RANDOM_STATE)
    predictions_valid=predictions_valid.sort_values(ascending=False)[:BEST_POINTS]
    volume=(target_valid.reset_index(drop=True)[predictions_valid.index]).sum()
    result=volume*BARREL-BUDGET
    result/=10**6
    return result, volume

In [13]:
profit_1, volume1=profit(target_1_valid, predictions_1_valid)
print('Profit from the first region: {:.2f} mln rubles'.format(profit_1))
print('Production volume: {:.2f} thousands barrels'.format(volume1))

Profit from the first region: 334.51 mln rubles
Production volume: 22965.57 thousands barrels


In [14]:
profit_2, volume2=profit(target_2_valid, predictions_2_valid)
print('Profit from the second region: {:.2f} mln rubles'.format(profit_2))
print('Production volume: {:.2f} thousands barrels'.format(volume2))

Profit from the second region: 338.61 mln rubles
Production volume: 22974.68 thousands barrels


In [15]:
profit_3, volume3=profit(target_3_valid, predictions_3_valid)
print('Profit from the third region: {:.2f} mln rubles'.format(profit_3))
print('Production volume: {:.2f} thousands barrels'.format(volume3))

Profit from the third region: 407.97 mln rubles
Production volume: 23128.83 thousands barrels


**Conclusion**

1. The minimum volume of raw materials that is advisable to extract is 111.11 thousand barrels, which is higher than the average for all the regions. This means that it is profitable to develop only part of the wells.
2. A profit function has been created that calculates the profit from the region and the volume of extracted raw materials.

## Calculation of profits and risks

In [16]:
def profit_bootstrap (target_valid, predictions_valid):
    profits=[]
    
    for i in range (BOOTSTRAP_NUMBER):
        answer, volume=profit(target_valid, predictions_valid)
        profits.append(answer)
    
    profits=pd.Series(profits)
    profit_mean=profits.mean()
    
    lower=profits.quantile(0.025)
    upper=profits.quantile(0.975)
    
    loss_chance = (profits < 0).sum() / BOOTSTRAP_NUMBER 
    
    print('Average profit (million rubles): {:.2f}'.format(profit_mean))
    print('95% confidence interval (million rubles): ({:.2f}, {:.2f})'.format(lower, upper))
    
    if loss_chance < MAX_LOSS: 
        print('The risk of loss is {}, which is less than the maximum permissible probability {}'.format(loss_chance, MAX_LOSS))
    else:
        print('The risk of loss is {}, which is more than the maximum permissible probability {}'.format(loss_chance, MAX_LOSS))    

In [17]:
profit_bootstrap(target_1_valid, predictions_1_valid)

Average profit (million rubles): 394.14
95% confidence interval (million rubles): (-69.45, 915.50)
The risk of loss is 0.061, which is more than the maximum permissible probability 0.025


In [18]:
profit_bootstrap(target_2_valid, predictions_2_valid)

Average profit (million rubles): 454.46
95% confidence interval (million rubles): (64.53, 855.13)
The risk of loss is 0.007, which is less than the maximum permissible probability 0.025


In [19]:
profit_bootstrap(target_3_valid, predictions_3_valid)

Average profit (million rubles): 353.74
95% confidence interval (million rubles): (-162.65, 847.76)
The risk of loss is 0.076, which is more than the maximum permissible probability 0.025


**Conclusion**

The only suitable region for drilling wells is the second (geo_data_1).

1. Only in the second region the risk of losses is lower than the specified maximum risk.
2. The second region has the highest average profit.
3. The confidence interval is already only in region number 2.