# Choosing location for a well

Study the data on oil samples in three regions (in each 10,000 fields, where the quality of oil and the capacity of the wells was measured). Build a machine learning model that will help determine the region where mining will bring the most profit. Analyze possible profit and risks.

Steps to choose a location:

- In the selected region, oil-fileds are searched, for each, the values the features are determined;
- Build a model and estimate the capacity;
- The oil-fileds with the highest capacity estimates are selected. The number of fields depends on the company's budget and the cost of developing one well;
- The revenue is equal to the total revenue of the selected oil-fields.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-loading-and-preparation" data-toc-modified-id="Data-loading-and-preparation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data loading and preparation</a></span></li><li><span><a href="#Model-training-and-check" data-toc-modified-id="Model-training-and-check-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Model training and check</a></span><ul class="toc-item"><li><span><a href="#Results" data-toc-modified-id="Results-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Results</a></span></li></ul></li><li><span><a href="#Preparation-for-the-revenue-calculation" data-toc-modified-id="Preparation-for-the-revenue-calculation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Preparation for the revenue calculation</a></span><ul class="toc-item"><li><span><a href="#Results" data-toc-modified-id="Results-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Results</a></span></li></ul></li><li><span><a href="#Revenue-and-risks-calculation" data-toc-modified-id="Revenue-and-risks-calculation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Revenue and risks calculation</a></span><ul class="toc-item"><li><span><a href="#Results" data-toc-modified-id="Results-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Results</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Data loading and preparation

In [21]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [22]:
df0 = pd.read_csv('datasets/geo_data_0.csv')
df1 = pd.read_csv('datasets/geo_data_1.csv')
df2 = pd.read_csv('datasets/geo_data_2.csv')

In [23]:
display(df0.head())
df0.info()
print()
display(df1.head())
df1.info()
print()
display(df2.head())
df2.info()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB



Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB



Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [24]:
print("Number of duplicates in each dataset, respectively:",
    df0[['f0', 'f1', 'f2']].duplicated().sum(),
      df1[['f0', 'f1', 'f2']].duplicated().sum(),
      df2[['f0', 'f1', 'f2']].duplicated().sum())

Number of duplicates in each dataset, respectively: 0 0 0


In [25]:
#deleting 'id' column, as it has no sense for the analysis 
df0.drop('id', axis=1, inplace=True)
df1.drop('id', axis=1, inplace=True)
df2.drop('id', axis=1, inplace=True)

## Model training and check

In [26]:
# function to automate model validation, returns a dictionary with RMSE value, arrays of predictions and true values
def check(df):
    # split the sample into training and validation
    X_train, X_valid, y_train, y_valid = train_test_split(df.drop(['product'], axis=1), 
                                                      df['product'], 
                                                      test_size=.25, random_state=25)
    model = LinearRegression() 
    model.fit(X_train, y_train) 
    predictions = model.predict(X_valid)
    
    result = {}
    result['RMSE'] = mean_squared_error(y_valid, predictions) ** 0.5
    result['predictions'] = predictions
    result['true'] = y_valid.values
    
    return result

In [27]:
results = [check(d) for d in [df0, df1, df2]]

for i in range(3):
    print (f"RMSE in the region {i}: {results[i]['RMSE']:.2f}" )

print("\nAverage predicted capacity")    
for i in range(3):
    mean = np.mean(results[i]['predictions'])
    print (f"in the region {i}: {mean:.2f}" )

RMSE in the region 0: 37.65
RMSE in the region 1: 0.89
RMSE in the region 2: 40.08

Average predicted capacity
in the region 0: 92.65
in the region 1: 69.27
in the region 2: 94.90


### Results
The RMSE for the model with predictions for the second region is smaller. We can assume that it works better.

## Preparation for the revenue calculation

In [28]:
product_price = 450e3
investments = 10e9

budget_per_well = investments / 200
required_product = budget_per_well / product_price

print("Required capacity of a well (thousand barrels):", round(required_product, 2))

Required capacity of a well (thousand barrels): 111.11


In [29]:
def revenue(predictions, target):
    predictions_sorted = predictions.sort_values(ascending=False)
    selected = target[predictions_sorted.index][:200]
    return selected.sum() * product_price - investments

### Results
The required capacity of a well exceeds the average capacity in the regions. However, for regions 0 and 2 this excess is only 20%, while for region 1 it is about 50%.

## Revenue and risks calculation

In [30]:
state = np.random.RandomState(25)

for n in range(3):  #цикл по каждому региону
    data = pd.DataFrame.from_dict(data= results[n], orient='columns').drop('RMSE', axis=1)    
    
    # Bootstrap с 1000 выборок
    values = []
    for i in range(1000): 
        target_subsample = data['true'].sample(n=500, replace=True, random_state=state)
        predictions_subsample = data['predictions'][target_subsample.index]
        result = revenue(predictions_subsample, target_subsample)
        values.append(result)

    values = pd.Series(values) / 1e9
    percentage_of_negative = values[values<0].count() / len(values) * 100

    mean = values.mean()
    lower = values.quantile(0.025)
    upper = values.quantile(0.975)
    
    print(f"Region {n}")
    print("Average revenue:", round(mean, 2))
    print("Confidence interval:", (round(lower, 2), round(upper, 2)))
    print(f"Risk of loss: {percentage_of_negative:.2f} %\n")

Region 0
Average revenue: 0.4
Confidence interval: (-0.1, 0.9)
Risk of loss: 5.90 %

Region 1
Average revenue: 0.55
Confidence interval: (0.14, 0.99)
Risk of loss: 0.10 %

Region 2
Average revenue: 0.37
Confidence interval: (-0.17, 0.91)
Risk of loss: 8.90 %



### Results
The maximum revenue and the lowest risk can be achieved in region 1 (0.55 billion rubles). For region 0, the average revenue is 0.40, for region 2 - 0.37, the risk of loss is 6 and 7%, respectively.
The optimal region for the oil producition is region 1.

## Conclusion
Datasets with well characteristics from 3 three regions were analyzed. Each dataset is splitted in training and validation in 3:1 ratio. Linear regression was used as a regression model.

The risks and the revenue for each region were calculated, the distributions of averages using the bootstrap technique were built and a 95% confidence were interval chosen.

The optimal region for well development is region 1 with an average revenue of 0.55 billions (Confidence interval of 0.14 - 0.99) and a risk of loss of 0.10%.