<h1>Content<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-and-preparing-data" data-toc-modified-id="Loading-and-preparing-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading and preparing data</a></span></li><li><span><a href="#Training-and-validating-the-model" data-toc-modified-id="Training-and-validating-the-model-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training and validating the model</a></span></li><li><span><a href="#Preparing-for-profit-calculation" data-toc-modified-id="Preparing-for-profit-calculation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Preparing for profit calculation</a></span></li><li><span><a href="#Profit-and-risk-calculation" data-toc-modified-id="Profit-and-risk-calculation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Profit and risk calculation</a></span></li></ul></div>

# Choosing a location for a well

The oil producing company GlavRosGosNeft needs to decide where to drill a new well.

We were provided with oil samples in three regions: in each 10,000 fields, where we measured the quality of oil and the volume of its reserves. It is necessary to build a machine learning model that will help determine the region where mining will bring the greatest profit. It is also necessary to analyze the possible profit and risks using the *Bootstrap* technique.

Steps to choose a location:

- In the selected region, they are looking for deposits, for each, the values of the signs are determined;
- Build a model and estimate the volume of reserves;
- Select the deposits with the highest value estimates.  The number of fields depends on the company's budget and the cost of developing one well;
- The profit is equal to the total profit of the selected fields.

## Loading and preparing data

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from numpy.random import RandomState
#gd1 = pd.read_csv('geo_data_0.csv') #gd - geo_data
gd1 = pd.read_csv('/datasets/geo_data_0.csv')
#gd2 = pd.read_csv('geo_data_1.csv')
gd2 = pd.read_csv('/datasets/geo_data_1.csv')
#gd3 = pd.read_csv('geo_data_2.csv')
gd2 = pd.read_csv('/datasets/geo_data_2.csv')

In [2]:
gd1.info()
gd1.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
5,wX4Hy,0.96957,0.489775,-0.735383,64.741541
6,tL6pL,0.645075,0.530656,1.780266,49.055285
7,BYPU6,-0.400648,0.808337,-5.62467,72.943292
8,j9Oui,0.643105,-0.551583,2.372141,113.35616
9,OLuZU,2.173381,0.563698,9.441852,127.910945


In [3]:
gd2.info()
gd2.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305
5,HHckp,-3.32759,-2.205276,3.003647,84.038886
6,h5Ujo,-11.142655,-10.133399,4.002382,110.992147
7,muH9x,4.234715,-0.001354,2.004588,53.906522
8,YiRkx,13.355129,-0.332068,4.998647,134.766305
9,jG6Gi,1.069227,-11.025667,4.997844,137.945408


In [4]:
gd3.info()
gd3.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746
5,LzZXx,-0.758092,0.710691,2.585887,90.222465
6,WBHRv,-0.574891,0.317727,1.773745,45.641478
7,XO8fn,-1.906649,-2.45835,-0.177097,72.48064
8,ybmQ5,1.776292,-0.279356,3.004156,106.616832
9,OilcN,-1.214452,-0.439314,5.922514,52.954532


In [5]:
gd1 = gd1.drop('id', axis = 1)
gd2 = gd2.drop('id', axis = 1)
gd3 = gd3.drop('id', axis = 1)

Based on the results of the preliminary analysis, no problems with the available data were noted. The id column, which does not carry any meaningful information for our study, has been removed.

In [6]:
gd1.info()
gd1.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4)
memory usage: 3.1 MB


Unnamed: 0,f0,f1,f2,product
0,0.705745,-0.497823,1.22117,105.280062
1,1.334711,-0.340164,4.36508,73.03775
2,1.022732,0.15199,1.419926,85.265647
3,-0.032172,0.139033,2.978566,168.620776
4,1.988431,0.155413,4.751769,154.036647
5,0.96957,0.489775,-0.735383,64.741541
6,0.645075,0.530656,1.780266,49.055285
7,-0.400648,0.808337,-5.62467,72.943292
8,0.643105,-0.551583,2.372141,113.35616
9,2.173381,0.563698,9.441852,127.910945


In [7]:
gd2.info()
gd2.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4)
memory usage: 3.1 MB


Unnamed: 0,f0,f1,f2,product
0,-15.001348,-8.276,-0.005876,3.179103
1,14.272088,-3.475083,0.999183,26.953261
2,6.263187,-5.948386,5.00116,134.766305
3,-13.081196,-11.506057,4.999415,137.945408
4,12.702195,-8.147433,5.004363,134.766305
5,-3.32759,-2.205276,3.003647,84.038886
6,-11.142655,-10.133399,4.002382,110.992147
7,4.234715,-0.001354,2.004588,53.906522
8,13.355129,-0.332068,4.998647,134.766305
9,1.069227,-11.025667,4.997844,137.945408


In [8]:
gd3.info()
gd3.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4)
memory usage: 3.1 MB


Unnamed: 0,f0,f1,f2,product
0,-1.146987,0.963328,-0.828965,27.758673
1,0.262778,0.269839,-2.530187,56.069697
2,0.194587,0.289035,-5.586433,62.87191
3,2.23606,-0.55376,0.930038,114.572842
4,-0.515993,1.716266,5.899011,149.600746
5,-0.758092,0.710691,2.585887,90.222465
6,-0.574891,0.317727,1.773745,45.641478
7,-1.906649,-2.45835,-0.177097,72.48064
8,1.776292,-0.279356,3.004156,106.616832
9,-1.214452,-0.439314,5.922514,52.954532


We turn to the study of models.

## Training and validating the model

In [9]:
features1 = gd1.drop(['product'], axis=1)
target1 = gd1['product']

In [10]:
features2 = gd2.drop(['product'], axis=1)
target2 = gd2['product']

In [11]:
features3 = gd3.drop(['product'], axis=1)
target3 = gd3['product']

In [12]:
features_train1, features_valid1, target_train1, target_valid1 = train_test_split(features1, 
                                                                              target1, 
                                                                              test_size=0.25, 
                                                                              random_state=12345)

In [13]:
features_train2, features_valid2, target_train2, target_valid2 = train_test_split(features2, 
                                                                              target2, 
                                                                              test_size=0.25, 
                                                                              random_state=12345)

In [14]:
features_train3, features_valid3, target_train3, target_valid3 = train_test_split(features3, 
                                                                              target3, 
                                                                              test_size=0.25, 
                                                                              random_state=12345)

The available data for the three regions were divided into features and target. Based on the task set before us, the target feature is the volume of reserves in the well. It is the 'product' column in the available data. To split the data into two samples (training and validation), the train_test_split method was used. The sampling ratio was 3:1 or 75%:25%.

In [15]:
model = LinearRegression()

In [16]:
model1 = model.fit(features_train1, target_train1)
predictions_valid1 = pd.Series(model1.predict(features_valid1), index=target_valid1.index)

In [17]:
model2 = model.fit(features_train2, target_train2)
predictions_valid2 = pd.Series(model2.predict(features_valid2), index=target_valid2.index)

In [18]:
model3 = model.fit(features_train3, target_train3)
predictions_valid3 = pd.Series(model3.predict(features_valid3), index=target_valid3.index)

According to the terms of our assignment, linear regression was used to train the models.

In [19]:
result1 = mean_squared_error(target_valid1, predictions_valid1) ** 0.5
print("RMSE модели линейной регрессии первого региона на валидационной выборке:", result1)

RMSE модели линейной регрессии первого региона на валидационной выборке: 37.5794217150813


In [20]:
result2 = mean_squared_error(target_valid2, predictions_valid2) ** 0.5
print("RMSE модели линейной регрессии второго региона на валидационной выборке:", result2)

RMSE модели линейной регрессии второго региона на валидационной выборке: 0.893099286775616


In [21]:
result3 = mean_squared_error(target_valid3, predictions_valid3) ** 0.5
print("RMSE модели линейной регрессии третьего региона на валидационной выборке:", result3)

RMSE модели линейной регрессии третьего региона на валидационной выборке: 40.02970873393434


In [22]:
pmean1 = predictions_valid1.mean()
print("Средний запас предсказанного сырья первого региона:", pmean1)

Средний запас предсказанного сырья первого региона: 92.59256778438038


In [23]:
pmean2 = predictions_valid2.mean()
print("Средний запас предсказанного сырья второго региона:", pmean2)

Средний запас предсказанного сырья второго региона: 68.728546895446


In [24]:
pmean3 = predictions_valid3.mean()
print("Средний запас предсказанного сырья третьего региона:", pmean3)

Средний запас предсказанного сырья третьего региона: 94.96504596800489


In [25]:
results = pd.DataFrame({'регион' : ['geo_0', 'geo_1', 'geo_2'],
        'mean предсказанного сырья' : [pmean1, pmean2, pmean3],
        'RMSE' : [result1, result2, result3]})
results

Unnamed: 0,регион,mean предсказанного сырья,RMSE
0,geo_0,92.592568,37.579422
1,geo_1,68.728547,0.893099
2,geo_2,94.965046,40.029709


An analysis of the obtained values of the average stock of predicted raw materials and RMSE models for each region shows that the second region geo_1 stands out against the background of others, primarily because the RMSE value here turned out to be as low as possible and close to zero, which is a very good indicator. At the same time, this region has a lower average amount of predicted raw materials in wells, which is potentially negative for business.  The other two regions: the first (geo_0) and the third (geo_2) have similar indicators with a high average amount of raw materials, but also with a high RMSE value.

## Preparing for profit calculation

In [26]:
budget = 10000000000
well_number_revenue = 200
barell_revenue = 450000

sufficient_volume = budget / well_number_revenue / barell_revenue
print("Достаточный объём сырья для безубыточной разработки новой скважины:", sufficient_volume)

Достаточный объём сырья для безубыточной разработки новой скважины: 111.11111111111111


In [27]:
mean1 = gd1['product'].mean()
print("Средний запас всего сырья первого региона:", mean1)

Средний запас всего сырья первого региона: 92.50000000000001


In [28]:
mean2 = gd2['product'].mean()
print("Средний запас всего сырья второго региона:", mean2)

Средний запас всего сырья второго региона: 68.82500000000002


In [29]:
mean3 = gd3['product'].mean()
print("Средний запас всего сырья третьего региона:", mean3)

Средний запас всего сырья третьего региона: 95.00000000000004


In [30]:
results['mean всего сырья'] = [mean1, mean2, mean3]
results

Unnamed: 0,регион,mean предсказанного сырья,RMSE,mean всего сырья
0,geo_0,92.592568,37.579422,92.5
1,geo_1,68.728547,0.893099,68.825
2,geo_2,94.965046,40.029709,95.0


To determine the sufficient volume of raw materials for the break-even development of a new well, the available data on the budget (10 billion rubles), the number of wells for calculating profit (200 units) and profit per unit of product (450 thousand rubles) were used. The resulting value, when compared with the average values of the predicted raw materials and additionally calculated average values of all raw materials for each of the three regions, turned out to be higher, which indicates that the average volume of each well in each region is not enough to justify the development.

## Profit and risk calculation

In [31]:
def revenue_formula(predictions, target, count):
    predictions_sort = predictions.sort_values(ascending = False)
    target_top = target[predictions_sort.index][:count]
    revenue = target_top.sum() * barell_revenue - budget
    return revenue

To calculate the profit, a calculation function was written based on the calculation conditions in the study task.

Let's move on to calculating profits and risks for the three regions.

In [32]:
state = np.random.RandomState(12345)
values1 = []
for i in range(1000):
    sample_target1 = target_valid1.sample(500, replace = True, random_state = state)
    sample_predictions1 = predictions_valid1[sample_target1.index]
    revenues = revenue_formula(sample_predictions1, sample_target1, 200)
    values1.append(revenues)
values1 = pd.Series(values1)
vmean1 = values1.mean()
lower1 = values1.quantile(0.025)
upper1 = values1.quantile(0.975)
risk1 = (values1<0).mean() * 100
print("Средний прибыль первого региона:", vmean1)
print("Начало 95%-ого доверительного интервала первого региона:", lower1)
print("Конец 95%-ого доверительного интервала первого региона:", upper1)
print("Риск убытков первого региона, %:", risk1)

Средний прибыль первого региона: 425938526.9105923
Начало 95%-ого доверительного интервала первого региона: -102090094.83793654
Конец 95%-ого доверительного интервала первого региона: 947976353.358369
Риск убытков первого региона, %: 6.0


In [33]:
values2 = []
for i in range(1000):
    sample_target2 = target_valid2.sample(500, replace = True, random_state = state)
    sample_predictions2 = predictions_valid2[sample_target2.index]
    revenues = revenue_formula(sample_predictions2, sample_target2, 200)
    values2.append(revenues)
values2 = pd.Series(values2)
vmean2 = values2.mean()
lower2 = values2.quantile(0.025)
upper2 = values2.quantile(0.975)
risk2 = (values2<0).mean() * 100
print("Средний прибыль второго региона:", vmean2)
print("Начало 95%-ого доверительного интервала второго региона:", lower2)
print("Конец 95%-ого доверительного интервала второго региона:", upper2)
print("Риск убытков второго региона, %:", risk2)

Средний прибыль второго региона: 518259493.69732493
Начало 95%-ого доверительного интервала второго региона: 128123231.43308629
Конец 95%-ого доверительного интервала второго региона: 953612982.0669085
Риск убытков второго региона, %: 0.3


In [34]:
values3 = []
for i in range(1000):
    sample_target3 = target_valid3.sample(500, replace = True, random_state = state)
    sample_predictions3 = predictions_valid3[sample_target3.index]
    revenues = revenue_formula(sample_predictions3, sample_target3, 200)
    values3.append(revenues)
values3 = pd.Series(values3)
vmean3 = values3.mean()
lower3 = values3.quantile(0.025)
upper3 = values3.quantile(0.975)
risk3 = (values3<0).mean() * 100
print("Средний прибыль второго региона:", vmean3)
print("Начало 95%-ого доверительного интервала второго региона:", lower3)
print("Конец 95%-ого доверительного интервала второго региона:", upper3)
print("Риск убытков второго региона, %:", risk3)

Средний прибыль второго региона: 420194005.34405005
Начало 95%-ого доверительного интервала второго региона: -115852609.16001143
Конец 95%-ого доверительного интервала второго региона: 989629939.844574
Риск убытков второго региона, %: 6.2


In [35]:
results['средняя прибыль'] = [vmean1, vmean2, vmean3]
results['риск убытков, %'] = [risk1, risk2, risk3]
results

Unnamed: 0,регион,mean предсказанного сырья,RMSE,mean всего сырья,средняя прибыль,"риск убытков, %"
0,geo_0,92.592568,37.579422,92.5,425938500.0,6.0
1,geo_1,68.728547,0.893099,68.825,518259500.0,0.3
2,geo_2,94.965046,40.029709,95.0,420194000.0,6.2


As noted earlier when comparing the RMSE and the average stock of predicted raw materials, the results of profit and risk calculations highlight the second region with the highest average profit and the lowest risk among all regions.  The first and third regions have similar indicators of average profit and risk, and at the same time, both of them do not pass the established threshold of 2.5% in terms of risk.  The second region has a risk of 0.3%, which is significantly less than the established threshold.

We recommend the second region with geo_data_1 data for further drilling.