# Determining the most profitable oil production region

**Task description**
Suppose you work for the oil production company "MainRosGosOil." Your task is to decide where to drill a new oil well.

You are provided with oil samples from three regions, with data on 10,000 oil fields in each region, including measurements of oil quality and the volume of reserves. Your goal is to build a machine learning model that will help determine the region where oil production will yield the highest profit. You will also analyze the potential profit and risks using the Bootstrap technique.

**Steps for selecting a location:**

1. In the selected region, you will search for oil fields and determine the values of the features.
2. Build a model and estimate the volume of reserves.
3. Select oil fields with the highest estimated values. The number of selected fields depends on the company's budget and the cost of developing one oil well.
4. The profit will be the sum of profits from the selected oil fields.

**Data description:**

- id: Unique identifier of the oil well.
- f0, f1, f2: Three features of the points.
- product: Volume of oil reserves in the well (thousand barrels).

**Task conditions:**

- Only linear regression is suitable for training the model (other models are not sufficiently predictable).
- During exploration of each region, 500 points will be investigated, from which 200 best points will be selected for development using machine learning.
- The budget for well development in each region is 10 billion rubles.
- At current oil prices, one barrel of crude oil generates a revenue of 450 rubles. The revenue from each unit of product is 450,000 rubles, as the volume is indicated in thousand barrels.
- After assessing the risks, only regions with a probability of losses less than 2.5% will be considered. Among them, the region with the highest average profit will be chosen.

## Data preparation

In [1]:
pip install ydata-profiling

Collecting ydata-profiling
  Downloading ydata_profiling-4.3.2-py2.py3-none-any.whl (352 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m353.0/353.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]==0.7.5 (from ydata-profiling)
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (679 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m679.5/679.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting multimethod<2,>=1.4 (from ydata-profiling)
  Downloading multimethod-1.9.1-py3-none-any.whl (10 kB)
Collecting typeguar

In [2]:
pip install numerize

Collecting numerize
  Downloading numerize-0.12.tar.gz (2.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: numerize
  Building wheel for numerize (setup.py) ... [?25l[?25hdone
  Created wheel for numerize: filename=numerize-0.12-py3-none-any.whl size=3155 sha256=4b76f560bd3117a30435d45886d7bbb89dc381462a44e09807b0b8a8f79f41e7
  Stored in directory: /root/.cache/pip/wheels/87/84/e1/9e30f2e3da6590acb0f1c03a806e2673d2f9e7f5bd2b11589a
Successfully built numerize
Installing collected packages: numerize
Successfully installed numerize-0.12


In [3]:
import pandas as pd
import ydata_profiling as pf
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from numpy.random import RandomState
from numerize import numerize

In [4]:
RANDOM_STATE = RandomState(12345)

In [5]:
data1 = pd.read_csv("https://code.s3.yandex.net/datasets/geo_data_0.csv")
data2 = pd.read_csv("https://code.s3.yandex.net/datasets/geo_data_1.csv")
data3 = pd.read_csv("https://code.s3.yandex.net/datasets/geo_data_2.csv")

In [6]:
pf.ProfileReport(data1)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [7]:
pf.ProfileReport(data2)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [8]:
pf.ProfileReport(data3)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [9]:
# removing the column named "id" for all the dataframes, as it's of no use for the research
for df in [data1, data2, data3]:
    df.drop("id", axis=1, inplace=True)

It can be observed that the data does not require preprocessing. Among the peculiarities, there is a very strong correlation between the "product" and the feature "f2" in the second dataset, and a moderate correlation between "product" and "f2" in data1 and data3.

## Model training and testing

In [10]:
# let's create a function that splits a dataframe in 75:25 ratio
def split(data):
    target = data['product']
    features = data.drop('product', axis=1)

    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=RANDOM_STATE)

    print(f'features_train shape: {features_train.shape}', f'features_valid shape: {features_valid.shape}', sep='\n')
    return features_train, features_valid, target_train, target_valid

In [11]:
# applying data scaling
def scale(features_train, features_valid):
    scaler = StandardScaler()
    scaler.fit(features_train)

    features_train = scaler.transform(features_train)
    features_valid = scaler.transform(features_valid)

    return features_train, features_valid

In [12]:
# training the model and making predictions on validation sample
def train(features_train, target_train, features_valid, target_valid):
    model = LinearRegression()
    model.fit(features_train, target_train)
    predicted_valid = pd.Series(model.predict(features_valid), index=target_valid.index)

    return predicted_valid

In [13]:
# a function that prints the average predicted oil reserves and the RMSE of the model
def new_reserves_prediction(target_valid, predicted_valid):
    print('Mean predicted oil reserves in the region: {:.2f}'.format(predicted_valid.mean()))
    print('RMSE: {:.2f}'.format(mean_squared_error(target_valid, predicted_valid) ** 0.5))

In [14]:
# data1
features_train_1, features_valid_1, target_train_1, target_valid_1 = split(data1)
features_train_1, features_valid_1 = scale(features_train_1, features_valid_1)
predicted_valid_1 = train(features_train_1, target_train_1, features_valid_1, target_valid_1)
new_reserves_prediction(target_valid_1, predicted_valid_1)

features_train shape: (75000, 3)
features_valid shape: (25000, 3)
Mean predicted oil reserves in the region: 92.59
RMSE: 37.58


In [15]:
# data2
features_train_2, features_valid_2, target_train_2, target_valid_2 = split(data2)
features_train_2, features_valid_2 = scale(features_train_2, features_valid_2)
predicted_valid_2 = train(features_train_2, target_train_2, features_valid_2, target_valid_2)
new_reserves_prediction(target_valid_2, predicted_valid_2)

features_train shape: (75000, 3)
features_valid shape: (25000, 3)
Mean predicted oil reserves in the region: 68.77
RMSE: 0.89


In [16]:
# data3
features_train_3, features_valid_3, target_train_3, target_valid_3 = split(data3)
features_train_3, features_valid_3 = scale(features_train_3, features_valid_3)
predicted_valid_3 = train(features_train_3, target_train_3, features_valid_3, target_valid_3)
new_reserves_prediction(target_valid_3, predicted_valid_3)

features_train shape: (75000, 3)
features_valid shape: (25000, 3)
Mean predicted oil reserves in the region: 95.09
RMSE: 39.96


The RMSE score of the model is best on the second dataset.
According to the model's predictions, the highest predicted reserves are in the new locations in data3 (94.97) and almost the same in data1 (92.59).

## Preparation for profit calculation

In [17]:
# Defining all key values for calculations

# Number of points to be explored during exploration
RESEARCH_POINTS = 500
# Number of top points to be selected
POINTS_SHORTLIST = 200
# Budget for well development in the region
BUDGET = 10**10
# Price for 1000 barrels of oil
ITEM_PRICE = 450000
# Maximum loss probability
MAX_LOSS = 0.025
# Number of samples for bootstrap
BOOTSTRAP_SAMPLES = 1000
# Confidence interval
CI = 0.95

In [18]:
# calculating the sufficient volume of raw materials for the break-even development of a new well
min_necessary_reserves = BUDGET/ITEM_PRICE/POINTS_SHORTLIST

print(f'Sufficient volume of raw materials for break-even development of a new well: {round(min_necessary_reserves,2)} bbl')

Sufficient volume of raw materials for break-even development of a new well: 111.11 bbl


In [19]:
# average reserves in each region
print(*[d['product'].mean() for d in [data1, data2, data3]], sep='\n')

92.50000000000001
68.82500000000002
95.00000000000004


- Sufficient volume of raw materials for break-even development of a new well is equal to 111.11 barrels.
- In all three regions, the average stock of raw materials is less than this value (93, 69 and 95 barrels).

## Calculation of profit and risks

In [20]:
# a function for profit calculation for selected wells and model predictions
def profit(target, predicted_valid):
    probs_sorted = predicted_valid.sort_values(ascending=False)
    selected = target[probs_sorted.index][:POINTS_SHORTLIST]

    return ITEM_PRICE * selected.sum() - BUDGET

In [21]:
# a function for calculating risks and profits for each region
def risks_and_profits(target, predictions):
    values = []

    for i in range(BOOTSTRAP_SAMPLES):
        target_subsample = target.sample(n=RESEARCH_POINTS, replace=True, random_state=RANDOM_STATE)
        probs_subsample = predictions[target_subsample.index]

        values.append(profit(target_subsample,probs_subsample))

    values = pd.Series(values)
    mean = values.mean()

    lower = values.quantile(0.025)
    upper = values.quantile(0.975)

    loss_probability = (values < 0).mean()

    print(f'Average profit: {numerize.numerize(mean)} rubles')
    print(f'95% confidence interval: {numerize.numerize(lower)}, {numerize.numerize(upper)} million rubles')

    if loss_probability < MAX_LOSS:
        print(f'Probability of losses is less than {MAX_LOSS} and is {numerize.numerize(loss_probability)}')
    else:
        print(f'Probability of losses is greater than {MAX_LOSS} and is {numerize.numerize(loss_probability)}')

    return mean, lower, upper, loss_probability


In [22]:
# results for data1
risks_and_profits(target_valid_1, predicted_valid_1)

Average profit: 423.9M rubles
95% confidence interval: -76.19M, 957.85M million rubles
Probability of losses is greater than 0.025 and is 0.05


(423897237.91690534, -76187813.89036272, 957846531.951783, 0.048)

In [23]:
# results for data2
risks_and_profits(target_valid_2, predicted_valid_2)

Average profit: 513.26M rubles
95% confidence interval: 108.07M, 928.57M million rubles
Probability of losses is less than 0.025 and is 0.01


(513256698.9172609, 108066895.23396212, 928574439.2324963, 0.006)

In [24]:
# results for data3
risks_and_profits(target_valid_3, predicted_valid_3)

Average profit: 381.12M rubles
95% confidence interval: -142.8M, 893.38M million rubles
Probability of losses is greater than 0.025 and is 0.07


(381120359.57590145, -142800630.08786878, 893380565.7504003, 0.074)

Region 2 appears to be the best choice for development both in terms of potential average profit (510.77 million rubles) and the probability of losses (0.01). Moreover, its confidence interval is narrower compared to other regions.

**Summary:**

During the preliminary analysis, it was found that the second dataset exhibits a very strong correlation between the target feature and the feature f2. Thanks to this property, the RMSE metric of the linear regression model turned out to be the best for the second region. However, according to the prediction results, the average oil reserves are higher in the first (92.59 thousand barrels) and third regions (95.09 thousand barrels).

It was determined that the sufficient oil volume for a break-even development of a new well is 111.11 barrels. In all three regions, the average oil reserves in the old wells are less than 111.11 barrels: 93, 69, and 95 thousand barrels, respectively.

The potential average profits and risks, calculated using the bootstrap technique, are distributed as follows: 423.9 million rubles and 0.05% for the first region, 513.26 million rubles and 0.01% for the second region, 381.12 million rubles and 0.07% for the third region.

The only region where the probability of losses is less than 2.5% is the second region. Therefore, it is recommended to drill a new well in this region.