#  Finding the best place for a new well

In this project, we will use machine learning algorithms to develop a model that would analyze oil well parameters in three regions and will help to select a region with the highest profit.

# Contents

* [Data loading and preparation]()
* [Model training and testing]()    
* [Preparation for profit calculation]()
* [Profit calculation]()
* [Calculating risks and profit for each region]()
* [General Conclusion]()

# Data loading and preparation

First of all, will load the data and the libraries that we will use in this project.

In [1]:
# Loading all required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error 
from numpy.random import RandomState

In [2]:
# Loading the data files into DataFrames
region0=pd.read_csv('/datasets/geo_data_0.csv')
region1=pd.read_csv('/datasets/geo_data_1.csv')
region2=pd.read_csv('/datasets/geo_data_2.csv')

We will display general data info the data.

In [3]:
# printing the general/summary information about the region0 DataFrame
print(region0.info())

# printing the general/summary information about the region1 DataFrame
print(region1.info())

# printing the general/summary information about the region2 DataFrame
region2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column  

All columns in all data sets have the correct type and there are no missing values. So, we can start to build our model.

# Model training and testing

At this stage, we will train and test the model for each region. For each region, we will split the data into target and features. Then we will divide target and features into a training set and validation set at a ratio of 75:25. Next, we will train the model on the training set and make predictions for the validation set. And finally, we will calculate the average volume of predicted reserves and model RMSE.

## Region 0

In [4]:
# spliting the data into target and features
r0_target=region0['product']
r0_features=region0[['f0','f1','f2']]

#dividing target and features into a training set and validation set at a ratio of 75:25
r0_target_train, r0_target_valid,r0_features_train, r0_features_valid  = train_test_split(r0_target, r0_features, test_size=0.25, random_state=12345) 

#training model
model = LinearRegression()
model.fit(r0_features_train, r0_target_train)

#making predictions for the validation set
r0_predicted_valid = model.predict(r0_features_valid)

#calculating average volume of predicted reserves and model RMSE
average=r0_predicted_valid.mean()
mse = mean_squared_error(r0_target_valid, r0_predicted_valid)
rmse=mse**0.5
print(f'Average volume of predicted reserves: {average}')
print(f'Model RMSE: {rmse}')

Average volume of predicted reserves: 92.59256778438035
Model RMSE: 37.5794217150813


## Region 1

In [5]:
# spliting the data into target and features
r1_target=region1['product']
r1_features=region1[['f0','f1','f2']]

#dividing target and features into a training set and validation set at a ratio of 75:25
r1_target_train, r1_target_valid,r1_features_train, r1_features_valid  = train_test_split(r1_target, r1_features, test_size=0.25, random_state=12345) 

#training model
model = LinearRegression()
model.fit(r1_features_train, r1_target_train)

#making predictions for the validation set
r1_predicted_valid = model.predict(r1_features_valid)

#calculating average volume of predicted reserves and model RMSE
average=r1_predicted_valid.mean()
mse = mean_squared_error(r1_target_valid, r1_predicted_valid)
rmse=mse**0.5
print(f'Average volume of predicted reserves:{average}')
print(f'Model RMSE:{rmse}')

Average volume of predicted reserves:68.728546895446
Model RMSE:0.893099286775617


## Region 2

In [6]:
# spliting the data into target and features
r2_target=region2['product']
r2_features=region2[['f0','f1','f2']]

#dividing target and features into a training set and validation set at a ratio of 75:25
r2_target_train, r2_target_valid,r2_features_train, r2_features_valid  = train_test_split(r2_target, r2_features, test_size=0.25, random_state=12345) 

#training model
model = LinearRegression()
model.fit(r2_features_train, r2_target_train)

#making predictions for the validation set
r2_predicted_valid = model.predict(r2_features_valid)

#calculating average volume of predicted reserves and model RMSE
average=r2_predicted_valid.mean()
mse = mean_squared_error(r2_target_valid, r2_predicted_valid)
rmse=mse**0.5
print(f'Average volume of predicted reserves:{average}')
print(f'Model RMSE:{rmse}')

Average volume of predicted reserves:94.96504596800489
Model RMSE:40.02970873393434


As can be seen, regions 0 and 2 have a quite similar average volume of predicted reserves and RMSE. While region 1 average volume of predicted reserves is much lower, the model for this region is way more accurate than for two other regions.

# Preparation for profit calculation

Here we will make preparation for profit calculation. We will define variables to store all key values for calculations. Then we will calculate the volume of reserves sufficient for developing a new well without losses. And finally, we will compare the obtained value with the average volume of reserves in each region.

When exploring the region, a study of 500 points is carried out with picking the best 200 points for the profit calculation. The budget for the development of 200 oil wells is 100 USD million. One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).

In [7]:
#defining variables to store all key values for calculations
budget=100000000
points=500
wells_number=200
unit_revenue=4500

#calculation budget for 1 well
well_budget=budget/200

#calculating the volume of reserves sufficient for developing a new well without losses
desired_units_number=well_budget/unit_revenue
desired_units_number


111.11111111111111

So, a well should contain more than 111 thousand barrels for developing without losses.

In [8]:
#calculating average volume of reserves in each region
print(f'Average volume of reserves in region 0: {r0_target.mean()}')
print(f'Average volume of reserves in region 1: {r1_target.mean()}')
print(f'Average volume of reserves in region 2: {r2_target.mean()}')

Average volume of reserves in region 0: 92.50000000000001
Average volume of reserves in region 1: 68.82500000000002
Average volume of reserves in region 2: 95.00000000000004


As can be seen, the average volume of reserves in all regions is lower than the calculated volume of reserves sufficient for developing a new well without losses. But each region has 10000 points and we only need to develop 200 well.

# Profit calculation

At this step, we will write a function to calculate profit from a set of selected oil wells and model predictions. The function will pick 200 wells with the highest values of predictions. For these 200 wells, the function will summarize the actual target volume of reserves and subtract the budget required for 200 well development.

In [9]:
# writing a function to calculate profit from a set of selected oil wells
def profit(target, predicted):
    
#convirting predicted values to Series 
    predicted = pd.Series(predicted)   
    
#resetting index for target, so it matches the index for prediction
    target = target.reset_index(drop=True)

#getting indices of predicted values Series sorted in descending order
    indices = predicted.sort_values(ascending=False).index 
    
#culculating profit for 200 wells with highest values of predictions 
    return target.loc[indices][:200].sum()*unit_revenue-budget
    

We will test the function on the whole validation sets.

In [10]:
# testing the function on the whole validation sets
print(profit(r0_target_valid, r0_predicted_valid))
print(profit(r1_target_valid, r1_predicted_valid))
print(profit(r2_target_valid, r2_predicted_valid))

33208260.43139851
24150866.966815114
27103499.635998324


So, if we study 2500 points and select only 200 of them, region 0 can give as more than 33 million dollars profit. But a study of only 500 points should be carried on when exploring the region. So, we need to be confident that we are selecting the region with the highest profitability when studying only 500 points.

# Calculating risks and profit for each region

In this final stage, we will use the bootstrapping technique with 1000 samples to find the distribution of profit. We will calculate average profit, 95% confidence interval, and risk of losses for each region. To calculate the risk of losses, we will divide the number of samples in which profit is negative by the total number of samples.

## Region 0

In [11]:
#defining randon state
state = RandomState(12345)

#resetting index for target, so it matches the index for prediction
r0_target_valid = r0_target_valid.reset_index(drop=True)

#defining empty list to store profits for samples
values = []

#calculating profit for each of 1000 samples and appending it to the values list
for i in range(1000):   
    target_subsample = r0_target_valid.sample(n=points, replace=True, random_state=state)
    probs_subsample = r0_predicted_valid[target_subsample.index]     
    subsample_revenue=profit(target_subsample, probs_subsample)
    values.append(subsample_revenue)
    
#converting values to Series for calculation simplicity
values = pd.Series(values)

#calculating and printing average profit, 95% confidence interval, and risk of losses for region 0
print(f'Average profit: {values.mean()}')
print(f'95% confidence interval: {values.quantile(0.025)}, {values.quantile(0.975)}')
print(f'Risk of losses: {(len(values[values < 0]) /1000)*100}%')

Average profit: 3961649.8480237117
95% confidence interval: -1112155.4589049604, 9097669.41553423
Risk of losses: 6.9%


## Region 1

In [12]:
#defining randon state
state = RandomState(12345)

#resetting index for target, so it matches the index for prediction
r1_target_valid = r1_target_valid.reset_index(drop=True)

#defining empty list to store profits for samples
values = []


#calculating profit for each of 1000 samples and appending it to the values list
for i in range(1000):   
    target_subsample = r1_target_valid.sample(n=points, replace=True, random_state=state)
    probs_subsample = r1_predicted_valid[target_subsample.index]     
    subsample_revenue=profit(target_subsample, probs_subsample)
    values.append(subsample_revenue)

#converting values to Series for calculation simplicity
values = pd.Series(values)

#calculating and printing average profit, 95% confidence interval, and risk of losses for region 1
print(f'Average profit: {values.mean()}')
print(f'95% confidence interval: {values.quantile(0.025)}, {values.quantile(0.975)}')
print(f'Risk of losses: {(len(values[values < 0]) /1000)*100}%')

Average profit: 4560451.057866608
95% confidence interval: 338205.0939898458, 8522894.538660347
Risk of losses: 1.5%


## Region 2

In [13]:
#defining randon state
state = RandomState(12345)

#resetting index for target, so it matches the index for prediction
r2_target_valid = r2_target_valid.reset_index(drop=True)

#defining empty list to store profits for samples
values = []

#calculating profit for each of 1000 samples and appending it to the values list
for i in range(1000):   
    target_subsample = r2_target_valid.sample(n=points, replace=True, random_state=state)
    probs_subsample = r2_predicted_valid[target_subsample.index]     
    subsample_revenue=profit(target_subsample, probs_subsample)
    values.append(subsample_revenue)

#converting values to Series for calculation simplicity
values = pd.Series(values)

#calculating and printing average profit, 95% confidence interval, and risk of losses for region 2
print(f'Average profit: {values.mean()}')
print(f'95% confidence interval: {values.quantile(0.025)}, {values.quantile(0.975)}')
print(f'Risk of losses: {(len(values[values < 0]) /1000)*100}%')

Average profit: 4044038.665683568
95% confidence interval: -1633504.1339559986, 9503595.749237997
Risk of losses: 7.6%


As can be seen, region 1 has the highest average profit. Moreover, it is the only region that has the risk of losses lower than 2.5% and where 95% confidence interval has no negative profit. Therefore, we will recommend this region for new wells development.

# General conclusion

In this project, we used machine learning algorithms to develop a model that analyzed oil well parameters in three regions and helped to select a region with the highest profit.

First, we observed the data and ensured that all columns have correct type and that there are no missing values. Then, for each region, we split the data into target and features. Next, we trained the model on the training set and make predictions for the validation set. Then, we calculated the average volume of predicted reserves and model RMSE. We found that regions 0 and 2 have a quite similar average volume of predicted reserves and RMSE. While region 1 average volume of predicted reserves is much lower, the model for this region is way more accurate than for two other regions.


Then, we calculated the volume of reserves sufficient for developing a new well without losses. We found that the average volume of reserves in all regions is lower than the calculated volume of reserves sufficient for developing a new well without losses (111). But each region has 10000 points and we only need to develop 200 well.

Finally, we wrote a function to calculate profit from a set of selected oil wells and model predictions. With that function, we used the bootstrapping technique with 1000 samples to find the distribution of profit. We calculated average profit, 95% confidence interval, and risk of losses for each region. It appeared that region 1 has the highest average profit (4560451). Moreover, it is the only region that has the risk of losses lower than 2.5% and where 95% confidence interval has no negative profit. Therefore, we recommend this region for new wells development.