# Predicting oil reserves and optimizing well placement using Machine Learning.

OilyGiant mining company has reservoir data containing oil well parameters for some selective basins/regions. As a data scientist, you've been hired to analyze reservoir data and build a model for predicting reserves in the new wells. You are tasked with optimizing well placement and maximizing profit. An important deliverable for this project is to analyze risks using the bootstrap technique.

##### Business Statement

Production forecast and reserves estimate are an essential input in the decision-making and investment evaluation scheme for any oil company. Oil companies and reservoir asset managers must factor in the reserves, production forecasts, and estimated ultimate recovery in determining whether a production project will be viable and profitable or not. In addition to reservoir volume, operational risk management is another important metric for oil companies. To this end, we need to find the best well placement and build a model to predict the volume of reserves and maximize profit by picking the region with the highest total profit. The model developed will be useful as a basis for critical decision making during reservoir management and field development planning.

##### Task Statement

Find the best place for a new well. use the following steps to choose the location:
- Collect the oil well parameters in the selected region: oil quality and volume of reserves;
- Build a model for predicting the volume of reserves in the new wells;
- Pick the oil wells with the highest estimated values;
- Pick the region with the highest total profit for the selected oil wells.

You have data on oil samples from three regions. Parameters of each oil well in the region are already known. Build a model that will help to pick the region with the highest profit margin. Analyze potential profit and risks using the Bootstrap technique.

## Data description

Geological exploration data for the three regions are stored in files:
 - [geo_data_0.csv](https://code.s3.yandex.net/datasets/geo_data_0.csv)
 - [geo_data_1.csv](https://code.s3.yandex.net/datasets/geo_data_1.csv)
 - [geo_data_2.csv](https://code.s3.yandex.net/datasets/geo_data_2.csv)
 - id — unique oil well identifier
 - f0, f1, f2 — three features of points (their specific meaning is unimportant, but the features 
themselves are significant)
 - product — volume of reserves in the oil well (thousand barrels).
 
**Conditions:**

 - Only linear regression is suitable for model training (the rest are not sufficiently predictable).
 - When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
 - The budget for oil well development is 100 USD million.
 - One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
 - After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.
 
The data is synthetic: contract details and well characteristics are not
disclosed.


## Objectives

The objectives of this project is to:
- Optimize well placement
- Develop a model that would predicts the volume of reserves in the new wells
- Pick the oil well with the highest estimated reserve and the region with the highest total profit.

<hr>

 # Table of contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#open_the_data">Open the data file and study the general information</a></li>
        <li><a href="#train_test">Train and test the model for each region</a></li>
        <li><a href="#prepare_profit">Prepare for profit calculation</a></li>
        <li><a href="#write_function">Write a function to calculate profit from a set of selected oil wells and model predictions</a></li>
        <li><a href="#calculate_risk">Calculate risks and profit for each region</a></li>
        <li><a href="#overall_conclusion">Overall conclusion</a></li>
    </ol>
</div>
<br>
<hr>

<div id="open_the_data">
    <h2>Open the data file and study the general information</h2> 
</div>

We require the following libraries: *pandas* and *numpy* for data preprocessing and manipulation, *Scikit-Learn* for building our learning algorithms

In [4]:
# import pandas and numpy for data preprocessing and manipulation
import numpy as np
import pandas as pd
import random

# matplotlib for visualization
import matplotlib.pyplot as plt
%matplotlib inline

# import train_test_split to split data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
pd.options.mode.chained_assignment = None # to avoid SettingWithCopyWarning after scaling

# import machine learning module from the sklearn library
from sklearn.linear_model import LinearRegression # import linear regression 

# import sklearn utilities
from sklearn.utils import shuffle

print('Project libraries has been successfully been imported!')

Project libraries has been successfully been imported!


In [5]:
# read the data
try:
    geo_data_0 = pd.read_csv('https://code.s3.yandex.net/datasets/geo_data_0.csv')
    geo_data_1 = pd.read_csv('https://code.s3.yandex.net/datasets/geo_data_1.csv')
    geo_data_2 = pd.read_csv('https://code.s3.yandex.net/datasets/geo_data_2.csv')
except:
    geo_data_0 = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Machine Learning in Business/geo_data_0.csv')
    geo_data_1 = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Machine Learning in Business/geo_data_1.csv')
    geo_data_2 = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Machine Learning in Business/geo_data_2.csv')
print('Data has been read correctly!')

Data has been read correctly!


In [6]:
# function to determine if columns in file have null values
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('Column {} has {:.{}%} percent of Nulls, and {} of nulls'.format(column, percent, num, num_of_nulls))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")
        
# function to display general information about the dataset
def get_info(df):
    """
    This function uses the head(), info(), describe(), shape() and duplicated() 
    methods to display the general information about the dataset.
    """
    print("\033[1m" + '-'*100 + "\033[0m")
    print('Head:')
    print()
    display(df.head())
    print('-'*100)
    print('Info:')
    print()
    display(df.info())
    print('-'*100)
    print('Describe:')
    print()
    display(df.describe())
    print('-'*100)
    display(df.describe(include='object'))
    print()
    print('Columns with nulls:')
    display(get_percent_of_na(df, 4))  # check this out
    print('-'*100)
    print('Shape:')
    print(df.shape)
    print('-'*100)
    print('Duplicated:')
    print("\033[1m" + 'We have {} duplicated rows.\n'.format(df.duplicated().sum()) + "\033[0m")
    print()

In [7]:
# study the general information about the dataset 
print('General information about the dataframe')
get_info(geo_data_0)
get_info(geo_data_1)
get_info(geo_data_2)

General information about the dataframe
[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.500419,0.250143,2.502647,92.5
std,0.871832,0.504433,3.248248,44.288691
min,-1.408605,-0.848218,-12.088328,0.0
25%,-0.07258,-0.200881,0.287748,56.497507
50%,0.50236,0.250252,2.515969,91.849972
75%,1.073581,0.700646,4.715088,128.564089
max,2.362331,1.343769,16.00379,185.364347


----------------------------------------------------------------------------------------------------


Unnamed: 0,id
count,100000
unique,99990
top,bxg6G
freq,2



Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(100000, 5)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m

[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,1.141296,-4.796579,2.494541,68.825
std,8.965932,5.119872,1.703572,45.944423
min,-31.609576,-26.358598,-0.018144,0.0
25%,-6.298551,-8.267985,1.000021,26.953261
50%,1.153055,-4.813172,2.011479,57.085625
75%,8.621015,-1.332816,3.999904,107.813044
max,29.421755,18.734063,5.019721,137.945408


----------------------------------------------------------------------------------------------------


Unnamed: 0,id
count,100000
unique,99996
top,wt4Uk
freq,2



Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(100000, 5)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m

[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0,f0,f1,f2,product
count,100000.0,100000.0,100000.0,100000.0
mean,0.002023,-0.002081,2.495128,95.0
std,1.732045,1.730417,3.473445,44.749921
min,-8.760004,-7.08402,-11.970335,0.0
25%,-1.162288,-1.17482,0.130359,59.450441
50%,0.009424,-0.009482,2.484236,94.925613
75%,1.158535,1.163678,4.858794,130.595027
max,7.238262,7.844801,16.739402,190.029838


----------------------------------------------------------------------------------------------------


Unnamed: 0,id
count,100000
unique,99996
top,xCHr8
freq,2



Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(100000, 5)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m



**Conclusion**

From the general information about the dataset, we can see that the data does not have any missing values.

<div id="train_test">
    <h2>Train and test the model for each region</h2> 
</div>

<div id="prepare_profit">
    <h2>Prepare for profit calculation</h2> 
</div>

<div id="write_function">
    <h2>Write a function to calculate profit from a set of selected oil wells and model predictions</h2> 
</div>

<div id="calculate_risk">
    <h2>Calculate risks and profit for each region</h2> 
</div>

<div id="overall_conclusion">
    <h2>Overall conclusion</h2> 
</div>