# OilyGiant
You work at the oil extraction company OilyGiant. Your task is to find the best locations to open 200 new oil wells. To complete this task, you must perform the following steps:
-	Read the files containing parameters collected from oil wells in the selected region: crude quality and reserve volume.
-	Build a model to predict the volume of reserves in new wells.
-	Select the oil wells with the highest estimated values.
-	Choose the region with the highest total profit for the selected oil wells.
You have data on crude oil samples from three regions. The parameters for each oil well in each region are already known.
Create a model that helps choose the region with the highest profit margin.
Analyze the potential profits and risks using the bootstrapping technique

### Conditions:
-	Only linear regression must be used for training the model.
-	During regional exploration, a study of 500 points is carried out, from which the best 200 points are selected to calculate profit.
-	The budget for developing 200 oil wells is 100 million USD.
-	One barrel of raw material generates 4.5 USD in revenue.
-	The revenue per unit of product is 4,500 USD (the reserve volume is expressed in thousands of barrels).
-	After risk assessment, keep only the regions with a risk of loss below 2.5%. From those that meet the criteria, select the region with the highest average profit.
-	The data is synthetic: contract details and well characteristics are not publicly available.

# Data Description
Geological exploration data for the three regions is stored in the following files:
-	/datasets/geo_data_0.csv
-	/datasets/geo_data_1.csv
-	/datasets/geo_data_2.csv

Columns:
-	id — unique identifier of the oil well
-	f0, f1, f2 — three features of the locations (their specific meaning is not important, but the features themselves are significant)
-	product — volume of reserves in the oil well (thousands of barrels)

# 1. Inicialization

In [1]:
# Import functions
import sys
import os

sys.path.append(os.path.abspath('..'))

In [2]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from src.preprocessing import drop_id
from src.modeling import entrenar_y_evaluar_modelo
from src.revenues import revenue
from src.bootstrapping import risk_calculation

# Import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## 1.1 Load Data

In [None]:
# Load data files
df_geodata_0 = pd.read_csv('../data/raw/geo_data_0.csv')
df_geodata_1 = pd.read_csv('../data/raw/geo_data_1.csv')
df_geodata_2 = pd.read_csv('../data/raw/geo_data_2.csv')

## 1.2 Data Preprocessing

In [4]:
# Dataset overview
print("The number of rows/columns in this dataset is:", df_geodata_0.shape)
df_geodata_0.info()
print()
print("The number of rows/columns in this dataset is:", df_geodata_1.shape)
df_geodata_1.info()
print()
print("The number of rows/columns in this dataset is:", df_geodata_2.shape)
df_geodata_2.info()

The number of rows/columns in this dataset is: (100000, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

The number of rows/columns in this dataset is: (100000, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB

The number o

In [5]:
# Brief visualization of the Datasets
print(df_geodata_0.sample(3))
print()
print(df_geodata_1.sample(3))
print()
print(df_geodata_2.sample(3))

          id        f0        f1        f2     product
26868  DVeMG  0.229454 -0.200186  1.718064   96.064041
31895  QCGV4 -0.790291  0.805041  2.401600  114.246420
45337  PcF6Y  1.543394 -0.310921  4.351437  112.459134

          id        f0        f1        f2     product
52213  mWzwn -4.506458 -4.183843  2.004440   57.085625
73885  5zPHj -4.863572 -5.513499  4.995293  137.945408
85659  j9dT8  6.891210 -1.174799  0.007287    0.000000

          id        f0        f1        f2     product
71696  brNmo -1.638890  1.355378  4.234194  132.681459
24192  PNnlh -0.105074  0.520398  0.300557   83.649200
3399   TvxFG  1.052067 -1.534918 -2.780380   28.408516


In [6]:
# Remove columns with "non-useful" variables
df_geodata_0 = drop_id(df_geodata_0)
df_geodata_1 = drop_id(df_geodata_1)
df_geodata_2 = drop_id(df_geodata_2)

Index(['f0', 'f1', 'f2', 'product'], dtype='object')
Index(['f0', 'f1', 'f2', 'product'], dtype='object')
Index(['f0', 'f1', 'f2', 'product'], dtype='object')


For all three datasets, the column that does not add value for further analysis and for training the regression model is id.
This column is removed, and it is verified that the new dataset no longer contains it.

# 2. Dataset

In [7]:
pred_0, target_0, rmse_0 = entrenar_y_evaluar_modelo(df_geodata_0) # Results for df_geodata_0
volume_0 = pred_0.mean()
print()

pred_1, target_1, rmse_1 = entrenar_y_evaluar_modelo(df_geodata_1) # Results for df_geodata_1
volume_1 = pred_1.mean()
print()

pred_2, target_2, rmse_2 = entrenar_y_evaluar_modelo(df_geodata_2) # Results for df_geodata_2
volume_2 = pred_2.mean()

Average volume: 92.39880
RMSE: 37.75660

Average volume: 68.71288
RMSE: 0.89028

Average volume: 94.77102
RMSE: 40.14587


Metrics Summary for the Three Analyzed Datasets

-	geodata_0 is a region with a high volume (Average Predicted Volume of 92.39880), but with low precision (RMSE 37.75660).
-	geodata_1 is the region with the lowest volume (Average Predicted Volume of 68.71288), but with the best precision (RMSE 0.89028).
-	geodata_2 is the region with the highest volume (Average Predicted Volume of 94.77102), but with the worst precision (RMSE 40.14587).

Taking into account both the Average Predicted Volume and the RMSE, regions 0 and 2 show higher average volumes but also exhibit a high level of noise due to their large RMSE values. This means that the actual oil volume in these regions could be significantly higher or lower than predicted.
Region 1, due to its extremely high precision, appears to be the best option, as it provides much greater certainty for decision-making, thanks to the reliability of its predictions and the reduction of potential losses.

# 3. Profit Calculation

In [8]:
# Value storage
inversion = 100000000 
nuevos_pozos = 200 
ingreso_unidad = 4500 
count = nuevos_pozos

In [9]:
# Break-even Point
punto_equilibrio = inversion / (nuevos_pozos * ingreso_unidad)
print(f'The minimum number of units required to avoid losses is: {punto_equilibrio:.2f}')

# Comparison between Break-even Point and Average Regional Volume
print(f'The average volume for Region 0 is: {volume_0:.5f}')
print(f'The average volume for Region 1 is: {volume_1:.5f}')
print(f'The average volume for Region 2 is: {volume_2:.5f}')

The minimum number of units required to avoid losses is: 111.11
The average volume for Region 0 is: 92.39880
The average volume for Region 1 is: 68.71288
The average volume for Region 2 is: 94.77102


Based on the comparison between the results of each region and the break-even point, none of the three regions reaches the minimum required value to be considered profitable.

In this scenario, it is recommended to adjust certain criteria in order to lower the break-even point and make the project attractive for investors in the development of new oil wells.
Some parameters to consider in order to reduce the break-even point and improve profitability include:

-	Budget and number of wells (either increasing the budget or reducing the number of wells)
-	Well selection strategy (targeting higher-quality wells or being more conservative with fewer wells)
-	Reviewing the price per barrel or negotiating better contract conditions

# 4. Profit Calculation per Set of Wells

In [10]:
# Results by Region
# Here we canculate the profit from the top results per region. 
profit_0 = revenue(target_0, pred_0, count, ingreso_unidad, inversion)
profit_1 = revenue(target_1, pred_1, count, ingreso_unidad, inversion)
profit_2 = revenue(target_2, pred_2, count, ingreso_unidad, inversion)

print(f"The profit for Region 0 is: $ {profit_0:.2f}")
print(f"The profit for Region 1 is: $ {profit_1:.2f}")
print(f"The profit for Region 2 is: $ {profit_2:.2f}")

The profit for Region 0 is: $ 33591411.14
The profit for Region 1 is: $ 24150866.97
The profit for Region 2 is: $ 25985717.59


It is interesting to observe that, although the average volume of each region does not initially exceed the break-even point, when calculating profitability using the best 200 wells per region, all regions become profitable.

Final results by region (top 200 wells):

-	Region 0: profit of 33,591,411 USD (most profitable)
-	Region 1: profit of 24,150,867 USD (least profitable)
-	Region 2: profit of 25,985,718 USD (medium profitability)

Based on these results, the initially proposed region for oil well development would be Region 0, as it shows the highest profitability at this stage of the analysis.

# 5. Risk Calculation

In [11]:
# Results by Region
mean_0, lower_0, upper_0, risk_0, values_0 = risk_calculation(target_0, pred_0, revenue, ingreso_unidad, inversion)
mean_1, lower_1, upper_1, risk_1, values_1 = risk_calculation(target_1, pred_1, revenue, ingreso_unidad, inversion)
mean_2, lower_2, upper_2, risk_2, values_2 = risk_calculation(target_2, pred_2, revenue, ingreso_unidad, inversion)

# Final Results for Region 0
print('Results for Region 0:')
print(f"Average profit: {mean_0:.4f}")
print(f"2.5% quantile: {lower_0:.4f}")
print(f"97.5% quantile: {upper_0:.4f}")
print(f"Risk of losses: {risk_0:.4f}")
print()

# Final Results for Region 1
print('Results for Region 1:')
print(f"Average profit: {mean_1:.4f}")
print(f"2.5% quantile: {lower_1:.4f}")
print(f"97.5% quantile: {upper_1:.4f}")
print(f"Risk of losses: {risk_1:.4f}")
print()

# Final Results for Region 2
print('Results for Region 2:')
print(f"Average profit: {mean_2:.4f}")
print(f"2.5% quantile: {lower_2:.4f}")
print(f"97.5% quantile: {upper_2:.4f}")
print(f"Risk of losses: {risk_2:.4f}")

Results for Region 0:
Average profit: 6076210.1038
2.5% quantile: 456985.4821
97.5% quantile: 12665456.9854
Risk of losses: 2.0000

Results for Region 1:
Average profit: 6487234.5713
2.5% quantile: 1851853.5836
97.5% quantile: 11498099.1577
Risk of losses: 0.2000

Results for Region 2:
Average profit: 5927131.0425
2.5% quantile: -394909.1520
97.5% quantile: 12570865.8797
Risk of losses: 3.1000


# 6. Conclusion
After completing the analysis—which included data preprocessing, linear regression model training, metric validation, and bootstrapping risk analysis—Region 1 (geo_data_1) is identified as the best location for developing the project of 200 new oil wells.

This conclusion is based on the following factors:

-	Predictive model precision: Region 1 shows the highest precision, with an RMSE of 0.89
-	Average profit: It achieves the highest average profit, with a profitability of $6'487,234.57
-	Risk of losses: It has the lowest risk, at 0.2%

Considering these factors, it is recommended that OilyGiant proceed with the development of the 200 new oil wells in Region 1.