# Optimal Oil Well Placement for OilyGiant Mining Company Using Linear Regression and Bootstrapping

<b>Introduction:</b>

This project involves determining the most profitable location for drilling new oil wells for OilyGiant. Using geological exploration data from three regions, we will develop linear regression models to predict the volume of oil reserves. Based on these predictions, we will calculate the potential profit, analyze risks using bootstrapping, and choose the best region for well development. The business constraints include a budget of $100 million for 200 wells and the need for risk of losses to remain below 2.5%.

# Data Loading and Preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Load datasets for three regions
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')

# Inspect datasets
print(data_0.head())
print(data_1.head())
print(data_2.head())


      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746


<b>Explanation:</b>


We load the data for the three regions and display the first few rows to understand the structure (columns: id, f0, f1, f2, and product).


<b>Observations from the Data:</b>

Region 0:

The feature values (f0, f1, f2) seem to range around small positive and negative numbers.
Oil reserves (product) have a wide range, from smaller volumes (~73 thousand barrels) to high volumes (168 thousand barrels).

Region 1:

The feature values (f0, f1, f2) are more extreme, with large negative values and large positive values, such as f0 ranging from -15 to 14.
The product (oil reserves) shows low to moderate values, generally below 150 thousand barrels.

Region 2:
The feature values (f0, f1, f2) show moderate variations, with some extremes in f2 (up to 5.9), and f1 and f0 showing a more centered distribution.
Oil reserves here show a broad range, from very low (27 thousand barrels) to quite high (149 thousand barrels).

From the initial inspection, Region 1 has distinct extreme feature values, while Regions 0 and 2 appear to have less dramatic variations in their features. All regions demonstrate variability in their oil reserves (target), but Regions 0 and 2 seem to have more wells with higher reserve predictions based on these samples.

# Splitting Data into Training and Validation Sets

In [2]:
# Split the data for each region into training and validation sets (75:25)
def split_data(data):
    features = data.drop(columns=['product', 'id'])
    target = data['product']
    return train_test_split(features, target, test_size=0.25, random_state=42)

X_train_0, X_val_0, y_train_0, y_val_0 = split_data(data_0)
X_train_1, X_val_1, y_train_1, y_val_1 = split_data(data_1)
X_train_2, X_val_2, y_train_2, y_val_2 = split_data(data_2)


<b>Explanation:</b>

Each dataset is split into training (75%) and validation (25%) sets. Features are f0, f1, and f2, while product is the target variable (oil reserves).


This code systematically prepares the datasets for model training and evaluation by splitting each region's data into training (75%) and validation (25%) sets. The training sets (X_train and y_train) are used to train the model, while the validation sets (X_val and y_val) are reserved for assessing the model's predictive performance.

# Model Training and Validation

In [3]:
# Train the model using Linear Regression
def train_and_evaluate(X_train, y_train, X_val, y_val):
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    rmse = mean_squared_error(y_val, predictions, squared=False)
    return predictions, rmse

pred_0, rmse_0 = train_and_evaluate(X_train_0, y_train_0, X_val_0, y_val_0)
pred_1, rmse_1 = train_and_evaluate(X_train_1, y_train_1, X_val_1, y_val_1)
pred_2, rmse_2 = train_and_evaluate(X_train_2, y_train_2, X_val_2, y_val_2)

# Print RMSE and average predicted volume for each region
print(f"Region 0 - RMSE: {rmse_0}, Avg predicted reserves: {pred_0.mean()}")
print(f"Region 1 - RMSE: {rmse_1}, Avg predicted reserves: {pred_1.mean()}")
print(f"Region 2 - RMSE: {rmse_2}, Avg predicted reserves: {pred_2.mean()}")



Region 0 - RMSE: 37.756600350261685, Avg predicted reserves: 92.3987999065777
Region 1 - RMSE: 0.890280100102884, Avg predicted reserves: 68.71287803913762
Region 2 - RMSE: 40.14587231134218, Avg predicted reserves: 94.77102387765939



<b>Intermediate Results:</b>

Region 1 stands out with the lowest RMSE (0.89), indicating the most accurate and reliable predictions among the three regions.

Regions 0 and 2 have much higher RMSE values (37.76 and 40.15, respectively), which suggest that the linear regression model struggles to make precise predictions in these regions.


# Profit Calculation Preparation

In [4]:
# Constants
BUDGET = 100_000_000  # 100 million USD
WELL_COST = BUDGET / 200  # Budget per well
REVENUE_PER_BARREL = 4500  # Revenue per 1000 barrels

# Sufficient reserves for profit
sufficient_reserves = WELL_COST / REVENUE_PER_BARREL
print(f"Sufficient reserves per well: {sufficient_reserves} thousand barrels")



Sufficient reserves per well: 111.11111111111111 thousand barrels


# Profit Calculation Function

In [5]:


def calculate_profit(predictions, actual_reserves, n_wells=200):
    # Ensure indices align by resetting them
    predictions = predictions.reset_index(drop=True)
    actual_reserves = actual_reserves.reset_index(drop=True)
    
    # Select the wells with the highest predicted reserves
    top_well_indices = predictions.sort_values(ascending=False).head(n_wells).index
    
    # Sum the actual reserves for these wells
    total_reserves = actual_reserves.loc[top_well_indices].sum()
    
    # Calculate the profit
    profit = total_reserves * REVENUE_PER_BARREL - BUDGET
    return profit


# Calculate profit for each region
profit_0 = calculate_profit(pd.Series(pred_0), y_val_0)
profit_1 = calculate_profit(pd.Series(pred_1), y_val_1)
profit_2 = calculate_profit(pd.Series(pred_2), y_val_2)

print(f"Region 0 - Profit: {profit_0}")
print(f"Region 1 - Profit: {profit_1}")
print(f"Region 2 - Profit: {profit_2}")



Region 0 - Profit: 33591411.14462179
Region 1 - Profit: 24150866.966815114
Region 2 - Profit: 25985717.59374112


<b>Analysis and Implications

Profitability vs. Prediction Accuracy:</b>

Region 0 has the highest predicted profit, but it also has a higher RMSE, indicating greater prediction error. This means there is more uncertainty in the predicted reserves, which could lead to variability in actual profits.

Region 1, although it has the lowest profit, has the most accurate predictions (lowest RMSE). This suggests that the predicted profit in Region 1 is more reliable, with a lower risk of significant deviations.

Region 2 strikes a balance with reasonably high profit and moderate prediction error, making it a potential candidate for further exploration if balanced against the risk.

<b>Risk Assessment:</b>

Given the higher profitability of Regions 0 and 2, they appear attractive options. However, the greater prediction error means the risk of losses is also higher if the actual reserves fall short of predictions.

Region 1, on the other hand, might be safer due to its more consistent predictions, even though the average reserves and profit are lower.

# Risk and Profit Analysis using Bootstrapping

In [6]:

# Set a random seed for reproducibility
np.random.seed(42)

def bootstrap_profit(predictions, actual_reserves, n_samples=500, n_wells=200, iterations=1000):
    # Ensure indices align by resetting them
    predictions = predictions.reset_index(drop=True)
    actual_reserves = actual_reserves.reset_index(drop=True)
    
    profits = []
    
    for i in range(iterations):
        # Randomly sample with replacement
        sample_indices = np.random.choice(predictions.index, size=n_samples, replace=True)
        sample_predictions = predictions.loc[sample_indices]
        sample_actual_reserves = actual_reserves.loc[sample_indices]
        
        # Calculate profit using the top n_wells from sampled wells
        profit = calculate_profit(sample_predictions, sample_actual_reserves, n_wells=n_wells)
        profits.append(profit)
    
    # Calculate average profit and 95% confidence interval
    profits = pd.Series(profits)
    avg_profit = profits.mean()
    confidence_interval = np.percentile(profits, [2.5, 97.5])
    
    # Calculate risk of loss (probability of negative profit)
    risk_of_loss = (profits < 0).mean()
    
    return avg_profit, confidence_interval, risk_of_loss


# Perform bootstrapping for each region
mean_profit_0, conf_interval_0, risk_0 = bootstrap_profit(pd.Series(pred_0), y_val_0)
mean_profit_1, conf_interval_1, risk_1 = bootstrap_profit(pd.Series(pred_1), y_val_1)
mean_profit_2, conf_interval_2, risk_2 = bootstrap_profit(pd.Series(pred_2), y_val_2)

# Print the results
print(f"Region 0 - Mean Profit: {mean_profit_0}, Confidence Interval: {conf_interval_0}, Risk of Loss: {risk_0*100}%")
print(f"Region 1 - Mean Profit: {mean_profit_1}, Confidence Interval: {conf_interval_1}, Risk of Loss: {risk_1*100}%")
print(f"Region 2 - Mean Profit: {mean_profit_2}, Confidence Interval: {conf_interval_2}, Risk of Loss: {risk_2*100}%")




# Final recommendation based on profits and risks
if risk_0 < 2.5 and mean_profit_0 > max(mean_profit_1, mean_profit_2):
    best_region = "Region 0"
elif risk_1 < 2.5 and mean_profit_1 > max(mean_profit_0, mean_profit_2):
    best_region = "Region 1"
else:
    best_region = "Region 2"

print(f"The best region for development is {best_region}.")


Region 0 - Mean Profit: 3995754.780542297, Confidence Interval: [-1104678.95331971  8974603.27717879], Risk of Loss: 6.0%
Region 1 - Mean Profit: 4525765.942909006, Confidence Interval: [ 523094.09802734 8301463.13264741], Risk of Loss: 0.8999999999999999%
Region 2 - Mean Profit: 3787059.0365973767, Confidence Interval: [-1277794.34988305  9079234.83221064], Risk of Loss: 7.5%
The best region for development is Region 1.


<b>Observations and Analysis</b>

Mean Profit:

Region 1 has the highest mean profit ($4,525,765.94), indicating that it offers the best expected financial return compared to Regions 0 and 2. This suggests that the wells in Region 1 are predicted to yield greater reserves or higher quality oil, leading to increased profitability.

Confidence Interval:

The confidence intervals provide insights into the range of expected profits for each region:

Region 1's interval is notably positive, suggesting a high likelihood of profitable outcomes.
In contrast, Region 0 and Region 2 have lower bounds that are negative, indicating that while there is potential for profit, there is also a significant risk of loss.

The narrow confidence interval for Region 1 suggests more certainty in the profit estimates, making it a more attractive option for investment.

Risk of Loss:

The risk of loss in Region 1 is the lowest (0.009%), indicating that it has a very low probability of yielding a negative profit. This reinforces the conclusion that Region 1 is the safest option among the three regions.
Regions 0 and 2 show higher risks of loss (0.06% and 0.075% respectively), suggesting that while they might have potential, they also come with greater financial uncertainty.

# Final Conclusion:



Based on the comprehensive analysis using linear regression, profit calculations, and bootstrapping, here are the conclusions and recommendations for OilyGiant Mining Company regarding the optimal region for drilling new oil wells:

Region 0:
Highest Average Profit: Indicates the best potential financial returns.
Greater Prediction Error: A higher RMSE suggests significant variability, increasing risk.
Wider Confidence Interval: Implies less stability in profit predictions.
Recommendation: Choose Region 0 if maximizing profit is the priority and the company can handle the associated variability.

Region 1:
Most Accurate Predictions: Lowest RMSE shows reliability.
Lowest Average Profit: While it offers the most consistent returns, the profit is less than other regions.
Narrowest Confidence Interval: Indicates stability and predictability.
Recommendation: Opt for Region 1 if the focus is on stability and minimizing risk, even with lower average profits.

Region 2:
Moderate Profit Potential: A balanced option with reasonable profit and prediction stability.
Moderate Prediction Error: RMSE and confidence interval provide a mix of risk and reward.
Recommendation: Region 2 serves as a compromise for investors seeking a balance between profit potential and stability.

Overall Recommendation:

For OilyGiant Mining Company, if the goal is to maximize profit with manageable risk, Region 0 is recommended.
If the priority is consistent returns with minimal risk, then Region 1 is the safer choice.
Choose Region 2 for a balanced approach, offering a blend of potential profit and moderate stability.
