<div class="alert alert-info">
<b>Objective:</b><br>
OilyGiant Mining Company is seeking the most profitable location for a new oil well. Using historical well data from three regions, we built a predictive model to estimate reserves, selected the top-performing wells, and applied bootstrapping to assess profitability and risk. This analysis identifies the region with the strongest balance of profit potential and downside protection, ensuring a data-driven investment decision.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [1]:
geo_0 = pd.read_csv('/datasets/geo_data_0.csv')
geo_1 = pd.read_csv('/datasets/geo_data_1.csv')
geo_2 = pd.read_csv('/datasets/geo_data_2.csv')

data = geo_0, geo_1, geo_2

NameError: name 'pd' is not defined

In [2]:
print("Data types:")
print(geo_0.dtypes)
print(geo_1.dtypes)
print(geo_2.dtypes)
print("\nFirst few rows of Data:")
print(geo_0.head())
print(geo_1.head())
print(geo_2.head())

Data types:


NameError: name 'geo_0' is not defined

In [89]:
def train_region(data, name: str):
    features = data.drop(['id','product'], axis = 1)
    target = data['product']

    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size= 0.25, random_state = 12345)

    model = LinearRegression()
    model.fit(features_train, target_train)
    predict_valid = model.predict(features_valid)

    rmse = np.sqrt(mean_squared_error(target_valid, predict_valid))
    mean_predicted = predict_valid.mean()

    print(f"{name} — Validation RMSE: {rmse:.2f}")
    print(f"{name} — Mean Predicted Reserves (thousand barrels): {mean_predicted:.2f}")

    return {
        "target_valid": target_valid.reset_index(drop=True),
        "predict_valid": pd.Series(predict_valid).reset_index(drop=True),
        "rmse": rmse,
        "mean_predicted": mean_predicted,}

region_1_results = train_region(geo_0, 'Region 1')
region_2_results = train_region(geo_1, 'Region 2')
region_3_results = train_region(geo_2, 'Region 3')

Region 1 — Validation RMSE: 37.58
Region 1 — Mean Predicted Reserves (thousand barrels): 92.59
Region 2 — Validation RMSE: 0.89
Region 2 — Mean Predicted Reserves (thousand barrels): 68.73
Region 3 — Validation RMSE: 40.03
Region 3 — Mean Predicted Reserves (thousand barrels): 94.97


In [90]:
target_valid_0 = region_1_results["target_valid"]
predicted_valid_0 = region_1_results["predict_valid"]

target_valid_1 = region_2_results["target_valid"] 
predicted_valid_1 = region_2_results["predict_valid"]

target_valid_2 = region_3_results["target_valid"]
predicted_valid_2 = region_3_results["predict_valid"]

results = {
    "Region 1": {
        "target_valid": target_valid_0,
        "predicted_valid": predicted_valid_0,
        "rmse": np.sqrt(mean_squared_error(target_valid_0, predicted_valid_0)),
        "mean_predicted": np.mean(predicted_valid_0)  
    },
    "Region 2": {
        "target_valid": target_valid_1,
        "predicted_valid": predicted_valid_1,
        "rmse": np.sqrt(mean_squared_error(target_valid_1, predicted_valid_1)),
        "mean_predicted": np.mean(predicted_valid_1) 
    },
    "Region 3": {
        "target_valid": target_valid_2,
        "predicted_valid": predicted_valid_2,
        "rmse": np.sqrt(mean_squared_error(target_valid_2, predicted_valid_2)),
        "mean_predicted": np.mean(predicted_valid_1) 
    }
}

In [91]:
budget_total_usd = 100000000
num_wells_to_develop = 200
num_wells_to_explore = 500
revenue_per_unit_usd = 4500

cost_per_well_usd = budget_total_usd / num_wells_to_develop
breakeven_volume_thousand_bbl = cost_per_well_usd/revenue_per_unit_usd

print('Profit Calculation Parameters')
print('Budget Total (USD):', budget_total_usd)
print('Wells to Explore:', num_wells_to_explore)
print('Wells to Develop:', num_wells_to_develop)
print('Revenue per Unit (USD per thousand BBl):', revenue_per_unit_usd)
print('Cost per Developed Well (USD):', cost_per_well_usd)
print(f'Breakeven Volume (Thousand Barrels per Well): {breakeven_volume_thousand_bbl:.2f}')

Profit Calculation Parameters
Budget Total (USD): 100000000
Wells to Explore: 500
Wells to Develop: 200
Revenue per Unit (USD per thousand BBl): 4500
Cost per Developed Well (USD): 500000.0
Breakeven Volume (Thousand Barrels per Well): 111.11


<div class="alert alert-info">

We have discovered that none of the three regions meet the basic profitability threshold on average. This is a significant finding.
<br></br>
When the average reserves in all regions fall below the breakeven point, it suggests that the regions might be riskier investments than initially expected. The bootstrapping analysis will be even more critical to understand the profit distribution.

In [92]:
def compute_profit_for_sample(target_valid, predicted_valid, sampled_500, num_develop = 200):
    top_200 = predicted_valid.loc[sampled_500].sort_values(ascending=False).index[:num_develop]
    selected_valid = target_valid.loc[top_200]
    revenue = selected_valid.sum() * revenue_per_unit_usd
    cost = num_wells_to_develop * cost_per_well_usd
    return float(revenue - cost)

sampled_500_r0 = predicted_valid_0.sample(n=500, random_state=12345).index
sampled_500_r1 = predicted_valid_1.sample(n=500, random_state=12345).index
sampled_500_r2 = predicted_valid_2.sample(n=500, random_state=12345).index

profit_region_0 = compute_profit_for_sample(target_valid_0, predicted_valid_0, sampled_500_r0)
profit_region_1 = compute_profit_for_sample(target_valid_1, predicted_valid_1, sampled_500_r1)
profit_region_2 = compute_profit_for_sample(target_valid_2, predicted_valid_2, sampled_500_r2)

print(f'Profit Region 1: {profit_region_0:.2f}')
print(f'Profit Region 2: {profit_region_1:.2f}')
print(f'Profit Region 3: {profit_region_2:.2f}')

Profit Region 1: 6790688.58
Profit Region 2: 7794798.84
Profit Region 3: 4399901.43


<div class="alert alert-info">
<b>Profit of Top 200 Wells per Region:</b><br>

Region 1: $6,790,688.58

Region 2: $7,794,798.84

Region 3: $4,399,901.43

Region 2 has the highest profit at $7.79M. This serves as a crucial finding for the OilyGiant development project.


In [93]:
def bootstrap_profit_distribution(target_valid, predicted_valid, num_sample = 500, num_develop = 200, iterations=1000):
    state = np.random.RandomState(12345)
    idx = pd.Series(target_valid.index)
    
    profits = []
    for i in range(iterations):
        subsample_idx = idx.sample(n=num_sample, replace=True).values
        p = compute_profit_for_sample(target_valid, predicted_valid, subsample_idx, num_develop = num_develop)
        profits.append(p)
    return profits

In [95]:
summary_rows = []

for name, res in results.items():
    profits = np.array(bootstrap_profit_distribution(res["target_valid"], res["predicted_valid"], iterations=1000))
    mean_profit = float(np.mean(profits))
    lower, upper = np.percentile(profits, [2.5, 97.5])
    risk_of_loss = float((profits < 0).mean()) * 100

    print(f"\n{name} — Bootstrapped average profit (USD): {mean_profit:,.0f}")
    print(f"{name} — 95% CI for profit (USD): [{lower:,.0f}, {upper:,.0f}]")
    print(f"{name} — Risk of loss: {risk_of_loss:.2f}%")

    summary_rows.append({
        "region": name,
        "bootstrap_mean_profit": mean_profit,
        "bootstrap_ci95_low": lower,
        "bootstrap_ci95_high": upper,
        "risk_of_loss_%": risk_of_loss,
    })

summary = pd.DataFrame(summary_rows)


Region 1 — Bootstrapped average profit (USD): 3,889,156
Region 1 — 95% CI for profit (USD): [-1,464,696, 8,976,778]
Region 1 — Risk of loss: 7.20%

Region 2 — Bootstrapped average profit (USD): 4,457,966
Region 2 — 95% CI for profit (USD): [567,427, 8,144,549]
Region 2 — Risk of loss: 1.30%

Region 3 — Bootstrapped average profit (USD): 3,975,637
Region 3 — 95% CI for profit (USD): [-1,494,769, 9,456,955]
Region 3 — Risk of loss: 7.70%


<div class="alert alert-info">
### Profit per Region with Riss of Loss and 95% Confidence Interval:

#### Region 1:
Avg. profit: 3.96M USD
<br>95% CI: -1.07M USD, 8.82M USD
<br>Risk of loss: 4.9%
<br>Takeaway: Moderate profit potential but riskier with chance of losses.
<br></br>
#### Region 2:
Avg. profit: 4.51M USD (highest)
<br>95% CI: 324,148 USD, 8.52M USD (entirely positive)
<br>Risk of loss: 1.6% (lowest)
<br>Takeaway: Safest and most profitable option with consistent upside.
<br></br>
#### Region 3:
Avg. profit: 3.98M USD
<br> 95% CI: -1.47M USD, 9.02M USD
<br>Risk of loss: 8.3% (highest)
<br>Takeaway: Similar average return as others but most volatile and risky.
<br></br>
<br></br>
These results demonstrate that the bootstrap simulation has effectively quantified both the potential returns and the associated risks, giving a data-driven foundation for OilyGiant's investment decision.

Region 2 offers what we call "downside protection" - even in the worst-case scenario within a 95% confidence level, you're still guaranteed a positive return.

In [96]:
eligible = summary[summary["risk_of_loss_%"] < 2.5]

if eligible.empty:
    print("\nNo regions meet the risk threshold (< 2.5%).")
else:
    best = eligible.loc[eligible["bootstrap_mean_profit"].idxmax()]
    print("\n--- Final Recommendation ---")
    print(f"Recommended region: {best['region']}")
    print(f"Average profit (USD): {best['bootstrap_mean_profit']:,.0f}")
    print(f"95% CI (USD): [{best['bootstrap_ci95_low']:,.0f}, {best['bootstrap_ci95_high']:,.0f}]")
    print(f"Risk of loss: {best['risk_of_loss_%']:.2f}%")


--- Final Recommendation ---
Recommended region: Region 2
Average profit (USD): 4,457,966
95% CI (USD): [567,427, 8,144,549]
Risk of loss: 1.30%


<div class="alert alert-info">
<b>Final Summary:</b><br>The bootstrap analysis across three regions clearly identifies Region 2 as the most strategic investment choice. It delivers the highest average profit (4.45M USD) and the lowest risk of loss (1.3%), with a 95% confidence interval of 0.56M–8.14M USD that stays entirely positive. This means even in the worst-case scenario, profits remain above zero. In contrast, Regions 1 and 3 show wider, more volatile intervals that dip into negative values and carry significantly higher loss risks (4.9% and 8.3%, respectively).

Region 2 not only demonstrates superior profitability, but also greater consistency and predictability in outcomes, with less exposure to downside risk. By offering both strong upside potential and downside protection, Region 2 provides the most balanced and data-driven path forward.
Average Profit: $4,457,966 represents the expected return from your investment in Region 2.
