Hello William!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure! 

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Thank you so much for the feedback, I appreacaite it! I should have double checked before submitting. Thanks! 
</div>



🛢️ Project Introduction

The goal of this project is to assist OilyGiant, a mining company, in selecting the most profitable region for the development of a new oil well site. The company has conducted geological surveys in three prospective regions, collecting data on oil well characteristics and reserves.

Using machine learning and statistical techniques, this project aims to:

Build predictive models to estimate oil reserves in new wells.

Select the most promising wells based on predicted reserves.

Calculate the expected profit and assess the financial risk of developing wells in each region.

Recommend the region with the highest expected profit and the lowest risk.

The analysis follows a structured process that includes model training, profit calculation, and a detailed risk assessment using bootstrapping. All business conditions, such as budget constraints and minimum required profitability, are carefully considered to ensure realistic and data-driven decision-making.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [3]:
# Read the CSV files into DataFrames
df_0 = pd.read_csv('/datasets/geo_data_0.csv')
df_1 = pd.read_csv('/datasets/geo_data_1.csv')
df_2 = pd.read_csv('/datasets/geo_data_2.csv')

# Display basic information about the datasets
df_0_info = df_0.info()
df_1_info = df_1.info()
df_2_info = df_2.info()

# Show the first few rows of each dataset
df_0_head = df_0.head()
df_1_head = df_1.head()
df_2_head = df_2.head()

df_0.describe(), df_1.describe(), df_2.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null 

(                  f0             f1             f2        product
 count  100000.000000  100000.000000  100000.000000  100000.000000
 mean        0.500419       0.250143       2.502647      92.500000
 std         0.871832       0.504433       3.248248      44.288691
 min        -1.408605      -0.848218     -12.088328       0.000000
 25%        -0.072580      -0.200881       0.287748      56.497507
 50%         0.502360       0.250252       2.515969      91.849972
 75%         1.073581       0.700646       4.715088     128.564089
 max         2.362331       1.343769      16.003790     185.364347,
                   f0             f1             f2        product
 count  100000.000000  100000.000000  100000.000000  100000.000000
 mean        1.141296      -4.796579       2.494541      68.825000
 std         8.965932       5.119872       1.703572      45.944423
 min       -31.609576     -26.358598      -0.018144       0.000000
 25%        -6.298551      -8.267985       1.000021      26.9

Data Preparation and Exploration
You've provided three datasets representing different oil exploration regions. Here's a summary of the data preparation:

✅ Common Information for All Three Datasets

Rows: 100,000

Columns: 5 (id, f0, f1, f2, product)

product: Target variable, volume of reserves in thousand barrels.

No missing values found in any dataset.

Summary Statistics

Region 0

Mean reserves (product): ~92.5

Min - Max reserves: 0 to ~185

Features (f0, f1, f2) are relatively centered around typical values with moderate variance.

Region 1

Mean reserves: ~68.8 (lowest)

Min - Max reserves: 0 to ~138

Feature values vary significantly (especially f0 and f1), suggesting high dispersion and possibly lower model stability.

Region 2

Mean reserves: ~95.0 (highest)

Min - Max reserves: 0 to ~190

Feature distributions are tighter than Region 1 and more centralized, similar to Region 0.

In [4]:
# Helper function to train model and evaluate
def train_and_evaluate(data):
    # Features and target
    features = data.drop(columns=['id', 'product'])
    target = data['product']

    # Split the data
    X_train, X_valid, y_train, y_valid = train_test_split(
        features, target, test_size=0.25, random_state=42
    )

    # Train the model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Predict
    predictions = model.predict(X_valid)

    # Evaluate
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    mean_predicted = predictions.mean()

    return model, X_valid, y_valid, predictions, rmse, mean_predicted

# Train and evaluate models for all regions
results_0 = train_and_evaluate(df_0)
results_1 = train_and_evaluate(df_1)
results_2 = train_and_evaluate(df_2)

results_summary = {
    'Region 0': {'RMSE': results_0[4], 'Mean Prediction': results_0[5]},
    'Region 1': {'RMSE': results_1[4], 'Mean Prediction': results_1[5]},
    'Region 2': {'RMSE': results_2[4], 'Mean Prediction': results_2[5]},
}

results_summary

{'Region 0': {'RMSE': 37.756600350261685, 'Mean Prediction': 92.3987999065777},
 'Region 1': {'RMSE': 0.890280100102884, 'Mean Prediction': 68.71287803913762},
 'Region 2': {'RMSE': 40.14587231134218, 'Mean Prediction': 94.77102387765939}}

Analysis

Region 1 has an extremely low RMSE (~0.89), indicating almost perfect predictions. However, this could suggest data leakage or that the target is trivially dependent on the features.

Region 0 and Region 2 have RMSE values around 38–40, which are reasonable for oil volume prediction given natural variance.

Region 2 has the highest average predicted reserves, making it potentially the most profitable.


<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Everything is correct
    
</div>

In [5]:
# Constants
BUDGET = 100_000_000  # in USD
WELL_COUNT = 200
REVENUE_PER_BARREL = 4.5  # in USD
REVENUE_PER_THOUSAND_BARRELS = REVENUE_PER_BARREL * 1000  # 4500 USD

# Calculate the break-even point: minimum reserves (in thousand barrels) needed per well
break_even_reserve = BUDGET / (WELL_COUNT * REVENUE_PER_THOUSAND_BARRELS)

# Mean actual reserves in each region
mean_reserve_0 = df_0['product'].mean()
mean_reserve_1 = df_1['product'].mean()
mean_reserve_2 = df_2['product'].mean()

# Store the results
break_even_and_means = {
    "Break-even reserve (thousand barrels)": break_even_reserve,
    "Region 0 Mean Reserve": mean_reserve_0,
    "Region 1 Mean Reserve": mean_reserve_1,
    "Region 2 Mean Reserve": mean_reserve_2,
}

break_even_and_means


{'Break-even reserve (thousand barrels)': 111.11111111111111,
 'Region 0 Mean Reserve': 92.50000000000001,
 'Region 1 Mean Reserve': 68.82500000000002,
 'Region 2 Mean Reserve': 95.00000000000004}

💰 Profitability Preparation Results

Break-even reserve per well: 111.11 thousand barrels

Mean reserves per region:

Region 0: 92.50

Region 1: 68.83

Region 2: 95.00

⚠️ Findings

All three regions fall short of the break-even volume.

Region 2 is closest to the threshold and also has the highest average predicted reserve from our model.

This implies that while profitability is uncertain overall, Region 2 is the most promising candidate.

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Good job!
    
</div>

In [6]:
# Function to calculate profit from top 200 wells
def calculate_profit(predictions, targets):
    # Convert to Series for indexing
    predictions = pd.Series(predictions)
    targets = targets.reset_index(drop=True)

    # Select top 200 predictions and their corresponding actual values
    top_200_indices = predictions.sort_values(ascending=False).head(WELL_COUNT).index
    selected_targets = targets.iloc[top_200_indices]

    # Calculate profit
    total_revenue = selected_targets.sum() * REVENUE_PER_THOUSAND_BARRELS
    profit = total_revenue - BUDGET
    return profit

# Calculate profits for each region
profit_0 = calculate_profit(results_0[3], results_0[2])
profit_1 = calculate_profit(results_1[3], results_1[2])
profit_2 = calculate_profit(results_2[3], results_2[2])

# Collect results
profit_estimates = {
    'Region 0 Estimated Profit (USD)': profit_0,
    'Region 1 Estimated Profit (USD)': profit_1,
    'Region 2 Estimated Profit (USD)': profit_2
}

profit_estimates


{'Region 0 Estimated Profit (USD)': 33591411.14462179,
 'Region 1 Estimated Profit (USD)': 24150866.966815114,
 'Region 2 Estimated Profit (USD)': 25985717.59374112}

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

The function looks correct. Well done!
    
</div>

 Findings
 
Region 0 shows the highest estimated profit when selecting the top 200 predicted wells.

Despite Region 2 having the highest average reserves, Region 0 yields the most profitable top subset.

We now need to assess risks and uncertainties using bootstrapping (1,000 iterations) to simulate different scenarios and compute:

Average profit

95% confidence interval

Probability of loss (risk)


In [7]:
# Training function
def train_and_evaluate(data):
    features = data.drop(columns=['id', 'product'])
    target = data['product']
    X_train, X_valid, y_train, y_valid = train_test_split(features, target, test_size=0.25, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    mean_predicted = predictions.mean()
    return model, X_valid, y_valid, predictions, rmse, mean_predicted

# Train models
results_0 = train_and_evaluate(df_0)
results_1 = train_and_evaluate(df_1)
results_2 = train_and_evaluate(df_2)

# Correct calculate_profit function with proper selection
def calculate_profit(predictions, targets, top_wells=200):
    predictions = pd.Series(predictions)
    selected_targets = targets.loc[predictions.sort_values(ascending=False).index].head(top_wells)
    total_revenue = selected_targets.sum() * REVENUE_PER_THOUSAND_BARRELS
    return total_revenue - BUDGET

# Correct bootstrapping function
def bootstrap_profit_with_function(predictions, targets, n_iterations=1000, sample_size=500, top_wells=200):
    state = np.random.RandomState(42)
    profits = []

    for _ in range(n_iterations):
        sample_indices = state.choice(predictions.index, size=sample_size, replace=True)
        sample_predictions = predictions.loc[sample_indices]
        sample_targets = targets.loc[sample_indices]
        profit = calculate_profit(sample_predictions, sample_targets, top_wells)
        profits.append(profit)

    return pd.Series(profits)

# Prepare prediction and target data
preds_0 = pd.Series(results_0[3], index=results_0[2].index)
preds_1 = pd.Series(results_1[3], index=results_1[2].index)
preds_2 = pd.Series(results_2[3], index=results_2[2].index)
targets_0 = results_0[2]
targets_1 = results_1[2]
targets_2 = results_2[2]

# Run bootstrapping
bootstrap_0 = bootstrap_profit_with_function(preds_0, targets_0)
bootstrap_1 = bootstrap_profit_with_function(preds_1, targets_1)
bootstrap_2 = bootstrap_profit_with_function(preds_2, targets_2)

# Analysis function
def analyze_bootstrap(bootstrap_results):
    mean_profit = bootstrap_results.mean()
    confidence_interval = bootstrap_results.quantile([0.025, 0.975])
    risk_of_loss = (bootstrap_results < 0).mean() * 100
    return {
        'Mean Profit (USD)': mean_profit,
        '95% Confidence Interval (USD)': tuple(confidence_interval),
        'Risk of Loss (%)': risk_of_loss
    }

# Final bootstrap analysis
bootstrap_analysis = {
    'Region 0': analyze_bootstrap(bootstrap_0),
    'Region 1': analyze_bootstrap(bootstrap_1),
    'Region 2': analyze_bootstrap(bootstrap_2)
}

bootstrap_analysis

{'Region 0': {'Mean Profit (USD)': 4278475.604625247,
  '95% Confidence Interval (USD)': (-972498.2956859531, 9542151.927088145),
  'Risk of Loss (%)': 5.5},
 'Region 1': {'Mean Profit (USD)': 5113627.761976852,
  '95% Confidence Interval (USD)': (988706.4990277883, 9407205.116508037),
  'Risk of Loss (%)': 0.8999999999999999},
 'Region 2': {'Mean Profit (USD)': 4025756.075141958,
  '95% Confidence Interval (USD)': (-1371622.250719326, 9298875.280253217),
  'Risk of Loss (%)': 7.3999999999999995}}

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

1. Inside the bootstrap loop you should use the function `calculate_profit` from the previous step.
2. The results are not correct. Double check indexes, please. Indexes in targets and predictions must be the same to get the correct results. Right now they are not the same.

</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>
    I believe it should be corrected now.

<div class="alert alert-danger">
<b>Reviewer's comment V2</b>

1. I'm sorry but the results are still wrong. Risks in region 1 and 2 should be more than 2.5% while risk in region 1 should be less than 2.5%. But the problem is not with indexes. Check the length of the variable `selected_targets`. The length is not 200. To avoid this problem you need to apply method .head() while creating variable `selected_targets` but not while creating variable `top_200_indices`. Why? Because you sample with replace=True and so you have duplicate indexes.
2. You should define the function `calculate_profit` only once. 
3. `targets = targets.loc[predictions.index]` - this line should be removed. It does wrong thing because of duplicate indexes in the predictions.

</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

The results are correct now. Great work!

</div>

📌 Final Recommendation

Only Region 1 has a risk of loss under 2.5%, as required by the business constraints. It also has the highest average profit among all regions.

✅ Proceed with oil well development in Region 1.

✅ Project Summary
This project analyzed geological data from three regions to identify the most financially viable site for new oil well development for OilyGiant. Linear regression models were trained for each region to predict oil reserves. Profit potential was assessed by selecting the top 200 predicted wells and calculating expected revenue based on known production values and fixed development costs.

To account for uncertainty, a bootstrapping technique with 1,000 iterations simulated profit outcomes and quantified financial risk.

🔍 Key Findings:
Region 1 had the lowest risk of loss (0.9%) and the highest average profit ($5.11M).

Region 0 had a slightly lower profit but a risk of 5.5%, exceeding the acceptable threshold.

Region 2 also exceeded the risk limit, with 7.4% risk of loss.

📌 Final Recommendation
Only Region 1 has a risk of loss under 2.5%, as required by the business constraints.
It also has the highest average profit among all regions.

✅ Proceed with oil well development in Region 1.