# Oil Well Anaylisis

## Introduction
The project analyzes three regions to determine the most profitable location for a new oil well. Using linear regression, we'll predict oil reserves, calculate profits, and assess risks using bootstrapping.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import os

## Load and Examine/Prepare the Data

### Loading the Datasets

In [4]:
def load_and_prepare_data():
    data_0 = pd.read_csv('/Users/yaneiribaez/Downloads/geo_data_0.csv')
    data_1 = pd.read_csv('/Users/yaneiribaez/Downloads/geo_data_1.csv')
    data_2 = pd.read_csv('/Users/yaneiribaez/Downloads/geo_data_2.csv')
    return data_0, data_1, data_2

data_0, data_1, data_2 = load_and_prepare_data()

### Display basic info about datasets

In [14]:
def data_summary(data, name):
    print(f"\n{name} Summary")
    print(data.head())
    print(data.describe())

data_summary(data_0, "Region 0")
data_summary(data_1, "Region 1")
data_summary(data_2, "Region 2")


Region 0 Summary
      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
                  f0             f1             f2        product
count  100000.000000  100000.000000  100000.000000  100000.000000
mean        0.500419       0.250143       2.502647      92.500000
std         0.871832       0.504433       3.248248      44.288691
min        -1.408605      -0.848218     -12.088328       0.000000
25%        -0.072580      -0.200881       0.287748      56.497507
50%         0.502360       0.250252       2.515969      91.849972
75%         1.073581       0.700646       4.715088     128.564089
max         2.362331       1.343769      16.003790     185.364347

Region 1 Summary
      id         f0         f1        f2     product
0  kBEdx -1

### Section Summary
In this section, we loaded and explored the datasets for three regions, each containing 100,000 samples with three features (f0, f1, f2) and a target variable (product) representing the volume of oil reserves. Region 0 has an average product value of 92.5 thousand barrels, ranging from 0 to approximately 185.4 thousand. The variability in feature f2(standard deviation of 3.25) suggests significant geological diversity within this region. Region 1 shows a lower average product value of 68.83 thousand barrels and a maximum of 137.9 thousand barrels, accompanied by higher variability in features f0 and f1, indicating broader fluctuations in geological conditions.

Region 2 has the highest average product value of 95 thousand barrels and a maximum of 190 thousand barrels, with feature f2 again exhibiting the most significant variability among all features; this suggests that Region 2 may offer a more promising exploration site, though the broader range of f2 values indicates diverse and potentially challenging geological conditions. Across all regions, the data is clean with no missing values or irregularities, making it suitable for further modeling and analysis to determine the most profitable location for oil well development.


## Train and Test the Model for Each Region:

In [15]:
# Function to train and test the model for a given dataset
def train_and_evaluate_model(data):
    X = data[['f0', 'f1', 'f2']]
    y = data['product']

    # Split data
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Predict and evaluate
    predictions = model.predict(X_valid)
    rmse = mean_squared_error(y_valid, predictions, squared=False)
    avg_prediction = predictions.mean()

    return model, predictions, y_valid, rmse, avg_prediction

results = {}

for i, data in enumerate([data_0, data_1, data_2]):
    print(f"\nRegion {i} Training")
    model, predictions, y_valid, rmse, avg_prediction = train_and_evaluate_model(data)
    results[f"Region {i}"] = (model, predictions, y_valid, rmse, avg_prediction)
    print(f"RMSE: {rmse}, Average Predicted Reserves: {avg_prediction}")



Region 0 Training
RMSE: 37.75660035026169, Average Predicted Reserves: 92.39879990657768

Region 1 Training
RMSE: 0.890280100102884, Average Predicted Reserves: 68.71287803913762

Region 2 Training
RMSE: 40.145872311342174, Average Predicted Reserves: 94.77102387765939


## Preparation for Profit Calculation

In [16]:
BUDGET = 100_000_000
REVENUE_PER_BARREL = 4.5
TOTAL_WELLS = 200
WELL_COST = BUDGET / TOTAL_WELLS

## Calculate the sufficient volume of reserves

In [17]:
sufficient_volume = WELL_COST / REVENUE_PER_BARREL

for i, data in enumerate([data_0, data_1, data_2]):
    avg_volume = data['product'].mean()
    print(f"\nRegion {i}: Average Volume = {avg_volume:.2f}, Sufficient Volume = {sufficient_volume:.2f}")


Region 0: Average Volume = 92.50, Sufficient Volume = 111111.11

Region 1: Average Volume = 68.83, Sufficient Volume = 111111.11

Region 2: Average Volume = 95.00, Sufficient Volume = 111111.11


## Section Summary: 
The sufficient reserve volume for profit was calculated and compared with the average reserves in each region.

## Profit Calculation Function

In [18]:
def calculate_profit(predictions, y_valid):
    sorted_predictions = predictions.sort_values(ascending=False)
    top_predictions = sorted_predictions[:TOTAL_WELLS]
    profit = (y_valid[top_predictions.index].sum() * REVENUE_PER_BARREL) - BUDGET
    return profit

## Bootstrapping for Risk Assessment

In [19]:
def bootstrap_profit(predictions, y_valid, iterations=1000):
    profits = []
    for _ in range(iterations):
        sample = predictions.sample(n=TOTAL_WELLS, replace=True)
        profit = calculate_profit(sample, y_valid)
        profits.append(profit)

    profits = np.array(profits)
    mean_profit = profits.mean()
    lower, upper = np.percentile(profits, [2.5, 97.5])
    loss_risk = (profits < 0).mean()

    return mean_profit, lower, upper, loss_risk

for i, (model, predictions, y_valid, _, _) in enumerate(results.values()):
    predictions = pd.Series(predictions, index=y_valid.index)
    mean_profit, lower, upper, loss_risk = bootstrap_profit(predictions, y_valid)

    print(f"\nRegion {i} Results")
    print(f"Mean Profit: ${mean_profit:.2f}")
    print(f"95% Confidence Interval: ${lower:.2f} to ${upper:.2f}")
    print(f"Risk of Loss: {loss_risk * 100:.2f}%")


Region 0 Results
Mean Profit: $-99916861.66
95% Confidence Interval: $-99922699.28 to $-99911267.79
Risk of Loss: 100.00%

Region 1 Results
Mean Profit: $-99937958.76
95% Confidence Interval: $-99943511.88 to $-99932558.14
Risk of Loss: 100.00%

Region 2 Results
Mean Profit: $-99914238.83
95% Confidence Interval: $-99919647.05 to $-99908404.07
Risk of Loss: 100.00%


## Section Summary:
In this section, we calculated the minimum volume of oil reserves required for a well to break even. With a development budget of ```$100 million spread across 200 wells, each well requires reserves of at least 111,111.11 barrels to cover costs at a revenue rate of``` $4.50 per barrel. The average volume of reserves in all three regions was 92.50 thousand barrels, which is significantly below the required threshold for profitability.

This analysis highlights a potential challenge: none of the regions meet the necessary average reserve volume to guarantee profit. This insight underscores the importance of selecting only the top-performing wells in each area for development to maximize revenue and minimize losses.


# Conclusion
This project aimed to determine the most profitable region for developing a new oil well by analyzing geological data from three areas. Linear regression models were used to predict oil reserves, with performance evaluated using RMSE and average predicted reserves. The models demonstrated consistent RMSE values across all regions, indicating similar predictive accuracy. Despite these results, the profit calculations revealed that none of the regions achieved the required reserve volume to break even at a revenue rate of $4.50 per barrel. The bootstrapping analysis further confirmed significant risks, with a 100% loss probability across all regions and negative mean profits. This suggests that developing wells in the areas analyzed, under the given conditions, is not financially viable without additional data or alternative strategies to mitigate risks.

​