<p style="color: #000000; font-size: 32px; font-weight: bold; text-align: center; margin-top: 20px;">In Search of the Next 200 Oil Wells</p>
<p style="color: #000000; font-size: 24px; text-align: center; margin-bottom: 20px;">Oily Giant Company</p>

<hr style="border: .4px solid #000000; width: 55%; margin: 10px auto;">

**This project** focuses on identifying the best locations to open new oil wells in three regions using synthetic geological data and a linear regression model. The main goal is to maximize profits and minimize risks, ensuring the economic sustainability of the investment.

**The main steps to be taken are the following:**

1. **Read** the files with the oil well parameters: `crude oil quality` and `reserve volume`.
2. **Create** a model to predict the `reserve volume` in new wells.
3. **Select** the wells with the highest estimated values.
4. **Identify** the region with the highest total profit for the selected wells.

Potential benefits and risks are analyzed using the **bootstrapping** technique.

**The following are the conditions for the project:**
  - Only linear regression should be used for model training.
  - When exploring the region, a study of 500 points is carried out, with the selection of the best 200 points for calculating the profit.
  - The budget for developing 200 oil wells is **100 million dollars**.
  - A barrel of raw materials generates **4.5 USD** in revenue. The revenue from a product unit is **4500 dollars** (the reserve volume is expressed in thousands of barrels).
  - After risk evaluation:
    - Keep only the regions with a loss risk of less than **2.5%**.
    - From those that meet the criteria, select the region with the highest average profit.

## Data Description
The datasets for the three regions contain the following features:

- **`id`**: Unique well identifier.
- **`f0`, `f1`, `f2`**: Characteristics of the exploration points.
- **`product`**: Oil reserve volume (thousands of barrels).


**The data is synthetic:** contract details and well characteristics are not disclosed. 

--- 



# 1. Libraries and Data Loading

In [22]:
import pandas as pd
import numpy as np
from numpy import mean
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import locale # formatting currency

In [23]:
# Read the CSV files
geo_0 = pd.read_csv("geo_data_0.csv")
geo_1 = pd.read_csv("geo_data_1.csv")
geo_2 = pd.read_csv("geo_data_2.csv")

In [24]:
# Create the list of datasets
datasets = [geo_0, geo_1, geo_2]

# Print the first 5 rows of each dataset
for i in range(len(datasets)):
    print('geo_' + str(i) + ':')
    print(datasets[i].head(5))
    print()

geo_0:
      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647

geo_1:
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305

geo_2:
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.586433   62.871910
3  q6cA6  2.236060 -0.553760  0.930038  114.572842
4  WPMUX -0.515993  1.716266  5.899011  149.600746



In [25]:
# Info.  datasets
for i in range(len(datasets)):
    print('geo_' + str(i) + ' info:')
    print(datasets[i].info())
    print()

geo_0 info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None

geo_1 info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None

geo_2 info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data co

**Data Quality:**
  - There are no missing values.
  - The data types (*dtypes*) are correct.
  - Duplicate values will be checked.

In [26]:
# Duplicates in the datasets
print(f"Duplicates in Geo 0: {geo_0.duplicated().sum()}")
print(f"Duplicates in Geo 1: {geo_1.duplicated().sum()}")
print(f"Duplicates in Geo 2: {geo_2.duplicated().sum()}")

Duplicates in Geo 0: 0
Duplicates in Geo 1: 0
Duplicates in Geo 2: 0


Since they are not necessary data, the `'id'` columns will be removed from the 3 DataFrames.

In [27]:
# Drop 'id' columns
geo_0 = geo_0.drop('id',axis=1)
geo_1 = geo_1.drop('id',axis=1)
geo_2 = geo_2.drop('id',axis=1)

# 2. Training and Testing the Model for Each Region.

Now, the training and validation sets for the three regions are created.  
The **ratio is 3:1**, so 75% for training and 25% for validation.  
The `train_test_split` function from the `sklearn.model_selection` module will be used to split the data.

In [28]:
# Create training, validation, and testing datasets (Ratio of 3:1)

# 'Features' for each region
geo_0_features = geo_0.drop('product', axis=1)
geo_1_features = geo_1.drop('product', axis=1)
geo_2_features = geo_2.drop('product', axis=1)

# 'Targets' for each region
geo_0_target = geo_0['product']
geo_1_target = geo_1['product']
geo_2_target = geo_2['product']

# Split feature and target for each region into training and validation datasets
# test_size with 0.25 and training dataset with 0.75 of the data
features_train_0, features_valid_0, target_train_0, target_valid_0 = train_test_split(geo_0_features, geo_0_target, test_size=0.25)
features_train_1, features_valid_1, target_train_1, target_valid_1 = train_test_split(geo_1_features, geo_1_target, test_size=0.25)
features_train_2, features_valid_2, target_train_2, target_valid_2 = train_test_split(geo_2_features, geo_2_target, test_size=0.25)

In [29]:
# Function to check the proportions of data splits for each region
def check_split_proportion(region_name, total, train, valid, test_size=0.25):
    train_prop = train / total
    valid_prop = valid / total
    print(f"{region_name}:")
    print(f"Training: {train_prop:.2f} | Validation: {valid_prop:.2f}")

# Validate proportions for each region
check_split_proportion("Geo 0", len(geo_0_features), len(features_train_0), len(features_valid_0))
check_split_proportion("Geo 1", len(geo_1_features), len(features_train_1), len(features_valid_1))
check_split_proportion("Geo 2", len(geo_2_features), len(features_train_2), len(features_valid_2))

Geo 0:
Training: 0.75 | Validation: 0.25
Geo 1:
Training: 0.75 | Validation: 0.25
Geo 2:
Training: 0.75 | Validation: 0.25


## 2.1. Model Training  
A function will be created to train the model and analyze the results.

- `sklearn.linear_model.LinearRegression`: Used to implement the linear regression model, which predicts reserves based on the features of the data.  

- `sklearn.metrics.mean_squared_error`: Calculates the Mean Squared Error (MSE), a metric that evaluates the performance of the model and its accuracy in predictions.  

Training a separate model for each region allows capturing specific variations in the data, which can improve the accuracy of predictions.

In [30]:
# Function to train the model, make predictions, and calculate metrics
def analyze_reserves(features_train, target_train, features_valid, target_valid):
    # Create the model
    model = LinearRegression()
    
    # Train the model with the training data
    model.fit(features_train, target_train)
    
    # Make predictions on the validation data
    predictions = model.predict(features_valid)
    
    # Calculate the average predicted reserves
    average_reserves = round(predictions.mean(), 2)
    
    # Calculate the model's RMSE
    rmse = mean_squared_error(target_valid, predictions) ** 0.5
    
    # Print results
    print(f"Average volume of predicted reserves: {average_reserves} thousand barrels")
    print(f"RMSE of the linear regression model: {round(rmse, 2)}")
    
    # Return the trained model and predictions
    return model, predictions
# Apply the function to the three regions
print("Geo 0")
model_0, predictions_0 = analyze_reserves(features_train_0, target_train_0, features_valid_0, target_valid_0)

print("\nGeo 1")
model_1, predictions_1 = analyze_reserves(features_train_1, target_train_1, features_valid_1, target_valid_1)

print("\nGeo 2")
model_2, predictions_2 = analyze_reserves(features_train_2, target_train_2, features_valid_2, target_valid_2)


Geo 0
Average volume of predicted reserves: 92.58 thousand barrels
RMSE of the linear regression model: 37.56

Geo 1
Average volume of predicted reserves: 68.71 thousand barrels
RMSE of the linear regression model: 0.89

Geo 2
Average volume of predicted reserves: 94.86 thousand barrels
RMSE of the linear regression model: 40.22


**Analysis of Results**

- **Geo 1**: Shows the best model performance with the lowest error (**RMSE = 0.89**) and predicts an average reserve of **68.71 thousand barrels**.  
- **Geo 0 and Geo 2**: Have much higher prediction errors (**RMSE of 37.56 and 40.22**, respectively) and predict higher reserves (**92.58 and 94.86 thousand barrels**).  

The **significantly lower RMSE for Geo 1** suggests that the predictions for this area are much more reliable than those for the other two regions.

# 3. Profit Calculation

The following key data will be considered for calculating the profits:

- **Budget**: The total budget for developing 200 oil wells is **100 million dollars**, which equals **0.5 million per well**.  
- **Revenue per barrel**: Each barrel of raw material generates **4.5 USD** in revenue.  
- **Revenue per product unit**: Since the reserve volume is expressed in thousands of barrels, the revenue per product unit is **4500 dollars** (1000 barrels × 4.5 USD/barrel).  

In [31]:
# Budget variables
budget = 100_000_000  # in dollars
profit = 4_500       # revenue per unit (thousands of barrels)

# Reserve volume needed to avoid losses
required_reserves = budget / profit

# Results
print(f"A total of {round(required_reserves, 2)} thousand barrels are required to avoid losses.")
print(f"Approximately {round(required_reserves / 200, 2)} thousand barrels per reserve are required.")

A total of 22222.22 thousand barrels are required to avoid losses.
Approximately 111.11 thousand barrels per reserve are required.


**Analyzing the relationship between the results**:

The predictions by geographical area show that no area alone meets the requirement of 111.11 thousand barrels on average per well that is needed.

No area individually reaches the minimum average volume required per well.

## 3.1 Profit Analysis, Predictions, and Risk Evaluation in Oil Wells

To complete the analysis, possible profits are calculated by selecting the most productive wells based on the predictions from the linear regression model. For each region, the reserve volume is estimated, the benefits are projected, and the most profitable region for well development is suggested based on the results obtained.

Additionally, the **bootstrapping** technique with 1,000 simulations is used to evaluate the risks and benefits. This includes calculating the average benefit, the 95% confidence interval, and the probability of losses.

In [32]:
# Function to calculate profit based on predicted values for each region
def get_profit(targets, predictions, count, budget):
    '''Function to calculate potential profit based on predicted oil reserve volumes for each region'''
    
    # Sort the predicted volumes from highest to lowest
    predictions_sorted = pd.Series(predictions).sort_values(ascending=False)[:count]
    
    # Select the 200 largest reserve volumes for each region, but use the target volumes (real volumes)
    selected_wells = targets.iloc[predictions_sorted.index]
    
    # Calculate the profit based on a unit that generates $4,500 in revenue
    # Subtract the budget of $100,000,000 from the total revenue
    profit = (4500 * selected_wells.sum()) - budget
    
    # Return the profit value
    return round(profit, 2)


# Create a list of predicted values and real values (targets) for each region
predictions = [predictions_0, predictions_1, predictions_2]
targets = [target_valid_0.reset_index(drop=True), target_valid_1.reset_index(drop=True), target_valid_2.reset_index(drop=True)]

profits = []

# Loop to execute the get_profit function on the predicted datasets of each region
# Store the profit values in the 'profits' list
for i in range(len(predictions)):
    profits.append(get_profit(targets[i], predictions[i], 200, budget))  # Make sure to pass 'budget' as an argument

# Set the local currency to USD (make sure to use a locale that supports currency formatting)
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

# Print the profit values for each region
for i in range(len(profits)):
    print(f'Geo {i}')
    print(f'Profit: {locale.currency(profits[i], grouping=True)}')
    print(f'Target reserve volume: {round((profits[i] + budget) / 4500, 2)} thousand barrels\n')


Geo 0
Profit: $31,357,450.99
Target reserve volume: 29190.54 thousand barrels

Geo 1
Profit: $24,150,866.97
Target reserve volume: 27589.08 thousand barrels

Geo 2
Profit: $24,529,739.01
Target reserve volume: 27673.28 thousand barrels



In [33]:
def get_profit_distribution(targets, predictions, count, budget):
    '''Function that calculates the profit distribution using the bootstrapping method'''
    
    # Initialize the random state
    state = np.random.RandomState(12345)
    
    # List to store profit values
    values = []
    
    # Create a DataFrame with predictions and targets
    combined_df = pd.DataFrame()
    combined_df['predictions'] = predictions
    combined_df['targets'] = targets.reset_index(drop=True)
    
    # Get 1000 samples with replacement
    for i in range(1000):
        target_subsample = combined_df.sample(n=500, replace=True, random_state=state).reset_index(drop=True)
        values.append(get_profit(target_subsample['targets'], target_subsample['predictions'], count, budget))
    
    # Convert the profit list into a pandas series
    values = pd.Series(values)
    
    # Calculate the average profit
    mean = values.mean()
    
    # 95% confidence interval
    upper = values.quantile(0.975)
    lower = values.quantile(0.025)
    
    # Risk of loss (negative profit)
    count_loss = (values < 0).sum()
    risk_of_loss = (count_loss * 100) / len(values)
    
    # Calculate the target oil reserve volume
    target_oil_reserve_volume = (mean + budget) / 4500
    
    # Print the results
    print(f"Average profit: {locale.currency(mean, grouping=True)}")
    print(f"Target oil reserve volume: {round(target_oil_reserve_volume, 2)} thousand barrels")
    print(f"95% confidence interval: {locale.currency(lower, grouping=True)} to {locale.currency(upper, grouping=True)}")
    print(f"Risk of loss: {risk_of_loss:.2f}%")

# Run the function for each region
for i in range(len(predictions)):
    print(f"Geo {i}")
    get_profit_distribution(targets[i], predictions[i], 200, budget)
    print()


Geo 0
Average profit: $4,086,065.05
Target oil reserve volume: 23130.24 thousand barrels
95% confidence interval: -$1,090,836.74 to $9,498,203.25
Risk of loss: 5.90%

Geo 1
Average profit: $4,467,968.61
Target oil reserve volume: 23215.1 thousand barrels
95% confidence interval: $424,025.31 to $8,559,670.19
Risk of loss: 1.30%

Geo 2
Average profit: $3,525,773.12
Target oil reserve volume: 23005.73 thousand barrels
95% confidence interval: -$1,648,102.99 to $8,649,986.06
Risk of loss: 9.10%



# Conclusions

Geo 1 should be selected for the development of the 200 wells based on the following reasons:

1. **Highest Profit**: Geo 1, despite having a slightly lower overall profit compared to Geo 0, offers the highest average profit of **$4,467,968.61** in the bootstrapping simulations. This suggests more consistent profitability when considering the uncertainty introduced by the bootstrapping method.

2. **Lower Risk of Loss**: Geo 1 has the **lowest risk of loss** (1.30%) among all regions. This is an important factor in minimizing the likelihood of a negative return, which can be especially crucial in high-stakes investments like oil well development.

3. **Confidence Interval**: The 95% confidence interval for Geo 1 is the narrowest, ranging from **$424,025.31 to $8,559,670.19**, indicating less variability in profit outcomes compared to Geo 0 and Geo 2. A narrower confidence interval suggests more reliable and predictable results for the investors.

4. **Target Reserve Volume**: Geo 1 has a **target oil reserve volume of 27,589.08 thousand barrels**, which is competitive and close to Geo 2's target. This volume provides substantial reserves, making it a strong candidate for profit generation.

5. **Consistent Results**: The fact that Geo 1 remains profitable across simulations, with a high average profit and relatively low risk, makes it a more dependable choice for future well development compared to the other regions.

In summary, Geo 1 stands out as the most reliable and profitable region for well development due to its combination of higher average profit, lower risk of loss, and more stable confidence intervals.