## 1-Overview

### 1.1 Project Description

I'll be tasked with finding the optimal location for a new oil well for OilyGiant mining company. My first step will be to gather data on the oil well parameters from the chosen regions, including oil quality and reserve volumes. Using this data, I'll build a predictive model to estimate the reserve volumes for new wells. I’ll then select the wells with the highest estimated values and identify the region that yields the greatest total profit from these wells.

With oil sample data available from three different regions, and knowing the parameters of each well in these regions, my goal will be to develop a model that identifies the region with the highest profit margin. To ensure a thorough analysis, I'll use the Bootstrapping technique to evaluate potential profits and assess associated risks, ensuring a well-informed decision for the new well's location.

### 1.2 Project Parameters

Steps to identify the optimal location:

1. Gather data on oil well parameters in the selected region, including oil quality and reserve volume.
2. Develop a model to predict reserve volumes in new wells.
3. Select the oil wells with the highest predicted reserve volumes.
4. Choose the region with the highest total profit from the selected wells.

I have data on oil samples from three regions, with known parameters for each well. My goal is to build a model that identifies the region with the highest profit potential. I'll use the Bootstrapping technique to analyze potential profits and associated risks.

## 2-Initialization

### 2.1 Add imports

Imports in Jupyter notebooks allow users to access external libraries for extended functionality and facilitate code organization by declaring dependencies at the beginning of the notebook, ensuring clear and efficient development.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy import stats as st

1. **Pandas**: is a Python library used for data manipulation and analysis, offering powerful data structures and operations for working with structured data.


### 2.2 Set up CSV DataFrames

In my Jupyter notebook, I use Pandas to load CSV files, enabling me to manipulate and analyze data seamlessly within the notebook environment.

In [2]:
local = {
    'region 0': './datasets/geo_data_0.csv',
    'region 1': './datasets/geo_data_1.csv',
    'region 2': './datasets/geo_data_2.csv',
}

server = {
    'region 0': '/datasets/geo_data_0.csv',
    'region 1': '/datasets/geo_data_1.csv',
    'region 2': '/datasets/geo_data_2.csv',
}

online = {
    'region 0': '',
    'region 1': '',
    'region 2': '',
}

I'm using multiple dictionaries to store paths to datasets for my `local` machine, TripleTen's `server`, and `online` use for any remote use when needed.

In [3]:
def load_csv(key):
    try:
        df = pd.read_csv(local[key])
    except FileNotFoundError:
        try:
            df = pd.read_csv(server[key])
        except FileNotFoundError:
            df = pd.read_csv(online[key])
    return df

I define the `load_csv` function to load a dataset specified by the argument `local`. First, I attempt to read it locally from `local[file_key]`, handling a `FileNotFoundError` by trying to read from `server[file_key]` if necessary, and finally, from `online[file_key]` if all else fails.

In [4]:
region_0 = load_csv('region 0')
region_1 = load_csv('region 1')
region_2 = load_csv('region 2')

The variables `region_0`, `region_1`, and `region_2` are assigned the resulting DataFrames from the created function.

In [5]:
display(region_0, region_1, region_2)

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.221170,105.280062
1,2acmU,1.334711,-0.340164,4.365080,73.037750
2,409Wp,1.022732,0.151990,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647
...,...,...,...,...,...
99995,DLsed,0.971957,0.370953,6.075346,110.744026
99996,QKivN,1.392429,-0.382606,1.273912,122.346843
99997,3rnvd,1.029585,0.018787,-1.348308,64.375443
99998,7kl59,0.998163,-0.528582,1.583869,74.040764


Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276000,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.001160,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305
...,...,...,...,...,...
99995,QywKC,9.535637,-6.878139,1.998296,53.906522
99996,ptvty,-10.160631,-12.558096,5.005581,137.945408
99997,09gWa,-7.378891,-3.084104,4.998651,137.945408
99998,rqwUm,0.665714,-6.152593,1.000146,30.132364


Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.871910
3,q6cA6,2.236060,-0.553760,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746
...,...,...,...,...,...
99995,4GxBu,-1.777037,1.125220,6.263374,172.327046
99996,YKFjq,-1.261523,-0.894828,2.524545,138.748846
99997,tKPY3,-1.199934,-2.957637,5.219411,157.080080
99998,nmxp2,-2.419896,2.417221,-5.548444,51.795253


The datasets contain information about oil samples from three regions with the following attributes:

- **id**: Unique identifier for the oil well.
- **f0, f1, f2**: Three features of the points (their specific meanings are irrelevant, but the features themselves are important).

**Target**

- **product**: Volume of reserves in the oil well (in thousand barrels).

## 3 Preparing the Data

### 3.1 Processing `region_0`

I will now proceed to examine the `region_0` data frame.

In [6]:
region_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Based on our current knowledge, all the data types are acceptable.

In [7]:
data_miss_0 = region_0.isna().sum()
data_dupl_0 = region_0.duplicated().sum()
print(f'There are {data_dupl_0} duplicate values. Their are {data_miss_0.sum()} missing values are.')

There are 0 duplicate values. Their are 0 missing values are.


After verifying for missing and duplicate values, it’s confirmed that `region_0` is fully processed.

### 3.2 Processing `region_1`

I will now proceed to examine the `region_1` data frame.

In [8]:
region_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Based on our current knowledge, all the data types are acceptable.

In [9]:
data_miss_1 = region_1.isna().sum()
data_dupl_1 = region_1.duplicated().sum()
print(f'There are {data_dupl_1} duplicate values. Their are {data_miss_1.sum()} missing values are.')

There are 0 duplicate values. Their are 0 missing values are.


After verifying for missing and duplicate values, it’s confirmed that `region_1` is fully processed.

### 3.3 Processing `region_2`

I will now proceed to examine the `region_2` data frame.

In [10]:
region_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


Based on our current knowledge, all the data types are acceptable.

In [11]:
data_miss_2 = region_2.isna().sum()
data_dupl_2 = region_2.duplicated().sum()
print(f'There are {data_dupl_2} duplicate values. Their are {data_miss_2.sum()} missing values are.')

There are 0 duplicate values. Their are 0 missing values are.


After verifying for missing and duplicate values, it’s confirmed that `region_2` is fully processed.

## 4 Train and Test the Model

### 4.1 Set Up

To ensure repeatability and accuracy, I’ll create a function that minimizes code while effectively completing the task.

I will train and test the model for each region.

In [12]:
def training(df):
    features = df.drop(columns=['id', 'product'])
    target = df['product']
    
    features_train, features_valid, target_train, target_valid = train_test_split(
        features, target, test_size=0.25, random_state=12345
    )
    target_valid = target_valid.reset_index(drop=True)
    
    return features_train, features_valid, target_train, target_valid

The `training` function splits a given DataFrame into features and target variables, excluding the `'id'` and `'product'` columns. It then further splits these into training and validation sets using a 75/25 ratio, returning all four subsets.

In [13]:
def modeling(features_train, target_train, features_valid):
    model = LinearRegression()
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    predictions = pd.Series(predictions)
    return predictions

The `modeling` function trains a `LinearRegression` model using the training features and target data. It then makes predictions on the validation features and returns these predictions.

In [14]:
def rmse(target_valid, predictions):
    rmse = mean_squared_error(target_valid, predictions) ** 0.5
    avg_pred = predictions.mean()
    
    print(f'RMSE: {rmse:.2f}\nAverage Prediction: {avg_pred:.2f}')

The `rmse` function calculates the Root Mean Squared Error (RMSE) between the actual validation targets and the model's predictions, then prints the RMSE along with the average of the predictions. This provides a quick assessment of the model's prediction accuracy and central tendency.

### 4.2 Evaluating `region_0`

Now all of the function created will be applied to each dataframe.

In [15]:
features_train_0, features_valid_0, target_train_0, target_valid_0 = training(region_0)
predictions_0 = modeling(features_train_0, target_train_0, features_valid_0)
rmse(target_valid_0, predictions_0)

RMSE: 37.58
Average Prediction: 92.59


The RMSE quantifies the average deviation between predicted and actual values. The average prediction is 92.59, and the target RMSE (or 'score' to beat) is 37.58.

### 4.3 Evaluating `region_1`

Next for `region_1`:

In [16]:
features_train_1, features_valid_1, target_train_1, target_valid_1 = training(region_1)
predictions_1 = modeling(features_train_1, target_train_1, features_valid_1)
rmse(target_valid_1, predictions_1)

RMSE: 0.89
Average Prediction: 68.73


For `region_1`, the average prediction is 68.73, and the new RMSE (or 'score') to beat is 0.89. Despite the lower average prediction, the deviation from actual values is nearly zero.

### 4.4 Evaluating `region_2`

Lastly the `region_2`:

In [17]:
features_train_2, features_valid_2, target_train_2, target_valid_2 = training(region_2)
predictions_2 = modeling(features_train_2, target_train_2, features_valid_2)
rmse(target_valid_2, predictions_2)

RMSE: 40.03
Average Prediction: 94.97


There seems to be a pattern where higher average predictions lead to greater deviations in accuracy. For instance, with an RMSE of 40.03 and an average prediction of 94.97, the accuracy decreases as the predictions increase.

## 5 Region Parameters

### 5.1 Key Values

To begin determining the best region, the key parameters need to be established based on the `Project Parameters`.

In [18]:
BUDGET = 100_000_000
REVENUE_PER_UNIT = 4_500
NUM_WELLS = 200

This code snippet defines three key variables for the project:

- The total budget allocated for the project, set to 100 million.
- The revenue earned for each unit of oil, set to $4,5000.
- The total number of oil wells planned for development in the project.

In [19]:
COST_PER_WELL = BUDGET / NUM_WELLS

This code calculates the cost to drill each well by dividing the total budget (`BUDGET`) by the number of wells (`NUM_WELLS`). The result is stored in the variable `COST_PER_WELL`.

### 5.2 The Minimum

With the cost per well determined, we need to calculate the minimum reserves required to cover the drilling costs.

In [20]:
MIN_RESERVES = COST_PER_WELL / REVENUE_PER_UNIT
print(f'Minimum reserves: {MIN_RESERVES:.2f}')

Minimum reserves: 111.11


After completing my calculations, it appears that the minimum required reserves are 111,111.11 barrels.

## 6 Profit Calculation

### 6.1 Setup

As seen before, to ensure repeatability and accuracy, I’ll create a function that minimizes code while effectively completing the task.

I will write a function to calculate the profit from a set of selected oil wells based on model predictions. I will choose the wells with the highest predicted values.

In [21]:
def profit_calc(prediction, target_valid):
    best_wells= prediction.sort_values(ascending=False)[:NUM_WELLS]
    target_wells = target_valid[best_wells.index]
    total_profit = (sum(target_wells) * REVENUE_PER_UNIT) - BUDGET
    return total_profit

The `profit_calc` function calculates the total profit by selecting the top `NUM_WELLS` predictions, retrieving the corresponding target values, and computing the profit based on these values. The profit is calculated as the total revenue from the selected wells minus a fixed budget.

### 6.2 `region_0`

Next, we will apply the `profit_calc()` function to each region, starting with `region_0`.

In [22]:
total_profit_0 = profit_calc(predictions_0, target_valid_0)
print(f'Total profit: {total_profit_0:.2f}')

Total profit: 33208260.43


The total expected profit is \$33,208,260.43.

### 6.3 `region_1`

Moving on to `region_1`:

In [23]:
total_profit_1 = profit_calc(predictions_1, target_valid_1)
print(f'Total profit: {total_profit_1:.2f}')

Total profit: 24150866.97


The total expected profit is \$24,150,866.97, which is lower than the profit for `region_1`.

So far, the profit aligns with the average prediction, suggesting that `region_2` may have a slightly higher profit than `region_0`. However, it's important to consider that `region_2` has the greatest variance in RMSE.

### 6.4 `region_2`

Lastly, let's evaluate `region_2`:

In [24]:
total_profit_2 = profit_calc(predictions_2, target_valid_2)
print(f'Total profit: {total_profit_2:.2f}')

Total profit: 27103499.64


The total expected profit is \$27,103,499.64, which is lower than the profit for `region_1`.

It can be noted that, although similar, the results do not perfectly align with my original expectations when comparing the overall region prediction to the expected profit of the top wells in each region.

## 7 Risk and Profit Analysis 

### 7.1 Setup

As seen before, to ensure repeatability and accuracy, I’ll create a function that minimizes code while effectively completing the task.

Now, I will calculate the risks and profit for each region using the bootstrapping technique with 1,000 samples to determine the distribution of profit for each region.

In [25]:
def bootstrap_sampling(predictions, target_valid):
    profits = []
    state = np.random.RandomState(12345)
    
    for i in range(1000):
        sample_target = target_valid.sample(n=500, replace=True, random_state=state)
        sample_prediction = predictions[sample_target.index]
        profit = profit_calc(sample_prediction, sample_target)
        profits.append(profit)
    
    return profits

This code performs bootstrapping by repeatedly sampling 500 observations with replacement from `target_valid` and calculating the profit for each sample. It uses a fixed random seed for reproducibility and collects 1000 profit estimates to return as a list.

In [26]:
def profit_stats(predictions, target_valid):
    profits = bootstrap_sampling(predictions, target_valid)
    
    profit_mean = np.mean(profits)
    intervals = st.t.interval(0.95, len(profits) - 1, loc=profit_mean, scale=st.sem(profits))
    risk = np.mean(np.array(profits) < 0) * 100
    
    print(f'Average profit: {profit_mean:.2f}')
    print(f'95% confidence: {intervals}')
    print(f'Risk of loss: {risk:.2f}%')

This function calculates and prints profit statistics with the help of the `bootstrap_samples()` function. It calculates the average profit, 95% confidence interval for the profit, and the risk of loss as a percentage based on the bootstrap results.

### 7.2 Calculating `region_0` Profit Stats

Now the `profit_stats()` function will be applied to each region.

In [27]:
profit_stats(predictions_0, target_valid_0)

Average profit: 6007352.44
95% confidence: (5809910.873188795, 6204794.012034521)
Risk of loss: 2.00%


The code snippet converts `predictions_0` to a Pandas Series, resets the index of `target_valid_0`, and calls the `profit_stats` function. The output shows an average profit of \$6,007,352.44, a 95% confidence interval for the profit ranging from \$5,809,910.87 to \$6,204,794.01, and a 2.00% risk of loss.

I will also repeat this process for `region_1` and `region_2`.

### 7.3 Calculating `region_1` Profit Stats

Next for `region_1`:

In [28]:
profit_stats(predictions_1, target_valid_1)

Average profit: 6652410.58
95% confidence: (6488555.194678185, 6816265.969743733)
Risk of loss: 0.30%


The output shows an average profit of $6,652,410.58, a 95% confidence interval for the profit ranging from $6,488,555.19 to $6,816,265.97, and a 0.30% risk of loss.

For the first time, we see that `region_1` has the greatest average profit and, as observed previously, it has the least variance, resulting in the lowest expected risk.

### 7.4 Calculating `region_2` Profit Stats

Lastly the `region_2`:

In [29]:
profit_stats(predictions_2, target_valid_2)

Average profit: 6155597.23
95% confidence: (5956412.740240851, 6354781.716578515)
Risk of loss: 3.00%


The output shows an average profit of $6,155,597.23, a 95% confidence interval for the profit ranging from $5,956,412.74 to $6,354,781.72, and a 3.00% risk of loss.

The results for `region_2` match my expectations from section `4.4 Evaluating region_2`, though the average profit is higher than that of `region_0`.

## 8 Conclusion

The OilyGiant mining company should invest in `region_1` because:

- It has demonstrated the least amount of risk.
- It has the lowest variance, making it the most predictable.
- Overall, when sampling and selecting various wells, it shows the highest profit.