# OilyGiant's New Well Location

## Introduction

OilyGiant mining company is seeking the ideal place for a new oil well.

The task will require:

 1. Analyzing oil quality and volumes of different reserves. 
 2. Creating a model to predict the volume of reserves in new wells. 
 
 This model will identify:
 
 1. Oil wells with the highest estimated values 
 2. Regions with oil wells that produce high profit margins

### Data Description

Three datasets from oil samples within three regions have been provided in the datasets folder. For each the following parameters are given:

`id:` unique oil well identifier

`f0, f1, f2:` three undisclosed features of significance

`product:` volume of reserves in the oil well (thousand barrels)

Potential profits and risks will be identified using the bootstrapping method.

### Procedure

The process will include five distinct steps:
1. Data Preparation
2. Model training and testing
3. Prepare data for profit calculation
4. Calculate profit from a set of selected oil wells and model predictions
5. Calculate risks and profit for each region

## Data Preparation

Import packages and read in dataframes for each well. 

Check for:

1. Data types
2. Missing data
3. Duplicate Data

### Import Packages

In [3]:
# Import packages
import pandas as pd # for data processing
from sklearn.model_selection import train_test_split # for splitting data
from sklearn.linear_model import LinearRegression # for linear regression
from sklearn.metrics import mean_squared_error # for error calculation\
from numpy.random import RandomState # for random state
import numpy as np # for array operations

### Save Dataframes

In [6]:
# create dataframes
df1 = pd.read_csv('datasets/geo_data_0.csv')
df2 = pd.read_csv('datasets/geo_data_1.csv')
df3 = pd.read_csv('datasets/geo_data_2.csv')

### Examine Values

In [4]:
### Show first rows of each dataframe
for i in [df1, df2, df3]:
    print(i.head(3))

      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
      id         f0        f1        f2     product
0  kBEdx -15.001348 -8.276000 -0.005876    3.179103
1  62mP7  14.272088 -3.475083  0.999183   26.953261
2  vyE1P   6.263187 -5.948386  5.001160  134.766305
      id        f0        f1        f2    product
0  fwXo0 -1.146987  0.963328 -0.828965  27.758673
1  WJtFt  0.262778  0.269839 -2.530187  56.069697
2  ovLUW  0.194587  0.289035 -5.586433  62.871910


From quick inspection, the id column contains string values and the others are floats.

### Data Types

Data types of each column will be checked with 'info'.

In [5]:
# Check datatypes of each dataframe
for i in [df1, df2, df3]:
    print(i.info(), '\n')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Colu

Each column is in the right format. 

### Missing Values

From the above analysis, no missing values are present.

### Duplicate Values

In [6]:
# Check for duplicates within 'id' column
for i in [df1, df2, df3]:
    print(i.duplicated(subset='id').sum())

10
4
4


As duplicates exist, print each set within all dataframes. 

In [7]:
# Find duplicates values in each dataframe
for i in [df1, df2, df3]:
    duplicates = (i[i.duplicated(subset='id', keep=False)])
    duplicates = duplicates.sort_values(by='id')
    print(duplicates, '\n')

          id        f0        f1         f2     product
66136  74z30  1.084962 -0.312358   6.990771  127.643327
64022  74z30  0.741456  0.459229   5.153109  140.771492
51970  A5aEY -0.180335  0.935548  -2.094773   33.020205
3389   A5aEY -0.039949  0.156872   0.209861   89.249364
69163  AGS9W -0.933795  0.116194  -3.655896   19.230453
42529  AGS9W  1.454747 -0.479651   0.683380  126.370504
931    HZww2  0.755284  0.368511   1.863211   30.681774
7530   HZww2  1.061194 -0.373969  10.430210  158.828695
63593  QcMuo  0.635635 -0.473422   0.862670   64.578675
1949   QcMuo  0.506563 -0.323775  -2.215583   75.496502
75715  Tdehs  0.112079  0.430296   3.218993   60.964018
21426  Tdehs  0.829407  0.298807  -0.049563   96.035308
92341  TtcGQ  0.110711  1.022689   0.911381  101.318008
60140  TtcGQ  0.569276 -0.104876   6.440215   85.350186
89582  bsk9y  0.398908 -0.400253  10.122376  163.433078
97785  bsk9y  0.378429  0.005837   0.160827  160.637302
41724  bxg6G -0.823752  0.546319   3.630479   93

From examination, each id has a unique value that is 5 characters long. However, the length of the data sets are 100,000. Perhaps these values are given due to random chance. This can be determined by finding the expected number of duplicates that would exist given the circumstances. Compare this value to the current number of duplicates.

In [8]:
# Find the chances 'id' names are duplicated

# Save options as zero
options = 0

# Calculate the number of options for each string element in the 'id' column
for i in range(5):
    if i == 0:
        # Set options to the number of unique values of the first string element in the 'id' column 
        options = len(df1['id'].str[i].unique())
        # Multiply by the number of the unique values of the remaining four string elements
    else:
        options *= len(df1['id'].str[i].unique())
    
# Calculate the chances of each 'id' name being duplicated
chances = (100000 / options)

# Expected chances of duplicates
print('Number of expected duplicate id numbers for each dataframe =', round(100000 * chances))

Number of expected duplicate id numbers for each dataframe = 11


As no dataset has more duplicates than the expected 11, they can be considered as a result of random chance.

## Model Training

Now data has been pre-processed, each region will be split into training and validation sets. Once split, the model can be trained. This will give predictions as to the average volume reserves. An analysis will select the largest volume reserve average.

### Split Data

The data needs to be split in two ways for:
1. Features and target
2. Training and validation sets

The second split will be done in a 75:25 ratio.

In [9]:
### Split data into training and validation.

# Create function to split data
def split_data(data):
    # Split data into features and target
    features = data.drop(['product', 'id'], axis=1)
    target = data['product']
    
    # Split data into training and validation sets
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    # Return the split data
    return features_train, features_valid, target_train, target_valid
    
# Split data for each region
features_train_1, features_valid_1, target_train_1, target_valid_1 = split_data(df1)
features_train_2, features_valid_2, target_train_2, target_valid_2 = split_data(df2)
features_train_3, features_valid_3, target_train_3, target_valid_3 = split_data(df3)

In [10]:
# Confirm each set is the proper size (75% training, 25% validation)

# Print shape of each set
print('df1 splits:', '\n'
      'Features train shape:', features_train_1.shape, '\n'
      'Features valid shape:', features_valid_1.shape, '\n'
      'Target train shape:', target_train_1.shape, '\n'
      'Target valid shape:', target_valid_1.shape, '\n')

print('df2 splits:', '\n'
        'Features train shape:', features_train_2.shape, '\n'
        'Features valid shape:', features_valid_2.shape, '\n'
        'Target train shape:', target_train_2.shape, '\n'
        'Target valid shape:', target_valid_2.shape, '\n')

print('df3 splits:',
        'Features train shape:', features_train_3.shape, '\n',
        'Features valid shape:', features_valid_3.shape, '\n',
        'Target train shape:', target_train_3.shape, '\n',
        'Target valid shape:', target_valid_3.shape, '\n')

df1 splits: 
Features train shape: (75000, 3) 
Features valid shape: (25000, 3) 
Target train shape: (75000,) 
Target valid shape: (25000,) 

df2 splits: 
Features train shape: (75000, 3) 
Features valid shape: (25000, 3) 
Target train shape: (75000,) 
Target valid shape: (25000,) 

df3 splits: Features train shape: (75000, 3) 
 Features valid shape: (25000, 3) 
 Target train shape: (75000,) 
 Target valid shape: (25000,) 



### Train Model and Make Predictions

As our target is numerical, linear regression will be used to train the model to:

1. Train the model based on target and feature values from the training set
2. Predict validation values based on the training set
3. Calculate the average volume of predicted reserves
4. Calculate the root square mean variable (RMSE)


In [11]:
# Create a function to predict volume and calculate RMSE
def predicted_volume_rmse(features_train, features_valid, target_train, target_valid):
    # Initialize model constructor
    model = LinearRegression()
    
    # 1. Train model on training set
    model.fit(features_train, target_train)
    
    # 2. Get model predictions on validation set
    predictions_valid = model.predict(features_valid)
    
    # Save predictions as a series with index from target_valid
    predictions_valid = pd.Series(predictions_valid, index=target_valid.index)
    
    # 3. Calclate volume reserve mean of predictions_valid
    mean_predictions = predictions_valid.mean()
    
    # 4. Calculate RMSE on validation set
    rmse = mean_squared_error(target_valid, predictions_valid) ** 0.5
    
    # Return RMSE
    return round(rmse,2), round(mean_predictions,2), predictions_valid

In [12]:
# Calculate region 1 data
rmse_1, mean_predictions_1, predictions_valid_1 = predicted_volume_rmse(features_train_1, 
                                                                        features_valid_1, 
                                                                        target_train_1, 
                                                                        target_valid_1)
print('Region 1 values:', '\n'
      'RMSE:', rmse_1, '\n'
      'Predicted volume mean:', mean_predictions_1, '\n')

# Calculate region 2 data
rmse_2, mean_predictions_2, predictions_valid_2 = predicted_volume_rmse(features_train_2, 
                                                                        features_valid_2, 
                                                                        target_train_2, 
                                                                        target_valid_2)
print('Region 2 values:', '\n'
      'RMSE:', rmse_2, '\n'
      'Predicted volume mean:', mean_predictions_2, '\n')


# Calculate region 3 data
rmse_3, mean_predictions_3, predictions_valid_3 = predicted_volume_rmse(features_train_3, 
                                                                        features_valid_3, 
                                                                        target_train_3, 
                                                                        target_valid_3)
print('Region 3 values:', '\n'
      'RMSE:', rmse_3, '\n'
      'Predicted volume mean:', mean_predictions_3)

Region 1 values: 
RMSE: 37.58 
Predicted volume mean: 92.59 

Region 2 values: 
RMSE: 0.89 
Predicted volume mean: 68.73 

Region 3 values: 
RMSE: 40.03 
Predicted volume mean: 94.97


### Analysis of Results

The third region has predicted the largest average of volume reserves at 94,970 barrels of oil. However, the root square mean square error is also the highest at 40.03, suggesting that higher variance also exists. This is followed by the first region with a slightly lower mean and RMSE. The second region is significantly lower in both categories with an impressively low RMSE at 0.89.

## Profit Calculation Preparation

The profit calculation requires saving variables. A baseline oil volume value will also be calculated to determine the break even point. The formula can be expressed as:

**required volume to breakeven = cost of single well / barrel price**

Total cost is equal to the budget for 200 wells at `$100 million`. The cost of one well will divide the total cost by 200 wells. The barrel price of oil is `$4500`.

Key values will first be saved according to the above formula. The calculation will then be made to determine how much each well must provide to break even. This value will be compared to the average of volume reserves in each region.

In [13]:
# Store budget as $100 million for the total cost of the construction 
budget = 100000000

# Divide budget by 200 to find budgets per well
budget_per_well = budget / 200

# Calculate revenue per unit of volume as $4,500
revenue_per_unit = 4500

### Breakeven Production

For a well to breakeven, the volume reserves revenue need to overcome the cost of the budget for an individual well.

In [14]:
# Find breakeven volume
breakeven = budget_per_well / revenue_per_unit

print(round(breakeven, 3))

111.111


### Findings

A well needs to contain atleast 111,111 barrels of oil in its reserves to atleast break even. The average predicted volume for each region is below this level. The third region is closest at 94,970 barrels. Yet, this is the average from a dataset of 100,000 wells. Perhaps enough wells are above this threshold value.

## Profit Calculation Execution

A function wil be created to determine which region will produce the greatest profit. The function will first locate the top predicted volume values from each region. Volume reserves of actual values that match with the predicted best wells will be added together to form the revenue. The budget will then be deducted from the revenue produced to leave the profit from the region. The region with the most profit will be discovered. 

### Profit Function

In [15]:
# Create function to find wells with highest profit
def profit(target, predictions):
    # Find well locations with highest predcited volumes 
    predictions = predictions.sort_values(ascending=False).head(200).index
    
    # Find total reserves of wells with highest volume
    total_volume = target.loc[predictions].sum()
    
    # Find revenue of wells with highest volume
    revenue = total_volume * revenue_per_unit
    
    # Find profit of wells with highest volume
    profit = round(revenue - budget, 2)
    
    return profit

### Application of Profit Calculation

In [16]:
# Find profit of wells in each region
profit_1 = profit(target_valid_1, predictions_valid_1)
print('Region 1 profit:', profit_1,'\n')

profit_2 = profit(target_valid_2, predictions_valid_2)
print('Region 2 profit:', profit_2,'\n')

profit_3 = profit(target_valid_3, predictions_valid_3)
print('Region 3 profit:', profit_3)

Region 1 profit: 33208260.43 

Region 2 profit: 24150866.97 

Region 3 profit: 27103499.64


### Findings

After performing the profit function on all three regions, the first region produced the most. This profit comes in at `$33,208,260`, over `$6 million` more than the next region as a maximum potential profit value.

## Regional Risk and Profit

Profit and risk will be assessed using the bootstrapping method for each region. This requires randomly sampling 500 data points and calculating the profit when selecting the top 200 predicted volume reserves. This is repeated 1000 times. 

Risk is recorded by counting the number of samples that lead to negative profit. Regions that have a loss percentage higher than 2.5% percent will be disregarded when making a recommendation. Those that remain will compete based on highest profit. The confidence interval will be calculated to capture 95% of the data.

In [17]:
# Create a function to find profit using bootstrap technique
def distr_profit(target, predictions):
    
    # Initialize loss counter
    loss = 0
    
    # Save values as an empty list
    values = []
    
    # Choose random state
    state = RandomState(12345)
    
    for i in range(1000):
        subsample_target = target.sample(n=500, replace=True, random_state=state)
        subsample_predict = predictions[subsample_target.index]

        subsample_profit = profit(subsample_target, subsample_predict)
        
        values.append(subsample_profit)
        # Calculate loss
        if subsample_profit < 0:
            loss += 1
    
    # Express loss as a percentage
    loss_percentage = (loss / 1000) * 100
    
    # Save values as a series
    values = pd.Series(values)
    
    # Find average profit
    mean = round(values.mean(),2)

    # Lower confidence interval bound (0.025)
    lower, upper = round(values.quantile(0.025),2), round(values.quantile(0.975),2)

    return mean, (lower, upper), loss_percentage

In [18]:
# Calculate profit for each region with bootstrapping method
mean_1, ci_95_1, loss_1 = distr_profit(target_valid_1, predictions_valid_1)
print('Region 1 bootstrapping values:', '\n', 'Mean:', mean_1, '\n',  
      '95% Confidence Interval:', (ci_95_1),
      '\n', 'Loss Probability:', loss_1, "%", '\n')

mean_2, ci_95_2, loss_2 = distr_profit(target_valid_2, predictions_valid_2)
print('Region 2 bootstrapping values:', '\n', 'Mean:', mean_2, '\n',  
      '95% Confidence Interval:', (ci_95_2),
      '\n', 'Loss Probability:', loss_2, "%", '\n')


mean_3, ci_95_3, loss_3 = distr_profit(target_valid_3, predictions_valid_3)
print('Region 3 bootstrapping values:', '\n', 'Mean:', mean_3, '\n',  
      '95% Confidence Interval:', (ci_95_3),
      '\n', 'Loss Probability:', loss_3, "%", '\n')


Region 1 bootstrapping values: 
 Mean: 6007352.44 
 95% Confidence Interval: (129483.32, 12311636.06) 
 Loss Probability: 2.0 % 

Region 2 bootstrapping values: 
 Mean: 6652410.58 
 95% Confidence Interval: (1579884.81, 11976415.88) 
 Loss Probability: 0.3 % 

Region 3 bootstrapping values: 
 Mean: 6155597.23 
 95% Confidence Interval: (-122184.95, 12306444.74) 
 Loss Probability: 3.0 % 



### Findings

#### Risk

Region 3 is disqualified due to having a loss percentage higher than 2.5%. Region 1 and 2 both fall below, with Region 2 only having a loss probability of 0.3%.

#### Profit

Bootstrapping suggests that Region 2 is the better choice, leading to a mean profit of `$6,652,410` whilst Region 1 averages half a million less.

#### Recommendation

Region 2 is the recommended region based on risk and profit for the above reasons. Whilst region 1 and 3 showed promise earlier in the analysis, the bootstrapping technique provided for a much higher profit than through others which only garnered results of half as much.

## Conclusion

### Data Preparation

Packages were imported and data from three regions were brought in. The data was checked for data types, missing values and duplicate values. Some duplicate values were found by determined to be present due to chance.

### Model Training

The data was split and trained to create a model. The model predicted that Region 2 and 3 would have higher volumes of oil than Region 1.

### Profit Calculation Preparation

Data about costs and revenue were introduced. The breakeven point was determined to be 111,111 barrels of oil. This was much higher than the initial averages for all three regions.

### Profit Calculcation Execution

A profit function was created that found the actual volume of wells from the predicted model. Region 1 showed the most promise with a profit of around `$33 million`

### Regional Risk and Profit

Bootstrapping produced results that suggested the risk of Region 3 was too high. Region 2 was chosen as the recommended region due to higher profit at `$6.7 million`
