# Sprint 9 Project

## Introduction

Working for the OilyGiant mining company, I am tasked with finding the best place to build a new well. 

These are the following steps I will take to choose the new location:

* Collect the oil well parameters in the selected regions: oil quality and volume of reserves
* Build a model for predicting the volume of reserves in the new wells
* Pick the oil wells with the highest estimated values
* Pick the region with the highest total profit for the selected oil wells

I have data onoil samples from three regions. Parameters of each oil well in the region are already known. I will be building a model that will help pick the region with the highest profit margin, and analyzing potential profit and risks.

### Data description

`id` — unique oil well identifier\
`f0, f1, f2` — three features of points (their specific meaning is unimportant, but the features themselves are significant)\
`product` — volume of reserves in the oil well (thousand barrels).

## Loading Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Loading Data

In [2]:
geo0 = pd.read_csv('/datasets/geo_data_0.csv')
geo1 = pd.read_csv('/datasets/geo_data_1.csv')
geo2 = pd.read_csv('/datasets/geo_data_2.csv')

In [3]:
geo0.head()

Unnamed: 0,id,f0,f1,f2,product
0,txEyH,0.705745,-0.497823,1.22117,105.280062
1,2acmU,1.334711,-0.340164,4.36508,73.03775
2,409Wp,1.022732,0.15199,1.419926,85.265647
3,iJLyR,-0.032172,0.139033,2.978566,168.620776
4,Xdl7t,1.988431,0.155413,4.751769,154.036647


In [4]:
geo0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [5]:
geo1.head()

Unnamed: 0,id,f0,f1,f2,product
0,kBEdx,-15.001348,-8.276,-0.005876,3.179103
1,62mP7,14.272088,-3.475083,0.999183,26.953261
2,vyE1P,6.263187,-5.948386,5.00116,134.766305
3,KcrkZ,-13.081196,-11.506057,4.999415,137.945408
4,AHL4O,12.702195,-8.147433,5.004363,134.766305


In [6]:
geo1.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


In [7]:
geo2.head()

Unnamed: 0,id,f0,f1,f2,product
0,fwXo0,-1.146987,0.963328,-0.828965,27.758673
1,WJtFt,0.262778,0.269839,-2.530187,56.069697
2,ovLUW,0.194587,0.289035,-5.586433,62.87191
3,q6cA6,2.23606,-0.55376,0.930038,114.572842
4,WPMUX,-0.515993,1.716266,5.899011,149.600746


In [8]:
geo2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   id       100000 non-null  object 
 1   f0       100000 non-null  float64
 2   f1       100000 non-null  float64
 3   f2       100000 non-null  float64
 4   product  100000 non-null  float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB


## Preparing Data

### Validating Data

In [9]:
# checking for duplicates

print(geo0.duplicated().sum())
print(geo1.duplicated().sum())
print(geo2.duplicated().sum())

0
0
0


In [10]:
# checking data types
display(geo0.dtypes)
display(geo1.dtypes)
display(geo2.dtypes)

id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object

id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object

id          object
f0         float64
f1         float64
f2         float64
product    float64
dtype: object

### Splitting Data

In [11]:
# splitting each region into feature and target sets

geo0_features = geo0.drop(['product', 'id'], axis=1)
geo0_target = geo0['product']

geo1_features = geo1.drop(['product', 'id'], axis=1)
geo1_target = geo1['product']

geo2_features = geo2.drop(['product', 'id'], axis=1)
geo2_target = geo2['product']

In [12]:
# splitting each set into training and validation sets at a 75:25 ratio

geo0_features_train, geo0_features_valid, geo0_target_train, geo0_target_valid = train_test_split(
    geo0_features, geo0_target, test_size=0.25, random_state=1)

geo1_features_train, geo1_features_valid, geo1_target_train, geo1_target_valid = train_test_split(
    geo1_features, geo1_target, test_size=0.25, random_state=1)

geo2_features_train, geo2_features_valid, geo2_target_train, geo2_target_valid = train_test_split(
    geo2_features, geo2_target, test_size=0.25, random_state=1)

**Wrap-up**

Once the data was loaded into its respective dataframes, I explored it for duplicates and to confirm that the datatypes were appropriate for the data stored in them. 

After that was complete, I split the datasets into features and targets. I had to drop the 'id' column from features because it was an object, and not relevent to our regression model. 

Now that I had the features and targets split, I used `train_test_split` to split the data sets into training and validation sets, at a ratio of 75%:25%.

Now I can move onto building the model.

## Building Linear Regression Model

In [13]:
# geo0 linear regression model

geo0_model = LinearRegression()

geo0_model.fit(geo0_features_train, geo0_target_train)

predictions_valid_geo0 = geo0_model.predict(geo0_features_valid)

result = mean_squared_error(geo0_target_valid, predictions_valid_geo0) ** 0.5

predictions_avg_geo0 = predictions_valid_geo0.mean()

print("Average value of predicted reserves:", predictions_avg_geo0)
print('RMSE of the geo0 linear regression model on the validation set:', result)

Average value of predicted reserves: 92.49262459838863
RMSE of the geo0 linear regression model on the validation set: 37.74258669996437


In [14]:
# geo1 linear regression model

geo1_model = LinearRegression()

geo1_model.fit(geo1_features_train, geo1_target_train)

predictions_valid_geo1 = geo1_model.predict(geo1_features_valid)

result = mean_squared_error(geo1_target_valid, predictions_valid_geo1) ** 0.5

predictions_avg_geo1 = predictions_valid_geo1.mean()

print("Average value of predicted reserves:", predictions_avg_geo1)

print('RMSE of the geo1 linear regression model on the validation set:', result)

Average value of predicted reserves: 69.12040524285558
RMSE of the geo1 linear regression model on the validation set: 0.8943375629130574


In [15]:
# geo2 linear regression model

geo2_model = LinearRegression()

geo2_model.fit(geo2_features_train, geo2_target_train)

predictions_valid_geo2 = geo2_model.predict(geo2_features_valid)

result = mean_squared_error(geo2_target_valid, predictions_valid_geo2) ** 0.5

predictions_avg_geo2 = predictions_valid_geo2.mean()

print("Average value of predicted reserves:", predictions_avg_geo2)
                            
print('RMSE of the geo2 linear regression model on the validation set:', result)

Average value of predicted reserves: 94.9568304858529
RMSE of the geo2 linear regression model on the validation set: 39.86671127773423


**Wrap-up**

The models have been built and trained on data for each region. I can conclude the following:

* Region from geo2 has the largest average predicted reserves, followed by geo0.
* The model for the region from geo1 has the lowest RMSE score, with a mean error of only 0.89 for the predicted reserve values.
    * This leads me to conclude that the model for geo1 is the most accurate 
   
Next, I will move onto preparing data for profit calculation.

## Preparing For Profit Calculation

In [16]:
# provided variables for calculations

BUDGET = 100000000
WELL_SAMPLE_SIZE = 200
REVENUE_PER_UNIT = 4500

In [17]:
min_product_per_well = (BUDGET/WELL_SAMPLE_SIZE) / REVENUE_PER_UNIT

print(min_product_per_well)

111.11111111111111


We will need to produce a minimuim average of **111.11 units of product per well** in order to operate without losses on the new wells.

Comparing this to the average value of predicted reserves, there is still some work to be done. The highest average of predicted reserves was the geo2 region, with a value of **94.96** which is still short of the **111.11** unit goal.

Next step is to pick the top 200 predicted wells in each region, and calculate the profits of those.

## Calculating Profit of Top 200 Wells Per Region

In [18]:
# Defining a function to find the top 200 regions and calculate the profit of that region
# It will take a series of predictions and return a profit amount

def calculate_profit(target, predictions):
    target = target.reset_index(drop=True)
    predictions = pd.Series(predictions)
    predictions = predictions.reset_index(drop=True)
    top_wells = predictions.sort_values(ascending=False).head(WELL_SAMPLE_SIZE).index
    target_reserves = target.loc[top_wells].sum()
    revenue = target_reserves * REVENUE_PER_UNIT
    profit = revenue - BUDGET
    return profit.round(2)

In [19]:
# calculating profit for each region's top 200 wells

print(f'geo0 Region top 200 wells predicted profit: {calculate_profit(geo0_target_valid, predictions_valid_geo0):,}')
print(f'geo1 Region top 200 wells predicted profit: {calculate_profit(geo1_target_valid, predictions_valid_geo1):,}')
print(f'geo2 Region top 200 wells predicted profit: {calculate_profit(geo2_target_valid, predictions_valid_geo2):,}')

geo0 Region top 200 wells predicted profit: 32,607,814.18
geo1 Region top 200 wells predicted profit: 24,150,866.97
geo2 Region top 200 wells predicted profit: 25,630,933.52


**Wrap-Up**

If I look at strictly the top 200 wells in each region, the region represented by the **geo0 dataset** would produce the most profit. It is highly unlikely that this outcome could be reproduced during the later sampling, so this ultimately does not tell us much, just that this region has the highest performing top 200, compared to the other regions' top 200. 

The profit was calculated by taking the top 200 sites by predicted reserves, finding the associated true reserve value from the provided data, multiplying them by the revenue per unit (4500) and then subtracting that revenue value by the initial investment of $100m. 

Next, I will produce a more accurate look at each region, using bootstrapping to iterate through thousands more datapoints. 

## Calculating Risks and Profits of Each Region

In [20]:
# establishing a global random state for the bootstrapping
state = np.random.RandomState(1)

In [21]:
# defining a function for bootstrapping
def bootstrapping(target, predictions, well_samples, bootstrap_samples, region):
    target = pd.Series(target)
    predictions = pd.Series(predictions)
    
    target = target.reset_index(drop=True)
    predictions = predictions.reset_index(drop=True)
    
    values = [] # empty list to store profits
    
    # running the profit calculation for the bootstrapped samples
    for i in range(bootstrap_samples):
        target_subsample = target.sample(n=well_samples, replace=True, random_state=state)
        probs_subsample = predictions[target_subsample.index]
        values.append(calculate_profit(target_subsample, probs_subsample))
    
    values = pd.Series(values)
    # creating upper and lower bounds for a 95% confidence interval
    upper = round(values.quantile(0.975), 2)
    lower = round(values.quantile(0.025), 2)
    
    # calculating mean of the samples
    mean = round(values.mean(), 2)
    
    # calculating the probability of losses
    risk_of_loss = (values < 0).mean() * 100
    
    print(f'Region represented by {region}', end='\n')
    print(f'The upper and lower bounds for a 95% confidence interval are: ({lower:,}, {upper:,})', end='\n')
    print(f'The average profit in this region is: {mean:,}', end='\n')
    print(f'The risk of losses in this region is: {risk_of_loss}%', end='\n')
    print()

In [22]:
bootstrapping(geo0_target_valid, predictions_valid_geo0, 500, 1000, 'geo0')
bootstrapping(geo1_target_valid, predictions_valid_geo1, 500, 1000, 'geo1')
bootstrapping(geo2_target_valid, predictions_valid_geo2, 500, 1000, 'geo2')

Region represented by geo0
The upper and lower bounds for a 95% confidence interval are: (-1,002,396.27, 9,471,095.33)
The average profit in this region is: 4,330,094.0
The risk of losses in this region is: 5.2%

Region represented by geo1
The upper and lower bounds for a 95% confidence interval are: (874,445.41, 8,857,187.54)
The average profit in this region is: 4,833,586.7
The risk of losses in this region is: 0.6%

Region represented by geo2
The upper and lower bounds for a 95% confidence interval are: (-1,071,231.91, 9,322,531.18)
The average profit in this region is: 3,926,721.33
The risk of losses in this region is: 7.199999999999999%



**Conclusion**

I have analyzed each region, by utilizing bootstrapping to form 1000 iterations of 500 samples each, and calculating the average profit, confidence intervals and percent risk (of losses/no profit). The following can be concluded:

The region represented by **`geo1`** will likely secure the most profit. While this region has the lowest 97.5% percentile profit values, it is the only region with a 95% confidence interval suggesting ***all profit***. The other two regions have up to $1 million in their 2.5% percentiles.

This data suggests that **`geo1`** will be the most consistent profit, with a significantly low percent risk of losses.

Since I have utilized 1000 iterations of the samples, we can rest somewhat assured that random chance does not play much affect here. 

Throughout this project, I have followed through a typical use case of machine learning in the workplace: 
* I imported a provided dataset and prepared it for analysis (by cleaning it and splitting it into training and validation sets)
* Trained linear regression models for each region's dataset
* Used the models to predict the oil reserves of the validation datasets' wells
* Calculated the profit of each region's top 200 wells
* Utilized bootstrapping to generate 1000 iterations of 500 samples of predictions to calculate the average profit, 95% confidence interval and risk of loss percent
* Used the data from above to make an informed prediction on the best region to capitalize on