# Machine Learning for Business project

## About Project

You work for the OilyGiant mining company. Your task is to find the best place for a new well.

**Steps to choose the location:**
1. Collect the oil well parameters in the selected region: oil quality and volume of reserves;
2. Build a model for predicting the volume of reserves in the new wells;
3. Pick the oil wells with the highest estimated values;
4. Pick the region with the highest total profit for the selected oil wells.

You have data on oil samples from three regions. 
Parameters of each oil well
in the region are already known. Build a model that will help to pick the region
with the highest profit margin. Analyze potential profit and risks using the
Bootstrap technique

## Conditions

Only linear regression is suitable for model training (the rest are not sufficiently predictable).

When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.

The budget for oil well development is 100 USD million.

One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).

After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.

The data is synthetic: contract details and well characteristics are not disclosed.

## Project plan

1. **Download and prepare the data. Explain the procedure.**

2. **Train and test the model for each region:**

    2.1. Split the data into a training set and validation set at a ratio of 75:25.
    
    2.2. Train the model and make predictions for the validation set.
    
    2.3. Save the predictions and correct answers for the validation set.
    
    2.4. Print the average volume of predicted reserves and model RMSE.
    
    2.5. Analyze the results.

3. **Prepare for profit calculation:**
    
    3.1. Store all key values for calculations in separate variables.
    
    3.2. Calculate the volume of reserves sufficient for developing a new well without losses. Compare the obtained   value with the average volume of reserves in each region.
    
    3.3. Provide the findings about the preparation for profit calculation step.

4. **Write a function to calculate profit from a set of selected oil wells and model predictions:**
    
    4.1. Pick the wells with the highest values of predictions. The number of wells depends on the budget and cost of developing one oil well.
    
    4.2. Summarize the target volume of reserves in accordance with these predictions
    
    4.3. Provide findings: suggest a region for oil wells' development and
    justify the choice. Calculate the profit for the obtained volume of reserves.

5. **Calculate risks and profit for each region:**
    
    5.1. Use the bootstrap technique with 1000 samples to find the distribution of profit.
    
    5.2. Find average profit, 95% confidence interval and risk of losses. Loss is negative profit.
    
    5.3. Provide findings: suggest a region for development of oil wells and justify the choice.

### Downloading  and preparing the data. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
#Exploring data and getting info about it
data_0 = pd.read_csv('/datasets/geo_data_0.csv')
data_1 = pd.read_csv('/datasets/geo_data_1.csv')
data_2 = pd.read_csv('/datasets/geo_data_2.csv')
print("-------------------Data_0 info--------------------")
print(data_0.info())
print("-------------------Data_1 info--------------------")
print(data_1.info())
print("-------------------Data_2 info--------------------")
print(data_2.info())

-------------------Data_0 info--------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
-------------------Data_1 info--------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-null float64
f1         100000 non-null float64
f2         100000 non-null float64
product    100000 non-null float64
dtypes: float64(4), object(1)
memory usage: 3.8+ MB
None
-------------------Data_2 info--------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
id         100000 non-null object
f0         100000 non-nu

In [3]:
def print_data(data):
    print(data.head())

In [4]:
#Printing datas' first 5 rows
print('-------------------data_0 first five rows--------------------')
print_data(data_0)
print('-------------------data_1 first five rows--------------------')
print_data(data_1)
print('-------------------data_2 first five rows--------------------')
print_data(data_2)

-------------------data_0 first five rows--------------------
      id        f0        f1        f2     product
0  txEyH  0.705745 -0.497823  1.221170  105.280062
1  2acmU  1.334711 -0.340164  4.365080   73.037750
2  409Wp  1.022732  0.151990  1.419926   85.265647
3  iJLyR -0.032172  0.139033  2.978566  168.620776
4  Xdl7t  1.988431  0.155413  4.751769  154.036647
-------------------data_1 first five rows--------------------
      id         f0         f1        f2     product
0  kBEdx -15.001348  -8.276000 -0.005876    3.179103
1  62mP7  14.272088  -3.475083  0.999183   26.953261
2  vyE1P   6.263187  -5.948386  5.001160  134.766305
3  KcrkZ -13.081196 -11.506057  4.999415  137.945408
4  AHL4O  12.702195  -8.147433  5.004363  134.766305
-------------------data_2 first five rows--------------------
      id        f0        f1        f2     product
0  fwXo0 -1.146987  0.963328 -0.828965   27.758673
1  WJtFt  0.262778  0.269839 -2.530187   56.069697
2  ovLUW  0.194587  0.289035 -5.58643

In [5]:
#Checking data for duplicated rows
### NOTE wi didnot check data for missing values because when we have used info() method there were not missing values
print("Duplicated rows in data_0:", data_0.duplicated().sum())
print("Duplicated rows in data_1:", data_1.duplicated().sum())
print("Duplicated rows in data_2:", data_2.duplicated().sum())

Duplicated rows in data_0: 0
Duplicated rows in data_1: 0
Duplicated rows in data_2: 0


In [6]:
#We will drop id column because it will make unrelated relations 
def id_drop(data):
    data = data.drop('id',axis=1)
    return data

In [7]:
data_0 = id_drop(data_0)
data_1 = id_drop(data_1)
data_2 = id_drop(data_2)
print_data(data_0)
print_data(data_1)
print_data(data_2)

         f0        f1        f2     product
0  0.705745 -0.497823  1.221170  105.280062
1  1.334711 -0.340164  4.365080   73.037750
2  1.022732  0.151990  1.419926   85.265647
3 -0.032172  0.139033  2.978566  168.620776
4  1.988431  0.155413  4.751769  154.036647
          f0         f1        f2     product
0 -15.001348  -8.276000 -0.005876    3.179103
1  14.272088  -3.475083  0.999183   26.953261
2   6.263187  -5.948386  5.001160  134.766305
3 -13.081196 -11.506057  4.999415  137.945408
4  12.702195  -8.147433  5.004363  134.766305
         f0        f1        f2     product
0 -1.146987  0.963328 -0.828965   27.758673
1  0.262778  0.269839 -2.530187   56.069697
2  0.194587  0.289035 -5.586433   62.871910
3  2.236060 -0.553760  0.930038  114.572842
4 -0.515993  1.716266  5.899011  149.600746


### Conclusion

Firstly we have read data from csv file. Then we got general info about it then we printed first rows. After that checked for duplicates and dropped `id` column for saving model to make unrelated relation.

### 2.Train and test the model for each region:

In [8]:
#Splitter function will split data into train and validation sets
def splitter(data):
    target = data['product']
    features = data.drop('product', axis = 1)
    
    features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size = 0.25, random_state = 12345)
    
    return features_train, features_valid, target_train, target_valid

In [9]:
features_train_0, features_valid_0, target_train_0, target_valid_0 = splitter(data_0)
features_train_1, features_valid_1, target_train_1, target_valid_1 = splitter(data_1)
features_train_2, features_valid_2, target_train_2, target_valid_2 = splitter(data_2)

In [10]:
# Trainer function will train data and the will predict by using validation set for getting RMSE
def trainer(features_train, features_valid, target_train, target_valid):
    model = LinearRegression()
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    
    avg = predicted_valid.mean()
    
    mse = mean_squared_error(target_valid, predicted_valid)
    rmse = mse ** 0.5
    
    return avg, rmse, predicted_valid
    
    

In [11]:
avg_0, rmse_0, predicted_valid_0 = trainer(features_train_0, features_valid_0, target_train_0, target_valid_0)
avg_1, rmse_1, predicted_valid_1 = trainer(features_train_1, features_valid_1, target_train_1, target_valid_1)
avg_2, rmse_2, predicted_valid_2 = trainer(features_train_2, features_valid_2, target_train_2, target_valid_2)

print("Predicted Average volume for data_0 = {:.2f} and rmse = {:.2f}".format(avg_0, rmse_0))
print("Predicted Average volume for data_1 = {:.2f} and rmse = {:.2f}".format(avg_1, rmse_1))
print("Predicted Average volume for data_2 = {:.2f} and rmse = {:.2f}".format(avg_2, rmse_2))

Predicted Average volume for data_0 = 92.59 and rmse = 37.58
Predicted Average volume for data_1 = 68.73 and rmse = 0.89
Predicted Average volume for data_2 = 94.97 and rmse = 40.03


In [12]:
df_0 = pd.DataFrame({'Actual': target_valid_0, 'Predicted': predicted_valid_0}).reset_index(drop=True)
df_1 = pd.DataFrame({'Actual': target_valid_1, 'Predicted': predicted_valid_1}).reset_index(drop=True)
df_2 = pd.DataFrame({'Actual': target_valid_2, 'Predicted': predicted_valid_2}).reset_index(drop=True)


### Conclusion

Here we have splitted our data by using `splitter` function. Then trained our data with `trainer` function. After that we added predicted volume and actual volume into one dataframe for further usings. From the results of training we can easily see that 1st geolocation average volume less than others and this is related also  to rmse

### 3. Prepare for profit calculation:

In [13]:
budget = 100000000
num_of_wells = 200
revenue = 4500 # to 1000 barrels
volume =  (budget / num_of_wells) / revenue
print('The volume of reserves sufficient for developing a new well without losses = {:.1f}'.format(volume))

The volume of reserves sufficient for developing a new well without losses = 111.1


### Conclusion

Here we have calculated volume of reserves sufficient for developing a new well without losses, buy using numbers from condition. And get `111.1` as a result. This result differs from previous average result that we found. But if we will take into account rmse 1st geolocations average will be also less in this case 

### 4. Write a function to calculate profit from a set of selected oil wells and model predictions:

In [14]:
#This function will calculate profit
def calculate_profit(actual, predicted):
    predicted = predicted.sort_values(ascending=False)
    highestVolumeWells = actual[predicted.index][:200]
    volume = highestVolumeWells.sum()
    profit = (revenue * volume) - budget
    
    return profit

In [15]:
print('The average profit in {} region is {:.1f}'.format(0,calculate_profit(df_0['Actual'], df_0['Predicted'])))
print('The average profit in {} region is {:.1f}'.format(1,calculate_profit(df_1['Actual'], df_1['Predicted'])))
print('The average profit in {} region is {:.1f}'.format(2,calculate_profit(df_2['Actual'], df_2['Predicted'])))

The average profit in 0 region is 33208260.4
The average profit in 1 region is 24150867.0
The average profit in 2 region is 27103499.6


### Conclusion

Here we have calculated profit for each region. And we had took 200 wells from each region which had a highest predicted volumes. As wee see 0 region has a highest profit more than `33` million dollars

### 5. Calculate risks and profit for each region:

In [16]:
#Setting random state
state = np.random.RandomState(12345)

In [17]:
#By this function we will apply bootstrapping
def profit_distribution(target,predicted, region):
    values = []
    negative_count = 0
    for i in range(1000):
    
        target_subsample = target.sample(n=500, replace=True, random_state = state)
        predicted_subsample = predicted[target_subsample.index][:len(target)]
        profit = calculate_profit(target_subsample, predicted_subsample)
        values.append(profit)
        if profit < 0 :
            negative_count += 1
            

    values = pd.Series(values)
    lower = values.quantile(.025)
    upper = values.quantile(.975)
    loss_probability = negative_count / len(values)
    
    mean = values.mean()
    print("Average profit for region {} is: {:.3f}".format(region, mean))
    print("Confidence interval for region {} is between {:.3f} and {:.3f}".format(region, lower, upper))
    print("The probability loss is {:.2%}".format(loss_probability))

In [18]:
profit_distribution(df_0['Actual'], df_0['Predicted'], 0)
profit_distribution(df_1['Actual'], df_1['Predicted'], 1)
profit_distribution(df_2['Actual'], df_2['Predicted'], 2)

Average profit for region 0 is: 4259385.269
Confidence interval for region 0 is between -1020900.948 and 9479763.534
The probability loss is 6.00%
Average profit for region 1 is: 5182594.937
Confidence interval for region 1 is between 1281232.314 and 9536129.821
The probability loss is 0.30%
Average profit for region 2 is: 4201940.053
Confidence interval for region 2 is between -1158526.092 and 9896299.398
The probability loss is 6.20%


### Conclusion

Here we have applied Boostrapping with `1000` samples. For 0 and 2 regions we got negative values which is meaning loss. As wee see we have minimum risk at region 1.And also it's average profit bigger than others. And region 1 profit distrubition falls into positive values. So it is better choosing this region 1.

### General conclusion

We have made lots of things. Opened data and explored it. Then splitted it into train and test sets, then trained it. After that we calculated lossless volume well. Then we calculated profit for each region. Then we have applied bootstrapping technoligie. According to our findings region 1 has less risk than others and highest average profit.