# Task 1 
---

Today is our hands-on a machine learning project! Hopefully you brought some interesting datasets you would like to work with. 

**If not, we are providing a Petrophysical dataset (check
the data folder for a description) and proposing some questions to be investigated with this data**.

Our goal is to practice and learn to address data science questions:

- How to frame the problem? What are the questions I want to address with my data?
- How to identify problems with the data: what are the data cleaning stages that I will have to do?
- How could I explore the data? How can I visualize my data to search for correlations?
- How can I prepare my data for the ML algorithms?
- What are meaningful evaluation metrics that I can apply?


Once your data is prepared for the ML algorithms, you will need to think about further aspects of the
project, for instance:
- The type of ML technique to be used.
- The task to be applied (for instance, classification, regression)
- The evaluation and validation criteria to have your model accurately addressing the questions I raised before.

---

## Petrophysical dataset 

### Background info
 
Information from the file **Data Description.docx.** 

The data set is a set of petrophysical well logs from a deep offshore gas exploration well, Iago-1, from the Northwest Shelf of Australia. The complete well dataset is publically available in the form of Log ASCII STANARD (.LAS) files at no cost from the publicly available WAPIMS (Western Australian Petroleum and Geothermal Information Management System) database. 

Six petrophysical measurements were chosen from the well that record changes in:
* density (RHOZ)
* electrical resistivity (HART)
* sonic velocity (DTCO)
* natural radioactivity (ECGR)
* mean atomic number (PEFZ) 
* porosity of the rocks penetrated by the well (TNPH).

These six logs are typically the most important and commonly acquired petrophysical measurements used in offshore oil and gas exploration wells. 

In addition, a **geological manual domaining** is provided. 

---
### Your task

Two features were not completely recorded: their logs include missing data. **Your task is to predict those missing values**. 

You are invited to approach that task in an exploratory fashion. Feel free to test different ideas and re-use and adapt the code from the previous weeks. 

The following steps are *suggested* as a general guideline: 

1. Data inspection/exploration and cleaning
    * open the data set and inspect its size, number of features, data structure
    * inspect data types and find the features with missing values (Nans)
    * perform data cleaning, and set appropriate data types. 
    
    * Get more insight about the data - For instance, inspect the statistics of the data: the features' distribution and correlations


2. Frame the problem
    * Select one feature to have its missing values predicted
    * Select the predictors (i.e., the features to be used in the prediction)
    * Choose the ML model

In [1]:
# Standard libraries
import numpy as np  # written in C, is faster and robust library for numerical and matrix operations
import pandas as pd # data manipulation library, it is widely used for data analysis and relies on numpy library.
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # plot nicely =)

from sklearn.model_selection import train_test_split #split arrays or matrices into random train and test subsets
from sklearn.preprocessing import StandardScaler #Standardize features by removing the mean and scaling to unit variance

# Auxiliar functions
from utils import *

# the following to lines will tell to the python kernel to always update the kernel for every utils.py
# modification, without the need of restarting the kernel.
%load_ext autoreload
%autoreload 2

# using the 'inline' backend, your matplotlib graphs will be included in your notebook, next to the code
%matplotlib inline

## Load data


In [2]:
load_path = r"../../data/Petrophysical/"
load_file = "Regression_task1_Petrophysical_data.csv"

df = pd.read_csv(load_path+load_file)

df

Unnamed: 0,DEPTH,DTCO,ECGR,HART,PEFZ,RHOZ,TNPH,Gelogical layer
0,M,us/ft,gAPI,ohm.m,B/E,g/cm3,m3/m3,
1,2207.057,73.166,36.884,2.016,4.214,2.437,0.149,1.0
2,2207.209,74.623,39.817,1.821,4.212,2.438,0.156,1.0
3,2207.362,74.979,42.094,1.758,4.182,2.434,0.163,1.0
4,2207.514,73.891,39.149,1.7,4.088,2.429,0.171,1.0
...,...,...,...,...,...,...,...,...
5715,3077.87,Nan,143.098,0.707,3.133,2.351,Nan,10.0
5716,3078.023,Nan,146.562,0.729,3.213,2.326,Nan,10.0
5717,3078.175,Nan,148.03,0.654,3.262,2.316,Nan,10.0
5718,3078.328,Nan,146.121,0.624,3.304,2.303,Nan,10.0


In [3]:
# inspecting data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5720 entries, 0 to 5719
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   DEPTH            5720 non-null   object 
 1   DTCO             5720 non-null   object 
 2   ECGR             5720 non-null   object 
 3   HART             5720 non-null   object 
 4   PEFZ             5720 non-null   object 
 5   RHOZ             5720 non-null   object 
 6   TNPH             5720 non-null   object 
 7   Gelogical layer  5719 non-null   float64
dtypes: float64(1), object(7)
memory usage: 357.6+ KB


In [4]:
# another option to inspect the data types

df.dtypes

DEPTH               object
DTCO                object
ECGR                object
HART                object
PEFZ                object
RHOZ                object
TNPH                object
Gelogical layer    float64
dtype: object

In [5]:
# we will see a "pitfall" later related to this counting

df.count()

DEPTH              5720
DTCO               5720
ECGR               5720
HART               5720
PEFZ               5720
RHOZ               5720
TNPH               5720
Gelogical layer    5719
dtype: int64

Here we load the data set and discover that it needs some manipulation:

* remove the first row, which contains a string with the physical unit. 
* change types from *object* to *float*

In [6]:
# remove the first row, which contains a string with the physical unit.

df.drop(index=0, inplace=True)

df

Unnamed: 0,DEPTH,DTCO,ECGR,HART,PEFZ,RHOZ,TNPH,Gelogical layer
1,2207.057,73.166,36.884,2.016,4.214,2.437,0.149,1.0
2,2207.209,74.623,39.817,1.821,4.212,2.438,0.156,1.0
3,2207.362,74.979,42.094,1.758,4.182,2.434,0.163,1.0
4,2207.514,73.891,39.149,1.7,4.088,2.429,0.171,1.0
5,2207.666,74.385,36.679,1.641,3.98,2.421,0.172,1.0
...,...,...,...,...,...,...,...,...
5715,3077.87,Nan,143.098,0.707,3.133,2.351,Nan,10.0
5716,3078.023,Nan,146.562,0.729,3.213,2.326,Nan,10.0
5717,3078.175,Nan,148.03,0.654,3.262,2.316,Nan,10.0
5718,3078.328,Nan,146.121,0.624,3.304,2.303,Nan,10.0


In [7]:
# change types from object to float

df = df.astype("float")

It is a good idea to perform some "sanity check" on every step to see if we have a meaninfull outcome. 

For instance, use the next cell(s) to verify if the data types have changed to "float". If you want, you can use different ways to double-check that. 

Remember the "pitfall" related to the counting? Use the cell bellow and run it again. What has changed? 

**Now it is your turn!**

## When you finished 

You can:
* Train other models and compare them (e.g, a random forest regressor).