<a href="https://colab.research.google.com/github/divx1979/IMB_CLASSIFICATION/blob/main/Oil_Spill_CLASS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Oil Spill Classification
## About Dataset
The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not.
Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.
The task is, given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean.

There are two classes and the goal is to distinguish between spill and non-spill using the features for a given ocean patch.

Non-Spill: negative case, or majority class.
Oil Spill: positive case, or minority

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as IMBPipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import average_precision_score

In [2]:
da1 = pd.read_csv('/content/oil_spill.csv')

print(da1.head())

   f_1    f_2      f_3     f_4  f_5       f_6    f_7   f_8      f_9  f_10  \
0    1   2558  1506.09  456.63   90   6395000  40.88  7.89  29780.0  0.19   
1    2  22325    79.11  841.03  180  55812500  51.11  1.21  61900.0  0.02   
2    3    115  1449.85  608.43   88    287500  40.42  7.34   3340.0  0.18   
3    4   1201  1562.53  295.65   66   3002500  42.40  7.97  18030.0  0.19   
4    5    312   950.27  440.86   37    780000  41.43  7.03   3350.0  0.17   

   ...     f_41      f_42     f_43     f_44   f_45  f_46      f_47   f_48  \
0  ...  2850.00   1000.00   763.16   135.46   3.73     0  33243.19  65.74   
1  ...  5750.00  11500.00  9593.48  1648.80   0.60     0  51572.04  65.73   
2  ...  1400.00    250.00   150.00    45.13   9.33     1  31692.84  65.81   
3  ...  6041.52    761.58   453.21   144.97  13.33     1  37696.21  65.67   
4  ...  1320.04    710.63   512.54   109.16   2.58     0  29038.17  65.66   

   f_49  target  
0  7.95       1  
1  6.26       0  
2  7.84       1  
3 

In [3]:
print(da1.describe())

              f_1           f_2          f_3          f_4         f_5  \
count  937.000000    937.000000   937.000000   937.000000  937.000000   
mean    81.588047    332.842049   698.707086   870.992209   84.121665   
std     64.976730   1931.938570   599.965577   522.799325   45.361771   
min      1.000000     10.000000     1.920000     1.000000    0.000000   
25%     31.000000     20.000000    85.270000   444.200000   54.000000   
50%     64.000000     65.000000   704.370000   761.280000   73.000000   
75%    124.000000    132.000000  1223.480000  1260.370000  117.000000   
max    352.000000  32389.000000  1893.080000  2724.570000  180.000000   

                f_6         f_7         f_8            f_9        f_10  ...  \
count  9.370000e+02  937.000000  937.000000     937.000000  937.000000  ...   
mean   7.696964e+05   43.242721    9.127887    3940.712914    0.221003  ...   
std    3.831151e+06   12.718404    3.588878    8167.427625    0.090316  ...   
min    7.031200e+04   21.2

In [4]:
## Split the data Method 1:

In [5]:
X = da1.drop('target', axis = 1)
y = da1['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)

In [6]:
## Build Pipeline

def build_pipe(model):
  return Pipeline([
      ('sc', StandardScaler()),
      ('model', model)
  ])

##Log Reg
lr_mod1 = build_pipe(LogisticRegression(solver = 'liblinear', random_state = 42))

##Deci Tree
dt_mod2 = DecisionTreeClassifier(random_state = 42)

##SVM
svm_mod3 = build_pipe(SVC(probability = True, random_state = 42))

##RF
rf_mod4 = RandomForestClassifier(random_state = 42)

##GBM
gbm_mod5 = GradientBoostingClassifier(random_state = 42)

In [7]:
## Spot Check Alg

In [8]:
mods = {
    'Logistic Regression': lr_mod1,
    'Decision Tree': dt_mod2,
    'SVM': svm_mod3,
    'Random Forest': rf_mod4,
    'Gradient Boosting': gbm_mod5
}

for name, model in mods.items():
  model.fit(X_train, y_train)
  y_prob = model.predict_proba(X_test)[:, 1]
  pr_auc = average_precision_score(y_test, y_prob)
  print(pr_auc)

0.8747957516339869
0.4623860182370821
0.9166666666666666
0.8915178571428571
0.8533549783549783


**Handle IMBClass With SMOTE(oversam) And Any 1 Model(e.g. Random Forest)**

In [9]:
## SMOTE
smote_pipe = IMBPipeline([
    ('smote', SMOTE(random_state = 1)),
    ('mod', RandomForestClassifier(random_state = 1))
])

smote_pipe.fit(X_train, y_train)
y_pred_smote = smote_pipe.predict_proba(X_test)[:, 1]
pr_au1 = average_precision_score(y_test, y_pred_smote)
print(pr_au1)

0.772165915915916


**Hyperpara Tuning**

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf_mod = RandomForestClassifier(random_state=42)  # Base estimator

param_grid = { 'n_estimators': [50, 100, 200],
               'max_depth': [None, 5, 10]}  # Parameters to test

grid_search = GridSearchCV(rf_mod, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)
best_m = grid_search.best_estimator_

y_pred_b1 = best_m.predict_proba(X_test)[:, 1]
auc_sc_b1 = roc_auc_score(y_test, y_pred_b1)
print(auc_sc_b1)

0.9847222222222223


**Explanation
Import Necessary Libraries: First, we import all the required libraries, including those for data manipulation, model building, evaluation, and handling imbalanced data.
Load the Dataset: Replace '/path/to/your/oil_spill.csv' with the actual path to your dataset.
Split the Data: We divide the dataset into training and testing sets to evaluate model performance.
Define Models and Pipelines: For each model, we define a pipeline that includes scaling (when applicable) and the model itself. This is important for algorithms like Logistic Regression and SVM, which are sensitive to the scale of the input features.
Spot-Check Algorithms: We fit each model to the training data and evaluate its performance on the test set using the AUROC score, a robust metric for imbalanced datasets.
Handle Imbalance with SMOTE: We demonstrate how to use SMOTE (Synthetic Minority Over-sampling Technique) with a pipeline to address class imbalance, using Random Forest as an example model.
Hyperparameter Tuning: We perform a grid search to find the optimal hyperparameters for the Random Forest model, then re-evaluate its performance with these optimized settings**