<div style="width:100%;text-align: center;"> <img align=middle src="https://media.istockphoto.com/id/1166511366/vector/oil-waste-pollution-series-of-oil-pollution-in-the-ocean-cartoon-vector.jpg?s=612x612&w=0&k=20&c=Qz6MwswmxtyHwWB3j7m1Tu6P8ml7ud2FpeeEbhhc80o=" alt="Heat beating" style="height:366px;margin-top:3rem;"> </div>

# <h1 style='background: #006994; border:0; color:white'><center>🛢🌊Oil Spill Classification</center></h1> 

# **<span style="color:#cd486b;">📰About the Dataset</span>**

The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not.
Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.
The task is, given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean.

There are two classes and the goal is to distinguish between spill and non-spill using the features for a given ocean patch.

Non-Spill (0): negative case, or majority class.

Oil Spill (1): positive case, or minority class.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Major Imports

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
# Import Data

df = pd.read_csv('/kaggle/input/oil-spill-detection/oil_spill.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## **<span style="color:#cd486b;">🪓Split the Dataset</span>**

In [None]:
X = df.iloc[:,:-1]
y = df['target']  

In [None]:
X.shape

In [None]:
y.shape

In [None]:
y.value_counts()

It is clear from the value counts that data for target label 1 are less than target label 0 which makes the dataset **unbalanced.**

## OverSampling

In [None]:
#y labels are - 896: 41 (divide both by 937 to get weights)
X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.95, 0.04], 
                           n_informative = 3, n_redundant = 1, flip_y = 0, n_features = 49,
                          n_clusters_per_class = 1, n_samples = 937, random_state = 10)

print('Orignal dataset shape %s' % Counter(y))

In [None]:
ros = RandomOverSampler(random_state = 42)
X_res, y_res = ros.fit_resample(X, y)

print('Reshaped dataset shape %s' % Counter(y_res))

## **<span style="color:#cd486b;">✂Splitting Data</span>**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, random_state = 42, test_size = 0.2, stratify = y_res)

In [None]:
model = RandomForestClassifier() 
model.fit(X_train, y_train)

In [None]:
predictions = model.predict(X_test)

In [None]:
import sklearn

In [None]:
model_precision_score = sklearn.metrics.precision_score(y_test, predictions, labels=model.classes_)
print("Precision score after using Balanced Random Forest Classifier is  ", model_precision_score)

In [None]:
print("F1 score after Balanced Random Forest Classifier is ",sklearn.metrics.f1_score(y_test, model.predict(X_test)))

In [None]:
#Confusion Matrix

cm = confusion_matrix(y_test, model.predict(X_test), labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model.classes_)
disp.plot()

In [None]:
#Classification Report
from sklearn.metrics import classification_report
clf_report = classification_report(y_test, predictions)
print(clf_report)

## **<span style="color:#cd486b;">🤘Conclusion</span>**
Here good accuracy and F1 score may depict a good performing model, but it dosen't. Dataset here is unbalaned but we tried to balance it with OverSampling but it does not seem to work. Another possible way to solve this proble is to use SMOTE oversampling technique.

## This marks the end of 🛢🌊Oil Spill Classification

**Stay Tuned for more..**

**Please share your feedback and suggestions and help me improve 😇**

## A look into using SMOTE

In [None]:
X_train, y_train,X_test, y_test = train_test_split(X, y,random_state = 42, test_size = 0.2)

In [None]:
# transform the dataset
from imblearn.over_sampling import SMOTE
oversample = SMOTE(random_state = 101)
X_S, y_S = oversample.fit_resample(X_train, y_train)

In [None]:
# summarize the new class distribution
counter = Counter(y_S)
print(counter)