Here's an outline of the project:

Download the dataset

Explore & analyze the dataset

Prepare the dataset for ML training

Train hardcoded & baseline models

Make predictions & submit to Kaggle

Peform feature engineering

Train & evaluate different models

Tune hyperparameters for the best models

Train on a GPU with the entire dataset

Document & publish the project online





# Downloading the dataset

In [None]:
! pip install opendatasets

In [None]:
import opendatasets as od

In [None]:
data_url='https://www.kaggle.com/datasets/edumagalhaes/quality-prediction-in-a-mining-process'

In [None]:
od.download(data_url)

In [None]:
data_dir='quality-prediction-in-a-mining-process'

# Opening and viewing the dataset


In [None]:
!ls -lh {data_dir}

In [None]:
!head {data_dir}/MiningProcess_Flotation_Plant_Database.csv

In [None]:
cols='date,% Iron Feed,% Silica Feed,Starch Flow,Amina Flow,Ore Pulp Flow,Ore Pulp pH,Ore Pulp Density,Flotation Column 01 Air Flow,Flotation Column 02 Air Flow,Flotation Column 03 Air Flow,Flotation Column 04 Air Flow,Flotation Column 05 Air Flow,Flotation Column 06 Air Flow,Flotation Column 07 Air Flow,Flotation Column 01 Level,Flotation Column 02 Level,Flotation Column 03 Level,Flotation Column 04 Level,Flotation Column 05 Level,Flotation Column 06 Level,Flotation Column 07 Level,% Iron Concentrate,% Silica Concentrate'.split(',')
cols

In [None]:
len(cols)

So the given data set has 24 columns

We may not need all of these columns in the first iteration of building the model

Let us take the necessary columns along each way

In [None]:
!wc -l {data_dir}/MiningProcess_Flotation_Plant_Database.csv

There are nearly 74 thousand rows in the given dataset


In [None]:
!head {data_dir}/MiningProcess_Flotation_Plant_Database.csv

# Using pandas to read the dataset

In [None]:
import pandas as pd

In [None]:
cols

In [None]:
need_cols=('% Iron Feed',
 '% Silica Feed',
 'Starch Flow',
 'Amina Flow',
 'Ore Pulp Flow',
 'Ore Pulp pH',
 'Ore Pulp Density',
 'Flotation Column 01 Air Flow',
 'Flotation Column 02 Air Flow',
 'Flotation Column 03 Air Flow',
 'Flotation Column 04 Air Flow',
 'Flotation Column 05 Air Flow',
 'Flotation Column 06 Air Flow',
 'Flotation Column 07 Air Flow',
 'Flotation Column 01 Level',
 'Flotation Column 02 Level',
 'Flotation Column 03 Level',
 'Flotation Column 04 Level',
 'Flotation Column 05 Level',
 'Flotation Column 06 Level',
 'Flotation Column 07 Level',
 '% Iron Concentrate',
 '% Silica Concentrate')

In [None]:
need_cols


We have ignored the date field in the gievn dataset for the first iteration because the date of extraction will not have that much of an effect on the output quality

Besides, porcessing dates is resource-intensive. So let us take care of that in other iterations if needed.

In [None]:
data_types_cols={
    '% Iron Feed':'float32',
 '% Silica Feed':'float32',
 'Starch Flow':'float32',
 'Amina Flow':'float32',
 'Ore Pulp Flow':'float32',
 'Ore Pulp pH':'float32',
 'Ore Pulp Density':'float32',
 'Flotation Column 01 Air Flow':'float32',
 'Flotation Column 02 Air Flow':'float32',
 'Flotation Column 03 Air Flow':'float32',
 'Flotation Column 04 Air Flow':'float32',
 'Flotation Column 05 Air Flow':'float32',
 'Flotation Column 06 Air Flow':'float32',
 'Flotation Column 07 Air Flow':'float32',
 'Flotation Column 01 Level':'float32',
 'Flotation Column 02 Level':'float32',
 'Flotation Column 03 Level':'float32',
 'Flotation Column 04 Level':'float32',
 'Flotation Column 05 Level':'float32',
 'Flotation Column 06 Level':'float32',
 'Flotation Column 07 Level':'float32',
 '% Iron Concentrate':'float32',
 '% Silica Concentrate':'float32'
    
}

In [None]:
data_types_cols

In [None]:

# df=pd.read_csv("/content/quality/MiningProcess_Flotation_Plant_Database.csv",usecols=need_cols,dtype=data_dir)
df=pd.read_csv("/content/quality-prediction-in-a-mining-process/MiningProcess_Flotation_Plant_Database.csv",usecols=[x for x in range(0,24)])

In [None]:
df

In [None]:
df.shape

In [None]:
df.info()

We don't need the date column because the date is not related to the outcome of the ore

Let us drop the date field

In [None]:
df.drop(['date'],axis=1,inplace=True)

The columns are not of float data type

Let us convert them into float data type

In [None]:
df.columns

In [None]:
for i in list(df.columns):
    df[i]=df[i].str.replace(',','.')

We have converted the commas in each value to a decimal point because without this pandas won't be able to transform object to float32

In [None]:
for i in list(df.columns):
    df[i]=df[i].astype('float32')

# EDA for whole dataset

In [None]:
df.info()

In [None]:
df.isnull().sum()

So there are no null values

In [None]:
df.describe()

In [None]:
df.head(20)

In [None]:
from google.colab import files

df.to_csv('processed_df.csv', encoding = 'utf-8-sig') 
files.download('processed_df.csv')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(18,18))
sns.heatmap(df.corr(),linewidths=0.003,linecolor='white',annot=True)

# Splitting the dataset into training set and test set

The given dataset does not have pre defined test and training sets.

Let us split the given data set into 2 parts

In [None]:
y=df['% Silica Concentrate']
x=df.drop(['% Silica Concentrate'],axis=1)

In [None]:
x.shape, y.shape

In [None]:
x.info()

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=55,test_size=0.3)

In [None]:
x_train.head()

In [None]:
y_train.head()

# EDA for specific columns of interest

In [None]:
# fig=plt.figure(figsize=(30,30))
df.hist(column=['Starch Flow', 'Amina Flow', 'Ore Pulp Flow', 'Ore Pulp pH', 'Ore Pulp Density'],grid=True,figsize=(14,14))

In [None]:
df.hist(column=['Flotation Column 01 Air Flow',
 'Flotation Column 02 Air Flow',
 'Flotation Column 03 Air Flow',
 'Flotation Column 04 Air Flow',
 'Flotation Column 05 Air Flow',
 'Flotation Column 06 Air Flow',
 'Flotation Column 07 Air Flow',],grid=True,figsize=(16,16))

In [None]:
df.hist(column=['Flotation Column 01 Level',
       'Flotation Column 02 Level', 'Flotation Column 03 Level',
       'Flotation Column 04 Level', 'Flotation Column 05 Level',
       'Flotation Column 06 Level', 'Flotation Column 07 Level'],grid=True,figsize=(14,14))

In [None]:
df.hist(column=['% Iron Feed', '% Silica Feed'],grid=True,figsize=(10,6))

# Creating a baseline model

Let us use a dummy regressor to create a baseline model to see if machine learning can even solve this problem

In [None]:
from sklearn.dummy import DummyRegressor

In [None]:
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(x_train, y_train)

In [None]:
y_preds=dummy_regr.predict(x_test)

Now that we have made a prediction using the dummy regressor, let us evaluate its performance

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
mean_squared_error(y_test, y_preds, squared=False)
#setting squared to false returns RMSE

In [None]:
mean_squared_error(y_test, y_preds, squared=True)
#this is just the square of the RMSE value

In [None]:
from sklearn.metrics import mean_absolute_error

In [None]:
mean_absolute_error(y_test, y_preds)

The baseline model we chose is the dummy regressor which returns predictions as the mean of all values

This baseline model has been able to achieve a RMSE of 1.1251 and an MAE of 0.9162

We have to build models that perform better than this

# Creating a linear regressor

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg_model = LinearRegression()
reg_model.fit(x_train,y_train)

In [None]:
reg_model.coef_

These are the coefficients for all variables in the linear regression model

In [None]:
train_reg_pred=reg_model.predict(x_train)

In [None]:
mean_squared_error(y_train,train_reg_pred,squared=False)

In [None]:
mean_absolute_error(y_train, train_reg_pred)

The model doesn't seem to overfit the training data

In [None]:
reg_pred=reg_model.predict(x_test)

In [None]:
reg_pred

In [None]:
mean_squared_error(y_test, reg_pred, squared=False)
#setting squared to false returns RMSE

In [None]:
mean_absolute_error(y_test, reg_pred)

This simple linear regression model has been able to achieve a RMSE of 0.6385 and an MAE of 0.4931

This is a drastic increase in performance from the baseline model. So we can say that this problem can be solved by machine learning

# Creating a Ridge Regression Model

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge1 = Ridge(alpha=1.0)

In [None]:
ridge1.fit(x_train,y_train)

In [None]:
ridge1_pred=ridge1.predict(x_test)

In [None]:
ridge1_pred

In [None]:
mean_squared_error(y_test, ridge1_pred, squared=False)

In [None]:
mean_absolute_error(y_test, ridge1_pred)

The ridge regression model with regularization parameter as 1 performs slightly better than the linear regression model without regularization

Let us repeat this with regularization parameter as 5

In [None]:
ridge2=Ridge(alpha=3.0)

In [None]:
ridge2.fit(x_train,y_train)

In [None]:
ridge2_pred=ridge2.predict(x_test)

In [None]:
ridge2_pred

In [None]:
mean_squared_error(y_test, ridge2_pred, squared=False)

In [None]:
mean_absolute_error(y_test, ridge2_pred)

Let us try to make a new ridge regression model with regularization parameter as 50000. This should theoretically lead to underfit

In [None]:
ridge3=Ridge(alpha=50000.0)

In [None]:
ridge3.fit(x_train,y_train)

In [None]:
ridge3_pred=ridge3.predict(x_test)
ridge3_pred

In [None]:
mean_squared_error(y_test, ridge3_pred, squared=False)

In [None]:
mean_absolute_error(y_test, ridge3_pred)

As expected, the predictions begin to get worse and the RMSE and MAE start to shoot up. This is because the model is unnecessarily penalized

# Creating a Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor


In [None]:
# randFor = RandomForestRegressor(n_estimators=70,max_depth=10, random_state=55,max_features='log2')
randFor = RandomForestRegressor(n_estimators=10, random_state=55)

Trying to train the model on the entire training set and with n_estimators set to 50 is taking over 23 minutes even with GPU

So let us try to reduce the computational load by limiting n_estimators to 10 and use 50% of the training set

In [None]:
new_x_train, new_X_test, new_y_train, new_y_test = train_test_split( x_train, y_train, train_size=0.20, random_state=66)

In [None]:
new_x_train.shape

In [None]:
new_y_train.shape

In [None]:
randFor.fit(new_x_train,new_y_train)

Now that we have trained a model with 20% of the training data it is essential that we save it for later use because training this model on a subset of the whole data itself takes us 20 seconds

In [None]:
import joblib

In [None]:
joblib.dump(randFor, "./random_forest.joblib")

In [None]:
rand1_pred=randFor.predict(x_test)
rand1_pred

In [None]:
mean_squared_error(y_test, rand1_pred, squared=False)

In [None]:
mean_absolute_error(y_test, rand1_pred)

This first model has achieved an RMSE of 0.1332 and an MAE of 0.0581

Clearly, the random forest regressor outperforms all the previous models by a huge margin

This has been achieved with the bare minimum inputs

Let us try to refine the tree regressor in the upcoming iterations

# Refining the Random Forest Regressor

In [None]:
new_x_train1, new_X_test, new_y_train1, new_y_test = train_test_split( x_train, y_train, train_size=0.33, random_state=66)

In [None]:
randFor1 = RandomForestRegressor(n_estimators=15, random_state=55)

In [None]:
%%time
randFor1.fit(new_x_train1,new_y_train1)

The new random forest regressor model has taken 53 seconds to train on one third of the original training set

In [None]:
rand2_pred=randFor1.predict(x_test)
rand2_pred

In [None]:
mean_squared_error(y_test, rand2_pred, squared=False)

In [None]:
mean_absolute_error(y_test, rand2_pred)

This model has achieved an RMSE of 0.0949 and an MAE of 0.0347

These are fantastic results

This model outperforms even the original random forest regressor