## Importance of Data Preprocessing

In this notebook, we will explore the "Motorbike Marketplace" dataset (link: https://www.kaggle.com/datasets/mexwell/motorbike-marketplace). However, we will stop at data preprocessing as this one of the prime examples why cleaning the data is so important. For training a machine learning model, only numerical data can be used. However, that's not always realistic and we never get numerical data. We'll explore more in detail.

### Step 1: Data cleaning and preprocessing
First, we'll import the necessary modules that we'll use to visualize our features and convert our categorical data to numerical data.

In [None]:
import pandas # For CSV I/O
import numpy # For manipulation of pandas dataframe
import matplotlib.pyplot as plt # For data visualization
import seaborn # Also used for data visualization, specifically heatmaps and countplots.
import warnings

seaborn.set()
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
data = pandas.read_csv('/kaggle/input/motorbike-marketplace/europe-motorbikes-zenrows.csv')
data.head()

In [None]:
# data.info() Only price, mileage and power is int64 and float64. 
# data.duplicated().sum() Count: 5832
data.isna().sum()

As noticed, there's a considerable portion of the data where the values are NaN. The current dataset is currently unclean as it contains NaN/missing values, redundnat features, which we'll use to get rid of.

## Step 2: Data Preprocessing

We will work our way around the dataset to clean the dataset. The final result, can be passed through machine learning algorithm for classification/regression purposes.

In [None]:
seaborn.countplot(x='fuel', data=data)

Filling in the NaN values of 'fuel' and 'power' with the most common value. We can assume that the majority of motorbikes use gasoline and is manual transmission

In [None]:
data['fuel'] = data['fuel'].fillna('Gasoline')
data['power'] = data['power'].replace(numpy.nan, 0)

In [None]:
seaborn.countplot(x='offer_type', data=data)

In [None]:
# We still have 'gear' and 'version' to make changes to
data.isna().sum()

In [None]:
seaborn.countplot(x='gear', data=data)

In [None]:
data['gear'] = data['gear'].fillna('Manual')
data.isna().sum()

In [None]:
data['version'] = data['version'].replace(numpy.nan, 'None')
data['version']

## Step 3: Scaling and Converting the values

First, we will perform what's known as one-hot encoding to change our cateogrical data into numerical data. Without numerical data, predictions cannot be formed. As our dataset has been cleaned, we can use standard scaling to scale our values, which will make our model training easy.  

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Transforming all our object datatype into ints
data['version'] = label_encoder.fit_transform(data['version'])
data['make_model'] = label_encoder.fit_transform(data['make_model'])
data['date'] = label_encoder.fit_transform(data['date'])
data['fuel'] = label_encoder.fit_transform(data['fuel'])
data['gear'] = label_encoder.fit_transform(data['gear'])
data['offer_type'] = label_encoder.fit_transform(data['offer_type'])

data = data.drop('link', axis=1)
data.dtypes

In [None]:
# Correlation is terrible in this dataset.
correlation = data.corr()
figure = plt.figure(figsize=(12, 10))
seaborn.heatmap(correlation, annot=True, 
           cmap='magma', 
           fmt='.1f')
plt.show()

In [None]:
data['power'] = data['power'].astype(numpy.int64)
data.dtypes

In [None]:
data.head()
# This is our clean dataset without standard scaling.

In [None]:
from sklearn.preprocessing import StandardScaler
std_scl = StandardScaler()
y = data['price']
X = data.drop('price', axis=1)

We've created our dependent and independent variables and we're scaling our dependent variables which will make it easy for a machine learning algoirthm to make things faster. 

In [None]:
X = pandas.DataFrame(std_scl.fit_transform(X))
X.head()

## Optional: Passing through a KNN

NOTE: I used a machine learning algorithm that's not applicable to this dataset accurately. But here, I used a KNN to try to classify the decision boundaries between features. This didn't work and I recommend people who take this notebook for reference to stop before this step and do whatever you please. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

test_scores = []
train_scores = []

for i in range(1, 15): 
    knn = KNeighborsClassifier(i)
    knn.fit(X_train, y_train)
    
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

In [None]:
max_train = max(train_scores)

train_score = [i for i, v in enumerate(train_scores) if v == max_train]
print(f"Max train score {max_train * 100}% and k = {list(map(lambda x: x + 1, train_score))}")

In [None]:
max_test = max(test_scores)

test_score = [i for i, v in enumerate(test_scores) if v == max_test]
print(f"Max est score {max_test * 100}% and k = {list(map(lambda x: x + 1, test_score))}")

### And we're done!

Here, we only stopped till data cleaning. Data preprocessing is one of the msot important steps in training machine/deep learning model and here I've employed an example of that. Using one-hot encoding, we'll transform our categorical data into numerical and fill in the missing values. If there any questions or if you'd like to collab with me, send me a mail at akshathmangudi@gmail.com. 

Notebook by Akshath Mangudi