### Data Preprocessing
Data preprocessing in machine learning is the process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.

Data preprocessing involves tasks such as:
- Acquiring the dataset from various sources
- Importing the necessary libraries and tools for data manipulation
- Importing the dataset into a suitable data structure
- Handling missing values and outliers
- Encoding categorical variables
- Splitting the dataset into training and test sets
- Scaling or normalizing the features
- Extracting or selecting relevant features

> Data preprocessing is important because it can improve the quality and performance of the machine learning model. It can also reduce the complexity and computational cost of the model. Data preprocessing can help to uncover hidden patterns and insights from the data. 🧠


In this secenario, the data will:
- [x] Remove Duplicated Data
- [x] Remove row of missing value/s
- [x] Scaling or normalizing the feature/s

#### Essential library imports
There are many libraries that can help you with data preprocessing in Python, but some of the most popular and useful

In [22]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from numpy import genfromtxt
import csv

#### Load the dataset
the dataset is got from kaggle

In [2]:
dataset = pd.read_csv('dataset/diabetes.csv')
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Rename the columns
As we se above, the columns name are hard to read. So we will rename the columns for easy to read.

In [3]:
dataset = dataset.rename(columns={"Pregnancies": "pregnancies",
                                  "Glucose": "glucose", 
                                  "BloodPressure": "bloodpress",
                                  "SkinThickness": "skinthick",
                                  "Insulins": "insulin", 
                                  "BMI": "bmi",
                                  "DiabetesPedigreeFunction": "diabetespedigree",
                                  "Age": "age",
                                  "Outcome": "outcome"})
dataset.head()

Unnamed: 0,pregnancies,glucose,bloodpress,skinthick,Insulin,bmi,diabetespedigree,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Remove Duplicated
Removing duplicated data is important in data analysis because it can affect the results and performance of the data analysis methods. Duplicated data can:
- Bias the distribution and statistics of the data, such as mean, variance, and correlation.
- Inflate the accuracy and confidence of the data analysis models, such as clustering and classification.
- Increase the complexity and computational cost of the data analysis algorithms, such as k-means and hierarchical clustering.

In [4]:
print(f"Duplicated data: {dataset.duplicated().sum()}")
if dataset.duplicated().sum() != 0:
  dataset = dataset.drop_duplicates()
print(f"Duplicated data after removing: {dataset.duplicated().sum()}")

Duplicated data: 0
Duplicated data after removing: 0


#### Remove Null Value/s
Removing null values is important in data analysis because they can affect the quality and validity of the data and the analysis results. Null values are the values that are not present or not recorded in the dataset. They can occur due to various reasons, such as data entry errors, data corruption, or data collection problems

In [5]:
print(f"Null values: {dataset.isnull().sum().sum()}")
if dataset.isnull().sum().sum() != 0:
  dataset = dataset.dropna()
print(f"Null values after removing: {dataset.isnull().sum().sum()}")

Null values: 0
Null values after removing: 0


#### Normalize the data
Normalizing data is important because it can help to improve the quality, consistency, and performance of the data and the data analysis methods. Normalizing data means organizing the data into separate tables based on their logical relationships, and ensuring that each piece of information is stored only once

Normalizing data using min-max scaler from scikit-learn is a way of transforming the data so that each feature has a range between a minimum and a maximum value, usually 0 and 1. This can help to improve the performance and stability of some machine learning algorithms, such as distance-based or gradient-based methods

Transform the training data and the test data using the transform method, which will scale each feature according to the formula:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>X</mi><mrow><mi>s</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>e</mi><mi>d</mi></mrow></msub><mo>=</mo><mfrac><mrow><mi>X</mi><mo>−</mo><msub><mi>X</mi><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub></mrow><mrow><msub><mi>X</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>−</mo><msub><mi>X</mi><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub></mrow></mfrac><mo>×</mo><mo stretchy="false">(</mo><mi>m</mi><mi>a</mi><mi>x</mi><mo>−</mo><mi>m</mi><mi>i</mi><mi>n</mi><mo stretchy="false">)</mo><mo>+</mo><mi>m</mi><mi>i</mi><mi>n</mi></mrow><annotation encoding="application/x-tex">X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}} \times (max - min) + min</annotation></semantics></math>

where Xmin​ and Xmax​ are the minimum and maximum values of the feature in the training data, and min and max are the feature range specified by the feature_range parameter

In [10]:
dataset.to_csv("dataset/data_train.csv", header=False, index=False)
scaler = MinMaxScaler()
data_train = genfromtxt("dataset/data_train.csv", delimiter=",")
data_train_scaler = scaler.fit_transform(data_train)

print(f"data before normalize: \n{data_train}\n")
print(f"data after normalize: \n{data_train_scaler}")

data before normalize: 
[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]

data after normalize: 
[[0.35294118 0.74371859 0.59016393 ... 0.23441503 0.48333333 1.        ]
 [0.05882353 0.42713568 0.54098361 ... 0.11656704 0.16666667 0.        ]
 [0.47058824 0.91959799 0.52459016 ... 0.25362938 0.18333333 1.        ]
 ...
 [0.29411765 0.6080402  0.59016393 ... 0.07130658 0.15       0.        ]
 [0.05882353 0.63316583 0.49180328 ... 0.11571307 0.43333333 1.        ]
 [0.05882353 0.46733668 0.57377049 ... 0.10119556 0.03333333 0.        ]]


In [26]:
headers = dataset.columns.tolist()
x_train = data_train_scaler[:, :-1]
y_train = data_train_scaler[:, -1]

with open("dataset/headers.txt", "w") as output:
  writer = csv.writer(output)
  writer.writerow(headers)

np.savetxt("dataset/x_train.csv", x_train, delimiter=",")
np.savetxt("dataset/y_train.csv", y_train, delimiter=",")


array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,
        0.48333333],
       [0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
        0.16666667],
       [0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
        0.18333333],
       ...,
       [0.29411765, 0.6080402 , 0.59016393, ..., 0.390462  , 0.07130658,
        0.15      ],
       [0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
        0.43333333],
       [0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
        0.03333333]])