# ZDMP

Project from [https://github.com/ZhengLinLei/ZDMP](https://github.com/ZhengLinLei/ZDMP)

![License](https://img.shields.io/badge/License-Apache%202.0-green)


## Load Datasets

Load the .csv dataset from [GitHub Repository](https://github.com/ZhengLinLei/ZDMP-datasets)

RAW: [.csv](https://media.githubusercontent.com/media/ZhengLinLei/ZDMP-datasets/main/batch_data.csv)



In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://media.githubusercontent.com/media/ZhengLinLei/ZDMP-datasets/main/batch_data.csv')

# Remove all not Number item row
df = df[np.isfinite(df).all(1)]

## Prepare data


### X and Y

Separate all .csv table data into X and Y. The result data is located in `realiability` column.

In [None]:
Y = df['reliability']
X = df.drop('reliability', axis=1)

### Split data

In this proccess we are going to split the data into two parts `training data` (80%) and `testing data` (20%). That's why we use the parameter `test_size=0.2` and `random_state=42` to set a seed that we use to shuffle our data.



In [None]:
from sklearn.model_selection import train_test_split

# Split the data into four data variables: Training(x,y), Testing(x,y)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
X_train

Unnamed: 0,ID V1,ID V10,ID V11,ID V12,ID V13,ID V14,ID V15,ID V16,ID V17,ID V18,...,CV V5,CV V6,CV V7,CV V8,CV V9,CV V10,CV V11,CV V12,CV V13,IMP Paso
20126,1,56,119,696,367,438,17,742,521,360,...,28.2208,59.4618,0.0,26.8645,58.9193,0.0,36.5849,595.983,0.0,0
36646,1,54,123,693,368,469,18,733,530,360,...,29.4416,63.8925,0.0,30.8883,63.3048,0.0,61.7224,941.714,0.0,0
19385,1,61,133,684,367,437,18,720,509,360,...,28.4921,58.5576,0.0,27.5879,58.0150,0.0,34.1435,622.432,0.0,0
27012,1,63,124,680,368,459,18,731,532,360,...,28.2208,56.5231,0.0,26.3672,56.0257,0.0,45.4915,506.917,0.0,0
40808,1,56,121,693,368,467,19,734,528,360,...,20.4445,71.2619,0.0,19.9020,70.5838,0.0,25.6890,916.395,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6467,1,56,119,693,368,466,18,726,526,360,...,29.3963,59.1453,0.0,27.7687,58.6028,0.0,38.2577,575.684,0.0,0
11650,1,58,115,693,366,467,18,733,526,360,...,27.5879,57.2012,0.0,26.1411,56.7039,0.0,48.5659,307.581,0.0,0
39396,1,55,162,712,369,471,17,736,509,360,...,31.2952,54.6694,0.0,36.6754,54.1721,0.0,43.1405,828.505,0.0,0
886,1,60,158,740,366,474,17,746,523,360,...,28.4017,55.6189,0.0,27.3166,55.0763,0.0,45.4463,356.545,0.0,0


In [None]:
Y_train

20126    0.626774
36646    0.628032
19385    0.626774
27012    0.591375
40808    0.612040
           ...   
6467     0.351662
11650    1.000000
39396    0.642767
886      0.770710
16311    0.626774
Name: reliability, Length: 32112, dtype: float64

## Model training

### K-Nearest Neighbors

#### Training the model

In [None]:
from sklearn.neighbors import KNeighborsRegressor

kn = KNeighborsRegressor(n_neighbors=1)
kn.fit(X_train, Y_train)

KNeighborsRegressor(n_neighbors=1)

#### Aplying the model to make a prediction

In [None]:
Y_kn_train_pred = kn.predict(X_train)
Y_kn_test_pred = kn.predict(X_test)

#### Evaluate model perfomance

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

kn_train_mse = mean_squared_error(Y_train, Y_kn_train_pred)
kn_train_r2 = r2_score(Y_train, Y_kn_train_pred)

kn_test_mse = mean_squared_error(Y_test, Y_kn_test_pred)
kn_test_r2 = r2_score(Y_test, Y_kn_test_pred)

In [None]:
kn_results = pd.DataFrame(['K-Nearest Neighbors', kn_train_mse, kn_train_r2, kn_test_mse, kn_test_r2]).transpose()
kn_results.columns = ['Method', 'Training MSE', 'Training R2', 'Test MSE', 'Test R2']

In [None]:
kn_results

Unnamed: 0,Method,Training MSE,Training R2,Test MSE,Test R2
0,K-Nearest Neighbors,5.6e-05,0.999266,0.014994,0.801152
