# IoV FL-based Misbehavior Detection Part 2 - Normalized Data MDS

<b>Author</b>: Jingze Dai, McMaster University
<br>
<b>Research Supervisor</b>: Dr.Jiaqi Huang, University of Central Missouri
<br>
<b>Key Topics</b>: Federated Learning, Explainable AI

<a name="toc"></a>
## Table of Contents
* [Section 1: Summary of Section Structures](#1-bullet)
* [Section 2: Environment Setup](#2-bullet)
* [Section 3.1: Generating the final dataset (Method 1 - Euclidean Distance)](#3-1-bullet)
* [Section 3.2: Generated Data Testing (Method 1 - Euclidean Distance)](#3-2-bullet)

<a class="anchor" id="1-bullet"><h3><b>Section 1</b>: Summary of Section Structures</h3></a> 
<br>
[Back to Menu](#toc)

<b>Section 1</b>: Summary of Section Structures
<ul>
  <li><b>This Section</b></li>
  <li>Introduces each section's main contents, for a better understanding of readers.</li>
</ul>

<b>Section 2</b>: Environment Setup
<ul>
  <li>Builds necessary enviroments and installs required tools.</li>
  <li>Load and access datasets.</li>
</ul>

<b>Section 3-1</b>: Generating the final dataset (Method 1 - Euclidean Distance)
<ul>
  <li>Using the Euclidean distance method to generate a transformed dataset</li>
</ul>

<b>Section 3-2</b>: Generated Data Testing (Method 1 - Euclidean Distance)
<ul>
  <li>Inspections and data cleaning of the Section 3-1 generated dataset.</li>
</ul>

<a class="anchor" id="2-bullet"><h3><b>Section 2</b>: Environment Setup</h3></a> 
<br>
[Back to Menu](#toc)

In [3]:
import pandas as pd

file_path = 'mixalldata_clean.csv'
df = pd.read_csv(file_path)

print("Column names in the dataset:")
print(df.columns)

Column names in the dataset:
Index(['type', 'sendTime', 'sender', 'senderPseudo', 'messageID', 'class',
       'posx', 'posy', 'posz', 'posx_n', 'posy_n', 'posz_n', 'spdx', 'spdy',
       'spdz', 'spdx_n', 'spdy_n', 'spdz_n', 'aclx', 'acly', 'aclz', 'aclx_n',
       'acly_n', 'aclz_n', 'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n',
       'hedz_n'],
      dtype='object')


In [2]:
print(df.head())

   type      sendTime  sender  senderPseudo  messageID  class        posx  \
0     4  72002.302942  130137     101301377  422013806      0  266.982401   
1     4  72003.302942  130137     101301377  422023410      0  266.827208   
2     4  72004.302942  130137     101301377  422032081      0  266.420297   
3     4  72005.302942  130137     101301377  422040712      0  268.912026   
4     4  72006.302942  130137     101301377  422052949      0  268.242276   

        posy  posz    posx_n  ...  aclz    aclx_n    acly_n  aclz_n      hedx  \
0  32.336955   0.0  3.480882  ...   0.0  0.000862  0.000862     0.0 -0.102790   
1  34.624145   0.0  3.546261  ...   0.0  0.000107  0.001040     0.0 -0.099856   
2  38.836461   0.0  3.544045  ...   0.0  0.000172  0.001661     0.0 -0.099856   
3  45.414229   0.0  3.340080  ...   0.0  0.000171  0.001654     0.0 -0.100172   
4  53.729986   0.0  3.328872  ...   0.0  0.000193  0.001852     0.0 -0.097105   

       hedy  hedz     hedx_n     hedy_n  hedz_n  


<a class="anchor" id="3-1-bullet"><h3><b>Section 3.1</b>: Generating the final dataset (Method 1 - Euclidean Distance)</h3></a> 
<br>
[Back to Menu](#toc)

In [8]:
import pandas as pd
import numpy as np

transformed_df = pd.DataFrame()

columns_to_keep = ['type', 'sendTime', 'sender', 'senderPseudo', 'messageID', 'class']
transformed_df[columns_to_keep] = df[columns_to_keep]

columns = [
    ('posx', 'posy', 'posz', 'pos'), 
    ('posx_n', 'posy_n', 'posz_n', 'pos_n'), 
    ('spdx', 'spdy', 'spdz', 'spd'), 
    ('spdx_n', 'spdy_n', 'spdz_n', 'spd_n'), 
    ('aclx', 'acly', 'aclz', 'acl'), 
    ('aclx_n', 'acly_n', 'aclz_n', 'acl_n'), 
    ('hedx', 'hedy', 'hedz', 'hed'), 
    ('hedx_n', 'hedy_n', 'hedz_n', 'hed_n')
]

for col_x, col_y, col_z, new_col in columns:
    transformed_df[new_col] = np.sqrt(df[col_x]**2 + df[col_y]**2 + df[col_z]**2)

print(transformed_df.head())

   type      sendTime  sender  senderPseudo  messageID  class         pos  \
0     4  72002.302942  130137     101301377  422013806      0  268.933600   
1     4  72003.302942  130137     101301377  422023410      0  269.064287   
2     4  72004.302942  130137     101301377  422032081      0  269.236040   
3     4  72005.302942  130137     101301377  422040712      0  272.719875   
4     4  72006.302942  130137     101301377  422052949      0  273.570521   

      pos_n       spd     spd_n       acl     acl_n  hed      hed_n  
0  4.917271  1.212767  0.000000  2.131618  0.001219  1.0  26.631126  
1  5.032356  3.149541  0.001046  2.110864  0.001046  1.0  25.042812  
2  4.933492  5.078437  0.002716  1.794447  0.001670  1.0  24.027949  
3  4.731177  7.237398  0.004379  1.821157  0.001663  1.0  23.398577  
4  4.700795  9.328836  0.006241  2.504642  0.001862  1.0  23.150423  


In [9]:
import pandas as pd

output_file_path = 'mixalldata_transformed_euclidean.csv'

transformed_df.to_csv(output_file_path, index=False)

print(f"The transformed data has been saved to {output_file_path}")

The transformed data has been saved to mixalldata_transformed_euclidean.csv


<a class="anchor" id="3-2-bullet"><h3><b>Section 3.2</b>: Generated Data Testing (Method 1 - Euclidean Distance)</h3></a> 
<br>
[Back to Menu](#toc)

In [10]:
num_rows = transformed_df.shape[0]

print(f"The transformed dataframe has {num_rows} rows.")

The transformed dataframe has 3194808 rows.


In [4]:
null_df = df.isnull()
print(null_df)

          type  sendTime  sender  senderPseudo  messageID  class   posx  \
0        False     False   False         False      False  False  False   
1        False     False   False         False      False  False  False   
2        False     False   False         False      False  False  False   
3        False     False   False         False      False  False  False   
4        False     False   False         False      False  False  False   
...        ...       ...     ...           ...        ...    ...    ...   
3194803  False     False   False         False      False  False  False   
3194804  False     False   False         False      False  False  False   
3194805  False     False   False         False      False  False  False   
3194806  False     False   False         False      False  False  False   
3194807  False     False   False         False      False  False  False   

          posy   posz  posx_n  ...   aclz  aclx_n  acly_n  aclz_n   hedx  \
0        False  False  

In [5]:
null_counts = df.isnull().sum()
print(null_counts)

type            0
sendTime        0
sender          0
senderPseudo    0
messageID       0
class           0
posx            0
posy            0
posz            0
posx_n          0
posy_n          0
posz_n          0
spdx            0
spdy            0
spdz            0
spdx_n          0
spdy_n          0
spdz_n          0
aclx            0
acly            0
aclz            0
aclx_n          0
acly_n          0
aclz_n          0
hedx            0
hedy            0
hedz            0
hedx_n          0
hedy_n          0
hedz_n          0
dtype: int64


In [6]:
duplicate_record = df.duplicated().sum()

print("Total Records Duplications: " + str(duplicate_record))

Total Records Duplications: 0


In [7]:
duplicate_rows = df.duplicated(subset=['messageID']).sum()

print("Duplicated messageID Rows: " + str(duplicate_rows))

Duplicated messageID Rows: 0
