# IoV FL-based Misbehavior Detection Part 1 - Data Analysis

<b>Author</b>: Jingze Dai, McMaster University
<br>
<b>Research Supervisor</b>: Dr.Jiaqi Huang, University of Central Missouri
<br>
<b>Key Topics</b>: Federated Learning, Explainable AI

<a name="toc"></a>
## Table of Contents
* [Section 1: Summary of Section Structures](#1-bullet)
* [Section 2: Environment Setup](#2-bullet)
* [Section 3: Data Cleaning](#3-bullet)
* [Section 4: Analysis - Message Identity Attributes](#4-bullet)
* [Section 5: Analysis - Position Attributes](#5-bullet)
* [Section 6: Analysis - Speed Attributes](#6-bullet)
* [Section 7: Analysis - Acceleration Attributes](#7-bullet)
* [Section 8: Analysis - Heading Position Attributes](#8-bullet)

<a class="anchor" id="1-bullet"><h3><b>Section 1</b>: Summary of Section Structures</h3></a> 
<br>
[Back to Menu](#toc)

<b>Section 1</b>: Summary of Section Structures
<ul>
  <li><b>This Section</b></li>
  <li>Introduces each section's main contents, for a better understanding of readers.</li>
</ul>

<b>Section 2</b>: Environment Setup
<ul>
  <li>Builds necessary enviroments and installs required tools.</li>
  <li>Load and access datasets.</li>
</ul>

<b>Section 3</b>: Data Cleaning
<ul>
  <li>Remove wrong and dirty data, and fix some data problem.</li>
</ul>

<b>Section 4</b>: Analysis - Message Identity Attributes
<ul>
  <li>Analysis on message identity attributes - 'type', 'sendTime', 'sender', 'senderPseudo', and 'messageID'.</li>
</ul>

<b>Section 5</b>: Analysis - Position Attributes
<ul>
  <li>Analysis on position attributes - 'posx', 'posy', 'posz', 'posx_n', 'posy_n', and 'posz_n'.</li>
</ul>

<b>Section 6</b>: Analysis - Speed Attributes
<ul>
  <li>Analysis on speed attributes - 'spdx', 'spdy', 'spdz', 'spdx_n', 'spdy_n', and 'spdz_n'.</li>
</ul>

<b>Section 7</b>: Analysis - Acceleration Attributes
<ul>
  <li>Analysis on message identity attributes - 'aclx', 'acly', 'aclz', 'aclx_n', 'acly_n', and 'aclz_n'.</li>
</ul>

<b>Section 8</b>: Analysis - Heading Position Attributes
<ul>
  <li>Analysis on heading position attributes - 'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n', and 'hedz_n'.</li>
</ul>

<a class="anchor" id="2-bullet"><h3><b>Section 2</b>: Environment Setup</h3></a> 
<br>
[Back to Menu](#toc)

In [2]:
import pandas as pd

file_path = 'mixalldata_clean.csv'
df = pd.read_csv(file_path)

print("Column names in the dataset:")
print(df.columns)

Column names in the dataset:
Index(['type', 'sendTime', 'sender', 'senderPseudo', 'messageID', 'class',
       'posx', 'posy', 'posz', 'posx_n', 'posy_n', 'posz_n', 'spdx', 'spdy',
       'spdz', 'spdx_n', 'spdy_n', 'spdz_n', 'aclx', 'acly', 'aclz', 'aclx_n',
       'acly_n', 'aclz_n', 'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n',
       'hedz_n'],
      dtype='object')


In [9]:
print(df.head())

   type      sendTime  sender  senderPseudo  messageID  class        posx  \
0     4  72002.302942  130137     101301377  422013806      0  266.982401   
1     4  72003.302942  130137     101301377  422023410      0  266.827208   
2     4  72004.302942  130137     101301377  422032081      0  266.420297   
3     4  72005.302942  130137     101301377  422040712      0  268.912026   
4     4  72006.302942  130137     101301377  422052949      0  268.242276   

        posy  posz    posx_n  ...  aclz    aclx_n    acly_n  aclz_n      hedx  \
0  32.336955   0.0  3.480882  ...   0.0  0.000862  0.000862     0.0 -0.102790   
1  34.624145   0.0  3.546261  ...   0.0  0.000107  0.001040     0.0 -0.099856   
2  38.836461   0.0  3.544045  ...   0.0  0.000172  0.001661     0.0 -0.099856   
3  45.414229   0.0  3.340080  ...   0.0  0.000171  0.001654     0.0 -0.100172   
4  53.729986   0.0  3.328872  ...   0.0  0.000193  0.001852     0.0 -0.097105   

       hedy  hedz     hedx_n     hedy_n  hedz_n  


In [10]:
num_records = df.shape[0]

print(f"Number of records in the dataset: {num_records}")

Number of records in the dataset: 3194808


In [11]:
data_types = df.dtypes

print("Data types:")
print(data_types)

Data types:
type              int64
sendTime        float64
sender            int64
senderPseudo      int64
messageID         int64
class             int64
posx            float64
posy            float64
posz            float64
posx_n          float64
posy_n          float64
posz_n          float64
spdx            float64
spdy            float64
spdz            float64
spdx_n          float64
spdy_n          float64
spdz_n          float64
aclx            float64
acly            float64
aclz            float64
aclx_n          float64
acly_n          float64
aclz_n          float64
hedx            float64
hedy            float64
hedz            float64
hedx_n          float64
hedy_n          float64
hedz_n          float64
dtype: object


In [14]:
num_records = df.shape[1]

print(f"Number of attributes in each record: {num_records}")

Number of attributes in each record: 30


The Attribute 'Class' represents the category of each V2X message. 0 represents this message has <b>normal behavior</b>, and 1 to 19 represent this message has 1 of 19 <b>misbehaviors</b> (Some are errors while others are attacks).

<a class="anchor" id="3-bullet"><h3><b>Section 3</b>: Data Cleaning</h3></a> 
<br>
[Back to Menu](#toc)

<i>The provided dataset "mixalldata_clean.csv" is already data-cleanned. However, data cleaning processes are still executed below.</i>

<p><b>Potential Problem 1</b>: unnamed columns -> <b>Does Not Exist</b>, by observation, all 30 column have valid names.</p>
<p><b>Potential Problem 2</b>: existence of null values.</p>

In [15]:
# At first, display the check on each value is null or not.
null_df = df.isnull()
print(null_df)

          type  sendTime  sender  senderPseudo  messageID  class   posx  \
0        False     False   False         False      False  False  False   
1        False     False   False         False      False  False  False   
2        False     False   False         False      False  False  False   
3        False     False   False         False      False  False  False   
4        False     False   False         False      False  False  False   
...        ...       ...     ...           ...        ...    ...    ...   
3194803  False     False   False         False      False  False  False   
3194804  False     False   False         False      False  False  False   
3194805  False     False   False         False      False  False  False   
3194806  False     False   False         False      False  False  False   
3194807  False     False   False         False      False  False  False   

          posy   posz  posx_n  ...   aclz  aclx_n  acly_n  aclz_n   hedx  \
0        False  False  

In [16]:
# Secondly, check how many null values on each column, and display their counts.
null_val_count = df.isna().sum()
print(null_val_count)

type            0
sendTime        0
sender          0
senderPseudo    0
messageID       0
class           0
posx            0
posy            0
posz            0
posx_n          0
posy_n          0
posz_n          0
spdx            0
spdy            0
spdz            0
spdx_n          0
spdy_n          0
spdz_n          0
aclx            0
acly            0
aclz            0
aclx_n          0
acly_n          0
aclz_n          0
hedx            0
hedy            0
hedz            0
hedx_n          0
hedy_n          0
hedz_n          0
dtype: int64


<p>As you can see, this dataset does not have null values, so there is nothing to do with null value removing.</p>

<p><b>Potential Problem 3</b>: Wrong-Format Data -> <b>Does Not Exist</b>, by observation, all 30 column have valid types, and their values are consistent with these mentioned types. There is no need to convert them.</p>

<p><b>Potential Problem 4</b>: Data with inappropriate values -> <b>Does Not Exist</b>, because the range and other constraints have not been given.</p>

<p><b>Potential Problem 5</b>: Duplicate Rows -> <b>Does Not Exist</b>, by observation, there is no repeated rows (comparing all values).</p>

In [17]:
# Find the total amount of duplicate records
duplicate_record = df.duplicated().sum()

print("Total Records Duplications: " + str(duplicate_record))

Total Records Duplications: 0


<p>Seems like no rows are repeated, then I check each column, and I found that only the column "messageID" needs to have unique values (since they are identifiers). While other columns, accept repetitions on values.</p>

In [18]:
# Find the total amount of records with duplicate "messageID"
duplicate_rows = df.duplicated(subset=['messageID']).sum()

print("Duplicated messageID Rows: " + str(duplicate_rows))

Duplicated messageID Rows: 0


<a class="anchor" id="4-bullet"><h3><b>Section 4</b>: Analysis - Message Identity Attributes</h3></a> 
<br>
[Back to Menu](#toc)

There are 5 message identity attributes - 'type', 'sendTime', 'sender', 'senderPseudo', and 'messageID'.

In [2]:
distinct_values = df['type'].unique()

print("Distinct values of 'type':")
print(distinct_values)

print("Amount of distinct values:")
print(len(distinct_values))

Distinct values of 'type':
[4]
Amount of distinct values:
1


In [7]:
min_value = df['type'].min()
max_value = df['type'].max()

print("Minimum value of 'type':", min_value)
print("Maximum value of 'type':", max_value)

Minimum value of 'type': 4
Maximum value of 'type': 4


In [3]:
distinct_values = df['sendTime'].unique()

print(f"Distinct values of 'sendTime':")
print(distinct_values)

print(f"Amount of distinct values in 'sendTime':")
print(len(distinct_values))

Distinct values of 'sendTime':
[72002.30294186 72003.30294186 72004.30294186 ... 79377.75261029
 79378.25261029 79378.75261029]
Amount of distinct values in 'sendTime':
3194808


In [8]:
min_value = df['sendTime'].min()
max_value = df['sendTime'].max()

print("Minimum value of 'sendTime':", min_value)
print("Maximum value of 'sendTime':", max_value)

Minimum value of 'sendTime': 240.602763370378
Maximum value of 'sendTime': 86399.98493806412


In [4]:
distinct_values = df['sender'].unique()

print(f"Distinct values of 'sender':")
print(distinct_values)

print(f"Amount of distinct values in 'sender':")
print(len(distinct_values))

Distinct values of 'sender':
[130137 130143 130161 ... 140877 140889 140895]
Amount of distinct values in 'sender':
24663


In [9]:
min_value = df['sender'].min()
max_value = df['sender'].max()

print("Minimum value of 'sender':", min_value)
print("Maximum value of 'sender':", max_value)

Minimum value of 'sender': 9
Maximum value of 'sender': 147981


In [12]:
most_frequent_values = df['sender'].value_counts()

print("Most frequent values of 'sender' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'sender' and their occurrences:
10995     1494
120051    1361
23913     1301
127653    1291
64575     1274
          ... 
136533       1
126717       1
22413        1
68967        1
119277       1
Name: sender, Length: 24663, dtype: int64


In [5]:
distinct_values = df['senderPseudo'].unique()

print(f"Distinct values of 'senderPseudo':")
print(distinct_values)

print(f"Amount of distinct values in 'senderPseudo':")
print(len(distinct_values))

Distinct values of 'senderPseudo':
[ 101301377  101301437  101301617 ... 1311474365 1411474365  101408957]
Amount of distinct values in 'senderPseudo':
118909


In [10]:
min_value = df['senderPseudo'].min()
max_value = df['senderPseudo'].max()

print("Minimum value of 'senderPseudo':", min_value)
print("Maximum value of 'senderPseudo':", max_value)

Minimum value of 'senderPseudo': 1
Maximum value of 'senderPseudo': 4206511862


In [13]:
most_frequent_values = df['senderPseudo'].value_counts()

print("Most frequent values of 'senderPseudo' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'senderPseudo' and their occurrences:
1            40853
10449913      1223
10484113      1116
10677914      1079
10713914       996
             ...  
306388441        1
23044732         1
900947136        1
911512346        1
350475953        1
Name: senderPseudo, Length: 118909, dtype: int64


In [6]:
distinct_values = df['messageID'].unique()

print("Distinct values of 'messageID':")
print(distinct_values)

print("Amount of distinct values:")
print(len(distinct_values))

Distinct values of 'messageID':
[422013806 422023410 422032081 ... 447809215 447809698 447810281]
Amount of distinct values:
3194808


In [11]:
min_value = df['messageID'].min()
max_value = df['messageID'].max()

print("Minimum value of 'messageID':", min_value)
print("Maximum value of 'messageID':", max_value)

Minimum value of 'messageID': 4184
Maximum value of 'messageID': 458622809


<i>Below there are some more complicated analysis: </i>

In [14]:
filtered_df = df[df['senderPseudo'] == 1]

sender_distribution = filtered_df['sender'].value_counts()

print("Value distribution of 'sender' when senderPseudo = 1:")
print(sender_distribution)

Value distribution of 'sender' when senderPseudo = 1:
1875      716
57735     673
99153     652
10995     638
15933     609
         ... 
24675       1
54315       1
71031       1
40635       1
120813      1
Name: sender, Length: 377, dtype: int64


In [5]:
filtered_df = df[df['senderPseudo'] == 10449913]

sender_distribution = filtered_df['sender'].value_counts()

print("Value distribution of 'sender' when senderPseudo = 10449913:")
print(sender_distribution)

Value distribution of 'sender' when senderPseudo = 10449913:
44991    1223
Name: sender, dtype: int64


In [6]:
filtered_df = df[df['senderPseudo'] == 10484113]

sender_distribution = filtered_df['sender'].value_counts()

print("Value distribution of 'sender' when senderPseudo = 10484113:")
print(sender_distribution)

Value distribution of 'sender' when senderPseudo = 10484113:
48411    1116
Name: sender, dtype: int64


In [7]:
filtered_df = df[df['senderPseudo'] == 10677914]

sender_distribution = filtered_df['sender'].value_counts()

print("Value distribution of 'sender' when senderPseudo = 10677914:")
print(sender_distribution)

Value distribution of 'sender' when senderPseudo = 10677914:
67791    1079
Name: sender, dtype: int64


In [8]:
filtered_df = df[df['senderPseudo'] == 10713914]

sender_distribution = filtered_df['sender'].value_counts()

print("Value distribution of 'sender' when senderPseudo = 10713914:")
print(sender_distribution)

Value distribution of 'sender' when senderPseudo = 10713914:
71391    996
Name: sender, dtype: int64


In [11]:
filtered_df = df[df['sender'] == 10995]

sender_distribution = filtered_df['senderPseudo'].value_counts()

print("Value distribution of 'senderPseudo' when sender = 10995:")
print(sender_distribution)

Value distribution of 'senderPseudo' when sender = 10995:
1           638
20109952    169
10109952    153
30109952    145
40109952    143
50109952    140
60109952    106
Name: senderPseudo, dtype: int64


In [9]:
filtered_df = df[df['sender'] == 120051]

sender_distribution = filtered_df['senderPseudo'].value_counts()

print("Value distribution of 'senderPseudo' when sender = 120051:")
print(sender_distribution)

Value distribution of 'senderPseudo' when sender = 120051:
1            332
101200516    209
301200516    208
501200516    204
201200516    193
401200516    112
601200516    103
Name: senderPseudo, dtype: int64


In [10]:
filtered_df = df[df['sender'] == 23913]

sender_distribution = filtered_df['senderPseudo'].value_counts()

print("Value distribution of 'senderPseudo' when sender = 23913:")
print(sender_distribution)

Value distribution of 'senderPseudo' when sender = 23913:
1           327
30239132    217
10239132    214
50239132    145
20239132    145
40239132    133
60239132    120
Name: senderPseudo, dtype: int64


In [12]:
filtered_df = df[df['sender'] == 127653]

sender_distribution = filtered_df['senderPseudo'].value_counts()

print("Value distribution of 'senderPseudo' when sender = 127653:")
print(sender_distribution)

Value distribution of 'senderPseudo' when sender = 127653:
1            343
201276537    185
401276537    169
301276537    165
501276537    155
101276537    151
601276537    123
Name: senderPseudo, dtype: int64


In [13]:
filtered_df = df[df['sender'] == 64575]

sender_distribution = filtered_df['senderPseudo'].value_counts()

print("Value distribution of 'senderPseudo' when sender = 64575:")
print(sender_distribution)

Value distribution of 'senderPseudo' when sender = 64575:
10645754    211
30645754    202
50645754    194
20645754    185
40645754    169
1           161
60645754    152
Name: senderPseudo, dtype: int64


<a class="anchor" id="5-bullet"><h3><b>Section 5</b>: Analysis - Position Attributes</h3></a> 
<br>
[Back to Menu](#toc)

There are 6 position attributes - 'posx', 'posy', 'posz', 'posx_n', 'posy_n', and 'posz_n'.

In [26]:
distinct_values = df['posx'].unique()

print(f"Distinct values of 'posx':")
print(distinct_values)

print(f"Amount of distinct values in 'posx':")
print(len(distinct_values))

Distinct values of 'posx':
[266.98240149 266.8272084  266.42029673 ... 224.85690015 223.17498692
 222.24713112]
Amount of distinct values in 'posx':
2665316


In [14]:
min_value = df['posx'].min()
max_value = df['posx'].max()

print("Minimum value of 'posx':", min_value)
print("Maximum value of 'posx':", max_value)

Minimum value of 'posx': -20.012992182835962
Maximum value of 'posx': 1518.9556909741427


In [20]:
most_frequent_values = df['posx'].value_counts()

print("Most frequent values of 'posx' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'posx' and their occurrences:
0.000000       23420
1122.755939      341
201.301003       283
116.387911       269
871.054199       267
               ...  
626.476901         1
695.610841         1
178.334344         1
203.314731         1
1028.601987        1
Name: posx, Length: 2665316, dtype: int64


In [27]:
distinct_values = df['posy'].unique()

print(f"Distinct values of 'posy':")
print(distinct_values)

print(f"Amount of distinct values in 'posy':")
print(len(distinct_values))

Distinct values of 'posy':
[ 32.33695481  34.62414468  38.83646124 ... 310.85491439 321.59727307
 328.01369459]
Amount of distinct values in 'posy':
2665325


In [15]:
min_value = df['posy'].min()
max_value = df['posy'].max()

print("Minimum value of 'posy':", min_value)
print("Maximum value of 'posy':", max_value)

Minimum value of 'posy': -22.479557911313783
Maximum value of 'posy': 1522.6086272866387


In [21]:
most_frequent_values = df['posy'].value_counts()

print("Most frequent values of 'posy' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'posy' and their occurrences:
0.000000      23420
296.065121      341
224.840560      283
888.557748      269
542.238381      267
              ...  
273.422724        1
33.040097         1
953.748244        1
928.577874        1
892.570309        1
Name: posy, Length: 2665325, dtype: int64


In [28]:
distinct_values = df['posz'].unique()

print(f"Distinct values of 'posz':")
print(distinct_values)

print(f"Amount of distinct values in 'posz':")
print(len(distinct_values))

Distinct values of 'posz':
[0.]
Amount of distinct values in 'posz':
1


In [16]:
min_value = df['posz'].min()
max_value = df['posz'].max()

print("Minimum value of 'posz':", min_value)
print("Maximum value of 'posz':", max_value)

Minimum value of 'posz': 0.0
Maximum value of 'posz': 0.0


In [22]:
most_frequent_values = df['posz'].value_counts()

print("Most frequent values of 'posz' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'posz' and their occurrences:
0.0    3194808
Name: posz, dtype: int64


In [29]:
distinct_values = df['posx_n'].unique()

print(f"Distinct values of 'posx_n':")
print(distinct_values)

print(f"Amount of distinct values in 'posx_n':")
print(len(distinct_values))

Distinct values of 'posx_n':
[3.48088247 3.5462606  3.54404455 ... 3.50654891 3.63467052 3.34195205]
Amount of distinct values in 'posx_n':
2511396


In [17]:
min_value = df['posx_n'].min()
max_value = df['posx_n'].max()

print("Minimum value of 'posx_n':", min_value)
print("Maximum value of 'posx_n':", max_value)

Minimum value of 'posx_n': 0.0
Maximum value of 'posx_n': 8.245799564253035


In [23]:
most_frequent_values = df['posx_n'].value_counts()

print("Most frequent values of 'posx_n' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'posx_n' and their occurrences:
0.000000    23467
4.185654      436
3.161534      297
3.059389      269
4.568067      266
            ...  
4.260074        1
4.447737        1
3.549270        1
4.391230        1
4.629670        1
Name: posx_n, Length: 2511396, dtype: int64


In [30]:
distinct_values = df['posy_n'].unique()

print(f"Distinct values of 'posy_n':")
print(distinct_values)

print(f"Amount of distinct values in 'posy_n':")
print(len(distinct_values))

Distinct values of 'posy_n':
[3.47318391 3.57052407 3.43206785 ... 3.34484057 3.32893055 3.29684768]
Amount of distinct values in 'posy_n':
2511396


In [18]:
min_value = df['posy_n'].min()
max_value = df['posy_n'].max()

print("Minimum value of 'posy_n':", min_value)
print("Maximum value of 'posy_n':", max_value)

Minimum value of 'posy_n': 0.0
Maximum value of 'posy_n': 8.29282635651494


In [24]:
most_frequent_values = df['posy_n'].value_counts()

print("Most frequent values of 'posy_n' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'posy_n' and their occurrences:
0.000000    23467
4.195968      436
3.229807      297
3.088235      269
4.438285      266
            ...  
3.822297        1
3.439644        1
3.708630        1
4.155950        1
4.561010        1
Name: posy_n, Length: 2511396, dtype: int64


In [31]:
distinct_values = df['posz_n'].unique()

print(f"Distinct values of 'posz_n':")
print(distinct_values)

print(f"Amount of distinct values in 'posz_n':")
print(len(distinct_values))

Distinct values of 'posz_n':
[0.]
Amount of distinct values in 'posz_n':
1


In [19]:
min_value = df['posz_n'].min()
max_value = df['posz_n'].max()

print("Minimum value of 'posz_n':", min_value)
print("Maximum value of 'posz_n':", max_value)

Minimum value of 'posz_n': 0.0
Maximum value of 'posz_n': 0.0


In [25]:
most_frequent_values = df['posz_n'].value_counts()

print("Most frequent values of 'posz_n' and their occurrences:")
print(most_frequent_values)

Most frequent values of 'posz_n' and their occurrences:
0.0    3194808
Name: posz_n, dtype: int64


<i>Below there are some more complicated analysis: </i>

<a class="anchor" id="6-bullet"><h3><b>Section 6</b>: Analysis - Speed Attributes</h3></a> 
<br>
[Back to Menu](#toc)

<a class="anchor" id="7-bullet"><h3><b>Section 7</b>: Analysis - Acceleration Attributes</h3></a> 
<br>
[Back to Menu](#toc)

<a class="anchor" id="8-bullet"><h3><b>Section 8</b>: Analysis - Heading Position Attributes</h3></a> 
<br>
[Back to Menu](#toc)

<i>End of this section</i>