# Classification based on the First Original Featuring Result

<p><b>Author</b>: Jingze Dai</p>
<p><b>McMaster University</b>, Honors Computer Science (Coop) student</p>
<p><b>Campus Email Address</b>: daij24@mcmaster.ca</p>
<p><b>Personal Email Address</b>: david1147062956@gmail.com</p>
<a href="https://github.com/daijingz">Github Homepage</a>
<a href="https://www.linkedin.com/in/jingze-dai/">Linkedin Webpage</a>

<i>The original research's first featuring method selected distinct features from the Recursive Feature Elimination (RFE) implementation. This notebook includes misbehavior classification by using these distinct features.</i>

<i>Your Feedback is important for Jingze's further development. If you want to give feedback and suggestions, or you want to participate in working and learning together, please email Jingze at daij24@mcmaster.ca. If you want Jingze to provide contributions to your research or opensource project or you want Jingze to help you with any programming issues, please email Jingze at david1147062956@gmail.com. Thank you for your help.</i>

## Table of Contents:
* [Section 1: Selected Features](#bullet1)
* [Section 2: Extract and Load Datasets](#bullet2)
* [Section 3: Classification (Binary Classification Approach (BCA))](#bullet3)
* [Section 4: Classification (A Multi-class Classification Approach for Three Classes (MCATC))](#bullet4)
* [Section 5: Classification (A Classic Learning Approach for Multi-class classification (C-LAMC))](#bullet5)

### <a class="anchor" id="bullet1"><p><b>Section 1</b>: Selected Features</p></a>

There are 24 features selected: 'posx', 'posy', 'posz', 'posx_n', 'posy_n', 'posz_n', 'spdx', 'spdy', 'spdz', 'spdx_n', 'spdy_n', 'spdz_n', 'aclx', 'acly','aclz', 'aclx_n', 'acly_n', 'aclz_n',
'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n', 'hedz_n'

### <a class="anchor" id="bullet2"><p><b>Section 2</b>: Extract and Load Datasets</p></a>

The first step is to install all necessary packages and libraries.

In [1]:
pip install gdown

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install tensorflow

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install --upgrade tensorflow --user

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install --user numpy==1.24.4

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install --upgrade scipy

Note: you may need to restart the kernel to use updated packages.


<p>There are two methods to download the package, choose one of them to download the dataset: </p>
<p><b>Method 1</b>: Using gdown commands (Sometimes with errors)</p>

<p>Here we download the CSV VANETs dataset file from remote google drive, and savce it in your local computer's download folder. </p>
The <b>correct</b> dataset name is "mixalldata_clean.csv".

In [6]:
import pandas as pd
import gdown

# Replace 'YOUR_FILE_ID' with the actual file ID
file_id = '1mbQUfSEe2EU2sh40Q1Q0KiZD-k7vRuU9'

# Construct the download link
file_url = f'https://drive.google.com/uc?id={file_id}'

# Use gdown to download the file
output_file = 'mixalldata_clean.csv'
gdown.download(file_url, output_file, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1mbQUfSEe2EU2sh40Q1Q0KiZD-k7vRuU9
From (redirected): https://drive.google.com/uc?id=1mbQUfSEe2EU2sh40Q1Q0KiZD-k7vRuU9&confirm=t&uuid=f45a3f3a-d566-4053-ae21-53a8ace3dc1a
To: C:\Users\david\Downloads\mixalldata_clean.csv
100%|█████████████████████████████████████████████████████████████████████████████| 1.21G/1.21G [00:25<00:00, 48.1MB/s]


'mixalldata_clean.csv'

<p><b>Method 2</b>: Direct downloading from sources</p>
At first, go to the webpage <a href="https://data.mendeley.com/datasets/k62n4z9gdz/1">Dataset for Misbehaviors in VANETs</a>.
Then click the button "Download All 314 MB". Then de-compress this compressed folder.

<b>Expected Outcome</b>
<p>After downloading the dataset, to have a good double check, the program below prints out the first 5 records inside.</p>

In [7]:
import pandas as pd
import gdown

# Load the dataset
output_file = 'mixalldata_clean.csv'
df = pd.read_csv(output_file)

# Display the DataFrame
print(df.head())

   type      sendTime  sender  senderPseudo  messageID  class        posx  \
0     4  72002.302942  130137     101301377  422013806      0  266.982401   
1     4  72003.302942  130137     101301377  422023410      0  266.827208   
2     4  72004.302942  130137     101301377  422032081      0  266.420297   
3     4  72005.302942  130137     101301377  422040712      0  268.912026   
4     4  72006.302942  130137     101301377  422052949      0  268.242276   

        posy  posz    posx_n  ...  aclz    aclx_n    acly_n  aclz_n      hedx  \
0  32.336955   0.0  3.480882  ...   0.0  0.000862  0.000862     0.0 -0.102790   
1  34.624145   0.0  3.546261  ...   0.0  0.000107  0.001040     0.0 -0.099856   
2  38.836461   0.0  3.544045  ...   0.0  0.000172  0.001661     0.0 -0.099856   
3  45.414229   0.0  3.340080  ...   0.0  0.000171  0.001654     0.0 -0.100172   
4  53.729986   0.0  3.328872  ...   0.0  0.000193  0.001852     0.0 -0.097105   

       hedy  hedz     hedx_n     hedy_n  hedz_n  


<b>Important: before completing later sections, please run all of this section programs in order to prevent possible errors.</b>

### <a class="anchor" id="bullet3"><p><b>Section 3</b>: Classification (Binary Classification Approach (BCA))</p></a>

<p>At first we need to divide dataset into training data and testing data. This is completed on each algorithm's implementation. </p>
<p>80% data is training data, while 20% remaining is testing data. (This is normal setting). However, in order to improve accuracy, some models' training-testing data ratio s are customized.</p>

In [8]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

features = ['posx', 'posy', 'posz', 'posx_n', 'posy_n', 'posz_n', 'spdx', 
            'spdy', 'spdz', 'spdx_n', 'spdy_n', 'spdz_n', 'aclx', 'acly','aclz', 
            'aclx_n', 'acly_n', 'aclz_n', 'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n', 'hedz_n']
X = df[features]
Y = (df['class'] != 0).astype(int)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)

lr = LogisticRegression(random_state=42)

lr.fit(X_train, Y_train)

Y_pred = lr.predict(X_test)

accuracy = accuracy_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred)

print("Logistic Regression Accuracy:", accuracy)
print("Logistic Regression Precision:", precision)
print("Logistic Regression Recall:", recall)
print("Logistic Regression F1-score:", f1)

Logistic Regression Accuracy: 0.7163900200637909
Logistic Regression Precision: 0.8919502374664886
Logistic Regression Recall: 0.3426010559727252
Logistic Regression F1-score: 0.49505127061970583


In [10]:
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

features = ['posx', 'posy', 'posz', 'posx_n', 'posy_n', 'posz_n', 'spdx', 
            'spdy', 'spdz', 'spdx_n', 'spdy_n', 'spdz_n', 'aclx', 'acly','aclz', 
            'aclx_n', 'acly_n', 'aclz_n', 'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n', 'hedz_n']
X = df[features]
Y = (df['class'] != 0).astype(int)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

n_neighbor_amount = 1

print("*********")
while n_neighbor_amount < 11:
    knn = KNeighborsClassifier(n_neighbors=n_neighbor_amount)
    knn.fit(X_train, Y_train)
    Y_pred = knn.predict(X_test)
    accuracy = accuracy_score(Y_test, Y_pred)
    print("KNN Accuracy when n_neighbors =", n_neighbor_amount, ":", accuracy)
    precision = precision_score(Y_test, Y_pred)
    recall = recall_score(Y_test, Y_pred)
    f1 = f1_score(Y_test, Y_pred)

    print("KNN Precision:", precision)
    print("KNN Recall:", recall)
    print("KNN F1-score:", f1)
    print("*********")
    n_neighbor_amount += 1

*********


KeyboardInterrupt: 

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

features = ['posx', 'posy', 'posz', 'posx_n', 'posy_n', 'posz_n', 'spdx', 
            'spdy', 'spdz', 'spdx_n', 'spdy_n', 'spdz_n', 'aclx', 'acly','aclz', 
            'aclx_n', 'acly_n', 'aclz_n', 'hedx', 'hedy', 'hedz', 'hedx_n', 'hedy_n', 'hedz_n']
X = df[features]
Y = (df['class'] != 0).astype(int)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

rf = RandomForestClassifier(n_estimators=20, random_state=42)

rf.fit(X_train, Y_train)

Y_pred = rf.predict(X_test)

accuracy = accuracy_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred)

print("Random Forest Accuracy:", accuracy)
print("Random Forest Precision:", precision)
print("Random Forest Recall:", recall)
print("Random Forest F1-score:", f1)