# Amazon reviews for cell phones and accessories dataset

## Introduction

* This study focuses on evaluating deep learning models for sentiment analysis using a dataset of 194,439 Amazon reviews related to cell phones and accessories. The goal is to predict review ratings through multi-class classification, alongside an exploration of traditional machine learning techniques using linguistic features like TF-IDF and various classifiers for performance comparison.

## Importing required libraries


In [1]:
import os
import pandas as pd

In [2]:
import numpy as np
import nltk
import json
import re
import multiprocessing as mp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [3]:
import warnings
# Ignore warnings
warnings.filterwarnings("ignore")

## Reading the dataset

In [4]:
os.chdir('D:/vu/academics/terms/term_12/AI_Decision_Sciences-2/midterm/dataset')

In [5]:
df = pd.read_json('Cell_Phones_and_Accessories_5.json', lines=True)

In [6]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194439 entries, 0 to 194438
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      194439 non-null  object
 1   asin            194439 non-null  object
 2   reviewerName    190920 non-null  object
 3   helpful         194439 non-null  object
 4   reviewText      194439 non-null  object
 5   overall         194439 non-null  int64 
 6   summary         194439 non-null  object
 7   unixReviewTime  194439 non-null  int64 
 8   reviewTime      194439 non-null  object
dtypes: int64(2), object(7)
memory usage: 13.4+ MB


In [8]:
# Extracting the 'reviewText' and 'overall' columns 
df = df[['reviewText', 'overall']]

## Sampling

#### Sampling the dataset to make it manageable

In [9]:
# The percentage of data to sample
sample_percentage = 0.05

# Perform simple random sampling
sample_data = df.sample(frac=sample_percentage, random_state=42)

## Preprocessing 

#### Cleaning the text column ('reviewText') by:
1. Removing stop words
2. Convert text to lowercase
3. Removing punctuations and numbers
4. Tokenizing 
5. Stemming and 
6. Lemmatization

In [10]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
import re

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to C:\Users\Dhwani
[nltk_data]     Bhandari\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Dhwani
[nltk_data]     Bhandari\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Dhwani
[nltk_data]     Bhandari\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
# Define stopwords
stop_words = set(stopwords.words('english'))

# Define stemmer
stemmer = PorterStemmer()

# Define lemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuations and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)

    # Tokenize and remove stop words
    tokenized_text = [w for w in word_tokenize(text) if w not in stop_words]
    text = ' '.join(tokenized_text)

    # Perform stemming and lemmatization
    stemmed_lemmatized_text = [stemmer.stem(lemmatizer.lemmatize(w)) for w in word_tokenize(text)]
    text = ' '.join(stemmed_lemmatized_text)

    return text

In [12]:
sample_data['reviewText'] = sample_data['reviewText'].apply(preprocess_text)
print(sample_data)

                                               reviewText  overall
156187  ibolt xprodock activ car dock holder mount sam...        5
102252  pouch everyth look otter box commut case aroun...        5
23146   first case iphon previous one free one give al...        3
86461   order case case htc inspir case last year abso...        5
62407   bought gift big hit love choic color made devi...        5
...                                                   ...      ...
148123  use charg note problem charg rel fast use usb ...        5
86480   small enough purs great charg cell go requir m...        5
65159   bought lg optimu slider prior lg rumor succumb...        4
57656   receiv case week earli fit amaz nice tight pho...        5
162194  first one bubbl adequ immedi took tri anoth on...        4

[9722 rows x 2 columns]


## Splitting the dataset

#### Split- The first 70% dataset for train, next 10% for validation, and remaining 20% for test.

In [13]:
# Split the data
train_size = int(0.7 * len(sample_data))
val_size = int(0.1 * len(sample_data))

train_data = sample_data[:train_size]
val_data = sample_data[train_size : train_size+val_size]
test_data = sample_data[train_size+val_size:]

# Extract the 'reviewText' and 'overall' fields
X_train, y_train = train_data['reviewText'], train_data['overall']
X_val, y_val = val_data['reviewText'], val_data['overall']
X_test, y_test = test_data['reviewText'], test_data['overall']

In [14]:
(X_train.shape), (y_train.shape)

((6805,), (6805,))

In [15]:
(X_val.shape), (y_val.shape)

((972,), (972,))

In [16]:
(X_test.shape), (y_test.shape)

((1945,), (1945,))

## Converting into TFIDF vector

TF-IDF (Term Frequency-Inverse Document Frequency) vector is a numerical representation of a document in natural language processing. It captures the importance of each word by combining its frequency in the document (TF) with its rarity across the entire corpus (IDF), allowing it to highlight words that are both frequent in the document and rare in the corpus.

In [17]:
# Extract TFIDF features
vectorizer = TfidfVectorizer(max_features=10000, max_df=0.95)
X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(X_test)

## Label Encoding

In [18]:
# create a labelencoder object
le = LabelEncoder()

# fit and transform on the data
y_train = le.fit_transform(y_train)
y_val = le.transform(y_val)
y_test = le.transform(y_test)

## Baseline Models

#### Decision tree

In [19]:
clf1 = DecisionTreeClassifier()
clf1.fit(X_train, y_train)
preds = clf1.predict(X_test)
print("J48 decision tree accuracy: ", accuracy_score(y_test, preds))

J48 decision tree accuracy:  0.49562982005141387


#### Logistic regression

In [20]:
clf3 = LogisticRegression()
clf3.fit(X_train, y_train)
preds = clf3.predict(X_test)
print("Logistic regression accuracy: ", accuracy_score(y_test, preds))

Logistic regression accuracy:  0.5933161953727506


#### XGBoost

In [21]:
clf4 = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
clf4.fit(X_train, y_train)
preds = clf4.predict(X_test)
print("XGBoost accuracy: ", accuracy_score(y_test, preds))

XGBoost accuracy:  0.5953727506426735


#### Random forest

In [22]:
clf5 = RandomForestClassifier()
clf5.fit(X_train, y_train)
preds = clf5.predict(X_test)
print("Random forest accuracy: ", accuracy_score(y_test, preds))

Random forest accuracy:  0.5670951156812339


#### SVM with linear kernel

In [23]:
clf2 = SVC(kernel='linear')
clf2.fit(X_train, y_train)
preds = clf2.predict(X_test)
print("SVM (linear kernel) accuracy: ", accuracy_score(y_test, preds))

SVM (linear kernel) accuracy:  0.5994858611825192


#### Of all the baseline models logistic regression, SVM and XGBoost perform the best with around 59-60% accuracy

## Applying Grid Search on models

In [24]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

### SVM Linear

In [25]:
svm_linear_classifier = SVC(kernel='linear')
svm_linear_param_grid = {'C': [0.1, 1]}
svm_linear_grid_search = GridSearchCV(svm_linear_classifier, svm_linear_param_grid, cv=3, n_jobs=-1)
svm_linear_grid_search.fit(X_train, y_train)

In [26]:
best_svm_linear = svm_linear_grid_search.best_estimator_
best_svm_linear

In [27]:
svm_linear_train_acc = accuracy_score(y_train, best_svm_linear.predict(X_train))
svm_linear_val_acc = accuracy_score(y_val, best_svm_linear.predict(X_val))
svm_linear_test_acc = accuracy_score(y_test, best_svm_linear.predict(X_test))

svm_linear_report = classification_report(y_test, best_svm_linear.predict(X_test))

print("SVM Linear Results (using best model):")
print("Train Accuracy:", svm_linear_train_acc)
print("Validation Accuracy:", svm_linear_val_acc)
print("Test Accuracy:", svm_linear_test_acc)
print("Classification Report:\n", svm_linear_report)

SVM Linear Results (using best model):
Train Accuracy: 0.7641440117560617
Validation Accuracy: 0.6224279835390947
Test Accuracy: 0.5994858611825192
Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.37      0.47       144
           1       0.27      0.05      0.08       120
           2       0.27      0.16      0.20       181
           3       0.40      0.18      0.25       422
           4       0.65      0.93      0.76      1078

    accuracy                           0.60      1945
   macro avg       0.45      0.34      0.36      1945
weighted avg       0.54      0.60      0.54      1945



#### The SVM model with a linear kernel gives an accuracy of around 60% on the test set

### SVM RBF

In [28]:
svm_rbf_classifier = SVC(kernel='rbf')
svm_rbf_param_grid = {'C': [0.1, 1], 'gamma': ['scale', 'auto']}
svm_rbf_grid_search = GridSearchCV(svm_rbf_classifier, svm_rbf_param_grid, cv=3, n_jobs=-1)
svm_rbf_grid_search.fit(X_train, y_train)

In [29]:
best_svm_rbf = svm_rbf_grid_search.best_estimator_
best_svm_rbf

In [30]:
svm_rbf_train_acc = accuracy_score(y_train, best_svm_rbf.predict(X_train))
svm_rbf_val_acc = accuracy_score(y_val, best_svm_rbf.predict(X_val))
svm_rbf_test_acc = accuracy_score(y_test, best_svm_rbf.predict(X_test))

svm_rbf_report = classification_report(y_test, best_svm_rbf.predict(X_test))

print("SVM RBF Results (using best model):")
print("Train Accuracy:", svm_rbf_train_acc)
print("Validation Accuracy:", svm_rbf_val_acc)
print("Test Accuracy:", svm_rbf_test_acc)
print("Classification Report:\n", svm_rbf_report)

SVM RBF Results (using best model):
Train Accuracy: 0.8902277736958119
Validation Accuracy: 0.5925925925925926
Test Accuracy: 0.579948586118252
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.16      0.26       144
           1       0.00      0.00      0.00       120
           2       0.47      0.09      0.15       181
           3       0.38      0.06      0.11       422
           4       0.59      0.99      0.74      1078

    accuracy                           0.58      1945
   macro avg       0.44      0.26      0.25      1945
weighted avg       0.51      0.58      0.46      1945



#### The SVM model with rbf kernel gives an accuracy of around 58% on the test set

### Random Forest

In [31]:
rf_classifier = RandomForestClassifier()
rf_param_grid = {'n_estimators': [100, 300],
                 'max_depth': [10, 20]}
rf_grid_search = GridSearchCV(rf_classifier, rf_param_grid, cv=3, n_jobs=-1)
rf_grid_search.fit(X_train, y_train)

In [32]:
best_rf = rf_grid_search.best_estimator_
best_rf

In [33]:
rf_train_acc = accuracy_score(y_train, best_rf.predict(X_train))
rf_val_acc = accuracy_score(y_val, best_rf.predict(X_val))
rf_test_acc = accuracy_score(y_test, best_rf.predict(X_test))

rf_report = classification_report(y_test, best_rf.predict(X_test))

print("Random Forest Results (using best model):")
print("Train Accuracy:", rf_train_acc)
print("Validation Accuracy:", rf_val_acc)
print("Test Accuracy:", rf_test_acc)
print("Classification Report:\n", rf_report)

Random Forest Results (using best model):
Train Accuracy: 0.5651726671565026
Validation Accuracy: 0.5648148148148148
Test Accuracy: 0.5542416452442159
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       144
           1       0.00      0.00      0.00       120
           2       0.00      0.00      0.00       181
           3       0.00      0.00      0.00       422
           4       0.55      1.00      0.71      1078

    accuracy                           0.55      1945
   macro avg       0.11      0.20      0.14      1945
weighted avg       0.31      0.55      0.40      1945



#### The Random Forest model gives an accuracy of around 55% on the test set

### Decision Tree

In [34]:
dt_classifier = DecisionTreeClassifier()
dt_param_grid = {'criterion': ['gini', 'entropy']}
dt_grid_search = GridSearchCV(dt_classifier, dt_param_grid, cv=3, n_jobs=-1)
dt_grid_search.fit(X_train, y_train)

In [35]:
best_dt = dt_grid_search.best_estimator_
best_dt

In [36]:
dt_train_acc = accuracy_score(y_train, best_dt.predict(X_train))
dt_val_acc = accuracy_score(y_val, best_dt.predict(X_val))
dt_test_acc = accuracy_score(y_test, best_dt.predict(X_test))

dt_report = classification_report(y_test, best_dt.predict(X_test))

print("Decision Tree Results (using best model):")
print("Train Accuracy:", dt_train_acc)
print("Validation Accuracy:", dt_val_acc)
print("Test Accuracy:", dt_test_acc)
print("Classification Report:\n", dt_report)

Decision Tree Results (using best model):
Train Accuracy: 0.9995591476855253
Validation Accuracy: 0.4506172839506173
Test Accuracy: 0.48277634961439586
Classification Report:
               precision    recall  f1-score   support

           0       0.22      0.19      0.21       144
           1       0.11      0.09      0.10       120
           2       0.18      0.17      0.17       181
           3       0.28      0.24      0.26       422
           4       0.65      0.71      0.68      1078

    accuracy                           0.48      1945
   macro avg       0.29      0.28      0.28      1945
weighted avg       0.46      0.48      0.47      1945



#### The Decision Tree model gives an accuracy of around 49% on the test set

### Logistic Regression

In [37]:
lr_classifier = LogisticRegression()
lr_param_grid = {'C': [0.1, 1, 10]}
lr_grid_search = GridSearchCV(lr_classifier, lr_param_grid, cv=3, n_jobs=-1)
lr_grid_search.fit(X_train, y_train)

In [38]:
best_lr = lr_grid_search.best_estimator_
best_lr

In [39]:
lr_train_acc = accuracy_score(y_train, best_lr.predict(X_train))
lr_val_acc = accuracy_score(y_val, best_lr.predict(X_val))
lr_test_acc = accuracy_score(y_test, best_lr.predict(X_test))

lr_report = classification_report(y_test, best_lr.predict(X_test))

print("Logistic Regression Results (using best model):")
print("Train Accuracy:", lr_train_acc)
print("Validation Accuracy:", lr_val_acc)
print("Test Accuracy:", lr_test_acc)
print("Classification Report:\n", lr_report)


Logistic Regression Results (using best model):
Train Accuracy: 0.7203526818515797
Validation Accuracy: 0.5997942386831275
Test Accuracy: 0.5933161953727506
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.30      0.42       144
           1       0.33      0.02      0.03       120
           2       0.33      0.15      0.21       181
           3       0.35      0.17      0.23       422
           4       0.64      0.94      0.76      1078

    accuracy                           0.59      1945
   macro avg       0.48      0.32      0.33      1945
weighted avg       0.53      0.59      0.52      1945



#### The logisitc regression model gives an accuracy of around 59% on the test set

### XGBoost

In [40]:
xgb_classifier = XGBClassifier()
xgb_param_grid = {'max_depth': [3, 5, 7],
                  'learning_rate': [0.1, 0.01]}
xgb_grid_search = GridSearchCV(xgb_classifier, xgb_param_grid, cv=3, n_jobs=-1)
xgb_grid_search.fit(X_train, y_train)

In [41]:
best_xgb = xgb_grid_search.best_estimator_
best_xgb

In [42]:
xgb_train_acc = accuracy_score(y_train, best_xgb.predict(X_train))
xgb_val_acc = accuracy_score(y_val, best_xgb.predict(X_val))
xgb_test_acc = accuracy_score(y_test, best_xgb.predict(X_test))

xgb_report = classification_report(y_test, best_xgb.predict(X_test))

print("XGBoost Results (using best model):")
print("Train Accuracy:", xgb_train_acc)
print("Validation Accuracy:", xgb_val_acc)
print("Test Accuracy:", xgb_test_acc)
print("Classification Report:\n", xgb_report)


XGBoost Results (using best model):
Train Accuracy: 0.8085231447465099
Validation Accuracy: 0.5843621399176955
Test Accuracy: 0.5897172236503856
Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.17      0.27       144
           1       0.44      0.06      0.10       120
           2       0.34      0.12      0.17       181
           3       0.38      0.17      0.24       422
           4       0.62      0.95      0.75      1078

    accuracy                           0.59      1945
   macro avg       0.48      0.29      0.31      1945
weighted avg       0.53      0.59      0.51      1945



#### The XGBoost model gives an accuracy of around 59% on the test set

### Taking larger chunk of data and seeing the performance of models that performed the best

In [43]:
# Define the percentage of data to sample
sample_percentage = 0.25# Adjust this as needed

# Perform simple random sampling
sample_data2 = df.sample(frac=sample_percentage, random_state=42)

In [44]:
sample_data2['reviewText'] = sample_data2['reviewText'].apply(preprocess_text)
print(sample_data2)

                                               reviewText  overall
156187  ibolt xprodock activ car dock holder mount sam...        5
102252  pouch everyth look otter box commut case aroun...        5
23146   first case iphon previous one free one give al...        3
86461   order case case htc inspir case last year abso...        5
62407   bought gift big hit love choic color made devi...        5
...                                                   ...      ...
64845   le month use phone experienc text issu essenti...        1
66485   honestli expect someth le qualiti though wrong...        5
132440  bought someth quick cheap poetic case design g...        5
77563   great product batteri last longer stock batter...        5
70510            bought gift seen want thing get snag end        4

[48610 rows x 2 columns]


#### Splitting the dataset

In [45]:
# Split the data
train_size = int(0.7 * len(sample_data2))
val_size = int(0.1 * len(sample_data2))

train_data = sample_data2[:train_size]
val_data = sample_data2[train_size : train_size+val_size]
test_data = sample_data2[train_size+val_size:]

# Extract the 'reviewText' and 'overall' fields
X_train, y_train = train_data['reviewText'], train_data['overall']
X_val, y_val = val_data['reviewText'], val_data['overall']
X_test, y_test = test_data['reviewText'], test_data['overall']

#### Converting into TFIDF vector

In [46]:
# Extract TFIDF features
vectorizer = TfidfVectorizer(max_features=10000, max_df=0.95)
X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(X_test)

#### Label Encoding

In [47]:
# create a labelencoder object
le = LabelEncoder()

# fit and transform on the data
y_train = le.fit_transform(y_train)
y_val = le.transform(y_val)
y_test = le.transform(y_test)

#### Logistic Regression

In [48]:
lr_classifier = LogisticRegression()
lr_param_grid = {'C': [0.1, 1, 10]}
lr_grid_search = GridSearchCV(lr_classifier, lr_param_grid, cv=3, n_jobs=-1)
lr_grid_search.fit(X_train, y_train)

In [49]:
best_lr = lr_grid_search.best_estimator_
best_lr

In [50]:
lr_train_acc = accuracy_score(y_train, best_lr.predict(X_train))
lr_val_acc = accuracy_score(y_val, best_lr.predict(X_val))
lr_test_acc = accuracy_score(y_test, best_lr.predict(X_test))

lr_report = classification_report(y_test, best_lr.predict(X_test))

print("Logistic Regression Results (using best model):")
print("Train Accuracy:", lr_train_acc)
print("Validation Accuracy:", lr_val_acc)
print("Test Accuracy:", lr_test_acc)
print("Classification Report:\n", lr_report)


Logistic Regression Results (using best model):
Train Accuracy: 0.6982690216592706
Validation Accuracy: 0.6257971610779675
Test Accuracy: 0.6232256737296853
Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.44      0.49       672
           1       0.27      0.07      0.12       560
           2       0.38      0.23      0.28      1058
           3       0.43      0.27      0.33      2023
           4       0.69      0.91      0.79      5409

    accuracy                           0.62      9722
   macro avg       0.47      0.38      0.40      9722
weighted avg       0.57      0.62      0.58      9722



#### SVM with linear kernel

In [57]:
svm_linear_classifier = SVC(kernel='linear')
svm_linear_param_grid = {'C': [0.1, 1]}
svm_linear_grid_search = GridSearchCV(svm_linear_classifier, svm_linear_param_grid, cv=3, n_jobs=-1)
svm_linear_grid_search.fit(X_train, y_train)

In [58]:
best_svm_linear = svm_linear_grid_search.best_estimator_
best_svm_linear

In [59]:
svm_linear_train_acc = accuracy_score(y_train, best_svm_linear.predict(X_train))
svm_linear_val_acc = accuracy_score(y_val, best_svm_linear.predict(X_val))
svm_linear_test_acc = accuracy_score(y_test, best_svm_linear.predict(X_test))

svm_linear_report = classification_report(y_test, best_svm_linear.predict(X_test))

print("SVM Linear Results (using best model):")
print("Train Accuracy:", svm_linear_train_acc)
print("Validation Accuracy:", svm_linear_val_acc)
print("Test Accuracy:", svm_linear_test_acc)
print("Classification Report:\n", svm_linear_report)

SVM Linear Results (using best model):
Train Accuracy: 0.7125517971023011
Validation Accuracy: 0.6301172598230816
Test Accuracy: 0.6241514091750668
Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.46      0.50       672
           1       0.29      0.09      0.14       560
           2       0.38      0.23      0.29      1058
           3       0.44      0.24      0.31      2023
           4       0.69      0.92      0.79      5409

    accuracy                           0.62      9722
   macro avg       0.47      0.39      0.40      9722
weighted avg       0.57      0.62      0.58      9722



* After taking 25% of the data both the SVM model with linear kernel and the logisitc regression model show a slight improvement.

* Results:

    **1. The logisitc regression model gives an accuracy of 62.3% on the test set.**
    
    **2. The SVC model with a linear kernel gives an accuracy of 62.4% on the test set.**

# Conclusion

* I built the initial models using 5% of the data and even the best performing models peaked at 60% accuracy after applying gridsearch and doing hyperparameter tuning as well.

* On increasing the data and taking 25% of the total dataset, and running the best performing models both of them showed some improvement with the accuracy now being around 62%.

* It is noteworthy that logistic regression and SVM with linear kernels may have yielded the best outcomes due to their ability to handle linearly separable data and their robustness in scenarios with limited data compared to more complex models prone to overfitting.

* With such large datasets it gets tricky and it is important to smartly utilize the resources at hand while aiming to get the best results.