# Hoax Identification from Kompas News

### Project Overview

The project "Hoax Identification from Kompas News Modeling" aims to develop a system that can automatically identify and classify hoax or fake news articles from the Kompas news website. This project is significant due to the rise of misinformation and fake news circulating on social media and other platforms, leading to potential harm to individuals and society.

### Import Libraries

In [1]:
# Utilities
import pandas as pd
import numpy as np

# NLP
import string
import re
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

# Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense

# Evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

[nltk_data] Downloading package punkt to /Users/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2024-03-10 00:19:10.483301: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Datasets Input

Two datasets, fact_df and hoax_df, are imported from Excel files. fact_df contains cleaned factual information labeled as 0 from dataset_kompas_4k_cleaned.xlsx, while hoax_df contains cleaned hoax information labeled as 1 from dataset_turnbackhoax_10_cleaned.xlsx. The datasets likely refer to the number of records (4,000 and 10,000, respectively) and have been cleaned, suggesting removal of irrelevant or erroneous data and standardization for analysis.

#### Import and Concat Datasets

In [2]:
fact_df = pd.read_excel('./dataset_kompas_4k_cleaned.xlsx')
fact_df['Label'] = 0

hoax_df = pd.read_excel('./dataset_turnbackhoax_10_cleaned.xlsx')
hoax_df['Label'] = 1

news_df = pd.concat([fact_df, hoax_df], axis=0, ignore_index=True)
news_df = news_df[['FullText', 'Label']]

news_df.head()

Unnamed: 0,FullText,Label
0,Hasil jajak pendapat yang diselenggarakan Litb...,0
1,"JAKARTA, KOMPAS.com - Pemerintah menargetkan p...",0
2,"PDI-Perjuangan, Partai Gerindra, dan Partai Go...",0
3,"JAKARTA, KOMPAS.com - Survei Litbang Kompas Ja...",0
4,"JAKARTA, KOMPAS.com - Presiden Joko Widodo la...",0


#### Check Dataset

In [3]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15131 entries, 0 to 15130
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   FullText  15104 non-null  object
 1   Label     15131 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 236.6+ KB


#### Check Duplicate

In [4]:
news_df.duplicated().sum()

42

#### Remove Duplicate Rows

In [5]:
news_df.drop_duplicates(keep=False, inplace=True)

In [6]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15082 entries, 0 to 15130
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   FullText  15082 non-null  object
 1   Label     15082 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 353.5+ KB


### Preprocessing

#### Lower Casing

In [7]:
def lowercase_text(text):
    return text.lower()

In [8]:
news_df['LowerCase'] = news_df['FullText'].apply(lambda text: lowercase_text(text))

In [9]:
news_df.head()

Unnamed: 0,FullText,Label,LowerCase
0,Hasil jajak pendapat yang diselenggarakan Litb...,0,hasil jajak pendapat yang diselenggarakan litb...
1,"JAKARTA, KOMPAS.com - Pemerintah menargetkan p...",0,"jakarta, kompas.com - pemerintah menargetkan p..."
2,"PDI-Perjuangan, Partai Gerindra, dan Partai Go...",0,"pdi-perjuangan, partai gerindra, dan partai go..."
3,"JAKARTA, KOMPAS.com - Survei Litbang Kompas Ja...",0,"jakarta, kompas.com - survei litbang kompas ja..."
4,"JAKARTA, KOMPAS.com - Presiden Joko Widodo la...",0,"jakarta, kompas.com - presiden joko widodo la..."


#### Remove Unnecessary Characters

We will replace '-' with space to prevent repeated words and hyphens between words

Apply a series of text preprocessing functions to the 'LowerCase' column of a DataFrame news_df, sequentially removing multiple types of unnecessary characters such as punctuation, special characters, single characters, digits, ASCII characters, Unicode characters, newlines, and extra spaces. The result is stored in a new column called 'RemoveUnnecessaryCharacters', preparing the text for further analysis or processing.

In [10]:
def remove_strip(text):
    return text.replace('-', ' ')

In [11]:
def remove_special_character(text):
    return re.sub(r'\W', ' ', text)

In [12]:
def remove_single_character(text):
    return re.sub(r'\s+[a-zA-Z]\s+', ' ', text)

In [13]:
def remove_digit_number(text):
    return re.sub(r"\d+", "", text)

In [14]:
def remove_ascii(text):
    return text.encode('ascii', 'ignore').decode('utf-8')

In [15]:
def remove_unicode(text):
    return re.sub(r'[^\x00-\x7f]', r'', text)

In [16]:
def remove_newline_etc(text):
    return text.replace('\\t',"").replace('\\n',"").replace('\\u'," ").replace('\\',"")

In [17]:
def remove_multispace(text):
    return re.sub('\s+',' ',text)

In [18]:
def remove_punctuation(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)

In [19]:
news_df['RemoveUnnecessaryCharacters'] = news_df['LowerCase'].apply(lambda text: remove_punctuation(
    remove_strip(
        remove_special_character(
            remove_single_character(
                remove_digit_number(
                    remove_ascii(
                        remove_unicode(
                            remove_newline_etc(
                                remove_multispace(text)
                            )
                        )
                    )
                )
            )
        )
    )
)
)

In [20]:
news_df.head()

Unnamed: 0,FullText,Label,LowerCase,RemoveUnnecessaryCharacters
0,Hasil jajak pendapat yang diselenggarakan Litb...,0,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat yang diselenggarakan litb...
1,"JAKARTA, KOMPAS.com - Pemerintah menargetkan p...",0,"jakarta, kompas.com - pemerintah menargetkan p...",jakarta kompas com pemerintah menargetkan p...
2,"PDI-Perjuangan, Partai Gerindra, dan Partai Go...",0,"pdi-perjuangan, partai gerindra, dan partai go...",pdi perjuangan partai gerindra dan partai go...
3,"JAKARTA, KOMPAS.com - Survei Litbang Kompas Ja...",0,"jakarta, kompas.com - survei litbang kompas ja...",jakarta kompas com survei litbang kompas ja...
4,"JAKARTA, KOMPAS.com - Presiden Joko Widodo la...",0,"jakarta, kompas.com - presiden joko widodo la...",jakarta kompas com presiden joko widodo la...


#### Remove Stopwords

Read a CSV file containing Indonesian stopwords and renames the first column to 'stopword'. It defines a function remove_stopword to remove stopwords from text, applying it to a DataFrame column called 'RemoveUnnecessaryCharacters' and storing the result in a new column called 'RemoveStopword'. The function iterates over each word in the text, replacing stopwords with an empty string, and then removes extra spaces.

In [21]:
id_stopword_dict = pd.read_csv('./stopwordbahasa.csv', header=None)
id_stopword_dict = id_stopword_dict.rename(columns={0: 'stopword'})

In [22]:
id_stopword_dict.head()

Unnamed: 0,stopword
0,ada
1,adalah
2,adanya
3,adapun
4,agak


In [23]:
def remove_stopword(text):
    text = ' '.join(['' if word in id_stopword_dict.stopword.values else word for word in text.split(' ')])
    text = re.sub('  +', ' ', text) # Remove extra spaces
    text = text.strip()
    return text

In [24]:
news_df['RemoveStopword'] = news_df['RemoveUnnecessaryCharacters'].apply(lambda text: remove_stopword(text))

In [25]:
news_df.head()

Unnamed: 0,FullText,Label,LowerCase,RemoveUnnecessaryCharacters,RemoveStopword
0,Hasil jajak pendapat yang diselenggarakan Litb...,0,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat diselenggarakan litbang k...
1,"JAKARTA, KOMPAS.com - Pemerintah menargetkan p...",0,"jakarta, kompas.com - pemerintah menargetkan p...",jakarta kompas com pemerintah menargetkan p...,jakarta kompas com pemerintah menargetkan pert...
2,"PDI-Perjuangan, Partai Gerindra, dan Partai Go...",0,"pdi-perjuangan, partai gerindra, dan partai go...",pdi perjuangan partai gerindra dan partai go...,pdi perjuangan partai gerindra partai golkar m...
3,"JAKARTA, KOMPAS.com - Survei Litbang Kompas Ja...",0,"jakarta, kompas.com - survei litbang kompas ja...",jakarta kompas com survei litbang kompas ja...,jakarta kompas com survei litbang kompas janua...
4,"JAKARTA, KOMPAS.com - Presiden Joko Widodo la...",0,"jakarta, kompas.com - presiden joko widodo la...",jakarta kompas com presiden joko widodo la...,jakarta kompas com presiden joko widodo bicara...


#### Stemming

Utilize the Sastrawi library for stemming in Bahasa Indonesia. It applies the stemming function to the 'RemoveStopword' column of a DataFrame news_df, which contains text data after stopwords have been removed. The stemming process reduces each word to its base or root form, which can help in text analysis by grouping together variations of words with the same meaning. The result is stored in a new column called 'Stemming', which can be used for further analysis or processing of the text data.

In [26]:
!pip install PySastrawi



In [27]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

In [28]:
def stemming(text):
    return stemmer.stem(text)

In [29]:
news_df['Stemming'] = news_df['RemoveStopword'].apply(lambda text: stemming(text))

In [30]:
news_df.head()

Unnamed: 0,FullText,Label,LowerCase,RemoveUnnecessaryCharacters,RemoveStopword,Stemming
0,Hasil jajak pendapat yang diselenggarakan Litb...,0,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat diselenggarakan litbang k...,hasil jajak dapat selenggara litbang kompas ja...
1,"JAKARTA, KOMPAS.com - Pemerintah menargetkan p...",0,"jakarta, kompas.com - pemerintah menargetkan p...",jakarta kompas com pemerintah menargetkan p...,jakarta kompas com pemerintah menargetkan pert...,jakarta kompas com perintah target tumbuh ekon...
2,"PDI-Perjuangan, Partai Gerindra, dan Partai Go...",0,"pdi-perjuangan, partai gerindra, dan partai go...",pdi perjuangan partai gerindra dan partai go...,pdi perjuangan partai gerindra partai golkar m...,pdi juang partai gerindra partai golkar tempat...
3,"JAKARTA, KOMPAS.com - Survei Litbang Kompas Ja...",0,"jakarta, kompas.com - survei litbang kompas ja...",jakarta kompas com survei litbang kompas ja...,jakarta kompas com survei litbang kompas janua...,jakarta kompas com survei litbang kompas janua...
4,"JAKARTA, KOMPAS.com - Presiden Joko Widodo la...",0,"jakarta, kompas.com - presiden joko widodo la...",jakarta kompas com presiden joko widodo la...,jakarta kompas com presiden joko widodo bicara...,jakarta kompas com presiden joko widodo bicara...


#### Tokenizing

Tokenize the text in the 'Stemming' column of a DataFrame news_df, splitting it into smaller units like words or sentences. The apply method is used to apply tokenize_text to each element in 'Stemming', and the result is stored in a new column 'word_tokenize', where each row contains a list of tokens extracted from the corresponding text. This tokenization step prepares the text data for further analysis or processing.

In [31]:
def tokenize_text(text):
    return word_tokenize(text)

In [32]:
news_df['word_tokenize'] = news_df['Stemming'].apply(lambda text: tokenize_text(text))

In [33]:
news_df.head()

Unnamed: 0,FullText,Label,LowerCase,RemoveUnnecessaryCharacters,RemoveStopword,Stemming,word_tokenize
0,Hasil jajak pendapat yang diselenggarakan Litb...,0,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat yang diselenggarakan litb...,hasil jajak pendapat diselenggarakan litbang k...,hasil jajak dapat selenggara litbang kompas ja...,"[hasil, jajak, dapat, selenggara, litbang, kom..."
1,"JAKARTA, KOMPAS.com - Pemerintah menargetkan p...",0,"jakarta, kompas.com - pemerintah menargetkan p...",jakarta kompas com pemerintah menargetkan p...,jakarta kompas com pemerintah menargetkan pert...,jakarta kompas com perintah target tumbuh ekon...,"[jakarta, kompas, com, perintah, target, tumbu..."
2,"PDI-Perjuangan, Partai Gerindra, dan Partai Go...",0,"pdi-perjuangan, partai gerindra, dan partai go...",pdi perjuangan partai gerindra dan partai go...,pdi perjuangan partai gerindra partai golkar m...,pdi juang partai gerindra partai golkar tempat...,"[pdi, juang, partai, gerindra, partai, golkar,..."
3,"JAKARTA, KOMPAS.com - Survei Litbang Kompas Ja...",0,"jakarta, kompas.com - survei litbang kompas ja...",jakarta kompas com survei litbang kompas ja...,jakarta kompas com survei litbang kompas janua...,jakarta kompas com survei litbang kompas janua...,"[jakarta, kompas, com, survei, litbang, kompas..."
4,"JAKARTA, KOMPAS.com - Presiden Joko Widodo la...",0,"jakarta, kompas.com - presiden joko widodo la...",jakarta kompas com presiden joko widodo la...,jakarta kompas com presiden joko widodo bicara...,jakarta kompas com presiden joko widodo bicara...,"[jakarta, kompas, com, presiden, joko, widodo,..."


#### Count Vectorizer UNIGRAM

Use scikit-learn's CountVectorizer to convert a list of tokenized news articles stored in 'word_tokenize' column of a DataFrame news_df into a matrix of token counts. The map function is used to join the tokens back into sentences. The fit_transform method of CountVectorizer converts the list of news articles into a matrix X where each row represents a news article and each column represents a unique token. Finally, get_feature_names_out is used to get the list of unique tokens, and X.toarray() is used to print the matrix of token counts for each news article.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

news_list = news_df['word_tokenize'].map(' '.join)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(news_list)
vectorizer.get_feature_names_out()
print(X.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### Training

#### Random Forest

##### Split Data

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, news_df['Label'], test_size=0.7, random_state=42)

##### Compiling Model

In [36]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [37]:
y_pred = rf_model.predict(X_test)

##### Evaluation Model

In [38]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

precision = precision_score(y_test, y_pred)
print("precision:", precision)

recall = recall_score(y_test, y_pred)
print("recall:", recall)

f1_score = f1_score(y_test, y_pred)
print("f1-measure:", f1_score)

Accuracy: 0.9967796931236976
precision: 0.9963933971424608
recall: 0.9988874982617161
f1-measure: 0.997638888888889


This random forest evaluation model demonstrates high performance across multiple metrics, including accuracy (0.997), precision (0.996), recall (0.999), and F1-measure (0.998). These scores indicate that the model is very accurate in its predictions, with a high ability to correctly identify positive instances and avoid false positives.

#### Convolutional Neural Network

##### Convert X to Sparse Tensor

Convert a sparse matrix X into a sparse tensor X_sparse_tensor using TensorFlow. First, X is converted to COO (Coordinate Format) using tocoo(). Then, the row and column indices are stacked together to create a 2D array of indices. Finally, a sparse tensor X_sparse_tensor is created using the SparseTensor class from TensorFlow, with the indices, data (values), and shape of the COO matrix. This conversion allows the sparse matrix X to be used efficiently in TensorFlow operations.

In [39]:
X_coo = X.tocoo()
indices = np.column_stack((X_coo.row, X_coo.col))
X_sparse_tensor = tf.SparseTensor(indices, X_coo.data, X_coo.shape)

In [40]:
print(X_sparse_tensor)

SparseTensor(indices=tf.Tensor(
[[    0 38534]
 [    0 46181]
 [    0 21496]
 ...
 [15081 78656]
 [15081 30553]
 [15081 76753]], shape=(1891203, 2), dtype=int64), values=tf.Tensor([1 2 1 ... 1 1 2], shape=(1891203,), dtype=int64), dense_shape=tf.Tensor([ 15082 118576], shape=(2,), dtype=int64))


In [41]:
num_classes = 2

##### Convert The SparseTensor to Dense NumPy Arrays

In [42]:
X_ordered_sparse_tensor = tf.sparse.reorder(X_sparse_tensor)
X_dense = tf.sparse.to_dense(X_ordered_sparse_tensor).numpy()

##### One-hot Encode Labels

In [43]:
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(news_df['Label'].values.reshape(-1, 1))



##### Reshape The Input Data to Include The Sequence Length Dimension

In [44]:
sequence_length = X_dense.shape[1]  # The length of your vocabulary
X_dense = X_dense.reshape(-1, sequence_length, 1)

##### Cast The Input Data to The Appropriate Data Type (float32)

In [45]:
X_dense = X_dense.astype(np.float32)

##### Split the Data

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X_dense, y, test_size=0.7, random_state=42)

##### Model Architecture

Defines a convolutional neural network (CNN) model using Keras' Sequential API. The model consists of a 1D convolutional layer with 64 filters and a kernel size of 3, followed by a global max pooling layer to reduce the dimensionality of the features. Finally, a dense layer with the number of units equal to the number of classes and a softmax activation function is added to output the probability distribution over the classes. This type of architecture is commonly used for text classification tasks.

In [None]:
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(units=num_classes, activation='softmax'))

##### Compile Model

In [48]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [49]:
model.fit(X_train, y_train, epochs=30, batch_size=64, validation_data=(X_test, y_test))

Epoch 1/30
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m685s[0m 10s/step - accuracy: 0.4502 - loss: 2.0578 - val_accuracy: 0.6811 - val_loss: 0.6144
Epoch 2/30
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m662s[0m 9s/step - accuracy: 0.6976 - loss: 0.5992 - val_accuracy: 0.6811 - val_loss: 0.6084
Epoch 3/30
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m659s[0m 9s/step - accuracy: 0.6911 - loss: 0.6027 - val_accuracy: 0.6811 - val_loss: 0.6081
Epoch 4/30
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m669s[0m 9s/step - accuracy: 0.6915 - loss: 0.6062 - val_accuracy: 0.6811 - val_loss: 0.6086
Epoch 5/30
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m654s[0m 9s/step - accuracy: 0.7033 - loss: 0.5846 - val_accuracy: 0.6811 - val_loss: 0.6044
Epoch 6/30
[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m662s[0m 9s/step - accuracy: 0.7031 - loss: 0.5923 - val_accuracy: 0.6811 - val_loss: 0.6045
Epoch 7/30
[1m71/71[0m [32m━━━


KeyboardInterrupt



##### Evaluate Model

In [58]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}, Test Accuracy: {accuracy}")

[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 931ms/step - accuracy: 0.6801 - loss: 0.6048
Test Loss: 0.6018175482749939, Test Accuracy: 0.6810948848724365


In [51]:
y_pred = model.predict(X_test)

[1m330/330[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m333s[0m 1s/step


In [59]:
print("Accuracy:", accuracy)

precision = precision_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1), average='weighted')
print("precision:", precision)

recall = recall_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1), average='weighted')
print("recall:", recall)

print("loss:", loss)

Accuracy: 0.6810948848724365
precision: 0.46389026871511146
recall: 0.6810949043379428
loss: 0.6018175482749939


  _warn_prf(average, modifier, msg_start, len(result))


This CNN model evaluation indicates moderate performance. The accuracy of 0.681 suggests that about 68% of the predictions were correct. The precision score of 0.464 indicates that when the model predicts a positive class, it is correct about 46% of the time. The recall score of 0.681 indicates that the model correctly identifies about 68% of all actual positive instances. The loss value of 0.602 is a measure of the model's error, with lower values indicating better performance. Overall, while the model shows decent accuracy and recall, there is room for improvement in precision, as it is relatively low.

### Random Forest vs CNN

To compare the evaluation models between random forest and a convolutional neural network (CNN) for detecting hoaxes using the news_df dataset, we need to consider the nature of the dataset and the characteristics of each model:

* Dataset Characteristics: The news_df dataset likely contains textual data, such as news articles or headlines, which can be complex and contain various linguistic nuances. Textual data requires careful preprocessing to extract meaningful features for classification tasks.

* Random Forest: Random forests are robust and perform well with high-dimensional datasets like text data. They can handle a large number of input features and are less prone to overfitting, making them suitable for text classification tasks. Random forests can capture complex relationships between features and are interpretable, providing insights into which features are important for classification.

* CNN: CNNs are effective for tasks involving spatial relationships, such as image recognition, but can also be applied to sequential data like text. CNNs can automatically learn relevant features from the data, which is beneficial for text classification. However, they require a large amount of data to generalize well and may be prone to overfitting, especially with small datasets.

**Comparison:**

* Accuracy: Random forests tend to perform well on text classification tasks, achieving high accuracy by leveraging the diversity of decision trees. However, CNNs can potentially achieve higher accuracy by learning intricate patterns in text data.
* Precision and Recall: Random forests often exhibit high precision and recall due to their ability to handle imbalanced datasets well. CNNs may struggle with imbalanced datasets but can perform comparably with proper tuning.
* Interpretability: Random forests are more interpretable than CNNs, as they provide feature importances for each input feature. This can be useful for understanding which words or features are most important for hoax detection.
* Computational Complexity: CNNs are computationally more expensive than random forests, especially for training on large datasets. Random forests are generally faster to train and can be more efficient for smaller datasets.


In conclusion, while random forests may perform well and provide interpretability for hoax detection on the news_df dataset, CNNs have the potential to achieve higher accuracy by automatically learning intricate patterns in textual data. However, CNNs may require more computational resources and careful tuning to avoid overfitting, especially with smaller datasets like news_df.