# Overview: Naive Bayes

We'll use [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) to build our model.

In [1]:
pip install scikit-learn pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Load the CSV
file_path = 'data/preprocessed-data.csv'
df = pd.read_csv(file_path)

# Handle NaN values: choose one of the following methods
df = df.dropna(subset=['singleMessage'])  # Method 1: Drop rows with NaN values
# df['singleMessage'] = df['singleMessage'].fillna('')  # Method 2: Fill NaN values with a default string

# Extract features and target
X = df['singleMessage']
y = df['reason']

# Split the dataset into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Preprocess the text data using TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

# Make predictions on the testing set
y_pred = classifier.predict(X_test_vectorized)

# Evaluate the classifier's performance
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("\nAccuracy:")
print(accuracy_score(y_test, y_pred))


Classification Report:
                                                                                                                                            precision    recall  f1-score   support

"Any discussion related in any way to market manipulation is strictly prohibited, as is advising others on whether to buy, sell, or hold."       0.00      0.00      0.00         8
                                                                      Account number visible. Please remove from content before reposting.       0.00      0.00      0.00         1
                                                                                                           Bullying a member or moderator.       0.00      0.00      0.00         8
                                                                                                Bypassing the chat filters is not allowed.       0.00      0.00      0.00         6
                                                                            

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [13]:
# Example message to predict
new_message = ["Trump will win 2024"]

# Preprocess the new message
new_message_vectorized = vectorizer.transform(new_message)

# Predict the class probabilities
class_probabilities = classifier.predict_proba(new_message_vectorized)

# Sort the class probabilities in descending order along with their corresponding classes
sorted_probabilities = sorted(zip(classifier.classes_, class_probabilities[0]), key=lambda x: x[1], reverse=True)

# Print the sorted class probabilities
for reason, probability in sorted_probabilities:
    print(f"{reason}: {probability:.4f}")

Off-topic: 0.5341
Inappropriate comment.: 0.1583
Caps for tickers only.: 0.0928
Third-party links / content not allowed.: 0.0564
Politics not allowed outside of references to the market.: 0.0462
Personal or sensitive information not allowed in chat.: 0.0452
"Any discussion related in any way to market manipulation is strictly prohibited, as is advising others on whether to buy, sell, or hold.": 0.0168
Bullying a member or moderator.: 0.0100
Reviewed by admin internally; not necessary to post to public chat.: 0.0080
False information or no source.: 0.0065
Bypassing the chat filters is not allowed.: 0.0065
False or misleading information, or no source.: 0.0025
Account number visible. Please remove from content before reposting.: 0.0025
False information.: 0.0020
Not sure what this is: 0.0020
Support Room would be more appropriate for this inquiry.: 0.0020
Perv is an inappropriate term please refrain from these kinds of discussions here: 0.0015
password: 0.0010
language please: 0.0005
Lan