# <u>Subreddit prediction</u> #



## 1. Description of the project ##

### <span style="color: #FF9800;">Project overview </span> ###


This project aims to develop machine learning models for **analyzing Reddit text** to determine the origin subreddit of a given post or comment. Reddit, a popular social media platform, is organized into a variety of thematic communities known as *subreddits*, where users share content and engage in discussions.



### <span style="color: #FF9800;">Objective </span> ###


The primary objective is to build a model that can **predict the subreddit** of a Reddit post or comment. Given a text entry from Reddit, the model will identify which of the following subreddits it originally came from:

- **Toronto**
- **Brussels**
- **London**
- **Montreal**

<b>This defines a multiclass classification problem</b>


### <span style="color: #FF9800;">Approach</span> ###



This project consists of two main parts:

1. **Implement a Bernoulli Naïve Bayes Classifier from Scratch**  
   First, a Bernoulli Naïve Bayes classifier will be developed from the ground up, without relying on external libraries for the core algorithm. This implementation will provide a deeper understanding of how the Bernoulli Naïve Bayes method works and how it can be applied to text classification.

2. **Utilize a Classifier from Scikit-Learn**  
   In the second part, a pre-built classifier from the `scikit-learn` library will be used to perform the same task. This comparison will allow us to evaluate the effectiveness of our custom implementation against a widely used, optimized machine learning library.


## 2. Load dataset and modules ##

### <span style="color: #FF9800;">Module importation </span> ###

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Ensure required NLTK resources are downloaded
try:
    nltk.download('punkt')
    nltk.download('stopwords')
except Exception as e:
    print(f"Error downloading NLTK resources: {e}")


[nltk_data] Downloading package punkt to /home/clatimie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/clatimie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### <span style="color: #FF9800;">Load training dataset</span> ###

In [2]:
# Define the path to the training data file
path_training = "../datasets/Train.csv"

# Read the CSV file into a pandas DataFrame
training_data = pd.read_csv(path_training, delimiter=',')

# Set column names explicitly for better readability
training_data.columns = ['text', 'subreddit']

# Separate the training data into two series: texts and subreddit labels
texts_train = training_data['text']          # Contains the Reddit posts or comments
subreddits_train = training_data['subreddit'] # Contains the subreddit each post originates from

# Get unique subreddit labels
unique_labels = np.unique(subreddits_train)   # List of unique subreddits in the dataset

n_samples = texts_train.shape[0]
n_classes = unique_labels.shape[0]

print(f"Dataset has {n_samples} examples and {n_classes} classes")

Dataset has 1399 examples and 4 classes


## 3. Vectorization of the texts ##

To utilize the texts in machine learning models, it is essential to convert them into a vectorized format. There are several methods available for encoding texts as vectors.

1. **Binary Representation of Words**  
   One approach is to employ a binary representation of the words. This method indicates the presence or absence of each word in the text using binary values (1 or 0).

2. **Removal of Stop Words**  
   Additionally, it is important to consider the removal of stop words—common words such as "and," "the," or "is" that may not carry significant meaning in the context of the analysis. By eliminating these words, we can enhance the quality of our feature set.

3. **TF-IDF (Term Frequency-Inverse Document Frequency)**  
   Another effective technique for vectorization is the use of TF-IDF. This method not only accounts for the frequency of words in the text but also adjusts for their importance across the entire corpus. By selecting features based on TF-IDF scores, we can focus on the most relevant words for our machine learning models.


In [3]:
# Approach 1 :  Binary representation of words (present [1] or absent [0]) -> no stop words considered and no words selection
BinaryVectorizer = CountVectorizer(binary=True)
x_train_1 = BinaryVectorizer.fit_transform(texts_train)
n_features_1 = x_train_1.shape[1]
print(f"Binary vectorized dataset(WITHOUT stop words consideration) has {n_samples} examples and {n_features_1} features")


stopwords_list = stopwords.words('english') + stopwords.words('french') # document is both in english and french

# Approach 2 : Binary representation of words (present [1] or absent [0]) with stop words considered
BinaryVectorizerStopWords = CountVectorizer(
    binary=True, 
    stop_words=stopwords_list)

x_train_2 = BinaryVectorizerStopWords.fit_transform(texts_train)
n_features_2 = x_train_2.shape[1]
print(f"Binary vectorized dataset(WITH stop words consideration) has {n_samples} examples and {n_features_2} features")

# Approach 3: Binary representation of words (present [1] or absent [0]) with stop words considered and stemming
class StemTokenizer:
    def __init__(self):
        # Initialize the Porter Stemmer
        self.wnl = nltk.stem.PorterStemmer()
        self.stop_words = stopwords_list  

    def __call__(self, doc):
        # Tokenize the document and stem each token, filtering out non-alpha and stop words
        return [self.wnl.stem(t) for t in word_tokenize(doc) if t.isalpha() and t not in self.stop_words]

# Set up the CountVectorizer with binary representation, stop words, and stemming
BinaryVectorizerStopWordsandStemming = CountVectorizer(
    binary=True,
    tokenizer=StemTokenizer(),
    stop_words=stopwords_list
)
x_train_3 = BinaryVectorizerStopWordsandStemming.fit_transform(texts_train)
n_features_3 = x_train_3.shape[1]
print(f"Binary vectorized dataset (WITH stop words consideration and STEMMING) has {n_samples} examples and {n_features_3} features")

# Approach 4: Binary representation of words (present [1] or absent [0]) with stop words considered and stemming and selection based on TD-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords_list, use_idf=True, smooth_idf=True, tokenizer=StemTokenizer())

# Fit and transform the training texts
x_train_tfidf = tfidf_vectorizer.fit_transform(texts_train)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense format 
dense_tfidf = x_train_tfidf.todense()
tfidf_df = pd.DataFrame(dense_tfidf, columns=feature_names)

# select words based on a threshold
threshold = 0.2  # Hyperparameter
important_tokens = tfidf_df.columns[(tfidf_df > threshold).any(axis=0)]

# Set up the CountVectorizer with binary representation, stop words, stemming and token selection based on TF-IDF
BinaryVectorizerStopWordsandStemmingandTFIDF = CountVectorizer(
    binary=True,
    tokenizer=StemTokenizer(),
    stop_words=stopwords_list,
    vocabulary=important_tokens
)
x_train_4 = BinaryVectorizerStopWordsandStemmingandTFIDF.fit_transform(texts_train)
n_features_4 = x_train_4.shape[1]
print(f"Binary vectorized dataset (WITH stop words consideration and STEMMING and TFIDF based token selection) has {n_samples} examples and {n_features_4} features")

Binary vectorized dataset(WITHOUT stop words consideration) has 1399 examples and 13690 features
Binary vectorized dataset(WITH stop words consideration) has 1399 examples and 13461 features




Binary vectorized dataset (WITH stop words consideration and STEMMING) has 1399 examples and 8635 features




Binary vectorized dataset (WITH stop words consideration and STEMMING and TFIDF based token selection) has 1399 examples and 5231 features
