# <u>Subreddit prediction</u> #



## 1. Description of the project ##

### <span style="color: #FF9800;">Project overview </span> ###


This project aims to develop machine learning models for **analyzing Reddit text** to determine the origin subreddit of a given post or comment. Reddit, a popular social media platform, is organized into a variety of thematic communities known as *subreddits*, where users share content and engage in discussions.



### <span style="color: #FF9800;">Objective </span> ###


The primary objective is to build a model that can **predict the subreddit** of a Reddit post or comment. Given a text entry from Reddit, the model will identify which of the following subreddits it originally came from:

- **Toronto**
- **Brussels**
- **London**
- **Montreal**

<b>This defines a multiclass classification problem</b>


### <span style="color: #FF9800;">Approach</span> ###



This project consists of two main parts:

1. **Implement a Bernoulli Naïve Bayes Classifier from Scratch**  
   First, a Bernoulli Naïve Bayes classifier will be developed from the ground up, without relying on external libraries for the core algorithm. This implementation will provide a deeper understanding of how the Bernoulli Naïve Bayes method works and how it can be applied to text classification.

2. **Utilize a Classifier from Scikit-Learn**  
   In the second part, a pre-built classifier from the `scikit-learn` library will be used to perform the same task. This comparison will allow us to evaluate the effectiveness of our custom implementation against a widely used, optimized machine learning library.


## 2. Load dataset and modules ##

### <span style="color: #FF9800;">Module importation </span> ###

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import pandas as pd


### <span style="color: #FF9800;">Load training dataset</span> ###

In [None]:
# Define the path to the training data file
path_training = "../datasets/Train.csv"

# Read the CSV file into a pandas DataFrame
training_data = pd.read_csv(path_training, delimiter=',')

# Set column names explicitly for better readability
training_data.columns = ['text', 'subreddit']

# Separate the training data into two series: texts and subreddit labels
texts_train = training_data['text']          # Contains the Reddit posts or comments
subreddits_train = training_data['subreddit'] # Contains the subreddit each post originates from

# Get unique subreddit labels
unique_labels = np.unique(subreddits_train)   # List of unique subreddits in the dataset

n_samples = texts_train.shape[0]
n_classes = unique_labels.shape[0]

print(f"Dataset has {n_samples} examples and {n_classes} classes")

Dataset has 1399 examples and 4 classes
