This project utilizes a recent spam dataset from Kaggle to build a spam classifier. The text data is preprocessed through various techniques including stop words
removal and lemmatization
. A pipeline is created with CountVectorizer
and Multinomial Naive Bayes
classifier, which achieved an impressive accuracy of 98%. The trained model is then deployed as a FastAPI
endpoint for real-time spam classification.
- Sklearn
- Pandas
- FastAPI
- Data Preprocessing
1.1 Changing the characters to lowercase
1.2 Tokenization
1.2 Stemming
Let
The probability that the message is spam given the words can be written as:
If we assume the occurrences of the words are independent of the other words, the formula can be rewritten as:
-
Clone the project repository:
git clone https://github.com/abdulhakkeempa/spam-detection.git
-
Navigate to the project directory:
cd spam-detection
-
Install the required dependencies:
pip install -r requirements.txt
- Run the
eda.py
script. This will generate word cloud images and save them to a folder. Make sure to create an 'images' folder in the directory prior to running the script. This script will also display a bar chart for spam and ham messages:python eda.py
- Before running the training script, create a 'model' folder in the directory.
- Run the
train.py
script. This will preprocess the data and train the Naive Bayes model:python train.py
- Run the
evaluate.py
script. This will test the model using the test part of the dataset:python evaluate.py
- To evaluate the model with a custom input message, run the
test.py
script with the--message
option followed by your custom message:python test.py --message="custom input message"
-
Run the FastAPI server using Uvicorn with the
--reload
option:uvicorn main:app --reload
-
You can test the API by navigating to
localhost:8000/docs
in your web browser. -
Using CURL
curl -X POST "http://localhost:8000/predict" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"message\":\"your message here\"}"