Skip to content

dylansiew/SC1015_Spam_Message_Detection

Repository files navigation

Spam Buster 3000

About

This is a Mini-Project for SC10105 (Introduction to Data Science and Artificial Intelligence) which focuses on Spam Messages from Spam DataSet. For detailed walkthrough, please view the source code in order from:

  1. Data Extraction
  2. Data Visualization
  3. Machine Learning
  4. Resampling and Analysis
  5. Spam Buster 3000
  6. Compiled

Contributors

  • @dylansiew - Model training, final product, slides and script
  • @integr8ti0n - Video
  • @ruochee723 - Data Cleaning and Extraction, slides and script

Problem Definition

  • How can we effectively identify Spam messages with the attributes of text messages?
  • Which model would be the best to predict it? Or can all models be used to predict it?

Models Used

  1. Naive Bayes
  2. Support Vector Machine
  3. Random Forest Classifier
  4. Logistic Regression
  5. Model Ensemble

Product

The ultimate spam detection tool designed to keep your inbox free of unwanted and harmful messages. With its powerful model ensemble of Naive Bayes, SVM, RFC, and Logistic Regression, Spam Buster 3000 constantly updates its dataframe with every new user input to improve accuracy and robustness over time. Its advanced algorithms allow for a comprehensive analysis of incoming data, ensuring that no spam goes unnoticed. Say goodbye to spam once and for all with Spam Buster 3000.

Conclusion

  • All models performed well when predicitng Spam Messages with a low false negative and false positive rate
  • Support Vector Machine performed the best of all 4 models (97.7% Accuracy) and there is a logistic correlation between the presence of Phone numbers and the message being Spam
  • Running K-Fold resampling on the models produced more accurate performance measure of the models
  • Model Ensemble performed better than all 4 models as it is the cummulation of the 4 models (98.4% Accuracy)
  • It is possible to predict Spam messages with sufficiently large datasets for the models to train on.

What did we learn from this project?

  • Handling imbalanced datasets using resampling methods like K-Fold
  • Logistic Regression, Naive Bayes, SVC and RandomForestClassifier from sklearn
  • Other packages such as tqdm, Figlet and Wordcloud
  • Collaborating using GitHub
  • Concepts about Accuracy, Vectorizing, and F1 Score

References

About

SC1015 Mini-project on detecting Spam Messages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published