A Natural Language Processing (NLP) project that normalizes noisy, informal Malaysian tweets (containing slang, abbreviations, and misspellings in both Malay and English) into standard, formal text, either in English or Malay.
This project utilizes fine-tuned T5 transformer models and a comprehensive preprocessing pipeline with language-specific dictionaries to achieve high-quality normalization. It includes an interactive Gradio web interface for testing and deployment.
Built as a university group project for WID3002 Natural Language Processing at Universiti Malaya.
- Bilingual Normalization: Preprocessing dictionaries and models natively support both Malay and English tweets.
- Preprocessing: Utilizes custom dictionaries for Malaysian slang, abbreviations, and common misspellings.
- Fine-Tuning: Built on top of the
mesolitica/t5-base-standard-bahasa-casedarchitecture for an optimal understanding of the local context. - Interactive UI: Includes a Gradio web application for real-time inference and demonstration.
- Frameworks & Libraries: PyTorch, Hugging Face Transformers, Gradio, Pandas, NumPy
- NLP Tools: Malaya, NLTK, FastText, PyEnchant, Emoji
- Model Architecture: T5 (Text-to-Text Transfer Transformer)
Due to GitHub's file size limits, the fine-tuned models and datasets are hosted on the Hugging Face Hub.
You can load these directly in your code using the transformers library (e.g., AutoModelForSeq2SeqLM.from_pretrained(...)):
- English Normalization Model: Wrynaft/t5-tweet-normalizer-en
- Malay Normalization Model: Wrynaft/t5-tweet-normalizer-ms
- Training Datasets: Wrynaft/tweet-normalization-datasets
- Origin: Sourced from the Mesolitica Malaysian Dataset (ChatGPT 3.5 Noisy Twitter).
MsianTweetNormalizer.ipynb: The main application notebook containing the final end-to-end pipeline (preprocessing, model inference, post-processing) and the Gradio web interface.fineTune.ipynb: The notebook used for fine-tuning the T5 model for the Malay target.fineTune_English.ipynb: The notebook used for fine-tuning the T5 model for the English target.Group5_Report.pdf: Comprehensive project report detailing the methodology, architecture, and results.
Ensure you have the following packages installed:
pip install pandas numpy torch transformers huggingface_hub gradio malaya fasttext-wheel pyenchant nltk emoji- Clone this repository.
- Open
msiantweetnormalizer1.ipynbin Jupyter Notebook, VS Code, or Google Colab. - Update the local file paths in the model loading cells to point directly to the Hugging Face repos (
Wrynaft/t5-tweet-normalizer-enandWrynaft/t5-tweet-normalizer-ms). - Run all cells in the notebook.
- The final cell will launch a Gradio web interface locally. Click the provided URL to interact with the Normalizer!
- Ryan Chin Jian Hwa
- Chong Jia Ying
- Ang Li Jia
- Chua Hui Ying Nicole