Skip to content

Wrynaft/Tweet-Normalizer

Repository files navigation

Malaysian Tweet Normalizer

A Natural Language Processing (NLP) project that normalizes noisy, informal Malaysian tweets (containing slang, abbreviations, and misspellings in both Malay and English) into standard, formal text, either in English or Malay.

This project utilizes fine-tuned T5 transformer models and a comprehensive preprocessing pipeline with language-specific dictionaries to achieve high-quality normalization. It includes an interactive Gradio web interface for testing and deployment.

Built as a university group project for WID3002 Natural Language Processing at Universiti Malaya.


Features

  • Bilingual Normalization: Preprocessing dictionaries and models natively support both Malay and English tweets.
  • Preprocessing: Utilizes custom dictionaries for Malaysian slang, abbreviations, and common misspellings.
  • Fine-Tuning: Built on top of the mesolitica/t5-base-standard-bahasa-cased architecture for an optimal understanding of the local context.
  • Interactive UI: Includes a Gradio web application for real-time inference and demonstration.

Tech Stack

  • Frameworks & Libraries: PyTorch, Hugging Face Transformers, Gradio, Pandas, NumPy
  • NLP Tools: Malaya, NLTK, FastText, PyEnchant, Emoji
  • Model Architecture: T5 (Text-to-Text Transfer Transformer)

Models and Datasets

Due to GitHub's file size limits, the fine-tuned models and datasets are hosted on the Hugging Face Hub.

Pre-trained Models

You can load these directly in your code using the transformers library (e.g., AutoModelForSeq2SeqLM.from_pretrained(...)):

Datasets


Project Structure

  • MsianTweetNormalizer.ipynb: The main application notebook containing the final end-to-end pipeline (preprocessing, model inference, post-processing) and the Gradio web interface.
  • fineTune.ipynb: The notebook used for fine-tuning the T5 model for the Malay target.
  • fineTune_English.ipynb: The notebook used for fine-tuning the T5 model for the English target.
  • Group5_Report.pdf: Comprehensive project report detailing the methodology, architecture, and results.

Setup and Usage

Prerequisites

Ensure you have the following packages installed:

pip install pandas numpy torch transformers huggingface_hub gradio malaya fasttext-wheel pyenchant nltk emoji

Running the Application

  1. Clone this repository.
  2. Open msiantweetnormalizer1.ipynb in Jupyter Notebook, VS Code, or Google Colab.
  3. Update the local file paths in the model loading cells to point directly to the Hugging Face repos (Wrynaft/t5-tweet-normalizer-en and Wrynaft/t5-tweet-normalizer-ms).
  4. Run all cells in the notebook.
  5. The final cell will launch a Gradio web interface locally. Click the provided URL to interact with the Normalizer!

Contributors

  • Ryan Chin Jian Hwa
  • Chong Jia Ying
  • Ang Li Jia
  • Chua Hui Ying Nicole

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors