Malaysian Tweet Normalizer

A Natural Language Processing (NLP) project that normalizes noisy, informal Malaysian tweets (containing slang, abbreviations, and misspellings in both Malay and English) into standard, formal text, either in English or Malay.

This project utilizes fine-tuned T5 transformer models and a comprehensive preprocessing pipeline with language-specific dictionaries to achieve high-quality normalization. It includes an interactive Gradio web interface for testing and deployment.

Built as a university group project for WID3002 Natural Language Processing at Universiti Malaya.

Features

Bilingual Normalization: Preprocessing dictionaries and models natively support both Malay and English tweets.
Preprocessing: Utilizes custom dictionaries for Malaysian slang, abbreviations, and common misspellings.
Fine-Tuning: Built on top of the mesolitica/t5-base-standard-bahasa-cased architecture for an optimal understanding of the local context.
Interactive UI: Includes a Gradio web application for real-time inference and demonstration.

Tech Stack

Frameworks & Libraries: PyTorch, Hugging Face Transformers, Gradio, Pandas, NumPy
NLP Tools: Malaya, NLTK, FastText, PyEnchant, Emoji
Model Architecture: T5 (Text-to-Text Transfer Transformer)

Models and Datasets

Due to GitHub's file size limits, the fine-tuned models and datasets are hosted on the Hugging Face Hub.

Pre-trained Models

You can load these directly in your code using the transformers library (e.g., AutoModelForSeq2SeqLM.from_pretrained(...)):

English Normalization Model: Wrynaft/t5-tweet-normalizer-en
Malay Normalization Model: Wrynaft/t5-tweet-normalizer-ms

Datasets

Training Datasets: Wrynaft/tweet-normalization-datasets
- Origin: Sourced from the Mesolitica Malaysian Dataset (ChatGPT 3.5 Noisy Twitter).

Project Structure

MsianTweetNormalizer.ipynb: The main application notebook containing the final end-to-end pipeline (preprocessing, model inference, post-processing) and the Gradio web interface.
fineTune.ipynb: The notebook used for fine-tuning the T5 model for the Malay target.
fineTune_English.ipynb: The notebook used for fine-tuning the T5 model for the English target.
Group5_Report.pdf: Comprehensive project report detailing the methodology, architecture, and results.

Setup and Usage

Prerequisites

Ensure you have the following packages installed:

pip install pandas numpy torch transformers huggingface_hub gradio malaya fasttext-wheel pyenchant nltk emoji

Running the Application

Clone this repository.
Open msiantweetnormalizer1.ipynb in Jupyter Notebook, VS Code, or Google Colab.
Update the local file paths in the model loading cells to point directly to the Hugging Face repos (Wrynaft/t5-tweet-normalizer-en and Wrynaft/t5-tweet-normalizer-ms).
Run all cells in the notebook.
The final cell will launch a Gradio web interface locally. Click the provided URL to interact with the Normalizer!

Contributors

Ryan Chin Jian Hwa
Chong Jia Ying
Ang Li Jia
Chua Hui Ying Nicole

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Group5_Report.pdf		Group5_Report.pdf
MsianTweetNormalizer.ipynb		MsianTweetNormalizer.ipynb
README.md		README.md
fineTune.ipynb		fineTune.ipynb
fineTune_English.ipynb		fineTune_English.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malaysian Tweet Normalizer

Features

Tech Stack

Models and Datasets

Pre-trained Models

Datasets

Project Structure

Setup and Usage

Prerequisites

Running the Application

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Malaysian Tweet Normalizer

Features

Tech Stack

Models and Datasets

Pre-trained Models

Datasets

Project Structure

Setup and Usage

Prerequisites

Running the Application

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages