Skip to content

implementation of Arabic Text Summarization using AraBERT. Leveraging the capabilities of AraBERT, the project provides a fine-tuned model for generating concise and coherent summaries in Arabic.

Notifications You must be signed in to change notification settings

aliasharf123/arabic-summarization-with-araBert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Arabic Text Summarization with AraBert Project Report

Executive Summary

The "Arabic Text Summarization with AraBERT" project aimed to harness the power of AraBERT, a transformer-based model, for summarizing Arabic language text. AraBERT was fine-tuned on an Arabic summarization dataset to produce concise and coherent summaries.

Objectives

  1. Explore AraBERT Architecture: Investigate the architecture of AraBERT and its applicability to Arabic text summarization.
  2. Fine-Tuning Process: Describe the process of fine-tuning AraBERT on an Arabic summarization dataset.
  3. Performance Evaluation: Evaluate the performance of the fine-tuned model on a validation dataset.
  4. Case Studies: Present case studies showcasing the model's effectiveness in summarizing diverse Arabic text.

Definitions

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is a set of metrics used for evaluating the quality of summaries by comparing them to reference summaries. It includes measures such as ROUGE-N (unigrams, bigrams, etc.), ROUGE-L (longest common subsequence), and ROUGE-W (weighted longest common subsequence). These metrics provide a quantitative assessment of the overlap and content similarity between generated and reference summaries.

AraBERT (Arabic BERT)

AraBERT is a transformer-based model introduced for processing Arabic language text. It has been fine-tuned for various natural language processing tasks, including summarization, by leveraging transfer learning.

Methodology

Data Collection

The project utilized the csebuetnlp/xlsum dataset for Arabic summarization. This dataset consists of a variety of documents with corresponding summaries.

Preprocessing

Data preprocessing involved tokenization using the AraBERT tokenizer, handling input-output sequences, and ensuring the proper format for training.

Fine-Tuning

AraBERT was fine-tuned on the Arabic summarization dataset using a sequence-to-sequence training approach.

Results and Findings

Model Evaluation

ROUGE Scores

The model's performance was evaluated using various ROUGE metrics, including ROUGE-N, ROUGE-L, and ROUGE-W. The following table presents the key ROUGE scores:

Metric Score
ROUGE-1 2.3202
ROUGE-2 0.0977
ROUGE-L 2.3113

About

implementation of Arabic Text Summarization using AraBERT. Leveraging the capabilities of AraBERT, the project provides a fine-tuned model for generating concise and coherent summaries in Arabic.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published