<a href="https://colab.research.google.com/github/aneeq-shaffy/SE4050-Deep-Learning/blob/main/mbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Create a Google Colab notebook named "mBERT" that trains a multilingual BERT model for Sri Lankan news category classification using the dataset from "https://github.com/nuuuwan/lk_news". The notebook should include sections for setup, dataset loading and processing, label generation, text cleaning, data preparation for model training, and a placeholder for model training. The notebook should be fully runnable and include clear explanations for each step.

## Notebook setup

### Subtask:
Configure the notebook with a title, GPU runtime detection, library installations, optional Google Drive mounting, and necessary package imports.


**Reasoning**:
The subtask requires setting up the notebook environment, including checking for GPU, installing libraries, and importing packages. These steps can be grouped into a single code block for efficiency. Also, adding a title is a manual step in Colab settings and cannot be done via code.



In [1]:
# Check for GPU
import tensorflow as tf
gpu_available = tf.config.list_physical_devices('GPU')
if gpu_available:
    print("GPU is available.")
else:
    print("No GPU available.")

# Install necessary libraries
!pip install transformers datasets pandas

# Optional Google Drive mounting (uncomment if needed)
# from google.colab import drive
# drive.mount('/content/drive')
# print("Google Drive mounted.")

# Import required packages
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset, load_dataset

No GPU available.


## Dataset preparation

### Subtask:
Clone the dataset repository, explore the data structure, combine relevant files into a pandas DataFrame, and select required columns.


**Reasoning**:
Clone the dataset repository and explore its structure.



In [2]:
import os

# Clone the repository
!git clone https://github.com/nuuuwan/lk_news.git

# Explore the directory structure
!ls -R lk_news/

Cloning into 'lk_news'...
remote: Enumerating objects: 253620, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 253620 (delta 0), reused 1 (delta 0), pack-reused 253616 (from 4)[K
Receiving objects: 100% (253620/253620), 876.17 MiB | 16.82 MiB/s, done.
Resolving deltas: 100% (16166/16166), done.
lk_news/:
custom_summary.json  push_to_news_lk3.sh  requirements.txt  tests
LICENSE		     README.md		  src		    workflows

lk_news/src:
lk_news  news_lk3

lk_news/src/lk_news:
__init__.py  NewsArticle.py  NewspaperFactory.py

lk_news/src/news_lk3:
core  custom_newspapers  __init__.py

lk_news/src/news_lk3/core:
AbstractNewsPaper.py  article  __init__.py

lk_news/src/news_lk3/core/article:
ArticleBase.py	Article.py  __init__.py

lk_news/src/news_lk3/custom_newspapers:
AdaDeranaLk.py		DailyFtLk.py	   __init__.py
AdaDeranaSinhalaLk.py	DailyMirrorLk.py   IslandLk.py
AdaLk.py		DailyNewsLk.py	   LankadeepaLk.py
BBCComSinhala.py	D

**Reasoning**:
Read the JSON files into pandas DataFrames, concatenate them, and select the required columns.



In [3]:
import json
import glob

all_files = glob.glob('lk_news/data/*/*/*.json')

list_of_dfs = []

for file_path in all_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Check if data is a list and not empty, then create DataFrame
            if isinstance(data, list) and data:
                df_temp = pd.DataFrame(data)
                list_of_dfs.append(df_temp)
            # If data is a dictionary, convert to list and create DataFrame
            elif isinstance(data, dict):
                df_temp = pd.DataFrame([data])
                list_of_dfs.append(df_temp)
            else:
                print(f"Skipping file {file_path}: Data is not a list or dictionary or is empty.")

    except json.JSONDecodeError:
        print(f"Skipping file {file_path}: JSON decode error.")
    except Exception as e:
        print(f"Skipping file {file_path}: An error occurred - {e}")


if list_of_dfs:
    df_combined = pd.concat(list_of_dfs, ignore_index=True)
    df = df_combined[['url', 'text']]
    display(df.head())
else:
    print("No dataframes were created. The combined dataframe is empty.")
    df = pd.DataFrame(columns=['url', 'text']) # Create an empty dataframe with required columns


No dataframes were created. The combined dataframe is empty.


**Reasoning**:
The previous attempt to load data failed because the file path pattern was incorrect. Explore the directory structure again to find the correct path to the data files, then attempt to load and process the data again.



In [4]:
# Explore the directory structure again to find the correct data path
!ls -R lk_news/

# Based on the output, the data is likely in lk_news/data/
# Try reading JSON files from lk_news/data/ directory

all_files = glob.glob('lk_news/data/*/*/*/*.json')

list_of_dfs = []

for file_path in all_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Check if data is a list and not empty, then create DataFrame
            if isinstance(data, list) and data:
                df_temp = pd.DataFrame(data)
                list_of_dfs.append(df_temp)
            # If data is a dictionary, convert to list and create DataFrame
            elif isinstance(data, dict):
                df_temp = pd.DataFrame([data])
                list_of_dfs.append(df_temp)
            else:
                print(f"Skipping file {file_path}: Data is not a list or dictionary or is empty.")

    except json.JSONDecodeError:
        print(f"Skipping file {file_path}: JSON decode error.")
    except Exception as e:
        print(f"Skipping file {file_path}: An error occurred - {e}")


if list_of_dfs:
    df_combined = pd.concat(list_of_dfs, ignore_index=True)
    df = df_combined[['url', 'text']]
    display(df.head())
else:
    print("No dataframes were created. The combined dataframe is empty.")
    df = pd.DataFrame(columns=['url', 'text']) # Create an empty dataframe with required columns

lk_news/:
custom_summary.json  push_to_news_lk3.sh  requirements.txt  tests
LICENSE		     README.md		  src		    workflows

lk_news/src:
lk_news  news_lk3

lk_news/src/lk_news:
__init__.py  NewsArticle.py  NewspaperFactory.py

lk_news/src/news_lk3:
core  custom_newspapers  __init__.py

lk_news/src/news_lk3/core:
AbstractNewsPaper.py  article  __init__.py

lk_news/src/news_lk3/core/article:
ArticleBase.py	Article.py  __init__.py

lk_news/src/news_lk3/custom_newspapers:
AdaDeranaLk.py		DailyFtLk.py	   __init__.py
AdaDeranaSinhalaLk.py	DailyMirrorLk.py   IslandLk.py
AdaLk.py		DailyNewsLk.py	   LankadeepaLk.py
BBCComSinhala.py	DBSJeyarajCom.py   NewsFirstLk.py
CeylonTodayLk.py	DivainaLk.py	   TamilMirrorLk.py
ColomboTelegraphCom.py	EconomyNextCom.py  VirakesariLk.py

lk_news/tests:
test_newspapers.py

lk_news/workflows:
one_off_backpopulate.py  pipeline.py
No dataframes were created. The combined dataframe is empty.


**Reasoning**:
The previous attempt to load data still failed, indicating the file path pattern is likely still incorrect. Explore the directory structure again, specifically within the `lk_news/` directory, to pinpoint where the actual data (JSON files) is located.



In [5]:
# Explore the lk_news directory more thoroughly
!ls -R lk_news/

# Based on the output, it seems the data is not directly in a 'data' directory.
# Let's check if the JSON files are within the 'src' directory or elsewhere.
# We can use a more general glob pattern to find any JSON files within lk_news/
all_files = glob.glob('lk_news/**/*.json', recursive=True)

list_of_dfs = []

for file_path in all_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Check if data is a list and not empty, then create DataFrame
            if isinstance(data, list) and data:
                df_temp = pd.DataFrame(data)
                list_of_dfs.append(df_temp)
            # If data is a dictionary, convert to list and create DataFrame
            elif isinstance(data, dict):
                df_temp = pd.DataFrame([data])
                list_of_dfs.append(df_temp)
            else:
                print(f"Skipping file {file_path}: Data is not a list or dictionary or is empty.")

    except json.JSONDecodeError:
        print(f"Skipping file {file_path}: JSON decode error.")
    except Exception as e:
        print(f"Skipping file {file_path}: An error occurred - {e}")


if list_of_dfs:
    df_combined = pd.concat(list_of_dfs, ignore_index=True)
    # Check if 'url' and 'text' columns exist before selecting
    required_columns = ['url', 'text']
    if all(col in df_combined.columns for col in required_columns):
        df = df_combined[required_columns]
        display(df.head())
    else:
        missing = [col for col in required_columns if col not in df_combined.columns]
        print(f"Combined DataFrame is missing required columns: {missing}")
        df = pd.DataFrame(columns=required_columns) # Create an empty dataframe with required columns
else:
    print("No dataframes were created. The combined dataframe is empty.")
    df = pd.DataFrame(columns=['url', 'text']) # Create an empty dataframe with required columns


lk_news/:
custom_summary.json  push_to_news_lk3.sh  requirements.txt  tests
LICENSE		     README.md		  src		    workflows

lk_news/src:
lk_news  news_lk3

lk_news/src/lk_news:
__init__.py  NewsArticle.py  NewspaperFactory.py

lk_news/src/news_lk3:
core  custom_newspapers  __init__.py

lk_news/src/news_lk3/core:
AbstractNewsPaper.py  article  __init__.py

lk_news/src/news_lk3/core/article:
ArticleBase.py	Article.py  __init__.py

lk_news/src/news_lk3/custom_newspapers:
AdaDeranaLk.py		DailyFtLk.py	   __init__.py
AdaDeranaSinhalaLk.py	DailyMirrorLk.py   IslandLk.py
AdaLk.py		DailyNewsLk.py	   LankadeepaLk.py
BBCComSinhala.py	DBSJeyarajCom.py   NewsFirstLk.py
CeylonTodayLk.py	DivainaLk.py	   TamilMirrorLk.py
ColomboTelegraphCom.py	EconomyNextCom.py  VirakesariLk.py

lk_news/tests:
test_newspapers.py

lk_news/workflows:
one_off_backpopulate.py  pipeline.py
Combined DataFrame is missing required columns: ['url', 'text']


**Reasoning**:
The previous attempt to find data was unsuccessful. The `ls -R` output confirms that the `data` directory is not present directly in the cloned repository. It is likely that the data needs to be generated by running one of the scripts within the repository. The `workflows/pipeline.py` and `workflows/one_off_backpopulate.py` seem like potential candidates for data generation. Let's try running `workflows/one_off_backpopulate.py` as it might generate the data needed for the task.



In [6]:
# Run the one_off_backpopulate.py script to potentially generate data
!python lk_news/workflows/one_off_backpopulate.py

# After running the script, check the directory structure again to see if a 'data' directory was created.
!ls -R lk_news/

Traceback (most recent call last):
  File "/content/lk_news/workflows/one_off_backpopulate.py", line 4, in <module>
    from utils import JSONFile, Log
ModuleNotFoundError: No module named 'utils'
lk_news/:
custom_summary.json  push_to_news_lk3.sh  requirements.txt  tests
LICENSE		     README.md		  src		    workflows

lk_news/src:
lk_news  news_lk3

lk_news/src/lk_news:
__init__.py  NewsArticle.py  NewspaperFactory.py

lk_news/src/news_lk3:
core  custom_newspapers  __init__.py

lk_news/src/news_lk3/core:
AbstractNewsPaper.py  article  __init__.py

lk_news/src/news_lk3/core/article:
ArticleBase.py	Article.py  __init__.py

lk_news/src/news_lk3/custom_newspapers:
AdaDeranaLk.py		DailyFtLk.py	   __init__.py
AdaDeranaSinhalaLk.py	DailyMirrorLk.py   IslandLk.py
AdaLk.py		DailyNewsLk.py	   LankadeepaLk.py
BBCComSinhala.py	DBSJeyarajCom.py   NewsFirstLk.py
CeylonTodayLk.py	DivainaLk.py	   TamilMirrorLk.py
ColomboTelegraphCom.py	EconomyNextCom.py  VirakesariLk.py

lk_news/tests:
test_newspapers

**Reasoning**:
The `one_off_backpopulate.py` script failed due to a `ModuleNotFoundError: No module named 'utils'`. This indicates that the required dependencies are not installed or the script is not being run in the correct environment. The repository has a `requirements.txt` file, which likely contains the necessary dependencies. Install the dependencies listed in `requirements.txt` and then try running the script again.



In [7]:
# Install dependencies from requirements.txt
!pip install -r lk_news/requirements.txt

# After installing dependencies, try running the one_off_backpopulate.py script again.
!python lk_news/workflows/one_off_backpopulate.py

# After running the script, check the directory structure again to see if a 'data' directory was created.
!ls -R lk_news/

Collecting scraper-nuuuwan (from -r lk_news/requirements.txt (line 1))
  Downloading scraper_nuuuwan-1.0.2-py3-none-any.whl.metadata (959 bytes)
Collecting utils-nuuuwan (from scraper-nuuuwan->-r lk_news/requirements.txt (line 1))
  Downloading utils_nuuuwan-2.0.11-py3-none-any.whl.metadata (725 bytes)
Collecting bs4 (from scraper-nuuuwan->-r lk_news/requirements.txt (line 1))
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting pymupdf (from scraper-nuuuwan->-r lk_news/requirements.txt (line 1))
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting selenium (from scraper-nuuuwan->-r lk_news/requirements.txt (line 1))
  Downloading selenium-4.36.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio<1.0,>=0.30.0 (from selenium->scraper-nuuuwan->-r lk_news/requirements.txt (line 1))
  Downloading trio-0.31.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium->scraper-nuuuwan->-r lk_news/r

**Reasoning**:
The `one_off_backpopulate.py` script still failed with a `ModuleNotFoundError: No module named 'lk_news'`. This suggests that the `lk_news` package, although present in the cloned repository's `src` directory, is not installed in the Python environment in a way that it can be imported directly. To fix this, we need to add the `src` directory to the Python path. After fixing the import issue, we can try running the script again, then attempt to load and process the generated data.



In [8]:
import sys

# Add the src directory to the Python path
sys.path.append('./lk_news/src')

# After adding to path, try running the one_off_backpopulate.py script again.
# This script might take a while to run as it fetches data.
print("Attempting to run the data generation script...")
!python lk_news/workflows/one_off_backpopulate.py
print("Data generation script finished.")

# After running the script, check the directory structure again to see if a 'data' directory was created.
print("Checking directory structure after script execution...")
!ls -R lk_news/
print("Directory structure checked.")

# Now, try reading the JSON files again from the expected data directory.
all_files = glob.glob('lk_news/data/*/*/*/*.json')

list_of_dfs = []

for file_path in all_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Check if data is a list and not empty, then create DataFrame
            if isinstance(data, list) and data:
                df_temp = pd.DataFrame(data)
                list_of_dfs.append(df_temp)
            # If data is a dictionary, convert to list and create DataFrame
            elif isinstance(data, dict):
                df_temp = pd.DataFrame([data])
                list_of_dfs.append(df_temp)
            else:
                # Some files might contain empty lists or non-dict/list data, skip those.
                print(f"Skipping file {file_path}: Data is not a list or dictionary or is empty.")

    except json.JSONDecodeError:
        print(f"Skipping file {file_path}: JSON decode error.")
    except Exception as e:
        print(f"Skipping file {file_path}: An error occurred - {e}")


if list_of_dfs:
    df_combined = pd.concat(list_of_dfs, ignore_index=True)
    # Check if 'url' and 'text' columns exist before selecting
    required_columns = ['url', 'text']
    if all(col in df_combined.columns for col in required_columns):
        df = df_combined[required_columns]
        print("\nSuccessfully loaded and combined data.")
        display(df.head())
    else:
        missing = [col for col in required_columns if col not in df_combined.columns]
        print(f"\nCombined DataFrame is missing required columns: {missing}")
        df = pd.DataFrame(columns=required_columns) # Create an empty dataframe with required columns
else:
    print("\nNo dataframes were created. The combined dataframe is empty.")
    df = pd.DataFrame(columns=['url', 'text']) # Create an empty dataframe with required columns


Attempting to run the data generation script...
Traceback (most recent call last):
  File "/content/lk_news/workflows/one_off_backpopulate.py", line 6, in <module>
    from lk_news import NewsArticle
ModuleNotFoundError: No module named 'lk_news'
Data generation script finished.
Checking directory structure after script execution...
lk_news/:
custom_summary.json  push_to_news_lk3.sh  requirements.txt  tests
LICENSE		     README.md		  src		    workflows

lk_news/src:
lk_news  news_lk3

lk_news/src/lk_news:
__init__.py  NewsArticle.py  NewspaperFactory.py

lk_news/src/news_lk3:
core  custom_newspapers  __init__.py

lk_news/src/news_lk3/core:
AbstractNewsPaper.py  article  __init__.py

lk_news/src/news_lk3/core/article:
ArticleBase.py	Article.py  __init__.py

lk_news/src/news_lk3/custom_newspapers:
AdaDeranaLk.py		DailyFtLk.py	   __init__.py
AdaDeranaSinhalaLk.py	DailyMirrorLk.py   IslandLk.py
AdaLk.py		DailyNewsLk.py	   LankadeepaLk.py
BBCComSinhala.py	DBSJeyarajCom.py   NewsFirstLk.py
C

**Reasoning**:
The script still fails with `ModuleNotFoundError: No module named 'lk_news'`, even after adding `./lk_news/src` to the Python path. This suggests there might be an issue with how the package is structured or how the script is trying to import it. The `lk_news` package seems to be nested within the `src` directory. Let's try adding `./lk_news/src/` to the Python path.



In [9]:
import sys

# Add the lk_news package directory within src to the Python path
sys.path.append('./lk_news/src/lk_news')

# Try running the data generation script again.
print("Attempting to run the data generation script again...")
!python lk_news/workflows/one_off_backpopulate.py
print("Data generation script finished.")

# After running the script, check the directory structure again.
print("Checking directory structure after script execution...")
!ls -R lk_news/
print("Directory structure checked.")

# Now, try reading the JSON files again from the expected data directory.
all_files = glob.glob('lk_news/data/*/*/*/*.json')

list_of_dfs = []

for file_path in all_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
            # Check if data is a list and not empty, then create DataFrame
            if isinstance(data, list) and data:
                df_temp = pd.DataFrame(data)
                list_of_dfs.append(df_temp)
            # If data is a dictionary, convert to list and create DataFrame
            elif isinstance(data, dict):
                df_temp = pd.DataFrame([data])
                list_of_dfs.append(df_temp)
            else:
                # Some files might contain empty lists or non-dict/list data, skip those.
                print(f"Skipping file {file_path}: Data is not a list or dictionary or is empty.")

    except json.JSONDecodeError:
        print(f"Skipping file {file_path}: JSON decode error.")
    except Exception as e:
        print(f"Skipping file {file_path}: An error occurred - {e}")


if list_of_dfs:
    df_combined = pd.concat(list_of_dfs, ignore_index=True)
    # Check if 'url' and 'text' columns exist before selecting
    required_columns = ['url', 'text']
    if all(col in df_combined.columns for col in required_columns):
        df = df_combined[required_columns]
        print("\nSuccessfully loaded and combined data.")
        display(df.head())
    else:
        missing = [col for col in required_columns if col not in df_combined.columns]
        print(f"\nCombined DataFrame is missing required columns: {missing}")
        df = pd.DataFrame(columns=required_columns) # Create an empty dataframe with required columns
else:
    print("\nNo dataframes were created. The combined dataframe is empty.")
    df = pd.DataFrame(columns=['url', 'text']) # Create an empty dataframe with required columns

Attempting to run the data generation script again...
Traceback (most recent call last):
  File "/content/lk_news/workflows/one_off_backpopulate.py", line 6, in <module>
    from lk_news import NewsArticle
ModuleNotFoundError: No module named 'lk_news'
Data generation script finished.
Checking directory structure after script execution...
lk_news/:
custom_summary.json  push_to_news_lk3.sh  requirements.txt  tests
LICENSE		     README.md		  src		    workflows

lk_news/src:
lk_news  news_lk3

lk_news/src/lk_news:
__init__.py  NewsArticle.py  NewspaperFactory.py

lk_news/src/news_lk3:
core  custom_newspapers  __init__.py

lk_news/src/news_lk3/core:
AbstractNewsPaper.py  article  __init__.py

lk_news/src/news_lk3/core/article:
ArticleBase.py	Article.py  __init__.py

lk_news/src/news_lk3/custom_newspapers:
AdaDeranaLk.py		DailyFtLk.py	   __init__.py
AdaDeranaSinhalaLk.py	DailyMirrorLk.py   IslandLk.py
AdaLk.py		DailyNewsLk.py	   LankadeepaLk.py
BBCComSinhala.py	DBSJeyarajCom.py   NewsFirstL

## Label generation

### Subtask:
Infer a category label based on keywords in the 'url' column and display the category distribution.


**Reasoning**:
Define a function to categorize URLs based on keywords and apply it to the 'url' column to create a new 'category' column. Then, display the distribution of the new 'category' column.



In [10]:
def categorize_url(url):
    """Categorizes a URL based on keywords."""
    if isinstance(url, str):
        url_lower = url.lower()
        if 'sport' in url_lower:
            return 'sports'
        elif 'business' in url_lower or 'biz' in url_lower:
            return 'business'
        elif 'politic' in url_lower: # Using 'politic' to catch variations like 'politics'
            return 'politics'
        elif 'entertain' in url_lower: # Using 'entertain' to catch variations
            return 'entertainment'
        elif 'news' in url_lower:
            return 'news'
        else:
            return 'other'
    else:
        return 'other' # Handle potential non-string values

# Apply the function to create the 'category' column
df['category'] = df['url'].apply(categorize_url)

# Display the distribution of the 'category' column
display(df['category'].value_counts())

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1


## Text cleaning

### Subtask:
Clean the text data by removing URLs, non-letter symbols, and extra spaces, then filter out short articles and save the cleaned data.


**Reasoning**:
Define the text cleaning function, apply it to the 'text' column, filter the DataFrame based on text length, and display the head of the cleaned and filtered data.



In [11]:
import re

def clean_text(text):
    """Cleans text data by removing URLs, non-letter symbols, and extra spaces."""
    if not isinstance(text, str):
        return ""
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    # Remove non-letter symbols (keeping spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    # Convert to lowercase
    text = text.lower()
    return text

# Apply the clean_text function
df['cleaned_text'] = df['text'].apply(clean_text)

# Filter out short articles (e.g., less than 50 characters)
min_text_length = 50
df_cleaned_filtered = df[df['cleaned_text'].str.len() > min_text_length]

# Display the head of the cleaned and filtered DataFrame
display(df_cleaned_filtered.head())

# Consider saving the cleaned DataFrame (optional)
# df_cleaned_filtered.to_csv('cleaned_news_data.csv', index=False)
# print("Cleaned and filtered data saved to 'cleaned_news_data.csv'")

Unnamed: 0,url,text,category,cleaned_text


## Data preparation for training

### Subtask:
Split the cleaned data into training and testing sets, convert them to Hugging Face Datasets format, and tokenize the text data using a multilingual BERT tokenizer.


**Reasoning**:
Split the cleaned data into training and testing sets, convert them to Hugging Face Datasets format, load the tokenizer, define the tokenization function, and apply it to the datasets.



In [12]:
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoTokenizer

# Split the cleaned and filtered data into training and testing sets
# Check if df_cleaned_filtered is not empty before splitting
if not df_cleaned_filtered.empty:
    train_df, test_df = train_test_split(df_cleaned_filtered, test_size=0.2, random_state=42)

    # Convert pandas DataFrames to Hugging Face Dataset objects
    train_dataset = Dataset.from_pandas(train_df)
    test_dataset = Dataset.from_pandas(test_df)

    # Load a multilingual BERT tokenizer
    tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

    # Define the tokenization function
    def tokenize_function(examples):
        return tokenizer(examples["cleaned_text"], padding="max_length", truncation=True)

    # Apply the tokenization function to the datasets
    tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
    tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

    # Display the first example of the tokenized training dataset
    print("First example of the tokenized training dataset:")
    display(tokenized_train_dataset[0])

else:
    print("df_cleaned_filtered is empty. Cannot split data or tokenize.")
    # Create empty datasets to avoid errors in subsequent steps
    train_dataset = Dataset.from_pandas(pd.DataFrame(columns=df_cleaned_filtered.columns))
    test_dataset = Dataset.from_pandas(pd.DataFrame(columns=df_cleaned_filtered.columns))
    tokenized_train_dataset = train_dataset
    tokenized_test_dataset = test_dataset


df_cleaned_filtered is empty. Cannot split data or tokenize.


## Model training (placeholder)

### Subtask:
This is a placeholder for the model training step. Due to the failure in previous steps to load and prepare the data, the actual model training cannot be performed at this time. This section outlines the steps that would be taken if the data were available.


**Reasoning**:
The previous steps failed to load and prepare the data, resulting in empty dataframes. Therefore, the model training step cannot be performed as intended. I will outline the steps for model training as placeholders and finish the task with a failure status due to the lack of data.

