There are several pre-trained models available for processing Persian (Farsi) language texts, particularly within the Hugging Face Transformers library. Here are some notable models that you can explore:

    ParsBERT:
        Model IDs:
            HooshvareLab/bert-fa-base-uncased (BERT model)
            HooshvareLab/bert-fa-zwnj-base (BERT model with ZWNJ token)
        Description: ParsBERT is a monolingual BERT model pre-trained specifically for Persian. It's suitable for a wide range of NLP tasks like text classification, sentiment analysis, and question answering in Persian.

    mBERT (Multilingual BERT):
        Model ID:
            bert-base-multilingual-cased
        Description: Although not exclusively for Persian, mBERT has been pre-trained on the top 100 languages with the largest Wikipedias, including Persian. It is effective for tasks where transfer learning from one language to another is feasible.

    XLM-RoBERTa (XLM-R):
        Model IDs:
            xlm-roberta-base
            xlm-roberta-large
        Description: This is a scaled cross-lingual sentence encoder. XLM-RoBERTa has shown strong performance across many languages and tasks. It’s particularly effective in zero-shot or few-shot scenarios across different languages, including Persian.

    DistilBERT Multilingual:
        Model ID:
            distilbert-base-multilingual-cased
        Description: A lighter version of mBERT, which is faster and smaller while retaining most of the performance. It supports Persian among other languages.

    FarsiBERT:
        Model IDs:
            HooshvareLab/bert-fa-base-uncased-sentiment-snappfood
            HooshvareLab/bert-fa-base-uncased-sentiment-digikala
        Description: FarsiBERT is another variant of BERT fine-tuned for specific sentiment analysis tasks on Persian datasets like SnappFood and Digikala reviews.

    ParsRoBERTa:
        Model ID:
            HooshvareLab/roberta-fa-zwnj-base
        Description: A RoBERTa-based model pre-trained specifically for Persian with handling for the zero-width non-joiner (ZWNJ) character, which is a frequent character in Persian.

In [9]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")
model = AutoModel.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

# Function to extract fixed features using mean pooling
def extract_fixed_features(text):
    # Encode text
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    # Move encoded input to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

    # Get model output
    with torch.no_grad():
        output = model(**encoded_input)

    # Get hidden states
    hidden_states = output.last_hidden_state

    # Perform mean pooling across the token embeddings
    mean_pooled = hidden_states.mean(dim=1)

    return mean_pooled

# Example usage
text = "متن فارسی برای استخراج بهتر است"
fixed_features = extract_fixed_features(text)

print(fixed_features[0])

tensor([-2.6275e-01,  5.7108e-01,  3.3121e-01,  1.4154e+00, -4.0521e-01,
         1.3762e+00,  3.3867e-01, -3.5899e-01,  3.3816e-01, -6.7192e-01,
        -1.1155e-01,  4.2443e-01, -1.4721e-01, -2.8911e-01,  1.1189e+00,
         7.7049e-01,  1.9862e-01, -2.3774e-01,  1.3620e-01,  5.8136e-01,
         5.8763e-01,  2.9841e-01,  6.9543e-02, -1.3864e-01,  3.5593e-01,
         2.7774e-01,  3.0156e-01,  9.8226e-01, -5.0453e-01, -6.4687e-01,
        -9.2100e-03, -3.4271e-01, -1.2741e-01,  9.8573e-02,  9.4802e-01,
         3.9395e-01, -4.4823e-01,  3.5571e-01,  1.2426e-01, -7.9641e-01,
         1.5209e-01,  5.1598e-02,  9.2991e-01,  8.1859e-01, -3.7078e-01,
         4.6551e-01,  6.6500e-01,  3.8866e-01, -1.0488e-01, -1.2904e-01,
        -2.8155e-01, -4.7948e-01, -1.5584e-01, -3.7851e-01, -5.5119e-02,
         1.1239e-01,  7.3316e-01,  6.5924e-01, -3.3957e-01, -3.6224e-01,
         4.0625e-01,  6.8341e-01,  4.1370e-01, -1.1419e-01,  1.0184e-01,
        -6.1253e-01,  7.4381e-01,  7.0447e-03, -3.2

In [6]:
import torch
from transformers import AutoTokenizer, AutoModel

# Load mBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
model = AutoModel.from_pretrained("bert-base-multilingual-cased")

# Function to encode text and extract features
def get_features(text):
    # Encode text
    encoded_input = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)
    # Get model output
    with torch.no_grad():
        output = model(**encoded_input)

    # Get the hidden states from the last layer
    hidden_states = output.last_hidden_state

    # Perform mean pooling on the output of the last layer to get one vector per input
    mean_pooled = hidden_states.mean(dim=1)

    # If you need a fixed number of features (e.g., 400 features), you can use a linear layer or other methods
    # Here, we initialize a linear transformation layer
    transformer = torch.nn.Linear(mean_pooled.shape[1], 400)

    # Apply the transformation to the pooled output
    fixed_size_features = transformer(mean_pooled)

    # Convert the tensor to numpy array after detaching from the graph
    return fixed_size_features.squeeze().detach().numpy()

# Example usage:
text = "متن فارسی برای استخراج بهتر است"
features = get_features(text)
print(features)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

[-0.07062095 -0.22050056  0.3408863   0.20093411  0.38109398  0.372801
  0.31965354 -0.24889152  0.2574297  -0.14708084 -0.09364371 -0.48636004
 -0.07245316  0.24653907 -0.05866271  0.21474178 -0.03034709  0.18886095
  0.03708272 -0.29993188  0.38528806 -0.21577215 -0.01156279 -0.26285094
 -0.1154431   0.01309147  0.4693251   0.03314454  0.30270854 -0.00264688
  0.37809548 -0.21427162  0.17976114  0.11154488 -0.00449676  0.13950966
 -0.17645545  0.00180549  0.00089142 -0.11490244  0.11222611  0.08659229
  0.26471224  0.0571307  -0.50953114  0.17031685 -0.12884814  0.37300533
  0.02508075 -0.15491746  0.11758441  0.02963535  0.44197762 -0.07445274
 -0.24695557 -0.13911942 -0.09545573  0.09621528  0.68771803  0.31991008
  0.3138378   0.30337143 -0.0042295  -0.19705886  0.15482847  0.166331
 -0.35959178  0.08633351 -0.5741924   0.3287912  -0.06345899 -0.1222142
 -0.22476983 -0.29923564 -0.02423243  0.05643653 -0.04367613  0.03739271
  0.08263633 -0.14173499  0.03113445  0.01298385  0.1126

In [7]:
import torch
from transformers import AutoTokenizer, AutoModel

# Load XLM-RoBERTa large tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
model = AutoModel.from_pretrained("xlm-roberta-large")

# Function to encode text and extract features
def get_features(text):
    # Encode text
    encoded_input = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)
    # Get model output
    with torch.no_grad():
        output = model(**encoded_input)

    # Get the hidden states from the last layer
    hidden_states = output.last_hidden_state

    # Perform mean pooling on the output of the last layer to get one vector per input
    mean_pooled = hidden_states.mean(dim=1)

    # If you need a fixed number of features (e.g., 400 features), you can use a linear layer or other methods
    # Here, we initialize a linear transformation layer
    transformer = torch.nn.Linear(mean_pooled.shape[1], 400)

    # Apply the transformation to the pooled output
    fixed_size_features = transformer(mean_pooled)

    # Convert the tensor to numpy array after detaching from the graph
    return fixed_size_features.squeeze().detach().numpy()

# Example usage:
text = "متن فارسی برای استخراج بهتر است"
features = get_features(text)
print(features)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

[-7.60693908e-01 -2.48057976e-01  9.24422085e-01 -5.32367051e-01
 -9.00858343e-02  7.70152748e-01 -7.14325786e-01 -2.47421697e-01
  1.74703181e-01 -5.64610481e-01  1.04018641e+00 -3.56198221e-01
 -7.76074886e-01 -1.13747612e-01 -1.99324153e-02 -8.63505602e-01
 -1.10646832e+00  6.69764996e-01 -3.39600235e-01  1.89778343e-01
  1.08140796e-01 -4.96761054e-01  7.21082509e-01  5.41450441e-01
  7.32753694e-01 -6.96872592e-01 -4.77564245e-01  8.63723338e-01
 -1.18853375e-01 -6.82338774e-01 -3.00767750e-01 -9.89049196e-01
 -1.42933190e-01  5.52920461e-01  8.97379041e-01  1.11381507e+00
  7.84100354e-01 -7.44526505e-01 -8.29315782e-02  7.21124172e-01
  7.66084790e-02 -1.46397755e-01  9.64250147e-01  6.03629291e-01
 -6.11941099e-01  4.44982350e-01  9.04703438e-01  1.34414226e-01
 -9.82283711e-01 -5.70435464e-01  1.01839697e+00  7.99292088e-01
 -3.05734485e-01 -9.32261646e-01  6.33407414e-01 -4.51679192e-02
  3.92974347e-01  4.33743596e-01 -4.48608011e-01 -5.36629379e-01
 -5.34412041e-02  3.52246

In [8]:
import torch
from transformers import AutoTokenizer, AutoModel

# Load XLM-RoBERTa large tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
model = AutoModel.from_pretrained("xlm-roberta-base")

# Function to encode text and extract features
def get_features(text):
    # Encode text
    encoded_input = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)
    # Get model output
    with torch.no_grad():
        output = model(**encoded_input)

    # Get the hidden states from the last layer
    hidden_states = output.last_hidden_state

    # Perform mean pooling on the output of the last layer to get one vector per input
    mean_pooled = hidden_states.mean(dim=1)

    # If you need a fixed number of features (e.g., 400 features), you can use a linear layer or other methods
    # Here, we initialize a linear transformation layer
    transformer = torch.nn.Linear(mean_pooled.shape[1], 400)

    # Apply the transformation to the pooled output
    fixed_size_features = transformer(mean_pooled)

    # Convert the tensor to numpy array after detaching from the graph
    return fixed_size_features.squeeze().detach().numpy()

# Example usage:
text = "متن فارسی برای استخراج بهتر است"
features = get_features(text)
print(features)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

[-4.75815505e-01 -2.98950225e-01  3.35942090e-01  6.16071522e-01
  1.15709379e-04  6.51665688e-01 -5.21875322e-02 -4.14154259e-03
  4.05965149e-01  5.22243083e-01 -4.87056315e-01  4.20386463e-01
  2.34429896e-01  6.11598305e-02 -2.23275796e-02  6.84694886e-01
 -5.78581452e-01  1.27659887e-01  1.59644440e-01 -1.79681152e-01
 -6.32148802e-01  5.28266907e-01  3.97546083e-01 -5.72776675e-01
 -1.97332054e-01  4.17273760e-01 -3.63003492e-01  1.28683016e-01
  6.26814187e-01  1.71466008e-01 -4.35827263e-02 -5.86148858e-01
  5.53992510e-01 -5.25772631e-01 -4.13774282e-01  4.00797725e-02
  5.96164942e-01  1.71618864e-01 -5.34344792e-01 -3.30057323e-01
  2.67316252e-01  2.81477839e-01  2.29817659e-01 -4.08190727e-01
  1.01996817e-01 -5.90485811e-01 -2.56165415e-01 -6.99609041e-01
 -3.89084280e-01  1.89915255e-01  3.20787966e-01  7.29326189e-01
 -8.62130523e-02 -7.78351724e-02 -6.35403097e-01  5.02042174e-01
 -1.53027967e-01  6.66966259e-01  6.44240797e-01 -3.72996032e-01
 -3.43259633e-01  6.15921