### Section 1: Data Loading and Preprocessing
بخش 1: بارگذاری داده‌ها و پیش‌پردازش


In [12]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import string, os 

# Setting the current directory
curr_dir = './'

# Initializing a list to store all headlines
all_headlines = []

# Loading data from CSV files
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
        # Reading the CSV file
        article_df = pd.read_csv(curr_dir + filename)
        
        # Extending the list with headline values from the DataFrame
        all_headlines.extend(list(article_df.headline.values))
        
        # Breaking the loop after reading the first matching file
        break

# Removing the "Unknown" expression from headlines
all_headlines = [h for h in all_headlines if h != "Unknown"]

# Text cleaning function
def clean_text(txt):
    """
    Cleans the given text by removing punctuation and converting to lowercase.

    Args:
    - txt (str): Input text.

    Returns:
    - txt (str): Cleaned text.
    """
    # Removing punctuation and converting to lowercase
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

# Creating cleaned text
corpus = [clean_text(x) for x in all_headlines]

This part of the code includes commands for loading data from CSV files, removing the "Unknown" expression from headlines, and a text cleaning function. The cleaned text is then created based on the headlines.


در این بخش از کد، دستورات لازم برای بارگذاری داده‌ها از فایل CSV، حذف عبارت "Unknown" از عناوین، و تابع پاک‌سازی متن به همراه ایجاد متن پاک‌سازی شده انجام میشه.

### Section 2: Tokenization and Data to Tokens Conversion
بخش 2: توکن‌سازی و تبدیل داده به توکن‌ها

In [13]:
# Importing necessary libraries
from keras.preprocessing.text import Tokenizer

# Creating a tokenizer object
tokenizer = Tokenizer()

# Tokenization function
def get_sequence_of_tokens(corpus):
    """
    Tokenizes the given corpus and generates input sequences for training.

    Args:
    - corpus (list): A list of text data.

    Returns:
    - input_sequences (list): List of input sequences for training.
    - total_words (int): Total number of unique words in the corpus.
    """
    # Fitting the tokenizer on the corpus
    tokenizer.fit_on_texts(corpus)
    
    # Calculating the total number of unique words
    total_words = len(tokenizer.word_index) + 1

    # Generating input sequences
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    
    return input_sequences, total_words

# Generating token sequences and getting total number of words
inp_sequences, total_words = get_sequence_of_tokens(corpus)


In this part of the code, using the Keras Tokenizer library, tokenization is performed. In this process, the input text is converted into numerical tokens. The tokenizer, in addition to fitting on the corpus, calculates the total number of unique words in the text. It then generates input sequences for training the LSTM model.


این قسمت از کد با استفاده از کتابخانه Keras Tokenizer یک توکن‌سازی رو انجام میده. در این توکن‌سازی، متن ورودی به توکن‌های عددی تبدیل میشه. توکن‌سازی در کنار تعداد کل کلمات یکتا در متن رو محاسبه میکنه و سپس دنباله توکن‌های ورودی برای آموزش مدل LSTM تولید میکنه.

### Section 3: Generating Padded Input Sequences
بخش 3: تولید داده‌های ورودی با پدینگ

In [14]:
# Importing necessary libraries
from keras.utils import pad_sequences
import keras.utils as ku

# Function for generating padded input sequences
def generate_padded_sequences(input_sequences):
    """
    Generates padded input sequences for training the LSTM model.

    Args:
    - input_sequences (list): List of input sequences.

    Returns:
    - predictors (array): Padded input sequences excluding the last element.
    - label (array): Last element of each input sequence as one-hot encoded labels.
    - max_sequence_len (int): Maximum length of the input sequences after padding.
    """
    # Finding the maximum sequence length
    max_sequence_len = max([len(x) for x in input_sequences])
    
    # Padding input sequences
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    # Extracting predictors and labels
    predictors, label = input_sequences[:,:-1], input_sequences[:,-1]
    
    # One-hot encoding labels
    label = ku.to_categorical(label, num_classes=total_words)
    
    return predictors, label, max_sequence_len

# Generating padded input sequences
predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

In this part of the code, the Keras library is used for padding input data. The function generate_padded_sequences produces padded input sequences for training the LSTM model. This function includes the matrix predictors (input sequences excluding the last element), the matrix label (the last element of each sequence as one-hot encoded labels), and max_sequence_len (the maximum length of input sequences after padding).


در این بخش از کد، از کتابخانه Keras برای پدینگ داده‌های ورودی استفاده شده. تابع generate_padded_sequences دنباله‌های ورودی رو با پدینگ تولید میکنه. این تابع شامل ماتریس predictors که دنباله‌های ورودی هستند (با حذف آخرین عنصر)، ماتریس label که آخرین عنصر هر دنباله به صورت one-hot encoded به عنوان برچسب‌هاست، و max_sequence_len که طول بیشترین دنباله پس از پدینگه هست.

### Section 4: Model Creation
بخش 4: ایجاد مدل

In [15]:
# Importing necessary libraries
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

# Function for creating the model
def create_model(max_sequence_len, total_words):
    """
    Creates and compiles an LSTM model for text generation.

    Args:
    - max_sequence_len (int): Maximum length of the input sequences after padding.
    - total_words (int): Total number of unique words in the corpus.

    Returns:
    - model (Sequential): Compiled LSTM model for text generation.
    """
    # Setting the input length
    input_len = max_sequence_len - 1
    
    # Initializing a Sequential model
    model = Sequential()
    
    # Adding an Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Adding Hidden Layer 1 - LSTM
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Adding Output Layer
    model.add(Dense(total_words, activation='softmax'))

    # Compiling the model
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

# Creating and compiling the LSTM model
model = create_model(max_sequence_len, total_words)

In this part of the code, an LSTM model for text generation is created and compiled. The create_model function includes a Sequential model with an Embedding layer, an LSTM layer as the hidden layer, and a Dense layer as the output layer. The model is compiled using categorical crossentropy as the loss function and the Adam optimization algorithm.


در این بخش از کد، یک مدل LSTM برای تولید متن ایجاد شده و کامپایل میشه. تابع create_model شامل یک مدل Sequential با یک لایه Embedding، یک لایه LSTM به عنوان لایه مخفی، و یک لایه Dense به عنوان لایه خروجیه. مدل با استفاده از الگوریتم categorical crossentropy به عنوان تابع هزینه و الگوریتم بهینه‌سازی Adam کامپایل میشه.

### Section 5: Model Training and Saving Weights

بخش 5: آموزش مدل و ذخیره وزن‌ها

In [16]:
# Importing necessary libraries
from keras.callbacks import EarlyStopping

# Model training
model.fit(predictors, label, epochs=100, verbose=5)

# Saving model weights
model.save_weights('textg3_model.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In this part of the code, the model is trained using the input data (predictors and label) for 100 epochs, and the weights of the model are saved as the 'textg3_model.h5' file. The verbose=5 parameter ensures that more detailed information about the training process is displayed in the console.


این بخش از کد شامل دستورات برای آموزش مدل با استفاده از داده‌های ورودی (predictors و label) به مدت 100 دوره (epoch) و ذخیره وزن‌های مدل به عنوان فایل 'textg3_model.h5' باشه. دستور verbose=5 باعث میشه مراحل آموزش با اطلاعات بیشتری در کنسول نمایش داده بشه.

### Section 6: Text Generation Using Trained Model
بخش 6: تولید متن با استفاده از مدل آموزش دیده شده

In [17]:
# Function for generating text
def generate_text(seed_text, next_words, model, max_sequence_len):
    """
    Generates new text based on a seed text using the trained LSTM model.

    Args:
    - seed_text (str): Seed text for text generation.
    - next_words (int): Number of words to generate.
    - model: Trained LSTM model.
    - max_sequence_len (int): Maximum length of the input sequences after padding.

    Returns:
    - str: Generated text.
    """
    for _ in range(next_words):
        # Converting seed text to token list
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        
        # Padding the token list
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        
        # Predicting the next word
        predict_x = model.predict(token_list, verbose=0) 
        predicted = np.argmax(predict_x, axis=1)
        
        # Converting the predicted index to the corresponding word
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        
        # Appending the predicted word to the seed text
        seed_text += " " + output_word
    
    return seed_text.title()

# Generating new text
print(generate_text("my", 10, model, max_sequence_len))

My Days Of Horror Brings Geopolitical Whiplash Lacks Sip Of Niger


This part of the code contains a function named generate_text that generates new text based on a seed text using the trained LSTM model. The function takes arguments such as seed_text (initial text), next_words (number of words to generate), model (trained LSTM model), and max_sequence_len (maximum length of the input sequences after padding). The generated text is then displayed as output.


این بخش از کد دارای یک تابع به نام generate_text هست که بر اساس یک متن ابتدایی (seed text) و با استفاده از مدل LSTM آموزش دیده شده، متن جدیدی ایجاد میکنه. تابع شامل متغیرهایی مثل seed_text (متن ابتدایی)، next_words (تعداد کلماتی که باید تولید بشن)، model (مدل LSTM آموزش دیده شده) و max_sequence_len (حداکثر طول دنباله ورودی پس از پدینگ) است. متن تولید شده به عنوان خروجی نمایش داده میشه.