# Text Generator Project

## Overview
This project involves creating a text generator using a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) layers. The model is trained to predict the next word in a sequence, enabling it to generate coherent text based on a given input.

## Key Steps

1. **Data Preparation**:
   - Convert text data into a list of words.
   - Tokenize the text and create a dictionary of unique tokens.

2. **Sequence Creation**:
   - Create input sequences and corresponding next words.
   - Encode these sequences as one-hot vectors.

3. **Model Building and Training**:
   - Define a Sequential model with LSTM layers and a Dense layer.
   - Compile the model with categorical crossentropy loss and RMSprop optimizer.
   - Train the model on the prepared data.

4. **Text Generation**:
   - Define functions to predict the next word and generate text based on input sequences.

In [1]:
# Import Libraries and model
import random
import pickle

import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer

from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Activation
from tensorflow.keras.optimizers import RMSprop

1. **Data Preparation**

In [2]:
text_df = pd.read_csv("fake_or_real_news.csv")

In [3]:
# Convert the 'text' column of the DataFrame to a list
text = list(text_df.text.values)

# Join all the text strings in the list into a single string, separated by spaces
joined_text = " ".join(text)

In [4]:
# Extract the first 100,000 characters 
partial_text = joined_text[:500000]

# If your computer is strong enough, you can change it into 1M characters, the text generator will perform better

In [5]:
# Create a tokenizer that matches words containing alphanumeric characters
tokenizer = RegexpTokenizer(r"\w+")

# Convert the partial_text string to lowercase and tokenize it into words
tokens = tokenizer.tokenize(partial_text.lower())

In [6]:
# Find all unique tokens in the list of tokens
unique_tokens = np.unique(tokens)

# Create a dictionary mapping each unique token to its index
unique_token_index = {token: idx for idx, token in enumerate(unique_tokens)}

2. **Sequence Creation**

In [7]:
# Set the number of words in each input sequence
n_words = 10

# Initialize lists for input sequences and the next word
input_words = []
next_words = []

# Create pairs of input sequences and the next word
for i in range(len(tokens) - n_words):
    input_words.append(tokens[i:i + n_words])
    next_words.append(tokens[i + n_words])

In [8]:
# Create a NumPy array X to store input sequences
X = np.zeros((len(input_words), n_words, len(unique_tokens)), dtype=bool)

# Create a NumPy array y to store the next word for each input sequence
y = np.zeros((len(next_words), len(unique_tokens)), dtype=bool)

In [9]:
# Iterate over input sequences
for i, words in enumerate(input_words):
    # Iterate over words in each input sequence
    for j, word in enumerate(words):
        # Encode each word as a one-hot vector in the X array
        X[i, j, unique_token_index[word]] = 1
    # Encode the next word as a one-hot vector in the y array
    y[i, unique_token_index[next_words[i]]] = 1

3. **Model Building and Training**

In [None]:
# Initialize a sequential model
model = Sequential()

# Add the first LSTM layer with 128 units, returning sequences
model.add(LSTM(128, input_shape=(n_words, len(unique_tokens)), return_sequences=True))

# Add the second LSTM layer with 128 units
model.add(LSTM(128))

# Add a Dense layer with units equal to the number of unique tokens
model.add(Dense(len(unique_tokens)))

# Add a softmax activation layer to convert outputs to probabilities
model.add(Activation("softmax"))

In [11]:
# Compile the model with categorical crossentropy loss, RMSprop optimizer, and accuracy metric
model.compile(loss="categorical_crossentropy", optimizer=RMSprop(learning_rate=0.01), metrics=["accuracy"])

# Train the model with the input data X and labels y, using a batch size of 128 and 50 epochs
model.fit(X, y, batch_size=128, epochs=50, shuffle=True)

Epoch 1/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m126s[0m 186ms/step - accuracy: 0.0545 - loss: 7.3420
Epoch 2/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 179ms/step - accuracy: 0.0753 - loss: 6.8467
Epoch 3/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m118s[0m 178ms/step - accuracy: 0.0984 - loss: 6.5159
Epoch 4/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 179ms/step - accuracy: 0.1184 - loss: 6.2630
Epoch 5/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 179ms/step - accuracy: 0.1350 - loss: 6.0215
Epoch 6/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 179ms/step - accuracy: 0.1527 - loss: 5.7858
Epoch 7/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 177ms/step - accuracy: 0.1703 - loss: 5.5577
Epoch 8/50
[1m663/663[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m118s[0m 178ms/step - accuracy: 0.1891 - loss: 5.3308
Epoch 9/

<keras.src.callbacks.history.History at 0x24887f78860>

In [None]:
model.save("mymodel.h5")

In [None]:
model = load_model("mymodel.h5")

4. **Text Generation**

In [14]:
def predict_next_word(input_text, n_best):
    # Convert the input text to lowercase
    input_text = input_text.lower()
    
    # Initialize the input array X
    X = np.zeros((1, n_words, len(unique_tokens)))
    
    # Encode the input text as a one-hot vector
    for i, word in enumerate(input_text.split()):
        X[0, i, unique_token_index[word]] = 1
    
    # Predict the next word probabilities
    predictions = model.predict(X)[0]
    
    # Return the indices of the n_best predictions
    return np.argpartition(predictions, -n_best)[-n_best:]

In [15]:
possible = predict_next_word("he will have to look into this thing and he", 5)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 265ms/step


In [16]:
print([unique_tokens[idx] for idx in possible])

['s', 'did', 'is', 'would', 'was']


In [17]:
def generate_text(input_text, text_length, creativity=3):
    # Initialize the list of words from the input text
    word_sequence = input_text.split()
    current = 0
    
    # Loop to generate new text
    for _ in range(text_length):
        # Create a sub-sequence of the current word sequence
        sub_sequence = " ".join(tokenizer.tokenize(" ".join(word_sequence).lower())[current:current+n_words])
        
        try:
            # Predict the next word and choose one from the top predictions
            choice = unique_tokens[random.choice(predict_next_word(sub_sequence, creativity))]
        except:
            # If an error occurs, choose a random word      
            choice = random.choice(unique_tokens)
        
        # Append the chosen word to the word sequence
        word_sequence.append(choice)
        current += 1
    
    # Return the generated text as a single string
    return " ".join(word_sequence)

In [18]:
generate_text("he will have to look into this thing and he", 100, 5)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27

'he will have to look into this thing and he s every and question in that america doesn hope some before keeping being supporting as defense even taking killed before the western night carl again and then was other rules changes a other contributors s day number from how they re everything to demand obamacare ayotte run john proceeds reports will 11 defense n el post for it was related to their husband put as america a couple ago at america event about every lot and hope that see everything he knows they will need that number the congressional earners when we re supporting international speech believe that truth is'