<a href="https://colab.research.google.com/github/gaytri9/Text-Generation-in-NLP/blob/main/Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [None]:
from google.colab import files

# Upload the ZIP file containing the folder named "politics"
uploaded = files.upload()


Saving sport.zip to sport.zip


In [None]:
import os
import zipfile
import io
import pandas as pd

# Function to read files with various encodings
def read_file_with_encoding(file_path, encodings):
    for encoding in encodings:
        try:
            with open(file_path, 'r', encoding=encoding) as file:
                return file.read()
        except UnicodeDecodeError:
            continue
    return None

# Extract the uploaded ZIP file
zip_file = list(uploaded.keys())[0]  # Get the name of the uploaded ZIP file
with zipfile.ZipFile(io.BytesIO(uploaded[zip_file]), 'r') as zip_ref:
    zip_ref.extractall()

# Get the list of files in the "sport" folder
folder_path = 'sport/'
file_names = os.listdir(folder_path)

# Create an empty list to store the contents of each file
documents = []

# Loop through each file, read its contents, and append to the list
for file_name in file_names:
    file_path = os.path.join(folder_path, file_name)
    content = read_file_with_encoding(file_path, ['utf-8-sig', 'latin-1', 'ISO-8859-1', 'windows-1252'])
    if content is not None:
        documents.append(content)
    else:
        print(f"Failed to read file '{file_name}' with all tried encodings.")

# Create a DataFrame with the documents
sport_df = pd.DataFrame({'text': documents})


In [None]:
print(sport_df.head())

                                                text
0  Melzer shocks Agassi\n\nSecond seed Andre Agas...
1  Blues slam Blackburn over Savage\n\nBirmingham...
2  Thanou desperate to make return\n\nGreek sprin...
3  Dementieva prevails in Hong Kong\n\nElena Deme...
4  Almagro continues Spanish surge\n\nUnseeded Ni...


In [None]:
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import string  # Import string module for punctuation handling
import pickle

nltk.download('wordnet')
nltk.download('punkt')

# Define custom stopwords for sports-related terms
custom_stopwords = set([...])  # Add your custom stopwords here

# Define the folder path containing the documents
folder_path = 'sport/'

# Get the list of file names in the folder
file_names = os.listdir(folder_path)

# Create an empty list to store the contents of each document
documents = []

# Initialize NLTK components for preprocessing
lemmatizer = WordNetLemmatizer()

# Process each document
for file_name in file_names:
    try:
        with open(os.path.join(folder_path, file_name), 'r', encoding='utf-8') as file:
            content = file.read()
    except UnicodeDecodeError:
        # Try reading with a different encoding if utf-8 fails
        try:
            with open(os.path.join(folder_path, file_name), 'r', encoding='latin-1') as file:
                content = file.read()
        except Exception as e:
            print(f"Error reading file {file_name}: {e}")
            continue

    # Tokenize the content with punctuation included
    tokens = word_tokenize(content)

    # Remove custom stopwords, non-alphabetic tokens, and lemmatize the remaining tokens
    filtered_tokens = [lemmatizer.lemmatize(word.lower()) for word in tokens
                       if word.lower() not in custom_stopwords]

    # Join the tokens back into a single string
    preprocessed_text = ' '.join(filtered_tokens)
    documents.append(preprocessed_text)

# Convert preprocessed data to DataFrame
data_clean = pd.DataFrame({'text': documents})

# Save the preprocessed data as a pickle file
data_clean.to_pickle('data_clean.pkl')

# Save the preprocessed data as a CSV file
data_clean.to_csv('data_clean.csv', index=False)

# Initialize and fit CountVectorizer
cv = CountVectorizer()
cv.fit(data_clean['text'])

# Save the CountVectorizer object as a pickle file
pickle.dump(cv, open("cv.pkl", "wb"))

# Create TF-IDF vectorizer with custom parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, min_df=0.1, max_features=500, ngram_range=(1, 2))

# Fit and transform the processed text data
document_term_matrix = tfidf_vectorizer.fit_transform(documents)

# Convert document-term matrix to DataFrame for easier manipulation
dtm_df = pd.DataFrame(document_term_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Save the DataFrame as a pickle file
dtm_df.to_pickle('dtm_df.pkl')

# Display the document-term matrix
print(dtm_df.head())

# Save the corpus DataFrame
data = pd.DataFrame({'text': documents})
data.to_pickle('corpus.pkl')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


         10       17        18  2003   about  action     added  admitted  \
0  0.213269  0.00000  0.000000   0.0  0.0823     0.0  0.080413  0.000000   
1  0.000000  0.00000  0.092140   0.0  0.0000     0.0  0.125033  0.000000   
2  0.000000  0.12741  0.063017   0.0  0.0000     0.0  0.000000  0.060187   
3  0.000000  0.00000  0.000000   0.0  0.0000     0.0  0.057962  0.000000   
4  0.000000  0.00000  0.000000   0.0  0.0000     0.0  0.095364  0.067119   

      after  after the  ...       won   won the  work  world     would  \
0  0.000000   0.000000  ...  0.000000  0.000000   0.0    0.0  0.000000   
1  0.083734   0.000000  ...  0.066130  0.093146   0.0    0.0  0.000000   
2  0.028634   0.000000  ...  0.045228  0.000000   0.0    0.0  0.000000   
3  0.000000   0.000000  ...  0.000000  0.000000   0.0    0.0  0.211368   
4  0.031932   0.062687  ...  0.000000  0.000000   0.0    0.0  0.086939   

   would be  would have      year  year old       you  
0  0.000000         0.0  0.055003  0.08064

In [None]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,text
0,robben play down european return injured chels...
1,stuart join norwich from addicks norwich have ...
2,yachvili savour france comeback france scrum-h...
3,smith keen on home series return scotland mana...
4,edu blast arsenal arsenal 's brazilian midfiel...
...,...
506,palace threat over cantona mask manchester uni...
507,big gun ease through in san jose top-seeded am...
508,newcastle 2-1 bolton kieron dyer smashed home ...
509,jones file conte lawsuit marion jones ha filed...


In [None]:
# Extract text for index '508'
text = data.loc[508, 'text']
print(text[:200])


poll explains free-kick decision referee graham poll said he applied the law of the game in allowing arsenal striker thierry henry 's free-kick in sunday 's 2-2 draw with chelsea . keeper petr cech wa


## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [None]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [None]:
# Create the dictionary for sports, take a look at it
dict = markov_chain(text)
dict

{'poll': ['explains', 'said', '.', 'added', "'s"],
 'explains': ['free-kick'],
 'free-kick': ['decision', 'in', 'flew', '?', 'option', 'or', '.'],
 'decision': ['referee', 'to'],
 'referee': ['graham', "'", 'have', 'tends', 'for', 'were'],
 'graham': ['poll'],
 'said': ['he', "'can", "'yes", 'poll', 'that', 'to'],
 'he': ['applied', 'said', 'wa', 'did', 'paused', 'said', 'turned', 'added'],
 'applied': ['the'],
 'the': ['law',
  'game',
  'whistle',
  'law',
  'game',
  'signal',
  'same',
  'ball',
  'goal',
  'ref',
  'strike',
  'advantage',
  'non-offending',
  'player',
  'quick',
  'wall',
  'referee',
  'kick',
  'referee',
  'premier',
  'premier',
  'football',
  'summer',
  'situation',
  'option',
  'quick',
  "'ceremonial"],
 'law': ['of', 'of'],
 'of': ['the', 'the', 'the', 'free-kick', '2003', 'either', 'what'],
 'game': ['in', '.'],
 'in': ['allowing', 'sunday', ',', 'an', 'the'],
 'allowing': ['arsenal'],
 'arsenal': ['striker', '2-1', ','],
 'striker': ['thierry'],
 't

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [None]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [None]:
generate_sentence(dict)

'Henry told bbc radio five live : `` at one point , he paused before.'

## Additional Exercises

1. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [None]:
import random

def generate_sentence(chain, count=15):
    # Start the sentence with a random word from the chain
    word = random.choice(list(chain.keys()))
    sentence = [word]

    # Generate the rest of the sentence
    while len(sentence) < count:
        word = random.choice(chain[word])
        sentence.append(word)
        # End sentence if word ends with punctuation mark
        if word[-1] in '?!.':
            break

    # If the sentence did not end with punctuation, add a random punctuation mark
    if sentence[-1][-1] not in '?!.':
        sentence[-1] += random.choice(['.', '!', '?'])

    return ' '.join(sentence)


In [None]:
generate_sentence(dict)

'his defensive wall back 9.15 metre ?'