# Markov Chain Text Generator

## What is a Markov Chain?
A Markov Chain is a stochastic model that describes a sequence of possible events where the probability of each event depends only on the state attained in the previous event. It is widely used in various fields such as natural language processing, game theory, and statistical modeling.

In the context of text generation, a Markov Chain can be used to model the probability of a word following another word based on a given text corpus. By training a Markov Chain on a dataset, we can generate new sequences of text that mimic the style and structure of the original data.

## How Does It Work?
1. **Tokenization**: The input text is split into individual words or tokens.
2. **Transition Graph**: A graph is built where each word is a node, and edges represent the probability of transitioning to the next word.
3. **Text Generation**: Starting from a given word (or prompt), the model selects the next word based on the probabilities in the graph. This process is repeated to generate a sequence of words.

This notebook implements a Markov Chain-based text generator using Python. It reads text data from CSV files, trains a Markov Chain model, and generates new text sequences based on user input.

## Importing Required Libraries
The following libraries are used in this notebook:
- `random`: For randomly selecting the next word during text generation.
- `string.punctuation`: For removing punctuation during tokenization.
- `collections.defaultdict`: For storing the Markov Chain graph as a dictionary of lists.
- `pandas`: For reading and processing CSV files.

In [29]:
# Importing required libraries
import random
from string import punctuation
from collections import defaultdict
import pandas as pd

# MarkovChain Class
The `MarkovChain` class implements the core functionality of the text generator. It includes methods for:
- Initializing the Markov Chain graph.
- Tokenizing input text.
- Training the model by building a graph of word transitions.
- Reading and processing CSV files.
- Generating text sequences based on the trained model.

### `__init__` Method
This method initializes the `MarkovChain` instance. It sets up the Markov Chain graph as a `defaultdict` of lists, where each key is a word, and the value is a list of possible next words.

In [30]:
class MarkovChain:
    def __init__(self):
        """
        Initializes the MarkovChain instance.

        Attributes:
            graph (defaultdict): A dictionary where each key is a word and the value 
                                 is a list of possible next words.
        """
        self.graph = defaultdict(list)

### `_tokenize` Method
This method tokenizes the input text by:
1. Removing punctuation and numeric characters.
2. Replacing newlines with spaces.
3. Splitting the text into individual words.

#### Arguments:
- `text` (str): The input text to be tokenized.

#### Returns:
- `list`: A list of words (tokens) extracted from the input text.

In [None]:
def _tokenize(self, text):
        """
        Tokenizes the input text by removing punctuation, numbers, and splitting it into words.
        """
        return (
            text.translate(str.maketrans("", "", punctuation + "1234567890"))
            .replace("\n", " ")
            .split(" ")
        )

### `_train` Method
This method trains the Markov Chain model by building a graph of word transitions from the input text.

#### Arguments:
- `text` (str): The input text used to train the Markov Chain model.

#### How It Works:
1. Tokenizes the input text using the `_tokenize` method.
2. Iterates through the tokens and builds a graph where each word points to a list of possible next words.

### `_read_pd_csv` Method
This method reads a CSV file and converts the first column into a single string.

#### Arguments:
- `csv_file_path` (str): The path to the CSV file.
- `header` (int or None): Row number to use as the column names, or `None` if the CSV files have no headers.

#### Returns:
- `str`: A string containing all rows of the first column, separated by newlines.

In [None]:
def _read_pd_csv(self, csv_file_path, header=None):
        """
        Reads a CSV file into a pandas DataFrame and converts the first column to a single string.
        """
        try:
            df = pd.read_csv(csv_file_path, encoding='UTF-8', header=header)
            first_column_as_string = "\n".join(df.iloc[:, 0].astype(str))
            return first_column_as_string
        except Exception as e:
            print(f"Error processing CSV file at {csv_file_path}: {e}")
            raise

### Constants: CSV File Paths

The `CSV_FILE_PATHS` constant defines a list of file paths to the CSV datasets used for training the Markov Chain model. Each file contains text data that will be processed and combined to build the Markov Chain graph.

#### Details:
- The datasets are stored in the `csv_datasets` directory.
- Each file is expected to have text data in the first column, which will be concatenated into a single string for training.

#### Example File Paths:
1. `markov_chain_impression_dataset.csv`: Contains impression-based text data.
2. `reddit_social_media_comments.csv`: Contains comments from Reddit.
3. `twitter_social_media_comments.csv`: Contains comments from Twitter.

These datasets are used to train the model, enabling it to generate text sequences that mimic the style and structure of the input data.

In [None]:
    # Define constants for CSV file paths
CSV_FILE_PATHS = [
    "/Users/apple/Documents/Projects/Samhail/csv_datasets/markov_chain_impression_dataset.csv",
    "/Users/apple/Documents/Projects/Samhail/csv_datasets/reddit_social_media_comments.csv",
    "/Users/apple/Documents/Projects/Samhail/csv_datasets/twitter_social_media_comments.csv"
]

### `_generate` Method
This method generates a sequence of text based on the trained Markov Chain and a given prompt.

#### Arguments:
- `prompt` (str): The initial text to start the generation.
- `length` (int): The number of words to generate (default is 10).

#### Returns:
- `str`: A string containing the generated sequence of text.

In [None]:
def _generate(self, prompt, length=10):
        """
        Generates a sequence of text based on the trained Markov Chain and a given prompt.
        """
        current = self._tokenize(prompt)[-1]
        output = prompt
        for i in range(length):
            options = self.graph.get(current, [])
            if not options:
                continue
            current = random.choice(options)
            output += f" {current}"
        return output

### `_train_model` Method
This method trains the Markov Chain model using text data from multiple CSV files.

#### Arguments:
- `csv_file_paths` (list of str): A list of file paths to the CSV files containing the training data.
- `csv_header` (int or None): Row number to use as the column names, or `None` if the CSV files have no headers.

#### Returns:
- `MarkovChain`: An instance of the MarkovChain class trained on the combined text data.

In [None]:
def _train_model(self, csv_file_paths, csv_header=None):
        """
        Trains the Markov Chain model using text data from multiple CSV files.
        """
        if not csv_file_paths:
            raise ValueError("No CSV file paths provided.")
        text = ""
        for csv_file_path in csv_file_paths:
            text += self._read_pd_csv(csv_file_path, header=csv_header)
        self._train(text)
        return self

### `predict_next` Function
This standalone function trains the Markov Chain model and generates text based on user input.

#### Steps:
1. Creates an instance of the `MarkovChain` class.
2. Trains the model using the `_train_model` method.
3. Prompts the user for input text.
4. Generates a sequence of text using the `_generate` method.

In [36]:
def predict_next():
    """
    Trains the Markov Chain model and generates text based on user input.
    """
    model = MarkovChain()
    trained_model = model._train_model(CSV_FILE_PATHS)
    prompt = input("Enter a prompt: ")
    print("The predicted sentence is the following: \n", trained_model._generate(prompt, length=10))

In [37]:
# Run the predict_next function
predict_next()

AttributeError: 'MarkovChain' object has no attribute '_train_model'