<a href="https://colab.research.google.com/github/UsmanGhias/NLP-Market-Projects/blob/main/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Exploring the Limits of Language Models: Human vs. AI-Generated Text




- Text Classification: Develop a classification model capable of accurately distinguishing between human-written text and AI-generated text. This model is not just a tool for analysis but also a stepping stone toward understanding the intricacies of language models like GPT-3 and ChatGPT.
- Text Generation: Experiment with N-gram models to generate sentences that closely resemble human writing, exploring how changes in model parameters affect the coherence and authenticity of the generated text.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Theoretical Calculation for N-gram Model with Laplacian Smoothing

#### Introduction to N-grams and Laplacian Smoothing:

- N-grams are sequences of 'n' items (words or letters) used for predicting the next item in a sequence. For example, in a trigram model (n=3), we look at sequences of three words.
- Laplacian (or Add-One) Smoothing is used to handle the issue of zero probabilities for unseen n-grams in the training data. By adding one to the count of each n-gram during probability calculations, we ensure that every potential n-gram has a non-zero chance of occurring.

# Example: Calculating Probability with Laplace Smoothing

##### Let suppose we want to calculate the probability of the bigram "the program" using Laplace smoothing from provided text. (I copied Text from gpt.txt file and hum.txt file)

#### Step 1: Count the Occurrences

- Let \(C("the program")\) be the count of the bigram "the program" in corpus. Assume \(C("the program") = 2\).
- Let \(C("the")\) be the count of the word "the" preceding our target bigram. Assume \(C("the") = 10\).
- Let \(V\) be the vocabulary size, which is the total number of unique words in the corpus. Assume \(V = 100\).

#### Step 2: Apply Laplace Smoothing Formula

The formula for calculating the probability of a word \(w_i\) given the previous word \(w_{i-1}\) with Laplace smoothing (add-one smoothing) is:

```Latex
\[ P(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + V} \]
```

- Substituting our counts into the formula gives:

```Latex
\[ P("program" | "the") = \frac{C("the program") + 1}{C("the") + V} = \frac{2 + 1}{10 + 100} = \frac{3}{110} \]
```

#### Complete Explanation

- The numerator \(C("the program") + 1\) represents the smoothed count of the bigram "the program". We add 1 to ensure that every bigram, including those not seen in the training corpus, has a non-zero probability.
- The denominator \(C("the") + V\) represents the smoothed count of all bigrams starting with "the". We add \(V\) (the vocabulary size) to account for the possibility of any word from the vocabulary following "the".

### Example Results

The calculated probability \(P("program" | "the") = \frac{3}{110}\) indicates that, after applying Laplace smoothing, the bigram "the program" has a small but non-zero probability of occurring in our corpus. This method ensures that even unseen bigrams can be accounted for in our language model, which is crucial for handling new or rare word combinations in natural language processing tasks.


```latex
P("program" | "the") = \frac{C("the program") + 1}{C("the") + V} = \frac{2 + 1}{10 + 100} = \frac{3}{110}
```


# Data Preparation

####  Importing Necessary Libraries

In [None]:
import re
from collections import defaultdict, Counter
from random import choices

## Text Cleaning Function

In [None]:
def clean_text(text):
    """
    Cleans input text by:
    - Lowercasing the text
    - Removing all non-alphanumeric characters (except periods, exclamation marks, and question marks)
    - Adding <START> and <END> tokens to each sentence.
    """
    # Convert text to lowercase
    text = text.lower()
    # Remove unwanted characters, keeping '.?!' for sentence delimitation
    text = re.sub(r'[^\w\s.?!]', '', text)
    # Split text into sentences and add <START> and <END> tokens
    sentences = re.split(r'(?<=[.?!])\s+', text)
    cleaned_sentences = [f'<START> {sentence.strip()} <END>' for sentence in sentences if sentence]
    return cleaned_sentences


#### Load and Clean the Dataset

In [None]:
# Here's the Function to load and clean our files
def load_and_clean_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    cleaned_text = clean_text(text)
    return cleaned_text

# Paths to our text files
hum_file_path = '/content/hum.txt'
gpt_file_path = '/content/gpt.txt'

# Loading and cleaning the datasets
hum_cleaned_texts = load_and_clean_file(hum_file_path)
gpt_cleaned_texts = load_and_clean_file(gpt_file_path)

# -------------------------------------------------------

# Combined the texts from both files if we
# cleaned_texts = hum_cleaned_texts + gpt_cleaned_texts


In [None]:
print(hum_cleaned_texts[:5])
print(gpt_cleaned_texts[:5])


# # Checking cleaned text
# print(cleaned_texts[:5])

['<START> neither texas  nor any other state  is allowed to secede from the us . <END>', '<START> why is the answer not so simple as either becoming one country and tearing down the walls or becoming two countries ? <END>', '<START> combusinessnetwortharticlewhengovernmentfinescompanieswhogetscash3189724 . <END>', '<START> just be happy  relax yourselftake care\nyou could  theoretically  tunnel from na to asia  since they are not antipodal  and thus you would nt need to tunnel through the center of the earth . <END>', '<START> . <END>']
['<START> they then used this information to calculate the value of r that would best describe the behavior of gases . <END>', '<START> flash is a software platform that was developed by adobe and is used to create animations  games  and other interactive content . <END>', '<START> the expansion of deserts can have significant impacts on the world . <END>', '<START> if you are concerned about your credit utilization rate  you may want to consider paying

#### Splited the Data into Training & Testing Sets for ChatGPT and HUMAN

In [None]:
import random

def split_data(data, test_size=0.1):
    random.shuffle(data)
    split_idx = int(len(data) * (1 - test_size))
    return data[:split_idx], data[split_idx:]


In [None]:
hum_train_sentences, hum_test_sentences = split_data(hum_cleaned_texts, test_size=0.1)
gpt_train_sentences, gpt_test_sentences = split_data(gpt_cleaned_texts, test_size=0.1)


In [None]:
def save_data(file_path, data):
    with open(file_path, 'w', encoding='utf-8') as file:
        for sentence in data:
            file.write(sentence + "\n")


In [None]:
# Define paths for saving
hum_train_path = '/content/hum_train.txt'
hum_test_path = '/content/hum_test.txt'
gpt_train_path = '/content/gpt_train.txt'
gpt_test_path = '/content/gpt_test.txt'

# Save the datasets
save_data(hum_train_path, hum_train_sentences)
save_data(hum_test_path, hum_test_sentences)
save_data(gpt_train_path, gpt_train_sentences)
save_data(gpt_test_path, gpt_test_sentences)


# Generating N-grams for Human and ChatGPT

In [None]:
def generate_ngrams(sentences, n=2):
    """
    Generate n-grams from a list of sentences.
    """
    ngrams = []
    for sentence in sentences:
        tokens = sentence.split()
        ngrams.extend(list(zip(*[tokens[i:] for i in range(n)])))
    return ngrams


### Human

In [None]:
# Assuming hum_train_sentences is already defined
hum_bigrams = generate_ngrams(hum_train_sentences, n=2)
hum_trigrams = generate_ngrams(hum_train_sentences, n=3)


In [None]:
hum_bigram_freq = Counter(hum_bigrams)
hum_trigram_freq = Counter(hum_trigrams)


### ChatGPT

In [None]:
# Assuming gpt_train_sentences is already defined
gpt_bigrams = generate_ngrams(gpt_train_sentences, n=2)
gpt_trigrams = generate_ngrams(gpt_train_sentences, n=3)


In [None]:
gpt_bigram_freq = Counter(gpt_bigrams)
gpt_trigram_freq = Counter(gpt_trigrams)


# Data Sets

In [None]:
# Human Dataset
# Vocabulary size
human_vocab_size_bigrams = len(set(hum_bigrams))
human_vocab_size_trigrams = len(set(hum_trigrams))

# Total N-grams
total_human_bigrams = sum(hum_bigram_freq.values())
total_human_trigrams = sum(hum_trigram_freq.values())


In [None]:
# ChatGPT Dataset
# Vocabulary size
gpt_vocab_size_bigrams = len(set(gpt_bigrams))
gpt_vocab_size_trigrams = len(set(gpt_trigrams))

# Total N-grams
total_gpt_bigrams = sum(gpt_bigram_freq.values())
total_gpt_trigrams = sum(gpt_trigram_freq.values())


# Step 3: Applying Laplacian Smoothing

### Laplacian Smoothing Function

In [None]:
def laplace_smoothed_probability(ngram, ngram_freq, total_ngrams, vocabulary_size, alpha=1):
    """
    Here we are calculating Laplace-smoothing probability of an n-gram.

    # Parameters:
    :param ngram: The n-gram for which to calculate the probability.
    :param ngram_freq: A Counter object containing frequencies of n-grams.
    :param total_ngrams: The total number of n-grams in the training set.
    :param vocabulary_size: The size of the vocabulary.
    :param alpha: The smoothing parameter (default: 1 for Laplacian smoothing).
    :return: The smoothed probability of the n-gram.
    """
    ngram_count = ngram_freq[ngram]
    smoothed_prob = (ngram_count + alpha) / (total_ngrams + alpha * vocabulary_size)
    return smoothed_prob


## Calculating Vocabulary Size

In [None]:
# Human Dataset Vocabulary Size and Total N-grams
human_vocab_size_bigrams = len(set(hum_bigrams))
human_vocab_size_trigrams = len(set(hum_trigrams))
total_human_bigrams = sum(hum_bigram_freq.values())
total_human_trigrams = sum(hum_trigram_freq.values())

# ChatGPT Dataset Vocabulary Size and Total N-grams
gpt_vocab_size_bigrams = len(set(gpt_bigrams))
gpt_vocab_size_trigrams = len(set(gpt_trigrams))
total_gpt_bigrams = sum(gpt_bigram_freq.values())
total_gpt_trigrams = sum(gpt_trigram_freq.values())


In [None]:
def get_next_word_human(current_tokens, backoff_order=[2, 1]):
    return get_next_word_general(current_tokens, hum_bigram_freq, hum_trigram_freq, backoff_order, human_vocab_size_bigrams, total_human_bigrams)

def get_next_word_gpt(current_tokens, backoff_order=[2, 1]):
    return get_next_word_general(current_tokens, gpt_bigram_freq, gpt_trigram_freq, backoff_order, gpt_vocab_size_bigrams, total_gpt_bigrams)

def get_next_word_general(current_tokens, bigram_freq, trigram_freq, backoff_order, vocab_size, total_ngrams):
    for n in backoff_order:
        possible_ngrams = None
        if n == 2:
            possible_ngrams = [(ngram, freq) for ngram, freq in bigram_freq.items() if ngram[:-1] == tuple(current_tokens[-(n-1):])]
        elif n == 3 and trigram_freq is not None:
            possible_ngrams = [(ngram, freq) for ngram, freq in trigram_freq.items() if ngram[:-1] == tuple(current_tokens[-2:])]

        if possible_ngrams:
            total_freq = sum(freq for _, freq in possible_ngrams)
            choices, weights = zip(*[(ngram[-1], freq / total_freq) for ngram, freq in possible_ngrams])
            return random.choices(choices, weights=weights)[0]

    return None  # Fallback if no suitable continuation is found


# Step 4: Model Evaluation

#### Defining Perplexity Calculation Function

In [None]:
import math

def calculate_perplexity(sentences, ngram_freq, total_ngrams, vocab_size, n=2, alpha=1):

    """
    Calculating perplexity of a set of sentences given an n-gram model with Laplace smoothing
    using Libraries: math.log & math.exp:

    # Parameters:
    :param sentences: The set of sentences to evaluate.
    :param ngram_freq: A Counter object containing frequencies of n-grams.
    :param total_ngrams: The total number of n-grams in the training data.
    :param vocab_size: The size of the vocabulary.
    :param n: The order of the n-gram model (2 for bigram, 3 for trigram).
    :param alpha: The smoothing parameter.
    :return: The calculated perplexity.
    """

    log_prob_sum = 0
    N = 0
    for sentence in sentences:
        tokens = ['<START>'] + sentence.split() + ['<END>']
        sentence_ngrams = zip(*[tokens[i:] for i in range(n)])
        for ngram in sentence_ngrams:
            prob = laplace_smoothed_probability(ngram, ngram_freq, total_ngrams, vocab_size, alpha)
            log_prob_sum += math.log(prob)
            N += 1
    perplexity = math.exp(-log_prob_sum / N)
    return perplexity



#### Evaluating using Bigram & Trigram Model on the Test Set

In [None]:
# Human Dataset
perplexity_human_bigrams = calculate_perplexity(hum_test_sentences, hum_bigram_freq, total_human_bigrams, human_vocab_size_bigrams, n=2)
perplexity_human_trigrams = calculate_perplexity(hum_test_sentences, hum_trigram_freq, total_human_trigrams, human_vocab_size_trigrams, n=3)

# ChatGPT Dataset
perplexity_gpt_bigrams = calculate_perplexity(gpt_test_sentences, gpt_bigram_freq, total_gpt_bigrams, gpt_vocab_size_bigrams, n=2)
perplexity_gpt_trigrams = calculate_perplexity(gpt_test_sentences, gpt_trigram_freq, total_gpt_trigrams, gpt_vocab_size_trigrams, n=3)


In [None]:
# prompt: print perplexity of Human Dataset and ChatGPT data set

print("Human Dataset:")
print(f"  - Bigram Perplexity: {perplexity_human_bigrams}")
print(f"  - Trigram Perplexity: {perplexity_human_trigrams}")

print("ChatGPT Dataset:")
print(f"  - Bigram Perplexity: {perplexity_gpt_bigrams}")
print(f"  - Trigram Perplexity: {perplexity_gpt_trigrams}")


Human Dataset:
  - Bigram Perplexity: 125213.95266966379
  - Trigram Perplexity: 1697687.723540955
ChatGPT Dataset:
  - Bigram Perplexity: 61908.717378033034
  - Trigram Perplexity: 716206.9919634318


# Step 5: Text Generation

#### Human Text Generation


In [None]:
import random

def generate_sentence(ngram_freq, start_token='<START>', end_token='<END>', n=2):
    """
    Generates a sentence using an n-gram model.

    :param ngram_freq: A Counter object containing frequencies of n-grams.
    :param start_token: The token indicating the start of a sentence.
    :param end_token: The token indicating the end of a sentence.
    :param n: The order of the n-gram model.
    :return: A generated sentence as a string.
    """
    sentence = [start_token]
    while True:
        if n == 2:
            # For bigrams, consider the last word in the sentence for the next word prediction
            current_token = sentence[-1]
            possible_ngrams = [(ngram, freq) for ngram, freq in ngram_freq.items() if ngram[0] == current_token]
        elif n == 3:
            # For trigrams, consider the last two words in the sentence for the next word prediction
            if len(sentence) < 2:
                # If the sentence has fewer than 2 words, revert to bigram prediction for the next word
                current_tokens = (start_token, sentence[-1])
            else:
                current_tokens = tuple(sentence[-2:])
            possible_ngrams = [(ngram, freq) for ngram, freq in ngram_freq.items() if ngram[:2] == current_tokens]
        else:
            raise ValueError("This function currently supports only bigram (n=2) and trigram (n=3) models.")

        # If no possible n-grams are found, break the loop
        if not possible_ngrams:
            break

        total_freq = sum(freq for _, freq in possible_ngrams)
        choices, weights = zip(*[(ngram[-1], freq / total_freq) for ngram, freq in possible_ngrams])
        next_word = random.choices(choices, weights=weights)[0]

        if next_word == end_token or len(sentence) > 100:  # Prevent infinite loops
            break
        sentence.append(next_word)

    return ' '.join(sentence[1:])  # Skip the <START> token for output


#### Generate a Sentence, Bigram Laplace Smoothing

In [None]:
generated_human_sentence_bigram = generate_sentence(hum_bigram_freq, n=2)
print(f"Generated Human Bigram Sentence: {generated_human_sentence_bigram}")


Generated Human Bigram Sentence: not only because about 22 you can tell us needs to be a collie lab .


In [None]:
generated_human_sentence_trigram = generate_sentence_with_trigram_backoff('<START>', '<END>', hum_bigram_freq, hum_trigram_freq, max_length=100)
print(f"Generated Human Trigram Sentence with Backoff: {generated_human_sentence_trigram}")


Generated Human Trigram Sentence with Backoff: other banksinsurance companiesservice providers so popular .


# ChatGPT Text Generation

In [None]:
generated_gpt_sentence_bigram = generate_sentence(gpt_bigram_freq, n=2)
print(f"Generated ChatGPT Bigram Sentence: {generated_gpt_sentence_bigram}")


Generated ChatGPT Bigram Sentence: the price of the way the joint to understand how a better to make it .


In [None]:
generated_gpt_sentence_trigram = generate_sentence_with_trigram_backoff('<START>', '<END>', gpt_bigram_freq, gpt_trigram_freq, max_length=100)
print(f"Generated ChatGPT Trigram Sentence with Backoff: {generated_gpt_sentence_trigram}")


Generated ChatGPT Trigram Sentence with Backoff: scientists have also been used to train their troops overseas to support the cheeks and body in many different countries and it contains sensitive personal information to answer your question it is normal to have legal and illegal while he was a student at harvard university .


# Results

#Human-Written Text Generation
Bigram Sentence: "not only because about 22 you can tell us needs to be a collie lab."

- Analysis: This sentence, generated from a bigram model, demonstrates a somewhat disjointed structure. It starts coherently but quickly becomes less logical and lacks a clear direction. This illustrates a common limitation of bigram models: reliance on the immediate preceding word often results in less coherent long-term structure.
Trigram Sentence with Backoff: "other banksinsurance companiesservice providers so popular."

- Analysis: The trigram-generated sentence, while short, suggests a slightly more coherent structure than the bigram sentence. However, the lack of spaces between words ("banksinsurance" and "companiesservice") indicates a possible preprocessing or tokenization issue. Despite this, the trigram model appears to maintain a more topic-focused approach.

# ChatGPT-Generated Text Generation
Bigram Sentence: "the price of the way the joint to understand how a better to make it."

- Analysis: Similar to the human bigram sentence, this sentence lacks coherence, with a repetitive structure that doesn't logically progress. It reflects the bigram model's limitations in capturing longer dependency relationships within text.

Trigram Sentence with Backoff: "scientists have also been used to train their troops overseas to support the cheeks and body in many different countries and it contains sensitive personal information to answer your question it is normal to have legal and illegal while he was a student at harvard university."

- Analysis: This sentence showcases a higher degree of coherence and complexity, characteristic of trigram models' ability to capture longer-term dependencies. However, it eventually diverges into a less coherent narrative, suggesting that while trigram models can generate more structured beginnings, maintaining topic consistency over long sentences remains challenging.