<a href="https://colab.research.google.com/github/anthonyguerges/anthonyguerges/blob/main/Projects/NLE_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Natural Language Engineering



In [1]:
# my unique candidate number is here
candidate_number = 262809



This project is focused on applying natural language processing techniques specifically POS-tagging and the Naive Bayes classification model to analyze and classify sentences according to predefined stylistic categories. Below, I detail the steps taken to achieve this, adapting typical question prompts into descriptive project steps.

In [2]:
### do not change the code in this cell
# make sure you run this cell
import nltk
from nltk import pos_tag
from nltk.probability import FreqDist
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')


# This is a list of sentences written in various styles.
sentences=["a tediously verbose sentence may contain many gratuitous and overly contrived modifiers .",
           "another sentence could be too short .",
           "some people write sentences that contain nouns and verbs , avoiding adjectives and descriptions ."]

# This is a dictionary containing counts of pos tags from a corpus of sentences which were labelled as styles A and B.
classtagcounts={"A":{"RB":30, "JJ":30, "NN":10, "NNS":10, "VB":10, "VBD":10},
                "B":{"VBP":20, "VBZ":10, "VBG":10, "VBD":10, "NN":20, "NNS":30}}

# This is a complete list of pos tags.
taglist = list(nltk.data.load('help/tagsets/upenn_tagset.pickle').keys())


[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


a) By following the steps below, pos-tag the sentences and construct a bag-of-tags representation of each one.

To start, I split each sentence from a predefined list into individual words and applied the pos_tag function from the NLTK library to tag each word with its corresponding part of speech. This process converted the list of sentences into a list of lists, where each inner list contained tuples of word-tag pairs.



In [3]:
# List of sentences
sentences = ["a tediously verbose sentence may contain many gratuitous and overly contrived modifiers .",
             "another sentence could be too short .",
             "some people write sentences that contain nouns and verbs , avoiding adjectives and descriptions ."]

# Splitting each sentence into words and then POS tagging
tagged = [pos_tag(sentence.split()) for sentence in sentences]
print(tagged)


[[('a', 'DT'), ('tediously', 'RB'), ('verbose', 'JJ'), ('sentence', 'NN'), ('may', 'MD'), ('contain', 'VB'), ('many', 'JJ'), ('gratuitous', 'JJ'), ('and', 'CC'), ('overly', 'RB'), ('contrived', 'VBD'), ('modifiers', 'NNS'), ('.', '.')], [('another', 'DT'), ('sentence', 'NN'), ('could', 'MD'), ('be', 'VB'), ('too', 'RB'), ('short', 'JJ'), ('.', '.')], [('some', 'DT'), ('people', 'NNS'), ('write', 'VBP'), ('sentences', 'NNS'), ('that', 'WDT'), ('contain', 'VBP'), ('nouns', 'NNS'), ('and', 'CC'), ('verbs', 'NNS'), (',', ','), ('avoiding', 'VBG'), ('adjectives', 'NNS'), ('and', 'CC'), ('descriptions', 'NNS'), ('.', '.')]]


Following the tagging, I transformed each list of word-tag pairs into a bag-of-tags representation using the FreqDist class from NLTK. This representation counts the occurrences of each POS tag within a sentence, effectively summarizing its grammatical structure.


In [4]:
# Converting each list of word, tag pairs into a bag-of-tags representation
bag_of_tags = [FreqDist(tag for word, tag in sentence) for sentence in tagged]

# Printing the bag-of-tags representation
for bot in bag_of_tags:
    print(dict(bot))

{'DT': 1, 'RB': 2, 'JJ': 3, 'NN': 1, 'MD': 1, 'VB': 1, 'CC': 1, 'VBD': 1, 'NNS': 1, '.': 1}
{'DT': 1, 'NN': 1, 'MD': 1, 'VB': 1, 'RB': 1, 'JJ': 1, '.': 1}
{'DT': 1, 'NNS': 6, 'VBP': 2, 'WDT': 1, 'CC': 2, ',': 1, 'VBG': 1, '.': 1}


b) Naive Bayes Model Parameter Calculation and Sentence Classification

i) Explaining my ideas behind the Naive Bayes model.

Starting from the assumption that we want to find the class that maximises $p(class|document)$, i will explain how Bayes theorem is used and what naive assumption is made about the features in the document. I will also describe the priors and conditional probabilities that are used to predict the most likely class for a document.



The Naive Bayes model is a probabilistic machine learning model that's used for classification tasks. The fundamental principle behind this model is Bayes' Theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For a Naive Bayes classifier, the event is the class we want to predict (e.g., class A or B), and the conditions are the features observed in the data (in this case, the bag-of-tags from sentences).

Bayes Theorem: Bayes' Theorem is mathematically stated as:

P(class|document) = P(document|class) x P(class)/P(document)

In the context of document classification:

P(class|document) is the posterior probability of a class given a document.

P(document|class) is the likelihood, which is the probability of the document given a class.

P(class) is the prior probability of the class.

P(document) is the evidence, the probability of the document.

Naive Assumption: The 'naive' assumption made in this model is that all features (in this case, the POS tags) are independent of each other given the class. This assumption simplifies the computation of P(document|class), as it becomes the product of the individual probabilities of each feature given the class.

Priors and Conditional Probabilities:
- **Prior Probability P(class):** This is the probability of observing each class in the training data without any additional information. It's usually calculated by the frequency of each class in the training data.
- **Conditional Probability P(feature|class): This is the probability of observing a certain feature given a class. In the context of your task, it would be the probability of seeing a specific POS tag in sentences belonging to class A or B. These probabilities are typically calculated based on the frequency of each feature in documents of a particular class.

To predict the most likely class for a document, Naive Bayes calculates the posterior probability for each class and selects the class with the highest posterior probability. This is done by multiplying the prior probability of each class with the product of the conditional probabilities of each feature observed in the document. The evidence term P(document) is usually ignored in this calculation since it's constant for all classes and does not affect the ranking of the classes.

ii) I summed the counts of POS tags for each class from a pre-existing dictionary classtagcounts to derive total frequencies for each class. Using these sums, I calculated the prior probabilities for each class, which reflect the likelihood of any given sentence belonging to a class without any further information.


In [5]:
# Dictionary containing counts of pos tags for classes A and B
classtagcounts = {"A": {"RB": 30, "JJ": 30, "NN": 10, "NNS": 10, "VB": 10, "VBD": 10},
                  "B": {"VBP": 20, "VBZ": 10, "VBG": 10, "VBD": 10, "NN": 20, "NNS": 30}}

# Summing the counts for classes A and B
classcounts = {class_label: sum(tag_counts.values()) for class_label, tag_counts in classtagcounts.items()}

# Calculating the total count of all classes
total_count = sum(classcounts.values())

# Calculating the prior probabilities for each class
classpriors = {class_label: count / total_count for class_label, count in classcounts.items()}

classcounts, classpriors

({'A': 100, 'B': 100}, {'A': 0.5, 'B': 0.5})

iii) To calculate conditional probabilities for each POS tag within the classes, I defined a function condprobs() that computes the probability of observing each tag given a class, using the frequencies obtained from the training data.



In [6]:
def condprobs(feature_counts, class_counts):

# Creating the function
    cond_probs = {}
    for class_label, features in feature_counts.items():
        total_count = class_counts[class_label]
        cond_probs[class_label] = {feature: count / total_count for feature, count in features.items()}
    return cond_probs

# Applying the function to classtagcounts and classcounts
conditional_probabilities = condprobs(classtagcounts, classcounts)
conditional_probabilities


{'A': {'RB': 0.3, 'JJ': 0.3, 'NN': 0.1, 'NNS': 0.1, 'VB': 0.1, 'VBD': 0.1},
 'B': {'VBP': 0.2, 'VBZ': 0.1, 'VBG': 0.1, 'VBD': 0.1, 'NN': 0.2, 'NNS': 0.3}}

iv) Here I explain why we might want to smooth the conditional probabilities, and how add one smoothing works.



Smoothing in the context of Naive Bayes classifiers, particularly for conditional probabilities, is a crucial technique to address the issue of zero probability. This situation arises when a feature (such as a specific POS tag in this case) that appears in the test data has not been observed in the training data for a particular class. Without smoothing, the probability of observing this unseen feature given the class would be zero. This zero probability, when multiplied with other probabilities (as Naive Bayes does), would result in a zero posterior probability for the class, skewing the classification results.

**Add-One Smoothing (Laplace Smoothing):
Add-one smoothing, also known as Laplace smoothing, is a simple yet effective method to prevent zero probabilities in a Naive Bayes classifier. It works by adding a small constant (usually 1) to the count of each feature for every class in the training set. This adjustment is applied regardless of whether the feature was observed in the training data for that class. The formula for calculating the conditional probability with add-one smoothing is:

P(feature|class) = (Count(feature, class) + 1) / (Count(class) + N)

Where:

Count(feature, class) is the original count of the feature for the given class.

Count(class) is the total count of all features for the class.

N is the number of unique features in the entire training set.

The addition of 1 to the feature count ensures that no feature has a zero probability. The denominator is adjusted by adding N (the total number of unique features) to maintain probability distribution properties. This way, smoothing handles the problem of unseen features and ensures a more robust and accurate classification, particularly when dealing with sparse data sets or small training samples.

v) I applied add-one smoothing across all POS tags and updated the class frequencies accordingly. This adjustment was necessary to ensure that each tag, including those not present in the initial training data, was considered in the model.


In [7]:
# List of possible tags
taglist = ['CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 'NP', 'NPS', 'PDT', 'POS',
           'PP', 'PP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT',
           'WP', 'WP$', 'WRB']

# Applying add-one smoothing to classtagcounts
smoothed_classtagcounts = {}
for class_label, tag_counts in classtagcounts.items():
    smoothed_classtagcounts[class_label] = {}
    for tag in taglist:
        smoothed_classtagcounts[class_label][tag] = tag_counts.get(tag, 0) + 1

# Updating classcounts to reflect modified class frequencies after smoothing
updated_classcounts = {class_label: sum(tag_counts.values()) for class_label, tag_counts in smoothed_classtagcounts.items()}

smoothed_classtagcounts, updated_classcounts

({'A': {'CC': 1,
   'CD': 1,
   'DT': 1,
   'EX': 1,
   'FW': 1,
   'IN': 1,
   'JJ': 31,
   'JJR': 1,
   'JJS': 1,
   'LS': 1,
   'MD': 1,
   'NN': 11,
   'NNS': 11,
   'NP': 1,
   'NPS': 1,
   'PDT': 1,
   'POS': 1,
   'PP': 1,
   'PP$': 1,
   'RB': 31,
   'RBR': 1,
   'RBS': 1,
   'RP': 1,
   'SYM': 1,
   'TO': 1,
   'UH': 1,
   'VB': 11,
   'VBD': 11,
   'VBG': 1,
   'VBN': 1,
   'VBP': 1,
   'VBZ': 1,
   'WDT': 1,
   'WP': 1,
   'WP$': 1,
   'WRB': 1},
  'B': {'CC': 1,
   'CD': 1,
   'DT': 1,
   'EX': 1,
   'FW': 1,
   'IN': 1,
   'JJ': 1,
   'JJR': 1,
   'JJS': 1,
   'LS': 1,
   'MD': 1,
   'NN': 21,
   'NNS': 31,
   'NP': 1,
   'NPS': 1,
   'PDT': 1,
   'POS': 1,
   'PP': 1,
   'PP$': 1,
   'RB': 1,
   'RBR': 1,
   'RBS': 1,
   'RP': 1,
   'SYM': 1,
   'TO': 1,
   'UH': 1,
   'VB': 1,
   'VBD': 11,
   'VBG': 11,
   'VBN': 1,
   'VBP': 21,
   'VBZ': 11,
   'WDT': 1,
   'WP': 1,
   'WP$': 1,
   'WRB': 1}},
 {'A': 136, 'B': 136})

vi) Using the adjusted frequencies, I recalculated the conditional probabilities to reflect the smoothing adjustments.


In [8]:
# Applying the condprobs function to the smoothed frequencies
smoothed_conditional_probabilities = condprobs(smoothed_classtagcounts, updated_classcounts)
smoothed_conditional_probabilities

{'A': {'CC': 0.007352941176470588,
  'CD': 0.007352941176470588,
  'DT': 0.007352941176470588,
  'EX': 0.007352941176470588,
  'FW': 0.007352941176470588,
  'IN': 0.007352941176470588,
  'JJ': 0.22794117647058823,
  'JJR': 0.007352941176470588,
  'JJS': 0.007352941176470588,
  'LS': 0.007352941176470588,
  'MD': 0.007352941176470588,
  'NN': 0.08088235294117647,
  'NNS': 0.08088235294117647,
  'NP': 0.007352941176470588,
  'NPS': 0.007352941176470588,
  'PDT': 0.007352941176470588,
  'POS': 0.007352941176470588,
  'PP': 0.007352941176470588,
  'PP$': 0.007352941176470588,
  'RB': 0.22794117647058823,
  'RBR': 0.007352941176470588,
  'RBS': 0.007352941176470588,
  'RP': 0.007352941176470588,
  'SYM': 0.007352941176470588,
  'TO': 0.007352941176470588,
  'UH': 0.007352941176470588,
  'VB': 0.08088235294117647,
  'VBD': 0.08088235294117647,
  'VBG': 0.007352941176470588,
  'VBN': 0.007352941176470588,
  'VBP': 0.007352941176470588,
  'VBZ': 0.007352941176470588,
  'WDT': 0.007352941176470

vii) Finally, I used the prior probabilities and the smoothed conditional probabilities to compute the likelihood of each sentence belonging to classes A and B. I then predicted the most likely class for each sentence by selecting the class with the highest posterior probability, displaying each original sentence alongside its predicted classification.

Through these steps, this project not only applies fundamental natural language processing techniques but also provides a systematic approach to text classification, leveraging the simplicity of Naive Bayes in the context of linguistic features.


In [10]:

def predict_class(sentence, classpriors, conditional_probabilities, taglist):
    # Splitting the sentence into words and POS tagging
    words = sentence.split()
    tagged_words = pos_tag(words)

    # Calculating probabilities for each class
    class_scores = {}
    for class_label in classpriors.keys():
        class_scores[class_label] = classpriors[class_label]
        for word, tag in tagged_words:
            if tag in taglist:
                class_scores[class_label] *= conditional_probabilities[class_label].get(tag, 1/len(taglist))

    # Predicting the class with the highest probability
    predicted_class = max(class_scores, key=class_scores.get)
    return predicted_class

# The sentences and other variables (classpriors, smoothed_conditional_probabilities, taglist) should be defined as per previous steps

# Predicting the most likely class for each sentence
predictions = [predict_class(sentence, classpriors, smoothed_conditional_probabilities, taglist) for sentence in sentences]

# Printing out each original sentence alongside the prediction
for sentence, prediction in zip(sentences, predictions):
    print(f"Sentence: \"{sentence}\" \nPredicted Class: {prediction}\n")


Sentence: "a tediously verbose sentence may contain many gratuitous and overly contrived modifiers ." 
Predicted Class: A

Sentence: "another sentence could be too short ." 
Predicted Class: A

Sentence: "some people write sentences that contain nouns and verbs , avoiding adjectives and descriptions ." 
Predicted Class: B

