<a href="https://colab.research.google.com/github/burcak-bayram/NLP-and-Text-Analysis-Module-Assignments-University-College-London/blob/main/23223014_INST0073_copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparing and Contrasting Popular NLP Frameworks and Libraries: The Case Study of Lemmatisation and Part of Speech Tagging
### 6 March 2024, University College London
### Module: Natural Language Processing and Text Analysis (INST0073)
### Module tutor: Andreas Vlachidis

This is a task I have been assigned to process a large corpus of News articles. The following steps showcase how I will be implementing it:

**-Retrieve specific sentences from the Brown Corpus via NLTK:** To extract sentences numbered 014 to 023 within the "News" category for analysis.

**-Sentence Processing Using NLTK and spaCy:**
   - NLTK for tokenization on the sentences followed by POS tagging and lemmatization.
   - spaCy for processing sentences using its integrated pipeline.
   
**-Analysis of Lemmatizaiton and POS tagging results** to compare the outcomes of both frameworks.


## Mount Google Drive on Colab
This notebook file was created on Colab, and to save the generated results we need to have access to Google Drive. The following code will ensure that we can access Google Drive.

In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


## Access the Brown Corpus
Let's begin by importing necessary libraries, accessing and preparing the sentences from the Brown Corpus.

In [None]:
# Import NLTK modules and download NLTK resources
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import brown
from nltk.stem import WordNetLemmatizer

#import spaCy modules
import spacy
from spacy.tokens import Doc

# Ensure NLTK required for processing:Brown corpus, tokenizers, POS taggers, Wordnet lemmatizer
nltk.download('brown')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
# Load the English model for spaCy
nlp = spacy.load("en_core_web_sm")

# Accessing sentences from the "News" category of the Brown Corpus directly
news_sentences = brown.sents(categories='news')

# Initialize an empty list to store selected sentences
selected_sentences = []

# Loop through the specific range based on the student ID, 23223014 (i.e., sentences 14 through 23)
for i in range(14, 24):
    # Join words in each sentence to form a string and add it to the list
    selected_sentences.append(' '.join(news_sentences[i]))

# Iterate over the selected sentences and print them with their corresponding index
for i in range(14, 24):
    print(f"Sentence {i}: {selected_sentences[i-14]}")


Sentence 14: `` This is one of the major items in the Fulton County general assistance program '' , the jury said , but the State Welfare Department `` has seen fit to distribute these funds through the welfare departments of all the counties in the state with the exception of Fulton County , which receives none of this money .
Sentence 15: The jurors said they realize `` a proportionate distribution of these funds might disable this program in our less populous counties '' .
Sentence 16: Nevertheless , `` we feel that in the future Fulton County should receive some portion of these available funds '' , the jurors said .
Sentence 17: `` Failure to do this will continue to place a disproportionate burden '' on Fulton taxpayers .
Sentence 18: The jury also commented on the Fulton ordinary's court which has been under fire for its practices in the appointment of appraisers , guardians and administrators and the awarding of fees and compensation .
Sentence 19: Wards protected
Sentence 20: 

## NLTK

In [None]:
# Setting up the lemmatizer from NLTK for text processing
lemmatizer = WordNetLemmatizer()

# This function applies NLTK's tools to analyze sentences
def analyze_sentence(sentence):
    # Tokenizing the sentence into words
    words = word_tokenize(sentence)
    # Assigning part-of-speech tags to each token
    tagged_words = pos_tag(words)
    # Lemmatizing each word along with its POS tag
    lemmas_with_tags = [(word, lemmatizer.lemmatize(word), pos) for word, pos in tagged_words]
    return lemmas_with_tags

# Initiating analysis and output saving
print("\nProcessing with NLTK:")
output_file_path = "/gdrive/My Drive/ProcessedOutputNLTK.txt"
with open(output_file_path, "w") as output_file:
    index = 14  # Begin from the 14th sentence to align with student ID
    for sentence in selected_sentences:
        # Analyzing each sentence
        lemma_results = analyze_sentence(sentence)
        output_line = f"Sentence {index} (Word, Lemma, POS): {lemma_results}\n"
        print(output_line)
        output_file.write(output_line)
        index += 1  # Move to the next sentence

print(f"Processed results are stored in '{output_file_path}'")




Processing with NLTK:
Sentence 14 (Word, Lemma, POS): [('``', '``', '``'), ('This', 'This', 'DT'), ('is', 'is', 'VBZ'), ('one', 'one', 'CD'), ('of', 'of', 'IN'), ('the', 'the', 'DT'), ('major', 'major', 'JJ'), ('items', 'item', 'NNS'), ('in', 'in', 'IN'), ('the', 'the', 'DT'), ('Fulton', 'Fulton', 'NNP'), ('County', 'County', 'NNP'), ('general', 'general', 'JJ'), ('assistance', 'assistance', 'NN'), ('program', 'program', 'NN'), ('``', '``', '``'), (',', ',', ','), ('the', 'the', 'DT'), ('jury', 'jury', 'NN'), ('said', 'said', 'VBD'), (',', ',', ','), ('but', 'but', 'CC'), ('the', 'the', 'DT'), ('State', 'State', 'NNP'), ('Welfare', 'Welfare', 'NNP'), ('Department', 'Department', 'NNP'), ('``', '``', '``'), ('has', 'ha', 'VBZ'), ('seen', 'seen', 'VBN'), ('fit', 'fit', 'NN'), ('to', 'to', 'TO'), ('distribute', 'distribute', 'VB'), ('these', 'these', 'DT'), ('funds', 'fund', 'NNS'), ('through', 'through', 'IN'), ('the', 'the', 'DT'), ('welfare', 'welfare', 'NN'), ('departments', 'departm

## spaCy

In [None]:
# Setup for text analysis using spaCy
def analyze_text_spacy(sentence):
    document = nlp(sentence)
    analysis_results = [(token.text, token.lemma_, token.pos_) for token in document]
    return analysis_results

# Begin text analysis using spaCy and save the results
print("\nAnalyzing with spaCy:")
output_file_name = "/gdrive/My Drive/spaCyAnalysisResults.txt"
with open(output_file_name, "w") as file:
    sentence_counter = 14  # Initial sentence index as per student ID requirements
    for sentence in selected_sentences:
        analysis_outcome = analyze_text_spacy(sentence)
        formatted_output = f"Sentence {sentence_counter} (Word, Lemma, POS): {analysis_outcome}\n"
        print(formatted_output)  # Display the formatted output
        file.write(formatted_output)  # Save the formatted output to a file
        sentence_counter += 1  # Prepare for the next sentence

print(f"Analysis completed. Results saved to '{output_file_name}'")




Analyzing with spaCy:
Sentence 14 (Word, Lemma, POS): [('`', '`', 'PUNCT'), ('`', '`', 'PUNCT'), ('This', 'this', 'PRON'), ('is', 'be', 'AUX'), ('one', 'one', 'NUM'), ('of', 'of', 'ADP'), ('the', 'the', 'DET'), ('major', 'major', 'ADJ'), ('items', 'item', 'NOUN'), ('in', 'in', 'ADP'), ('the', 'the', 'DET'), ('Fulton', 'Fulton', 'PROPN'), ('County', 'County', 'PROPN'), ('general', 'general', 'ADJ'), ('assistance', 'assistance', 'NOUN'), ('program', 'program', 'NOUN'), ("''", "''", 'PUNCT'), (',', ',', 'PUNCT'), ('the', 'the', 'DET'), ('jury', 'jury', 'NOUN'), ('said', 'say', 'VERB'), (',', ',', 'PUNCT'), ('but', 'but', 'CCONJ'), ('the', 'the', 'DET'), ('State', 'State', 'PROPN'), ('Welfare', 'Welfare', 'PROPN'), ('Department', 'Department', 'PROPN'), ('`', '`', 'PUNCT'), ('`', '`', 'PUNCT'), ('has', 'have', 'AUX'), ('seen', 'see', 'VERB'), ('fit', 'fit', 'ADJ'), ('to', 'to', 'PART'), ('distribute', 'distribute', 'VERB'), ('these', 'these', 'DET'), ('funds', 'fund', 'NOUN'), ('through',

## Comparison

To address the comparison of NLTK and spaCy in terms of Part-of-Speech (POS) tagging and lemmatization, we can implement a qualitative review process using Python. This involves:

1. **Extracting the original POS tags** from the Brown Corpus for our selected sentences to serve as a reference.
2. **Reviewing the POS tags and lemmatization** outputs from both NLTK and spaCy.
3. **Identifying core differences and limitations** in the outputs, focusing on noticeable patterns or discrepancies rather than evaluating every single token.

We'll start by extracting the original POS tags for the selected sentences from the Brown Corpus. Then, we'll load the outputs from both NLTK and spaCy, and compare these outputs with the original tags to observe any significant differences or patterns.

In [None]:
# Retrieve and document the original POS tags from the Brown Corpus for comparison
original_tags_collection = [brown.tagged_sents(categories='news')[index] for index in range(14, 24)]

# Documenting the original POS tags for subsequent analysis
with open("/gdrive/My Drive/OriginalPOSTagsAnalysis.txt", "w") as tags_file:
    sentence_number = 14  # Commencing from the 14th sentence for alignment with specifics
    for tagged_sent in original_tags_collection:
        tags_file.write(f"Sentence {sentence_number}: {tagged_sent}\n")
        sentence_number += 1  # Advancing to the next sentence

print("Original POS tags have been successfully saved to 'OriginalPOSTagsAnalysis.txt'")




Original POS tags have been successfully saved to 'OriginalPOSTagsAnalysis.txt'


### Load NLTK and spaCy Outputs for Comparison

We'll read back the outputs from the NLTK and spaCy processed files.

In [None]:
# Processing sentences to gather NLTK and spaCy linguistic annotations
nltk_annotated = [analyze_sentence(sentence) for sentence in selected_sentences]
spacy_annotated = [analyze_text_spacy(sentence) for sentence in selected_sentences]

# Adjusting the original POS tags from the Brown Corpus for a unified comparison
adjusted_pos_tags = [
    [(token, '', tag) for token, tag in tagged_sent]
    for tagged_sent in original_tags_collection
]


### Comparison Overview

- **POS Tagging:** Juxtapose the POS tags from NLTK and spaCy against Brown's original tags to identify variances.
- **Lemmatization:** Without lemmatized forms in Brown, the evaluation will be subjective, focusing on inaccuracies or divergences in lemmatization between the frameworks.





In [None]:
# Example code for reviewing POS tags and lemmatization
for sentence_counter in range(10):
    sentence_id = 14 + sentence_counter
    print(f"Analyzing Sentence {sentence_id} - Details:")
    print(f"Baseline from Corpus: {adjusted_pos_tags[sentence_counter]}")
    print(f"Results via NLTK: {nltk_annotated[sentence_counter]}")
    print(f"Results via spaCy: {spacy_annotated[sentence_counter]}")
    print("\n")


Analyzing Sentence 14 - Details:
Baseline from Corpus: [('``', '', '``'), ('This', '', 'DT'), ('is', '', 'BEZ'), ('one', '', 'CD'), ('of', '', 'IN'), ('the', '', 'AT'), ('major', '', 'JJ'), ('items', '', 'NNS'), ('in', '', 'IN'), ('the', '', 'AT'), ('Fulton', '', 'NP-TL'), ('County', '', 'NN-TL'), ('general', '', 'JJ'), ('assistance', '', 'NN'), ('program', '', 'NN'), ("''", '', "''"), (',', '', ','), ('the', '', 'AT'), ('jury', '', 'NN'), ('said', '', 'VBD'), (',', '', ','), ('but', '', 'CC'), ('the', '', 'AT'), ('State', '', 'NN-TL'), ('Welfare', '', 'NN-TL'), ('Department', '', 'NN-TL'), ('``', '', '``'), ('has', '', 'HVZ'), ('seen', '', 'VBN'), ('fit', '', 'JJ'), ('to', '', 'TO'), ('distribute', '', 'VB'), ('these', '', 'DTS'), ('funds', '', 'NNS'), ('through', '', 'IN'), ('the', '', 'AT'), ('welfare', '', 'NN'), ('departments', '', 'NNS'), ('of', '', 'IN'), ('all', '', 'ABN'), ('the', '', 'AT'), ('counties', '', 'NNS'), ('in', '', 'IN'), ('the', '', 'AT'), ('state', '', 'NN'), ('w

This conceptual approach allows for a manual review of the frameworks' outputs. The following questions may be considered:

- **POS Tagging Differences:** Are there tags where one framework consistently differs from the original Brown tags or the other framework?
- **Lemmatization Differences:** Does one framework appear to produce more accurate or contextually appropriate lemmas than the other?


### Save the Outputs

Given the detailed nature of the outputs, which include original tags, NLTK outputs, and spaCy outputs, a structured file format that supports hierarchy and is easily readable for both humans and machines is preferable. JSON (JavaScript Object Notation) is an ideal choice for this purpose because:

- **It can easily represent the nested structure of sentences, POS tags, and lemmatization outputs.**
- **It's widely supported and can be easily read into most data analysis tools, programming languages, and libraries.**
- **It facilitates a straightforward comparison of complex structures.**

This code will structure the data accordingly and save it to a file named "OutputComparison.json".


In [None]:
import json

# 'sentence_reviews' will hold each sentence's analysis, including the original Brown tags, and both NLTK and spaCy's outputs.
sentence_reviews = [
    {
        "Sentence Index": sentence_num,
        "Brown Original Tags": adjusted_pos_tags[sentence_num - 14],
        "Output from NLTK": nltk_annotated[sentence_num - 14],
        "Output from spaCy": spacy_annotated[sentence_num - 14]
    }
    for sentence_num in range(14, 24)
]

# Writing the comparative analysis to a JSON file for easy review and sharing.
comparison_json_path = "/gdrive/My Drive/ComparativeAnalysis.json"
with open(comparison_json_path, "w") as file:
    json.dump(sentence_reviews, file, indent=2)  # Using a two-space indent for compactness

comparison_json_path



'/gdrive/My Drive/ComparativeAnalysis.json'

## Data Analysis

The data analysis within Python environment could involve:

- **Descriptive Statistics:** Compute statistics like mean, median, or mode to summarize the agreement rates or the frequency of specific POS tags.
- **Visualization:** Create visualizations such as bar charts to compare the frequency of POS tags assigned by each framework or to visualize agreement rates.
- **Clustering or Anomaly Detection:** Identify patterns or outliers in the tagging or lemmatization outputs, which could indicate systemic differences between the frameworks.


### Load the JSON Data and Generate the Spreadsheet

In order to facilitate the analysis, we need to load the data from the JSON file into a format suitable for analysis, such as a Pandas DataFrame. Meanwhile, produce a spreadsheet based on the JSON file.

In [None]:
import pandas as pd
# Let's adjust the previous code to accommodate the new rule for handling spaCy tokenization of hyphenated words

# Reload the JSON file, in case we need to process it again from scratch
with open(comparison_json_path, 'r') as file:
    data = json.load(file)

# Prepare the data for the dataframe with the new rule for spaCy output
prepared_data = []

for sentence in data:
    sentence_number = sentence['Sentence Index']
    original_tags = sentence['Brown Original Tags']
    nltk_output = sentence['Output from NLTK']
    spacy_output = sentence['Output from spaCy']

    # Adjust for spaCy tokenization of double backticks and hyphenated words
    adjusted_spacy_output = []
    i = 0
    while i < len(spacy_output):
        if i < len(spacy_output) - 1 and spacy_output[i] == ["`", "`", "PUNCT"] and spacy_output[i + 1] == ["`", "`", "PUNCT"]:
            adjusted_spacy_output.append(["` `", "` `", "PUNCT PUNCT"])
            i += 2  # Skip the next token as well
        elif i < len(spacy_output) - 2 and spacy_output[i + 1][0] == "-":
            # Combine the three parts of a hyphenated word into a single token
            hyphenated_word = spacy_output[i][0] + " - " + spacy_output[i + 2][0]
            hyphenated_lemma = spacy_output[i][1] + " - " + spacy_output[i + 2][1]
            hyphenated_pos = spacy_output[i][2] + " " + spacy_output[i + 1][2] + " " + spacy_output[i + 2][2]
            adjusted_spacy_output.append([hyphenated_word, hyphenated_lemma, hyphenated_pos])
            i += 3  # Skip the next two tokens
        else:
            adjusted_spacy_output.append(spacy_output[i])
            i += 1

    # Ensure all lists are of the same length by padding shorter ones
    max_length = max(len(original_tags), len(nltk_output), len(adjusted_spacy_output))
    original_tags.extend([["", "", ""]] * (max_length - len(original_tags)))
    nltk_output.extend([["", "", ""]] * (max_length - len(nltk_output)))
    adjusted_spacy_output.extend([["", "", ""]] * (max_length - len(adjusted_spacy_output)))

    # Combine the data into a single list
    for orig, nltk, spacy in zip(original_tags, nltk_output, adjusted_spacy_output):
        prepared_data.append([
            sentence_number,
            orig[0], orig[2],
            nltk[0], nltk[1], nltk[2],
            spacy[0], spacy[1], spacy[2]
        ])

# Create the adjusted dataframe
adjusted_df = pd.DataFrame(prepared_data, columns=[
    'Sentence_No', 'Words_Brown', 'POS_Brown',
    'Words_NLTK', 'Lemma_NLTK', 'POS_NLTK',
    'Words_spaCy', 'Lemma_spaCy', 'POS_spaCy'
])

print(adjusted_df)

# Save the adjusted dataframe to a new Excel file
adjusted_output_file_path = '/gdrive/My Drive/adjusted_processed_output.xlsx'
adjusted_df.to_excel(adjusted_output_file_path, index=False)

adjusted_output_file_path

     Sentence_Number Original_Words Original_POS  NLTK_Words  NLTK_Lemma  \
0                 14             ``           ``          ``          ``   
1                 14           This           DT        This        This   
2                 14             is          BEZ          is          is   
3                 14            one           CD         one         one   
4                 14             of           IN          of          of   
..               ...            ...          ...         ...         ...   
292               23            the           AT         the         the   
293               23         prices          NNS      prices       price   
294               23     reasonable           JJ  reasonable  reasonable   
295               23             ''           ''          ``          ``   
296               23              .            .           .           .   

    NLTK_POS spaCy_Words spaCy_Lemma    spaCy_POS  
0         ``         ` `         ` 

'adjusted_processed_output.xlsx'