<a href="https://colab.research.google.com/github/chandanareddy1201/INFO-5731---Computational-Methods-for-Information-Systems/blob/main/INFO5731_Assignment_3_1_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from scratch instead of using any pre-existing libraries to do so:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [3]:
import requests
from bs4 import BeautifulSoup
import re
from collections import defaultdict

# Function to fetch data from URL
def fetch_data(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print("Failed to fetch data from URL:", url)
        return None

# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

# Function to generate N-grams
def generate_ngrams(tokens, n):
    ngrams = []
    for i in range(len(tokens)-n+1):
        ngrams.append(tuple(tokens[i:i+n]))
    return ngrams

# Function to count frequencies of N-grams
def count_ngram_frequencies(texts, n):
    ngram_freq = defaultdict(int)
    total_ngrams = 0
    for text in texts:
        tokens = text.split()
        ngrams = generate_ngrams(tokens, n)
        for ngram in ngrams:
            ngram_freq[ngram] += 1
            total_ngrams += 1
    return ngram_freq, total_ngrams

# Function to extract noun phrases
def extract_noun_phrases(text):
    noun_phrases = re.findall(r'\b(?:NN(?:P|PS)?|JJ)*(?:\s+\b(?:NN(?:P|PS)?|JJ)+)*\b', text)
    return noun_phrases

# Function to calculate relative probabilities of noun phrases
def calculate_relative_probabilities(texts):
    noun_phrase_freq = defaultdict(int)
    max_freq = 0
    for text in texts:
        noun_phrases = extract_noun_phrases(text)
        for phrase in noun_phrases:
            noun_phrase_freq[phrase] += 1
            if noun_phrase_freq[phrase] > max_freq:
                max_freq = noun_phrase_freq[phrase]
    relative_probabilities = {}
    for i, text in enumerate(texts, 1):
        relative_probabilities[i] = {}
        noun_phrases = extract_noun_phrases(text)
        for phrase in noun_phrases:
            relative_probabilities[i][phrase] = noun_phrase_freq[phrase] / max_freq
    return relative_probabilities

# URL to fetch data
url = "https://ddr.densho.org/narrators/"
# Fetch data from URL
data = fetch_data(url)

if data:
    # Parse HTML content
    soup = BeautifulSoup(data, 'html.parser')
    # Extract text from HTML
    text = soup.get_text()
    # Preprocess the text
    preprocessed_text = preprocess_text(text)

    # Perform N-gram analysis
    n = 3  # N for N-grams (change as needed)
    trigram_freq, total_trigrams = count_ngram_frequencies([preprocessed_text], n)
    print("Trigram Frequencies:")
    for trigram, freq in trigram_freq.items():
        print(trigram, ":", freq)

    # Extract noun phrases
    noun_phrases = extract_noun_phrases(preprocessed_text)

    # Calculate relative probabilities of noun phrases
    relative_probabilities = calculate_relative_probabilities([preprocessed_text])
    print("\nRelative Probabilities of Noun Phrases:")
    for i, probs in relative_probabilities.items():
        print("Review", i, ":")
        for phrase, rel_prob in probs.items():
            print(phrase, ":", rel_prob)
else:
    print("No data fetched from the URL.")


Failed to fetch data from URL: https://ddr.densho.org/narrators/
No data fetched from the URL.


## Question 2 (25 points)

**Undersand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any pre-existing libraries to do so.

In [6]:
# Write your code here







Failed to fetch data from URL: https://ddr.densho.org/narrators/


TypeError: 'NoneType' object is not iterable

## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
# Write your code here







## Question 4 (20 Points)

**Create your own training and evaluation data for sentiment analysis.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.


In [10]:
# The GitHub link of your final csv file


# Link:



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


All reviews saved to 'reviews_all.csv'.


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Type your answer