# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from scratch instead of using any pre-existing libraries to do so:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [None]:
import nltk
import pandas as ps
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from nltk.tag import pos_tag

# Downloading NLKT resoutces
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from google.colab import files

#select a file to upload(reviews file)
upld_f = files.upload()

# Reading the uploaded file
for fn in upld_f.keys():
    with open(fn, 'r') as file:
        text = file.read()

# tokenizing the text
tkns = word_tokenize(text)
sntncs = sent_tokenize(text)

# Print the tokens and sentences
print("Tokens:", tkns)
print("\nSentences:", sntncs)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Saving amazonreviews.csv to amazonreviews (3).csv


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Question 2 (25 points)

**Undersand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any pre-existing libraries to do so.

In [None]:
# Write your code here




import io
#reading reviews fikle from uploads
d_f = ps.read_csv(io.StringIO(upld_f['amazonreviews.csv'].decode('utf-8')))
#reading review desc from the csv
ds = d_f['Desc'].tolist()

# preprocessing the description
def pp_text(text):
    sntncs = sent_tokenize(text)
    tknzd_sntncs = [word_tokenize(sentence.lower()) for sentence in sntncs]
    return tknzd_sntncs

#  N-grams
def get_ngrams(tokenized_text, n):
    n_grams = []
    for sentence_tokens in tokenized_text:
        n_grams.extend(list(ngrams(sentence_tokens, n)))
    return n_grams

# extracting noun phrases
def extract_noun_phrases(text):
    tgd_sntncs = pos_tag(word_tokenize(text))
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    c_p = nltk.RegexpParser(grammar)
    tree = c_p.parse(tgd_sntncs)
    noun_phrases = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
        noun_phrases.append(' '.join(word for word, tag in subtree.leaves()))
    return noun_phrases

# Calculating the N-gram frequencies where n is 3
tknzd_ds = [pp_text(text) for text in ds]
three_grams = get_ngrams([token for sublist in tknzd_ds for token in sublist], 3)
three_gram_freq = FreqDist(three_grams)

two_grams = get_ngrams([token for sublist in tknzd_ds for token in sublist], 2)
two_gram_freq = FreqDist(two_grams)

prblts = {}
for bigram in two_gram_freq:
    first_word_freq = two_gram_freq[bigram[0]]
    if first_word_freq != 0:
        prblts[bigram] = two_gram_freq[bigram] / first_word_freq
    else:
        prblts[bigram] = 0

# Extract and calculate relative probabilities of noun phrases
noun_phrases_all_reviews = []
for text in ds:
    noun_phrases = extract_noun_phrases(text)
    noun_phrases_all_reviews.append(noun_phrases)

noun_phrases_flattened = [phrase for sublist in noun_phrases_all_reviews for phrase in sublist]
noun_phrase_freq = FreqDist(noun_phrases_flattened)

r_prblts = {}
for i, review_phrases in enumerate(noun_phrases_all_reviews):
    r_prblts[f"Review {i + 1}"] = {
        phrase: noun_phrase_freq[phrase] / max(noun_phrase_freq.values()) for phrase in review_phrases
    }

# Displaying results of n grams
print("Frequency of N-grams (N=3):")
print(three_gram_freq.most_common())

print("\nProbabilities for bigrams:")
print(prblts)

print("\nRelative probabilities of noun phrases for each review:")
d_f = ps.DataFrame.from_dict(r_prblts, orient='index')
print(d_f)




Frequency of N-grams (N=3):
[(('>', 'pd', 'output'), 1100), (('pd', 'output', '('), 1100), (('as', 'our', 'default'), 1100), (('usb-c', 'power', 'adapter'), 880), (('15', 'and', '19.98'), 880), (('and', '19.98', 'watts'), 880), (('19.98', 'watts', ')'), 880), (('the', 'apple', '20w'), 660), (('power', 'adapter', 'is'), 660), (('usb-c', 'power', 'delivery'), 660), (('can', 'be', 'used'), 660), (('be', 'used', 'for'), 660), (('watts', '>', 'input'), 660), (('output', '(', 'same'), 660), (('(', 'same', 'options'), 660), (('same', 'options', 'as'), 660), (('options', 'as', 'our'), 660), (('our', 'default', ','), 660), (('default', ',', '15'), 660), ((',', '15', 'and'), 660), (('apple', '20w', 'charger'), 440), (('for', 'a', 'while'), 440), (('a', 'while', 'now'), 440), (('and', 'it', "'s"), 440), (('when', 'it', 'comes'), 440), (('it', 'comes', 'to'), 440), (('comes', 'to', 'charging'), 440), (('before', 'heading', 'out'), 440), (('heading', 'out', 'or'), 440), ((',', 'this', 'charger'), 4

## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as pt
from nltk.tokenize import word_tokenize
import pandas as ps
import nltk
nltk.download('punkt')

# Loading the data
d_f = ps.read_csv('amazonreviews.csv')

# Tokenizing the reviews desc
tokenized_reviews = [word_tokenize(text.lower()) for text in d_f['Desc']]

# Training Word2Vec model
mdl = Word2Vec(sentences=tokenized_reviews, vector_size=300, window=5, min_count=1, workers=4)
mdl.save('word2vec_mdl')

# Getting vocabulary using the model
vocab = list(mdl.wv.key_to_index.keys())

# Visualizing word embeddings
def pt_embdngs(mdl, w):
    for word in w:
        if word in mdl.wv.key_to_index:
            print(f"Word '{word}' is found in the model's vocabulary.")

            pass
        else:
            print(f"Word '{word}' not found in the model's vocabulary.")


# Visualize some embeddings
pt_w = ['good', 'excellent', 'bad', 'performance', 'story', 'director']
pt_embdngs(mdl, pt_w)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Word 'good' is found in the model's vocabulary.
Word 'excellent' not found in the model's vocabulary.
Word 'bad' not found in the model's vocabulary.
Word 'performance' is found in the model's vocabulary.
Word 'story' not found in the model's vocabulary.
Word 'director' not found in the model's vocabulary.


## Question 4 (20 Points)

**Create your own training and evaluation data for sentiment analysis.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.


In [1]:
# The GitHub link of your final csv file


# Link:https://github.com/VinodkumarKonduru/Info5731/blob/main/amazonreviewssentimented.csv



# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [2]:
# Type your answer


"""
i find this assignment is very much knowledge gaining for me

when compared with previous assignments i find this assignment is a bit easy.

as we are doing many assignments and exercises my python skills are improving day by day.




"""

'\ni find this assignment is very much knowledge gaining for me\n\nwhen compared with previous assignments i find this assignment is a bit easy.\n\nas we are doing many assignments and exercises my python skills are improving day by day.\n\n\n\n\n'