# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [15]:
#first we import the necessary libraries..
import requests
from bs4 import BeautifulSoup
import pandas as pd

# we are implementing the first part of the question, i.e, amazon reviews
def fetch_product_reviews(url, review_count=1000):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    review_texts = []
    #we initialize page counter here
    page_counter = 1
    #loop
    while len(review_texts) < review_count:
        product_url = f"{url}?page={page_counter}"
        server_response = requests.get(product_url, headers=headers)
        soup_parser = BeautifulSoup(server_response.content, 'html.parser')
        reviews = soup_parser.find_all('span', {'data-hook': 'review-body'})

        for review in reviews:
            if len(review_texts) < review_count:
                review_texts.append(review.get_text())
            else:
                break
        #increment page counter
        page_counter += 1
    #return statement
    return review_texts

# URL of the amazon product..
product_reviews_url = 'https://www.amazon.com/SHASHIBO-Shifting-Geometric-Magnetic-Transforming/dp/B07W5QM4DP/ref=sr_1_3?dib=eyJ2IjoiMSJ9.zC0c7nLgmQII8muDoATYhXugk9nCudoP2yUhmI71OGWYHu2xkpmERJU3XPGeYp18aPnBr2luax-7CPUG5k5BOr4w1koHGSrcVvHRU0P8EgpC9bXm_B4v3pa9fLaXwbiFKVowO5xcBwEZRzYqfwKVSSqzh5CeBqZ3Wvzd2Gg54hbHRqcxw-Gd1orLjChnSYQL6Kwkgfnldk08W5Peigcn-P2rQJAaxwCeydypyh7jJPYIZCf-N28yNm3qhiMof7_wkCPxl72bcGyXeqPodXEFKyA7z2bFfmozamfIH1tZtb0.wZP12Q-1YtXbjDn8kI58vMy0XH8CQJVp8BNXhJEfyR8&dib_tag=se&keywords=shashibo%2Bshape%2Bshifting%2Bbox&qid=1708957841&sr=8-3&th=1'

print("Retrieving product reviews...")
reviews = fetch_product_reviews(product_reviews_url, review_count=1000)
print("Product reviews retrieved successfully.")

# DataFrame to hold the reviews..
reviews_df = pd.DataFrame({'Review': reviews})

print("Saving reviews to CSV file...")
# Save the DataFrame to a CSV file..
reviews_df.to_csv('product_reviews.csv', index=False)
print("Reviews saved successfully.")


Retrieving product reviews...
Product reviews retrieved successfully.
Saving reviews to CSV file...
Reviews saved successfully.


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [16]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Acquiring the NLTK assets..
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Loading the dataset and ensuring its integrity
df = pd.read_csv('product_reviews.csv')
print("Data successfully loaded.")
print("Total rows:", len(df))

# Define the stop words and initialize the stemmer and the lemmatizer..
stop_words_list = set(stopwords.words('english'))
stemmer_engine = PorterStemmer()
lemmatizer_tool = WordNetLemmatizer()

# This is the text cleaning function..
def sanitize_text(text):
    print("Original Text:", text)
    # Eliminating punctuations and numbers
    text = ''.join([character for character in text if character not in string.punctuation and not character.isdigit()])
    print("Text after removing punctuations and numbers:", text)
    # Tokenizing the text
    tokens = word_tokenize(text)
    print("Tokenized Text:", tokens)
    # Removing stop words..
    tokens = [word for word in tokens if word.lower() not in stop_words_list]
    print("Text after stop words removal:", tokens)
    # Converting to lowercase...
    tokens = [word.lower() for word in tokens]
    print("Text after converting to lowercase:", tokens)
    # Stemming
    stemmed_tokens = [stemmer_engine.stem(word) for word in tokens]
    print("Stemmed Text:", stemmed_tokens)
    # Lemmatizing
    lemmatized_tokens = [lemmatizer_tool.lemmatize(word) for word in tokens]
    print("Lemmatized Text:", lemmatized_tokens)
    return ' '.join(lemmatized_tokens)

# Applying text cleaning to the Review column
df['Cleaned_Review'] = df['Review'].apply(sanitize_text)


cleaned_file_path = '/content/drive/MyDrive/cleaned.csv'
df.to_csv(cleaned_file_path, index=False)

print("Data frame successfully saved to:", cleaned_file_path)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Data successfully loaded.
Total rows: 1000
Original Text: 
Absolutely life-changing! I purchased this shape-shifting box for my family member who is on the autism spectrum, and it has been an incredible addition to their home. The versatility and adaptability of the box have provided endless hours of sensory stimulation and exploration. Its ability to transform into various shapes has captivated my family member's attention, encouraging both creativity and engagement. The quality and durability of the materials used make it perfect for repetitive use, and the calming effect it has had is truly remarkable. This box has become an essential tool for sensory development and relaxation. I highly recommend it to anyone looking for a versatile, engaging, and therapeutic tool for their loved ones.
Read more
Text after removing punctuations and numbers: 
Absolutely li

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Tokenized Text: ['I', 'bought', 'this', 'for', 'a', 'grandchild', 'for', 'Christmas', 'She', 'loves', 'this', 'kind', 'of', 'thing', 'and', 'I', 'thought', 'she', 'would', 'enjoy', 'it', 'They', 'are', 'a', 'little', 'steep', 'in', 'price', 'and', 'I', 'must', 'say', 'I', 'was', 'a', 'bit', 'surprised', 'when', 'it', 'arrived', 'in', 'a', 'tiny', 'little', 'x', 'inch', 'box', 'From', 'the', 'pics', 'I', 'was', 'expecting', 'something', 'about', 'double', 'that', 'size', 'I', 'wanted', 'to', 'be', 'certain', 'I', 'was', 'getting', 'a', 'decent', 'quality', 'item', 'so', 'I', 'removed', 'it', 'from', 'the', 'box', 'and', 'started', 'messing', 'around', 'with', 'it', 'to', 'check', 'quality', 'and', 'entertainment', 'value', 'Well', 'we', 'are', 'now', 'days', 'closer', 'to', 'Christmas', 'since', 'it', 'arrived', 'and', 'I', 'am', 'no', 'closer', 'to', 'getting', 'back', 'into', 'that', 'x', 'box', 'it', 'came', 'in', 'It',

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [23]:
import nltk
import spacy
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
!pip install spacy
!python -m spacy download en_core_web_sm
from nltk.tokenize import sent_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.tag import PerceptronTagger
spacy_model = spacy.load("en_core_web_sm")
tagger = PerceptronTagger()
from collections import Counter

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [26]:
import spacy
import pandas as pd
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize, sent_tokenize

# Loading spaCy model
nlp = spacy.load("en_core_web_sm")

# Loading cleaned text from the CSV file
cleaned_file_path = '/content/drive/MyDrive/cleaned.csv'
df = pd.read_csv(cleaned_file_path)
cleaned_text = ' '.join(df['Cleaned_Review'])

# Tokenizing text
sentences = sent_tokenize(cleaned_text)

# Initialize counters for POS tags
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

# Initialize counters for named entities
person_count = 0
organization_count = 0
location_count = 0
product_count = 0
date_count = 0

# Function to extract named entities from a sentence
def extract_entities(sentence):
    entities = ne_chunk(pos_tag(word_tokenize(sentence)))
    for entity in entities:
        if isinstance(entity, nltk.tree.Tree):
            entity_type, entity_name = zip(*entity)
            entity_name = ' '.join(entity_name)
            if entity.label() == 'PERSON':
                print("Person:", entity_name)
                global person_count
                person_count += 1
            elif entity.label() == 'ORGANIZATION':
                print("Organization:", entity_name)
                global organization_count
                organization_count += 1
            elif entity.label() == 'GPE':
                print("Location:", entity_name)
                global location_count
                location_count += 1
            elif entity.label() == 'PRODUCT':
                print("Product:", entity_name)
                global product_count
                product_count += 1
            elif entity.label() == 'DATE':
                print("Date:", entity_name)
                global date_count
                date_count += 1

# Conduct POS tagging and NER for each sentence
for sentence in sentences:
    # Tokenize the sentence
    tokens = word_tokenize(sentence)
    # Perform POS tagging
    tagged_words = pos_tag(tokens)
    # Count POS tags
    for word, tag in tagged_words:
        if tag.startswith('N'):
            noun_count += 1
        elif tag.startswith('V'):
            verb_count += 1
        elif tag.startswith('J'):
            adj_count += 1
        elif tag.startswith('R'):
            adv_count += 1
    # Perform named entity recognition with NLTK
    extract_entities(sentence)
    # Perform named entity recognition with spaCy
    doc = nlp(sentence)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            #print("Person:", ent.text)
            person_count += 1
        elif ent.label_ == 'ORG':
            #print("Organization:", ent.text)
            organization_count += 1
        elif ent.label_ == 'GPE':
            #print("Location:", ent.text)
            location_count += 1
        elif ent.label_ == 'PRODUCT':
            #print("Product:", ent.text)
            product_count += 1
        elif ent.label_ == 'DATE':
            #print("Date:", ent.text)
            date_count += 1

# Print total counts of POS tags and named entities
print("\nTotal counts of POS tags:")
print("Noun:", noun_count)
print("Verb:", verb_count)
print("Adjective:", adj_count)
print("Adverb:", adv_count)

print("\nTotal counts of named entities:")
print("Person:", person_count)
print("Organization:", organization_count)
print("Location:", location_count)
print("Product:", product_count)
print("Date:", date_count)



Total counts of POS tags:
Noun: 23078
Verb: 13251
Adjective: 11502
Adverb: 5688

Total counts of named entities:
Person: 315
Organization: 154
Location: 168
Product: 0
Date: 481


In [27]:
from collections import defaultdict

# Function to extract named entities from a sentence and count frequencies
def extract_entities_and_count(sentence, entity_category):
    entities = ne_chunk(pos_tag(word_tokenize(sentence)))
    for entity in entities:
        if isinstance(entity, nltk.tree.Tree):
            entity_type, entity_name = zip(*entity)
            entity_name = ' '.join(entity_name)
            if entity.label() == 'PERSON':
                entity_category["Person"].append(entity_name)
            elif entity.label() == 'ORGANIZATION':
                entity_category["Organization"].append(entity_name)
            elif entity.label() == 'GPE':
                entity_category["Location"].append(entity_name)
            elif entity.label() == 'PRODUCT':
                entity_category["Product"].append(entity_name)
            elif entity.label() == 'DATE':
                entity_category["Date"].append(entity_name)

# Initialize dictionary to store words for each category
category_words = defaultdict(list)

# Conduct POS tagging and NER for each sentence
for sentence in sentences:
    # Perform named entity recognition with NLTK
    extract_entities_and_count(sentence, category_words)

    # Perform named entity recognition with spaCy
    doc = nlp(sentence)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            category_words["Person"].append(ent.text)
        elif ent.label_ == 'ORG':
            category_words["Organization"].append(ent.text)
        elif ent.label_ == 'GPE':
            category_words["Location"].append(ent.text)
        elif ent.label_ == 'PRODUCT':
            category_words["Product"].append(ent.text)
        elif ent.label_ == 'DATE':
            category_words["Date"].append(ent.text)

# Print the category and words along with their frequencies
print("\nWord frequencies by category:")
for category, words in category_words.items():
    print("\nCategory:", category)
    word_counter = Counter(words)
    for word, frequency in word_counter.items():
        print("Word:", word, "| Frequency:", frequency)

# Print total counts of POS tags and named entities
print("\nTotal counts of POS tags:")
print("Noun:", noun_count)
print("Verb:", verb_count)
print("Adjective:", adj_count)
print("Adverb:", adv_count)
print("\nTotal counts of named entities:")
print("Person:", person_count)
print("Organization:", organization_count)
print("Location:", location_count)
print("Product:", product_count)
print("Date:", date_count)



Word frequencies by category:

Category: Date
Word: christmas | Frequency: 156
Word: sunday | Frequency: 78
Word: several day christmas | Frequency: 78
Word: week | Frequency: 78
Word: june | Frequency: 77
Word: year | Frequency: 14

Category: Person
Word: jun freaking played | Frequency: 77
Word: construir el cubo e pequeño | Frequency: 70
Word: luego se | Frequency: 70
Word: mucho para construir la otras | Frequency: 70
Word: neemt | Frequency: 7
Word: älskar denna lilla kub nu funderar vi | Frequency: 7
Word: att köpa | Frequency: 7
Word: kunna koppla | Frequency: 7

Category: Location
Word: china | Frequency: 77
Word: la calidad | Frequency: 70
Word: julklappsstrumpa | Frequency: 7
Word: någon | Frequency: 7
Word: hela | Frequency: 7

Category: Organization
Word: multitude de formes et de possibilités | Frequency: 70
Word: buena lo recomiendo para toda la | Frequency: 70
Word: snel de maling | Frequency: 7
Word: för | Frequency: 7

Total counts of POS tags:
Noun: 23078
Verb: 13251

In [19]:
# Loading the cleaned text from the CSV file..
cleaned_file_path = '/content/drive/MyDrive/cleaned.csv'
df = pd.read_csv(cleaned_file_path)
cleaned_text = ' '.join(df['Cleaned_Review'])

# Tokenizing our text into sentences
sentences = sent_tokenize(cleaned_text)

# Loading the English language model for spaCy..
nlp = spacy.load('en_core_web_sm')

# Function to print constituency parsing and dependency parsing trees for a sentence
def print_parsing_trees(sentence):
    doc = nlp(sentence)
    print("\nSentence:", sentence)
    print("\nConstituency parsing tree:")
    for token in doc:
        print(token.text, token.dep_, token.head.text)
    print("\nDependency parsing tree:")
    for chunk in doc.noun_chunks:
        print(chunk.text, "-->", chunk.root.text)

# Printing constituency parsing and dependency parsing trees for each sentence
for sentence in sentences:
    print_parsing_trees(sentence)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
cost going movie bowling --> bowling
activity --> activity
least worth entertainment --> entertainment
nice looking --> looking
good hand shapeshifter --> shapeshifter
cube puzzle --> puzzle
fun toy overpriced get term fidget --> fidget
distracting lot option --> option
great gift idea --> idea
kid love adult house --> house
much fun play --> play
kind cool shape --> shape
fidget toy puzzle fun age complaint color --> color
somewhat hard understand photo --> photo
sell product price --> price
really high product --> product
well charge lot --> lot
yo ’ favorite toy --> toy
non stop --> stop
one seam little rough fault toy --> toy
new one birthday --> birthday
love --> love
fun --> fun
good gift --> gift
someone --> someone
colourful magnet work --> work
bit pricey size item --> item
box --> box
one purchased read loved xmas present lot fun --> fun
older adult fun --> fun
amazing many different shape --> shape
color bright

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [20]:
'''
Thoughts on the Assignment:
The assignment was nice opportunity to learn about collecting and analyzing text data using Python. It covered a lot of important tasks like getting reviews from websites,
cleaning up the text, and understanding what the text means. Its relevant to real life because we often need to work with text data in jobs like marketing or research.

Challenges Faced:
Some parts were tricky, like to figure out out the process of cleaning up of the text and figuring out what to do with all the data took some time and was a bit confusing at first.

Aspects Enjoyed:
Its cool to actually do things with real data and see how we can make sense of it using computer programs. Figuring out how to solve problems and get the program to do what we want was really satisfying.

Opinion on Time to Complete the Assignment:
The amount of time given to finish the assignment was okaish, more time couldve been appreciated.
'''

'\nThoughts on the Assignment:\nThe assignment was nice opportunity to learn about collecting and analyzing text data using Python. It covered a lot of important tasks like getting reviews from websites,\ncleaning up the text, and understanding what the text means. Its relevant to real life because we often need to work with text data in jobs like marketing or research.\n\nChallenges Faced:\nSome parts were tricky, like to figure out out the process of cleaning up of the text and figuring out what to do with all the data took some time and was a bit confusing at first.\n\nAspects Enjoyed:\nIts cool to actually do things with real data and see how we can make sense of it using computer programs. Figuring out how to solve problems and get the program to do what we want was really satisfying.\n\nOpinion on Time to Complete the Assignment:\nThe amount of time given to finish the assignment was okaish, more time couldve been appreciated.\n'