# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_reviews(source_url, num_reviews, output_csv):
    all_reviews = []

    # Amazon reviews have a fixed number of reviews per page (usually 10), so we can calculate the number of pages needed
    pages_needed = (num_reviews // 10) + (1 if num_reviews % 10 != 0 else 0)

    for page_number in range(1, pages_needed + 1):
        # URL to scrape the data from
        url = f"{source_url}&pageNumber={page_number}"

        # Using the requests module to get the web page content
        page = requests.get(url)

        # Parsing the HTML content using BeautifulSoup
        soup = BeautifulSoup(page.content, 'html.parser')

        # Finding all the review elements
        review_elements = soup.find_all('div', class_='a-section review aok-relative')

        for review in review_elements:
            review_text = review.find('span', class_='review-text')
            if review_text:
                review_text = review_text.get_text().strip()
                all_reviews.append(review_text)

                # Break the loop if the desired number of reviews is collected
                if len(all_reviews) == num_reviews:
                    break

    # Create a DataFrame
    df = pd.DataFrame({'Reviews': all_reviews})

    # Save the DataFrame to a CSV file
    df.to_csv(output_csv, index=False)

    return df

if __name__ == "__main__":
    # URL of the product on Amazon
    amazon_product_url = "https://www.amazon.com/Fossil-Womens-Stella-Stainless-Chronograph/dp/B00KGTUKFU/ref=sr_1_7?crid=WEP1EJ0NCYIB&dib=eyJ2IjoiMSJ9.5NV6CyBgRX7xVrm3oZfd1dIxSlkLWx2llT2CaPsXDMQ9fMiUzTARP0O1FPV38FnJx5wczAWZ0828FwWaq-_RMpgr4Rp9jVv4CqqkngfwTjebACKL186f9LVJ-FtLWFIydTp_vQc3hcqUWCusU_s4EV7wGQzlFOMDPKLaBr-JFd24TXlccLsLJayvokqArbUvEUfmGmP8WoOoGj-L7Zzrapojx5VsUBF0UysSKr8uEPtHXZMvT-e5Cfhwer4y7ZtkMXNeR4apneWRrbTH1eCc08i5nSBk-6ydNGZjFsIz5wg.IMpYYpboVFoWNMeR3o4aHfZl80ZwXd2JQoBy7a4h5B8&dib_tag=se&keywords=fossil%2Bwatch%2Bwomen&qid=1709164745&sprefix=fossil%2Caps%2C127&sr=8-7&th=1"
    # Number of reviews to scrape
    num_of_reviews = 1000  # Adjust the number of reviews as needed
    # Output CSV file name
    output_csv_file = "reviews.csv"

    reviews_df = scrape_reviews(amazon_product_url, num_of_reviews, output_csv_file)

    # Print the DataFrame
    print(reviews_df)
    print(f"Reviews saved to {output_csv_file}")


                                               Reviews
0          The quality of the item is great\nRead more
1                     Just what I expected.\nRead more
2    The media could not be loaded.\n              ...
3    I've had this watch for 3 days now. I got it f...
4    The media could not be loaded.\n              ...
..                                                 ...
795  I've had this watch for 3 days now. I got it f...
796  The media could not be loaded.\n              ...
797                       Amazing I love ❤️\nRead more
798  Honestly exceeded my expectations. What a beau...
799  The watch is gaudy. I paid $94.00 for this pie...

[800 rows x 1 columns]
Reviews saved to reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [24]:
import nltk
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the CSV file into a DataFrame
df = pd.read_csv('reviews.csv')

# Function to remove punctuation and special characters from text
def clean_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

# Function to remove numbers from text
def clean_numbers(text):
    return re.sub(r'\d+', '', text)

# Function to remove stopwords from text
def remove_stop_words(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join(word for word in text.split() if word.lower() not in stop_words)

# Function to convert text to lowercase
def convert_lowercase(text):
    return text.lower()

# Function to apply stemming to text
def apply_stemming(text):
    stemmer = PorterStemmer()
    return ' '.join(stemmer.stem(word) for word in text.split())

# Function to apply lemmatization to text
def apply_lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join(lemmatizer.lemmatize(word) for word in text.split())

if __name__ == "__main__":
    # Apply text preprocessing steps to the Amazon reviews
    print("Original DataFrame:")
    print(df.head())

    df['cleaned_reviews'] = df['Reviews'].copy()
    df['cleaned_reviews'] = df['cleaned_reviews'].apply(clean_punctuation)
    print("\nAfter removing punctuation and special characters:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(clean_numbers)
    print("\nAfter removing numbers:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(remove_stop_words)
    print("\nAfter removing stopwords:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(convert_lowercase)
    print("\nAfter converting text to lowercase:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(apply_stemming)
    print("\nAfter applying stemming:")
    print(df.head())

    df['cleaned_reviews'] = df['cleaned_reviews'].apply(apply_lemmatization)
    print("\nAfter applying lemmatization:")
    print(df.head())

    # Save the modified DataFrame to CSV file
    df.to_csv('cleaned_reviews.csv', index=False)

    # Print message indicating successful completion
    print("Text data cleaned and saved in 'cleaned_reviews.csv'")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original DataFrame:
                                             Reviews
0        The quality of the item is great\nRead more
1                   Just what I expected.\nRead more
2  The media could not be loaded.\n              ...
3  I've had this watch for 3 days now. I got it f...
4  The media could not be loaded.\n              ...

After removing punctuation and special characters:
                                             Reviews  \
0        The quality of the item is great\nRead more   
1                   Just what I expected.\nRead more   
2  The media could not be loaded.\n              ...   
3  I've had this watch for 3 days now. I got it f...   
4  The media could not be loaded.\n              ...   

                                     cleaned_reviews  
0        The quality of the item is great\nRead more  
1                    Just what I expected\nRead more  
2  The media could not be loaded\n               ...  
3  Ive had this watch for 3 days now I got it for... 

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [42]:
import spacy
from spacy import displacy
from collections import Counter

# Load spaCy English model
nlp_custom = spacy.load('en_core_web_sm')

# Load the cleaned text data
df_custom_text = pd.read_csv('cleaned_reviews.csv')

# Function to perform POS tagging and count POS
def pos_tagging_and_count_custom_text(text):
    doc_custom = nlp_custom(text)
    pos_tags_custom = [token.pos_ for token in doc_custom]
    pos_count_custom = Counter(pos_tags_custom)
    return pos_count_custom

# Function to perform constituency parsing and display the tree
def constituency_parsing_custom_text(text):
    doc_custom = nlp_custom(text)
    for sent_custom in doc_custom.sents:
        displacy.render(sent_custom, style='dep', jupyter=True, options={'distance': 90})

# Function to perform dependency parsing and display the tree
def dependency_parsing_custom_text(text):
    doc_custom = nlp_custom(text)
    for sent_custom in doc_custom.sents:
        displacy.render(sent_custom, style='dep', jupyter=True)

# Function to perform Named Entity Recognition (NER) and count entities
def named_entity_recognition_custom_text(text):
    doc_custom = nlp_custom(text)
    entities_custom = [ent_custom.text for ent_custom in doc_custom.ents]
    entity_count_custom = Counter(entities_custom)
    return entity_count_custom

if __name__ == "__main__":
    # Choose one sentence for example analysis
    example_sentence_custom_text = df_custom_text['cleaned_reviews'].iloc[0]

    # (1) POS Tagging
    pos_count_custom_text = pos_tagging_and_count_custom_text(example_sentence_custom_text)
    print("\n(1) Parts of Speech (POS) Tagging:")
    print(pos_count_custom_text)

    # (2) Constituency Parsing
    print("\n(2) Constituency Parsing:")
    constituency_parsing_custom_text(example_sentence_custom_text)

    # (2) Dependency Parsing
    print("\n(3) Dependency Parsing:")
    dependency_parsing_custom_text(example_sentence_custom_text)

    # (3) Named Entity Recognition (NER)
    entity_count_custom_text = named_entity_recognition_custom_text(example_sentence_custom_text)
    print("\n(4) Named Entity Recognition (NER):")
    print(entity_count_custom_text)


(1) Parts of Speech (POS) Tagging:
Counter({'ADJ': 2, 'NOUN': 2})

(2) Constituency Parsing:



(3) Dependency Parsing:



(4) Named Entity Recognition (NER):
Counter({'qualiti': 1})


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
#This assignment is very challeging to me compared to the previous assignment and excercises
# few of the topics are new to me like POS tagging, parsing
#Understanding the Amazon website's structure and locating the HTML sections containing the reviews was the difficult aspect. The HTML structure of Amazon can be intricate, necessitating meticulous inspection to identify the elements for scraping.
#The assignment provided a valuable opportunity to practice online scraping techniques for data collection and organization. The course offered a pragmatic insight into dealing with dynamic web sites and managing any obstacles encountered while scraping data.
