# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [10]:
# Your code here

import requests
from bs4 import BeautifulSoup
import csv

def scrape_reviews(url, num_pages, output_file):
    reviews_data = []

    for page_num in range(1, num_pages + 1):
        page_url = f"{url}&page={page_num}"

        response = requests.get(page_url)
        if response.status_code != 200:
            print(f"Failed to retrieve page {page_num}. Exiting.")
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        review_containers = soup.find_all('div', class_='lister-item-content')

        for container in review_containers:
            review_text = container.find('div', class_='text').get_text()
            username = container.find('span', class_='display-name-link').get_text()
            review_date = container.find('span', class_='review-date').get_text()

            reviews_data.append([username, review_date, review_text])

    if reviews_data:
        with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Username', 'Review Date', 'Review Text'])
            csv_writer.writerows(reviews_data)

        print(f"{len(reviews_data)} reviews have been successfully scraped and saved to '{output_file}'.")
    else:
        print("No reviews found on the pages.")

if __name__ == "__main__":
    movie_url = 'https://www.imdb.com/title/tt9603212/reviews/?ref_=tt_ql_2'
    num_pages_to_scrape = 50
    output_csv_file = 'yash_reviews.csv'

    scrape_reviews(movie_url, num_pages_to_scrape, output_csv_file)




1250 reviews have been successfully scraped and saved to 'yash_reviews.csv'.


In [11]:
import pandas as pd
pd.read_csv('yash_reviews.csv')

Unnamed: 0,Username,Review Date,Review Text
0,Paragon240,12 July 2023,Man.... I wish I loved this movie more than I ...
1,JackRJosie,15 July 2023,Ethan Hunt has left the mere secret agent stat...
2,ragingbull_2005,18 July 2023,After the first 30 minutes that promised an in...
3,imseeg,12 July 2023,4 considerations for those with high expectati...
4,BA_Harrison,11 July 2023,Mission Impossible is one of those rare franch...
...,...,...,...
1245,namob-43673,12 July 2023,All of the instalments in this franchise are v...
1246,HalBanksy,10 July 2023,I have just watched the greatest car-chase in ...
1247,denis888,22 December 2023,"I wanted to write a larhe review, making many ..."
1248,Truedutch,9 July 2023,Straight up Dead Reckoning Part 1 was fine. I ...


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import re

# Download NLTK resources if not already installed
nltk.download('stopwords')
nltk.download('wordnet')

# Read the CSV file containing the raw data
df = pd.read_csv('yash_reviews.csv')

# Define a function for text cleaning
def clean_text(text):
    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Lowercase the text
    text = text.lower()

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Initialize a stemmer and lemmatizer
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Apply stemming and lemmatization
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]

    # Join the cleaned tokens to form the cleaned text
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

# Apply the clean_text function to the 'Review Text' column
df['Cleaned Text'] = df['Review Text'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('yash_reviews_cleaned.csv', index=False)

print("Text data has been cleaned and saved to 'yash_reviews_cleaned.csv'.")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Text data has been cleaned and saved to 'yash_reviews_cleaned.csv'.


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [14]:
# Your code here
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Reads the cleaned text data
df = pd.read_csv('yash_reviews_cleaned.csv')

# (1)(POS) Tagging
noun_count = 0
verb_count = 0
adj_count = 0
adv_count = 0

for text in df['Cleaned Text']:
    doc = nlp(text)
    for token in doc:
        if token.pos_ == "NOUN":
            noun_count += 1
        elif token.pos_ == "VERB":
            verb_count += 1
        elif token.pos_ == "ADJ":
            adj_count += 1
        elif token.pos_ == "ADV":
            adv_count += 1

print(f"Noun Count: {noun_count}")
print(f"Verb Count: {verb_count}")
print(f"Adjective Count: {adj_count}")
print(f"Adverb Count: {adv_count}")

# Constituency Parsing and Dependency Parsing (using one sentence as an example)
sample_text = df['Cleaned Text'].iloc[0]  # Take the first sentence as an example

# Constituency Parsing Tree
sample_doc = nlp(sample_text)
print("\nConstituency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} [{token.dep_}]", end=" -> ")
print()

# Dependency Parsing Tree
print("\nDependency Parsing Tree:")
for token in sample_doc:
    print(f"{token.text} [{token.head.text}]", end=" -> ")
print()

# Named Entity Recognition
entities = {
    "PERSON": 0,
    "ORG": 0,
    "LOC": 0,
    "PRODUCT": 0,
    "DATE": 0
}

for text in df['Cleaned Text']:
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[ent.label_] += 1

print("\nNamed Entity Counts:")
for entity, count in entities.items():
    print(f"{entity}: {count}")







Noun Count: 69050
Verb Count: 29050
Adjective Count: 27900
Adverb Count: 7150

Constituency Parsing Tree:
man [nsubj] -> wish [ROOT] -> love [compound] -> movi [nsubj] -> do [aux] -> nt [neg] -> get [ccomp] -> wrong [amod] -> solid [amod] -> action [compound] -> movi [compound] -> jawdrop [compound] -> stunt [dobj] -> best [amod] -> seri [amod] -> mission [compound] -> imposs [compound] -> movi [nsubj] -> felt [conj] -> like [prep] -> small [amod] -> step [pobj] -> backward [advmod] -> franchis [det] -> fallout [npadvmod] -> mindblow [amod] -> action [compound] -> sequenc [compound] -> stunt [compound] -> work [dobj] -> along [prep] -> develop [xcomp] -> ethan [compound] -> relationship [dobj] -> ilsa [compound] -> provid [compound] -> closur [compound] -> julia [compound] -> show [compound] -> length [compound] -> ethan [nsubj] -> would [aux] -> go [ccomp] -> protect [advcl] -> closest [amod] -> battl [compound] -> impos [compound] -> villain [nsubj] -> dead [nsubj] -> reckon [ccomp] 

**Explanation about constituency parsing tree and dependency parsing tree with example**

Constituency parsing represents the sentence "Man, I wish I loved this movie more than I did" as a hierarchical tree structure, breaking it into phrases such as noun phrases (NP) and verb phrases (VP), as seen in the example tree provided earlier. In contrast, dependency parsing illustrates the grammatical relationships between words, showcasing dependencies like the subject-verb relationship between "Man" and "wish" or the object-verb relationship between "movie" and "loved," as demonstrated in the dependency tree example.

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [15]:
# Write your response below

print("In this assignment, the code for parsing is a bit challenging, but aside from that, I have learned many new methods to scrape and clean the data. Overall, I am acquiring a wealth of knowledge in coding through these assignments")

In this assignment, the code for parsing is a bit challenging, but aside from that, I have learned many new methods to scrape and clean the data. Overall, I am acquiring a wealth of knowledge in coding through these assignments
