<a href="https://colab.research.google.com/github/chesterhuynguyen/huynguyen_INFO5731_Fall2023/blob/main/Huy_Nguyen_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv

# The URL of "Black Pather: Wakanda Forever" movie's IMDB page with user reviews.
url = "https://www.imdb.com/title/tt9114286/reviews"

# Create a function to scrape user reviews from IMDB.
def scrape_imdb_reviews(url, num_reviews=1000):
    reviews = []
    page_num = 1

    while len(reviews) < num_reviews:
        # Fetch the IMDB page.
        page_url = f"{url}?start={page_num}"
        response = requests.get(page_url)

        if response.status_code != 200:
            print(f"Failed to retrieve page {page_url}")
            break

        soup = BeautifulSoup(response.text, "html.parser")

        # Extract user reviews from the page.
        review_elements = soup.find_all("div", class_="text show-more__control")
        for review_element in review_elements:
            review_text = review_element.get_text(strip=True)
            reviews.append(review_text)

        # Move to the next page.
        page_num += 1

    return reviews

# Scrape user reviews (up to 1000).
user_reviews = scrape_imdb_reviews(url, num_reviews=1000)

# Save the user reviews to a CSV file.
csv_file = "Reviews.csv"
with open(csv_file, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Review"])
    for review in user_reviews:
        writer.writerow([review])

print(f"{len(user_reviews)} reviews have been saved to {csv_file}.")





1000 reviews have been saved to Reviews.csv.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
pip install pandas nltk textblob



In [2]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
from textblob import Word
nltk.download("wordnet")
nltk.download("stopwords")

# Load the CSV file containing the user reviews.
csv_file = "Reviews.csv"
df = pd.read_csv(csv_file)

# Define a function for text cleaning.
def clean_text(text):
    # Remove special characters and punctuation.
    text = text.replace("\n", " ")
    text = " ".join(word for word in text.split() if word.isalnum())

    # Remove numbers.
    text = " ".join(word for word in text.split() if not word.isnumeric())

    # Remove stopwords.
    stop_words = set(stopwords.words("english"))
    text = " ".join(word for word in text.split() if word not in stop_words)

    # Lowercase all text.
    text = text.lower()

    # Stemming and Lemmatization.
    text = " ".join(Word(word).lemmatize() for word in text.split())

    return text

# Apply the cleaning function to the "Review" column.
df["Cleaned_Review"] = df["Review"].apply(clean_text)

# Save the cleaned data in the same CSV file.
df.to_csv(csv_file, index=False, encoding="utf-8")

print("Data cleaning and saved in the same CSV file.")




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Data cleaning and saved in the same CSV file.


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [13]:
pip install spacy




In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Load the CSV file
csv_file = "Reviews.csv"
df = pd.read_csv(csv_file)

# Download the NLTK data for POS tagging.
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

# Define a function for POS tagging and counting.
def pos_tag_and_count(text):
    # Tokenize the text into words.
    words = word_tokenize(text)

    # Perform POS tagging.
    pos_tags = nltk.pos_tag(words)

    # Initialize counts for Nouns, Verbs, Adjectives, and Adverbs.
    noun_count = 0
    verb_count = 0
    adj_count = 0
    adv_count = 0

    # Define POS tag prefixes for Nouns, Verbs, Adjectives, and Adverbs.
    noun_prefixes = ["NN", "NNS", "NNP", "NNPS"]
    verb_prefixes = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
    adj_prefixes = ["JJ", "JJR", "JJS"]
    adv_prefixes = ["RB", "RBR", "RBS"]

    # Count the POS tags.
    for word, pos in pos_tags:
        if pos in noun_prefixes:
            noun_count += 1
        elif pos in verb_prefixes:
            verb_count += 1
        elif pos in adj_prefixes:
            adj_count += 1
        elif pos in adv_prefixes:
            adv_count += 1

    return {"Noun": noun_count, "Verb": verb_count, "Adjective": adj_count, "Adverb": adv_count}

# Apply POS tagging and counting to the "Cleaned_Review" column.
df["POS_Counts"] = df["Cleaned_Review"].apply(pos_tag_and_count)

# Calculate the total counts for each POS.
total_counts = df["POS_Counts"].apply(pd.Series).sum()

print("Total POS Counts:")
print(total_counts)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Total POS Counts:
Noun         59840
Verb         30600
Adjective    33520
Adverb       16760
dtype: int64


In [9]:
import pandas as pd
import spacy

# Load the CSV file containing the clean text data.
csv_file = "Reviews.csv"
df = pd.read_csv(csv_file)

# Load the spaCy English model.
nlp = spacy.load("en_core_web_sm")

# Define a function to extract and count named entities.
def extract_and_count_entities(text):
    doc = nlp(text)
    entity_counts = {"Person": 0, "Organization": 0, "Location": 0, "Product": 0, "Date": 0}

    for ent in doc.ents:
        entity_type = ent.label_
        if entity_type in entity_counts:
            entity_counts[entity_type] += 1

    return entity_counts

# Apply entity extraction and counting to the "Cleaned_Review" column.
df["Entity_Counts"] = df["Cleaned_Review"].apply(extract_and_count_entities)

# Calculate the total counts for each entity type.
total_counts = df["Entity_Counts"].apply(pd.Series).sum()

# Print the entity counts.
print("Total Entity Counts:")
print(total_counts)


Total Entity Counts:
PERSON     920
ORG        320
LOC         40
PRODUCT     40
DATE       240
dtype: int64


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**