<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [21]:
import requests
from bs4 import BeautifulSoup
import csv

page = "https://www.amazon.com/product-reviews/B096M85BSH/ref=acr_dp_hist_5?ie=UTF8&filterByStar=five_star&reviewerType=all_reviews#reviews-filter-bar"

# Define user-agent to mimic a web browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

def collect_amazon_reviews(page):
    try:
        response = requests.get(page, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        reviews = []

        # Extract review text
        review_elements = soup.find_all('span', {'data-action': 'reviews:content:read-more'})
        for review_element in review_elements:
            review_text = review_element.text.strip()
            reviews.append({"Review": review_text})

        return reviews
    except requests.exceptions.RequestException as e:
        print("Error collecting data from the website:", e)
        return []

# Collect data from the Amazon product review page
data = collect_amazon_reviews(page)

file = "amazon_product_reviews.csv"

with open(file, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['Review']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerows(data)

print(f"Amazon product reviews data has been saved to '{file}'.")


Amazon product reviews data has been saved to 'amazon_product_reviews.csv'.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [24]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import string

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Read the CSV file with the original data
file = "amazon_product_reviews.csv"
df = pd.read_csv(file)

# Function to clean and preprocess the text
def clean_text(text):
    # Remove special characters and punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])
    # Tokenize the text
    words = text.split()
    # Remove stopwords and perform stemming and lemmatization
    words = [lemmatizer.lemmatize(stemmer.stem(word.lower())) for word in words if word.lower() not in stop_words]
    return ' '.join(words)

# Apply the cleaning function to the 'Review' column
df['Cleaned Review'] = df['Review'].apply(clean_text)

# Save the cleaned data to a new CSV file
cleaned_csv_file = "amazon_product_reviews_cleaned.csv"
df.to_csv(cleaned_csv_file, index=False)

print(f"Cleaned data has been saved to '{cleaned_csv_file}'.")


Cleaned data has been saved to 'amazon_product_reviews_cleaned.csv'.


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bhanuprasadkommula/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bhanuprasadkommula/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [16]:
pip install nltk spacy


Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/ca/f3/609bb7512cad1f02af13daa23aa433b931da34c502211f29fd47dceff624/spacy-3.7.2-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading spacy-3.7.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (25 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Obtaining dependency information for murmurhash<1.1.0,>=0.28.0 from https://files.pythonhosted.org/packages/7a/05/4a3b5c3043c6d84c00bf0f574d326660702b1c10174fe6b44cef3c3dff08/murmurh

Collecting cloudpathlib<0.17.0,>=0.7.0 (from weasel<0.4.0,>=0.1.0->spacy)
  Obtaining dependency information for cloudpathlib<0.17.0,>=0.7.0 from https://files.pythonhosted.org/packages/0f/6e/45b57a7d4573d85d0b0a39d99673dc1f5eea9d92a1a4603b35e968fbf89a/cloudpathlib-0.16.0-py3-none-any.whl.metadata
  Downloading cloudpathlib-0.16.0-py3-none-any.whl.metadata (14 kB)
Downloading spacy-3.7.2-cp311-cp311-macosx_11_0_arm64.whl (6.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading catalogue-2.0.10-py3-none-any.whl (17 kB)
Downloading cymem-2.0.8-cp311-cp311-macosx_11_0_arm64.whl (41 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading murmurhash-1.0.10-cp311-cp311-macosx_11_0_arm64.whl (26 kB)
Downloading preshed-3.0.9-cp311-cp311-macosx_11_0_arm64.whl (128 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [17]:
import nltk
import spacy

nltk.download('punkt')
spacy.cli.download("en_core_web_sm")


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/bhanuprasadkommula/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [25]:
import pandas as pd
import nltk
import spacy

# Load NLTK and spaCy resources
nltk.download('averaged_perceptron_tagger')
nlp = spacy.load("en_core_web_sm")

# Read the CSV file with the cleaned data
cleaned_csv_file = "amazon_product_reviews_cleaned.csv"
df = pd.read_csv(cleaned_csv_file)

# Function for syntax and structure analysis
def syntax_structure_analysis(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Perform POS tagging
    pos_tags = nltk.pos_tag(tokens)
    noun_count = len([word for word, tag in pos_tags if tag.startswith('N')])
    verb_count = len([word for word, tag in pos_tags if tag.startswith('V')])
    adj_count = len([word for word, tag in pos_tags if tag.startswith('J')])
    adv_count = len([word for word, tag in pos_tags if tag.startswith('R')])

    # Perform constituency parsing and dependency parsing
    doc = nlp(text)
    constituency_parsing_tree = list(doc.sents)[0]._.parse_string
    dependency_parsing_tree = list(doc.sents)[0]._.to_tree()

    # Perform named entity recognition (NER)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return {
        "Noun Count": noun_count,
        "Verb Count": verb_count,
        "Adjective Count": adj_count,
        "Adverb Count": adv_count,
        "Constituency Parsing Tree": constituency_parsing_tree,
        "Dependency Parsing Tree": dependency_parsing_tree,
        "Named Entities": entities
    }

# Apply syntax and structure analysis to the 'Cleaned Review' column
df['Syntax and Structure Analysis'] = df['Cleaned Review'].apply(syntax_structure_analysis)

# Save the analyzed data to a new CSV file
analyzed_csv_file = "amazon_product_reviews_analyzed.csv"
df.to_csv(analyzed_csv_file, index=False)

print(f"Syntax and structure analysis data has been saved to '{analyzed_csv_file}'.")


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/bhanuprasadkommula/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Syntax and structure analysis data has been saved to 'amazon_product_reviews_analyzed.csv'.


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

Constituency parsing mainly concentrates on hierarchical structure of sentences. It is mainly used to divide the 
sentence into subphrases and smaller phrases. each node represents a phrase. 

Dependency parsing is mainly used to highlight thr grammatical relationships between words. I show how words are 
connected with each other with grammatical dependencies.