<a href="https://colab.research.google.com/github/bharathreddy-2802/BharathSimhaReddy_INFO5731_Fall2024/blob/main/Samala_BharathSimhaReddy_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Install beautifulsoup4 requests pandas libraries
!pip install beautifulsoup4 requests pandas

# Importing required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Headers to mimic browser behavior
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# Function to fetch reviews from Amazon
def fetch_amazon_reviews(url):
    reviews = []
    for page in range(1, 21):  #  (each page might contain 50 reviews)
        print(f'Scraping page {page}...')
        # Send request to the website
        page_url = f"{url}/ref=cm_cr_arp_d_paging_btm_next_{page}?pageNumber={page}"
        response = requests.get(page_url, headers=headers)

        # Parse the page content
        soup = BeautifulSoup(response.content, 'html.parser')
        review_blocks = soup.find_all('div', {'data-hook': 'review'})

        # Extract review data
        for review in review_blocks:
            review_dict = {}
            review_dict['Title'] = review.find('a', {'data-hook': 'review-title'}).text.strip() if review.find('a', {'data-hook': 'review-title'}) else ''
            review_dict['Rating'] = review.find('i', {'data-hook': 'review-star-rating'}).text.strip() if review.find('i', {'data-hook': 'review-star-rating'}) else ''
            review_dict['Body'] = review.find('span', {'data-hook': 'review-body'}).text.strip() if review.find('span', {'data-hook': 'review-body'}) else ''
            reviews.append(review_dict)

        # To avoid overloading the server, introduce a delay
        time.sleep(2)

    return reviews

# URL of the Amazon product reviews (you need to change this URL for your desired product)
product_url = 'https://www.amazon.com/SteelSeries-Worlds-Fastest-Mechanical-Keyboard/dp/B0BF64DN6H/ref=sr_1_1?_encoding=UTF8&content-id=amzn1.sym.12129333-2117-4490-9c17-6d31baf0582a&dib=eyJ2IjoiMSJ9.7_LpxWwuBa0EKw4v976atp6dIOFAq713J7JClcweWmx_OhgfhzTvvor7hYb8UD6IiikSLWVB1nJaFYudZuVIsaTjTji7EKbBoun_R_7EFp7j6dMXfc_FNHH59cDdhmPH2EmyCzOF4Z6xNIgx__nhGn6-XHTNvCzqckMAX1K2t0cl8-dC1vehcjyCftbfZZf1d6rIDaSi1d_XQ-oaVW3CdT0fy1WA2EjxlA8CVpkCZ14.r9MNsEpXFU2xiw_GshRo7XQg7-kCyo7SQ953k0RdBo0&dib_tag=se&keywords=gaming%2Bkeyboard&pd_rd_r=bcd4794e-56bb-4167-9b76-35977428cd25&pd_rd_w=sGF3k&pd_rd_wg=wdvBy&pf_rd_p=12129333-2117-4490-9c17-6d31baf0582a&pf_rd_r=77PCVSZR1175JTBF4PFG&qid=1727818392&refinements=p_72%3A1248885011&rnid=1248883011&s=videogames&sr=1-1&th=1'

# Scrape the reviews
amazon_reviews = fetch_amazon_reviews(product_url)

# Save to a CSV file
df = pd.DataFrame(amazon_reviews)
df.to_csv('keyboard_reviews.csv', index=False)
print("Reviews successfully saved to keyboard_reviews.csv")


Scraping page 1...
Scraping page 2...
Scraping page 3...
Scraping page 4...
Scraping page 5...
Scraping page 6...
Scraping page 7...
Scraping page 8...
Scraping page 9...
Scraping page 10...
Scraping page 11...
Scraping page 12...
Scraping page 13...
Scraping page 14...
Scraping page 15...
Scraping page 16...
Scraping page 17...
Scraping page 18...
Scraping page 19...
Scraping page 20...
Reviews successfully saved to keyboard_reviews.csv


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
# Write code for each of the sub parts with proper comments.
# Install  required pandas nltk libraries
!pip install pandas nltk

# libraries
import re
import nltk
import pandas as pd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

#nltk data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the CSV file with Amazon reviews
df = pd.read_csv('keyboard_reviews.csv')  #

# Displayingt few rows of the dataset
print("Original Data:")
print(df.head())

# Part 1: Remove noise (special characters and punctuations)
def remove_special_characters(text):
    return re.sub(r'[^\w\s]', '', text)

df['cleaned_text'] = df['Body'].apply(remove_special_characters)
print("\nData after removing special characters:")
print(df[['Body', 'cleaned_text']].head())

# Part 2: Remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

df['cleaned_text'] = df['cleaned_text'].apply(remove_numbers)
print("\nData after removing numbers:")
print(df[['Body', 'cleaned_text']].head())

# Part 3: Remove stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    words = word_tokenize(text)
    return ' '.join([word for word in words if word.lower() not in stop_words])

df['cleaned_text'] = df['cleaned_text'].apply(remove_stopwords)
print("\nData after removing stopwords:")
print(df[['Body', 'cleaned_text']].head())

# Part 4: Convert to lowercase
df['cleaned_text'] = df['cleaned_text'].str.lower()
print("\nData after converting to lowercase:")
print(df[['Body', 'cleaned_text']].head())

# Part 5: Stemming
stemmer = PorterStemmer()
def stem_text(text):
    words = word_tokenize(text)
    return ' '.join([stemmer.stem(word) for word in words])

df['stemmed_text'] = df['cleaned_text'].apply(stem_text)
print("\nData after stemming:")
print(df[['cleaned_text', 'stemmed_text']].head())

# Part 6: Lemmatization
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    words = word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in words])

df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)
print("\nData after lemmatization:")
print(df[['cleaned_text', 'lemmatized_text']].head())

# Save the cleaned data to a new CSV file
df.to_csv('keyboard_reviews_cleaned.csv', index=False)
print("\nCleaned data saved to 'amazon_reviews_cleaned.csv'")





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Original Data:
                                               Title              Rating  \
0  5.0 out of 5 stars\nEASILY THE BEST, IF NOT TH...  5.0 out of 5 stars   
1  5.0 out of 5 stars\nBest keyboard you will EVE...  5.0 out of 5 stars   
2                        5.0 out of 5 stars\nAmazing  5.0 out of 5 stars   
3  5.0 out of 5 stars\nThe magnetic keys feel rea...  5.0 out of 5 stars   
4                                                NaN                 NaN   

                                                Body  
0  I've used most of the top rated keyboards befo...  
1  I bought this keyboard about a year and a half...  
2  Amazing keyboard and a better ergonomics for t...  
3  Really nice keyboard,  I am usually a hardcore...  
4  I’ve upgraded from an older Corsair K70 to the...  

Data after removing special characters:
                                                Body  \
0  I've used most of the top rated keyboards befo...   
1  I bought this keyboard about a year and a 

# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Your code here
# Install nltk spacy benepar libraries
!pip install nltk spacy benepar
!python -m spacy download en_core_web_sm

# Importing required libraries
import pandas as pd
import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize
from spacy import displacy
import benepar

# Load the cleaned dataset
df = pd.read_csv('keyboard_reviews_cleaned.csv')  # Adjust the path to the cleaned CSV if needed
nlp = spacy.load("en_core_web_sm")

# Part 1: POS Tagging (Parts of Speech)
nltk.download('averaged_perceptron_tagger')

# Function to perform POS tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    return pos_tag(tokens)

# Apply POS tagging to the cleaned text
df['pos_tags'] = df['lemmatized_text'].apply(pos_tagging)

# Counting Nouns, Verbs, Adjectives, and Adverbs
pos_counts = Counter()

for tags in df['pos_tags']:
    for word, pos in tags:
        if pos.startswith('N'):
            pos_counts['Noun'] += 1
        elif pos.startswith('V'):
            pos_counts['Verb'] += 1
        elif pos.startswith('J'):
            pos_counts['Adjective'] += 1
        elif pos.startswith('R'):
            pos_counts['Adverb'] += 1

print(f"POS Counts:\nNouns: {pos_counts['Noun']}\nVerbs: {pos_counts['Verb']}\nAdjectives: {pos_counts['Adjective']}\nAdverbs: {pos_counts['Adverb']}")

# Part 2: Constituency Parsing and Dependency Parsing
# Benepar parser for constituency parsing
benepar.download('benepar_en3')
nlp.add_pipe("benepar", config={"model": "benepar_en3"})

# Function to print constituency and dependency parsing for a sentence
def parse_sentence(text):
    doc = nlp(text)
    for sent in doc.sents:
        print(f"Sentence: {sent.text}")
        print("\nConstituency Parsing Tree:")
        print(sent._.parse_string)
        print("\nDependency Parsing:")
        for token in sent:
            print(f"{token.text}: {token.dep_} --> {token.head.text}")

# Example sentence for parsing (choose one sentence for explanation)
example_sentence = df['lemmatized_text'].iloc[0]  # First sentence from the cleaned text
parse_sentence(example_sentence)

# Part 3: Named Entity Recognition (NER)
# Function to extract named entities
def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Apply NER to the lemmatized text
df['entities'] = df['lemmatized_text'].apply(extract_entities)

# Count the number of each entity type
entity_counts = Counter()

for entities in df['entities']:
    for text, label in entities:
        entity_counts[label] += 1

print("\nNamed Entity Recognition (NER) Counts:")
for entity, count in entity_counts.items():
    print(f"{entity}: {count}")

# Display a few examples of the NER output
print("\nSample NER Output:")
print(df[['lemmatized_text', 'entities']].head())



Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5 (from benepar)
  Downloading torch_struct-0.5-py3-none-any.whl.metadata (4.3 kB)
Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Building wheels for collected packages: benepar
  Building wheel for benepar (setup.py) ... [?25l[?25hdone
  Created wheel for benepar: filename=benepar-0.2.0-py3-none-any.whl size=37626 sha256=50462af822712661f97ab22ae0dd6f5d7ea02eaea85a2fcb79ba78c5f4c84671
  Stored in directory: /root/.cache/pip/wheels/8d/4d/c1/a5af726368d5dbaaaa0b2dd36ed39b9da8cec46279a49bd6db
Successfully built benepar
Installing collected packages: torch-struct, benepar
Successfully installed benepar-0.2.0 torch-struct-0.5
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


POS Counts:
Nouns: 4324
Verbs: 1878
Adjectives: 2355
Adverbs: 643


[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.
  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Sentence: ive used top rated keyboard like steelseries razer blackwidow v green brown switch model razer huntsman red optical three well respected keyboard get lot praise built offerive played fps game year tend notice look new tech product give advantage gameplay besides im also fulltime engineering student find typing report frequently switch linear dont tactile click bump instead smooth keystroke perfect fps shooter allows faster better feeling key pressesi tested keyboard favorite shooter game main one play apex legend game like valorant aim heavy apex movement heavy game using keyboard mouse time lowest actuation setting mm nice quick movement pulling healsshield wheel throwable wheel apex legendsthe oled screen really cool feature actually allows change setting without steelseries engine application setting like brightness actuation macro plus custom message main screen display ive seen even connects game display killsdeathsassists game like csgoin term report actuation setting m

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [4]:
from google.colab import files

# This will download the CSV file to your local machine
files.download('keyboard_reviews_cleaned.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [7]:
'''
The assignment was great. It was a eye-opener to the world of web scraping. I look forward to more such assignments. The only issue noted was that some
sites did not respond consistently to scraping. Some were even blocking. This required me to make a lot of changes in the code and the downside is that it brought alot of
error which were not my own making. I feel the assignment is hard to complete in the provided time period.
'''

'\nThe assignment was great. It was a eye-opener to the world of web scraping. I look forward to more such assignments. The only issue noted was that some\nsites did not respond consistently to scraping. Some were even blocking. This required me to make a lot of changes in the code and the downside is that it brought alot of\nerror which were not my own making. I feel the assignment is hard to complete in the provided time period.\n'