<a href="https://colab.research.google.com/github/VijayaKumariGanipineni/VijayaKumari_INFO5731_Fall2024/blob/main/Ganipineni_VijayaKumari_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Installing required libraries
!pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_imdb_reviews(url, max_reviews=1000):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
    }
    reviews = []
    page = 0

    while len(reviews) < max_reviews:
        # Constructing the URL for the review pages (IMDB uses offsets for pages)
        review_url = f"{url}?ref_=undefined&paginationKey={page}"
        print(f"Visiting {review_url}")

        response = requests.get(review_url, headers=headers)
        if response.status_code != 200:
            print("Failed to retrieve page. Exiting...")
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        review_elements = soup.find_all('div', class_='text show-more__control')

        for review in review_elements:
            try:
                review_text = review.text.strip()
                reviews.append({
                    'Review': review_text
                })

                if len(reviews) >= max_reviews:
                    break
            except Exception as e:
                print(f"Error while parsing review: {e}")

        page += 1

    return reviews

# URL of the IMDB movie reviews page
imdb_url = "https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2"  #  URL for movie
imdb_reviews_data = get_imdb_reviews(imdb_url, max_reviews=1000)

# Saving to CSV
df_imdb_reviews = pd.DataFrame(imdb_reviews_data)
df_imdb_reviews.to_csv('movie_reviews.csv', index=False)
print(f"Saved {len(df_imdb_reviews)} reviews to 'movie_reviews.csv'")


Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=0
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=1
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=2
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=3
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=4
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=5
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=6
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=7
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?ref_=undefined&paginationKey=8
Visiting https://www.imdb.com/title/tt17526714/reviews/?ref_=tt_ov_ql_2?r

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:

# Installing necessary libraries
!pip install nltk pandas

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Downloading nltk resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Loading the IMDB reviews CSV file
df = pd.read_csv('movie_reviews.csv')

# Function to clean the text data
def clean_text(text):
    # (1) Removing noise (special characters and punctuations)
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # (2) Removing numbers
    text = re.sub(r'\d+', '', text)

    # (3) Removing stopwords
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    words = [word for word in words if word.lower() not in stop_words]

    # (4) Lowercasing all texts
    words = [word.lower() for word in words]

    # (5) Stemming
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in words]

    # (6) Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]

    # Joining the words back into a single string
    clean_text = ' '.join(lemmatized_words)
    return clean_text

# Applying the cleaning function to the review text
df['Cleaned Review'] = df['Review'].apply(clean_text)

# Saving the cleaned data into a new CSV file
df.to_csv('cleaned_movie_reviews.csv', index=False)

# Displaying the first few rows of the cleaned data
print(df.head())





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


                                              Review  \
0  Let me start out by saying: I love body horror...   
1  12 minutes of standing ovation during Cannes F...   
2  I was extremely hyped for that movie even thou...   
3  And I've seen thousands. I estimate about 5k m...   
4  Every scene of this film wowed me at TIFF. The...   

                                      Cleaned Review  
0  let start say love bodi horror dont your squea...  
1  minut stand ovat cann film festiv premier demi...  
2  extrem hype movi even though im big fan fargea...  
3  ive seen thousand estim k movi year earth cant...  
4  everi scene film wow tiff cast atmospher visua...  


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
# Install the necessary libraries
!pip install nltk spacy
!python -m spacy download en_core_web_sm

import nltk
import spacy
from collections import Counter
from nltk import pos_tag, word_tokenize

# Download required NLTK resources
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

# Load the cleaned IMDB reviews CSV file
df = pd.read_csv('cleaned_movie_reviews.csv')

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

# Function for Parts of Speech (POS) Tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return pos_tags

# Function to count POS tags (Nouns, Verbs, Adjectives, Adverbs)
def count_pos(pos_tags):
    counts = Counter(tag for word, tag in pos_tags)
    return counts

# Function for Dependency Parsing and Constituency Parsing
def parse_sentence(text):
    doc = nlp(text)
    return doc

# Function for Named Entity Recognition (NER)
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# Perform analysis on the first review as an example
review_text = df['Cleaned Review'][0]

# 1. Parts of Speech Tagging
pos_tags = pos_tagging(review_text)
print(f"POS Tags: {pos_tags}")

# Count total number of Nouns, Verbs, Adjectives, Adverbs
pos_counts = count_pos(pos_tags)
print(f"POS Counts: {pos_counts}")

# 2. Dependency Parsing and Constituency Parsing
parsed_doc = parse_sentence(review_text)
print(f"Dependency Parsing: {[f'{token.text}: {token.dep_}' for token in parsed_doc]}")

# Display the Constituency Parse Tree (spaCy doesn't provide direct constituency parsing)
# But here's how we interpret dependencies:
for token in parsed_doc:
    print(f'{token.text} -> {token.head.text} ({token.dep_})')

# 3. Named Entity Recognition (NER)
entities = named_entity_recognition(review_text)
print(f"Named Entities: {entities}")

# Count the named entities
entity_count = Counter([label for text, label in entities])
print(f"Entity Counts: {entity_count}")


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


POS Tags: [('let', 'VB'), ('start', 'VB'), ('say', 'VB'), ('love', 'VB'), ('bodi', 'JJ'), ('horror', 'NN'), ('dont', 'VB'), ('your', 'PRP$'), ('squeamish', 'JJ'), ('might', 'MD'), ('want', 'VB'), ('pas', 'NN'), ('film', 'NN'), ('said', 'VBD'), ('thought', 'VBN'), ('balanc', 'NN'), ('disturb', 'NN'), ('impact', 'NN'), ('intrigu', 'JJ'), ('disgust', 'NN'), ('absolut', 'NN'), ('right', 'RB'), ('moneyin', 'JJ'), ('world', 'NN'), ('full', 'JJ'), ('filler', 'NN'), ('botox', 'NN'), ('face', 'NN'), ('lift', 'NN'), ('implant', 'JJ'), ('skin', 'NN'), ('care', 'NN'), ('routin', 'NN'), ('ob', 'IN'), ('youth', 'NN'), ('peak', 'JJ'), ('medium', 'NN'), ('substanc', 'NN'), ('call', 'NN'), ('question', 'NN'), ('whether', 'IN'), ('new', 'JJ'), ('better', 'JJR'), ('realli', 'NN'), ('one', 'CD'), ('perhap', 'NN'), ('clever', 'NN'), ('gross', 'JJ'), ('point', 'NN'), ('fun', 'NN'), ('watch', 'NN'), ('moor', 'NN'), ('qualley', 'NN'), ('excel', 'NN'), ('duo', 'NN'), ('quaid', 'VBD'), ('play', 'NN'), ('hollywo

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [None]:
from google.colab import files
files.download('cleaned_movie_reviews.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Write your response below
'''
The assignment was relevant although I faced a lot of challenges accessing some information from the required
websites. For example, the page for the narrators was access forbiden and the articles abstracts too page was blocking
and therefore settled on the movies. I enjoyed data scaping and the cleaning of data alot. In future, I would suggest
to be given a wider range of getting data.
'''

'\nThe assignment was relevant although I faced a lot of challenges accessing some information from the required\nwebsites. For example, the page for the narrators was access forbiden and the articles abstracts too page was blocking\nand therefore settled on the movies. I enjoyed data scaping and the cleaning of data alot. In future, I would suggest\nto be given a wider range of getting data.\n'