# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
# Your code here
import requests as req
import pandas as pd2
from bs4 import BeautifulSoup

# Function to fetch IMDb reviews
def get_movie_reviews(movie_id):
    movie_reviews_list = []
    for page_number in range(1,41):
        #generating dynamic url to fetch data
        imdb_movie_url = "https://www.imdb.com/title/"+movie_id+"/reviews?start="+str(page_number)
        response_of_url = req.get(imdb_movie_url)
        review_soup = BeautifulSoup(response_of_url.content, 'html.parser')
        review_div = review_soup.findAll('div', class_='text show-more__control')
        for each_review in review_div:
            movie_reviews_list.append(each_review.text)
    return movie_reviews_list

# I am collecting the reviews of Killers of the Flower Moon(2024) from IMDB. It's title id is 'tt5537002'.
movie_review = get_movie_reviews('tt5537002')

# Save the data into a CSV file
movie_review_df = pd2.DataFrame(movie_review, columns=['Review'])
movie_review_df.to_csv('killers_of_flower_moon_reviews.csv', index=False)


In [2]:
movie_review_df.head()

Unnamed: 0,Review
0,"The good: 2 brilliant actors, Leonardo Di Capr..."
1,"""Killers of the Flower Moon"" is a Western crim..."
2,Leonardo Di Caprio returns from World War One ...
3,Oil is discovered under Osage Nation land in l...
4,Some films warrant long runtimes. Epics like '...


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [3]:
#importing the required packages
import string
!pip install nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Ensure you have the necessary nltk resources downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_review_data(text_1):
    # Remove punctuation
    text_1 = text_1.translate(str.maketrans('', '', string.punctuation))

    #remove numbers from text data
    text_12 = ''
    for char in text_1:
        if not char.isdigit():
            text_12+=char
    text_1=text_12


    #remove the stopwords from review content
    text_1=' '.join([ x for x in text_1.split() if x not in stopwords.words('english')])

    #convert to lowercase letters
    text_1=text_1.lower()

    #apply stemming
    stem_tokens_1 =[word_token for word_token in nltk.word_tokenize(text_1)]

    #apply Lemmatization
    final_cleaned_tokens_list = [lemmatizer.lemmatize(word_1) for word_1 in stem_tokens_1]

    #return the pre-processed data
    return ' '.join(final_cleaned_tokens_list)


movie_review_df["cleaned Review text"]=movie_review_df["Review"].apply(clean_review_data)
movie_review_df.to_csv('movie_reviews_cleaned.csv', index=False)
movie_review_df.head(10)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,Review,cleaned Review text
0,"The good: 2 brilliant actors, Leonardo Di Capr...",the good brilliant actor leonardo di caprio ro...
1,"""Killers of the Flower Moon"" is a Western crim...",killer flower moon western crime drama film co...
2,Leonardo Di Caprio returns from World War One ...,leonardo di caprio return world war one oklaho...
3,Oil is discovered under Osage Nation land in l...,oil discovered osage nation land late th centu...
4,Some films warrant long runtimes. Epics like '...,some film warrant long runtimes epic like lawr...
5,Martin Scorsese follows up his sloppy The Iris...,martin scorsese follows sloppy the irishman an...
6,Obviously this isn't bad. It's from an amazing...,obviously isnt bad it amazing director interes...
7,I'm not a die-hard Martin Scorsese fan. I have...,im diehard martin scorsese fan i deep apprecia...
8,First things first. There is absolutely no nee...,first thing first there absolutely need ½ hour...
9,Well that was a real snorer.Scorsese took a ta...,well real snorerscorsese took taut thriller bo...


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [4]:
import pandas as pd
import spacy
from spacy.tokens import Token
from nltk import Tree
from collections import Counter

# Loading the SpaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Function to perform POS tagging and calculate counts
def pos_tagging_of_text(text):
    return Counter([token.pos_ for token in nlp(text)])


# Dependency Parsing with SpaCy
def dependency_parsing(text):
    doc = nlp(text)
    for sentence in doc.sents:
        print(f"Dependency parsing tree for: {sentence}")
        spacy.displacy.render(sentence, style='dep', jupyter=True, options={'distance': 90})

# Named Entity Recognition
def named_entity_recognition(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    entity_labels = [ent.label_ for ent in doc.ents]
    entity_counts = Counter(entity_labels)
    return entities, entity_counts


# Applying POS tagging
movie_review_df['POS_counts'] = movie_review_df['cleaned Review text'].apply(pos_tagging_of_text)

# Print Constituency Parsing and Dependency Parsing for the first few sentences
for content in movie_review_df['cleaned Review text'].head():
    # Assuming each 'content' might have multiple sentences, we split them.
    sentences = content.split(".")
    for sentence in sentences:
        if sentence.strip():
            dependency_parsing(sentence)

# Applying Named Entity Recognition
movie_review_df['Entities'], movie_review_df['Entity_counts'] = zip(*movie_review_df['cleaned Review text'].apply(named_entity_recognition))

# Display results
print(movie_review_df[['POS_counts', 'Entities', 'Entity_counts']].head())

Dependency parsing tree for: the good brilliant actor leonardo di caprio robert deniro acting quite well not extraordinary well really deliver solid performance it always joy watch performthe bad movie last way too long come hour minute beware aint gripping masterpiece aint fastpaced spectacle aint visceral drama it merely slowburning portrait already wearing patience thin half hourthe first minute story get see nothing else money hungry men marying indian woman land rich oil married slowly poison indian wife altered medicine make terminally ill and die oil rich land copy repeat on plot may sound vicious intrigueing oppositehow director martin scorsese manage ruin devestating story by simply not restraining time and injecting true drama thrill it toddles along only hour pace take spark punch experiencedthere fault martin scorsese direction his choice editing photography soundscore usual level quality directing not terrible quite average martin scorsese movie used wild dangerous fast no

Dependency parsing tree for: killer flower moon western crime drama film cowritten directed martin scorsese based nonfiction book name david grann starring leonardo dicaprio robert de niro lily gladstone touch upon often overlooked piece american history best way possible thanks talent director castin early discovery oil land belonging native american osage nation turn tribe richest people world this sudden acquisition wealth attracts attention white businessmen looking seize opportunity stealing much osage tribe possible among group interloper ernest burkhart leonardo dicaprio upon arriving oklahoma encouraged uncle william king hale robert de niro marry member osage way inheriting fortune ernest soon fall love later marries mollie lily gladstone young osage woman strong tie family rich a white occupation native land continues member osage tribe repeatedly found murdered mysterious circumstance molly close family among prominent victimsone favourite thing movie addition enjoyable mean

Dependency parsing tree for: right what make scheme intriguing watch patience required pull ethical ramification result only filmmaker like scorsese could explore topic like complexity style remains timeless everadditionally almost scorsese visual trademark director full display wideopen cinematography designed immerse audience world america creative framing character shot give certain perspective scene one particular stood conversation ernest william discus business regarding osage we see two seated inside darkly lit room discussing type future lie ahead entire tribe ernest choosing remain loyal osage wife mollie uncle william reminds nephew important reason married first place here scorsese place character way make look place inside single bright spot dark room the darkness surrounding two likened perfect visual representation true intention supposed brightness focused actuality metaphor tainted presence everything osage created pointdue scale theme plotting film rightfully earns lon

Dependency parsing tree for: leonardo di caprio return world war one oklahoma oil made osage indian rich after long meandering talk uncle robert de niro di caprio run taxi pick lily gladstone pureblood osage begin drive regularly they fall love eventually three child someone busy killing osage indiansthis scorsese movie stellar cast impeccably written directed shot i dont know thelma schoonmaker scorsese longtime editor at three hour wrap tenminute epilogue presented episode gangbusters talking everyone wound spending rest life it go long movie least movie without break this upside downside company like amazon netflix wherewithal tempt great film maker large gob money because viewer home watch movie like leisure taking time go bath room meal forth made there need edit what could cut reasonable length without dicaprios army career finding lived trailer final year left audience bladder telling time leave story what would made fine miniseries television must watched one go movie theater t

Dependency parsing tree for: oil discovered osage nation land late th century oklahoma it the oil company moved osage people richest folk world all wealth attracts lot worker capitalist criminal general scammer everybody looking get piece indian oil pie war veteran ernest burkhart leo dicaprio return join brother byron uncle william hale robert de niro william become wealthy leader town family friend worm way native community he known king hale ingratiated tribe he directs ernest woo marry mollie lily gladstone shy sickly native womani put watching long possible


Dependency parsing tree for: i dont want spend three hour watching evil people stealing helpless indian folk a i started watching phrase pop mind this stealing candy baby fun time watch it tough watch two hour i find stopping multitasking ipad when fbi show becomes tradition howcatchem thats little bit watchable i dont stop anymore this would great two hour mystery movie fbi martin scorsese probably want native character time he shy making character evil leo something like brando i dont like it seems much fitting handsome leo mollie would powerless charm a mollie cluelessness rather frustrating there denying scorsese filmmaking mastery sincere intention the first two hour grind still rewarding watch


Dependency parsing tree for: some film warrant long runtimes epic like lawrence arabia da boot three hour length rocket along brisk pace largely fastidious editing the duration picture necessary one could argue tell story without sacrificing detail coherence excitement then film like heaven gate also runtime three hour bloated selfindulgent unaffecting watch thanks director michael ciminos arrogant refusal cut anythingmartin scorsese killer flower moon teeter somewhere camp it element deserving high praise inarguably long pacing structural narrative issue galore based nonfiction novel name david grann film center emmet burkhart simpleminded world war i veteran return the osage nation home uncle william king hale there emmet fall osage named mollie uncle tell set inherit much people oil headrights meanwhile someone killing wealthy osage area look like molly family might nexton paper sound like fascinating exciting picture dash psychological intrigue however scorsese version tale dour sw

                                          POS_counts  \
0  {'DET': 4, 'ADJ': 31, 'NOUN': 57, 'PROPN': 10,...   
1  {'PROPN': 71, 'ADJ': 139, 'NOUN': 312, 'VERB':...   
2  {'PROPN': 17, 'NOUN': 69, 'VERB': 32, 'NUM': 5...   
3  {'NOUN': 59, 'VERB': 39, 'ADJ': 27, 'ADP': 3, ...   
4  {'DET': 11, 'NOUN': 252, 'ADJ': 115, 'ADP': 18...   

                                            Entities  \
0  [caprio robert deniro, thin half hourthe, firs...   
1  [martin, scorsese, david grann, leonardo dicap...   
2  [one, oklahoma, indian, robert de niro, three,...   
3  [late th century, indian, uncle william, rober...   
4  [lawrence arabia da, three hour, three hour, m...   

                                       Entity_counts  
0  {'PERSON': 3, 'DATE': 1, 'ORDINAL': 1, 'NORP':...  
1  {'ORG': 1, 'NORP': 11, 'PERSON': 15, 'GPE': 1,...  
2  {'CARDINAL': 4, 'GPE': 2, 'NORP': 3, 'PERSON':...  
3  {'DATE': 1, 'NORP': 2, 'PERSON': 3, 'TIME': 4,...  
4  {'PERSON': 18, 'TIME': 3, 'ORG': 4, 'NORP': 6,..

In [7]:
movie_review_df.head(15)

Unnamed: 0,Review,cleaned Review text,POS_counts,Entities,Entity_counts
0,"The good: 2 brilliant actors, Leonardo Di Capr...",the good brilliant actor leonardo di caprio ro...,"{'DET': 4, 'ADJ': 31, 'NOUN': 57, 'PROPN': 10,...","[caprio robert deniro, thin half hourthe, firs...","{'PERSON': 3, 'DATE': 1, 'ORDINAL': 1, 'NORP':..."
1,"""Killers of the Flower Moon"" is a Western crim...",killer flower moon western crime drama film co...,"{'PROPN': 71, 'ADJ': 139, 'NOUN': 312, 'VERB':...","[martin, scorsese, david grann, leonardo dicap...","{'ORG': 1, 'NORP': 11, 'PERSON': 15, 'GPE': 1,..."
2,Leonardo Di Caprio returns from World War One ...,leonardo di caprio return world war one oklaho...,"{'PROPN': 17, 'NOUN': 69, 'VERB': 32, 'NUM': 5...","[one, oklahoma, indian, robert de niro, three,...","{'CARDINAL': 4, 'GPE': 2, 'NORP': 3, 'PERSON':..."
3,Oil is discovered under Osage Nation land in l...,oil discovered osage nation land late th centu...,"{'NOUN': 59, 'VERB': 39, 'ADJ': 27, 'ADP': 3, ...","[late th century, indian, uncle william, rober...","{'DATE': 1, 'NORP': 2, 'PERSON': 3, 'TIME': 4,..."
4,Some films warrant long runtimes. Epics like '...,some film warrant long runtimes epic like lawr...,"{'DET': 11, 'NOUN': 252, 'ADJ': 115, 'ADP': 18...","[lawrence arabia da, three hour, three hour, m...","{'PERSON': 18, 'TIME': 3, 'ORG': 4, 'NORP': 6,..."
5,Martin Scorsese follows up his sloppy The Iris...,martin scorsese follows sloppy the irishman an...,"{'PROPN': 5, 'VERB': 14, 'ADJ': 17, 'DET': 5, ...","[martin scorsese, three half hour, indian, okl...","{'PERSON': 2, 'TIME': 1, 'NORP': 1, 'GPE': 1}"
6,Obviously this isn't bad. It's from an amazing...,obviously isnt bad it amazing director interes...,"{'ADV': 13, 'AUX': 12, 'PART': 8, 'ADJ': 24, '...","[fbi, two hour, fbi, i worried hour, irishman]","{'ORG': 2, 'TIME': 2, 'NORP': 1}"
7,I'm not a die-hard Martin Scorsese fan. I have...,im diehard martin scorsese fan i deep apprecia...,"{'PRON': 9, 'VERB': 21, 'ADJ': 18, 'PROPN': 7,...","[martin, scorsese, third]","{'ORG': 1, 'ORDINAL': 2}"
8,First things first. There is absolutely no nee...,first thing first there absolutely need ½ hour...,"{'ADJ': 42, 'NOUN': 67, 'ADV': 26, 'VERB': 43,...","[first, first, ½ hour, robert de niro, washing...","{'ORDINAL': 2, 'TIME': 1, 'PERSON': 1, 'GPE': ..."
9,Well that was a real snorer.Scorsese took a ta...,well real snorerscorsese took taut thriller bo...,"{'INTJ': 1, 'ADJ': 12, 'NOUN': 32, 'VERB': 14,...","[snorerscorsese, oklahoma, edgar hoover, texas...","{'NORP': 2, 'GPE': 2, 'PERSON': 2, 'CARDINAL': 1}"


# Mandatory Question

**Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.**

By doing this assignment, I got hands on experience on the topics which were taught in class.
Given time is succifient for the assignment. Errors were easily resolved by using stackoverflow and reddit.
The one challenge I faced during this assignment is writing code for the parsing and printing it.