<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Two.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [2]:
import requests
import csv
from bs4 import BeautifulSoup

def get_total_reviews(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    total_reviews_str = soup.find('div', {'class': 'header'}).get_text()
    return int(total_reviews_str.split()[0].replace(',', ''))

def save_reviews_to_csv(csv_file_path, reviews):
    with open(csv_file_path, 'w', newline='', encoding='utf-8') as csv_file:
        csv_writer = csv.writer(csv_file)
        csv_writer.writerow(['Review'])
        csv_writer.writerows(reviews)

def scrape_reviews(url, total_reviews):
    reviews = []
    for page_number in range(1, min(1001, total_reviews // 10 + 2)):
        page_url = f'{url}&start={10 * (page_number - 1)}'
        page_response = requests.get(page_url)
        page_soup = BeautifulSoup(page_response.content, 'html.parser')
        page_reviews = page_soup.find_all('div', {'class': 'text show-more__control'})
        for review in page_reviews:
            reviews.append([review.get_text().strip()])
    return reviews

def main():
    imdb_url = 'https://www.imdb.com/title/tt0120338/reviews/?ref_=tt_ql_2'
    csv_file_path = 'Titanic_reviews.csv'

    total_reviews = get_total_reviews(imdb_url)
    all_reviews = scrape_reviews(imdb_url, total_reviews)
    save_reviews_to_csv(csv_file_path, all_reviews)

    print(f'Reviews saved to {csv_file_path}')

if __name__ == "__main__":
    main()


Reviews saved to Titanic_reviews.csv


In [5]:
import pandas as pd
df = pd.read_csv('Titanic_reviews.csv')
df.head()

Unnamed: 0,Review
0,I have watched Titanic how many times I don't ...
1,The stage curtains open ...Not since the adven...
2,"Ah, yes, the film that propelled Leonardi DiCa..."
3,Very beautiful and cinematic movie with lots o...
4,"Back in 1997, do I remember that year: Clinton..."


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [8]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [9]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [11]:
import re

df['Reviews after Noise Removal'] = df['Review'].str.replace('[^\w\s]', '')

df['Reviews after Noise Removal'] = df['Reviews after Noise Removal'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', char) for char in x))


In [12]:
df['After digits removal'] = df['Reviews after Noise Removal'].apply(lambda text: ''.join([char for char in text if not char.isdigit()]))


In [13]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

df['Stopwords Removal'] = df['After digits removal'].apply(lambda text: " ".join(word for word in text.split() if word not in stop_words))


In [14]:
df['Lower Case'] = df['Stopwords Removal'].apply(lambda text: " ".join(word.lower() for word in text.split()))

from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

df['After Stemming'] = df['Lower Case'].apply(lambda text: " ".join(porter_stemmer.stem(word) for word in text.split()))

from textblob import Word
import nltk

nltk.download('wordnet')

df['After Lemmatization'] = df['After Stemming'].apply(lambda text: " ".join(Word(word).lemmatize() for word in text.split()))

df


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bhanuprasadkommula/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Review,Reviews after Noise Removal,After digits removal,Stopwords Removal,Lower Case,After Stemming,After Lemmatization
0,I have watched Titanic how many times I don't ...,I have watched Titanic how many times I don t ...,I have watched Titanic how many times I don t ...,I watched Titanic many times I know Everytime ...,i watched titanic many times i know everytime ...,i watch titan mani time i know everytim i watc...,i watch titan mani time i know everytim i watc...
1,The stage curtains open ...Not since the adven...,The stage curtains open Not since the adven...,The stage curtains open Not since the adven...,The stage curtains open Not since advent film ...,the stage curtains open not since advent film ...,the stage curtain open not sinc advent film br...,the stage curtain open not sinc advent film br...
2,"Ah, yes, the film that propelled Leonardi DiCa...",Ah yes the film that propelled Leonardi DiCa...,Ah yes the film that propelled Leonardi DiCa...,Ah yes film propelled Leonardi DiCapro super s...,ah yes film propelled leonardi dicapro super s...,ah ye film propel leonardi dicapro super stard...,ah ye film propel leonardi dicapro super stard...
3,Very beautiful and cinematic movie with lots o...,Very beautiful and cinematic movie with lots o...,Very beautiful and cinematic movie with lots o...,Very beautiful cinematic movie lots classic sc...,very beautiful cinematic movie lots classic sc...,veri beauti cinemat movi lot classic scene als...,veri beauti cinemat movi lot classic scene als...
4,"Back in 1997, do I remember that year: Clinton...",Back in 1997 do I remember that year Clinton...,Back in do I remember that year Clinton ban...,Back I remember year Clinton bans cloning rese...,back i remember year clinton bans cloning rese...,back i rememb year clinton ban clone research ...,back i rememb year clinton ban clone research ...
...,...,...,...,...,...,...,...
8370,I am still crying as I am writing this review ...,I am still crying as I am writing this review ...,I am still crying as I am writing this review ...,I still crying I writing review right I even k...,i still crying i writing review right i even k...,i still cri i write review right i even know b...,i still cri i write review right i even know b...
8371,This movie re-wrote film history in every way....,This movie re wrote film history in every way ...,This movie re wrote film history in every way ...,This movie wrote film history every way No one...,this movie wrote film history every way no one...,thi movi wrote film histori everi way no one c...,thi movi wrote film histori everi way no one c...
8372,Titanic is the film ive viewed 2nd most of all...,Titanic is the film ive viewed 2nd most of all...,Titanic is the film ive viewed nd most of all ...,Titanic film ive viewed nd kinda guy watche go...,titanic film ive viewed nd kinda guy watche go...,titan film ive view nd kinda guy watch good mo...,titan film ive view nd kinda guy watch good mo...
8373,James Cameron's 'Titanic' shares a similar mot...,James Cameron s Titanic shares a similar mot...,James Cameron s Titanic shares a similar mot...,James Cameron Titanic shares similar motto Mar...,james cameron titanic shares similar motto mar...,jame cameron titan share similar motto marmit ...,jame cameron titan share similar motto marmit ...


In [15]:
df.to_csv('Titanic_Reviews.csv', index=False)


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [16]:
from nltk.tokenize import word_tokenize

pos_tagged_sentences = []

for sentence in df['Lower Case']:
    tokens = word_tokenize(sentence)
    
    pos_tagged_sentences.append(nltk.pos_tag(tokens))

pos_tagged_sentences




[[('i', 'NN'),
  ('watched', 'VBD'),
  ('titanic', 'RB'),
  ('many', 'JJ'),
  ('times', 'NNS'),
  ('i', 'VBP'),
  ('know', 'VBP'),
  ('everytime', 'RB'),
  ('i', 'JJ'),
  ('watch', 'VBP'),
  ('i', 'NN'),
  ('still', 'RB'),
  ('cry', 'VBZ'),
  ('laugh', 'IN'),
  ('smile', 'NN'),
  ('feel', 'VBP'),
  ('the', 'DT'),
  ('story', 'NN'),
  ('flows', 'VBZ'),
  ('tension', 'NN'),
  ('throughout', 'IN'),
  ('movie', 'NN'),
  ('two', 'CD'),
  ('actors', 'NNS'),
  ('acting', 'VBG'),
  ('chemistry', 'NN'),
  ('need', 'NN'),
  ('applaud', 'IN'),
  ('sinking', 'VBG'),
  ('ship', 'NN'),
  ('realistically', 'RB'),
  ('filmed', 'VBD'),
  ('my', 'PRP$'),
  ('heart', 'NN'),
  ('will', 'MD'),
  ('go', 'VB'),
  ('on', 'IN'),
  ('perfect', 'JJ'),
  ('fit', 'JJ'),
  ('jack', 'NN'),
  ('roses', 'NNS'),
  ('love', 'VBP'),
  ('story', 'NN'),
  ('timeless', 'NN'),
  ('well', 'RB'),
  ('all', 'DT'),
  ('movie', 'NN'),
  ('factors', 'NNS'),
  ('fully', 'RB'),
  ('qualified', 'VBD'),
  ('what', 'WP'),
  ('i', 'VBZ'

In [17]:
pip install spacy


Note: you may need to restart the kernel to use updated packages.


In [26]:
clean_texts = df['Review']

import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

clean_texts = df['Review']

def extract_entities_and_counts(texts):
    entity_counter = Counter()

    for text in texts:
        doc = nlp(text)
        for ent in doc.ents:
            entity_counter[ent.label_] += 1

    return entity_counter

entity_counts = extract_entities_and_counts(clean_texts)

print("Entity Counts:")
for entity_type, count in entity_counts.items():
    print(f"{entity_type}: {count}")


Entity Counts:
ORG: 18090
CARDINAL: 9380
PERSON: 38525
DATE: 15075
WORK_OF_ART: 3015
ORDINAL: 7035
MONEY: 335
PERCENT: 335
PRODUCT: 335
NORP: 1340
QUANTITY: 335
GPE: 1675
TIME: 4355
EVENT: 335
LOC: 335
FAC: 335
LAW: 335


**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

constituency Parsing Tree

Simply constituency Parsing Tree visually represents how the sentence is structured, showcasing the 
relationships between different phrases and their hierarchical arrangement.
A Constituency Parsing Tree, also known as a phrase structure tree or syntactic parsing tree, visually represents 
the hierarchical structure of a sentence based on its grammatical constituents or phrases. This tree breaks down a 
sentence into its constituent parts, illustrating how words and phrases are grouped together. 

Dependency parsing Tree

A dependency parsing tree provides a visual representation of the grammatical relationships between words in a sentence. Unlike constituency parsing trees that emphasize grouping words into phrases, dependency parsing focuses on illustrating how each word depends on or relates to others. In this tree:

Each word in the sentence is represented as a node.
Arrows between nodes indicate the syntactic relationships, showcasing how words depend on one another.
The root of the tree typically corresponds to the main verb or the central element in the sentence.
Nodes branching from the root represent words that depend on or modify the root.