<a href="https://colab.research.google.com/github/amulyabodempudi/amulya_INFO5731_Fall2023/blob/main/Amulya_Bodempudi_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Two**

In this assignment, you will try to gather text data from open data source via web scraping or API. After that you need to clean the text data and syntactic analysis of the data.

# **Question 1**

(40 points). Write a python program to collect text data from **either of the following sources** and save the data into a **csv file**:

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon.

(2) Collect the top 10000 User Reviews of a film recently in 2023 or 2022 (you can choose any film) from IMDB.

(3) Collect all the reviews of the top 1000 most popular software from [G2](https://www.g2.com/) or [Capterra](https://www.capterra.com/)

(4) Collect the abstracts of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from [Semantic Scholar](https://www.semanticscholar.org).

(5) Collect all the information of the 904 narrators in the [Densho Digital Repository](https://ddr.densho.org/narrators/).

(6) Collect the top 10000 reddits by using a hashtag (you can use any hashtag) from Reddits.


In [9]:
# Write your code here
# I had choosen 2nd question for collecting 10000 reviews of "Johnwick" movie from IMBD
import requests
from bs4 import BeautifulSoup
import csv

def scrape_imdb_reviews(movie_id, num_reviews=10000):
    base_url = f'https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3'

    reviews = []
    page_number = 1

    while len(reviews) < num_reviews:
        url = f"{base_url}&start={page_number}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='text show-more__control')

        if not review_elements:
            break

        for review in review_elements:
            reviews.append(review.text.strip())

        page_number += 1

    return reviews[:num_reviews]

def save_to_csv(data, filename='imdb_reviews.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(['Review'])
        for review in data:
            writer.writerow([review])

# Specify the movie ID for "tt10366206"
movie_id = 'tt10366206'
movie_reviews = scrape_imdb_reviews(movie_id, num_reviews=10000)

if movie_reviews:
    save_to_csv(movie_reviews)
    print(f"Successfully collected and saved {len(movie_reviews)} reviews for movie Johnwick:Chapter 4.")


Successfully collected and saved 10000 reviews for movie Johnwick:Chapter 4.


# **Question 2**

(30 points). Write a python program to **clean the text data** you collected above and save the data in a new column in the csv file. The data cleaning steps include:

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the [stopwords list](https://gist.github.com/sebleier/554280).

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [11]:
# Write your code here
#importing the required libraries
import requests
from bs4 import BeautifulSoup
import csv
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    # (1) Remove noise (special characters and punctuations)
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])

    # (2) Remove numbers
    text = ''.join([char for char in text if not char.isdigit()])

    # (3) Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = nltk.word_tokenize(text)
    text = ' '.join([word for word in tokens if word.lower() not in stop_words])

    # (4) Lowercase all texts
    text = text.lower()

    # (5) Stemming
    porter_stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    text = ' '.join([porter_stemmer.stem(word) for word in tokens])

    # (6) Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])

    return text

def scrape_imdb_reviews(movie_id, num_reviews=10000):
    base_url = f'https://www.imdb.com/title/{movie_id}/reviews?ref_=tt_ql_3'

    reviews = []
    page_number = 1

    while len(reviews) < num_reviews:
        url = f"{base_url}&start={page_number}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        review_elements = soup.find_all('div', class_='text show-more__control')

        if not review_elements:
            break

        for review in review_elements:
            cleaned_review = clean_text(review.text.strip())
            reviews.append(cleaned_review)

        page_number += 1

    return reviews[:num_reviews]

def save_to_csv(data, filename='imdb_reviews_cleaned.csv'):
    with open(filename, 'w', newline='', encoding='utf-8') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(['Original Review', 'Cleaned Review'])
        for original_review, cleaned_review in zip(data, [clean_text(review) for review in data]):
            writer.writerow([original_review, cleaned_review])

# Specify the IMDb movie ID for "tt10366206"
movie_id = 'tt10366206'
movie_reviews = scrape_imdb_reviews(movie_id, num_reviews=10000)
save_to_csv(movie_reviews)





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Question 3**

(30 points). Write a python program to conduct **syntax and structure analysis** of the clean text you just saved above. The syntax and structure analysis includes:

(1) Parts of Speech (POS) Tagging: Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) Constituency Parsing and Dependency Parsing: print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) Named Entity Recognition: Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [8]:
# Write your code here

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('punkt')  # Add this line to download the 'punkt' resource
nltk.download('words')
import requests
from bs4 import BeautifulSoup
import csv
from nltk import pos_tag, ne_chunk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tree import Tree
from nltk.chunk import tree2conlltags
from collections import Counter

# ... (rest of your code remains the same)

def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    pos_counts = Counter(tag for word, tag in pos_tags)
    return pos_tags, pos_counts

def constituency_parsing(text):
    sentences = sent_tokenize(text)
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged = pos_tag(words)
        parsing_tree = ne_chunk(tagged)
        print("Constituency Parsing Tree:")
        print(parsing_tree)

def dependency_parsing(text):
    sentences = sent_tokenize(text)
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged = pos_tag(words)
        parsing_tree = ne_chunk(tagged)
        conll_tags = tree2conlltags(parsing_tree)
        print("Dependency Parsing Tree:")
        for tag in conll_tags:
            print(tag)

def named_entity_recognition(text):
    sentences = sent_tokenize(text)
    entities = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged = pos_tag(words)
        parsing_tree = ne_chunk(tagged, binary=True)
        entities.extend([(word, entity) for word, entity, tag in tree2conlltags(parsing_tree) if entity != 'O'])
    entity_counts = Counter(entities)
    return entity_counts

# Load cleaned reviews from the CSV file
cleaned_reviews = []
with open('imdb_reviews_cleaned.csv', 'r', encoding='utf-8') as csv_file:
    reader = csv.reader(csv_file)
    next(reader)  # Skip header
    for row in reader:
        cleaned_reviews.append(row[1])

# Performing analyses on a sample review
sample_review = cleaned_reviews[0]

# (1) Parts of Speech (POS) Tagging
pos_tags, pos_counts = pos_tagging(sample_review)
print("\nParts of Speech (POS) Tagging:")
print(pos_tags)
print("POS Counts:", pos_counts)

# (2) Constituency Parsing and Dependency Parsing
constituency_parsing(sample_review)
dependency_parsing(sample_review)

# (3) Named Entity Recognition
entity_counts = named_entity_recognition(sample_review)
print("\nNamed Entity Recognition:")
print(entity_counts)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!



Parts of Speech (POS) Tagging:
[('imagin', 'NN'), ('video', 'NN'), ('game', 'NN'), ('shoot', 'NN'), ('bad', 'JJ'), ('guy', 'NN'), ('hardwar', 'NN'), ('old', 'JJ'), ('everyth', 'NN'), ('kind', 'NN'), ('slow', 'JJ'), ('focu', 'NN'), ('oppon', 'NN'), ('set', 'VBN'), ('easi', 'JJ'), ('instal', 'JJ'), ('hack', 'NN'), ('give', 'VBP'), ('invinc', 'NN'), ('autoaim', 'NN'), ('come', 'VBP'), ('slowli', 'NN'), ('shout', 'NN'), ('open', 'JJ'), ('weapon', 'NN'), ('fire', 'NN'), ('three', 'CD'), ('four', 'CD'), ('bullet', 'NN'), ('run', 'VB'), ('shoot', 'JJ'), ('anyth', 'NN'), ('anyway', 'RB'), ('use', 'JJ'), ('sniper', 'JJ'), ('explo', 'NN'), ('trap', 'NN'), ('kind', 'NN'), ('cant', 'JJ'), ('even', 'RB'), ('hit', 'VBP'), ('theyr', 'JJ'), ('next', 'JJ'), ('wield', 'NN'), ('knife', 'NN'), ('tri', 'NN'), ('fight', 'NN'), ('decent', 'NN'), ('manner', 'NN'), ('yet', 'RB'), ('avatar', 'JJ'), ('move', 'NN'), ('like', 'IN'), ('year', 'NN'), ('old', 'JJ'), ('man', 'NN'), ('even', 'RB'), ('autoaim', 'VBZ'),

**Write your explanations of the constituency parsing tree and dependency parsing tree here (Question 3-2):**

**Constituency Parsing Tree:**
Constituency parsing is a natural language parsing technique that divides a sentence into smaller
constituents or phrases such as noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and others.
As a result, a hierarchical tree structure representing the syntactic structure of a sentence is produced.

**Dependency Parsing Tree:**
Dependency parsing is another approach for assessing a sentence's grammatical structure, but it concentrates on the relationships between words in the form of directed linkages (dependencies). Each word in the phrase is treated as a node, and the linkages between them indicate grammatical relationships.


