<a href="https://colab.research.google.com/github/Yusmitha-Lekha/YusmithaLekha_INFO5731_Fall2024/blob/main/Yusmithalekha_Prathi_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Tuesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# Extracted and assigned the url of the imdb movie reviews page to the variable URL U
url = 'https://www.imdb.com/title/tt15239678/reviews/_ajax?ref_=undefined&paginationKey='

# Headers to mimic a real browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

# Initialized the  variables for the purpose of pagination and reviews
reviews = []
pagination_key = ''
total_reviews_needed = 1000  # top 1000 reviews

# Opened the  CSV file for the purpose of writing
with open('imdb_reviews.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Review'])  # Write header row

    # Loop to go through all the pages until we have enough(1000) reviews
    while len(reviews) < total_reviews_needed:
        # Construct the full URL with the pagination key
        full_url = url + pagination_key

        # Sending a GET request to  URL
        response = requests.get(full_url, headers=headers)

        # Parsing the page content
        soup = BeautifulSoup(response.content, 'html.parser')

        # Finding all the review-containers
        new_reviews = soup.find_all('div', class_='text show-more__control')

        # If no new reviews are being found, break the loop which is (end of pages)
        if not new_reviews:
            print("No more reviews found.")
            break

        # Adding  the new reviews to the final or the total list
        for review in new_reviews:
            review_text = review.get_text().strip()
            if len(reviews) < total_reviews_needed:
                reviews.append(review_text)
                writer.writerow([review_text])
            else:
                break  # Break if we already have the 1000 reviews

        # Update the pagination key for the next page
        load_more_data = soup.find('div', {'class': 'load-more-data'})
        if load_more_data and load_more_data.has_attr('data-key'):
            pagination_key = load_more_data['data-key']
        else:
            break  # Break here, if no pagination key is being found

        # Pausing between requests to avoid being blocked
        time.sleep(1)

# Printing the number of reviews scraped
print(f'Scraped {len(reviews)} reviews successfully!')


Scraped 1000 reviews successfully!


# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [2]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Downloading the NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [3]:
# Loading the reviews from the CSV file
dataFrame = pd.read_csv('imdb_reviews.csv')

# first few rows of the original data
dataFrame.head()


Unnamed: 0,Review
0,This is what Hollywood needs. A great story wi...
1,I'm going to write this as a review for both D...
2,Had the pleasure to watch this film in an earl...
3,Phenomenal stuff. I'll probably calm down tomo...
4,"If you liked or loved the first one, the same ..."


# (1) Remove noise, such as special characters and punctuations.

In [18]:
# Function for removing the noise
def remove_noise(text):
    return re.sub(r'[^A-Za-z\s]', '', text)  # Keep only letters and whitespace

# Applying the function
dataFrame['Noisy Removed'] = dataFrame['Review'].apply(remove_noise)

# Displaying the updated DataFrame
dataFrame[['Review', 'Noisy Removed']].head()


Unnamed: 0,Review,Noisy Removed
0,This is what Hollywood needs. A great story wi...,This is what Hollywood needs A great story wit...
1,I'm going to write this as a review for both D...,Im going to write this as a review for both Du...
2,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...
3,Phenomenal stuff. I'll probably calm down tomo...,Phenomenal stuff Ill probably calm down tomorr...
4,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same w...


# (2) Remove numbers.

In [5]:
# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Apply the function
dataFrame['Numbers Removed'] = dataFrame['Noisy Removed'].apply(remove_numbers)

# Display the updated DataFrame
dataFrame[['Review', 'Numbers Removed']].head()


Unnamed: 0,Review,Numbers Removed
0,This is what Hollywood needs. A great story wi...,This is what Hollywood needs A great story wit...
1,I'm going to write this as a review for both D...,Im going to write this as a review for both Du...
2,Had the pleasure to watch this film in an earl...,Had the pleasure to watch this film in an earl...
3,Phenomenal stuff. I'll probably calm down tomo...,Phenomenal stuff Ill probably calm down tomorr...
4,"If you liked or loved the first one, the same ...",If you liked or loved the first one the same w...


# (3) Remove stopwords by using the stopwords list.

In [6]:
# Initializing the stopwords
stop_words = set(stopwords.words('english'))

# Function for removing the stopwords
def remove_stopwords(text):
    text_tokens = text.split()
    return ' '.join([word for word in text_tokens if word.lower() not in stop_words])

# Applying the function
dataFrame['Stopwords Removed'] = dataFrame['Numbers Removed'].apply(remove_stopwords)

# Displaying the updated DataFrame
dataFrame[['Review', 'Stopwords Removed']].head()


Unnamed: 0,Review,Stopwords Removed
0,This is what Hollywood needs. A great story wi...,Hollywood needs great story great directorprod...
1,I'm going to write this as a review for both D...,Im going write review Dune movies Ill include ...
2,Had the pleasure to watch this film in an earl...,pleasure watch film early screening completely...
3,Phenomenal stuff. I'll probably calm down tomo...,Phenomenal stuff Ill probably calm tomorrow ri...
4,"If you liked or loved the first one, the same ...",liked loved first one apply one Personally lov...


# (4) Lowercase all texts

In [7]:
# Function for lowercase all texts
def lowercase_text(text):
    return text.lower()

# Applying  function
dataFrame['Lowercased'] = dataFrame['Stopwords Removed'].apply(lowercase_text)

# Displaying the updated-DataFrame
dataFrame[['Review', 'Lowercased']].head()


Unnamed: 0,Review,Lowercased
0,This is what Hollywood needs. A great story wi...,hollywood needs great story great directorprod...
1,I'm going to write this as a review for both D...,im going write review dune movies ill include ...
2,Had the pleasure to watch this film in an earl...,pleasure watch film early screening completely...
3,Phenomenal stuff. I'll probably calm down tomo...,phenomenal stuff ill probably calm tomorrow ri...
4,"If you liked or loved the first one, the same ...",liked loved first one apply one personally lov...


# (5) Stemming.

In [8]:
# Initializing the stemmer
stemmer = PorterStemmer()

# Function for the purpose of stemming
def stem_text(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

# Applying  the function
dataFrame['Stemmed'] = dataFrame['Lowercased'].apply(stem_text)

# Displaying  the updated-DataFrame
dataFrame[['Review', 'Stemmed']].head()


Unnamed: 0,Review,Stemmed
0,This is what Hollywood needs. A great story wi...,hollywood need great stori great directorprodu...
1,I'm going to write this as a review for both D...,im go write review dune movi ill includ though...
2,Had the pleasure to watch this film in an earl...,pleasur watch film earli screen complet blown ...
3,Phenomenal stuff. I'll probably calm down tomo...,phenomen stuff ill probabl calm tomorrow right...
4,"If you liked or loved the first one, the same ...",like love first one appli one person love one ...


# (6) Lemmatization.

In [9]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K

In [10]:
import pandas as pd
import spacy
import contractions

# Loading  the English NLP model
nlp = spacy.load('en_core_web_sm')


# Function to expand contractions
def expand_contractions(text):
    return contractions.fix(text)

# Function for lemmatization using spaCy
def lemmatize_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

# Applying the function to expand contractions and then lemmatize
dataFrame['Lemmatized'] = dataFrame['Stemmed'].apply(expand_contractions).apply(lemmatize_text)

# Displaying the updated DataFrame
print(dataFrame[['Review', 'Lemmatized']].head())


                                              Review  \
0  This is what Hollywood needs. A great story wi...   
1  I'm going to write this as a review for both D...   
2  Had the pleasure to watch this film in an earl...   
3  Phenomenal stuff. I'll probably calm down tomo...   
4  If you liked or loved the first one, the same ...   

                                          Lemmatized  
0  hollywood need great stori great directorprodu...  
1  I be go write review dune movi ill includ thin...  
2  pleasur watch film earli screen complet blow a...  
3  phenomen stuff ill probabl calm tomorrow right...  
4  like love first one appli one person love one ...  


In [11]:
# Saving the cleaned-data to a new CSV file which is called as imdb_reviews_cleaned
dataFrame.to_csv('imdb_reviews_cleaned.csv', index=False)

# Print a confirmation message
print("Cleaned data saved to 'imdb_reviews_cleaned.csv'")


Cleaned data saved to 'imdb_reviews_cleaned.csv'


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

# (1) Parts of Speech (POS) Tagging

In [12]:
import nltk
import pandas as pd
from collections import Counter

# Downloading all  necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def pos_tagging(text):
    words = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(words)
    return pos_tags

def count_pos(pos_tags):
    pos_counts = Counter(tag for word, tag in pos_tags)
    return {
        'Nouns': sum(pos_counts[tag] for tag in ['NN', 'NNS', 'NNP', 'NNPS']),
        'Verbs': sum(pos_counts[tag] for tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']),
        'Adjectives': sum(pos_counts[tag] for tag in ['JJ', 'JJR', 'JJS']),
        'Adverbs': sum(pos_counts[tag] for tag in ['RB', 'RBR', 'RBS'])
    }

# Loading the cleaned-data
dataFrame = pd.read_csv('imdb_reviews_cleaned.csv')

# Performing the  POS tagging on  cleaned-text
dataFrame['POS_tags'] = dataFrame['Lemmatized'].apply(pos_tagging)

# Counting the  POS for each and every review
dataFrame['POS_counts'] = dataFrame['POS_tags'].apply(count_pos)

# Calculating the  total POS counts
total_pos_counts = dataFrame['POS_counts'].apply(pd.Series).sum()

print("Total POS counts:")
print(total_pos_counts)

# Saving all the  results to CSV file
dataFrame.to_csv('imdb_reviews_pos_tagged.csv', index=False)
print("POS tagging results saved to 'imdb_reviews_pos_tagged.csv'")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Total POS counts:
Nouns         61873
Verbs         15913
Adjectives    22720
Adverbs        6796
dtype: int64
POS tagging results saved to 'imdb_reviews_pos_tagged.csv'


# (2) Constituency Parsing and Dependency Parsing

In [13]:
import nltk
import spacy
from nltk import Tree
import pandas as pd

# Downloading all the  necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('words', quiet=True)

# Loading the  spaCy model
nlp = spacy.load("en_core_web_sm")

def constituency_parse(sentence):
    words = nltk.word_tokenize(sentence)
    pos_tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(pos_tags)
    return tree

def dependency_parse(sentence):
    doc = nlp(sentence)
    return [(token.text, token.dep_, token.head.text) for token in doc]

# Loading  the cleaned-data from csv file
dataFrame = pd.read_csv('imdb_reviews_cleaned.csv')

# Processing all the sentences
for index, row in dataFrame.iterrows():
    sentence = row['Lemmatized']
    print(f"\n\nSentence {index + 1}:")
    print(sentence)

    try:
        print("\nConstituency Parse Tree:")
        constituency_tree = constituency_parse(sentence)
        print(constituency_tree)
    except LookupError as e:
        print(f"Error in constituency parsing: {e}")

    print("\nDependency Parse:")
    dependency_relations = dependency_parse(sentence)
    for word, dep, head in dependency_relations:
        print(f"{word} --{dep}--> {head}")

print("\nParsing completed for all sentences.")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
sand --nsubj--> hide
hide --ccomp--> make
leav --amod--> trail
clear --amod--> trail
trail --dobj--> hide
behind --prep--> hide
find --advcl--> watch
scatteredoveral --amod--> doubt
enjoy --amod--> doubt
movi --compound--> doubt
doubt --nsubj--> watch
watch --ccomp--> bloat


Sentence 987:
think mani thing talk movi let keep simpl film absolut incre masterpiec work art besid think also one import film ever make accomplish much mani level exampl last time see sciencefict movi masterclass act deep charact perfect cinematographi immacul sound design list could go adapt novel specif believ film show we true power cgi use properli nearli invis blend perfectli rest footag deni villeneuv take we somewher one ever truli undescrib experi make perfect use wide shot lens filter ad immacul sound design han zimmer yet make incred origin score villeneuv compliment perfect background nois one thing sure though movi see theatr good theat

# (3) Named Entity Recognition

In [14]:
import spacy
import pandas as pd
from collections import Counter

# Loading the  spaCy model
nlp = spacy.load("en_core_web_sm")

def perform_ner(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

# Loading  the cleaned-data
dataFrame = pd.read_csv('imdb_reviews_cleaned.csv')

# Performing the  NER on all the cleaned texts extrcated from csv file
all_entities = []

for index, row in dataFrame.iterrows():
    text = row['Lemmatized']
    entities = perform_ner(text)
    all_entities.extend(entities)

    # Printing all entities for each text
    print(f"\nEntities in text {index + 1}:")
    for entity, label in entities:
        print(f"{entity} - {label}")

# Calculating  the count of each and every entity type
entity_counts = Counter(label for _, label in all_entities)

print("\nTotal entity counts:")
for entity_type, count in entity_counts.items():
    print(f"{entity_type}: {count}")

# Creating  a list of all unique entities
unique_entities = list(set(all_entities))

print("\nSample of unique entities found (up to 20):")
for entity, label in unique_entities[:20]:
    print(f"{entity} - {label}")

# Saving the results to CSV
results_dataFrame = pd.DataFrame(unique_entities, columns=['Entity', 'Type'])
results_dataFrame.to_csv('named_entities.csv', index=False)
print("\nFull list of named entities saved to 'named_entities.csv'")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Entities in text 491:
yesterday - DATE
embrac messiah destini perfectzendaya - ORG
javier bardem - PERSON
first - ORDINAL
one - CARDINAL
beautifulth - ORG

Entities in text 492:
deni - NORP
surpris - NORP
han - NORP
everi time - PERSON
paul - PERSON
believ mayb - ORG
frank herbert - PERSON
mayb frank - PERSON

Entities in text 493:
grandeur depth - ORG
two - CARDINAL
challeng soar - PERSON
coloss narr - PERSON
frank herbert semin - PERSON
narr expans - PERSON
first - ORDINAL
paul - PERSON
journey naiv young duke - PERSON
harden realiti - PERSON
destini desirerebecca - PERSON
harkonnen depict grotesqu opul repel - ORG
harkonnen figur - PERSON
han - NORP
two - CARDINAL
mostli - GPE
materi hindranc - PERSON
narr - PERSON
howev - GPE
quibbl - PERSON
two - CARDINAL
ten - CARDINAL

Entities in text 494:
two - CARDINAL
first - ORDINAL
jessica rebecca - PERSON
quest aveng - PERSON
paul - PERSON
lisan al gaib - PERSON
zendaya fier

#**Comment**
Make sure to submit the cleaned data CSV in the comment section - 10 points

In [15]:
# Saved the final cleaned and the processed dataset to a CSV file
dataFrame.to_csv('final_cleaned_dataset.csv', index=False)

# Print a confirmation message
print("Final cleaned dataset with all steps saved to 'final_cleaned_dataset.csv'")


Final cleaned dataset with all steps saved to 'final_cleaned_dataset.csv'


In [16]:
import pandas as pd

# Loading  the final cleaned CSV file
dataFrame = pd.read_csv('final_cleaned_dataset.csv')

# Keeping  only the final cleaned 'Lemmatized' column since it is the last step performed and renamed it to 'Final Review'
final_dataFrame = dataFrame[['Lemmatized']].rename(columns={'Lemmatized': 'Final Review'})

# Saving the entire final-dataset to new CSV-file called as final_reviews_dataset
final_dataFrame.to_csv('final_reviews_dataset.csv', index=False)

# Print a final confirmation message
print("Cleaned and Final reviews dataset has been saved to 'final_reviews_dataset.csv'")


Cleaned and Final reviews dataset has been saved to 'final_reviews_dataset.csv'


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [19]:
# Write your response below

'''
1. The first and foremost main challenge which i faced is thw web scraping, Here we have collected the data from various sources , in our case, the IMDB url which is a dynamic-content -loading dataset, it is always tricky because of this content loading and also other tricky part is pagination, and also we should not get blocked by the server. It has also been difficult to handle the pagination of such huge number of pages and a vast number of reviews which are 1000 efficiently wothout over-loading the server.
2. Next challenge which i faced is in text cleaning, i have implemented different kinds of text pre-processing technqiues like removal of the noise, then handling the stop words, stemming and also lemmatization which requires the attention to detail. If stemming and lemmatization are not being handled correctly, there is a high chance that they can even alter the entire meaning of the text.
3. The other challenge which i felt was applying the POS tagging , parsing which require a lot of undetstanding on the lingustic structures and also implementing all these things correctly can be quite challenging.

The enjoyable aspects which i felt in this assignment are that , i ahve explore libraries such as beautiful soup for the purpose of web scraping and also using idataFrameferent kinds of libraries like the Spacy and NLTK for the text pre-processing.

And also, once the entire data has been cleaned, we analysed the text using the POS tagging and also extracting different entities using the NER provided me with valuable information/insights into data. It was very interesting to learn how much information can be extracted from the raw data or the raw-text.

My final thoughts are that , This assignment comprised of multiple concepts like the web scraping, text pre processing and the advanced text-analytics which helped me gaing more insights on all these topics together.

The assignment comprised of multiple tasks, so it require lot of attention and detail but it is a good practise for hands on experience. The time provided is sufficeint for some one whoc is already familiar with all the libraries and the techniques, but it is time intensive, additional time can be provided so that we can grasp more concepts and preopley give the solutions
'''

'\n1. The first and foremost main challenge which i faced is thw web scraping, Here we have collected the data from various sources , in our case, the IMDB url which is a dynamic-content -loading dataset, it is always tricky because of this content loading and also other tricky part is pagination, and also we should not get blocked by the server. It has also been difficult to handle the pagination of such huge number of pages and a vast number of reviews which are 1000 efficiently wothout over-loading the server.\n2. Next challenge which i faced is in text cleaning, i have implemented different kinds of text pre-processing technqiues like removal of the noise, then handling the stop words, stemming and also lemmatization which requires the attention to detail. If stemming and lemmatization are not being handled correctly, there is a high chance that they can even alter the entire meaning of the text.\n3. The other challenge which i felt was applying the POS tagging , parsing which requ