# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Wednesday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


# Question 1 (40 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]

(3) Collect all the reviews of the top 1000 most popular software from G2 or Capterra.

(4) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(5) Collect all the information of the 904 narrators in the Densho Digital Repository.


In [51]:
import requests
from bs4 import BeautifulSoup
import time
import csv

def get_narrators(url: object) -> object:
    narrators = []

    # Assuming there's a way to determine when to stop, like a "no more pages" indicator.
    # This loop structure might need adjustment based on the actual pagination implementation.
    while True:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }

        response = requests.get(url, headers=headers)

        if response.status_code != 200:
            print(f"Failed to retrieve data: {response.status_code}")
            break

        soup = BeautifulSoup(response.text, 'html.parser')

        # Replace '.narrator-class' with the actual class or identifier used on the website for narrators.
        narrator_elements = soup.select('.media')
        if not narrator_elements:
            print("No more narrators found or reached the end of the list.")
            break
        # print(narrator_elements)
        for element in narrator_elements:
            narrator = {}
            media_heading = element.find('h4', class_='media-heading')
            if media_heading:
                anchor = media_heading.find('a')  # Find the anchor tag within the media heading
                if anchor:
                    narrator['name'] = anchor.text.strip()
                    narrator['link'] = anchor['href']
                    narrators.append(narrator)

        next_page_element = soup.find('a', string='Next')
        if next_page_element:
            url = next_page_element['href']  # Assuming the next page URL is in the 'href' attribute
        else:
            break  # Stop if there's no "Next" button found

    return narrators


def narrator_info(url):
    info = {}
    interviews = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }

    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to retrieve data: {response.status_code}")
        return

    soup = BeautifulSoup(response.text, 'html.parser')
    _title_element = soup.find('h1')
    _summary_ele = soup.find('p')
    media_elements = soup.find_all('div', class_='media')
    info['title'] = _title_element.text.strip()
    info['summary'] = _summary_ele.text.strip()

    for media in media_elements:
        detail = {}
        # Extract the interview title from the 'b' tag within 'media-heading' class
        interview_title_element = media.find('b', class_='media-heading')
        if interview_title_element and interview_title_element.find('a'):
            detail['interview_title'] = interview_title_element.a.text.strip()

        # Extract the URL from the 'a' tag within 'media-heading' class
        if interview_title_element and interview_title_element.find('a'):
            detail['url'] = interview_title_element.a['href']

        # Extract the description from the 'div' with class 'source muted'
        description_element = media.find('div', class_='source muted')
        if description_element:
            detail['description'] = description_element.text.strip()

        # Add the extracted details to the list
        interviews.append(detail)
    if(len(interviews) > 0):
      info['interview_title'] = interviews[0]['interview_title']
      info['latest_interview_summary'] = interviews[0]['description']
    return info

base_url = 'https://ddr.densho.org/narrators/?page='

print("Initiating Scrapper sequence")
narrators_list = []
narrator_info_list = []

start_time = time.time()

for i in range(1, 40):
    narrators_list.extend(get_narrators(base_url + str(i)))
    print('Scrapped Page '+str(i))

end_time = time.time()

elapsed_time = end_time - start_time

print('Sequence ran for :'+str(elapsed_time))
print('All Narrators Urls fetched, Initiating Scrapper Info Sequence')

for i in narrators_list:
    start_time = time.time()
    link = i['link']
    narrator_info_list.append(narrator_info(link))
    end_time = time.time()
    print('Info Sequence for :'+ i['name']+ 'ran for '+str(elapsed_time))

# Open the file in write mode ('w') and create a csv.writer object
with open('narrators.csv', mode='w', newline='', encoding='utf-8') as file:
    # Use the DictWriter class, which maps dictionaries onto output rows
    writer = csv.DictWriter(file, fieldnames=narrator_info_list[0].keys())

    # Write the header row based on keys from the first dictionary
    writer.writeheader()

    # Write all the dictionaries to the CSV file
    for narrator in narrator_info_list:
        writer.writerow(narrator)

# # info = narrator_info('https://ddr.densho.org/narrators/887/')
print('Narrator Info List')
print(narrator_info_list)
print('Total Narrators: ' + str(len(narrator_info_list)))

print(f"CSV file ''narrators.csv'' created successfully.")

Initiating Scrapper sequence
Scrapped Page 1
Scrapped Page 2
Scrapped Page 3
Scrapped Page 4
Scrapped Page 5
Scrapped Page 6
Scrapped Page 7
Scrapped Page 8
Scrapped Page 9
Scrapped Page 10
Scrapped Page 11
Scrapped Page 12
Scrapped Page 13
Scrapped Page 14
Scrapped Page 15
Scrapped Page 16
Scrapped Page 17
Scrapped Page 18
Scrapped Page 19
Scrapped Page 20
Scrapped Page 21
Scrapped Page 22
Scrapped Page 23
Scrapped Page 24
Scrapped Page 25
Scrapped Page 26
Scrapped Page 27
Scrapped Page 28
Scrapped Page 29
Scrapped Page 30
Scrapped Page 31
Scrapped Page 32
Scrapped Page 33
Scrapped Page 34
Scrapped Page 35
Scrapped Page 36
Scrapped Page 37
Scrapped Page 38
Scrapped Page 39
Sequence ran for :15.646780252456665
All Narrators Urls fetched, Initiating Scrapper Info Sequence
Info Sequence for :Kay Aiko Aberan for 15.646780252456665
Info Sequence for :Art Aberan for 15.646780252456665
Info Sequence for :Sharon Tanagi Aburanoran for 15.646780252456665
Info Sequence for :Toshiko Aiboshiran fo

# Question 2 (30 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [52]:
# Write code for each of the sub parts with proper comments.
from bs4 import BeautifulSoup
import pandas as pd
import requests
from IPython.display import display


df_narrators = pd.read_csv('narrators.csv')

df_narrators.columns = ['narrator_name','summary','latest_interview_title','latest_interview_summary']

# Converting to lower case, removing special charecters, removing any weird noises
df_narrators['narrator_name_clean'] = df_narrators['narrator_name'].str.replace('[^\w\s]','').str.replace('\d+', '').str.lower()

df_narrators['latest_interview_title_clean'] = df_narrators['latest_interview_title'].str.replace('[^\w\s]','').str.replace('\d+', '').str.lower()


df_narrators['latest_interview_summary_clean'] = df_narrators['latest_interview_summary'].str.replace('[^\w\s]','').str.replace('\d+', '').str.lower()

df_narrators['latest_interview_title_clean'].fillna('', inplace=True)
df_narrators['latest_interview_summary_clean'].fillna('', inplace=True)
df_narrators['narrator_name_clean'].fillna('', inplace=True)


df_narrators.head()



  df_narrators['narrator_name_clean'] = df_narrators['narrator_name'].str.replace('[^\w\s]','').str.replace('\d+', '').str.lower()
  df_narrators['latest_interview_title_clean'] = df_narrators['latest_interview_title'].str.replace('[^\w\s]','').str.replace('\d+', '').str.lower()
  df_narrators['latest_interview_summary_clean'] = df_narrators['latest_interview_summary'].str.replace('[^\w\s]','').str.replace('\d+', '').str.lower()


Unnamed: 0,narrator_name,summary,latest_interview_title,latest_interview_summary,narrator_name_clean,latest_interview_title_clean,latest_interview_summary_clean
0,Kay Aiko Abe,"Nisei female. Born May 9, 1927, in Selleck, Wa...",Kay Aiko Abe Interview — ddr-densho-1000-232,"December 2, 2008.\n Seattle, Washington.\...",kay aiko abe,kay aiko abe interview ddrdensho,december \n seattle washington\n \n...
1,Art Abe,"Nisei male. Born June 12, 1921, in Seattle, Wa...",Art Abe Interview — ddr-densho-1000-206,"January 24, 2008.\n Seattle, Washington.\...",art abe,art abe interview ddrdensho,january \n seattle washington\n \n ...
2,Sharon Tanagi Aburano,"Nisei female. Born October 31, 1925, in Seattl...",Sharon Tanagi Aburano Interview II — ddr-densh...,"April 3, 2008.\n Seattle, Washington.\n ...",sharon tanagi aburano,sharon tanagi aburano interview ii ddrdensho,april \n seattle washington\n \n ...
3,Toshiko Aiboshi,"Nisei female. Born July 8, 1928, in Boyle Heig...",Toshiko Aiboshi Interview — ddr-manz-1-112,"January 20, 2011.\n Culver City, Californ...",toshiko aiboshi,toshiko aiboshi interview ddrmanz,january \n culver city california\n ...
4,Douglas L. Aihara,"Sansei male. Born March 15, 1950, in Torrance,...",Douglas L. Aihara Interview — ddr-densho-1000-522,"9-Nov-22.\n Los Angeles, California.\n ...",douglas l aihara,douglas l aihara interview ddrdensho,nov\n los angeles california\n \n ...


In [53]:
#Removing stopwords
from nltk.corpus import stopwords

stop = stopwords.words('english')

df_narrators['latest_interview_summary_clean'] = df_narrators['latest_interview_summary_clean'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df_narrators['latest_interview_summary_clean'].head()

0       december seattle washington segments
1        january seattle washington segments
2          april seattle washington segments
3    january culver city california segments
4        nov los angeles california segments
Name: latest_interview_summary_clean, dtype: object

In [54]:
result = df_narrators.dtypes
print(result)

narrator_name                     object
summary                           object
latest_interview_title            object
latest_interview_summary          object
narrator_name_clean               object
latest_interview_title_clean      object
latest_interview_summary_clean    object
dtype: object


In [55]:
#Stemming.

from nltk.stem import PorterStemmer
st = PorterStemmer()
df_narrators['latest_interview_summary_clean'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0          decemb seattl washington segment
1         januari seattl washington segment
2           april seattl washington segment
3    januari culver citi california segment
4           nov lo angel california segment
Name: latest_interview_summary_clean, dtype: object

In [56]:
#Lemmatization
from textblob import Word
import nltk
nltk.download('wordnet')

df_narrators['latest_interview_summary_clean'] = df_narrators['latest_interview_summary_clean'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df_narrators['latest_interview_summary_clean'].head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0       december seattle washington segment
1        january seattle washington segment
2          april seattle washington segment
3    january culver city california segment
4        nov los angeles california segment
Name: latest_interview_summary_clean, dtype: object

In [57]:
df_narrators

Unnamed: 0,narrator_name,summary,latest_interview_title,latest_interview_summary,narrator_name_clean,latest_interview_title_clean,latest_interview_summary_clean
0,Kay Aiko Abe,"Nisei female. Born May 9, 1927, in Selleck, Wa...",Kay Aiko Abe Interview — ddr-densho-1000-232,"December 2, 2008.\n Seattle, Washington.\...",kay aiko abe,kay aiko abe interview ddrdensho,december seattle washington segment
1,Art Abe,"Nisei male. Born June 12, 1921, in Seattle, Wa...",Art Abe Interview — ddr-densho-1000-206,"January 24, 2008.\n Seattle, Washington.\...",art abe,art abe interview ddrdensho,january seattle washington segment
2,Sharon Tanagi Aburano,"Nisei female. Born October 31, 1925, in Seattl...",Sharon Tanagi Aburano Interview II — ddr-densh...,"April 3, 2008.\n Seattle, Washington.\n ...",sharon tanagi aburano,sharon tanagi aburano interview ii ddrdensho,april seattle washington segment
3,Toshiko Aiboshi,"Nisei female. Born July 8, 1928, in Boyle Heig...",Toshiko Aiboshi Interview — ddr-manz-1-112,"January 20, 2011.\n Culver City, Californ...",toshiko aiboshi,toshiko aiboshi interview ddrmanz,january culver city california segment
4,Douglas L. Aihara,"Sansei male. Born March 15, 1950, in Torrance,...",Douglas L. Aihara Interview — ddr-densho-1000-522,"9-Nov-22.\n Los Angeles, California.\n ...",douglas l aihara,douglas l aihara interview ddrdensho,nov los angeles california segment
...,...,...,...,...,...,...,...
970,George Yoshinaga,"Nisei male. Born July 19, 1925, in Redwood Cit...",George Yoshinaga Interview — ddr-manz-1-107,"August 10, 2010.\n Las Vegas, Nevada.\n ...",george yoshinaga,george yoshinaga interview ddrmanz,august la vega nevada segment
971,George M. Yoshino,"Nisei male. Born February 25, 1921, in Bellevu...",George M. Yoshino Interview — ddr-densho-1014-8,"June 17, 2009.\n Bloomington, Minnesota.\...",george m yoshino,george m yoshino interview ddrdensho,june bloomington minnesota segment
972,Karen Yoshitomi,"Sansei female. Born 1962 in Spokane, Washingto...",Karen Yoshitomi Interview — ddr-densho-1000-527,"23-Jan-23.\n Seattle, Washington.\n ...",karen yoshitomi,karen yoshitomi interview ddrdensho,jan seattle washington segment
973,John Young,"Chinese American male. Born May 22, 1923, in L...",John Young Interview — ddr-manz-1-171,"May 22, 2015.\n San Gabriel, California.\...",john young,john young interview ddrmanz,may san gabriel california segment


# Question 3 (30 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [58]:
# Your code here
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [59]:
#3-1
import nltk
from collections import Counter

import spacy

# Tokenize the text into words
df_narrators['latest_interview_summary_clean'] = df_narrators['latest_interview_summary_clean'].astype(str)
df_narrators['words'] = df_narrators['latest_interview_summary_clean'].apply(nltk.word_tokenize)

# Perform POS tagging for each row
df_narrators['pos_tags'] = df_narrators['words'].apply(nltk.pos_tag)

# Count the number of each POS tag for each row
df_narrators['pos_counts'] = df_narrators['pos_tags'].apply(lambda x: Counter(tag[1][0] for tag in x))

# Sum up the total number of each POS tag across all rows
total_counts = df_narrators['pos_counts'].sum()

# Print the results
print("Total number of nouns (N):", total_counts['N'])
print("Total number of verbs (V):", total_counts['V'])
print("Total number of adjectives (Adj):", total_counts['J'])
print("Total number of adverbs (Adv):", total_counts['R'])

Total number of nouns (N): 2748
Total number of verbs (V): 399
Total number of adjectives (Adj): 581
Total number of adverbs (Adv): 122


In [60]:
#3-2
import spacy
from spacy import displacy
nlp=spacy.load('en_core_web_sm')

for i in df_narrators['latest_interview_summary_clean'].head():
  displacy.render(nlp(i),jupyter=True)

In [61]:
from nltk import pos_tag, word_tokenize, RegexpParser

chunker = RegexpParser("""
                       NP: {<DT>?<JJ>*<NN>}    #To extract Noun Phrases
                       P: {<IN>}               #To extract Prepositions
                       V: {<V.*>}              #To extract Verbs
                       PP: {<P> <NP>}          #To extract Prepostional Phrases
                       VP: {<V> <NP|PP>*}      #To extarct Verb Phrases
                       """)
for i in df_narrators['pos_tags']:
  output = chunker.parse(i)
  print("After Extracting\n", output)


After Extracting
 (S (NP december/NN) (NP seattle/JJ washington/NN) (NP segment/NN))
After Extracting
 (S (NP january/JJ seattle/JJ washington/NN) (NP segment/NN))
After Extracting
 (S (NP april/NN) (NP seattle/JJ washington/NN) (NP segment/NN))
After Extracting
 (S
  (NP january/JJ culver/NN)
  (NP city/NN)
  (NP california/NN)
  (NP segment/NN))
After Extracting
 (S
  (NP nov/JJ los/NN)
  angeles/NNS
  (VP (V california/VBP) (NP segment/NN)))
After Extracting
 (S (NP july/NN) denver/WRB (NP colorado/NN) (NP segment/NN))
After Extracting
 (S )
After Extracting
 (S
  (NP june/NN)
  salt/NNS
  (VP (V lake/VBP) (NP city/NN) (NP utah/NN) (NP segment/NN)))
After Extracting
 (S july/RB (NP klamath/JJ fall/NN) (NP oregon/NN) (NP segment/NN))
After Extracting
 (S (NP march/NN) (NP spokane/NN) (NP washington/NN) (NP segment/NN))
After Extracting
 (S
  october/RB
  (NP hood/NN)
  (NP river/NN)
  (NP oregon/NN)
  (NP segment/NN))
After Extracting
 (S october/RB (NP seattle/JJ washington/NN) (NP 

# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [62]:
# Write your response below
One of the difficult tasks was web scrapping, frequent access to the site was being blocked, alot of time was being invested into scrapping choosing the right elements to retrieve data and correct techniques to retrieve data strategically.
Cleanzing data was a bit easy, but faced a typical Nan error, had to perform few other filters to clean the data properly. Overall it was a good experience and time was acceptable