<a href="https://colab.research.google.com/github/anoohyabhaskarla/INFO_5731/blob/main/Bhaskarla_Anoohya_Assignment_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [None]:
# Your code here
!pip install requests beautifulsoup4 pandas



In [38]:

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError

# Base URL for scraping
main_url = "https://ddr.densho.org/narrators/"

# Dictionary to store the data
data_dict = {"Name": [], "Bio": []}

# Loop through each page (page 1 to 41)
for page_num in range(1, 42):
    try:
        # Build the request with the page number
        link1 = Request(main_url + f"?page={page_num}", headers={'User-Agent': 'Mozilla/5.0'})
        url1 = urlopen(link1)

        # Read the page content
        data1 = url1.read()
        soup = BeautifulSoup(data1, 'html.parser').find_all('div', class_="col-md-6 col-sm-6")

        # Loop through each narrators' information
        for i in soup:
            narrator_info = i.find('div', class_="media-body")

            # Extracting Name and Bio
            name = narrator_info.find('a').text
            Bio = narrator_info.find('div', class_="source muted").text.strip()

            # Add the information to the dictionary
            data_dict["Name"].append(name)
            data_dict["Bio"].append(Bio)

            # Print for debugging
            print(name)
            print(Bio)

    except HTTPError as e:
        print(f"HTTPError on page {page_num}: {e}")
    except Exception as e:
        print(f"Error on page {page_num}: {e}")

# Create a DataFrame from the dictionary
df = pd.DataFrame(data_dict)

# Specify the filename for saving
csv_filename = "narrators_data.csv"

# Save the data to a CSV file
df.to_csv(csv_filename, index=False, encoding="utf-8")

# Success message
print(f"CSV file '{csv_filename}' has been created successfully!")






Kay Aiko Abe
Nisei female. Born May 9, 1927, in Selleck, Washington. Spent much of childhood in Beaverton, Oregon, where father owned a farm. Influenced at an early …
Art Abe
Nisei male. Born June 12, 1921, in Seattle, Washington. Grew up in an area of Seattle with few other Japanese Americans, and was attending the …
Sharon Tanagi Aburano
Nisei female. Born October 31, 1925, in Seattle, Washington. Family owned and operated a successful grocery store prior to World War II. After the bombing …
Toshiko Aiboshi
Nisei female. Born July 8, 1928, in Boyle Heights, California. At an early age, went to live with family friends when father passed away and …
Douglas L. Aihara
Sansei male. Born March 15, 1950, in Torrance, California. Grew up in the Los Angeles area, where father sold insurance. Active with Los Angeles' Koyasan …
Yae Aihara
Nisei female. Born August 18, 1925 in Tacoma, Washington. Raised in Seattle, Washington, where family operated a grocery store. Attended Washington Grammar S

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [39]:
# Write code for each of the sub parts with proper comments.

import pandas as pd
train = pd.read_csv("/content/narrators_data.csv")
#1. Removing Noise
df['Processed_name'] = df['Name'].str.replace('[^\w\s]','',regex=True)
df['Processed_details'] = df['Bio'].str.replace('[^\w\s]','',regex=True)
print(df)
# 2. Removing Numbers
df['Processed_name'] = df['Processed_name'].str.replace('\d+', '', regex=True)
df['Processed_details'] = df['Processed_details'].str.replace('\d+', '', regex=True)
print(df)
#3. Removing stopwords by using the stopwords list
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

df['Processed_name'] = df['Processed_name'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['Processed_details'] = df['Processed_details'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
print(df)

#4. Lowercase all texts
df['Processed_name'] = df['Processed_name'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Processed_details'] = df['Processed_details'].apply(lambda x: " ".join(x.lower() for x in x.split()))
print(df)

#5. Stemming
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['Processed_name']=df['Processed_name'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
df['Processed_details']=df['Processed_details'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
print(df)

#6. Lemmatization
from textblob import Word
import nltk
nltk.download('wordnet')

df['Processed_name'] = df['Processed_name'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['Processed_details'] = df['Processed_details'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print(df)

df.head()
df.to_csv('Narrators_data_Cleaned.csv', index=False)




                       Name  \
0              Kay Aiko Abe   
1                   Art Abe   
2     Sharon Tanagi Aburano   
3           Toshiko Aiboshi   
4         Douglas L. Aihara   
...                     ...   
997         Karen Yoshitomi   
998              John Young   
999             Sharon Yuen   
1000              Lois Yuki   
1001            Aaron Zajic   

                                                    Bio  \
0     Nisei female. Born May 9, 1927, in Selleck, Wa...   
1     Nisei male. Born June 12, 1921, in Seattle, Wa...   
2     Nisei female. Born October 31, 1925, in Seattl...   
3     Nisei female. Born July 8, 1928, in Boyle Heig...   
4     Sansei male. Born March 15, 1950, in Torrance,...   
...                                                 ...   
997   Sansei female. Born 1962 in Spokane, Washingto...   
998   Chinese American male. Born May 22, 1923, in L...   
999   Sansei female. Born July 1945 in Seattle, Wash...   
1000  Nisei female. Born September 13

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                       Name  \
0              Kay Aiko Abe   
1                   Art Abe   
2     Sharon Tanagi Aburano   
3           Toshiko Aiboshi   
4         Douglas L. Aihara   
...                     ...   
997         Karen Yoshitomi   
998              John Young   
999             Sharon Yuen   
1000              Lois Yuki   
1001            Aaron Zajic   

                                                    Bio  \
0     Nisei female. Born May 9, 1927, in Selleck, Wa...   
1     Nisei male. Born June 12, 1921, in Seattle, Wa...   
2     Nisei female. Born October 31, 1925, in Seattl...   
3     Nisei female. Born July 8, 1928, in Boyle Heig...   
4     Sansei male. Born March 15, 1950, in Torrance,...   
...                                                 ...   
997   Sansei female. Born 1962 in Spokane, Washingto...   
998   Chinese American male. Born May 22, 1923, in L...   
999   Sansei female. Born July 1945 in Seattle, Wash...   
1000  Nisei female. Born September 13

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [43]:
import nltk
import spacy
from nltk.tokenize import sent_tokenize
from collections import Counter
import pandas as pd

# Download NLTK resources (if needed)
try:
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading spacy model...")
    import subprocess
    subprocess.run(["python", "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")


def analyze_text(text):
    if not isinstance(text, str):  # Check if it's a string
        print("Skipping analysis: Text is not a string.")
        return  # Or return some default values if needed

    sentences = sent_tokenize(text)

    # 1. POS Tagging
    pos_counts = Counter()
    for sentence in sentences:
        doc = nlp(sentence)  # Use spaCy for POS tagging
        for token in doc:
            tag = token.pos_  # spaCy's POS tags are more detailed
            if tag.startswith('N'):
                pos_counts['Noun'] += 1
            elif tag.startswith('V'):
                pos_counts['Verb'] += 1
            elif tag.startswith('J'):  # spaCy uses 'ADJ'
                pos_counts['Adjective'] += 1
            elif tag.startswith('R'):  # spaCy uses 'ADV'
                pos_counts['Adverb'] += 1

    print("POS Tagging and Counts:")
    print(pos_counts)


    # 2. Dependency Parsing (Constituency parsing is complex; using a simpler method here)
    print("\nDependency Parsing:")
    for sentence in sentences:
        doc = nlp(sentence)
        print(f"\nSentence: {sentence}")
        for token in doc:
            print(f"{token.text} --({token.dep_})--> {token.head.text}")

        # Example explanation (using the first sentence)
        if sentences:
            example_sentence = sentences[0]
            example_doc = nlp(example_sentence)
            print("\nExplanation of Dependency Parsing for Example Sentence:")
            print(f"Sentence: {example_sentence}")
            for token in example_doc:
                print(f"{token.text} --({token.dep_})--> {token.head.text}")
            print("(Explanation: Dependency parsing shows the relationships between words. Each word is dependent on a 'head' word. The 'dep_' label describes the type of dependency (e.g., 'nsubj' for nominal subject, 'dobj' for direct object). This shows the grammatical structure and how words relate to each other.)")


    # 3. Named Entity Recognition
    print("\nNamed Entity Recognition:")
    entity_counts = Counter()
    for sentence in sentences:
        doc = nlp(sentence)
        for ent in doc.ents:
            entity_counts[ent.label_] += 1
            print(f"{ent.text} ({ent.label_})")

    print(f"\nEntity Counts: {entity_counts}")


# Load the DataFrame
df = pd.read_csv('narrators_data.csv')  # Replace with your CSV file

# Fill NaN values in the 'Bio' column with empty strings
df['Bio'].fillna('', inplace=True)  # This is the crucial line!

# Apply the analysis to the 'Bio' column (or whichever column you want)
for index, row in df.iterrows():
    print(f"\n--- Analyzing Bio for Narrator {index + 1} ---") # Add narrator number for better tracking.
    analyze_text(row['Bio'])  # Or row['Preprocessed_Content'] if you have it.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Bio'].fillna('', inplace=True)  # This is the crucial line!


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
of --(prep)--> bombing
Pearl --(compound)--> Harbor
Harbor --(pobj)--> of
, --(punct)--> After
… --(punct)--> After

Explanation of Dependency Parsing for Example Sentence:
Sentence: Nisei female.
Nisei --(amod)--> female
female --(ROOT)--> female
. --(punct)--> female
(Explanation: Dependency parsing shows the relationships between words. Each word is dependent on a 'head' word. The 'dep_' label describes the type of dependency (e.g., 'nsubj' for nominal subject, 'dobj' for direct object). This shows the grammatical structure and how words relate to each other.)

Named Entity Recognition:
Nisei (PERSON)
October 3, 1919 (DATE)
Fresno (GPE)
California (GPE)
Watsonville (GPE)
California (GPE)
Pearl Harbor (FAC)

Entity Counts: Counter({'GPE': 4, 'PERSON': 1, 'DATE': 1, 'FAC': 1})

--- Analyzing Bio for Narrator 942 ---
POS Tagging and Counts:
Counter({'Noun': 5, 'Verb': 4})

Dependency Parsing:

Sentence: Sansei male.
Sanse

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import choice

In [None]:
# storaging for extracted data
product_data = []

# listing of different user-agentt strings to avoid detection
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
]

In [None]:
# creating a session to maintain cookies & headers
session = requests.Session()

# adding extra headers to mimic a real browser
headers = {
    'User-Agent': choice(user_agents),
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://github.com/',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Connection': 'keep-alive',
    'Cache-Control': 'no-cache',
}

In [None]:
# fetching multiple pages
for i in range(1, 55):  # adjusting range for more pages
    time.sleep(3)  # delay to avoid getting blocked

    base_url = f'https://github.com/marketplace?page={i}&type=actions'
    print(f"Scraping: {base_url}")

    # retrying logic
    for attempt in range(3):  # Retrying up to 3 times
        try:
            response = session.get(base_url, headers=headers, timeout=10)

            if response.status_code == 200:
                break  # exiting loop if successful
            else:
                print(f"Attempt {attempt+1}: Failed with status {response.status_code}")
                time.sleep(2)  # waiting before retrying
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}")
            time.sleep(2)

    if response.status_code != 200:
        print(f"Skipping page {i} due to failure")
        continue

    soup = BeautifulSoup(response.text, 'html.parser')

    # finding all marketplace items
    github_actions = soup.find_all('div', class_='position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3')

    for actions in github_actions:

        # handling exception error
        try:
            # extracting product details
            product_name_tag = actions.find('a', class_='marketplace-common-module__marketplace-item-link--jrIHf line-clamp-1')
            product_name = product_name_tag.text.strip() if product_name_tag else 'N/A'

            # extracting product URL
            url = product_name_tag['href'] if product_name_tag else 'N/A'
            if url.startswith('/'):
                url = f'https://github.com{url}'

            # extracting action description
            action_description_tag = actions.find('p', class_='mt-1 mb-0 text-small fgColor-muted line-clamp-2')
            action_description = action_description_tag.text.strip() if action_description_tag else 'N/A'

            # storing in a structured dictionary
            product_data.append({
                'Product Name': product_name,
                'URL': url,
                'Description': action_description,
                'Page Number': i
            })

        except Exception as e:
            print(f"Error extracting data on page {i}: {e}")

Scraping: https://github.com/marketplace?page=1&type=actions
Scraping: https://github.com/marketplace?page=2&type=actions
Scraping: https://github.com/marketplace?page=3&type=actions
Scraping: https://github.com/marketplace?page=4&type=actions
Scraping: https://github.com/marketplace?page=5&type=actions
Scraping: https://github.com/marketplace?page=6&type=actions
Scraping: https://github.com/marketplace?page=7&type=actions
Scraping: https://github.com/marketplace?page=8&type=actions
Scraping: https://github.com/marketplace?page=9&type=actions
Scraping: https://github.com/marketplace?page=10&type=actions
Scraping: https://github.com/marketplace?page=11&type=actions
Scraping: https://github.com/marketplace?page=12&type=actions
Scraping: https://github.com/marketplace?page=13&type=actions
Scraping: https://github.com/marketplace?page=14&type=actions
Scraping: https://github.com/marketplace?page=15&type=actions
Scraping: https://github.com/marketplace?page=16&type=actions
Scraping: https:/

In [None]:
# converting to a dataframe
df_products = pd.DataFrame(product_data)

# saving to csv
df_products.to_csv('github_marketplace_actions.csv', index=False)

print("Scraping Completed. Data saved to github_marketplace_actions.csv.")

Scraping Completed. Data saved to github_marketplace_actions.csv.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from nltk.corpus import stopwords

# Define the cleaning function with NaN handling
def clean_text_remove_stopword(text):
    if isinstance(text, str):  # Ensure the text is a string
        stop_words = set(stopwords.words('english'))
        text = ' '.join([word for word in text.split() if word.lower() not in stop_words])
        return text
    return ''  # Return an empty string if it's NaN or not a string

# Applying the cleaning function to 'Product Name' and 'Description' columns
df_action_products['Product Name'] = df_action_products['Product Name'].apply(clean_text_remove_stopword)
df_action_products['Description'] = df_action_products['Description'].apply(clean_text_remove_stopword)


In [None]:
# Import necessary NLTK functions
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Define the cleaning function
def clean_text_remove_stopword(text):
    if isinstance(text, str):  # Ensure the text is a string
        stop_words = set(stopwords.words('english'))  # Define stopwords
        lemmatizer = WordNetLemmatizer()

        # Tokenizing the text
        tokens = word_tokenize(text)

        # Removing stopwords and lemmatizing the tokens
        cleaned_tokens = [lemmatizer.lemmatize(word.lower()) for word in tokens if word.isalnum() and word.lower() not in stop_words]
        return ' '.join(cleaned_tokens)
    return ''  # Return an empty string if it's NaN or not a string

# Applying the cleaning function to 'Product Name' and 'Description' columns
df_action_products['Product Name'] = df_action_products['Product Name'].apply(clean_text_remove_stopword)
df_action_products['Description'] = df_action_products['Description'].apply(clean_text_remove_stopword)


In [35]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.error import HTTPError
import json
import pandas as pd
import numpy as np
import re
main_url = "https://github.com/marketplace?type=actions"
pages=500
data_dict = {"Name": [], "info": []}

for page_num in range(1,pages+1):
  link1 = Request(main_url, headers={'User-Agent': 'Mozilla/5.0'})
  url1 = urlopen(link1)
  data1 = url1.read()
  soup = BeautifulSoup(data1).find('div',class_="mt-4 marketplace-common-module__marketplace-list-grid--vCk7D")
  #print(soup)
  products = soup. find_all('div',class_="position-relative border rounded-2 d-flex marketplace-common-module__marketplace-item--MohVH gap-3 p-3")
  #print(products)
  for i in products:
    p1 = i.find('div',class_="d-flex flex-justify-between flex-items-start gap-3")
    p1_name = p1.find('a').text
    p1_info = i.find('p').text
    data_dict["Name"].append(p1_name)
    data_dict["info"].append(p1_info)
    print(p1_name)
    print(p1_info)
df = pd.DataFrame(data_dict)
csv_filename = "Github_products.csv"
df.to_csv(csv_filename, index=False, encoding="utf-8")

print(f"CSV file '{csv_filename}' has been created successfully!")



[1;30;43mStreaming output truncated to the last 5000 lines.[0m

Rebuild Armbian
Build Armbian Linux

run-digger
Manage terraform collaboration

GitHub Script
Run simple scripts using the GitHub client

Deploy to GitHub Pages
This action will handle the deployment process of your project to GitHub Pages

ChatGPT CodeReviewer
A Code Review Action Powered By ChatGPT

FTP Deploy
Automate deploying websites and more with this GitHub action via FTP and FTPS

TruffleHog OSS
Scan Github Actions with TruffleHog

Metrics embed
An infographics generator with 40+ plugins and 300+ options to display stats about your GitHub account

yq - portable yaml processor
create, read, update, delete, merge, validate and do more with yaml

Super-Linter
Super-linter is a ready-to-run collection of linters and code analyzers, to help validate your source code

Gosec Security Checker
Runs the gosec security checker

Rebuild Armbian and Kernel
Support Amlogic, Rockchip and Allwinner boxes

OpenCommit — improve c

In [50]:
#Part_2
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')


# Step 1: Load the CSV file into a DataFrame
df = pd.read_csv('/content/Github_products.csv')

# Step 2: Preprocessing Function for 'product_information'
def preprocess_text(text):
    # Remove special characters, HTML tags, and extra spaces
    text_clean = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text_clean = re.sub(r'[^a-zA-Z\s]', '', text_clean)  # Remove special characters
    text_clean = text_clean.strip()  # Remove extra whitespace

    # Convert to lowercase
    text_clean = text_clean.lower()

    # Tokenize
    tokens = word_tokenize(text_clean)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(lemmatized_tokens)

# Step 3: Apply preprocessing to the 'product_information' column
df['processed_product_info'] = df['info'].apply(preprocess_text)

# Step 4: Data Quality Operations

# (a) Handle missing values (in both columns)
df['Name'].fillna('Unknown', inplace=True)  # Fill missing product names with 'Unknown'
df['info'].fillna('No information available', inplace=True)  # Fill missing info with placeholder

# (b) Remove duplicates based on product_name and product_information
df.drop_duplicates(subset=['Name', 'info'], keep='first', inplace=True)

# (c) Check for consistency (ensure product names are non-empty, and product info is valid)
# Ensure that no product name is empty
if df['Name'].str.strip().eq('').any():
    print("Warning: There are empty product names.")

# Ensure that product_information is not empty
if df['info'].str.strip().eq('').any():
    print("Warning: There are products with no information.")


# Step 5: Check cleaned data
print(df.head())




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                           Name  \
0                TruffleHog OSS   
1                 Metrics embed   
2  yq - portable yaml processor   
3                  Super-Linter   
4        Gosec Security Checker   

                                                info  \
0              Scan Github Actions with TruffleHog\n   
1  An infographics generator with 40+ plugins and...   
2  create, read, update, delete, merge, validate ...   
3  Super-linter is a ready-to-run collection of l...   
4                  Runs the gosec security checker\n   

                              processed_product_info  
0                      scan github action trufflehog  
1  infographics generator plugins option display ...  
2      create read update delete merge validate yaml  
3  superlinter readytorun collection linters code...  
4                         run gosec security checker  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna('Unknown', inplace=True)  # Fill missing product names with 'Unknown'
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['info'].fillna('No information available', inplace=True)  # Fill missing info with placeholder


In [37]:
# droping missing value
df_action_products = df_action_products.dropna(subset=['Description'])
# selecting specifice column
df_actions = df_action_products[['Product Name', 'Description', 'URL', 'Page Number']]
# storing data to csv
df_actions.to_csv('cleaned_github_actions_data.csv', index=False)

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [None]:
# installing tweepy for twitter scraping
!pip install tweepy



In [None]:
# importing tweepy for twitter api
import tweepy

In [46]:
# twitter API credentials
API_KEY = "2WAsQ5tiDlNjD2ygNfrvjPN3b"
API_KEY_SECRET = "r9N5FrfAnH9MkwC584cyTPDXB5wFbPb0AzY25jTkLn7ZDYf1CT"
ACCESS_TOKEN = "1872694165047234560-QgvHHSobJEQfnYMMTTHjQilSJyWGOE"
ACCESS_TOKEN_SECRET = "OyEXjP2GRHsUW7t9HSBTznax6OAkWnInAmcWRXLIDxdFd"
BEARER_TOKEN = "AAAAAAAAAAAAAAAAAAAAAHDHzQEAAAAAR3aL%2BfNtSRSrUw%2F1s2npdDv7Liw%3DFH5PvRylSEv0hN4PElepYCI7vD78j0xRrfhuP7WVgPK1Jc5IoN"

In [47]:
# authenticating using OAuth2
client = tweepy.Client(bearer_token=BEARER_TOKEN)

# defining search query and parameters
query = "(#MachineLearning OR #AI) -is:retweet lang:en"
tweets = client.search_recent_tweets(query=query, tweet_fields=["id", "text", "author_id"], max_results=100)

# extracting relevant data
data = []
if tweets.data:
    for tweet in tweets.data:
        data.append({
            "Tweet ID": tweet.id,
            "Username": tweet.author_id,
            "Text": tweet.text
        })

# converting to dataframe and display
twitter_df = pd.DataFrame(data)
twitter_df.head()

Unnamed: 0,Tweet ID,Username,Text
0,1892788816756560355,1888107342191087616,"@MEIMaths Meanwhile, @PublicAIData is over her..."
1,1892788799274467698,70110112,#Processvenue has ready infra to handle any vo...
2,1892788740294136214,1843300177203056640,💎 Acie AI – A Must-Watch AI &amp; Biotech Proj...
3,1892788736481530190,1793969145580568576,@MathewRiggs12 @UnleashBenjamin @BenjaminOnIp ...
4,1892788714025140629,1889333873475031040,Buy Naver Account\n Telegram: @Usaseoonline\n ...


In [48]:

# storing twitter_df to csv
twitter_df.to_csv('twitter_data.csv', index=False)

In [49]:
# performing data quality checks
missing_values = twitter_df.isnull().sum()
duplicate_rows = twitter_df.duplicated().sum()

# printing data quality report
print("Missing Values:\n", missing_values)
print("\nDuplicate Rows:", duplicate_rows)

# removing duplicates (if any)
df_twitter_cleaned = twitter_df.drop_duplicates()

# saving the cleaned data to a new CSV file
df_twitter_cleaned.to_csv('twitter_cleaned_data.csv', index=False)

Missing Values:
 Tweet ID    0
Username    0
Text        0
dtype: int64

Duplicate Rows: 0


# Mandatory Question

It took me really very long time to complete the assignment. I broke with lots of stuff. Incredibly, I found out the reasons for my errors were really small missing aspects at first. However, I found out the ways to achieve it though it is late. By debugging and installing necessary libraries. I enjoyed the practical nature of the assignment.  It was satisfying to see the scraper successfully collect data from the website and then use NLP techniques to analyze that data. Seeing the results of the NLP analysis (POS tags, entity counts) was interesting.  It provided some insights into the text data and the kind of information that can be extracted using these techniques. but the provided time was sufficient to complete the core requirements of the assignment however, I just realized I should have been start Little more.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog