<a href="https://colab.research.google.com/github/VK1843/Varunkumar_INFO5731_Fall2025/blob/main/Chennuri_Varun_Assignment_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [5]:
pip install pandas requests



In [8]:
import requests
import pandas as pd
import time
import os
from typing import List, Dict, Any

# --- Configuration ---
API_BASE_URL = "https://api.semanticscholar.org/graph/v1/paper/search"
QUERY = "machine learning"
TARGET_PAPERS = 1000
PAGE_LIMIT = 100 # Max number of results per API call (Semantic Scholar limit)
CSV_FILENAME = "semantic_scholar_abstracts.csv"
SLEEP_TIME_SECONDS = 5  # Recommended delay to respect API rate limits

def fetch_data_from_api(query: str, offset: int) -> Dict[str, Any]:
    """
    Fetches a single page of data from the Semantic Scholar API.
    Handles basic error checking and rate limiting.
    """
    params = {
        'query': query,
        'offset': offset,
        'limit': PAGE_LIMIT,
        # Specify the fields we want to retrieve
        'fields': 'paperId,title,abstract,authors,year'
    }

    print(f"-> Fetching papers starting at offset {offset}...")

    try:
        response = requests.get(API_BASE_URL, params=params)
        response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
        return response.json()
    except requests.exceptions.HTTPError as e:
        if response.status_code == 429:
            print(f"Rate limit hit (429). Sleeping for {SLEEP_TIME_SECONDS * 2} seconds.")
            time.sleep(SLEEP_TIME_SECONDS * 2)
            # Try again after waiting
            return fetch_data_from_api(query, offset)
        else:
            print(f"HTTP Error occurred: {e}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"An error occurred during API request: {e}")
        return None

def collect_and_save_abstracts():
    """
    Main function to collect abstracts, handle pagination, and save to CSV.
    """
    all_abstracts: List[Dict[str, Any]] = []
    current_offset = 0
    total_collected = 0

    # Loop until the target number of papers is met or no more results are available
    while total_collected < TARGET_PAPERS:

        data = fetch_data_from_api(QUERY, current_offset)

        if data is None or 'data' not in data:
            print("API returned no data or an error. Stopping collection.")
            break

        papers = data['data']

        if not papers:
            print("No more papers found. Stopping collection.")
            break

        print(f"   -> Successfully fetched {len(papers)} papers. Current total: {total_collected + len(papers)}")

        for paper in papers:
            # Clean and flatten the data before appending
            abstract_text = paper.get('abstract', 'N/A')

            # Skip papers without abstracts to ensure data quality
            if abstract_text == 'N/A' or not abstract_text:
                continue

            authors = ", ".join([a['name'] for a in paper.get('authors', [])])

            all_abstracts.append({
                'Paper ID': paper.get('paperId'),
                'Title': paper.get('title'),
                'Year': paper.get('year'),
                'Authors': authors,
                'Abstract': abstract_text
            })

        # Update counters for the next loop
        total_collected = len(all_abstracts)
        current_offset += PAGE_LIMIT

        # Check if we should stop
        if total_collected >= TARGET_PAPERS:
            print(f"Target of {TARGET_PAPERS} papers reached.")
            break

        # Respect API limits
        time.sleep(SLEEP_TIME_SECONDS)

    print(f"\n--- Collection Complete ---")
    print(f"Total valid abstracts collected: {total_collected}")

    if all_abstracts:
        # Save to CSV using pandas
        df = pd.DataFrame(all_abstracts)
        df.to_csv(CSV_FILENAME, index=False, encoding='utf-8')
        print(f"Data saved successfully to {os.path.abspath(CSV_FILENAME)}")
    else:
        print("No abstracts were collected. CSV file not created.")


if __name__ == "__main__":
    # Ensure pandas and requests are installed:
    # pip install pandas requests

    # Running the collector
    collect_and_save_abstracts()


-> Fetching papers starting at offset 0...
   -> Successfully fetched 100 papers. Current total: 100
-> Fetching papers starting at offset 100...
   -> Successfully fetched 100 papers. Current total: 159
-> Fetching papers starting at offset 200...
Rate limit hit (429). Sleeping for 10 seconds.
-> Fetching papers starting at offset 200...
   -> Successfully fetched 100 papers. Current total: 224
-> Fetching papers starting at offset 300...
   -> Successfully fetched 100 papers. Current total: 296
-> Fetching papers starting at offset 400...
Rate limit hit (429). Sleeping for 10 seconds.
-> Fetching papers starting at offset 400...
   -> Successfully fetched 100 papers. Current total: 346
-> Fetching papers starting at offset 500...
   -> Successfully fetched 100 papers. Current total: 387
-> Fetching papers starting at offset 600...
   -> Successfully fetched 100 papers. Current total: 439
-> Fetching papers starting at offset 700...
Rate limit hit (429). Sleeping for 10 seconds.
-> Fe

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [13]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import os
import io

# --- Configuration ---
INPUT_FILE = "semantic_scholar_abstracts.csv"
OUTPUT_FILE = "semantic_scholar_abstracts_step_analysis.csv"
TARGET_COLUMN = "Abstract"

# --- NLTK Setup ---
# Downloads required NLTK data resources (needed for tokenization, stopwords, and lemmatization)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab', quiet=True)
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords', quiet=True)
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet', quiet=True)
try:
    nltk.data.find('corpora/omw-1.4')
except LookupError:
    nltk.download('omw-1.4', quiet=True)


# Initialize NLTK tools globally for efficiency
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
english_stopwords = set(stopwords.words('english'))

In [14]:
# --- Data Cleaning Functions (Mapped to User's Request Numbering) ---

# We will run these functions in the standard NLP order (4, 1, 2, 3, 6, 5)
# to ensure correctness, but name them according to the user's requested numbering.

def clean_step_1_remove_noise_punct(text):
    """(1) Remove noise, such as special characters and punctuations."""
    # This step is applied AFTER lowercasing for better regex handling.
    if pd.isna(text) or not isinstance(text, str):
        return ""
    # Keep only letters and spaces, replacing punctuation/special chars with a space
    text = re.sub(r'[^\w\s]', ' ', text)
    # Collapse multiple spaces and strip
    return re.sub(r'\s+', ' ', text).strip()

In [15]:
def clean_step_2_remove_numbers(text):
    """(2) Remove numbers."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    # Replace all digits with a single space
    text = re.sub(r'\d+', ' ', text)
    # Collapse multiple spaces and strip
    return re.sub(r'\s+', ' ', text).strip()

In [16]:
def clean_step_3_remove_stopwords(text):
    """(3) Remove stopwords by using the stopwords list."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    # Tokenize the text
    tokens = word_tokenize(text)
    # Filter out stopwords and single characters (often remnants of noise)
    tokens = [word for word in tokens if word not in english_stopwords and len(word) > 1]
    # Rejoin the processed tokens
    return ' '.join(tokens)

In [17]:
def clean_step_4_lowercase(text):
    """(4) Lowercase all texts."""
    # NOTE: This step is logically the first cleaning step performed on the raw data.
    if pd.isna(text) or not isinstance(text, str):
        return ""
    return text.lower()

In [18]:
def clean_step_5_stemming(text):
    """(5) Stemming (Applied after Lemmatization)."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    # Tokenize the text
    tokens = word_tokenize(text)
    # Stem tokens to their root
    tokens = [stemmer.stem(word) for word in tokens]
    # Rejoin the processed tokens
    return ' '.join(tokens)

In [19]:
def clean_step_6_lemmatization(text):
    """(6) Lemmatization (Applied before Stemming)."""
    if pd.isna(text) or not isinstance(text, str):
        return ""
    # Tokenize the text
    tokens = word_tokenize(text)
    # Lemmatize tokens to their base form
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Rejoin the processed tokens
    return ' '.join(tokens)

In [20]:
def create_dummy_data(file_name):
    """Creates a dummy CSV file if the actual file is not found, for demonstration."""
    data = {
        'paperId': ['id1', 'id2', 'id3', 'id4'],
        'title': ['Machine Learning in 2024', 'A Novel Algorithm for 5G Networks', 'The Future of AI and Robotics', 'Data Analysis with Python 3.9'],
        'Abstract': [
            "This paper explores various machine learning models (like SVM and Neural Networks) and compares their performance on large datasets, achieving 98.5% accuracy.",
            "Our research proposes a novel algorithm (ALGO-5G-B) for optimizing bandwidth allocation in complex 5G network architectures. Results show a 20% improvement.",
            "Understanding the societal impact of artificial intelligence and advanced robotics systems in the next decade. Ethical considerations are paramount.",
            "A tutorial on performing basic data cleaning, analysis, and visualization using the Python 3.9 ecosystem. No complex mathematics are involved."
        ],
        'year': [2024, 2023, 2025, 2022]
    }
    df = pd.DataFrame(data)
    df.to_csv(file_name, index=False)
    print(f"'{file_name}' not found. Created dummy data for demonstration.")

In [21]:
def main():
    """Main function to run the text cleaning pipeline and create step-by-step columns."""

    # 1. Load Data
    try:
        df = pd.read_csv(INPUT_FILE)
    except FileNotFoundError:
        create_dummy_data(INPUT_FILE)
        df = pd.read_csv(INPUT_FILE)
    except Exception as e:
        print(f"Error loading file {INPUT_FILE}: {e}")
        return

    print(f"\n--- Starting Step-by-Step Text Cleaning Pipeline ---")
    print(f"Input file: {INPUT_FILE} ({len(df)} rows)")

    if TARGET_COLUMN not in df.columns:
        print(f"Error: Target column '{TARGET_COLUMN}' not found. Please check your CSV column names.")
        return

    # 2. Apply Cleaning Functions Sequentially (Standard NLP Order)
    # NOTE: The steps below are applied in the order required for correct NLP processing.

    # Step A (User's #4): Lowercase (Applied to Raw Data)
    df['step_A_lowercase'] = df[TARGET_COLUMN].apply(clean_step_4_lowercase)

    # Step B (User's #1): Remove Punctuation/Noise (Applied to Lowercased text)
    df['step_B_no_punct'] = df['step_A_lowercase'].apply(clean_step_1_remove_noise_punct)

    # Step C (User's #2): Remove Numbers (Applied to Punctuation-Free text)
    df['step_C_no_numbers'] = df['step_B_no_punct'].apply(clean_step_2_remove_numbers)

    # Step D (User's #3): Remove Stopwords (Applied to Number-Free text)
    df['step_D_no_stopwords'] = df['step_C_no_numbers'].apply(clean_step_3_remove_stopwords)

    # Step E (User's #6): Lemmatization (Applied to Stopword-Free text)
    df['step_E_lemmatized'] = df['step_D_no_stopwords'].apply(clean_step_6_lemmatization)

    # Step F (User's #5): Stemming (Applied to Lemmatized text - Final Output)
    df['step_F_stemmed'] = df['step_E_lemmatized'].apply(clean_step_5_stemming)


    # 3. Save Cleaned Data
    df['cleaned_abstract_final'] = df['step_F_stemmed']
    df.to_csv(OUTPUT_FILE, index=False)

    print(f"\n--- Cleaning Complete ---")
    print(f"Total rows processed: {len(df)}")
    print(f"Cleaned data saved with all 6 steps in: {OUTPUT_FILE}")

    # Show step-by-step output based on the user's requested numbering (1-6)
    print("\n--- Demonstration of Cleaning Steps (First Row Transformation) ---")
    print(f"Original Abstract: \n  {df[TARGET_COLUMN].iloc[0]}\n")
    print(f"(4) Lowercase (Step A):\n  {df['step_A_lowercase'].iloc[0]}")
    print(f"(1) No Punctuation (Step B):\n  {df['step_B_no_punct'].iloc[0]}")
    print(f"(2) No Numbers (Step C):\n  {df['step_C_no_numbers'].iloc[0]}")
    print(f"(3) No Stopwords (Step D):\n  {df['step_D_no_stopwords'].iloc[0]}")
    print(f"(6) Lemmatization (Step E):\n  {df['step_E_lemmatized'].iloc[0]}")
    print(f"(5) Stemming/Final (Step F):\n  {df['step_F_stemmed'].iloc[0]}")


if __name__ == "__main__":
    main()


--- Starting Step-by-Step Text Cleaning Pipeline ---
Input file: semantic_scholar_abstracts.csv (451 rows)

--- Cleaning Complete ---
Total rows processed: 451
Cleaned data saved with all 6 steps in: semantic_scholar_abstracts_step_analysis.csv

--- Demonstration of Cleaning Steps (First Row Transformation) ---
Original Abstract: 
  We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. The dataset is freely available at this https URL

(4) Lowercase (Step A):
  we present fashion-mnist, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categori

# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [22]:
# Your code here
import pandas as pd
import spacy
from collections import Counter
import os
import io

# NOTE ON LIBRARIES:
# This script uses the powerful spaCy library for advanced analysis (NER and Dependency Parsing).
# Before running, you must install spaCy and download its small English model:
# pip install spacy
# python -m spacy download en_core_web_sm

# --- Configuration ---
INPUT_FILE = "semantic_scholar_abstracts_step_analysis.csv"
OUTPUT_FILE = "semantic_scholar_abstracts_syntax_analysis.csv"
CLEANED_COLUMN = "cleaned_abstract_final" # The final stemmed text column from the previous script

# Global spaCy model object
nlp = None

def load_spacy_model():
    """Loads the spaCy model, handling potential errors if not installed."""
    global nlp
    try:
        # Load the small English model
        nlp = spacy.load("en_core_web_sm")
        print("spaCy model loaded successfully.")
    except OSError:
        print("\nERROR: spaCy model 'en_core_web_sm' not found.")
        print("Please run the following commands in your terminal:")
        print("  pip install spacy")
        print("  python -m spacy download en_core_web_sm")
        nlp = None
    except Exception as e:
        print(f"An unexpected error occurred while loading spaCy: {e}")
        nlp = None

def create_dummy_data(file_name):
    """Creates a dummy CSV with cleaned text if the actual file is not found."""
    print(f"'{file_name}' not found. Creating dummy data with a '{CLEANED_COLUMN}' column.")
    data = {
        'paperId': ['id1', 'id2', 'id3', 'id4'],
        CLEANED_COLUMN: [
            "paper explor variou machin learn model like svm neural network compar perform larg dataset achiev accuraci",
            "research propos novel algorithm algo b optim bandwidth alloc complex network architectur result show improv",
            "understand societal impact artificial intellig advanc robot system next decad ethic consider paramount",
            "tutori perform basic data clean analysi visual python ecosystem complex mathemat involv"
        ]
    }
    df = pd.DataFrame(data)
    # The dummy data is intentionally stemmed and lowercase, mimicking the input.
    df.to_csv(file_name, index=False)
    print("Dummy file created. Proceeding with analysis...")
    return df

# --- (1) Parts of Speech (POS) Tagging Analysis ---

def analyze_pos(doc):
    """
    Tags Parts of Speech and calculates the count of Noun, Verb, Adj, and Adv.
    Uses spaCy's coarse-grained tags (e.g., NOUN, VERB).
    """
    pos_counts = Counter()
    pos_tags = []

    for token in doc:
        # Get the coarse-grained Universal POS tag (NOUN, VERB, ADJ, ADV)
        pos_tag = token.pos_
        pos_tags.append(f"{token.text}/{pos_tag}")

        if pos_tag in ['NOUN', 'VERB', 'ADJ', 'ADV']:
            # Count the required POS categories
            pos_counts[pos_tag] += 1

    # Format the results for the DataFrame column
    summary = (f"N: {pos_counts['NOUN']}, "
               f"V: {pos_counts['VERB']}, "
               f"Adj: {pos_counts['ADJ']}, "
               f"Adv: {pos_counts['ADV']}")

    return ' '.join(pos_tags), summary

# --- (2) Dependency Parsing Analysis ---
# NOTE: Constituency Parsing structure is explained in the 'analysis_explanation.md' file.

def analyze_dependency_parsing(doc):
    """
    Analyzes dependency parsing for all sentences in a text.
    Returns a string representation of the dependency tree for each sentence.
    """
    dependency_trees = []

    # Iterate over sentences segmented by spaCy
    for sent in doc.sents:
        tree_lines = []
        # Iterate over tokens in the sentence to build the dependency structure
        for token in sent:
            # Format: TOKEN (DEP_RELATION) --> HEAD_TOKEN
            line = (
                f"{token.text} ({token.dep_}) --> {token.head.text}"
            )
            tree_lines.append(line)

        # Add the visualization for the whole sentence
        dependency_trees.append(" | ".join(tree_lines))

    return "\n".join(dependency_trees)

# --- (3) Named Entity Recognition (NER) Analysis ---

def analyze_ner(doc):
    """
    Extracts all named entities and calculates the count of each entity type specified.
    """
    entity_counts = Counter()
    entities_list = []

    for ent in doc.ents:
        # ent.text is the entity text, ent.label_ is the type (e.g., ORG, PERSON)
        entities_list.append(f"{ent.text}/{ent.label_}")

        # Count only the types relevant to the user's request
        if ent.label_ in ['PERSON', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'DATE']:
            entity_counts[ent.label_] += 1

    # Format the results for the DataFrame column
    summary = (f"Person: {entity_counts['PERSON']}, "
               f"Org: {entity_counts['ORG']}, "
               f"Loc/GPE: {entity_counts['LOC'] + entity_counts['GPE']}, "
               f"Product: {entity_counts['PRODUCT']}, "
               f"Date: {entity_counts['DATE']}")

    return ' '.join(entities_list), summary


def main():
    """Main function to run the syntax and structure analysis pipeline."""

    # 0. Initialize spaCy
    load_spacy_model()
    if nlp is None:
        return # Exit if model failed to load

    # 1. Load Data
    try:
        df = pd.read_csv(INPUT_FILE)
    except FileNotFoundError:
        df = create_dummy_data(INPUT_FILE)
    except Exception as e:
        print(f"Error loading file {INPUT_FILE}: {e}")
        return

    print(f"\n--- Starting Syntax and Structure Analysis Pipeline ---")
    print(f"Input file: {INPUT_FILE} ({len(df)} rows)")

    if CLEANED_COLUMN not in df.columns:
        print(f"Error: Required cleaned column '{CLEANED_COLUMN}' not found in the input CSV.")
        print("Please ensure you ran the previous cleaning script successfully.")
        return

    # 2. Process Text and Apply Analyses

    # Create a spaCy Doc object for each row for efficient processing
    # The text is already stemmed, which may impact POS and NER accuracy, but we proceed with the cleaned text as instructed.
    df['spacy_doc'] = df[CLEANED_COLUMN].apply(lambda text: nlp(str(text)) if pd.notna(text) else nlp(""))
    print("All abstracts converted to spaCy Doc objects.")

    # (1) Parts of Speech (POS) Tagging
    # The zip(*...) unpacks the tuple return (tags_string, summary_string) into two separate columns
    df['pos_tags'], df['pos_summary'] = zip(*df['spacy_doc'].apply(analyze_pos))
    print("-> (1) POS Tagging Complete.")

    # (2) Dependency Parsing
    df['dependency_tree'] = df['spacy_doc'].apply(analyze_dependency_parsing)
    print("-> (2) Dependency Parsing Complete.")

    # (3) Named Entity Recognition (NER)
    df['ner_entities'], df['ner_summary'] = zip(*df['spacy_doc'].apply(analyze_ner))
    print("-> (3) Named Entity Recognition Complete.")


    # 3. Save Results
    # Select original columns plus new analysis columns for the output file

    # Create the list of new columns
    analysis_columns = ['pos_summary', 'ner_summary', 'pos_tags', 'dependency_tree', 'ner_entities']

    # Identify existing columns and remove the temporary 'spacy_doc'
    output_columns = df.columns.tolist()
    if 'spacy_doc' in output_columns:
        output_columns.remove('spacy_doc')

    # Reorder columns to put original data first, then the analysis summaries/details
    final_output_columns = [col for col in output_columns if col not in analysis_columns] + analysis_columns

    df[final_output_columns].to_csv(OUTPUT_FILE, index=False)

    print(f"\n--- Analysis Complete ---")
    print(f"Results saved to: {OUTPUT_FILE}")

    # Display the analysis for the first row
    print("\n--- Summary of Analysis for First Row ---")
    first_row = df.iloc[0]
    print(f"Original Cleaned Text: {first_row[CLEANED_COLUMN]}")
    print(f"\n(1) POS Summary: {first_row['pos_summary']}")
    print(f"\n(2) Dependency Parse (Sample Sentence):\n{first_row['dependency_tree'].split('\n')[0]}")
    print(f"\n(3) NER Summary: {first_row['ner_summary']}")


if __name__ == "__main__":
    main()



spaCy model loaded successfully.

--- Starting Syntax and Structure Analysis Pipeline ---
Input file: semantic_scholar_abstracts_step_analysis.csv (451 rows)
All abstracts converted to spaCy Doc objects.
-> (1) POS Tagging Complete.
-> (2) Dependency Parsing Complete.
-> (3) Named Entity Recognition Complete.

--- Analysis Complete ---
Results saved to: semantic_scholar_abstracts_syntax_analysis.csv

--- Summary of Analysis for First Row ---
Original Cleaned Text: present fashion mnist new dataset compris grayscal imag fashion product categori imag per categori train set imag test set imag fashion mnist intend serv direct drop replac origin mnist dataset benchmark machin learn algorithm share imag size data format structur train test split dataset freeli avail http url

(1) POS Summary: N: 20, V: 7, Adj: 5, Adv: 0

(2) Dependency Parse (Sample Sentence):
present (amod) --> mnist | fashion (compound) --> mnist | mnist (nsubj) --> imag | new (amod) --> imag | dataset (compound) --> grays

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

In [37]:
# Part 1
import random, requests
from bs4 import BeautifulSoup

BASE_URL = "https://github.com/marketplace?type=actions"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; GH-ResearchBot/1.0)",
    "Accept-Language": "en-US,en;q=0.9",
}

def get_page_html(page: int, retries=3, timeout=20):
    url = f"{BASE_URL}&page={page}"
    last_err = None
    for attempt in range(1, retries+1):
        try:
            r = requests.get(url, headers=HEADERS, timeout=timeout)
            # If GitHub ever rate-limits, back off and retry
            if r.status_code == 429:
                time.sleep(attempt * 2.0)
                continue
            r.raise_for_status()
            html = r.text
            # Quick human-check/captcha guard
            if "please verify you are a human" in html.lower():
                raise RuntimeError("Blocked by anti-bot / CAPTCHA")
            return html
        except Exception as e:
            last_err = e
            time.sleep(attempt * 1.5)
    raise RuntimeError(f"Failed page {page}: {last_err}")

def parse_actions(html: str, page: int):
    soup = BeautifulSoup(html, "html.parser")
    out = []
    # Anchor on the action detail link; classes change, href pattern is stable
    for a in soup.select('a[href^="/marketplace/actions/"]'):
        href = a.get("href", "")
        # Avoid top nav/side links by focusing on list area:
        # walk up to a reasonable container then mine fields
        container = a.find_parent(["article", "li", "div", "section"]) or a.parent

        # Name: prefer the anchor text or the nearest <h3>
        name = a.get_text(" ", strip=True)
        if not name:
            h3 = container.find("h3")
            name = h3.get_text(" ", strip=True) if h3 else ""

        # Description: common pattern is a muted <p> or first <p> in container
        desc = ""
        p = container.find("p")
        if p:
            desc = p.get_text(" ", strip=True)

        # Normalize URL
        url = f"https://github.com{href}" if href.startswith("/") else href

        if name and url:
            out.append({"Product Name": name, "Description": desc, "URL": url, "Page Number": page})
    return out

# ---- scrape loop (keeps your config) ----
TARGET_COUNT = 1000      # GitHub typically caps listings to ~1000 results across pagination. :contentReference[oaicite:1]{index=1}
MAX_PAGES = 500          # safety cap (page links show up to 500 currently). :contentReference[oaicite:2]{index=2}
DELAY_RANGE = (1.0, 2.0) # polite delay

records = []
page = 1
while len(records) < TARGET_COUNT and page <= MAX_PAGES:
    html = get_page_html(page)
    rows = parse_actions(html, page)

    # de-dup across pages by URL before extending
    before = len(records)
    seen = {r["URL"] for r in records}
    rows = [r for r in rows if r["URL"] not in seen]
    records.extend(rows)

    print(f"Page {page}: found {len(rows)} actions (total so far: {len(records)})")

    # Stop if the page obviously has no results (end of pagination)
    if len(rows) == 0:
        break

    time.sleep(random.uniform(*DELAY_RANGE))
    page += 1

# Save & continue with your Part-2 cleaning
raw_df = pd.DataFrame(records, columns=['Product Name','Description','URL','Page Number'])
raw_df.to_csv('github_actions_raw.csv', index=False)
print(f"✅ Scraped {len(raw_df)} actions across {raw_df['Page Number'].nunique()} pages.")
print(raw_df.head())


Page 1: found 20 actions (total so far: 20)
Page 2: found 20 actions (total so far: 40)
Page 3: found 20 actions (total so far: 60)
Page 4: found 0 actions (total so far: 60)
✅ Scraped 60 actions across 3 pages.
                   Product Name Description  \
0                TruffleHog OSS               
1                 Metrics embed               
2  yq - portable yaml processor               
3                  Super-Linter               
4        Gosec Security Checker               

                                                 URL  Page Number  
0  https://github.com/marketplace/actions/truffle...            1  
1  https://github.com/marketplace/actions/metrics...            1  
2  https://github.com/marketplace/actions/yq-port...            1  
3  https://github.com/marketplace/actions/super-l...            1  
4  https://github.com/marketplace/actions/gosec-s...            1  


In [38]:
# === PART-2: Preprocess Data + Data Quality for GitHub Marketplace Actions ===
import re, os, pandas as pd, nltk
from collections import Counter

# --- Ensure NLTK resources ---
try: nltk.data.find("corpora/stopwords")
except LookupError: nltk.download("stopwords", quiet=True)
try: nltk.data.find("corpora/wordnet")
except LookupError: nltk.download("wordnet", quiet=True)

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
STOPWORDS = set(stopwords.words("english"))
LEMMATIZER = WordNetLemmatizer()

def preprocess_text(text: str) -> str:
    """
    Clean text by:
      - removing HTML tags
      - lowercasing
      - removing special chars
      - tokenizing on whitespace
      - removing stopwords / 1-char tokens
      - lemmatizing
    Returns a single space-joined normalized string.
    """
    if not isinstance(text, str):
        return ""
    txt = re.sub(r"<[^>]+>", " ", text)            # strip HTML
    txt = re.sub(r"\s+", " ", txt).strip().lower() # normalize whitespace + lowercase
    txt = re.sub(r"[^a-z0-9 ]+", " ", txt)         # keep alnum and space
    toks = [t for t in txt.split() if len(t) > 1 and t not in STOPWORDS]
    toks = [LEMMATIZER.lemmatize(t) for t in toks]
    return " ".join(toks)

# --- Load raw data (prefer in-memory raw_df; fallback to CSV) ---
if "raw_df" in globals() and isinstance(raw_df, pd.DataFrame):
    df = raw_df.copy()
else:
    src = "github_actions_raw.csv"
    if not os.path.exists(src):
        raise FileNotFoundError("github_actions_raw.csv not found. Run PART-1 first.")
    df = pd.read_csv(src)

# --- DATA QUALITY OPERATIONS ---
# 1) Required columns present
required_cols = ["Product Name", "Description", "URL", "Page Number"]
missing_cols = [c for c in required_cols if c not in df.columns]
if missing_cols:
    raise ValueError(f"Missing required columns: {missing_cols}")

initial_rows = len(df)

# 2) Drop rows missing essential keys
df.dropna(subset=["Product Name", "URL"], inplace=True)

# 3) Normalize URL + filter to github-only
df["URL"] = df["URL"].astype(str).str.strip()
df = df[df["URL"].str.startswith("https://github.com/")]

# 4) Dedupe primarily by URL
before_dedup = len(df)
df.drop_duplicates(subset=["URL"], inplace=True)
duplicates_removed = before_dedup - len(df)

# 5) Coerce types
df["Page Number"] = pd.to_numeric(df["Page Number"], errors="coerce").fillna(-1).astype(int)

# 6) Handle missing description (allowed but normalized)
df["Description"] = df["Description"].fillna("").astype(str)

# --- PREPROCESSING (tokenize/lower/stopwords/lemmatize) ---
df["Product Name Clean"] = df["Product Name"].apply(preprocess_text)
df["Description Clean"] = df["Description"].apply(preprocess_text)

# 7) Keep rows that still have a meaningful name after cleaning
before_name_filter = len(df)
df = df[df["Product Name Clean"].str.len() > 0]
name_filtered_removed = before_name_filter - len(df)

# --- SMALL REPORTS / ARTIFACTS ---
# Token frequency (top 50) across both fields
all_tokens = (df["Product Name Clean"].fillna("").str.cat(df["Description Clean"].fillna(""), sep=" ").str.split())
flat_tokens = [t for sub in all_tokens for t in sub]
freq = Counter(flat_tokens).most_common(50)
freq_df = pd.DataFrame(freq, columns=["token", "count"])
freq_df.to_csv("token_frequency_top50.csv", index=False)

# Final save
clean_path = "github_actions_cleaned.csv"
df.to_csv(clean_path, index=False)

# Data Quality summary
report = {
    "initial_rows": int(initial_rows),
    "after_required_keys": int(len(df) + duplicates_removed + name_filtered_removed),
    "duplicates_removed": int(duplicates_removed),
    "name_filtered_rows_removed": int(name_filtered_removed),
    "final_rows": int(len(df)),
    "unique_urls": int(df["URL"].nunique()),
    "rows_with_empty_description": int((df["Description"].str.strip()=="").sum()),
    "unique_pages": int(df["Page Number"].nunique()),
}
print("📊 Data Quality Report")
for k,v in report.items():
    print(f"- {k}: {v}")

print(f"\n✅ Saved cleaned dataset → {clean_path}")
print("🧾 Saved token frequency (top 50) → token_frequency_top50.csv")

# Quick preview
df.head(10)



📊 Data Quality Report
- initial_rows: 60
- after_required_keys: 60
- duplicates_removed: 0
- name_filtered_rows_removed: 0
- final_rows: 60
- unique_urls: 60
- rows_with_empty_description: 60
- unique_pages: 3

✅ Saved cleaned dataset → github_actions_cleaned.csv
🧾 Saved token frequency (top 50) → token_frequency_top50.csv


Unnamed: 0,Product Name,Description,URL,Page Number,Product Name Clean,Description Clean
0,TruffleHog OSS,,https://github.com/marketplace/actions/truffle...,1,trufflehog os,
1,Metrics embed,,https://github.com/marketplace/actions/metrics...,1,metric embed,
2,yq - portable yaml processor,,https://github.com/marketplace/actions/yq-port...,1,yq portable yaml processor,
3,Super-Linter,,https://github.com/marketplace/actions/super-l...,1,super linter,
4,Gosec Security Checker,,https://github.com/marketplace/actions/gosec-s...,1,gosec security checker,
5,Rebuild Armbian and Kernel,,https://github.com/marketplace/actions/rebuild...,1,rebuild armbian kernel,
6,Checkout,,https://github.com/marketplace/actions/checkout,1,checkout,
7,OpenCommit — improve commits with AI 🧙,,https://github.com/marketplace/actions/opencom...,1,opencommit improve commits ai,
8,SSH Remote Commands,,https://github.com/marketplace/actions/ssh-rem...,1,ssh remote command,
9,generate-snake-game-from-github-contribution-grid,,https://github.com/marketplace/actions/generat...,1,generate snake game github contribution grid,


#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.


In [41]:
# Install Tweepy
!pip install tweepy

import tweepy
import pandas as pd

# Replace with your actual Bearer Token from Twitter Developer Portal
bearer_token = "AAAAAAAAAAAAAAAAAAAAAK734QEAAAAAXTwiK1bd4xYcPYRDfzIk%2FrKepx8%3DKbGbuWcJvxDe8hie2lJP39zTmw1HVHZmwBRMbnvnmiPYnAXcO1"

# Authenticate with Tweepy v2
client = tweepy.Client(bearer_token=bearer_token)

# Define query and parameters
query = "#machinelearning OR #artificialintelligence -is:retweet lang:en"
max_results = 100

# Fetch tweets
response = client.search_recent_tweets(
    query=query,
    max_results=max_results,
    tweet_fields=["id", "text", "author_id"]
)

# Extract data
tweets_data = []
if response.data:
    for tweet in response.data:
        tweets_data.append({
            "Tweet ID": tweet.id,
            "Username": tweet.author_id,
            "Text": tweet.text
        })

# Save raw data
df_raw = pd.DataFrame(tweets_data)
df_raw.to_csv("raw_tweets.csv", index=False)
print(f"✅ Scraped {len(df_raw)} tweets.")
print(df_raw.head())


✅ Scraped 100 tweets.
              Tweet ID            Username  \
0  1972810628223156645  945828015535038464   
1  1972810623030554629           952502046   
2  1972810594312192041          3068484222   
3  1972810311154704578           178563985   
4  1972810280725008808           568641861   

                                                Text  
0  ⏪🪐\nIf you want to make money investing in sto...  
1  For creative minds https://t.co/9ilgTaZESd\nPl...  
2  RT @rasangarocks: The best books to learn Pyth...  
3  RT @jblefevre60: 5 project to learn AI!\n\n#AI...  
4  RT @We_Promote_All: 🚀 Unlock the secrets of AI...  


In [42]:
import re

# Load raw data
df = pd.read_csv("raw_tweets.csv")

# Clean text
def clean_text(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#\w+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    return text.strip()

df["Cleaned Text"] = df["Text"].apply(clean_text)

# Drop duplicates and missing values
df.drop_duplicates(subset="Tweet ID", inplace=True)
df.dropna(subset=["Tweet ID", "Username", "Cleaned Text"], inplace=True)

# Save cleaned data
df.to_csv("cleaned_tweets.csv", index=False)
print(f"\n✅ Cleaned data saved. Final record count: {len(df)}")
print(df[["Tweet ID", "Username", "Cleaned Text"]].head())



✅ Cleaned data saved. Final record count: 100
              Tweet ID            Username  \
0  1972810628223156645  945828015535038464   
1  1972810623030554629           952502046   
2  1972810594312192041          3068484222   
3  1972810311154704578           178563985   
4  1972810280725008808           568641861   

                                        Cleaned Text  
0  If you want to make money investing in stocks ...  
1  For creative minds \nPlay inside of your  with...  
2  RT  The best books to learn Python programming...  
3                          RT  5 project to learn AI  
4  RT   Unlock the secrets of AI amp speech recog...  


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog