<a href="https://colab.research.google.com/github/akhil27/Akhil_INFO5731_SPRING2025/blob/main/Indraganti_Akhil_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 2**

In this assignment, you will work on gathering text data from an open data source via web scraping or API. Following this, you will need to clean the text data and perform syntactic analysis on the data. Follow the instructions carefully and design well-structured Python programs to address each question.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

* **Make sure to submit the cleaned data CSV in the comment section - 10 points**

**Total points**: 100

**Deadline**: Monday, at 11:59 PM.

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**

**Please check that the link you submitted can be opened and points to the correct assignment.**


# Question 1 (25 points)

Write a python program to collect text data from **either of the following sources** and save the data into a **csv file:**

(1) Collect all the customer reviews of a product (you can choose any porduct) on amazon. [atleast 1000 reviews]

(2) Collect the top 1000 User Reviews of a movie recently in 2023 or 2024 (you can choose any movie) from IMDB. [If one movie doesn't have sufficient reviews, collect reviews of atleast 2 or 3 movies]


(3) Collect the **abstracts** of the top 10000 research papers by using the query "machine learning", "data science", "artifical intelligence", or "information extraction" from Semantic Scholar.

(4) Collect all the information of the 904 narrators in the Densho Digital Repository.

(5)**Collect a total of 10000 reviews** of the top 100 most popular software from G2 and Capterra.


In [1]:
import requests
from bs4 import BeautifulSoup
import csv
import time

def extract_feedback(urls, headers, limit=1000):
    feedback_list = []

    for link in urls:
        print(f"Fetching data from: {link}")

        response = requests.get(link, headers=headers)
        if response.status_code != 200:
            print(f"Error fetching page: {response.status_code}")
            continue

        content = BeautifulSoup(response.text, 'html.parser')
        feedback_items = content.find_all('article', class_='user-review-item')

        if not feedback_items:
            print(f"No further feedback available for {link}.")
            continue

        for item in feedback_items:
            if len(feedback_list) >= limit:
                break

            heading = item.find('h3', class_='ipc-title__text')
            headline = heading.text.strip() if heading else "N/A"

            score = item.find('span', class_='ipc-rating-star--rating')
            rating = score.text.strip() if score else "N/A"

            text = item.find('div', class_='ipc-html-content-inner-div')
            review_text = text.text.strip() if text else "N/A"

            feedback_list.append({
                'Headline': headline,
                'Score': rating,
                'Content': review_text
            })

        print(f"Collected {len(feedback_items)} entries from {link}.")

        if len(feedback_list) >= limit:
            break

        time.sleep(1)

    return feedback_list

def store_csv(feedback_list, filename):
    columns = ['Headline', 'Score', 'Content']
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=columns)
        writer.writeheader()
        writer.writerows(feedback_list)

if __name__ == '__main__':
    urls = [
        "https://www.imdb.com/title/tt10872600/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt9362722/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt10638522/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt1462764/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt9603212/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt10151854/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt9362722/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt1745960/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt1630029/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt11813216/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt13539646/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt5884796/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt1136617/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt1488589/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt10640346/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt13345606/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt10640346/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt13122400/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt1392214/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt10293406/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt14715170/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt11145118/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt15239678/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt11083552/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt13345606/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt11317142/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt6710474/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0076759/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt4154796/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0816692/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0468569/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0109830/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0133093/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0120737/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0110912/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0068646/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0114369/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0108052/reviews/?ref_=tt_ov_ururv",
        "https://www.imdb.com/title/tt0110413/reviews/?ref_=tt_ov_ururv"
    ]

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
    }

    feedback = extract_feedback(urls, headers, limit=1000)
    store_csv(feedback, 'review_data.csv')

    print(f"Saved {len(feedback)} entries to 'review_data.csv'")


Fetching data from: https://www.imdb.com/title/tt10872600/reviews/?ref_=tt_ov_ururv
Collected 25 entries from https://www.imdb.com/title/tt10872600/reviews/?ref_=tt_ov_ururv.
Fetching data from: https://www.imdb.com/title/tt9362722/reviews/?ref_=tt_ov_ururv
Collected 25 entries from https://www.imdb.com/title/tt9362722/reviews/?ref_=tt_ov_ururv.
Fetching data from: https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ov_ururv
Collected 25 entries from https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ov_ururv.
Fetching data from: https://www.imdb.com/title/tt10638522/reviews/?ref_=tt_ov_ururv
Collected 25 entries from https://www.imdb.com/title/tt10638522/reviews/?ref_=tt_ov_ururv.
Fetching data from: https://www.imdb.com/title/tt1462764/reviews/?ref_=tt_ov_ururv
Collected 25 entries from https://www.imdb.com/title/tt1462764/reviews/?ref_=tt_ov_ururv.
Fetching data from: https://www.imdb.com/title/tt6791350/reviews/?ref_=tt_ov_ururv
Collected 25 entries from https://www.imdb.com

# Question 2 (15 points)

Write a python program to **clean the text data** you collected in the previous question and save the clean data in a new column in the csv file. The data cleaning steps include: [Code and output is required for each part]

(1) Remove noise, such as special characters and punctuations.

(2) Remove numbers.

(3) Remove stopwords by using the stopwords list.

(4) Lowercase all texts

(5) Stemming.

(6) Lemmatization.

In [None]:
!pip install nltk






In [2]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

data = pd.read_csv("review_data.csv")

word_filter = set(stopwords.words('english'))
stem_func = PorterStemmer()
lemma_func = WordNetLemmatizer()

def process_text(text):
    if pd.isna(text):
        return ""

    # (1) Remove noise, such as special characters and punctuations.
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # (2) Remove numbers.
    text = re.sub(r'\d+', '', text)

    # (3) Remove stopwords by using the stopwords list.
    words = text.split()
    words = [word for word in words if word.lower() not in word_filter]

    # (4) Lowercase all texts.
    words = [word.lower() for word in words]

    # (5) Stemming.
    stemmed_words = [stem_func.stem(word) for word in words]

    # (6) Lemmatization.
    lemmatized_words = [lemma_func.lemmatize(word) for word in stemmed_words]

    return " ".join(lemmatized_words)

data["Processed_Text"] = data["Content"].astype(str).apply(process_text)

data.to_csv("final_cleaned_reviews.csv", index=False)

print("Processing completed! Data saved in 'final_cleaned_reviews.csv'.")



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Processing completed! Data saved in 'final_cleaned_reviews.csv'.


# Question 3 (15 points)

Write a python program to **conduct syntax and structure analysis of the clean text** you just saved above. The syntax and structure analysis includes:

(1) **Parts of Speech (POS) Tagging:** Tag Parts of Speech of each word in the text, and calculate the total number of N(oun), V(erb), Adj(ective), Adv(erb), respectively.

(2) **Constituency Parsing and Dependency Parsing:** print out the constituency parsing trees and dependency parsing trees of all the sentences. Using one sentence as an example to explain your understanding about the constituency parsing tree and dependency parsing tree.

(3) **Named Entity Recognition:** Extract all the entities such as person names, organizations, locations, product names, and date from the clean texts, calculate the count of each entity.

In [3]:
!pip install spacy benepar
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf




Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
^C
Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation

In [2]:
!pip install benepar
import benepar
benepar.download('benepar_en3')




[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Unzipping models/benepar_en3.zip.


True

In [4]:
import warnings
warnings.simplefilter("ignore")


In [5]:
import pandas as pd
import spacy
from collections import Counter
import benepar

# Load spaCy models
lang_model = spacy.load("en_core_web_sm")

# Load the cleaned dataset
dataset = pd.read_csv("final_cleaned_reviews.csv")

# (1) Parts of Speech (POS) Tagging
def analyze_pos(text):
    doc = lang_model(text)
    pos_freq = Counter(token.pos_ for token in doc)
    return pos_freq

pos_distribution = Counter()
dataset["Processed_Text"].dropna().apply(lambda x: pos_distribution.update(analyze_pos(x)))

print("\n(1) Parts of Speech (POS) Tagging:")
print(f"Total Nouns: {pos_distribution['NOUN']}")
print(f"Total Verbs: {pos_distribution['VERB']}")
print(f"Total Adjectives: {pos_distribution['ADJ']}")
print(f"Total Adverbs: {pos_distribution['ADV']}")

# (2) Constituency Parsing and Dependency Parsing
parser_model = spacy.load("en_core_web_trf")
parser_model.add_pipe("benepar", config={"model": "benepar_en3"})

def display_parsing_trees(text):
    doc = parser_model(text)
    for sent in doc.sents:
        print("\nSentence:", sent.text)
        print("Constituency Parsing Tree:")
        print(sent._.parse_string)  # Constituency Tree
        print("\nDependency Parsing Tree:")
        for token in sent:
            print(f"{token.text} <--({token.dep_})-- {token.head.text}")

print("\n(2) Constituency & Dependency Parsing Example:")
sample_sentence = dataset["Processed_Text"].dropna().iloc[0]
display_parsing_trees(sample_sentence)

# (3) Named Entity Recognition (NER)
def identify_entities(text):
    doc = lang_model(text)
    entity_freq = Counter(ent.label_ for ent in doc.ents)
    return entity_freq

entity_distribution = Counter()
dataset["Processed_Text"].dropna().apply(lambda x: entity_distribution.update(identify_entities(x)))

print("\n(3) Named Entity Recognition (NER):")
print(f"Total Person Names: {entity_distribution['PERSON']}")
print(f"Total Organizations: {entity_distribution['ORG']}")
print(f"Total Locations: {entity_distribution['GPE']}")
print(f"Total Products: {entity_distribution['PRODUCT']}")
print(f"Total Dates Mentioned: {entity_distribution['DATE']}")



(1) Parts of Speech (POS) Tagging:
Total Nouns: 30780
Total Verbs: 13620
Total Adjectives: 12021
Total Adverbs: 3659


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565



(2) Constituency & Dependency Parsing Example:


You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.



Sentence: superhero extravaganza spiderman nwh didnt meet expect exceed emot reson mcu film far movi worth everi excruci moment wait visual score emot stori ofc fanservic film understand spiderman fan like spiderman film truli celebr belov charact best way possibl
Constituency Parsing Tree:
(S (NP (NP (NN superhero) (NN extravaganza)) (NP (NNP spiderman) (NN nwh))) (VP (VBD did) (RB nt) (VP (VB meet) (S (VP (VB expect) (VB exceed) (FW emot) (NN reson) (FW mcu) (NN film) (NP (RB far) (FW movi)) (ADJP (JJ worth) (FW everi) (JJ excruci) (NN moment)) (VB wait) (JJ visual) (NN score) (FW emot) (FW stori) (FW ofc) (FW fanservic) (NN film) (VP (VB understand) (NP (NN spiderman) (NN fan)) (PP (PP (IN like) (NP (NN spiderman) (NN film) (FW truli) (JJ celebr) (FW belov) (FW charact) (NP (JJS best) (NN way)))) (ADJP (JJ possibl)))))))))

Dependency Parsing Tree:
superhero <--(compound)-- extravaganza
extravaganza <--(ROOT)-- extravaganza
spiderman <--(compound)-- film
nwh <--(dep)-- meet
did <--

# **Following Questions must answer using AI assitance**

#Question 4 (20 points).

Q4. (PART-1)
Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

 The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

 The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

1.   **Preprocess Data**: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

2. Perform **Data Quality** operations.


Preprocessing:
Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality:
Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.


Github MarketPlace page:
https://github.com/marketplace?type=actions

Prompt 1: (PART-1) Web scraping data from the GitHub Marketplace to gather details about popular actions. Using Python, the process begins by sending HTTP requests to multiple pages of the marketplace (1000 products), handling pagination through dynamic page numbers. The key details extracted include the product name, a short description, and the URL.

The extracted data is stored in a structured CSV format with columns for product name, description, URL, and page number. A time delay is introduced between requests to avoid server overload. ChatGPT can assist by helping with the parsing of HTML, error handling, and generating reports based on the data collected.

The goal is to complete the scraping within a specified time limit, ensuring that the process is efficient and adheres to GitHub’s usage guidelines.

(PART -2)

Preprocess Data: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

Perform Data Quality operations.

Preprocessing: Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality: Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately. Github MarketPlace page: https://github.com/marketplace?type=actions.  My user egent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 You have to make a request to the url. There is no need of github api. I provided a link in the question. Jut use the request to get html parser and if you try to go next page you will see url has page-2 and so on. So use a for loop and get all other pages.

In [6]:
import requests
from bs4 import BeautifulSoup
import csv
import time

HEADERS = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/132.0.0.0 Safari/537.36'),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'https://github.com/'
}

BASE_URL = "https://github.com/marketplace"

def scrape_marketplace():
    page = 1
    seen_urls = set()
    session = requests.Session()

    with open("github_actions.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Product Name", "Description", "URL", "Page"])

        while True:
            # Use the params argument to let requests encode the URL parameters
            params = {"type": "actions", "page": page}
            print(f"Scraping page {page}...")

            try:
                response = session.get(BASE_URL, headers=HEADERS, params=params, timeout=15)
            except Exception as e:
                print(f"Request error on page {page}: {e}")
                break

            if response.status_code != 200:
                print(f"Non-200 response ({response.status_code}) on page {page}. Ending scrape.")
                break

            soup = BeautifulSoup(response.text, "html.parser")

            # Attempt to find product cards by searching for article elements.
            # Adjust selectors if GitHub updates their structure.
            articles = soup.find_all("article")
            if not articles:
                # If no articles are found, try a fallback method (e.g., div containers)
                articles = soup.find_all("div", class_="mb-3")
            if not articles:
                print(f"No product containers found on page {page}. Ending scrape.")
                break

            new_entries = 0
            for article in articles:
                # Look for a product title (<h3> tag)
                h3 = article.find("h3")
                if not h3:
                    continue
                product_name = h3.get_text(strip=True)

                # Look for a description (<p> tag)
                p = article.find("p")
                description = p.get_text(strip=True) if p else ""

                # Look for an anchor tag that links to the product page
                a = article.find("a", href=True)
                if not a or '/marketplace/actions/' not in a["href"]:
                    continue
                product_url = "https://github.com" + a["href"]

                if product_url in seen_urls:
                    continue

                seen_urls.add(product_url)
                writer.writerow([product_name, description, product_url, page])
                new_entries += 1

            print(f"Page {page}: Found {new_entries} new products.")
            # If no new products are found, we assume we have reached the end.
            if new_entries == 0:
                print("No new products found, ending scrape.")
                break

            page += 1
            time.sleep(5)

    print("Scraping complete.")

if __name__ == "__main__":
    scrape_marketplace()


Scraping page 1...
Page 1: Found 0 new products.
No new products found, ending scrape.
Scraping complete.


Prompt 2: I am getting like this. Scraping page 1...
Page 1: Found 0 new products.
No new products found, ending scrape.
Scraping complete. Not getting any entries. just blank csv with headings, resolve this

In [7]:
import requests
import json
from bs4 import BeautifulSoup
import csv
import time

# Browser-like HTTP headers
HEADERS = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/132.0.0.0 Safari/537.36'),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://github.com/',
    'Connection': 'keep-alive'
}

BASE_URL = "https://github.com/marketplace"

def scrape_marketplace():
    # Initialize page to start scraping from
    page = 1
    seen_urls = set()
    session = requests.Session()

    with open("github_actions.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Product Name", "Description", "URL", "Page"])

        while True:
            params = {"type": "actions", "page": page}
            print(f"Scraping page {page}...")

            try:
                response = session.get(BASE_URL, headers=HEADERS, params=params, timeout=15)
            except Exception as err:
                print(f"Request error on page {page}: {err}")
                break

            if response.status_code != 200:
                print(f"Non-200 response ({response.status_code}) on page {page}. Ending scrape.")
                break

            soup = BeautifulSoup(response.text, "html.parser")
            products = []

            # Attempt 1: Try to extract data from embedded JSON data (if available)
            json_data_script = soup.find("script", id="__NEXT_DATA__")
            if json_data_script:
                try:
                    data = json.loads(json_data_script.string)
                    # Adjust the following keys if GitHub's JSON structure changes
                    pageProps = data.get("props", {}).get("pageProps", {})
                    marketplace_data = pageProps.get("marketplace", {})
                    items = marketplace_data.get("items", [])
                    for item in items:
                        name = item.get("name", "").strip()
                        description = item.get("description", "").strip()
                        slug = item.get("slug", "").strip()
                        product_url = f"https://github.com/marketplace/actions/{slug}" if slug else ""
                        if product_url and product_url not in seen_urls:
                            seen_urls.add(product_url)
                            products.append((name, description, product_url))
                except Exception as e:
                    print(f"Error parsing JSON data on page {page}: {e}")

            # Attempt 2: Fallback method by searching for anchor tags
            if not products:
                links = soup.find_all("a", href=lambda x: x and "/marketplace/actions/" in x)
                for link in links:
                    if link.find_parent(id="js-global-screen-reader-notice"):
                        continue
                    href = link.get("href")
                    product_url = "https://github.com" + href
                    if product_url in seen_urls:
                        continue
                    seen_urls.add(product_url)
                    # Attempt to determine the product name (using h3 text if available)
                    parent_h3 = link.find_parent("h3")
                    product_name = parent_h3.get_text(strip=True) if parent_h3 else link.get_text(strip=True)
                    # Look for an associated description in the next <p> tag
                    p_tag = link.find_next("p")
                    description = p_tag.get_text(strip=True) if p_tag else ""
                    products.append((product_name, description, product_url))

            if not products:
                snippet = response.text[:500]
                print(f"No products found on page {page}. Response snippet:\n{snippet}\nEnding scrape.")
                break

            for name, desc, url in products:
                writer.writerow([name, desc, url, page])
            print(f"Page {page}: Found {len(products)} products.")

            page += 1
            # Pause to reduce load on the server
            time.sleep(5)

    print("Scraping complete.")

if __name__ == "__main__":
    scrape_marketplace()


Scraping page 1...
Page 1: Found 20 products.
Scraping page 2...
Page 2: Found 20 products.
Scraping page 3...
Page 3: Found 20 products.
Scraping page 4...
Page 4: Found 20 products.
Scraping page 5...
Page 5: Found 20 products.
Scraping page 6...
Page 6: Found 20 products.
Scraping page 7...
Page 7: Found 20 products.
Scraping page 8...
No products found on page 8. Response snippet:







<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  
  >



  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.g
Ending scrape.
Scraping complete.


Prompt 3: Hey Chatgpt! Right now we're able to fetch only 140 products and its breaking the loop after page 7. So, Keep a timer for each request and let the page load. If no products are found, dont break the loop. Keep looping until you get 1000 amount of products.

In [8]:
#!/usr/bin/env python3
import requests
import json
from bs4 import BeautifulSoup
import csv
import time

# ---------------------------
# Configuration and Constants
# ---------------------------
HEADERS = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/132.0.0.0 Safari/537.36'),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://github.com/',
    'Connection': 'keep-alive'
}

BASE_URL = "https://github.com/marketplace"
USE_TYPE_ACTIONS = True  # Filter for GitHub Actions listings

# Settings
TARGET_PRODUCT_COUNT = 1000
MAX_PAGES = 150  # Set a maximum page limit to avoid an infinite loop.

# ---------------------------
# Scraping Function Using a While Loop
# ---------------------------
def scrape_marketplace():
    seen_urls = set()
    total_products = 0
    current_page = 1
    session = requests.Session()

    with open("github_actions.csv", "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Product Name", "Description", "URL", "Page"])

        # Loop until we reach the target count or hit the max page limit.
        while total_products < TARGET_PRODUCT_COUNT and current_page <= MAX_PAGES:
            params = {"page": current_page}
            if USE_TYPE_ACTIONS:
                params["type"] = "actions"

            print(f"Scraping page {current_page} with parameters: {params} ...")
            try:
                # Allow time for the page to fully load by using a generous timeout.
                response = session.get(BASE_URL, headers=HEADERS, params=params, timeout=30)
            except Exception as err:
                print(f"Request error on page {current_page}: {err}")
                current_page += 1
                time.sleep(5)
                continue

            if response.status_code != 200:
                print(f"Non-200 response ({response.status_code}) on page {current_page}. Continuing to next page.")
                current_page += 1
                time.sleep(5)
                continue

            soup = BeautifulSoup(response.text, "html.parser")
            products = []

            # Attempt 1: Extract product details from embedded JSON data.
            json_data_script = soup.find("script", id="__NEXT_DATA__")
            if json_data_script:
                try:
                    data = json.loads(json_data_script.string)
                    pageProps = data.get("props", {}).get("pageProps", {})
                    marketplace_data = pageProps.get("marketplace", {})
                    items = marketplace_data.get("items", [])

                    for item in items:
                        name = item.get("name", "").strip()
                        description = item.get("description", "").strip()
                        slug = item.get("slug", "").strip()
                        if slug:
                            if USE_TYPE_ACTIONS:
                                product_url = f"https://github.com/marketplace/actions/{slug}"
                            else:
                                product_url = f"https://github.com/marketplace/{slug}"
                        else:
                            product_url = ""

                        if product_url and product_url not in seen_urls:
                            seen_urls.add(product_url)
                            products.append((name, description, product_url))
                except Exception as e:
                    print(f"Error parsing JSON data on page {current_page}: {e}")

            # Attempt 2: Fallback – extract using anchor tags if no JSON data found.
            if not products:
                links = soup.find_all("a", href=lambda x: x and "/marketplace/actions/" in x)
                for link in links:
                    # Skip links that are in screen-reader-only sections.
                    if link.find_parent(id="js-global-screen-reader-notice"):
                        continue
                    href = link.get("href")
                    product_url = "https://github.com" + href
                    if product_url in seen_urls:
                        continue
                    seen_urls.add(product_url)
                    parent_h3 = link.find_parent("h3")
                    product_name = parent_h3.get_text(strip=True) if parent_h3 else link.get_text(strip=True)
                    p_tag = link.find_next("p")
                    description = p_tag.get_text(strip=True) if p_tag else ""
                    products.append((product_name, description, product_url))

            # Write the products for this page to the CSV file (if any).
            if products:
                for name, desc, url in products:
                    writer.writerow([name, desc, url, current_page])
                total_products += len(products)
                print(f"Page {current_page}: Found {len(products)} products. Total so far: {total_products}.")
            else:
                print(f"No products found on page {current_page}.")

            current_page += 1
            # Pause a few seconds between page requests to be respectful of GitHub's server.
            time.sleep(5)

    print(f"Scraping complete. Total products collected: {total_products}.")

# ---------------------------
# Main Execution Block
# ---------------------------
if __name__ == "__main__":
    scrape_marketplace()


Scraping page 1 with parameters: {'page': 1, 'type': 'actions'} ...
Page 1: Found 20 products. Total so far: 20.
Scraping page 2 with parameters: {'page': 2, 'type': 'actions'} ...
Page 2: Found 20 products. Total so far: 40.
Scraping page 3 with parameters: {'page': 3, 'type': 'actions'} ...
Page 3: Found 20 products. Total so far: 60.
Scraping page 4 with parameters: {'page': 4, 'type': 'actions'} ...
Page 4: Found 20 products. Total so far: 80.
Scraping page 5 with parameters: {'page': 5, 'type': 'actions'} ...
Page 5: Found 20 products. Total so far: 100.
Scraping page 6 with parameters: {'page': 6, 'type': 'actions'} ...
Page 6: Found 20 products. Total so far: 120.
Scraping page 7 with parameters: {'page': 7, 'type': 'actions'} ...
Page 7: Found 20 products. Total so far: 140.
Scraping page 8 with parameters: {'page': 8, 'type': 'actions'} ...
No products found on page 8.
Scraping page 9 with parameters: {'page': 9, 'type': 'actions'} ...
Page 9: Found 20 products. Total so far: 

prompt 4: github_actions.csv is generated with 1013 products. Part 1 of the question is done. Now work on (PART -2)

Preprocess Data: Clean the text by tokenizing, removing stopwords, and converting to lowercase.

Perform Data Quality operations.

Preprocessing: Preprocessing involves cleaning the text by removing noise such as special characters, HTML tags, and unnecessary whitespace. It also includes tasks like tokenization, stopword removal, and lemmatization to standardize the text for analysis.

Data Quality: Data quality checks ensure completeness, consistency, and accuracy by verifying that all required columns are filled and formatted correctly. Additionally, it involves identifying and removing duplicates, handling missing values, and ensuring the data reflects the true content accurately.

In [None]:
pip install pandas nltk




In [9]:
#!/usr/bin/env python3
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# ---------------------------
# Ensure Required NLTK Resources are Available
# ---------------------------
# Sometimes NLTK may look for 'punkt_tab', so we check for both 'punkt' and 'punkt_tab'
resources = ['punkt', 'punkt_tab', 'stopwords', 'wordnet']
for resource in resources:
    try:
        nltk.data.find(f'tokenizers/{resource}')  # 'punkt' and 'punkt_tab' are in tokenizers
    except LookupError:
        nltk.download(resource)

# ---------------------------
# Data Loading and Quality Checks
# ---------------------------
# Load the CSV generated from Part 1
df = pd.read_csv('github_actions.csv')

# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Remove duplicates based on the URL column (assuming unique URLs define unique products)
df.drop_duplicates(subset=['URL'], inplace=True)

# Ensure that required columns ('Product Name', 'URL') are not empty.
df = df[df['Product Name'].str.strip() != '']
df = df[df['URL'].str.strip() != '']

# Fill missing descriptions with an empty string
df['Description'] = df['Description'].fillna('')

# ---------------------------
# Preprocessing Functions
# ---------------------------
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """
    Preprocess the input text by:
      - Removing HTML tags.
      - Removing special characters and digits.
      - Converting text to lowercase.
      - Removing extra whitespace.
      - Tokenizing.
      - Removing stopwords.
      - Lemmatizing tokens.
    Returns the cleaned text as a single string.
    """
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove special characters and digits (keep only letters and whitespace)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back into a single string
    return ' '.join(tokens)

# ---------------------------
# Apply Preprocessing
# ---------------------------
df['Product Name Clean'] = df['Product Name'].astype(str).apply(preprocess_text)
df['Description Clean'] = df['Description'].astype(str).apply(preprocess_text)

# ---------------------------
# Data Quality Operations
# ---------------------------
# Ensure that the cleaned text columns are non-empty.
df = df[(df['Product Name Clean'].str.strip() != '') & (df['URL'].str.strip() != '')]

# Optional: Print a sample for manual inspection.
print("\nSample of cleaned data:")
print(df[['Product Name', 'Product Name Clean', 'Description', 'Description Clean']].head())

# ---------------------------
# Save Cleaned Data
# ---------------------------
df.to_csv('github_actions_cleaned.csv', index=False)
print("\nPreprocessing complete. Cleaned data saved to 'github_actions_cleaned.csv'.")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Missing values in each column:
Product Name    0
Description     1
URL             0
Page            0
dtype: int64

Sample of cleaned data:
                   Product Name          Product Name Clean  \
0                TruffleHog OSS               trufflehog os   
1                 Metrics embed                metric embed   
2  yq - portable yaml processor  yq portable yaml processor   
3                  Super-Linter                 superlinter   
4        Gosec Security Checker      gosec security checker   

                                         Description  \
0                Scan Github Actions with TruffleHog   
1  An infographics generator with 40+ plugins and...   
2  create, read, update, delete, merge, validate ...   
3  Super-linter is a ready-to-run collection of l...   
4                    Runs the gosec security checker   

                                   Description Clean  
0                      scan github action trufflehog  
1  infographics generator plugins

#Question 5 (20 points)

PART 1:
Web Scrape  tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.)
The extracted data includes the tweet ID, username, and text.

Part 2:
Perform data cleaning procedures

A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.


**Note**

1.   Follow tutorials provided in canvas to obtain api keys. Use ChatGPT to get the code. Make sure the file is downloaded and saved.
2.   Make sure you divide GPT code as shown in tutorials, dont make multiple requestes.



prompt 1: Try to understand this completely I will give you my access and tokens just replace and do this Question
PART 1: Web Scrape tweets from Twitter using the Tweepy API, specifically targeting hashtags related to subtopics (machine learning or artificial intelligence.) The extracted data includes the tweet ID, username, and text.

these are my access keys

 Bearer Token: AAAAAAAAAAAAAAAAAAAAAGo8zQEAAAAAY3hPK2%2BmPQwKKiGydeKPSvOPRi8%3DF87tYbzjFY74fzJrqknizeUGpm3o8Sd9LsIKCfw4m9iYNWpD5J

Access Token : 1886177120444956672-wuEXFW61G5SxvtebR9N5CtF1miOGWI

Access Token Secret : 0mCmNFGY1oix34JX89iGRlGsLo4eKok3Fxcq6vj3pGYgG

API KEY: IBv79v1TiyxJuGHtdo5WvBsSv
API KEY SECRET: NGuCPikSQLKdxt6BnPpmc6PNZc0uqIzFqkQMPR6L4vTU6rhC6b

Step 1: Install Tweepy
Ensure you have Tweepy installed:

In [None]:
pip install tweepy



Step 2: Authenticate and Retrieve Tweets
Here's how you can authenticate with Twitter and retrieve tweets related to machine learning or artificial intelligence:

In [13]:
import tweepy
import pandas as pd

# Set your keys and tokens here
api_key = 'IBv79v1TiyxJuGHtdo5WvBsSv'
api_key_secret = 'NGuCPikSQLKdxt6BnPpmc6PNZc0uqIzFqkQMPR6L4vTU6rhC6b'
access_token = '1886177120444956672-wuEXFW61G5SxvtebR9N5CtF1miOGWI'
access_token_secret = '0mCmNFGY1oix34JX89iGRlGsLo4eKok3Fxcq6vj3pGYgG'

# Authenticate with Twitter
auth = tweepy.OAuth1UserHandler(
    consumer_key=api_key,
    consumer_secret=api_key_secret,
    access_token=access_token,
    access_token_secret=access_token_secret
)
api = tweepy.API(auth)

# Define the hashtags you want to search for
hashtags = '#ML'

# Initialize an empty list to store tweet data
tweet_data = []

# Loop through each hashtag and collect tweets
for hashtag in hashtags:
    try:
        tweets = tweepy.Cursor(api.search_tweets, q=hashtag, lang="en", tweet_mode="extended").items(100)

        for tweet in tweets:
            tweet_data.append({
                'tweet_id': tweet.id,
                'username': tweet.user.screen_name,
                'text': tweet.full_text
            })
    except tweepy.TweepyException as e:
        print(f"Error occurred while fetching tweets for {hashtag}: {e}")

# Print the collected tweets
for tweet in tweet_data:
    print(f"@{tweet['username']}: {tweet['text']}")




Error occurred while fetching tweets for #: 401 Unauthorized
Unauthorized
Error occurred while fetching tweets for M: 401 Unauthorized
Unauthorized
Error occurred while fetching tweets for L: 401 Unauthorized
Unauthorized
Tweets saved to ml_tweets_v2.csv


Prompt 2: Getting this error. Error occurred while fetching tweets for #: 401 Unauthorized
89 - Invalid or expired token.
Error occurred while fetching tweets for M: 401 Unauthorized
89 - Invalid or expired token.
Error occurred while fetching tweets for L: 401 Unauthorized
89 - Invalid or expired token. Resolve this error and these these are my updated access keys

API Key: P7S43RM7dwRoPzXsHow1yq2LV

API Key Secret: bg0BfPUx5TRDm6nFLoUMrwzOtkkBJsEZk1gr8IZb8iaAhHpwym

Bearer Token: AAAAAAAAAAAAAAAAAAAAAJODzQEAAAAAt0VuklR%2BDRQe54UpmTK5XoidtYk%3DZBQPwJuEVrM9AnjvLVbgw8PTgk9RRjj37U74cB3nYwpimNr1OU

Access Token: 1892051302382960640-CWhsDx9tRxKZnZRKFYOfHxYEfFkUDN

Access Token Secret: ElzA0vzeo8dCrO9lyIQJB3NRhpavEd45iwTs6qziePLvB


In [16]:
'''Here's the updated code with your new credentials and proper API v2 implementation:'''
'''
Key improvements made:

Updated all credentials with your new keys

Added enhanced data cleaning:

URL removal

Mention removal

Whitespace cleanup

Included engagement metrics (likes/retweets)

Added comprehensive error handling

Added data validation checks

To use this:

Run the code as-is - it will create "ml_tweets_v2.csv"

Check the console output for any errors

Upload the CSV to OneDrive and share the link

The CSV will contain:

Tweet ID

Creation timestamp

Author ID

Cleaned text content

Like/retweet counts

If you get authentication errors, double-check:

All keys are copied exactly

Your Twitter Developer account has "Essential" access level at minimum

No extra spaces in the credentials

You're not hitting rate limits (wait 15 mins between requests)'''

import tweepy
import pandas as pd

# New credentials
BEARER_TOKEN = 'AAAAAAAAAAAAAAAAAAAAAJODzQEAAAAAt0VuklR%2BDRQe54UpmTK5XoidtYk%3DZBQPwJuEVrM9AnjvLVbgw8PTgk9RRjj37U74cB3nYwpimNr1OU'
API_KEY = 'P7S43RM7dwRoPzXsHow1yq2LV'
API_SECRET = 'bg0BfPUx5TRDm6nFLoUMrwzOtkkBJsEZk1gr8IZb8iaAhHpwym'
ACCESS_TOKEN = '1892051302382960640-CWhsDx9tRxKZnZRKFYOfHxYEfFkUDN'
ACCESS_SECRET = 'ElzA0vzeo8dCrO9lyIQJB3NRhpavEd45iwTs6qziePLvB'

# Authenticate with Twitter API v2
client = tweepy.Client(
    bearer_token=BEARER_TOKEN,
    consumer_key=API_KEY,
    consumer_secret=API_SECRET,
    access_token=ACCESS_TOKEN,
    access_token_secret=ACCESS_SECRET
)

# Define ML-related search query
query = "#ML -is:retweet -is:reply lang:en"

# Get recent tweets
tweets = client.search_recent_tweets(
    query=query,
    tweet_fields=["created_at", "author_id", "text", "public_metrics"],
    expansions=["author_id"],
    max_results=100
)

# Process tweets with error handling
tweet_data = []
if tweets.data:
    for tweet in tweets.data:
        try:
            tweet_data.append({
                'tweet_id': tweet.id,
                'created_at': tweet.created_at,
                'author_id': tweet.author_id,
                'text': tweet.text,
                'likes': tweet.public_metrics['like_count'],
                'retweets': tweet.public_metrics['retweet_count']
            })
        except Exception as e:
            print(f"Error processing tweet {tweet.id}: {str(e)}")

# Create DataFrame and clean data
if tweet_data:
    df = pd.DataFrame(tweet_data)

    # Data cleaning steps
    df = df.drop_duplicates(subset='tweet_id')
    df['text'] = df['text'].str.replace(r'http\S+', '', regex=True)  # Remove URLs
    df['text'] = df['text'].str.replace(r'@\w+', '', regex=True)       # Remove mentions
    df['text'] = df['text'].str.strip()                                # Remove extra whitespace

    # Save to CSV
    df.to_csv("ml_tweets_v2.csv", index=False)
    print(f"Successfully saved {len(df)} cleaned tweets to ml_tweets_v2.csv")

    # Display sample
    print("\nSample data:")
    print(df.head(3))
else:
    print("No tweets found with current query parameters")


Successfully saved 100 cleaned tweets to ml_tweets_v2.csv

Sample data:
              tweet_id                created_at            author_id  \
0  1892054142241153326 2025-02-19 03:30:34+00:00           2938342669   
1  1892045952539664772 2025-02-19 02:58:02+00:00            555031989   
2  1892038412032520263 2025-02-19 02:28:04+00:00  1629373825916780546   

                                                text  likes  retweets  
0  If you're passionate about Data Science, Analy...      0         0  
1  Eight forces that will shape the future cities...      4         1  
2  AI/ML Engineer \nPeraton\n\n\n\nFollow to be u...      0         0  


prompt 3:
ml_tweet.csv is generated successfully, perform Part 2: Perform data cleaning procedures. The data cleaning steps 1) means remove noise, such as special characters and punctuations; 2) Remove numbers, 3) Remove stopwords by using the stopwords list; 4) Lowercase all texts; 5) Stemming and 6) Lemmatization.
A final data quality check ensures the completeness and consistency of the dataset. The cleaned data is then saved into a CSV file for further analysis.

In [17]:
'''To perform data cleaning on the collected tweets, you can use the following Python program.
This script will remove noise, numbers, stopwords, convert to lowercase, apply stemming, and lemmatization.
It will then save the cleaned data in a new column in the CSV file.'''

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK resources if not already downloaded
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Load the CSV file containing the tweets
df = pd.read_csv('ml_tweets_v2.csv')

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to clean the text data
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Remove mentions
    text = re.sub(r'@\w+', '', text)

    # Remove special characters and punctuation
    text = re.sub(r'[^a-zA-Z ]', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Apply stemming
    stemmed_tokens = [stemmer.stem(token) for token in tokens]

    # Apply lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]

    # Join the tokens back into a string
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

# Apply the cleaning function to each tweet
df['cleaned_text'] = df['text'].apply(clean_text)

# Save the cleaned data to a new CSV file
df.to_csv('cleaned_ml_tweets.csv', index=False)

print("Cleaned data saved to cleaned_ml_tweets.csv")


Cleaned data saved to cleaned_ml_tweets.csv


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

I really liked this assignment. Initially, I felt it tough to answer Question 1(IMDB Web scraping ). It took some time to understand the logic and webscraping process. I enjoyed all the questions such as IMDB webscraping, GitHub Marketplace question and Tweepy API one. The provided to do this assignment is fine.

# Write your response below
Fill out survey and provide your valuable feedback.

https://docs.google.com/forms/d/e/1FAIpQLSd_ObuA3iNoL7Az_C-2NOfHodfKCfDzHZtGRfIker6WyZqTtA/viewform?usp=dialog