#**Introduction** **to** **Project**


# üöÄ Cross-Platform Tech Trends Scraper

## üìå Project Overview

This project demonstrates a complete **end-to-end data collection and processing pipeline** using multiple web data acquisition techniques ‚Äî all fully free and applicable in real-world scenarios. It brings together **Requests + BeautifulSoup, Selenium, RapidAPI, and pandas** to collect, store, clean, and process web and API-derived data.

You will learn how to:
- Extract data from static HTML pages using Requests + BeautifulSoup
- Scrape dynamic websites using Selenium
- Fetch structured JSON data from a free RapidAPI endpoint
- Store all gathered data in structured CSV files
- Clean and normalize real, messy web data
- Do basic processing and simple insights

This workflow mirrors typical data-engineering and data-science pipelines, making it perfect for learning and portfolio purposes.

---

## üìå Data Sources and Methods

| Source | Method | Output |
|--------|--------|--------|
| Hacker News Top Stories | Static scraping with `requests` + `BeautifulSoup` | `hacker_news.csv` |
| Twitter Search Page (e.g., ‚ÄúAI‚Äù topic) | Dynamic scraping using `selenium` | `twitter_ai_tweets.csv` |
| RapidAPI Free News API (e.g., AI or tech news) | API call using `requests` | `rapidapi_tech_news.csv` |

---

## üìå Project Steps

### üß± 1. Requests + BeautifulSoup
Use Requests to fetch web pages and BeautifulSoup to parse static HTML and extract structured data.

### üåÄ 2. Selenium
Use a headless Selenium browser to load dynamic content (like Twitter search results) that cannot be obtained via simple HTTP requests.

### üì° 3. RapidAPI
Query a free API endpoint (e.g., news API) to fetch JSON metadata for articles related to a topic.

### üíæ 4. Data Storage
Save all collected data as CSV files for easy future access and analysis.

### üßπ 5. Cleaning & Processing
Load CSVs via pandas, drop duplicates, handle missing values, normalize fields, and prepare data for analysis.

### üìä 6. Basic Insights
Use simple Python logic (e.g., frequency counts, word counts) to generate preliminary analysis.

---

## üß† What You‚Äôll See

- Structured CSV outputs (`hacker_news.csv`, `twitter_ai_tweets.csv`, `rapidapi_tech_news.csv`)
- A combined understanding of **static scraping, dynamic scraping, and API usage**
- Cleaned and processed data
- Basic analytical insights




Request and Beautifulsoup

In [None]:
!pip install requests beautifulsoup4 selenium pandas webdriver-manager


Collecting selenium
  Downloading selenium-4.40.0-py3-none-any.whl.metadata (7.7 kB)
Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting trio<1.0,>=0.31.0 (from selenium)
  Downloading trio-0.32.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket<1.0,>=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting trio-typing>=0.10.0 (from selenium)
  Downloading trio_typing-0.10.0-py3-none-any.whl.metadata (10 kB)
Collecting types-certifi>=2021.10.8.3 (from selenium)
  Downloading types_certifi-2021.10.8.3-py3-none-any.whl.metadata (1.4 kB)
Collecting types-urllib3>=1.26.25.14 (from selenium)
  Downloading types_urllib3-1.26.25.14-py3-none-any.whl.metadata (1.7 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting sortedcontainers (from trio<1.0,>=0.31.0->selenium)
  Downloading sortedcontainers-2.4.0

In [None]:
import requests
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

import pandas as pd
import time
import json
import csv

# **SECTION 1: Scraping using Requests + BeautifulSoup**

**So we are scrapping the data from Hacker news and these are the Possible feilds we'll be working on:**


---


*   Title - What article is about
*   Article URL - External Link
*   HN URL - Discussion Link
*   Score (points)
*   Author - who posted it?
*   Post's Age
*   Comments count





In [None]:
url = "https://news.ycombinator.com/news"
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, "html.parser")

In [None]:
stories = soup.find_all("tr", class_="athing")
print(len(stories))

30


In [None]:
all_stories = []



In [None]:
for story in stories:
  title_tag = story.find("span", class_="titleline").find("a")
  title = title_tag.get_text()
  article_url = title_tag.get("href")
  subtext_row =  story.find_next_sibling("tr")
  score_tag = subtext_row.find("span", class_="score")
  score = score_tag.get_text() if score_tag else None
  author_tag = subtext_row.find("a", class_="hnuser")
  author = author_tag.get_text() if author_tag else None
  age_tag = subtext_row.find("span", class_="age")
  age = age_tag.get_text() if age_tag else None
  comment_tag = subtext_row.find_all("a")[-1]
  comments =comment_tag.get_text()
  story_data = {
    "title": title,
    "article_url": article_url,
    "score": score,
    "author": author,
    "age": age,
    "comments": comments
    }
  all_stories.append(story_data)



In [None]:
print(title)
print(article_url)
print(score, author, age, comments)
print("-" * 50)

Software engineers can no longer neglect their soft skills
https://www.qu8n.com/posts/most-important-software-engineering-skill-2026
159 points quanwinn 12 hours ago 200¬†comments
--------------------------------------------------


In [None]:
print(all_stories[0]["title"])
print(all_stories[1]["title"])
print(all_stories[2]["title"])


Gaussian Splatting ‚Äì A$AP Rocky "Helicopter" music video
Flux 2 Klein pure C inference
Fil-Qt: A Qt Base build with Fil-C experience


In [None]:
print(len(all_stories))
print(all_stories[1:3])

30
[{'title': 'Flux 2 Klein pure C inference', 'article_url': 'https://github.com/antirez/flux2.c', 'score': '211 points', 'author': 'antirez', 'age': '7 hours ago', 'comments': '89\xa0comments'}, {'title': 'Fil-Qt: A Qt Base build with Fil-C experience', 'article_url': 'https://git.qt.io/cradam/fil-qt', 'score': '15 points', 'author': 'pjmlp', 'age': '1 hour ago', 'comments': '3\xa0comments'}]


In [None]:
df = pd.DataFrame(all_stories)
df.to_csv("hacker_news.csv", index=False)

df.head(5)

Unnamed: 0,title,article_url,score,author,age,comments
0,"Gaussian Splatting ‚Äì A$AP Rocky ""Helicopter"" m...",https://radiancefields.com/a-ap-rocky-releases...,428 points,ChrisArchitect,7 hours ago,149¬†comments
1,Flux 2 Klein pure C inference,https://github.com/antirez/flux2.c,211 points,antirez,7 hours ago,89¬†comments
2,Fil-Qt: A Qt Base build with Fil-C experience,https://git.qt.io/cradam/fil-qt,15 points,pjmlp,1 hour ago,3¬†comments
3,A Social Filesystem,https://overreacted.io/a-social-filesystem/,266 points,icy,9 hours ago,132¬†comments
4,Wine 11.0,https://gitlab.winehq.org/wine/wine/-/releases...,194 points,zdw,3 hours ago,35¬†comments


**The Reason I couldn't able to complete this Project as running SELENIUM isn't easy
And doing the project in VS code was running but I needed Proxy to scrape the data so I didn't able to complete it**

# **Dynamic Job Listings Scraper using Selenium (Indeed)**


---
**In this Project we are scraping dynamic job listing data from Indeed using Selenium, since the content is JS-rendered, the goal is to collect structured, real world dynamic data**

---
*   Job Title ‚Äì Role being offered
*   Company Name ‚Äì Hiring organization
*   Job Location ‚Äì City / remote / hybrid info
*   Job URL ‚Äì Direct link to the job posting
*   Posted Time ‚Äì When the job was listed
*   Scrape Timestamp ‚Äì When the data was collected

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time



In [None]:
# Install Chrome first if it's not already installed
!wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
!echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | tee /etc/apt/sources.list.d/google-chrome.list
!apt-get update
!apt-get install -y google-chrome-stable

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time

# Configure Chrome options for headless execution in Colab
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

# Setup ChromeDriver using webdriver_manager
service = Service(ChromeDriverManager().install())

driver = webdriver.Chrome(service=service, options=chrome_options)
query = "dslr+camera"
driver.get(f"https://www.amazon.in/s?k={query}crid=1P5SPEL9BCYHC&sprefix=dslrcamera%2Caps%2C464&ref=nb_sb_noss_2")

# Added a small delay to ensure content loads (if needed, adjust further)
time.sleep(5)

try:
  # Using find_elements for robustness as class might appear multiple times or not at all
  elems = driver.find_elements(By.CSS_SELECTOR, ".puis-card-container")
  if elems:
    for elem in elems:
      print(elem.text)
  else:
    print("No elements with class 'puis-card-container' found.")
except Exception as e:
  print(f"An error occurred: {e}")
finally:
  driver.close()

OK
deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main
Hit:1 http://dl.google.com/linux/chrome/deb stable InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://cli.github.com/packages stable InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... 