## Explanation of the Code

1️⃣ **Fetch news from Mediastack**  
   - Fetches articles based on your API key and fetch limit (e.g., 10 articles).

2️⃣ **Check for paywalled articles**  
   - Skips articles from known paywalled domains (e.g., New York Times).

3️⃣ **Extract full article text**  
   - Attempts to extract text using `newspaper3k`, `Unstructured`, and `BeautifulSoup`.

4️⃣ **Store articles in JSON**  
   - Saves the articles in a JSON file (`news.json`).

5️⃣ **Convert text to embeddings**  
   - Uses the `SentenceTransformer` to generate embeddings for each article's text.

6️⃣ **Store embeddings in ChromaDB**  
   - Adds the generated embeddings into ChromaDB for semantic search.


### to install

pip install fake-useragent && pip install newspaper3k && pip install lxml_html_clean

In [2]:
import requests
import json
import time
import chromadb
from sentence_transformers import SentenceTransformer
from unstructured.partition.html import partition_html
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from newspaper import Article

# 🔹 CONFIGURATION: Define the Mediastack API and the number of articles to fetch
API_KEY = "356bb7cd80f02083d604ba6ba1dfadd8"
MAX_ARTICLES = 10  # You can change this to 5, 10, etc.

# Mediastack Base URL
BASE_URL = f"http://api.mediastack.com/v1/news?access_key={API_KEY}&countries=us&limit={MAX_ARTICLES}"

# List of known paywalled domains (to avoid scraping content)
paywalled_domains = ["nytimes.com", "washingtonpost.com", "theatlantic.com", "bloomberg.com"]

# User-Agent Rotator
ua = UserAgent()

def is_paywalled(url):
    """Check if the article is from a paywalled domain."""
    return any(domain in url for domain in paywalled_domains)

def extract_full_text(url):
    """Extract full article text using newspaper3k, Unstructured, and BeautifulSoup."""
    try:
        headers = {'User-Agent': ua.random}
        page = requests.get(url, headers=headers, timeout=10)

        if page.status_code != 200:
            return f"Error: Page returned status code {page.status_code}"

        # Attempt 1: newspaper3k (best for full-text extraction)
        article = Article(url)
        article.download()
        article.parse()
        if len(article.text) > 500:
            return article.text

        # Attempt 2: Unstructured (fallback)
        elements = partition_html(text=page.text)
        extracted_text = "\n".join([el.text for el in elements if el.text.strip()])
        if len(extracted_text) > 500:
            return extracted_text

        # Attempt 3: BeautifulSoup (last resort)
        soup = BeautifulSoup(page.text, "html.parser")
        paragraphs = soup.find_all("p")
        extracted_text = "\n".join([p.get_text() for p in paragraphs])
        return extracted_text if len(extracted_text) > 500 else "Content could not be extracted."

    except Exception as e:
        return f"Error extracting content: {str(e)}"

# 🔹 Fetch news from Mediastack
response = requests.get(BASE_URL)
news_data = response.json().get("data", [])[:MAX_ARTICLES]  # Limit articles

articles_list = []

# 🔹 Process each article
for i, article in enumerate(news_data):
    url = article.get("url", "")

    if not url or is_paywalled(url):
        print(f"🚫 Skipping paywalled article: {url}")
        continue

    print(f"🔍 [{i+1}/{MAX_ARTICLES}] Processing: {url}")
    full_text = extract_full_text(url)

    articles_list.append({
        "title": article.get("title", "Unknown title"),
        "url": url,
        "content": full_text
    })

    time.sleep(2)  # Avoid being blocked by rate limits

# 🔹 Save articles in JSON format
with open("news.json", "w", encoding="utf-8") as f:
    json.dump(articles_list, f, indent=4)

print(f"✅ Articles saved in 'news.json'.")

# 🔹 INTEGRATION WITH CHROMADB (Embeddings)
print("🔄 Converting articles to embeddings and storing them in ChromaDB...")

# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("news_articles")

# Load the embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Convert each article to embeddings and store them in ChromaDB
for article in articles_list:
    text = article["title"] + " " + article["content"]
    embedding = embedding_model.encode(text).tolist()

    collection.add(
        ids=[article["url"]],
        embeddings=[embedding],
        metadatas=[{"title": article["title"], "url": article["url"]}],
        documents=[text]
    )

print("✅ Articles converted to embeddings and stored in ChromaDB.")



🔍 [1/10] Processing: https://www.ndtvprofit.com/technology/gen-ai-adoption-not-moving-at-speed-of-technology-agentic-ai-no-silver-bullet-finds-deloitte-survey
🔍 [2/10] Processing: https://www.wral.com/story/mules-that-provided-aid-after-hurricane-helene-struck-down-on-road/21863436/
🔍 [3/10] Processing: https://www.argophilia.com/news/lavrio-agios-efstratios-lemnos-kavala/240341/
🔍 [4/10] Processing: https://www.scotsman.com/recommended/oodie-has-just-launched-football-based-jackets-blankets-and-coats-4963229
🔍 [5/10] Processing: https://www.burnleyexpress.net/recommended/oodie-has-just-launched-football-based-jackets-blankets-and-coats-4963229
🔍 [6/10] Processing: https://www.deccanchronicle.com/sports/australian-team-arrives-in-pakistan-for-champions-trophy-1861721
🔍 [7/10] Processing: https://www.wigantoday.net/recommended/oodie-has-just-launched-football-based-jackets-blankets-and-coats-4963229
🔍 [8/10] Processing: https://famagusta-gazette.com/over-2800-traffic-violations-recorded

Add of existing embedding ID: https://famagusta-gazette.com/over-2800-traffic-violations-recorded-in-cyprus-over-just-six-days/
Insert of existing embedding ID: https://famagusta-gazette.com/over-2800-traffic-violations-recorded-in-cyprus-over-just-six-days/
Add of existing embedding ID: https://www.zeebiz.com/personal-finance/news-return-comparison-sip-vs-ppf-which-investment-can-build-larger-corpus-for-rs-130000-annual-contribution-calculations-inside-347543
Insert of existing embedding ID: https://www.zeebiz.com/personal-finance/news-return-comparison-sip-vs-ppf-which-investment-can-build-larger-corpus-for-rs-130000-annual-contribution-calculations-inside-347543
Add of existing embedding ID: https://www.ft.com/content/9fb2c865-8183-4b2c-86a0-bd55e86728da
Insert of existing embedding ID: https://www.ft.com/content/9fb2c865-8183-4b2c-86a0-bd55e86728da


✅ Articles converted to embeddings and stored in ChromaDB.
