#### Crypto Sentiment Tracker – News & Text Sources

**Goal**: Collect recent text data (titles, summaries, descriptions) from crypto news sources for BTC, ETH, SOL, and general crypto sentiment.

**2026 Free Tier Reality**:
- Primary (always works): RSS feeds from major crypto news sites (no key, no limits, recent articles)
- Upgrade path: NewsData.io free plan (200 credits/day, 10 articles/credit, 12-hour delay, basic filters)
- X/Reddit: Skipped (paid-only for meaningful search in 2026 free tier)
- Price baseline: Daily OHLCV from Binance (CCXT, public, free)

**Security**: API keys managed via dotenv (.env file)

#### Data Collection

#### 1. Imports & dotenv setup

In [2]:
import os
import pandas as pd
from datetime import datetime, timedelta
import time
import feedparser
import requests
import ccxt

from dotenv import load_dotenv

# Load secrets from .env (create this file!)
load_dotenv()

NEWSDATA_API_KEY = os.getenv("NEWSDATA_API_KEY")

print("Phase 1 started:", datetime.now().strftime("%Y-%m-%d %H:%M EAT"))
print("NewsData.io key loaded:", "YES" if NEWSDATA_API_KEY else "NO (using RSS fallback)")

Phase 1 started: 2026-02-16 16:38 EAT
NewsData.io key loaded: YES


##### Step 1: RSS Feeds – Zero-Cost, Immediate Baseline
Reliable crypto news RSS (Cointelegraph, CoinDesk, etc.)
→ ~20–100 recent articles, no signup, no rate limits

##### RSS Collection

In [3]:
RSS_FEEDS = {
    "Cointelegraph": "https://cointelegraph.com/rss",
    "CoinDesk":      "https://www.coindesk.com/arc/outboundfeeds/rss/",
    "CryptoSlate":   "https://cryptoslate.com/feed/",
    "CryptoPotato":  "https://cryptopotato.com/feed/",
    # Optional extras: "Decrypt": "https://decrypt.co/feed",
}

def parse_rss(source_name, url, days_back=14):
    feed = feedparser.parse(url)
    if feed.bozo:
        print(f"Feed error {source_name}: {feed.bozo_exception}")
        return pd.DataFrame()
    
    articles = []
    cutoff = datetime.now() - timedelta(days=days_back)
    
    for entry in feed.entries:
        pub = entry.get('published_parsed')
        published = datetime(*pub[:6]) if pub else datetime.now()
        if published < cutoff:
            continue
            
        articles.append({
            'source': source_name,
            'title': entry.get('title', ''),
            'link': entry.get('link', ''),
            'published': published,
            'summary': entry.get('summary', ''),
            'content': entry.get('content', [{}])[0].get('value', '') if 'content' in entry else '',
            'tags': ', '.join(t['term'] for t in entry.get('tags', [])) if 'tags' in entry else ''
        })
    
    df = pd.DataFrame(articles)
    if not df.empty:
        df['published'] = pd.to_datetime(df['published'])
    return df

print("Collecting RSS feeds...")
all_rss = [parse_rss(name, url) for name, url in RSS_FEEDS.items()]
rss_df = pd.concat(all_rss, ignore_index=True).drop_duplicates(subset=['link', 'title'])
rss_df = rss_df.sort_values('published', ascending=False).reset_index(drop=True)

print(f"Collected {len(rss_df)} articles from RSS (last {14} days)")
rss_df[['source', 'published', 'title', 'summary']].head(8)

# Save
rss_df.to_csv('raw_crypto_news_rss.csv', index=False)

Collecting RSS feeds...
Feed error CryptoSlate: <unknown>:2:751: not well-formed (invalid token)
Collected 91 articles from RSS (last 14 days)


##### Step 2: NewsData.io – Free Tier (200 credits/day)
If you added key to .env → runs automatically  
→ Fetches keyword-targeted crypto news (better relevance than general RSS)

#### NewsData.io

In [4]:
if NEWSDATA_API_KEY:
    print("Fetching from NewsData.io (free tier)...")
    
    def fetch_nd_crypto():
        base_url = "https://newsdata.io/api/1/news"
        queries = [
            "Bitcoin OR BTC crypto",
            "Ethereum OR ETH crypto",
            "Solana OR SOL crypto",
            "crypto OR cryptocurrency market"
        ]
        results = []
        
        for q in queries:
            params = {
                'apikey': NEWSDATA_API_KEY,
                'q': q[:100],           # free tier limit 100 chars
                'language': 'en',
                'size': 10              # max per request on free
            }
            try:
                r = requests.get(base_url, params=params, timeout=12)
                r.raise_for_status()
                data = r.json()
                if data.get('status') != 'success':
                    print(f"Error for '{q}': {data.get('message')}")
                    continue
                arts = data.get('results', [])
                for a in arts:
                    a['query'] = q
                results.extend(arts)
                time.sleep(1.2)  # respect ~30 credits/15 min
            except Exception as e:
                print(f"Request failed '{q}': {e}")
        
        if not results:
            return pd.DataFrame()
        
        df = pd.DataFrame(results)
        df['pubDate'] = pd.to_datetime(df['pubDate'], errors='coerce')
        df = df.dropna(subset=['pubDate'])
        df['source'] = 'NewsData.io'
        
        # Tag relevance
        df['btc_relevant'] = df.apply(lambda row: any(k in str(row).lower() for k in ['bitcoin', 'btc']), axis=1)
        df['eth_relevant'] = df.apply(lambda row: any(k in str(row).lower() for k in ['ethereum', 'eth']), axis=1)
        df['sol_relevant'] = df.apply(lambda row: any(k in str(row).lower() for k in ['solana', 'sol']), axis=1)
        
        print(f"NewsData.io: {len(df)} articles collected")
        return df
    
    nd_df = fetch_nd_crypto()
    if not nd_df.empty:
        nd_df.to_csv('raw_newsdataio_crypto.csv', index=False)
        display(nd_df[['source', 'pubDate', 'title', 'description']].head(6))
else:
    print("No NewsData.io key in .env → skip (sign up free at newsdata.io if desired)")

Fetching from NewsData.io (free tier)...
NewsData.io: 40 articles collected


Unnamed: 0,source,pubDate,title,description
0,NewsData.io,2026-02-16 01:40:32,"As XRP and LTC Catch Tailwinds, This Top Crypt...",Crypto is moving fast again. Charts are flashi...
1,NewsData.io,2026-02-16 01:40:12,Justin Sun LIT Deposit: A Strategic $4.1M Move...,BitcoinWorldJustin Sun LIT Deposit: A Strategi...
2,NewsData.io,2026-02-16 01:33:00,Cathie Wood buys $46 million of tumbling tech ...,Here are Cathie Wood’s latest moves.
3,NewsData.io,2026-02-16 01:29:00,Crash course: Vietnam's crypto boom goes bust,"Unlike neighboring China, which has banned cry..."
4,NewsData.io,2026-02-16 01:10:44,Bitwise Crypto Industry Innovators ETF (NYSEAR...,Bitwise Crypto Industry Innovators ETF (NYSEAR...
5,NewsData.io,2026-02-16 01:07:56,Strategy Bitcoin Push Deepens With Stretch Fin...,Strategy Inc (NasdaqGS:MSTR) continues to accu...


In [14]:
print("Fetching daily prices (BTC/ETH/SOL) – prioritizing CoinGecko free tier (365 days max)...")

from datetime import datetime, timedelta
import time

prices_list = []

try:
    !pip install pycoingecko --quiet
    from pycoingecko import CoinGeckoAPI
    
    cg = CoinGeckoAPI()
    
    coin_map = {
        'BTC': 'bitcoin',
        'ETH': 'ethereum',
        'SOL': 'solana'
    }
    
    lookback_days = 365  # Free tier safe limit (do NOT exceed 365)
    print(f"  Using CoinGecko lookback: {lookback_days} days")
    
    for coin_key, coin_id in coin_map.items():
        print(f"  Fetching {coin_key} from CoinGecko...")
        try:
            data = cg.get_coin_market_chart_by_id(
                id=coin_id,
                vs_currency='usd',
                days=lookback_days,
                interval='daily'
            )
            prices = pd.DataFrame(data['prices'], columns=['ts', 'price'])
            prices['date'] = pd.to_datetime(prices['ts'], unit='ms').dt.date
            df_cg = prices[['date', 'price']].set_index('date').rename(columns={'price': f'{coin_key}_price'})
            prices_list.append(df_cg)
            print(f"    Success: {len(df_cg)} days for {coin_key}")
        except Exception as e:
            print(f"    CoinGecko failed for {coin_key}: {e}")
            # Continue to next coin
    
except Exception as cg_setup_err:
    print(f"CoinGecko setup failed: {cg_setup_err} → skipping to Binance")

if len(prices_list) < 3:
    print("\nTrying Binance fallback with extended timeout...")
    
    import ccxt
    
    def get_prices_binance(symbol, days=365, retries=3):
        exchange = ccxt.binance({
            'timeout': 90000,                # 90 seconds
            'enableRateLimit': True,
            'options': {'defaultType': 'spot'},
        })
        
        since = exchange.parse8601(
            (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%dT00:00:00Z')
        )
        
        for attempt in range(1, retries + 1):
            try:
                print(f"  Binance attempt {attempt}/{retries} for {symbol}...")
                ohlcv = exchange.fetch_ohlcv(symbol, '1d', since=since, limit=1000)
                df = pd.DataFrame(ohlcv, columns=['ts', 'o', 'h', 'l', 'c', 'v'])
                df['date'] = pd.to_datetime(df['ts'], unit='ms').dt.date
                return df[['date', 'c']].rename(columns={'c': f'{symbol.split("/")[0]}_price'}).set_index('date')
            except ccxt.RequestTimeout as e:
                print(f"    Timeout attempt {attempt}: {e}")
                if attempt < retries:
                    time.sleep(5 + attempt * 5)
            except Exception as e:
                print(f"    Error on {symbol}: {e}")
                break
        return pd.DataFrame()
    
    for sym in ['BTC/USDT', 'ETH/USDT', 'SOL/USDT']:
        df_sym = get_prices_binance(sym)
        if not df_sym.empty:
            prices_list.append(df_sym)

# ────────────────────────────────────────────────
# Combine, save & show
# ────────────────────────────────────────────────
if prices_list:
    prices = pd.concat(prices_list, axis=1, join='outer')
    prices = prices.sort_index()
    prices.to_csv('daily_prices_btc_eth_sol.csv')
    
    print("\nSuccess! Prices saved (likely 365 days of data).")
    print("Most recent 7 days:")
    display(prices.tail(7).round(2))
    
    # Fixed date range print
    min_date = prices.index.min()
    max_date = prices.index.max()
    print(f"\nDate range: {min_date} to {max_date} ({(max_date - min_date).days + 1} days total)")
    
    print(f"\nColumns available: {list(prices.columns)}")
else:
    print("""
All automatic fetches failed.
    
Quick fixes to try next:
  1. Re-run the cell (transient network glitch)
  2. Use a free VPN (US/EU server) → often solves Binance timeouts from ET
  3. For demo/portfolio: Manually download CSV from CoinGecko website
     → https://www.coingecko.com/en/coins/bitcoin/historical_data → last 365 days
     → Repeat for ethereum & solana
     → Then load with:
        btc = pd.read_csv('bitcoin.csv', parse_dates=['Date']).set_index('Date')['Price'].rename('BTC_price')
        # etc.
    """)

Fetching daily prices (BTC/ETH/SOL) – prioritizing CoinGecko free tier (365 days max)...



[notice] A new release of pip is available: 24.0 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


  Using CoinGecko lookback: 365 days
  Fetching BTC from CoinGecko...
    Success: 366 days for BTC
  Fetching ETH from CoinGecko...
    Success: 366 days for ETH
  Fetching SOL from CoinGecko...
    Success: 366 days for SOL

Success! Prices saved (likely 365 days of data).
Most recent 7 days:


Unnamed: 0_level_0,BTC_price,ETH_price,SOL_price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2026-02-11,68779.91,2018.92,82.86
2026-02-12,66937.58,1939.43,79.27
2026-02-13,66184.58,1945.74,78.24
2026-02-14,68838.87,2047.36,84.26
2026-02-15,69765.6,2085.52,88.16
2026-02-16,68716.58,1963.96,85.94
2026-02-16,68460.89,1968.17,84.13



Date range: 2025-02-17 to 2026-02-16 (365 days total)

Columns available: ['BTC_price', 'ETH_price', 'SOL_price']
