# Wikipedia Sentiment + Bitcoin Price â€” Algo Trading Notebook

This notebook builds a **simple algorithmic trading signal** for Bitcoin using:

1. **Wikipedia edit activity** on the `Bitcoin` page
2. **Sentiment analysis** of edit comments
3. **BTC-USD daily prices** from Yahoo Finance

We then:
- Construct daily features from Wikipedia edits
- Merge them with Bitcoin price data
- Train a simple model to predict **next-day price direction**
- Evaluate the strategy performance on a test set


## 0. Setup
Uncomment and run the following cell if you don't have the required libraries installed.


In [None]:
# !pip install mwclient transformers yfinance tqdm
# If you are in Colab, also run:
# !pip install torch --index-url https://download.pytorch.org/whl/cpu

## 1. Imports & Configuration

In [None]:
import time
from datetime import datetime, timedelta

import mwclient
import pandas as pd
import numpy as np
import yfinance as yf
from tqdm import tqdm

from transformers import pipeline

import matplotlib.pyplot as plt

plt.style.use('ggplot')

# ---- Configuration ----
WIKI_SITE = 'en.wikipedia.org'
WIKI_PAGE_TITLE = 'Bitcoin'

# Date range for analysis
# You can adjust these
START_DATE = '2020-01-01'
END_DATE   = '2023-12-31'

# Ticker for Bitcoin (USD) on Yahoo Finance
BTC_TICKER = 'BTC-USD'

START_DT = datetime.fromisoformat(START_DATE)
END_DT   = datetime.fromisoformat(END_DATE)

## 2. Download Wikipedia Revisions
We use the `mwclient` library to fetch the revision history of the Bitcoin page.
Each revision has:
- `timestamp`
- `user`
- `comment` (edit summary)
- other metadata


In [None]:
def fetch_wiki_revisions(page_title: str,
                          start_dt: datetime,
                          end_dt: datetime,
                          max_retries: int = 5):
    """Fetch revisions for a Wikipedia page between start_dt and end_dt.

    Returns a list of revision dicts.
    """
    site = mwclient.Site(WIKI_SITE)
    page = site.pages[page_title]

    # mwclient uses MW timestamps (UTC); convert datetimes
    start_str = start_dt.strftime('%Y%m%d%H%M%S')
    end_str   = end_dt.strftime('%Y%m%d%H%M%S')

    params = {
        'start': start_str,
        'end': end_str,
        'dir': 'newer',
        'prop': 'ids|timestamp|user|comment|flags|size'
    }

    retries = 0
    while retries < max_retries:
        try:
            revs = list(page.revisions(**params))
            return revs
        except Exception as e:
            print(f"Error fetching revisions (attempt {retries+1}/{max_retries}): {e}")
            retries += 1
            time.sleep(2 * retries)
    raise RuntimeError('Failed to fetch revisions after max retries.')


revs = fetch_wiki_revisions(WIKI_PAGE_TITLE, START_DT, END_DT)
print(f"Fetched {len(revs)} revisions from Wikipedia.")

# Peek at a few revisions
revs[:3]

## 3. Build Daily Edit Activity DataFrame
We aggregate revisions by **date** and compute:
- `edit_count`: number of edits on that date
- later we'll add sentiment stats per day.


In [None]:
def revisions_to_dataframe(revs):
    records = []
    for r in revs:
        # r['timestamp'] is a mwclient timestamp, convert to Python datetime
        ts = r['timestamp']
        if isinstance(ts, str):
            # Fallback: some mwclient versions may already give datetime
            ts = datetime.fromisoformat(ts)
        date_only = ts.date()
        records.append({
            'timestamp': ts,
            'date': date_only,
            'user': r.get('user', None),
            'comment': r.get('comment', ''),
            'size': r.get('size', None)
        })
    df = pd.DataFrame(records)
    return df

wiki_df = revisions_to_dataframe(revs)
print(wiki_df.head())
print('\nTotal unique days with edits:', wiki_df['date'].nunique())

### Daily aggregation

In [None]:
# Basic daily aggregation: number of edits per day
daily_edits = (
    wiki_df
    .groupby('date')
    .agg(edit_count=('timestamp', 'count'))
    .reset_index()
)

daily_edits.head()

## 4. Sentiment Analysis of Edit Comments
We will use a Hugging Face `pipeline` for sentiment analysis.

For simplicity, we use the default `sentiment-analysis` pipeline which
outputs `POSITIVE` or `NEGATIVE` with a score.

We then aggregate per day:
- `avg_sentiment_score`: average signed sentiment score
- `pos_frac`: fraction of positive comments
- `neg_frac`: fraction of negative comments
- `comment_count`: total number of comments considered


In [None]:
# Initialize sentiment pipeline
sentiment_pipeline = pipeline('sentiment-analysis')

def compute_signed_score(sent_label, score):
    """Map (label, score) to a signed value in [-1, 1]."""
    if 'NEG' in sent_label.upper():
        return -score
    return score

def add_comment_sentiment(df: pd.DataFrame, comment_col: str = 'comment') -> pd.DataFrame:
    comments = df[comment_col].fillna('').astype(str).tolist()
    labels = []
    scores = []
    signed_scores = []

    for c in tqdm(comments, desc='Running sentiment on comments'):
        if not c.strip():
            labels.append('NEUTRAL')
            scores.append(0.0)
            signed_scores.append(0.0)
            continue
        try:
            result = sentiment_pipeline(c[:512])[0]  # truncate long comments
            label = result['label']
            score = float(result['score'])
            signed = compute_signed_score(label, score)
        except Exception as e:
            # In case of any error, treat as neutral
            label = 'NEUTRAL'
            score = 0.0
            signed = 0.0
        labels.append(label)
        scores.append(score)
        signed_scores.append(signed)

    out_df = df.copy()
    out_df['sent_label'] = labels
    out_df['sent_score'] = scores
    out_df['sent_signed'] = signed_scores
    return out_df

wiki_df_sent = add_comment_sentiment(wiki_df)
wiki_df_sent.head()

### Aggregate sentiment per day

In [None]:
daily_sent = (
    wiki_df_sent
    .groupby('date')
    .agg(
        edit_count=('timestamp', 'count'),
        avg_sentiment_score=('sent_signed', 'mean'),
        pos_frac=(lambda x: np.mean([1.0 if 'POS' in l.upper() else 0.0 for l in wiki_df_sent.loc[x.index, 'sent_label']])),
        neg_frac=(lambda x: np.mean([1.0 if 'NEG' in l.upper() else 0.0 for l in wiki_df_sent.loc[x.index, 'sent_label']])),
        comment_count=('comment', 'count')
    )
    .reset_index()
)

daily_sent.head()

## 5. Download Bitcoin Price Data (Yahoo Finance)
We fetch **daily OHLCV** (Open, High, Low, Close, Volume) for `BTC-USD`.

Then we compute:
- `return`: daily log-return
- `next_return`: next-day log-return
- `target_up`: 1 if next_return > 0 else 0


In [None]:
btc_data = yf.download(BTC_TICKER, start=START_DATE, end=END_DATE)
btc_data = btc_data.rename_axis('date').reset_index()
btc_data['date'] = btc_data['date'].dt.date

btc_data['close'] = btc_data['Close']
btc_data['return'] = np.log(btc_data['close']).diff()
btc_data['next_return'] = btc_data['return'].shift(-1)
btc_data['target_up'] = (btc_data['next_return'] > 0).astype(int)

btc_data.head()

## 6. Merge Wikipedia Features with Bitcoin Prices

In [None]:
# Merge on 'date'
merged = pd.merge(btc_data, daily_sent, on='date', how='left')

# Fill missing Wikipedia days with zeros / neutral sentiment
merged[['edit_count', 'avg_sentiment_score', 'pos_frac', 'neg_frac', 'comment_count']] = (
    merged[['edit_count', 'avg_sentiment_score', 'pos_frac', 'neg_frac', 'comment_count']]
    .fillna(0.0)
)

print(merged.head())
print('\nShape:', merged.shape)

### Quick visualization: edits & sentiment vs price

In [None]:
fig, ax1 = plt.subplots(figsize=(12, 5))

ax1.plot(merged['date'], merged['close'], label='BTC Close')
ax1.set_ylabel('BTC Close Price (USD)')
ax1.set_xlabel('Date')

ax2 = ax1.twinx()
ax2.plot(merged['date'], merged['edit_count'], alpha=0.5, label='Wiki Edit Count')
ax2.set_ylabel('Wikipedia Edit Count')

plt.title('BTC Price vs. Wikipedia Edit Count')
fig.tight_layout()
plt.show()

fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.plot(merged['date'], merged['avg_sentiment_score'])
ax1.set_ylabel('Avg Sentiment Score (signed)')
ax1.set_xlabel('Date')
plt.title('Daily Wikipedia Sentiment for Bitcoin Page')
fig.tight_layout()
plt.show()

## 7. Build a Simple Predictive Model
We use a simple **logistic regression** model to predict whether the next-day return is positive (`target_up`).

Features:
- `edit_count`
- `avg_sentiment_score`
- `pos_frac`
- `neg_frac`
- `comment_count`

You can easily extend this with more technical features (moving averages, volatility, etc.).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Drop rows with missing target
model_df = merged.dropna(subset=['target_up']).copy()

feature_cols = ['edit_count', 'avg_sentiment_score', 'pos_frac', 'neg_frac', 'comment_count']
X = model_df[feature_cols].values
y = model_df['target_up'].values

# Train-test split by time (no shuffling to avoid lookahead bias)
split_idx = int(0.7 * len(model_df))
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000))
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"Test Accuracy: {acc:.4f}")
print("Confusion Matrix:\n", cm)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## 8. Simple Strategy Backtest
We create a **toy trading strategy**:

- Each day in the test period, use the model to predict `target_up`
- If predicted **up**, we go **long** for the next day (PnL = next_return)
- If predicted **down**, we go **flat** (PnL = 0)

This is *not* a realistic backtest (no transaction costs, slippage, or risk constraints), but it is enough to check if the signal has any predictive power.

In [None]:
test_df = model_df.iloc[split_idx:].copy().reset_index(drop=True)
X_test_full = test_df[feature_cols].values
y_pred_test = pipe.predict(X_test_full)

test_df['pred_up'] = y_pred_test

# Strategy return: if we predict up, take next_return; otherwise 0
test_df['strategy_return'] = np.where(test_df['pred_up'] == 1,
                                       test_df['next_return'], 0.0)

# Cumulative returns
test_df['cum_strategy'] = test_df['strategy_return'].cumsum().apply(np.exp)
test_df['cum_buy_hold'] = test_df['next_return'].cumsum().apply(np.exp)

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(test_df['date'], test_df['cum_strategy'], label='Strategy (Wiki Sentiment)')
ax.plot(test_df['date'], test_df['cum_buy_hold'], label='Buy & Hold')
ax.set_ylabel('Cumulative Growth (exp of cum log-returns)')
ax.set_xlabel('Date')
ax.legend()
plt.title('Strategy vs Buy & Hold (Test Period)')
fig.tight_layout()
plt.show()

final_strategy = test_df['cum_strategy'].iloc[-1]
final_buy_hold = test_df['cum_buy_hold'].iloc[-1]
print(f"Final Strategy Growth (test): {final_strategy:.3f}")
print(f"Final Buy & Hold Growth (test): {final_buy_hold:.3f}")

## 9. Next Steps / Extensions
- Add **technical indicators**: moving averages, RSI, volatility, etc.
- Use a more sophisticated **sentiment model** (finance-specific, multi-class).
- Add **lagged features** (previous days' sentiment and edit activity).
- Use proper **walk-forward validation** and **transaction costs**.
- Experiment with tree-based models (Random Forest, XGBoost) or deep learning.

---
This notebook is structured so you can plug it into a more serious research or trading pipeline.
Tweak the configuration at the top and iterate from here.