In [1]:
%load_ext autoreload
%autoreload 2


In [2]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get Anthropic API key from environment variables
DATABRICKS_TOKEN = os.getenv("DATABRICKS_TOKEN")



from pydantic import BaseModel
from typing import Literal

class AnkiCard(BaseModel):
    front: str
    back: str
    source_type: Literal["Book", "Paper", "WebArticle", "Course"]
    source_address: str
    field: Literal["Engineering", "Algorithms", "Productivity", "AIEngineering"]
    note: str


In [3]:
import dspy

from typing import List

class AnkiCreation(dspy.Signature):
    """
    # Anki Card Creation Assistant

<system>You are an expert Anki flashcard creator with deep expertise in machine learning and educational psychology. Your goal is to transform learning materials into effective, memorable Anki cards following proven principles of spaced repetition and active recall.

<task>
Create high-quality Anki flashcards from provided learning materials that optimize for long-term retention and understanding. Each card should follow the minimum information principle and test a single, well-defined concept.
</task>

<output_format>
Generate a CSV with the following fields for each flashcard:
- Front: Question shown to the user
- Back: Answer/solution. For mathematical expressions, use LaTeX delimited by \[ \] for display math or \( \) for inline math. Ensure all equations are properly escaped.
- SourceType: Book/Paper/WebArticle/Course
- SourceAddress: Source URL/reference
- Field: Engineering/Algorithms/Productivity/AIEngineering
- Note: Additional context and related concepts
</output_format>

<principles>
# Core Principles for Effective Flashcards

## 1. Understand Before Memorizing
- Master the material conceptually before creating cards
- Anki reinforces existing understanding, not initial learning
- Build coherent mental models before memorization

## 2. Minimum Information Principle
- One idea per card
- Keep cards as simple as possible
- Benefits:
  - Easier recall
  - Efficient review scheduling
  - Prevents cognitive overload

## 3. Question-Based Formulation
- Structure cards as specific questions
- Questions provide better memory cues than statements
- Promotes active recall

## 4. Super techincal cards with objective questions
- Focus on the technical part of the content when making cards, not opinions or philosophy
- Avoid making coards on introductory things like what the materical covers, who the author is,etc
- Overindex on equations, mathematical concepts, and algorithms and key ideas
- If the content is not technical, do not make cards, or do it in a way that is testable


## Common Pitfalls to Avoid

1. Statement Cards
   - Avoid term-definition pairs
   - Use active questions instead

2. Essay-Length Answers
   - Break into multiple focused cards
   - Keep answers concise
   - Use keywords when possible

3. Mechanical Cloze Deletions
   - Use thoughtfully to test specific concepts
   - Avoid overuse that promotes surface learning

4. List Memorization
   - Use mnemonics or memory techniques
   - Break long lists into chunks

5. Multiple Choice Questions
   - Avoid recognition-based testing
   - Focus on active recall

6. Binary Questions
   - Skip yes/no and true/false formats
   - Too high chance of guessing correctly

## Card Structure Best Practices

### Basic Card Anatomy
1. Question (Front)
   - Clear, specific prompt
   - Tests single concept

2. Answer (Back)
   - Concise response
   - Clear success criteria

3. Optional Context
   - Supporting information
   - Source references
   - Related concepts

### Content Guidelines
- Facts: Embed in meaningful context
- Complex Topics: Focus on relationships
- Formulas: Include derivation context
- Processes: Break into logical chunks
- Create redundant cards from different angles
</principles>

<examples>
Example 1: Python Walrus Operator
Front: What will be the output of this code in python?

if (a:= 0) > -1 :
print("hello")
print(a)

Back: 
hello 
0

SourceType: WebArticle
SourceAddress: https://docs.python.org/3/whatsnew/3.8.html
Field: Engineering
Note: The walrus operator combines expressions and assignments. Useful for:
- Avoiding repeated function calls
- Regular expression matching
- While-loop conditions

Example 2: Transformer Positional Encodings
Front: What are positional encodings in Transformers and why are they used?

Back: Since Transformers lack recurrence and convolution, positional encodings inject information about token positions in the sequence. Added to input embeddings in encoder/decoder stacks with dimension d_model. Uses sine/cosine functions: \[PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{model}})\]

SourceType: Paper
SourceAddress: https://arxiv.org/abs/1706.03762
Field: Algorithms
Note: Modern implementations often use RoPE (Rotary Position Embeddings)



Example 3: Batch Normalization Parameters
Front: Why does Batch Normalization have learned scale and shift parameters?

Back: The learned \(\gamma\) and \(\beta\) parameters in \[y_k = \gamma_k \hat{x}_k + \beta_k\] enable full neural network representation power.

SourceType: Paper
SourceAddress: https://arxiv.org/abs/1502.03167
Field: Algorithms
Note: Setting \[ \gamma_k = \sqrt{\operatorname{Var}[x_k]} \] and \[ \beta_k = \operatorname{E}[x_k] \] can recover original values if optimal.

Example 4: Pythagorean theorem
front=What is the relationship between the lengths of the sides of a right triangle according to the Pythagorean theorem?', 
back=The square of the length of the hypotenuse is equal to the sum of the squares of the lengths of the other two sides, expressed as \(c^2 = a^2 + b^2\).
SourceType='Book', 
SourceAddress='Geometry textbook'
Field='Engineering'
noNotete='Fundamental concept in geometry used for calculating distances and sizes in right-angled triangles.
</examples>
</system>
    """
    input_knowledge: list[str] = dspy.InputField(desc="knowledge to be converted into anki cards")
    extracted_card: list[AnkiCard] = dspy.OutputField(desc="all tokens referring to specific people extracted from the tokenized text")

anki_creator = dspy.ChainOfThought(AnkiCreation)




In [4]:
lm = dspy.LM('databricks/databricks-claude-sonnet-4', api_key=DATABRICKS_TOKEN,base_url="https://dream11-e2.cloud.databricks.com/serving-endpoints")
dspy.configure(lm=lm)

In [5]:
import requests
from bs4 import BeautifulSoup
import PyPDF2
import io
import re

def is_pdf_url(url):
    """Check if the URL points to a PDF file."""
    # Check common PDF indicators in URL
    if url.lower().endswith('.pdf'):
        return True
    
    # Check for Google Drive PDF
    if 'drive.google.com' in url:
        return True
    
    # Try to check content type in headers
    try:
        response = requests.head(url, allow_redirects=True)
        content_type = response.headers.get('content-type', '').lower()
        return 'application/pdf' in content_type
    except:
        return False

def scrape_article(url):
    """Scrape content from a web article."""
    # Send GET request to the URL
    response = requests.get(url)
    
    # Create BeautifulSoup object to parse HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the main article content
    article_text = []
    
    # Try common article content selectors
    content_selectors = ['article', '.article-content', '.post-content', '.entry-content', 'main']
    
    for selector in content_selectors:
        content = soup.select_one(selector)
        if content:
            # Get all paragraph text
            paragraphs = content.find_all('p')
            article_text = [p.get_text().strip() for p in paragraphs]
            break
    
    # If no content found with selectors, try getting all paragraphs
    if not article_text:
        paragraphs = soup.find_all('p')
        article_text = [p.get_text().strip() for p in paragraphs]
    
    # Join all paragraphs with newlines
    full_text = '\n'.join(article_text)
    
    return full_text

def scrape_pdf(url):
    """Scrape content from a PDF file."""
    # For Google Drive links, need to get the file ID and use direct download URL
    if 'drive.google.com' in url:
        file_id = url.split('/')[-2]
        url = f'https://drive.google.com/uc?export=download&id={file_id}'
    
    try:
        # Download PDF content
        response = requests.get(url)
        
        # Create PDF reader object
        pdf_file = io.BytesIO(response.content)
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        
        # Extract text from all pages
        text = []
        for page in pdf_reader.pages:
            text.append(page.extract_text())
            
        # Join all pages with newlines
        full_text = '\n'.join(text)
        
        return full_text
        
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return ""

def extract_content(url):
    """
    Automatically detect URL type and extract content accordingly.
    Returns a tuple of (content, source_type)
    """
    if is_pdf_url(url):
        content = scrape_pdf(url)
        source_type = "Paper" if "arxiv.org" in url.lower() else "Book"
    else:
        content = scrape_article(url)
        source_type = "WebArticle"
    
    return content, source_type


In [6]:
# URL of the content to process
url = "https://arxiv.org/pdf/2508.15260"


# Extract content and detect source type
content, source_type = extract_content(url)

# Print first 500 chars of content and source type
print(f"First 500 characters of content:\n{content[:500]}\n")
print(f"Source type: {source_type}")



First 500 characters of content:
Deep Think with Confidence
DEEPTHINK WITH CONFIDENCE
Yichao Fu2†∗, Xuewei Wang1, Yuandong Tian1, Jiawei Zhao1†
1Meta AI,2UCSD
†Equal contribution
Project Page: jiaweizzhao.github.io/deepconf
ABSTRACT
Large Language Models (LLMs) have shown great potential in reasoning tasks
through test-time scaling methods like self-consistency with majority voting.
However, this approach often leads to diminishing returns in accuracy and high
computational overhead. To address these challenges, we introduce De

Source type: Paper


In [7]:


# Create Anki cards
preds = anki_creator(input_knowledge=content.split('\n'))

# Update source type for all cards based on detected type
for card in preds.extracted_card:
    card.source_type = source_type
    card.source_address = url

In [None]:
import pandas as pd

# Extract the AnkiCards from the prediction
cards = preds.extracted_card

# Create lists to store card data
fronts = []
backs = []
source_types = []
source_addresses = []
fields = []
notes = []

# Extract data from each card
for card in cards:
    fronts.append(card.front)
    backs.append(card.back) 
    source_types.append(card.source_type)
    source_addresses.append(card.source_address)
    fields.append(card.field)
    notes.append(card.note)

# Create DataFrame
df = pd.DataFrame({
    'Front': fronts,
    'Back': backs,
    'SourceType': source_types,
    'SourceAddress': source_addresses,
    'Field': fields,
    'Note': notes
})

# Save to CSV
df.to_csv('cards/anki_cards.csv', index=False,header=False)
print("Saved anki cards to anki_cards.csv")

Saved anki cards to anki_cards.csv


In [9]:
lm.inspect_history(5)





[34m[2025-09-01T09:54:08.911944][0m

[31mSystem message:[0m

Your input fields are:
1. `input_knowledge` (list[str]): knowledge to be converted into anki cards
Your output fields are:
1. `reasoning` (str): 
2. `extracted_card` (list[AnkiCard]): all tokens referring to specific people extracted from the tokenized text
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## input_knowledge ## ]]
{input_knowledge}

[[ ## reasoning ## ]]
{reasoning}

[[ ## extracted_card ## ]]
{extracted_card}        # note: the value you produce must adhere to the JSON schema: {"type": "array", "$defs": {"AnkiCard": {"type": "object", "properties": {"back": {"type": "string", "title": "Back"}, "field": {"type": "string", "enum": ["Engineering", "Algorithms", "Productivity", "AIEngineering"], "title": "Field"}, "front": {"type": "string", "title": "Front"}, "note": {"type": "string", "title": "Note"}, "source_address": {"type": "string", "title": "Sou