# Beatles Lyrics Scraper

This notebook scrapes Beatles song lyrics from Genius.com and saves them as a clean text file.

## 📋 Setup Instructions:
1. Get your Genius API token at: https://genius.com/api-clients
2. Click the 🔑 key icon in the left sidebar (Secrets)
3. Add a new secret:
   - **Name**: `GENIUS_TOKEN`
   - **Value**: Your Genius API token
   - **Enable**: "Notebook access"
4. Run all cells below

---

## 📦 Install Dependencies

In [1]:
!pip install lyricsgenius

Collecting lyricsgenius
  Downloading lyricsgenius-3.7.2-py3-none-any.whl.metadata (6.1 kB)
Downloading lyricsgenius-3.7.2-py3-none-any.whl (48 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.4/48.4 kB[0m [31m534.7 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lyricsgenius
Successfully installed lyricsgenius-3.7.2


## 📚 Import Libraries

In [3]:
import json
from lyricsgenius import Genius
import re
import os
from google.colab import files, userdata

## 🔧 Define Functions

In [12]:
def get_credentials():
    """Get credentials from Colab environment variables"""
    try:
        token = userdata.get('GENIUS_TOKEN')
        print("Token loaded from Colab secrets")
        return token
    except:
        try:
            token = os.getenv('GENIUS_TOKEN')
            if token:
                print("Token loaded from environment variable")
                return token
            else:
                raise ValueError("No token found")
        except:
            print("No token found in secrets or environment variables")
            token = input("Enter your Genius API token: ")
            return token

def scrape_beatles_lyrics(max_songs=500):
    """
    Scrape Beatles lyrics and return as simple text

    Args:
        max_songs: Maximum number of songs to scrape

    Returns:
        String containing all lyrics with titles
    """
    # Setup Genius client
    token = get_credentials()
    genius = Genius(token, timeout=500)
    genius.remove_section_headers = False  # Keep structure like [Verse], [Chorus]
    genius.verbose = False

    # Get Beatles songs
    print(f"Searching for The Beatles songs (max: {max_songs})...")
    artist = genius.search_artist('The Beatles', max_songs=max_songs)
    songs = artist.songs

    print(f"Found {len(songs)} songs. Processing...")

    # Process all lyrics with deduplication
    all_lyrics = ""
    seen_titles = set()

    for i, song in enumerate(songs, 1):
        try:
            # Skip items with 'Film script' etc in title
            if 'film script' in song.title.lower():
                continue

            # Check for duplicates
            normalized = normalize_title(song.title)
            if normalized in seen_titles:
                continue
            seen_titles.add(normalized)

            # Clean title for display and add in angle brackets
            display_title = clean_title_display(song.title)
            song_title = f"<{display_title}>"

            # Get and clean lyrics
            raw_lyrics = song.lyrics if song.lyrics else ""

            # Skip if lyrics suggest it's not a song
            lyrics_lower = raw_lyrics.lower()
            if any(term in lyrics_lower for term in ['compilation discography', 'discography', 'tracklist']):
                continue

            # Basic cleaning
            lines = raw_lyrics.split('\n')
            clean_lines = []

            for line in lines:
                line = line.strip()
                # Skip empty lines, contributors, embed numbers, "You might also like", and dialog
                if (line and
                    'Contributors' not in line and
                    'You might also like' not in line and
                    not re.match(r'^\d+Embed$', line) and
                    not re.match(r'^[A-Z][A-Za-z\s]*:\s', line)):  # Skip dialog like "NARRATOR: " or "Ringo: "

                    # Remove all section headers like '[Verse 1]', '[Chorus]', '[Bridge]', etc.
                    if line.startswith('[') and line.endswith(']'):
                        continue  # Skip all section headers
                    else:
                        # Remove parentheses content like '(sha-da-da-da)' but keep text before
                        line = re.sub(r'\s*\([^)]*\)', '', line).strip()

                    if line:  # Only add if there's still content after cleaning
                        clean_lines.append(line)

            cleaned_lyrics = '\n'.join(clean_lines)

            # Add to collection: Title in angle brackets on one line, lyrics on following lines
            if cleaned_lyrics.strip():  # Only add if there are actual lyrics
                song_section = f"{song_title}\n{cleaned_lyrics}\n\n"
                all_lyrics += song_section

            if len(seen_titles) % 10 == 0:  # Progress update every 10 songs
                print(f"Processed {len(seen_titles)} unique songs...")

        except Exception as e:
            print(f"Error processing {song.title}: {e}")
            continue

    print(f"Final count: {len(seen_titles)} unique songs")
    return all_lyrics

def save_and_download_lyrics(lyrics_text, filename="beatles_lyrics.txt"):
    """Save lyrics to file and trigger download in Colab"""
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(lyrics_text)

    num_lines = len(lyrics_text.split('\n'))
    print(f"\nSaved {num_lines} lines to {filename}")

    files.download(filename)

    return filename

## 🚀 Main Scraper Function

In [7]:
def run_scraper(max_songs=200):
    """
    Main function for Google Colab

    Args:
        max_songs: Maximum number of songs to scrape
    """

    try:
        print("🎸 Beatles Lyrics Scraper")
        print("=" * 50)

        print(f"\n=== SCRAPING LYRICS ===")
        lyrics = scrape_beatles_lyrics(max_songs)

        print(f"\n=== SAVING FILE ===")
        save_and_download_lyrics(lyrics)

        print(f"\nCompleted! Total characters: {len(lyrics):,}")
        print(f"📄 Total songs processed: {lyrics.count}")

        return lyrics

    except Exception as e:
        print(f"❌ Error: {e}")
        return None

print("Main function ready!")

Main function ready!


## 🎯 Run the Scraper

**Adjust `max_songs` as needed:**
- `50` = Quick test
- `200` = Good selection
- `500+` = Comprehensive collection (will take longer)

In [8]:
lyrics = run_scraper(max_songs=5)

if lyrics:
    print("\n📝 Preview of scraped lyrics:")
    print("-" * 30)
    print(lyrics[:500] + "...")
else:
    print("❌ Scraping failed. Check your token and try again.")

🎸 Beatles Lyrics Scraper

=== SCRAPING LYRICS ===
Token loaded from Colab secrets
Searching for The Beatles songs (max: 5)...
❌ Error: Unexpected response status code: 403. Expected 200 or 204. Response body: <!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex,nofollow"><meta name="viewport" content="width=device-width,initial-scale=1"><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131;font-family:system-ui,-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,"Helvetica Neue",Arial,"Noto Sans",sans-serif,"Apple Color Emoji","Segoe UI Emoji","Segoe UI Symbol","Noto Color Emoji"}body{display:flex;flex-direction:column;height:100vh;min-height:100vh}.main-content{margin:8rem auto;padding-left:1.5rem;max-width:60rem}@media (width <= 720px){.main-content{ma