# Naruhodo Podcast Graph Analyzer

**Naruhodo** is a Brazilian podcast dedicated to answering listeners’ questions about science, common sense, and curiosities. Every episode is packed with science-based content and is enriched with a diverse set of references—ranging from scientific papers and articles to books and online resources. Many episodes share overlapping themes and often reference the same sources, which makes the dataset ideal for creating an interconnected graph.

This project focuses on scraping the available Naruhodo podcast data and importing it into Neo4j. The primary objective here is to efficiently collect and structure the data into a graph database, establishing a robust foundation. Future projects will build upon this groundwork to reveal connections between episodes, identify clusters of related themes, and explore how references bridge multiple subjects.

## Table of Contents

- [Introduction](#introduction)
- [Project Structure](#project-structure)
- [Environment and Dependencies](#Environment-and-dependencies)
- [Code Breakdown](#Code-breakdown)
  - [1. Data Scraping Module](#data-scraping-module)
  - [2. Data Collection and CSV Generation](#data-collection-and-csv-generation)
  - [3. CSV Normalization](#csv-normalization)
  - [4. Neo4j Data Import](#neo4j-data-import)
- [Analytical Possibilities in Neo4j](#analytical-possibilities-in-neo4j)
- [Conclusion](#conclusion)


<a name="introduction"></a>
## Introduction

*Naruhodo* is not only a podcast—it’s a curated collection of scientific exploration where episodes often intersect through shared references. **The primary goal of this notebook is to scrape the available Naruhodo podcast data and import it into Neo4j, creating a robust graph database foundation.** Further projects utilizing this dataset will be developed in separate notebooks.

This foundational project opens up a wide range of future possibilities, especially with the integration of LLMs and Machine Learning. Here are the top 5 potential projects that can be pursued once the data is in Neo4j:

1. **Retrieval-Augmented Generation (RAG) for Podcast Summaries:**  
   Combine large language models (LLMs) with data retrieval from Neo4j to generate insightful episode summaries or answer user queries by referencing related content.

2. **RAG-Graph for Thematic Exploration:**  
   Integrate RAG techniques with graph-based search methods to offer context-aware, detailed insights into episodes. This approach can help users navigate complex scientific topics by linking episodes and references seamlessly.

3. **Episode Clusterization and Recommendation Systems:**  
   Apply clustering algorithms on the graph data to identify groups of episodes that share common themes or references. This can power personalized recommendation systems, suggesting episodes similar to those users already enjoy.

4. **Pathway Discovery for Thematic Learning:**  
   Leverage graph analytics to map out learning pathways. For example, if a user is interested in the theme of behavior, the system can highlight a sequential pathway through episodes and references that deepen their understanding of the topic.

5. **Interdisciplinary Knowledge Mapping:**  
   Analyze the intersections of various scientific disciplines across episodes by examining shared references. This can uncover hidden relationships and provide insights into how different fields influence each other.

The following sections explain how the data is scraped, normalized, and imported into Neo4j, setting the stage for these advanced analyses and applications in future projects.


For more details about the podcast and its themes, you can check out [Naruhodo on B9](https://www.b9.com.br/shows/naruhodo/).

<a name="project-structure"></a>
## Project Structure

The repository is organized into the following modules:

- **Environment Configuration:**  
  Stores all sensitive connection details (such as Neo4j credentials and file paths) in a `.env` file using `python-dotenv`. This keeps your configuration secure and separate from the code.

- **Data Scraping Module:**  
  Contains functions that send HTTP requests, parse HTML content, and extract references from individual podcast posts. This module forms the foundation for gathering raw data from the Naruhodo website.

- **Data Collection and CSV Generation:**  
  Iterates over multiple search result pages to collect all podcast post URLs and then scrapes each post for its references. The collected data is saved as a ragged CSV file, where each row contains the episode URL followed by a variable number of reference strings.

- **CSV Normalization:**  
  Converts the ragged CSV into a normalized CSV format. In the normalized file, each row represents a single relationship between an episode and one reference, making the data ideal for graph import and subsequent analysis.

- **Neo4j Data Import:**  
  Loads the normalized CSV file and builds the graph in Neo4j by creating nodes for episodes and references, and establishing `:REFERENCES` relationships between them. This module lays the groundwork for future graph-based analyses and applications.


<a name="Environment-and-dependencies"></a>
## Environment and Dependencies

- **Python 3.x**
- **Dependencies:**
  - `neo4j-driver`
  - `python-dotenv`
  - `pandas` (optional for CSV processing)
  - `csv` (Python’s built-in module)

All sensitive configuration values—such as the Neo4j URI, username, and password, as well as the output CSV path—are stored in a single `.env` file that is excluded from version control.

<a name="Code-breakdown"></a>
## Code Breakdown

<a name="data-scraping-module"></a>
### 1. Data Scraping Module
**`get_soup(url: str) -> BeautifulSoup`**  
  **Purpose:**  
  - Sends a GET request to the given URL using custom headers.
  - Handles HTTP errors and sets the proper encoding.
  - Returns a BeautifulSoup object for HTML parsing.


**`extract_references(post_url: str) -> List[str]`**  
  **Purpose:**  
  - Fetches the HTML content of a podcast post.
  - Locates the “REFERÊNCIAS” section and extracts all subsequent reference texts until a delimiter is encountered.
  - Returns a list of reference strings (or an empty list if no references are found).

In [None]:
# Importing libraries
import requests
from bs4 import BeautifulSoup
from typing import List

# Base URL of the website to scrape.
BASE_URL: str = 'https://www.b9.com.br'

# Custom headers to mimic a real browser request.
HEADERS: dict[str, str] = {
    'User-Agent': (
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        'AppleWebKit/537.36 (KHTML, like Gecko) '
        'Chrome/90.0.4430.93 Safari/537.36'
    )
}


def get_soup(url: str) -> BeautifulSoup:
    """
    Fetch the content from the given URL and return a BeautifulSoup object
    for parsing the HTML.

    Args:
        url (str): The URL of the webpage to fetch.

    Returns:
        BeautifulSoup: A BeautifulSoup object containing the parsed HTML.

    Raises:
        HTTPError: If the HTTP request fails (non-200 status code).
    """
    # Send a GET request with custom headers.
    response = requests.get(url, headers=HEADERS)
    # Raise an error for bad responses (e.g., 404, 500).
    response.raise_for_status()
    # Set the encoding to UTF-8 to properly interpret the response.
    response.encoding = 'utf-8'
    # Parse and return the HTML content using the built-in parser.
    return BeautifulSoup(response.text, 'html.parser')


def extract_references(post_url: str) -> List[str]:
    """
    Extract a list of reference strings from a post page.

    This function looks for a paragraph element containing the text
    'REFERÊNCIAS'. It then collects the text from all subsequent sibling
    elements until it encounters a sibling with the text '========', which is
    used as a delimiter to mark the end of the references section.

    Args:
        post_url (str): The URL of the post containing references.

    Returns:
        List[str]: A list of reference strings. If no references section is found,
                   an empty list is returned.
    """
    # Retrieve and parse the HTML of the post page.
    soup = get_soup(post_url)
    
    # Locate the paragraph element that contains 'REFERÊNCIAS'.
    references_section = soup.find('p', string=lambda x: x and 'REFERÊNCIAS' in x)
    if not references_section:
        return []
    
    references: List[str] = []
    # Iterate over all sibling elements that follow the references section.
    for sibling in references_section.find_next_siblings():
        text = sibling.get_text(strip=True)
        # Stop collecting references when encountering the delimiter.
        if text == '========':
            break
        references.append(text)
    
    return references


In [None]:
# Example usage:
if __name__ == '__main__':
    # Replace 'your_post_url' with the actual URL you want to scrape.
    your_post_url = 'https://www.b9.com.br/shows/naruhodo/naruhodo-418-o-que-e-a-birra/?highlight=naruhodo'
    refs = extract_references(your_post_url)
    for ref in refs:
        print(ref)
        

<a name="data-collection-and-csv-generation"></a>
### 2. Data Collection and CSV Generation
**`get_podcast_posts(page_number: int) -> List[str]`**  
  **Purpose:**  
  - Constructs the search URL using the page number.
  - Scrapes the page to extract all podcast post URLs by selecting elements with the CSS class `c-post-card__link`.

**`scrape_references() -> List[List[str]]`**   
  **Purpose:**  
  - Iterates through search result pages starting from page 1 until no more post URLs are found.
  - For each post URL, calls `extract_references` to collect the references.
  - Aggregates the data so that each row consists of the post URL followed by its corresponding references.

**`save_to_csv(data: List[List[str]], filename: str = 'references.csv') -> None`**   
  **Purpose:**  
  - Writes the aggregated (ragged) data to a CSV file using UTF-8 encoding.
  - Each row in the CSV starts with the post URL and is followed by the extracted references.


In [None]:
import os
import pandas as pd
from dotenv import load_dotenv
from typing import NoReturn, List
import time
import csv

# Ensure that get_soup and extract_references are available.
# For example:
# from your_module import get_soup, extract_references

SEARCH_URL: str = 'https://www.b9.com.br/?s=naruhodo&pagina={}'

def get_podcast_posts(page_number: int) -> List[str]:
    """
    Retrieve podcast post URLs from a search page.

    This function formats the search URL with the provided page number,
    fetches the page content using get_soup, and extracts all post links
    from anchor elements with the CSS class 'c-post-card__link'.

    Args:
        page_number (int): The page number to scrape.

    Returns:
        List[str]: A list of URLs for the podcast posts found on the page.
    """
    # Format the URL with the given page number and retrieve its parsed content.
    soup = get_soup(SEARCH_URL.format(page_number))
    # Extract the href attribute from each anchor tag matching the selector.
    return [a['href'] for a in soup.select('a.c-post-card__link')]

def scrape_references() -> List[List[str]]:
    """
    Scrape references from podcast posts across all available search pages.

    Iterates through pages starting at page 1 until no podcast post links are
    found on a page (indicating there are no more pages available). For each page,
    it retrieves all podcast post links, scrapes each post for references, and
    aggregates the results into a list. Each element in the returned list contains
    the post URL as the first element, followed by its extracted references.

    Returns:
        List[List[str]]: A list of lists, where each inner list contains a post URL
                         and its corresponding references.
    """
    all_references: List[List[str]] = []
    page: int = 1  # Start at the first page

    while True:
        print(f"Scraping page {page}...")
        post_links = get_podcast_posts(page)
        
        # If no post links are found, assume that there are no more pages.
        if not post_links:
            print("No more posts found on this page. Ending loop.")
            break

        for post_link in post_links:
            print(f"Scraping post {post_link}...")
            references = extract_references(post_link)
            # Prepend the post URL to the list of references.
            all_references.append([post_link] + references)
            # Pause for 1 second to be respectful to the server.
            time.sleep(1)

        page += 1  # Move to the next page

    return all_references

def save_to_csv(data: List[List[str]], filename: str = 'references.csv') -> None:
    """
    Save the scraped data to a CSV file.

    Writes each row of data to a CSV file using UTF-8 encoding. Each row in the data
    should be a list of strings, where the first element is the post URL followed by its references.

    Args:
        data (List[List[str]]): The data to write to the CSV file.
        filename (str): The name of the CSV file to create or overwrite.
    """
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        for row in data:
            writer.writerow(row)

if __name__ == "__main__":
    # Load environment variables from the .env file (if needed)
    load_dotenv()
    
    # Scrape references from the website and save them to a CSV file.
    references = scrape_references()
    save_to_csv(references)
    print("Data has been saved to references.csv")


<a name="csv-normalization"></a>
### 3. CSV Normalization
**`normalize_references(input_file: str, output_file: str) -> None`**  
  **Purpose:**  
  - Reads the ragged CSV (where each row has an episode followed by a variable number of references).
  - Converts the data into a normalized CSV format with two columns: "Episode" and "Reference".
  - Each row in the normalized CSV represents one episode–reference relationship.

In [None]:
import csv

def normalize_references(input_file: str, output_file: str) -> None:
    """
    Reads a ragged CSV file where the first element of each row is the episode and
    the remaining elements are references. It writes a normalized CSV with two columns:
    'Episode' and 'Reference', with each row representing one reference relationship.
    
    Args:
        input_file (str): Path to the original, ragged CSV file.
        output_file (str): Path where the normalized CSV will be saved.
    """
    with open(input_file, newline='', encoding='utf-8') as f_in, \
         open(output_file, mode='w', newline='', encoding='utf-8') as f_out:
        
        reader = csv.reader(f_in, delimiter=',')
        writer = csv.writer(f_out)
        
        # Write header row
        writer.writerow(["Episode", "Reference"])
        
        for row in reader:
            if not row:
                continue  # Skip empty rows
            episode = row[0]
            # Each additional cell is considered a reference.
            for reference in row[1:]:
                # You may want to add additional cleaning or filtering here.
                writer.writerow([episode, reference])

# Example usage:
if __name__ == '__main__':
    input_csv = 'references.csv'
    output_csv = 'normalized_references.csv'
    normalize_references(input_csv, output_csv)
    print(f"Normalized data has been saved to {output_csv}")


<a name="neo4j-data-import"></a>
### 4. Neo4j Data Import
**`load_data(filename: str = "references.csv") -> List[List[str]]`**  
  **Purpose:**  
  - Loads the normalized CSV file and returns the data as a list of rows, where each row is a list of strings.

**`create_graph(tx: Transaction, data: List[List[str]]) -> None`**  
  **Purpose:**  
  - Iterates over each row from the CSV.
  - For each row, creates (or merges) an Episode node (using the episode URL) and a Reference node (using the reference URL).
  - Establishes a `:REFERENCES` relationship between the Episode and Reference nodes via Cypher queries.

**`main() -> None`**  
  **Purpose:**  
  - Orchestrates the Neo4j data import process by loading the CSV data, opening a session, executing the transaction to create the graph, and closing the driver.

In [None]:
import os
from neo4j import GraphDatabase, Transaction
from dotenv import load_dotenv
from typing import List

# Load environment variables from the .env file
load_dotenv()

# Retrieve Neo4j connection details from environment variables
NEO4J_URI: str = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER: str = os.environ.get("NEO4J_USER", "neo4j")
NEO4J_PASSWORD: str = os.environ.get("NEO4J_PASSWORD", "senha123")

# Create the Neo4j driver instance
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))


def load_data(filename: str = "references.csv") -> List[List[str]]:
    """
    Load data from a CSV file.

    Each row in the CSV is expected to have two columns:
      - The first column contains the episode URL.
      - The second column contains the reference URL.
    
    (If your CSV contains more than two columns, only the first two will be used.)
    
    Args:
        filename (str): Path to the CSV file.
    
    Returns:
        List[List[str]]: A list of rows, where each row is a list of strings.
    """
    with open(filename, mode="r", encoding="utf-8") as file:
        reader = csv.reader(file)
        data = list(reader)
    return data


def create_graph(tx: Transaction, data: List[List[str]]) -> None:
    """
    Create or merge nodes and relationships in the Neo4j graph from CSV data.

    For each row in the CSV, the first element is considered the episode URL,
    and the second element is the reference URL. The function creates (or merges)
    an Episode node and a Reference node, then creates a relationship between them.
    
    Args:
        tx (Transaction): The active Neo4j transaction.
        data (List[List[str]]): The CSV data as a list of rows.
    """
    for row in data:
        # Skip rows that are empty or do not have at least two columns.
        if len(row) < 2:
            continue

        episode: str = row[0].strip()
        reference: str = row[1].strip()

        if not episode or not reference:
            continue  # Skip if either field is empty

        # Create or merge the Episode node
        tx.run("MERGE (e:Episode {url: $episode})", episode=episode)
        # Create or merge the Reference node
        tx.run("MERGE (r:Reference {url: $reference})", reference=reference)
        # Create the relationship between Episode and Reference nodes
        tx.run(
            """
            MATCH (e:Episode {url: $episode})
            MATCH (r:Reference {url: $reference})
            MERGE (e)-[:REFERENCES]->(r)
            """,
            episode=episode,
            reference=reference,
        )


def main() -> None:
    """
    Main function to load CSV data and import it into Neo4j.
    """
    data = load_data()  # Load normalized CSV data
    with driver.session() as session:
        session.execute_write(create_graph, data)
    print("Data has been imported into Neo4j")
    driver.close()


if __name__ == "__main__":
    main()


<a name="analytical-possibilities-in-neo4j"></a>
## Analytical Possibilities in Neo4j
Once your data is imported into Neo4j, there are numerous analyses you can perform, including:

- **Cluster Analysis:**
Identify clusters or communities of episodes that share many common references, which might indicate similar themes or topics.

- **Centrality Measures:**
Calculate metrics like degree centrality to identify which episodes or references are the most influential or central in the network.

- **Path Analysis:**
Trace paths between episodes to understand how scientific ideas or themes evolve and interconnect across different episodes.

- **Thematic Mapping:**
Explore how different subjects or areas of science intersect by analyzing shared references among episodes.

- **Content Recommendations:**
Build recommendation systems that suggest related episodes based on shared references or thematic similarities.


<a name="conclusion"></a>
## Conclusion
This project showcases a complete pipeline for extracting, normalizing, and importing podcast episode data into a Neo4j graph database. By leveraging environment configuration, data normalization, and robust Neo4j import techniques, you can reveal the hidden connections between episodes and references. This approach not only demonstrates technical proficiency in Python and graph databases but also opens up many avenues for sophisticated data analysis—making it an excellent addition to your portfolio.

Feel free to explore the code and extend its functionality. Happy coding and graph exploring!