<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/GPT_4o_article_outline_generator_using_firecrawl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Colab notebook is designed to help SEO professionals automate the process of generating article outlines optimized for specific keywords using GPT-4o and various Google APIs. By leveraging Google's NLP and Custom Search APIs, OpenAI's GPT-4o, and Weights & Biases, this notebook scrapes top-ranking web pages, extracts key information, and generates an optimized article outline based on key aspects of ranking sites.

In this notebook, you will:

* Input your target keywords and article type.
* Use Google Custom Search API to find top-ranking pages for your keywords.
* Scrape and analyze the content of these pages.
* Extract key entities and questions using Google's NLP API.
* Generate an optimized article outline using GPT-4o.

You'll need ...

**Accounts and API keys for:**

* Custom Search and Cloud Natural Language APIs: A step-by-step guide can be found [here](https://wandb.ai/onlineinference/article_outlines_entities_questions/reports/Generating-content-outlines-with-prompt-engineering-entities-and-GPT-4o--Vmlldzo4Mjc1MDc1?utm_source=pubcon&utm_medium=colab&utm_campaign=daves_pubcon_demo#google-api-key). (very low cost)
* OpenAI: [Sign up](https://platform.openai.com/docs/quickstart) and obtain your API key from the OpenAI dashboard. (very low cost with a free trial)
* Firecrawl: [Sign up](https://www.firecrawl.dev/) and obtain your API key. (free option)
* Weights & Biases: [Sign up](https://wandb.ai/site/?utm_source=pubcon&utm_medium=colab&utm_campaign=daves_pubcon_demo) and obtain your API key. (free option)

#Step 1: Define Your Target Keywords and Article Type

In this step, you will define the primary and secondary keywords you are aiming to rank for, as well as the type of article you want to create.

In [None]:
# Step 1: Define Your Target Keywords and Article Type

# Define the primary term that you're trying to rank for.
query = input("What do you want to rank for: ")

# Define any secondary terms you're trying to rank for.
query_secondary = input("Are there other terms you're trying to rank for (comma separated): ")

# Define the type of article outline you want to create.
article_type = input("What type of article is it (e.g., deep dive, quickstart, tutorial, etc.): ")

#Step 2: Install and Import Necessary Libraries

In this step, we will install and import all the necessary libraries required for the notebook.

In [None]:
# Step 2: Install and Import Necessary Libraries

# Install required packages
!pip install --upgrade google-api-python-client google-cloud-language openai weave wandb firecrawl

# Import necessary libraries
import os
from getpass import getpass
from collections import defaultdict
from google.cloud import language_v1
from googleapiclient.discovery import build
from openai import OpenAI
import re
import time
from firecrawl import FirecrawlApp

# Import wandb and weave for logging and visualization
import wandb
import weave

#Step 3: Set Up API Keys and Credentials

In this step, you will input your API keys and set up credentials for the various services used in this notebook.

Instructions on getting keys:

**Google Cloud API Key and Credentials:**
* Sign up for a Google Cloud account if you haven't already.
* Enable the Google NLP API and the Custom Search API.
* Generate an API key and download the JSON credentials file.

**OpenAI API Key:**
* Sign up for an OpenAI account and obtain your API key from the OpenAI dashboard.

**Firecrawl API Key:**
* Sign up for Firecrawl at Firecrawl's website and obtain your API key.

**Weights & Biases:**
* Sign up for a wandb account at wandb.ai and obtain your API key.
* Once you've signed up you'll find it [here](https://wandb.ai/authorize).


In [None]:
# Step 3: Set Up API Keys and Credentials

# Google API Key
google_api = input("Enter your Google API Key: ")

# Google Application Credentials (JSON file path)
# In Colab, we need to upload the JSON file and set the environment variable
from google.colab import files

print("Upload your Google Application Credentials JSON file.")
uploaded = files.upload()
credentials_filename = list(uploaded.keys())[0]
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credentials_filename

# Google Custom Search Engine ID
google_search_id = input("Enter your Google Custom Search Engine ID (cx): ")


# Firecrawl API Key
firecrawl_api_key = getpass("Enter your Firecrawl API Key: ")
app = FirecrawlApp(api_key=firecrawl_api_key)

# Initialize Weights & Biases
wandb_api_key = getpass("Enter your wandb API Key: ")
wandb.login(key=wandb_api_key)

# OpenAI API Key
openai_api_key = getpass("Enter your OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

# Step 4: Define the functions

In this step, we define several functions that will be used throughout the notebook:

* *google_search*: Performs a Google Custom Search.

* *fetch_content_with_firecrawl*: Fetches content from a URL using Firecrawl.

* *extract_headings_from_markdown*: Extracts headings from markdown text.

* *generate_summary*: Generates a summary of the text using OpenAI's GPT-4o.

* *extract_questions*: Extracts important questions from the text.

* *top_questions*: Selects the top questions from a list.

* *analyze_entities*: Analyzes entities in the text using Google Cloud NLP.

In [None]:
# Step 4: Define Helper Functions

# Setup Google Search API
def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res['items']

# Function to extract content from a webpage using Firecrawl
def fetch_content_with_firecrawl(url):
    try:
        scrape_result = app.scrape_url(url, params={'formats': ['markdown']})
        if '403 Forbidden' in scrape_result.get('status', ''):
            print(f"Access to {url} was denied with a 403 Forbidden error.")
            return None
        page_text = scrape_result.get('markdown', '')
        if not page_text:
            print(f"No content available for {url}")
            return None
        return page_text
    except Exception as e:
        print(f"Error fetching content from {url}: {str(e)}")
        return None

# Function to extract headings from markdown text
def extract_headings_from_markdown(markdown_text):
    """Extract headings from markdown text based on markdown syntax."""
    headings = []
    for line in markdown_text.split('\n'):
        line = line.strip()
        if line.startswith('#'):
            # Remove leading '#' characters and any extra whitespace
            heading = line.lstrip('#').strip()
            if heading:
                headings.append(heading)
    return headings

# Function to generate a summary of the text using OpenAI GPT-4o
def generate_summary(text, headings):
    """Generate a GPT-4o summary of the text using the headings."""
    # Prepare the prompt
    headings_text = '\n'.join(f"- {heading}" for heading in headings)
    prompt = (f"Summarize the following article, focusing on these headings:\n{headings_text}\n\n"
              f"The summary should be concise (max 500 tokens) and capture the key points.")
    try:
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": "You are an expert summarizer."},
                {"role": "user", "content": prompt + "\n\n" + text}
            ],
            model="gpt-4o",
            max_tokens=500,
            temperature=0.5,
            n=1
        )
        summary = response.choices[0].message.content.strip()
        return summary
    except Exception as e:
        print(f"Error generating summary: {e}")
        return "Summary not available."

# Function to extract questions from the text using OpenAI GPT-4o
def extract_questions(text):
    """Extract questions from the text using GPT-4o."""
    prompt = (f"Extract important questions from the following text related to the query '{query}'. "
              f"List them as bullet points.\n\n{text}")
    try:
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": "You are a helpful assistant who extracts key questions from texts."},
                {"role": "user", "content": prompt}
            ],
            model="gpt-4o",
            max_tokens=1000,
            temperature=0.1,
            n=1
        )
        questions_text = response.choices[0].message.content.strip()
        # Split the response into individual questions based on bullet points
        questions = re.findall(r"-\s*(.*)", questions_text)
        if not questions:
            questions = [questions_text]
        return questions
    except Exception as e:
        print(f"Error extracting questions: {e}")
        return []

# Function to select the top questions
def top_questions(all_questions):
    """Generate the top questions from the list of all questions."""
    try:
        questions_text = '\n'.join(f"- {question}" for question in all_questions)
        prompt = (f"From the following list of questions extracted from top articles about '{query}', "
                  f"select the 5 most important questions that would be most useful to the user. "
                  f"List them as bullet points.\n\n{questions_text}")
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": "You are an expert at identifying key questions on a topic."},
                {"role": "user", "content": prompt}
            ],
            model="gpt-4o",
            max_tokens=500,
            temperature=0.1,
            n=1
        )
        top_questions_text = response.choices[0].message.content.strip()
        # Split the response into individual questions based on bullet points
        top_questions_list = re.findall(r"-\s*(.*)", top_questions_text)
        if not top_questions_list:
            top_questions_list = [top_questions_text]
        return top_questions_list
    except Exception as e:
        print(f"Error generating top questions: {e}")
        return []

# Function to analyze entities using Google Cloud NLP
def analyze_entities(text_content):
    """Analyze entities in the text using Google Cloud NLP."""
    try:
        document = language_v1.Document(content=text_content, type_=language_v1.Document.Type.PLAIN_TEXT)
        response = nlp_client.analyze_entities(document=document, encoding_type=language_v1.EncodingType.UTF8)
        return response.entities
    except Exception as e:
        print(f"Error analyzing entities: {e}")
        return []


#Step 5: Scrape and Analyze Top Ranking Pages

In this step, we will:

* Search for the top-ranking pages for your query using Google Custom Search API.
* Scrape the content of these pages using Firecrawl.
* Extract summaries, questions, and entities from the scraped content.
* Log the data using Weights & Biases (wandb).

In [None]:
# Step 5: Scrape and Analyze Top Ranking Pages

client = OpenAI()

# Initialize Weights & Biases
wandb.init(project="seo-content-strategy")
weave.init('seo-content-strategy')

# Create W&B Tables to store scraped data
firecrawl_table = wandb.Table(columns=[
    "url",
    "markdown_summary",
    "artifact_link",
    "title",
    "description",
    "language",
    "status_code"
])
top_questions_table = wandb.Table(columns=[
    "question"
])

entities_table = wandb.Table(columns=[
    "entity",
    "aggregated_score",
    "page_count"
])

# Initialize a list to collect all questions
all_questions = []
entity_data = {}
markdown_summaries = []

# Initialize Google Cloud NLP client
nlp_client = language_v1.LanguageServiceClient()

# Search and scrape top 5 pages
search_results = google_search(query, google_api, google_search_id, num=10)

for result in search_results:
    url = result['link']
    print(f"Processing URL: {url}")
    # Fetch content using Firecrawl
    page_text = fetch_content_with_firecrawl(url)
    if page_text is None:
        print(f"Failed to fetch content from {url}")
        continue  # Skip if no content

    # Save the full content as a file
    safe_title = ''.join(c if c.isalnum() else '_' for c in result.get('title', 'page_text'))
    artifact_filename = f"{safe_title}.txt"
    with open(artifact_filename, 'w', encoding='utf-8') as f:
        f.write(page_text)

    # Create and log the artifact
    artifact = wandb.Artifact(name=f"page_text_{safe_title}", type='page_text')
    artifact.add_file(artifact_filename)
    artifact = wandb.run.log_artifact(artifact)  # Capture the logged artifact

    # Wait for the artifact to be logged
    artifact.wait()

    # Get the artifact link
    artifact_link = artifact.get_path(artifact_filename).ref_url

    # Extract metadata
    title = result.get('title', 'Unknown Title')
    description = result.get('snippet', 'No description available')
    language = 'en'  # Adjust as needed
    status_code = 200  # Adjust as needed

    # Extract headings from the markdown text
    headings = extract_headings_from_markdown(page_text)

    # Generate a summary using GPT-4
    markdown_summary = generate_summary(page_text, headings)
    if markdown_summary:
        markdown_summaries.append(markdown_summary)
    else:
        print(f"No summary generated for {url}")

    # Extract questions from the page and add them to the list
    questions = extract_questions(page_text)
    all_questions.extend(questions)

    # Analyze entities in the page text
    entities = analyze_entities(page_text)
    page_entities = set()  # To track unique entities in this page

    for entity in entities:
        entity_name = entity.name
        salience = entity.salience
        # Update entity data
        if entity_name in entity_data:
            entity_info = entity_data[entity_name]
            entity_info['total_salience'] += salience
            if url not in entity_info['pages']:
                entity_info['page_count'] += 1
                entity_info['pages'].add(url)
        else:
            entity_data[entity_name] = {
                'total_salience': salience,
                'page_count': 1,
                'pages': {url}
            }

    # Add data to the table, including the markdown summary and artifact link
    firecrawl_table.add_data(
        url,
        markdown_summary,
        artifact_link,
        title,
        description,
        language,
        status_code
    )

    # Clean up the temporary file
    os.remove(artifact_filename)

# After processing all pages, generate the top questions
top_questions_list = top_questions(all_questions)

# Add the top questions to the table
for question in top_questions_list:
    top_questions_table.add_data(question)

# Determine the top entities
# Calculate a combined score: total_salience * page_count
for entity_name, data in entity_data.items():
    aggregated_score = data['total_salience'] * data['page_count']
    data['aggregated_score'] = aggregated_score

# Sort entities by the aggregated score
top_entities = sorted(entity_data.items(), key=lambda item: item[1]['aggregated_score'], reverse=True)

# Get the top N entities (e.g., top 10)
top_n = 10
top_entities = top_entities[:top_n]

# Add top entities to the entities table
for entity_name, data in top_entities:
    entities_table.add_data(
        entity_name,
        data['aggregated_score'],
        data['page_count']
    )

# Log the tables to W&B
wandb.log({
    "scraped_data": firecrawl_table,
    "top_questions_table": top_questions_table,
    "entities_table": entities_table,
    "markdown_summaries": markdown_summaries,
})

print("Markdown Summaries:", markdown_summaries)

# Finish the W&B run
wandb.finish()


# Step 6: Generate the Article Outline
 In this final step, we will generate an article outline using the collected data and OpenAI's GPT-4o.

In [None]:
# Generate the article outline using the collected data
@weave.op()
def generate_outline(top_entities, top_questions, query, query_secondary, article_type, markdown_summaries):
    entities_str = ', '.join([entity_name for entity_name, _ in top_entities])
    questions_str = ', '.join(top_questions)
    summaries_str = '\n\n'.join(markdown_summaries)
    try:
        response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system",
                     "content": "You create succinct and clear article outlines. You can include your understanding "
                       "of a topic to enhance an outline, but the focus should be on the inclusion of the entities, questions and top ranking content you are provided with."},
                    {"role": "assistant", "content": "You are a highly skilled writer, and you want to produce a " + article_type +
                       " article outline that will appeal to users and rank well for queries given by the user. "
                       "The outline will contain headings and sub-headings, with clear and concise descriptions of the content "
                       "that is recommended for that section, and why. \n When you are recommending to create an introductory paragraph to a section, to capture a featured snippet, "
                        "note that it should be between 260 and 320 characters, and then provide a clearly noted example of what one might be. "
                         "\n After you have provided the outline, explain clearly how this article outline "
                       "could be used to create an article that will rank well using best-practice SEO strategies as well as be helpful "
                       "to users. You will be judged based on how well the article ranks, as well as how engaging the article is to readers, "
                       "and provide the metrics you would suggest be used to judge whether you are successful. \n An example of an article "
                       "structure that works well is: \n\n"
                       "Title: Top-Level Content (e.g., An Introduction to [Main Topic])\n\n"
                       "The description: Visible on the page as well as used as the description to Google. Should be 130 <= character_count "
                       "<= 160 and include the main keywords whenever possible.\n\n"
                       "**Introduction**\n\n"
                       "We don't need a heading tag here. Simply dive in with a brief description of what you'll be covering. One or two short "
                       "paragraphs is great, but longer is fine.\n\n"
                       "**H2 - Table Of Contents**\n"
                       "Ideally this section is done manually, but in a pinch, you can use the / Table Of Contents feature. You can add a bit of "
                       "additional content below the table of contents if you like.\n\n"
                       "**H2 - What Is [Main Topic]?**\n"
                       "Ideally here we have a section on the main topic. It will start with a paragraph that runs between 230 and 260 characters. "
                       "This first paragraph should be the short answer to the question, and it will hopefully get picked up as a featured snippet "
                       "and be used for voice search. It can and should be extended from there to provide additional information and context.\n\n"
                       "**H2 - Optional - Subset Such As 'Types Of [Main Topic]'**\n"
                       "Brief description here—ideally 230 to 260 characters.\n\n"
                       "**H3 - [Subset Type One] (e.g., First Subtype of [Main Topic])**\n"
                       "Description of [Subset Type One]. Ideally starting with 260 to 320 characters, but not mandatory, and expanding from there.\n\n"
                       "**H3 - [Subset Type Two] (e.g., Second Subtype of [Main Topic])**\n"
                       "Description of [Subset Type Two]. Ideally starting with 260 to 320 characters, but not mandatory, and expanding from there.\n\n"
                        "**H2 - A tutorial for [Main Topic]**\n"
                       "Generate a tutorial based on common macnmachine learning tasks, which are likely found in the summaries provided by the user. You should use W&B Weave whenever possible.\n\n"
                       "**H2 - What Is [Main Topic] Used For?**\n"
                       "Again, ideally this starts with a 230 to 260 character short answer and is expanded upon.\n\n"
                       "**H2 - Examples Of [Main Topic]** \n"
                       "Optionally, we can place a brief description of the types of examples. It should be done in H3 tags (assuming it's a simple one). "
                       "A robust example requiring multiple stages (e.g., setup, running, visualizing) may require multiple H2 tags with H3s nested beneath.\n"
                       "**H2 - Recommended Reading On [Main Topic]** \n"
                       "Here we simply add a list with 2 or 4 articles you feel are related and would be of interest to the reader."},
                    {"role": "user",
            "content": "Create an article outline that will rank well for " + query + " as the primary term, and " + query_secondary +
                       " secondary keywords, which are less important but should still be considered. The following entities appear to be "
                       "relevant to ranking in the top 10 and should be worked into the page:\n" + entities_str + "\n Try to ensure the outline "
                       "will make it easy to work these into the article prominently and explain how this might be done in comments. Additionally, "
                       "the following questions appear to be important to answer in the article:\n" + questions_str +
                       "\n The following are summaries of the content and format that can be found on the top-ranking pages. This should heavily influence "
                       "the outlines you produce, as this content ranks well: \n" + summaries_str + "\n"
                       "Try to ensure that it will be easy to answer these questions in the article, and again, explain how you would recommend "
                       "doing this in a way that will seem useful to the user. The article outline should begin by explaining \n- all of the core "
                       "concepts required to understand the topic"
        }],
                max_tokens=2000,
                temperature=0.2,
                n=1
            )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating outline: {e}")
        return "Outline not available."

# Generate the article outline
article_outline = generate_outline(
    top_entities,
    top_questions_list,
    query,
    query_secondary,
    article_type,
    markdown_summaries
)

# Optionally, you can print or save the article outline
print("Generated Article Outline:")
print(article_outline)
