# Creating an Intelligent Branded Content Assistant

## Part 1 – Exploring the Possibilities

The motivation behind this exercise is to build a small, tractable textual dataset that speaks to the general expertise of content creators associated with adventure-oriented media brands.

We are going to attempt to create an AI assistant that can help select talent and build authentic, badass storylines around creators who have demonstrated their commitment to experiencing nature to the fullest.


### Data Preparation and Pre-Processing

This section of our notebook deals with the data retrieval, pre-processing, and storage techniques used to curate a source of truth for our AI application.

##### Retrieving Biographical Information from Inkwell Media's 'Creator' Pages

We're going to use BeautifulSoup because the task is small and simple. All we want to do is extract bioggraphical information from the members of Inkwell Media's 'Creator Network':

![Inkwell Creator Network](images/inkwell_creator_network.png)


In [38]:
# load dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [39]:
# instantiate a variable to hold the content creator network page
talent_roster_url = 'https://inkwell.media/creatornetwork'

# extract the html
talent_roster_html = requests.get(talent_roster_url)
talent_roster_soup = BeautifulSoup(talent_roster_html.text, 'html.parser')

# create a link list using a comprehension
links = [node.get('href') for node in talent_roster_soup.find_all('a') if node.get('class') and node.get('class')[0] in ['content-fill', 'image-slide-anchor'] and 'instagram' not in node.get('href')]

# create a data frame with the available data
links_df = pd.DataFrame({'links' : links})

In [40]:
links_df.head(5)

Unnamed: 0,links
0,http://inkwell.media/creator/andrewmiller
1,http://inkwell.media/creator/jimmychin
2,http://www.inkwell.media/creator/ianwalsh
3,http://www.inkwell.media/creator/sashadigiulian
4,/creator/chrisburkard


In [41]:
# split the links to get an informative slug
links_df['slug'] = links_df['links'].str.split('/').str[-1]
links_df.head(5)

Unnamed: 0,links,slug
0,http://inkwell.media/creator/andrewmiller,andrewmiller
1,http://inkwell.media/creator/jimmychin,jimmychin
2,http://www.inkwell.media/creator/ianwalsh,ianwalsh
3,http://www.inkwell.media/creator/sashadigiulian,sashadigiulian
4,/creator/chrisburkard,chrisburkard


Now for the more intensive scraping task... We need to create a header so that Inkwell's site (which is built on Squarespace, and is wont to return a message telling us to visit a Squarespace status page for additional information) can't tell the difference between our scraper and real traffic.

We don't need to get too intense, though. For instance, I'm not going to try to feed it additional parameters other than the heading.

In [42]:
# instantiate a header
hdr = {
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'} 

In [43]:
# instantiate an empty dictionary
bios = []

# loop through the data frame
for i, row in links_df.iterrows():
    
    # if there's no protocol or home page route, add it to the front of the reference link
    if 'http' not in row['links']:
        url = 'http://www.inkwell.media' + row['links']
        
    # otherwise, use the link provided
    else:
        url = row['links']
        
    # retrieve the html from the specific creator page in question on this iteration
    bio_page = requests.get(url, headers=hdr)
    bio_soup = BeautifulSoup(bio_page.text, 'html.parser')
    
    # grab the first paragraph div and assign its text to the dictionary, with the slug as the key
    bio_text = bio_soup.find('p').text
    bios.append(bio_text)
    
# add the bios to the data frame
creator_df = links_df.copy()
creator_df['bio'] = bios

creator_df.head(5)

Unnamed: 0,links,slug,bio
0,http://inkwell.media/creator/andrewmiller,andrewmiller,Andrew Miller is a Utah-based adventure photog...
1,http://inkwell.media/creator/jimmychin,jimmychin,Jimmy Chin is an Academy Award winning filmmak...
2,http://www.inkwell.media/creator/ianwalsh,ianwalsh,Maui native Ian Walsh has been tackling massiv...
3,http://www.inkwell.media/creator/sashadigiulian,sashadigiulian,Sasha DiGiulian is a professional climber. She...
4,/creator/chrisburkard,chrisburkard,Chris Burkard is an American self-taught photo...


In [44]:
with open('resources/testwrite.txt', 'w') as file:
    
    for i, row in creator_df.iterrows():
        file.write(
f"""
Bio: {row['bio']}
Link: {row['links']}


"""
        )

##### Preparing the Data for LLM Compatibility

Now it's time to pull in some tools we can use to work with LLMs—specifically `LlamaIndex` and `LangChain`, which should help us vectorize our database and optimize prompt engineering, respectively.

We'll start by loading API keys and instantiating environment variables:

In [45]:
# load environment variables
import os
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

And we should start with a basic query to GPT-4 to make sure everything is working properly with our connection to the LLM:

In [48]:
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(model_name='gpt-4', temperature=0.7)
messages = [
    SystemMessage(content='You are a senior creative writer for Outside Media.'),
    HumanMessage(content='Create a 200-word storyline around famous mountain biking talent, and suggest brands that might want to sponsor the content.')
]

response=chat(messages)

print(response.content, end='/n')

Title: "Pedal to the Peak: Max Thompson's Journey to Mountain Biking Glory"

Storyline:
Max Thompson, a small-town prodigy, has skyrocketed to international fame as the mountain biking world's most celebrated talent. With his fierce dedication, Max has pushed the boundaries of the sport, leaving fans and fellow athletes in awe. "Pedal to the Peak" follows Max's inspiring journey, from his humble beginnings in rural Colorado, to his relentless pursuit of greatness in the most challenging terrains worldwide. 

As Max's reputation grows, he takes on the most grueling test of his career: the prestigious Red Mountain Challenge. A treacherous course filled with dangerous obstacles and steep inclines, Red Mountain is notorious for breaking down even the most seasoned riders. Max's determination, however, remains unshaken, as he navigates the unforgiving terrain. Through perseverance and intense focus, Max overcomes adversity and emerges victorious, cementing his status as a mountain biking le

Looks great. Now we can focus on vectorizing our creator data and creating some chains that can help us narrow the focus and specialize our AI application:

In [55]:
# load langchain and pinecone dependencies
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains.question_answering import load_qa_chain
import pinecone

# load our text document and split it into new documents
loader = TextLoader('resources/testwrite.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=900, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# instantiate the embeddings model from OpenAI
embeddings = OpenAIEmbeddings()

In [51]:
# create a pinecone index or load the existing one
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

index_name = 'outsidedemo1'

if index_name not in pinecone.list_indexes():
    
    pinecone.create_index(
        name=index_name, 
        metric="cosine",
        dimension=1536
    )
    
    docsearch = Pinecone.from_documents(
        docs, 
        embeddings, 
        index_name=index_name
    )

else:
    docsearch = Pinecone.from_existing_index(
        index_name, 
        embeddings
    )

### Making Use of Vectorized Data

##### Testing Some Basic LLM Functionality for Our Customized Database

The first things we need to be able to do are:

1) Retrieve documents related to a specific query through similarity search within our vectorized database, and display similarity scores so we can supervise the model and make sure that it's returning results that make sense to a user with industry experience.

2) Answer factual questions about our custom database with natural language user queries.

Once we achieve those objectives, we can move on and try to optimize our ablity to develop content ideas around the creators in our network.

In [97]:
## a function that should return `k` related documents from our
## pinecone database, along with the context and a relevance score
def get_similar(query, k=10, score=False):
    
    if score:
        similar_docs = docsearch.similarity_search_with_score(
            query,
            k=k
        )
        
    else:
        similar_docs = docsearch.similarity_search(
            query,
            k=k
        )
    
    return similar_docs

In [98]:
# instantiate the model
model_name = 'gpt-4'
llm = OpenAI(model_name=model_name)

# define a chain specifically designed for document-based Q&A
chain = load_qa_chain(llm, chain_type='stuff')

# put a function wrapper around the chain
def get_answer(query):
    
    similar_docs = get_similar(query)
    answer = chain.run(input_documents=similar_docs, question=query)
    
    return answer

In [96]:
get_answer('Do we have any adventure photographers with experience in the Andes?')

'Yes, Sunny Stroeer is an adventure photographer with experience in the Andes. She holds multiple speed records in the greater ranges, including the "360" record on 22,838ft Aconcagua in the Andes.'

Fantastic. That's specific, it's based on context that makes sense to the user, and it's a really good indicator that the LLM is a powerful ally in this task. That's because if we look back at the data we vectorized, Sunny's bio reads as follows:

    Sunny is a photographer and world-class mountain athlete. Her award-winning photography is complemented by her extensive expedition and high-angle experience, and outstanding endurance ability.  She holds multiple speed records in the greater ranges including the “360" record on 22,838ft Aconcagua. Sunny is also a competent big wall climber and, as a frequent solo adventurer, used to creating exceptional content in unforgiving environments in a highly nimble fashion.
    
Notice that nowhere in the text does it mention the "Andes". Instead, we've allowed OpenAI's embedding to assign similarity between our query and 'Aconcagua'.

Just for the sake of taking a look under the hood, let's see what happened with the `get_similar()` function that was called through the `get_answer()` request we made above:

In [101]:
## observe the relevant document retrieval
get_similar(
    'Do we have any adventure photographers with experience in the Andes?',
    score=True
)

[(Document(page_content='Bio: Tony is an adventure lifestyle photographer who has spent most of his life climbing, skiing, surfing and chasing adventure - after nearly a decade in the industry, he’s established himself as not only a photographer & DP but also a photo journalist. A few of his recent projects include a kayak documentary shot in the rebel territories of Mexico and tracing the Marco Polo Path along the Tajikistan and Afghan border. Tony has worked with numerous clients including Patagonia, Toyota, and Red Bull and has been published in works such as GQ, Nat Geo Adventure, and Outside Magazine.\nLink: http://inkwell.media/creator/tonyczech', metadata={'source': 'resources/testwrite.txt'}),
  0.8237468),
 (Document(page_content='Bio: Sofia Jaramillo is an outdoor commercial photographer based in Jackson, Wyoming. Her background is in photojournalism. She believes in the power of storytelling and with this approach has photographed ad campaigns for some of the top outdoor bra

Okay now this is interesting. It did answer us correctly, but it looks like there was a lot of output with similarity scores in the '80s, and we're not exactly sure why it chose to answer with just Sunny. We might need to dig into this more as we iron out specific use-cases.

##### Probing the Model to Test Fidelity

For now, let's probe it a little just to see what resonses to more specific questions look like:

In [102]:
get_answer('Do we have any adventure photographers with experience in the Andes who have previously worked with National Geographic?')

"Yes, Andy Mann has experience with expeditions on all 7 continents, including the Andes, and has worked for National Geographic Magazine, Sea Legacy, and National Geographic's Pristine Seas."

In [108]:
get_answer('List all adventure photographers who have previously worked with National Geographic, and explain the connection.')

'1. Keith Ladzinski: He has made films, advertising, and television content for National Geographic TV.\n2. Paul Nicklen: He is an assignment photographer for National Geographic Magazine, a National Geographic Fellow, and has documented expeditions for National Geographic on all 7 continents.\n3. Andy Mann: He has documented expeditions on all 7 continents for National Geographic Magazine and is a director at 3 Strings Productions, which works with clients like National Geographic.\n4. Pete McBride: He has traveled on assignment to over 75 countries for the National Geographic Society and has produced a book and documentaries for them.\n\nThese adventure photographers have worked with National Geographic, either as assignment photographers, documentarians, or creating content for their various platforms.'

In [104]:
get_answer('In what capacity has Chris Figenshau worked with National Geographic.')

'Chris Figenshau has worked with National Geographic as a still photographer, documenting various expeditions and adventures.'

Okay now let's try something a little more specific to branded content development:

In [99]:
get_answer('Who do we know that could be a great ambassador for a company that sells cycling gear')

'Sonya Looney would be a great ambassador for a company that sells cycling gear, as she is a World Champion endurance mountain biker, motivational speaker, writer, and adventure traveler.'

In [105]:
get_answer('Give me a list of everyone who would make sense as an ambassador for a cycling gear company and explain why.')

'1. Sonya Looney: As a World Champion endurance mountain biker, Sonya would be a great ambassador for a cycling gear company because she has experience, success, and credibility in the biking world.\n\n2. Lea Davison: Lea is an American cross-country mountain biker with two World Championships titles and two Olympic medals. Her experience and success in the sport make her a strong candidate as an ambassador for a cycling gear company.\n\n3. Adrien Costa: Despite losing his right leg in a mountain accident, Adrien has continued to push the boundaries of what is possible for people with physical limitations. As a former professional road cyclist and current climber, he would bring a unique and inspiring perspective as an ambassador for a cycling gear company.'

In [107]:
get_answer('Give me a full list of our athletes with disabilities, and explain what they are.')

'1. Adrien Costa - Adrien had a mountain accident in 2018 that resulted in the amputation of his right leg above the knee.\n2. Quinn Brett - Quinn sustained paralysis below the waist due to a 100-foot fall while climbing in Yosemite in October 2017.\n3. Jeff Denholm - Jeff lost his dominant arm in an accident on an Alaskan fishing trawler.'

Okay, I'm sold. This is working well enough to move onto putting together some creative content development modules. We'll get started on that in the next notebook.

In [109]:
input('')

Input goes here.


'Input goes here.'