<a href="https://colab.research.google.com/github/barbaroja2000/llm/blob/main/Langchain_%26_OpenAi_AWS_Summit_2023_London_Sponser_Categorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Langchain & OpenAi - AWS Summit 2023 London Sponser Categorization

This notebook categorizes all the companies sponsering the 2023 Summit , into the following categories:

* Security
* Managed Service Providers (MSPs)
* Landing Zone/Infrastructure Providers
* Training Partners
* Consulting Partners/Systems Integrators
* Software/Application Providers
* Data Management Providers
* Observability
* AI/ML

Process:

1.  Parse the sponser page, pulling out all non AWS pages into a list
2. Spider these uris pulling out title & description 
3. Feed these into a GPT model and categorize
4. Display results in a Pandas table

Requires OpenAi Key:

```Python
OPENAI_API_KEY="abc"
```

In [60]:
#@title Load Keys
#@markdown Utitily to load keys from fs, replace with environ vars if not using

import os

#os.environ.get("OPENAI_API_KEY")
#os.environ.get("HUGGINGFACE_API_KEY")

!python -m pip install python-dotenv
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)
import dotenv
dotenv.load_dotenv('/content/drive/MyDrive/keys/keys.env')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Mounted at /content/drive/


True

In [61]:
sponser_page = "https://aws.amazon.com/events/summits/london/sponsors/"

In [62]:
#@title Parse hrefs
#@markdown Exclude all local hrefs and anything AWS related

from bs4 import BeautifulSoup
import requests

candidates = []
#required as many pages will return 403 forbidden without user-agent string
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'}
result = requests.get(sponser_page, headers=headers)
parser = 'html.parser' 
page = requests.get(sponser_page)
soup = BeautifulSoup(page.text, parser)

for link in soup.find_all('a', href=True):
  x = link['href'].find("https://")
  y =  link['href'].find("aws")
  if x == 0 and  y ==-1:
    domain = link['href'].split("/")
    candidates.append(domain[2:3].pop())

In [63]:
print(candidates[:10])

['ladiesthatlaunch.splashthat.com', 'www.paloaltonetworks.com', 'www.dynatrace.com', 'www.vmware.com', 'www.accenture.com', 'www2.deloitte.com', 'www.snowflake.com', 'www.trendmicro.com', 'www.reply.com', 'vmware.com']


In [64]:
#@title Crawl websites
#@markdown Pull title and description back from html

candidates_for_categorization = []

for x,i in enumerate(candidates):
  try:
    uri = f"https://{i}"
    resp = requests.get(uri, timeout=10, headers=headers)
    soup = BeautifulSoup(resp.text, parser)
    description  = soup.find("meta", property="og:description") or soup.find("meta", property="description")
    title  = soup.find("meta", property="og:title") or soup.find("meta", property="title") 
    description = description["content"] if description else None
    title = title["content"] if title else None
    tmp_dict = {"title": title, "description": description, "url": uri}
    candidates_for_categorization.append(tmp_dict)

  except Exception as e:
    print(e)

In [65]:
print(len(candidates_for_categorization))

110


In [66]:
!pip install langchain openai > /dev/null
from langchain import PromptTemplate, LLMChain
from langchain.llms import OpenAI

model_name= 'text-davinci-003' #@param ["text-davinci-003", "gpt-4"]
llm = OpenAI(model_name=model_name, temperature=0)

In [67]:
template = """Categories listed here are types of cloud companies.

Security
Managed Service Providers (MSPs)
Landing Zone/Infrastructure Providers
Training Partners
Consulting Partners/Systems Integrators
Software/Application Providers
Data Management Providers
Observability
AI/ML

Using the description of companies below classify each into the preceeding categories. 
If the company matches multiple categories, return all matching in a comma seperated list
If your confidence is poor for the given classifications, propose a new classification

Blurb: {blurb}
==============================================================
Classification: """

prompt = PromptTemplate(
    input_variables=["blurb" ],
    template=template
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

In [130]:
final = []
for i in candidates_for_categorization:
  blurb = f"{i['title']} {i['description']}"
  category = llm_chain.run(blurb)
  if "title" in i and i["title"] is not None:
    i["category"] = [x.strip(' ') for x in category.split(", ")]
    final.append(i)

['Security', 'Consulting Partners/Systems Integrators', 'Training Partners']
['Observability', 'AI/ML', 'Security']
['Consulting Partners/Systems Integrators']
['Consulting Partners/Systems Integrators']
['Consulting Partners/Systems Integrators', 'Audit & Assurance', 'Financial Advisory', 'Risk Advisory', 'Tax & Legal']




['Security', 'Software/Application Providers']
['Consulting Partners/Systems Integrators']
['Consulting Partners/Systems Integrators']
['Security', 'Managed Service Providers (MSPs)']
['Data Management Providers', 'Software/Application Providers']
['Consulting Partners/Systems Integrators']
['Consulting Partners/Systems Integrators']
['Security', 'Software/Application Providers']
['Security', 'Managed Service Providers (MSPs)']
['Software/Application Providers', 'Consulting Partners/Systems Integrators']
['Software/Application Providers']
['AI/ML', 'Consulting Partners/Systems Integrators']
['Software/Application Providers', 'Data Management Providers']




['Observability']
['Data Management Providers', 'Software/Application Providers']
['Observability', 'Software/Application Providers']
['Consulting Partners/Systems Integrators', 'Training Partners']
['Consulting Partners/Systems Integrators', 'Security', 'Data Management Providers', 'Observability']




['Landing Zone/Infrastructure Providers', 'Security', 'Data Management Providers']
['Consulting Partners/Systems Integrators']
['Software/Application Providers']
['Security', 'Consulting Partners/Systems Integrators']
['Software/Application Providers']
['Consulting Partners/Systems Integrators']
['Software/Application Providers', 'Consulting Partners/Systems Integrators']
['Software/Application Providers']
['Observability', 'Consulting Partners/Systems Integrators']
['Observability', 'Consulting Partners/Systems Integrators']
['Data Management Providers', 'Software/Application Providers']
['Consulting Partners/Systems Integrators']
['Security', 'Consulting Partners/Systems Integrators']




['Consulting Partners/Systems Integrators']
['Security']
['Software/Application Providers']
['Data Management Providers', 'Software/Application Providers']
['Software/Application Providers']
['Data Management Providers', 'Security']
['Software/Application Providers']
['Data Management Providers', 'Software/Application Providers']
['Data Management Providers', 'Observability']
['Consulting Partners/Systems Integrators', 'Software/Application Providers', 'Data Management Providers']
['Software/Application Providers']
['Software/Application Providers']
['Consulting Partners/Systems Integrators', 'Software/Application Providers', 'Data Management Providers']




['Software/Application Providers']
['Software/Application Providers', 'AI/ML']
['Software/Application Providers']
['Software/Application Providers']
['Software/Application Providers']
['Consulting Partners/Systems Integrators', 'Software/Application Providers', 'AI/ML']
['Data Management Providers', 'Security']
['Data Management Providers']
['Security', 'Data Management Providers']
['Observability', 'Training Partners']
['Security', 'Managed Service Providers (MSPs)', 'Consulting Partners/Systems Integrators']
['Software/Application Providers']
['Infrastructure Providers', 'Software/Application Providers']
['Observability', 'Consulting Partners/Systems Integrators']
['Security', 'Software/Application Providers']
['Data Management Providers', 'Observability']
['Observability', 'Software/Application Providers']
['Consulting Partners/Systems Integrators']
['Data Management Providers']
['Software/Application Providers']
['Data Management Providers', 'Consulting Partners/Systems Integrators



['Data Management Providers', 'Software/Application Providers']
['Consulting Partners/Systems Integrators']
['Software/Application Providers']


In [144]:
import pandas as pd
df.drop(df.index, inplace=True)
df = pd.DataFrame(final)

In [145]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [148]:
#@title One hot encode the category column
df_ohe = df.drop(["category", "description", "company_category"], 1).join(df.category.str.join('|').str.get_dummies())

  df_ohe = df.drop(["category", "description", "company_category"], 1).join(df.category.str.join('|').str.get_dummies())


In [149]:
#@title Filter by Category
#@markdown Click on `filter` and enter 1 in the from field against the category
df_ohe

Unnamed: 0,title,url,AI/ML,Audit & Assurance,Consulting Partners/Systems Integrators,Data Management Providers,Financial Advisory,Infrastructure Providers,Landing Zone/Infrastructure Providers,Managed Service Providers (MSPs),Observability,Risk Advisory,Security,Software/Application Providers,Tax & Legal,Training Partners
0,Leader in Cybersecurity Protection & Software ...,https://www.paloaltonetworks.com,0,0,1,0,0,0,0,0,0,0,1,0,0,1
1,Dynatrace | Modern cloud done right,https://www.dynatrace.com,1,0,0,0,0,0,0,0,1,0,1,0,0,0
2,Introducing VMware Cross-Cloud Services,https://www.vmware.com,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3,Accenture | Let there be change,https://www.accenture.com,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,"Deloitte | Audit & Assurance, Consulting, Fina...",https://www2.deloitte.com,0,1,1,0,1,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,Worldwide IT Training,https://www.globalknowledge.com,0,0,0,0,0,0,0,0,0,0,0,0,0,1
97,"qa.com | QA | World-Leading Tech Training, Ski...",https://www.qa.com,0,0,1,0,0,0,0,0,0,0,0,0,0,1
98,"Modern Data Protection, Backup, & Recovery Sof...",https://www.veeam.com,0,0,0,1,0,0,0,0,0,0,0,1,0,0
99,Introducing VMware Cross-Cloud Services,https://www.vmware.com,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [138]:
df_sum = df_ohe.drop(['title','url'], 1)

  df_sum = df_ohe.drop(['title','url'], 1)


In [141]:
#@title Sum Companies by Categorization
df_sum.sum()

  df_sum.sum()


AI/ML                                       8
Audit & Assurance                           1
Consulting Partners/Systems Integrators    44
Data Management Providers                  25
Financial Advisory                          1
Infrastructure Providers                    1
Landing Zone/Infrastructure Providers       2
Managed Service Providers (MSPs)            8
Observability                              12
Risk Advisory                               1
Security                                   22
Software/Application Providers             43
Tax & Legal                                 1
Training Partners                           6
dtype: int64

In [142]:
#@title Save out
file_name=f"aws-summit-sponsers-{model_name}.csv"
df.to_csv(f"/content/drive/MyDrive/{file_name}", encoding='utf-8', index=False)