# CB Insights

OurCrowd (https://www.cbinsights.com/) has a collection of around one million companies, each with\
information such as their name, funding information, and for some, revenue and valuation information. 

#### Check robots.txt

First, lets check robots.txt to ensure we are ethically scraping CB Insights.

In [25]:
import requests

url = 'https://www.cbinsights.com/robots.txt'
headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36' }
response = requests.get(url, headers=headers)

print(response.content.decode('utf-8'))

User-agent: *
Disallow: /nyserda/
Disallow: /embed/
Disallow: /forbes/ampc-survey2.php
Disallow: /user_package_options.php
Disallow: /admin/
Disallow: /pricing-monthly
Disallow: /marketing/mosaic_investor.php
Disallow: /reports/
Disallow: /old-pricing
Disallow: /trial-pricing
Disallow: /marketing/*.php
Disallow: /contact-tos
Disallow: /ajax
Disallow: /sendgrid_event_hook.php
Disallow: /browser-plugins/
Disallow: /performance/
Disallow: /public_pages/
Disallow: /research-reports/Global-Healthcare-Exits-Report.pdf
Disallow: /*.pdf$
Disallow: /internal/
Disallow: /company2/

User-agent: GPTBot
Disallow: /

Sitemap: https://www.cbinsights.com/sitemap/sitemap_master.xml
Sitemap: https://www.cbinsights.com/research/sitemap.xml



As you can see, we are allowed to scrape pages with prefix `https://www.ourcrowd.com/company/`, since they do not appear under any `Disallow`.

#### Obtain URLs to scrape from sitemap

CB Insights provides a master sitemap located at https://www.cbinsights.com/sitemap/sitemap_master.xml. \
Using this sitemap, we obtain 215 sitemaps, each containing the URLs of around 4,000 companies.\
Here, we scrape all of the 215 sitemaps and store the links of every company in `./cbinsights/cbinsights_links.txt`.

DO NOT RERUN THE CELL BELOW:

In [21]:
import requests
import time
import re
from random import randint

def scrape_sitemaps():
    output_file = './cbinsights/cbinsights_links.txt'

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36'
    }

    with open(output_file, 'w', encoding='utf-8') as f:
        sitemap_count = 215
        for i in range(1, sitemap_count + 1):
            url = f'https://www.cbinsights.com/sitemap/sitemap_company_{i}.xml'
            timeout = randint(4,8)
            try:
                print(f'Scraping {url}')
                response = requests.get(url, headers=headers)

                if response.status_code == 200:
                    urls = re.findall(r'<loc>(.*?)</loc>', response.text)
                    for link in urls:
                        f.write(link + '\n')
                    print(f'Successfully scraped sitemap {i}: {len(urls)} URLs. Sleeping for {timeout} seconds.')
                else:
                    print(f"Didn't find sitemap {i}, status: {response.status_code}")
                    
                time.sleep(timeout)

            except Exception as e:
                print(f'Scraping sitemap {i} caused error: {e}')
                
scrape_sitemaps()

Scraping https://www.cbinsights.com/sitemap/sitemap_company_1.xml
Successfully scraped sitemap 1: 4664 URLs. Sleeping for 7 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_2.xml
Successfully scraped sitemap 2: 4612 URLs. Sleeping for 7 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_3.xml
Successfully scraped sitemap 3: 4554 URLs. Sleeping for 7 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_4.xml
Successfully scraped sitemap 4: 4373 URLs. Sleeping for 6 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_5.xml
Successfully scraped sitemap 5: 4549 URLs. Sleeping for 6 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_6.xml
Successfully scraped sitemap 6: 4233 URLs. Sleeping for 7 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_7.xml
Successfully scraped sitemap 7: 4896 URLs. Sleeping for 6 seconds.
Scraping https://www.cbinsights.com/sitemap/sitemap_company_8.xml
Suc

Now lets scrape and parse the website using HTTP requests and BeautifulSoup. 

#### Find relevant URLs

We already have a list of 8,000 startups in `../datasets/profiles.csv` from OurCrowd.\
We should only visit links on CB Insights for companies that are present on OurCrowd.

In [8]:
import pandas as pd
from rapidfuzz import fuzz, process

# Get OurCrowd names
df = pd.read_csv('../datasets/profiles.csv')
names_ourcrowd = list(df['name'].str.lower())

# Get CB Insights names and their line numbers so we can obtain their links after matching
names_cbinsights = {}
with open('./cbinsights/cbinsights_links.txt', 'r') as file:
    for i, line in enumerate(file):
        name = line.strip().rsplit('/', 1)[-1].replace('-', ' ')
        names_cbinsights[name] = i

# Exact matches
matches = []
remaining_names_ourcrowd = []
cbinsights_set = set(names_cbinsights.keys())

for name_ourcrowd in names_ourcrowd:
    if name_ourcrowd in cbinsights_set:
        line_number = names_cbinsights[name_ourcrowd]
        matches.append((name_ourcrowd, name_ourcrowd, 1, line_number))
    else:
        remaining_names_ourcrowd.append(name_ourcrowd)

num_exact_matches = len(matches)
num_remaining_names_ourcrowd = len(remaining_names_ourcrowd)
print(f'Number of exact matches: {num_exact_matches}')
print(f'Number of remaining names: {num_remaining_names_ourcrowd}')

# Fuzzy matches
count = 0
total = num_remaining_names_ourcrowd
percent_interval = 5
next_threshold = percent_interval

for name_ourcrowd in remaining_names_ourcrowd:
    progress = (count / total) * 100
    if progress >= next_threshold:
        print(f'{progress:.0f}% of remaining names processed.')
        next_threshold += percent_interval
    
    match, score, _ = process.extractOne(name_ourcrowd, cbinsights_set, scorer=fuzz.ratio)
    if score >= 93:
        line_number = names_cbinsights[match]
        matches.append((name_ourcrowd, match, score, line_number))
    count += 1

num_total_matches = len(matches)
print(f'Total matches: {num_total_matches}')
print(matches[:10])

# Store relevant URLs in file
with open('./cbinsights/cbinsights_links.txt', 'r') as file:
    cbinsights_lines = file.readlines()
    
relevant_urls = [cbinsights_lines[match[3]] for match in matches]
with open('./cbinsights/cbinsights_relevant_links.txt', 'w') as f:
    for url in relevant_urls:
        f.write(url.strip() + '/financials\n')
        
print(f'All {num_total_matches} relevant URLs have been stored in cbinsights/cbinsights_relevant_links.txt')

Number of exact matches: 7047
Number of remaining names: 1316
5% of remaining names processed.
10% of remaining names processed.
15% of remaining names processed.
20% of remaining names processed.
25% of remaining names processed.
30% of remaining names processed.
35% of remaining names processed.
40% of remaining names processed.
45% of remaining names processed.
50% of remaining names processed.
55% of remaining names processed.
60% of remaining names processed.
65% of remaining names processed.
70% of remaining names processed.
75% of remaining names processed.
80% of remaining names processed.
85% of remaining names processed.
90% of remaining names processed.
95% of remaining names processed.
Total matches: 7258
[('valera health', 'valera health', 1, 183229), ('bestow', 'bestow', 1, 226641), ('caura', 'caura', 1, 124237), ('plainid', 'plainid', 1, 264744), ('snapcart', 'snapcart', 1, 181676), ('slice', 'slice', 1, 184590), ('barn2door', 'barn2door', 1, 242655), ('deep sky', 'deep 

#### Scrape the URLs

Now lets scrape and parse the website using HTTP requests and BeautifulSoup, then store the result in `../datasets/financials.csv`.

DO NOT RERUN THE CELL BELOW:

In [12]:
from bs4 import BeautifulSoup
from time import sleep
from random import randint
import requests
import re
import csv
import sys

matchers = {
    'investor_count': re.compile(r'Investors Count'),
    
    'funding_total': re.compile(r'raised\s+\$([0-9,.]+[T,B,M,K]?)'),
    'funding_count': [
        re.compile(r'over\s+(\d+)\s+rounds?'),
        re.compile(r'\s*(\d+)\s+fundings?')
    ],
    
    'funding_last_type': re.compile(r'latest funding round.*was\s+a\s+([\w\s\/-]+)\s+for'),
    'funding_last': re.compile(r'latest funding round.*for\s+\$(\S+[T,B,M,K]?)'),
    'funding_last_date': re.compile(r'latest funding round.*on\s+([A-Za-z0-9,\s]+)'),
    
    'valuation': [
        re.compile(r'valuation.*(\$[0-9,.]+[T,B,M,K])'),
        re.compile(r'valuation.*(\$[0-9,.]+\s*-\s*\$[0-9,.]+[T,B,M,K])')
    ],
    'valuation_date': re.compile(r'valuation.*in(\s+([A-Za-z]+\s+\d{4})(?=\s+was))'),
    
    'revenue_year': re.compile(r'(\d{4})\s+revenue\s+was\s+\$([0-9,.]+[T,B,M,K]?)'),
    'revenue_total': re.compile(r'(\d{4})\s+revenue\s+was\s+\$([0-9,.]+[T,B,M,K]?)') 
}

# Fetches and parses page
def scrape(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36'
    }
    
    while True:
        response = requests.get(url, headers=headers)
    
        if response.status_code == 429:
            retry_after = response.headers.get('Retry-After')
            if retry_after:
                print(f'Too many requests: {url}. Status code: {response.status_code}. Retrying after {retry_after} seconds.')
                sleep(int(retry_after))
            else:
                print(f'Too many requests: {url}. Status code: {response.status_code}. Aborting.')
                sys.exit(1)
        elif response.status_code == 404:
            return None
        elif response.status_code != 200:
            print(f'Failed to retrieve {url}. Status code: {response.status_code}')
            return None
        else:
            return BeautifulSoup(response.content, 'html.parser')
        
def clean_soup(full_soup):
    if not full_soup:
        return None
    title_div = full_soup.find('div', class_='flex flex-col')
    kpi_div = full_soup.find('div', {'data-test': 'kpi-section'})
    fundings_div = full_soup.find('div', {'data-test': 'fundings-section'})

    soup = BeautifulSoup('<div></div>', 'html.parser').div

    if title_div:
        soup.append(title_div)
    if kpi_div:
        soup.append(kpi_div)
    if fundings_div:
        soup.append(fundings_div)
    
    # No financials tab
    if not kpi_div or not fundings_div:
        return None

    return soup

def extract(soup):
    data = {
        'name': None,
        'website': None,
        'investor_count': None,
        'funding_count': None,
        'funding_total': None,
        'funding_last_type': None,
        'funding_last': None,
        'funding_last_date': None,
        'valuation': None,
        'valuation_date': None,
        'revenue_year': None,
        'revenue_total': None,
    }
    
    # Extract name
    name_match = soup.find('h1', class_='cbi-default pr-2 text-2xl font-medium text-black')
    if name_match:
        data['name'] = name_match.text.strip()
        
    # Extract website URL
    website_match = soup.find('a', class_='color--blue padding--top--s text-sm font-medium')
    if website_match:
        data['website'] = website_match['href'].strip()
    
    # Extract investor count
    investor_count_match = soup.find('h2', text=matchers['investor_count'])
    if investor_count_match:
        data['investor_count'] = investor_count_match.find_next('span').text.strip()

    funding_section = soup.find('h2', text=re.compile(r'Funding, Valuation & Revenue'))
    if funding_section:
        paragraphs = funding_section.find_all_next('p')
        for p in paragraphs:
            p_text = p.text.strip()

            # Extract total raised amount and total number of rounds
            for key in ['funding_total', 'funding_count']:
                if key == 'funding_count':
                    for pattern in matchers[key]:
                        funding_count_match = re.search(pattern, p_text)
                        if funding_count_match:
                            data[key] = funding_count_match.group(1).strip()
                            break 
                else:
                    match = re.search(matchers[key], p_text)
                    if match:
                        data[key] = match.group(1).strip()

            # Extract latest funding type, amount, and date
            for key in ['funding_last_type', 'funding_last', 'funding_last_date']:
                match = re.search(matchers[key], p_text)
                if match:
                    data[key] = match.group(1).strip()

            # Extract valuation amount and date
            if 'valuation' in p_text.lower():
                for key in ['valuation', 'valuation_date']:
                    if key == 'valuation':
                        for pattern in matchers[key]:
                            valuation_match = re.search(pattern, p_text)
                            if valuation_match:
                                data[key] = valuation_match.group(1).strip()
                                break
                    else:
                        match = re.search(matchers[key], p_text)
                        if match:
                            data[key] = match.group(1).strip()

            # Extract revenue amount and date
            if 'revenue' in p_text.lower():
                revenue_match = re.search(matchers['revenue_year'], p_text)
                if revenue_match:
                    data['revenue_year'] = revenue_match.group(1).strip()
                    data['revenue_total'] = revenue_match.group(2).strip()
                    
        return data

# Generator that yields financials
def process_websites(urls):
    for url in urls:
        dirty_soup = scrape(url)
        soup = clean_soup(dirty_soup)
        if soup:            
            yield extract(soup)
        else:
            yield None
            
# Store financials in CSV
def write_to_csv(financial, filename='../datasets/financials.csv'):
    fieldnames = [*financial]

    with open(filename, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        if file.tell() == 0:
            writer.writeheader()

        writer.writerow(financial)

def main(urls):
    financials = process_websites(urls)
    processed_financials = 0
    total_financials = len(urls)
    previous_percent_processed = 0
    
    while True:
        try:
            financial = next(financials)
        except StopIteration:
            break
            
        processed_financials += 1
        if financial:
            write_to_csv(financial)
        
        percent_processed = (processed_financials / total_financials) * 100
        if percent_processed - previous_percent_processed >= 5:
            print(f'{(percent_processed):.1f}% of pages processed.')
            previous_percent_processed = percent_processed
    
        timeout = randint(1, 4)
        sleep(timeout)
        
    print(f'All {total_financials} pages processed.')
    
if __name__ == '__main__':
    test_urls = ['https://www.cbinsights.com/company/unitree/financials',
                 'https://www.cbinsights.com/company/slice/financials',
                 'https://www.cbinsights.com/company/roman-health-ventures/financials',
                ]
    
    urls = list()
    with open('./cbinsights/cbinsights_relevant_links.txt', 'r') as file:
        for line in file:
            urls.append(line.strip())
    
    # If program crashes, leftoff is the last index that was processed
    leftoff = 508
    main(urls[leftoff + 1:])

5.0% of pages processed.
10.0% of pages processed.
15.0% of pages processed.
20.0% of pages processed.
25.0% of pages processed.
30.0% of pages processed.
35.1% of pages processed.
40.1% of pages processed.
45.1% of pages processed.
50.1% of pages processed.
55.1% of pages processed.
60.1% of pages processed.
65.1% of pages processed.
70.1% of pages processed.
75.1% of pages processed.
Failed to retrieve https://www.cbinsights.com/company/actionfigure/financials. Status code: 503
80.1% of pages processed.
85.1% of pages processed.
90.1% of pages processed.
95.2% of pages processed.
All 6749 pages processed.


Notes:
1. The above should have printed all 6749 + 508 + 1 = 7258 pages processed, but it crashed prematurely
2. `financials.csv` has 6872 entries because some of the processed pages did not contain financial tabs
3. There's no need to retry retrieving https://www.cbinsights.com/company/actionfigure/financials because its tab does not contain much info