# OurCrowd

OurCrowd (https://www.ourcrowd.com/) has a collection of around 8,000 startups, each with\
information such as their name, summary info, keywords, and news articles with sentiment data. 

#### Check robots.txt

First, lets check robots.txt to ensure we are ethically scraping OurCrowd.

In [5]:
import requests

url = 'https://www.ourcrowd.com/robots.txt'
response = requests.get(url)

print(response.content.decode('utf-8'))

User-agent: *
Disallow: /myportfolio/
Disallow: /reset/
Disallow: /preset/
Disallow: /search/
Disallow: *?__hstc

Sitemap: https://www.ourcrowd.com/sitemap.xml


As you can see, we are allowed to scrape pages with prefix `https://www.ourcrowd.com/startup/`, since they do not appear under any `Disallow`.

#### Obtain URLs to scrape from sitemap

OurCrowd provides a sitemap located at https://www.ourcrowd.com/sitemap.xml. \
Using this sitemap, we obtain the URLs of every startup on the website.

In [1]:
import xml.etree.ElementTree as ET

# Load and parse sitemap.xml
def extract_sitemap(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    namespace = {'ns': 'https://www.sitemaps.org/schemas/sitemap/0.9'}

    # Find each startup link
    extracted_urls = []
    for loc in root.findall('.//ns:url/ns:loc', namespaces=namespace):
        extracted_url = loc.text
        if extracted_url.startswith('https://www.ourcrowd.com/startup/'):
            extracted_urls.append(extracted_url)
    
    return extracted_urls


xml_file = './ourcrowd/sitemap.xml'
urls = extract_sitemap(xml_file)

print(len(urls))
print(urls[:20])

8363
['https://www.ourcrowd.com/startup/valera-health', 'https://www.ourcrowd.com/startup/bestow', 'https://www.ourcrowd.com/startup/mediktor', 'https://www.ourcrowd.com/startup/oros', 'https://www.ourcrowd.com/startup/caura', 'https://www.ourcrowd.com/startup/cloudkitchens', 'https://www.ourcrowd.com/startup/plainid', 'https://www.ourcrowd.com/startup/beereaders', 'https://www.ourcrowd.com/startup/snapcart', 'https://www.ourcrowd.com/startup/slice', 'https://www.ourcrowd.com/startup/barn2door', 'https://www.ourcrowd.com/startup/deep-sky', 'https://www.ourcrowd.com/startup/synapticure', 'https://www.ourcrowd.com/startup/ant-group', 'https://www.ourcrowd.com/startup/nomagic', 'https://www.ourcrowd.com/startup/dovetail', 'https://www.ourcrowd.com/startup/infosum', 'https://www.ourcrowd.com/startup/narvar', 'https://www.ourcrowd.com/startup/ownhome', 'https://www.ourcrowd.com/startup/decent']


#### Scrape the URLs

Now lets scrape and parse the website using HTTP requests and BeautifulSoup, then store the result in `../datasets/profiles.csv`.

DO NOT RERUN THE CELL BELOW:

In [81]:
from bs4 import BeautifulSoup
from time import sleep
from random import randint
import requests
import json
import re
import itertools
import csv
import sys

# Fetches and parses page
def scrape(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36'
    }
    
    while True:
        response = requests.get(url, headers=headers)
    
        if response.status_code == 429:
            retry_after = response.headers.get('Retry-After')
            if retry_after:
                print(f'Too many requests: {url}. Status code: {response.status_code}. Retrying after {retry_after} seconds.')
                sleep(int(retry_after))
            else:
                print(f'Too many requests: {url}. Status code: {response.status_code}. Aborting.')
                sys.exit(1)
        elif response.status_code != 200:
            print(f'Failed to retrieve {url}. Status code: {response.status_code}')
            return None
        else:
            soup = BeautifulSoup(response.content, 'html.parser')
            script_tags = soup.find_all('script')
            for script in script_tags:
                scraped = script.string
                if scraped and 'const startupPulseCompany' in scraped:
                    scraped_json = re.search(r'\{(.*)\}', scraped, re.DOTALL).group(0)
                    return scraped_json

# Extracts relevant details from JSON
def extract(scraped_json):    
    webpage_json = json.loads(scraped_json)

    profile = {
        'name': webpage_json.get('name'),
        'tagline': webpage_json.get('tagline'),
        'website': webpage_json.get('website'),
        'summary': webpage_json.get('summary'),
        'concepts': webpage_json.get('atAGlance'),
        'keywords': webpage_json.get('verticals'),
        'sentiment': webpage_json.get('aggregatedNewsData'),
        'articles': webpage_json.get('news'),
    }

    return profile 

# Generator that yields profiles
def process_websites(urls):
    for url in urls:
        scraped_json = scrape(url)
        
        if scraped_json:
            # Batch progress
            print('◆', end='', flush=True)

            # Timeout between page scrapes
            timeout = randint(0, 3)
            sleep(timeout)
            
            yield extract(scraped_json)

# Store profiles in CSV
def write_to_csv(profiles, filename='../datasets/profiles.csv'):
    fieldnames = ['name', 'tagline', 'website', 'summary', 'concepts', 'keywords', 'sentiment', 'articles']

    with open(filename, mode='a', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        if file.tell() == 0:
            writer.writeheader()

        writer.writerows(profiles)
            

def main(urls):
    profiles = process_websites(urls)
    batch_size = 60
    processed_profiles = 0
    total_profiles = len(urls)
    
    # Process in batches to avoid overwhelming website with requests
    while True:
        print('[', end='')
        batch = list(itertools.islice(profiles, batch_size))
        processed_profiles += len(batch)
        if not batch:
            break
        
        write_to_csv(batch)
        
        # Timeout between batches
        timeout = randint(5, 15)
        print(f'] {(processed_profiles / total_profiles * 100):.1f}% of pages processed. Sleeping for {timeout} seconds.')
        sleep(timeout)
            
    print(f'All {total_profiles} pages processed.')

if __name__ == '__main__':
    #test_urls = ['https://www.ourcrowd.com/startup/valera-health', 'https://www.ourcrowd.com/startup/bestow', 'https://www.ourcrowd.com/startup/mediktor', 'https://www.ourcrowd.com/startup/oros', 'https://www.ourcrowd.com/startup/caura', 'https://www.ourcrowd.com/startup/cloudkitchens', 'https://www.ourcrowd.com/startup/plainid', 'https://www.ourcrowd.com/startup/beereaders', 'https://www.ourcrowd.com/startup/snapcart', 'https://www.ourcrowd.com/startup/slice', 'https://www.ourcrowd.com/startup/barn2door', 'https://www.ourcrowd.com/startup/deep-sky', 'https://www.ourcrowd.com/startup/synapticure', 'https://www.ourcrowd.com/startup/ant-group', 'https://www.ourcrowd.com/startup/nomagic', 'https://www.ourcrowd.com/startup/dovetail', 'https://www.ourcrowd.com/startup/infosum', 'https://www.ourcrowd.com/startup/narvar', 'https://www.ourcrowd.com/startup/ownhome', 'https://www.ourcrowd.com/startup/decent']
    #main(test_urls)
    main(urls)

[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 0.7% of pages processed. Sleeping for 14 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 1.4% of pages processed. Sleeping for 14 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 2.2% of pages processed. Sleeping for 15 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 2.9% of pages processed. Sleeping for 9 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 3.6% of pages processed. Sleeping for 8 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 4.3% of pages processed. Sleeping for 7 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 5.0% of pages processed. Sleeping for 15 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 5.7% of pages processed. Sleeping for 6 seconds.
[◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆] 6.5% of pages processed. Sleeping for

We have successfully scraped and processed the webpage of every startup on OurCrowd and stored the data in `../datasets/profiles.csv`. 