# THIS PROGRAM AIMS TO SCRAPE STARTUPS FROM THE YC WEBSITE #

##APPROACH##
We will be looking at the YC website. Depending on the ask, the program should cater towards a filter available by YC or by program. The steps for this project will be:

- Observe the YC HTML elements
- Recognize the patterns
- Zoom into elements that can provide name of the company
- Use the name of the company to go to YC’s company specific website
- Scrape information
- Rinse and repeat


In [1]:
# Set up for running selenium in Google Colab
## You don't need to run this code if you do it in Jupyter notebook, or other local Python setting
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [830 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1,756 kB]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:13 http://archive.ubuntu.com/ubuntu



In [2]:
!pip install chromedriver-autoinstaller

Collecting chromedriver-autoinstaller
  Downloading chromedriver_autoinstaller-0.6.4-py3-none-any.whl (7.6 kB)
Installing collected packages: chromedriver-autoinstaller
Successfully installed chromedriver-autoinstaller-0.6.4


In [3]:
# Taken from: https://nariyoo.com/python-how-to-run-selenium-in-google-colab/

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller
import re
import requests


chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)

In [4]:
url = "https://www.ycombinator.com/companies/?batch=W24"
driver.get(url)

In [5]:
def page_load():
    """
    Loads the page. Scrolls down every two seconds.
    No parameters needed
    """
    last_height = driver.execute_script('return document.body.scrollHeight')
    while True:
        driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        time.sleep(2)  # wait for new elements to load
        new_height = driver.execute_script('return document.body.scrollHeight')
        if new_height == last_height:
            break
        last_height = new_height

In [6]:
page_load()

In [7]:
source = driver.page_source

In [8]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [9]:
## Formatting company name to access YC company specific website

def replace(s):
    """Replaces any dots or spaces with -
    Param (company name)->string type

    Example:
    replace('InspectMind AI;)

    company = "voltage k"
    replace('voltage k)
    """
    return re.sub(r'[ .]', '-', s).lower()


In [10]:
## Parsing through [Company Name], [Location], [Description], [Category]
divs = soup.find_all('div', 'lg:max-w-[90%]')

## Taking company info and creating a web_name to iterate through later
company_info = []

for info in divs:
    name = info.find('span','_coName_99gj3_454').text
    location = info.find('span','_coLocation_99gj3_470').text
    descrip = info.find('span','_coDescription_99gj3_479').text

    dictionary = { 'Company': name,
                  'Location' : location,
                   'Description' : descrip,
                   'web_name' : replace(name)
    }

    company_info.append(dictionary)

In [11]:
df = pd.DataFrame(company_info)

In [12]:
df

Unnamed: 0,Company,Location,Description,web_name
0,Alacrity,"San Francisco, CA, USA",AI Based Account Takeover Prevention Platform,alacrity
1,ParcelBio,"San Francisco, CA, USA",Next-generation mRNA medicines,parcelbio
2,K-Scale Labs,"New York, NY, USA",Open-source humanoid robots,k-scale-labs
3,Marr Labs,"San Francisco, CA, USA",AI-voice agents that are indistinguishable fro...,marr-labs
4,Forge Rewards,"San Francisco, CA, USA",All-in-one operations software to power restau...,forge-rewards
...,...,...,...,...
243,Lantern,"San Francisco, CA, USA",Postgres vector database extension to build AI...,lantern
244,Danswer,,Open Source AI Assistant and Enterprise Search,danswer
245,Yenmo,"Bengaluru, KA, India",Secured consumer lending in India,yenmo
246,GovernGPT,"Toronto, ON, Canada",AI Back Office for Private Funds,governgpt


In [13]:
def find_emails(url):
    """ Finds email given a URL
    Param (url or "url")

    Examples:
    find_emails("https://www.ycombinator.com/companies/alacrity")

    url= "https://www.ycombinator.com/companies/alacrity"
    find_emails(url)
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

    emails = list(re.findall(email_regex, soup.get_text()))

    return emails


In [14]:
## Testing
url = "https://www.ycombinator.com/companies/alacrity"

find_emails(url)

['founders@joinalacrity.com']

In [15]:
## URL = https://www.ycombinator.com/companies/

list_founders = []
empty_sets = []

for index, data in df.iterrows():
    # Some companies might not have founders name
    # using try for code to keep running if founders name not found
    try:
        url = f"https://www.ycombinator.com/companies/{data['web_name']}"
        driver.get(url)
        iter_source = driver.page_source
        # Searching for elements
        soup2 = BeautifulSoup(iter_source, "html.parser")
        founders = soup2.find('div','space-y-5')
        comp = soup2.find('div', 'space-y-3')
        season = soup2.find('div', 'flex flex-row items-center gap-[6px]')
        tags = soup2.find_all('div', 'yc-tw-Pill rounded-sm bg-[#E6E4DC] uppercase tracking-widest px-3 py-[3px] text-[12px] font-thin')
        # Setting our identifiers
        comp_name = comp.find('h1', 'font-extralight').text
        founders_name = founders.find_all('h3', 'text-lg font-bold')
        cleaned_founders = [name.text for name in founders_name]
        linkedin_link = soup2.find('a', title='LinkedIn profile')['href']
        season_tag = season.find('span').text
        other_tags = [tag.text for tag in tags]
        # Organized dictionary
        fc_dic = {
            "URL": url,
            "Company" : comp_name,
            "Founders": cleaned_founders,
            "LinkedIn" : linkedin_link,
            "Season": season_tag,
            "Other Tags": other_tags,
            "Email": find_emails(url)
        }
        #Append dictionary to list
        list_founders.append(fc_dic)
        print('\x1b[6;30;42m' + 'Success!' + '\x1b[0m')
        print(f'{comp_name} has valid founder information. continue...')
    except:
        empty_sets.append(comp_name)
        print('\x1b[6;30;41m' + 'Error!' + '\x1b[0m')
        print(f'{comp_name} has empty information somwhere. continue...')
        continue

list_founders

[6;30;42mSuccess![0m
Alacrity has valid founder information. continue...
[6;30;42mSuccess![0m
ParcelBio has valid founder information. continue...
[6;30;42mSuccess![0m
K-Scale Labs has valid founder information. continue...
[6;30;42mSuccess![0m
Marr Labs has valid founder information. continue...
[6;30;42mSuccess![0m
Forge Rewards has valid founder information. continue...
[6;30;42mSuccess![0m
FanCave has valid founder information. continue...
[6;30;42mSuccess![0m
RetailReady has valid founder information. continue...
[6;30;42mSuccess![0m
Million has valid founder information. continue...
[6;30;42mSuccess![0m
NowHouse has valid founder information. continue...
[6;30;42mSuccess![0m
Crux has valid founder information. continue...
[6;30;42mSuccess![0m
Reprompt has valid founder information. continue...
[6;30;42mSuccess![0m
InspectMind AI has valid founder information. continue...
[6;30;41mError![0m
InspectMind AI has empty information somwhere. continue...
[6;30

[{'URL': 'https://www.ycombinator.com/companies/alacrity',
  'Company': 'Alacrity',
  'Founders': ['Omar Draz', 'Anderthan Hsieh'],
  'LinkedIn': 'https://www.linkedin.com/company/100487698/admin',
  'Season': 'W24',
  'Other Tags': ['Active',
   'identity',
   'fraud-prevention',
   'fraud-detection',
   'San Francisco'],
  'Email': ['founders@joinalacrity.com']},
 {'URL': 'https://www.ycombinator.com/companies/parcelbio',
  'Company': 'ParcelBio',
  'Founders': ['David Weinberg', 'Chris Carlson'],
  'LinkedIn': 'https://www.linkedin.com/company/parcelbio',
  'Season': 'W24',
  'Other Tags': ['Active',
   'gene-therapy',
   'biotech',
   'healthcare',
   'drug-delivery',
   'therapeutics',
   'San Francisco'],
  'Email': ['founders@parcelbio.com']},
 {'URL': 'https://www.ycombinator.com/companies/k-scale-labs',
  'Company': 'K-Scale Labs',
  'Founders': ['Benjamin Bolte', 'Pawel Budzianowski', 'Matthew Freed'],
  'LinkedIn': 'https://www.linkedin.com/company/kscale/',
  'Season': 'W24

In [16]:
## Making a copy in case to alter the copy instead of original
copy_list = list_founders

In [17]:
founder_df = pd.DataFrame(copy_list)

In [18]:
founder_df

Unnamed: 0,URL,Company,Founders,LinkedIn,Season,Other Tags,Email
0,https://www.ycombinator.com/companies/alacrity,Alacrity,"[Omar Draz, Anderthan Hsieh]",https://www.linkedin.com/company/100487698/admin,W24,"[Active, identity, fraud-prevention, fraud-det...",[founders@joinalacrity.com]
1,https://www.ycombinator.com/companies/parcelbio,ParcelBio,"[David Weinberg, Chris Carlson]",https://www.linkedin.com/company/parcelbio,W24,"[Active, gene-therapy, biotech, healthcare, dr...",[founders@parcelbio.com]
2,https://www.ycombinator.com/companies/k-scale-...,K-Scale Labs,"[Benjamin Bolte, Pawel Budzianowski, Matthew F...",https://www.linkedin.com/company/kscale/,W24,"[Active, artificial-intelligence, machine-lear...",[ben@kscale.dev]
3,https://www.ycombinator.com/companies/marr-labs,Marr Labs,"[Dave Grannan, Han Shu]",https://www.linkedin.com/company/marrlabs,W24,"[Active, artificial-intelligence, ai, ai-assis...",[hello@marrlabs.com]
4,https://www.ycombinator.com/companies/forge-re...,Forge Rewards,"[Ethan Chang, Isaac Kan]",https://linkedin.com/company/forgerewards,W24,"[Active, fintech, food-tech, ai, San Francisco]",[founders@forgerewards.com]
...,...,...,...,...,...,...,...
235,https://www.ycombinator.com/companies/lantern,Lantern,"[Bastien Beurier, Guillaume Lachaud]",https://www.linkedin.com/in/bastienbeurier/,S19,"[Inactive, artificial-intelligence, San Franci...",[]
236,https://www.ycombinator.com/companies/danswer,Danswer,"[Yuhong Sun, Chris Weaver]",https://www.linkedin.com/in/danswer-ai/,W24,[Active],[]
237,https://www.ycombinator.com/companies/yenmo,Yenmo,"[Ashutosh Purohit, Aryan Agarwal]",https://www.linkedin.com/company/yenmo-in/,W24,"[Active, fintech, lending, consumer-finance]",[]
238,https://www.ycombinator.com/companies/governgpt,GovernGPT,"[Mamal Amini, Oliver Walerys]",https://www.linkedin.com/company/93647085/,W24,"[Active, artificial-intelligence, finance, b2b...",[mamal@governgpt.ai]


In [19]:
def non_duplicates(dataframe):
    """Removes duplicates and return a cleaned dataframe

    Param (dataframe)-->DF pandas
    """
    duplicates = dataframe.duplicated('Company', keep=False)
    cleaned_dataframe = dataframe.drop_duplicates('Company', keep = 'first')
    return cleaned_dataframe

In [20]:
cleaned_founder_df = non_duplicates(founder_df)

In [21]:
complete_df = pd.merge(df, cleaned_founder_df, on='Company', how='outer')

In [22]:
complete_df

Unnamed: 0,Company,Location,Description,web_name,URL,Founders,LinkedIn,Season,Other Tags,Email
0,Alacrity,"San Francisco, CA, USA",AI Based Account Takeover Prevention Platform,alacrity,https://www.ycombinator.com/companies/alacrity,"[Omar Draz, Anderthan Hsieh]",https://www.linkedin.com/company/100487698/admin,W24,"[Active, identity, fraud-prevention, fraud-det...",[founders@joinalacrity.com]
1,ParcelBio,"San Francisco, CA, USA",Next-generation mRNA medicines,parcelbio,https://www.ycombinator.com/companies/parcelbio,"[David Weinberg, Chris Carlson]",https://www.linkedin.com/company/parcelbio,W24,"[Active, gene-therapy, biotech, healthcare, dr...",[founders@parcelbio.com]
2,K-Scale Labs,"New York, NY, USA",Open-source humanoid robots,k-scale-labs,https://www.ycombinator.com/companies/k-scale-...,"[Benjamin Bolte, Pawel Budzianowski, Matthew F...",https://www.linkedin.com/company/kscale/,W24,"[Active, artificial-intelligence, machine-lear...",[ben@kscale.dev]
3,Marr Labs,"San Francisco, CA, USA",AI-voice agents that are indistinguishable fro...,marr-labs,https://www.ycombinator.com/companies/marr-labs,"[Dave Grannan, Han Shu]",https://www.linkedin.com/company/marrlabs,W24,"[Active, artificial-intelligence, ai, ai-assis...",[hello@marrlabs.com]
4,Forge Rewards,"San Francisco, CA, USA",All-in-one operations software to power restau...,forge-rewards,https://www.ycombinator.com/companies/forge-re...,"[Ethan Chang, Isaac Kan]",https://linkedin.com/company/forgerewards,W24,"[Active, fintech, food-tech, ai, San Francisco]",[founders@forgerewards.com]
...,...,...,...,...,...,...,...,...,...,...
245,Yenmo,"Bengaluru, KA, India",Secured consumer lending in India,yenmo,https://www.ycombinator.com/companies/yenmo,"[Ashutosh Purohit, Aryan Agarwal]",https://www.linkedin.com/company/yenmo-in/,W24,"[Active, fintech, lending, consumer-finance]",[]
246,GovernGPT,"Toronto, ON, Canada",AI Back Office for Private Funds,governgpt,https://www.ycombinator.com/companies/governgpt,"[Mamal Amini, Oliver Walerys]",https://www.linkedin.com/company/93647085/,W24,"[Active, artificial-intelligence, finance, b2b...",[mamal@governgpt.ai]
247,Stitch Technologies,"London, England, United Kingdom",-,stitch-technologies,https://www.ycombinator.com/companies/stitch-t...,"[Till Kern, Yuriy Oparenko]",https://www.linkedin.com/company/stitch-tech,W24,"[Inactive, developer-tools, saas]",[]
248,TableFlow,,,,https://www.ycombinator.com/companies/inquery,"[Mitch Patin, Eric Ciminelli]",https://www.linkedin.com/company/tableflowhq,W23,"[Active, artificial-intelligence, developer-to...",[]


In [23]:
complete_df.to_csv('YCscrape.csv')

In [24]:
# User Search Function
# want to show:
# company, location, description, web_name, founders, linkedin, email, season, other tags

def search():
    """ Input company name to output information including
    [company, location, description, web_name, founders, linkedin,
    email, season, other tags]

    information is validated within function

    param("any company name")--> string type
    """
    company_name = input("Please input a company name that you would like to see information about. ").lower()
    valid_names = [name.lower() for name in complete_df['Company']]
    while company_name in valid_names:
        return complete_df[complete_df['Company'].str.lower() == company_name]
    else:
        return "We cannot find your company. Please try again."

search()




Please input a company name that you would like to see information about. Alacrity


Unnamed: 0,Company,Location,Description,web_name,URL,Founders,LinkedIn,Season,Other Tags,Email
0,Alacrity,"San Francisco, CA, USA",AI Based Account Takeover Prevention Platform,alacrity,https://www.ycombinator.com/companies/alacrity,"[Omar Draz, Anderthan Hsieh]",https://www.linkedin.com/company/100487698/admin,W24,"[Active, identity, fraud-prevention, fraud-det...",[founders@joinalacrity.com]


In [25]:
driver.quit()