# Project Topic: Scrape Companies Information Listed on Y-combinator

## Project Outline - (Last updated: March, 2022)
The project objective is to scrape information of all listed companies on Y-combinator (https://ycombinator.com/companies) and perform some basic analysis on the data.

## Information to scrape
The image below indicates the information to be scraped for analysis.
1. `company_name` (company's summary and tags)
1. `short_description` (company's summary and tags)
1. `tags` (company's summary and tags)
1. `link` (company's link)
1. `company_socials` (company's info)
1. `founded` (company's info)
1. `team_size` (company's info)
1. `location` (company's info)
1. `active_founders` (Founder's decription)
1. `about_founder` (Founder's info)
1. `description` (Company's description)


<img width="1353" alt="Screenshot 2022-04-03 at 7 29 58 PM" src="https://user-images.githubusercontent.com/55639062/162624402-ed21f6f2-ab55-46e4-aa4a-58b91b093ed6.png">


## Output data sample

<img width="1972" alt="Screenshot 2022-04-10 at 3 42 02 PM" src="https://user-images.githubusercontent.com/55639062/162624837-b2dc1d20-e0b2-44aa-9a1b-27a0ef2b3970.png">

### Important Notice
If the code breaks, the closest fix is to verify if the HTML tag in the code is still valid. If not, change the HTML tags.

## Import Necessery Libraries

In [1]:
from __future__ import print_function
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time, re, os
from datetime import datetime as dt
import pandas as pd
from pprint import pprint

## Run selenium driver
This function handles the dynamic part of the code by scrolling the companies page till the end. After all the companies have been fetched, BeautifulSoup extracts the links to each company.

`URL`: https://www.ycombinator.com/companies

In [8]:

def run_seleniun_and_get_page_source():
    url = 'https://www.ycombinator.com/companies'
    options = webdriver.ChromeOptions()
    options.headless = True
    try:
        path = os.getenv('CHROMEDRIVER_HOME')
        driver = webdriver.Chrome(executable_path=path, chrome_options=options)
    except Exception as e:
        driver = webdriver.Chrome(ChromeDriverManager().install())

    driver.get(url)
    time.sleep(10)

    #scroll to the end of the page
    check_page_length = 0
    try:
        ############ SCROLL TILL END PAGE ###################
        while True:
            page_len = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
            time.sleep(0.5)

            if check_page_length == page_len:
                break
            check_page_length = page_len
    except:
        driver.close()

    ############ CONVERT PAGE SOURCE TO BEAUTIFUL SOUP ELEMENT ###################
    selenium_web_content = BeautifulSoup(driver.page_source, 'lxml')
    get_company_list_block = selenium_web_content.find_all('a', class_='styles-module__company___1UVnl no-hovercard') # Extract all links
    full_location_block = selenium_web_content.find_all('span', class_='styles-module__coLocation___yhKam')

    url_and_location = []
    for i in range(len(get_company_list_block)):
        location = full_location_block[i].text.strip() 
        company_url = get_company_list_block[i]['href']
        info = [company_url, location]
        url_and_location.append(info)

    lenght = len(full_location_block)

    driver.close()
    return url_and_location, lenght



## Get Company's Info

The master HTML tag to fetch all the company's information is `soup.find('div',  class_='space-y-3')`. This was consistent and very easy to extract.

In [19]:

def get_company_info(soup):
    info = {}

    summary = soup.find('div',  class_='space-y-3')
    info['company_name'] = summary.h1.text
    info['link'] = soup.find('div', class_='flex flex-row items-center leading-none px-3').a['href']
    info['short_description'] = summary.find('div',  class_='text-xl').text

    spans = summary.find_all('span',  class_='ycdc-badge')
    info['tags'] = [what.text.replace('Y Combinator Logo', '') for what in spans]
    info['description'] = soup.p.text
    info['company_socials'] = soup.find('div',  class_='space-x-2')
    
    spans = []

    i = 0
    for fact in soup.find('div', class_="space-y-0.5").find_all('span'):
        spans.append(fact.text)
        try:
            key_ = spans[i].lower().replace(' ', '_')[:-1]
            info[key_] = spans[i+1]
            i += 2
        except:
            pass

    info['company_socials'] = [a['href'] for a in soup.find('div',  class_='space-x-2').find_all('a')]

    return info

# Get Founder's Info

This was a bit rusty as it is not consistent across all the company, some company contains more information about the founders while some do not. To understand the disparity in the information check:
- `Airbnb` - https://www.ycombinator.com/companies/airbnb
- `Coinbase` -  https://www.ycombinator.com/companies/coinbase
- `Pardes Bio` - https://www.ycombinator.com/companies/pardes-bio

In [None]:
def get_founders_info(soup):
    founders_info = {}
    try:
        founders = soup.find('div', class_='space-y-5')
        founders_info['active_founders'] = [name.div.text for name in founders.find_all('div', class_='leading-snug')]
        about_founder = {}
        all_about = []
        for founder in founders.find_all('div', class_='leading-snug'):
            about_founder = {}
            name = founder.find('div', class_='font-bold').text

            about_founder['name'] = name

            divs = [ what for what in founder.find_all('div')]
            about_founder['role'] = divs[1].text
            about_founder['social_media_links'] = [link['href'] for link in founder.find('div', class_='mt-1 space-x-2').find_all('a')]

            all_about.append(about_founder)
        
    except:
        founders = soup.find('div', class_='space-y-4')
        founders_info['active_founders'] = [name.div.text for name in founders.find_all('div', class_='leading-snug')]
        founders_info = {}
        all_about = []
        for founder in founders.find_all('div', class_='leading-snug'):
            about_founder = {}
            name = founder.find('div', class_='font-bold').text

            about_founder['name'] = name

            divs = [ what for what in founder.find_all('div')]
            about_founder['role'] = divs[1].text
            about_founder['social_media_links'] = [link['href'] for link in founder.find('div', class_='mt-1 space-x-2').find_all('a')]

            all_about.append(about_founder)
            
    founders_info['about_founders'] = all_about
    return founders_info


In [37]:
def get_founders_info(soup):
    founders_info = {}
    try:
        founders = soup.find('div', class_='space-y-5')
        founders_info['active_founders'] = [name.div.text for name in founders.find_all('div', class_='leading-snug')]

        all_about = []
        for what in founders.find_all('div', class_='flex flex-row gap-3 items-start flex-col md:flex-row'):
            about_founder = {}
            name = what.h3.text
            split = name.split(', ')

            about_founder['name'] = name

            if split[0] != split[-1]:
                about_founder['role'] = split[-1]
            else:
                about_founder['role'] = ''

            about_founder['social_media_links'] = [link['href'] for link in what.find('div', class_='mt-1 space-x-2').find_all('a')]
            
            all_about.append(about_founder)

        
    except:
        founders = soup.find('div', class_='space-y-4')
        founders_info['active_founders'] = [name.div.text for name in founders.find_all('div', class_='leading-snug')]
        founders_info = {}
        all_about = []
        for founder in founders.find_all('div', class_='leading-snug'):
            about_founder = {}
            name = founder.find('div', class_='font-bold').text

            about_founder['name'] = name

            divs = [ what for what in founder.find_all('div')]
            about_founder['role'] = divs[1].text
            about_founder['social_media_links'] = [link['href'] for link in founder.find('div', class_='mt-1 space-x-2').find_all('a')]

            all_about.append(about_founder)
            
    founders_info['about_founders'] = all_about
    return founders_info


## Scrape all Info
This function put all the scrapers together i.e the company info scraper and the founder info scraper

In [13]:
def scrape_info(link_and_location):
    main_url = 'https://www.ycombinator.com'
    url = main_url + link_and_location[0]
    source = requests.get(url).text
    soup = BeautifulSoup(source, 'lxml')

    company_all_info = get_company_info(soup)
    company_all_info['full_location'] = link_and_location[1]
    try:
        founder_info = get_founders_info(soup)
    except:
        founder_info = {
            "active_founders":[],
            "about_founders": []
        }
    company_all_info.update(founder_info)

    return company_all_info


## Save the scraped data
This converts the scraped info to pandas and rearranges the columns

In [14]:
def save_to_csv(scraped_info, savepath):
       df = pd.DataFrame(scraped_info)
       df = df[['company_name', 'link', 'short_description', 'tags',
              'company_socials', 'founded', 'team_size', 'full_location', 'location',
              'active_founders', 'about_founders', 'description']]
              
       df.to_csv(savepath, index=False)

## Get all links from Y-combinator

In [9]:

main_url = 'https://www.ycombinator.com'

start = dt.now()
url_and_location, lenght = run_seleniun_and_get_page_source()

runtime = (dt.now() - start).total_seconds()
print(f'Total scroll runtime - {runtime}')

  driver = webdriver.Chrome(executable_path=path, chrome_options=options)
  driver = webdriver.Chrome(executable_path=path, chrome_options=options)


Total scroll runtime - 15.207302


## Scraping without multi-threading
The script ran for `~15 mins` to scrape information from 1000 companies.

`Note:` The runtime can be faster or slower depending on internet connectivity.

In [20]:
companies = []
retries = []
count = 1

start_no_thread = dt.now()
run = True
while run:
    for link_href in url_and_location:
        try:
            print(f"{count}/{lenght}", end='\r')
            company_all_info = scrape_info(link_href)
            companies.append(company_all_info)

        except Exception as e:
            print(e)
            retries.append(link_href)
        count += 1
    if retries != []:
        url_and_location = retries
        retries = []
    else:
        run = False
        

runtime_no_thread = (dt.now() - start_no_thread).total_seconds()
print(f'Total runtime - {runtime_no_thread}')

Total runtime - 38.820362


## Scraping using multi-threading
Multithreading gives a faster and more efficient result. The script ran for about `~57.5 secs` as compared to the first approach.

`Note:` The runtime can be faster or slower depending on internet connectivity.

In [23]:
import concurrent.futures as cf


start_thread = dt.now()
m_companies = []
with cf.ThreadPoolExecutor() as exc:
    results = exc.map(scrape_info, url_and_location)

    for result in results:
        m_companies.append(result)

runtime_thread = (dt.now() - start_thread).total_seconds()
print(f'Total runtime - {runtime_thread}')

Total runtime - 3.564972


## Save as CSV

In [18]:
import pandas as pd

save_to_csv(scraped_info=m_companies, savepath='ycombinator.csv')
       

In [25]:
df = pd.read_csv('ycombinator.csv')
print(df.shape)
df.head()

(1000, 12)


Unnamed: 0,company_name,link,short_description,tags,company_socials,founded,team_size,full_location,location,active_founders,about_founders,description
0,Airbnb,http://airbnb.com,Book accommodations around the world.,"['W09', 'Public', 'Marketplace', 'Travel']","['https://www.linkedin.com/company/airbnb/', '...",2008.0,5000.0,"San Francisco, CA, USA",San Francisco,"['Nathan Blecharczyk', 'Brian Chesky', 'Joe Ge...","[{'name': 'Joe Gebbia, CPO', 'role': 'CPO', 's...",Founded in August of 2008 and based in San Fra...
1,Coinbase,https://www.coinbase.com,"Buy, sell, and manage cryptocurrencies.","['S12', 'Public']","['https://twitter.com/coinbase', 'https://www....",2012.0,500.0,"San Francisco, CA, USA",San Francisco,['Brian Armstrong'],"[{'name': 'Brian Armstrong', 'role': 'CEO', 's...","Founded in June of 2012, Coinbase is a digital..."
2,DoorDash,http://doordash.com,Restaurant delivery.,"['S13', 'Public']","['https://www.linkedin.com/company/doordash/',...",,1600.0,"San Francisco, CA, USA",San Francisco,['Tony Xu'],"[{'name': 'Andy Fang', 'role': 'Founder', 'soc...","Founded in 2013, DoorDash is a San Francisco-b..."
3,GitLab,http://gitlab.com/,A complete DevOps platform delivered as a sing...,"['W15', 'Public', 'DevSecOps', 'Developer Tools']",['https://www.linkedin.com/company/gitlab-com'...,2012.0,1200.0,"San Francisco, CA, USA",San Francisco,['Dmitriy Zaporozhets'],"[{'name': 'Sid Sijbrandij', 'role': 'CEO', 'so...",GitLab is the first single application for the...
4,Dropbox,http://dropbox.com,Backup and share files in the cloud.,"['S07', 'Public']","['https://www.linkedin.com/company/dropbox/', ...",2008.0,4000.0,"San Francisco, CA, USA",San Francisco,"['Arash Ferdowsi', 'Drew Houston']","[{'name': 'Drew Houston, CEO', 'role': 'CEO', ...",Dropbox is building the world’s first smart wo...


In [2]:
url = 'https://www.ycombinator.com/companies/airbnb'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

In [38]:
info = get_founders_info(soup)

pprint(info)

{'about_founders': [{'name': 'Nathan Blecharczyk, CTO',
                     'role': 'CTO',
                     'social_media_links': ['https://twitter.com/nathanblec',
                                            'https://www.linkedin.com/in/blecharczyk/']},
                    {'name': 'Brian Chesky, CEO',
                     'role': 'CEO',
                     'social_media_links': ['https://twitter.com/bchesky']},
                    {'name': 'Joe Gebbia, CPO',
                     'role': 'CPO',
                     'social_media_links': ['https://twitter.com/jgebbia']}],
 'active_founders': ['Nathan Blecharczyk', 'Brian Chesky', 'Joe Gebbia']}


In [41]:
df = pd.read_csv('ycombinator.csv')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
df.head(10)

Unnamed: 0,company_name,link,short_description,tags,company_socials,founded,team_size,full_location,location,active_founders,about_founders,description
0,Airbnb,http://airbnb.com,Book accommodations around the world.,"['W09', 'Public', 'Travel', 'Marketplace']","['https://www.linkedin.com/company/airbnb/', '...",2008.0,5000.0,"San Francisco, CA, USA",San Francisco,"['Nathan Blecharczyk', 'Brian Chesky', 'Joe Ge...","[{'name': 'Nathan Blecharczyk, CTO', 'role': '...",Founded in August of 2008 and based in San Fra...
1,Amplitude,https://amplitude.com,Pioneering Digital Optimization,"['W12', 'Public', 'B2B']",['https://www.crunchbase.com/organization/ampl...,2012.0,500.0,"San Francisco, CA, USA",San Francisco,['Spenser Skates'],"[{'name': 'Curtis Liu', 'role': 'CTO', 'social...",Amplitude is the global leader in Digital Opti...
2,DoorDash,http://doordash.com,Restaurant delivery.,"['S13', 'Public']","['https://www.linkedin.com/company/doordash/',...",,1600.0,"San Francisco, CA, USA",San Francisco,['Tony Xu'],"[{'name': 'Andy Fang', 'role': 'Founder', 'soc...","Founded in 2013, DoorDash is a San Francisco-b..."
3,Coinbase,https://www.coinbase.com,"Buy, sell, and manage cryptocurrencies.","['S12', 'Public']","['https://twitter.com/coinbase', 'https://www....",2012.0,500.0,"Remote; San Francisco, CA, USA",San Francisco,['Brian Armstrong'],"[{'name': 'Brian Armstrong', 'role': 'CEO', 's...","Founded in June of 2012, Coinbase is a digital..."
4,Dropbox,http://dropbox.com,Backup and share files in the cloud.,"['S07', 'Public']","['https://www.linkedin.com/company/dropbox/', ...",2008.0,4000.0,"San Francisco, CA, USA",San Francisco,"['Arash Ferdowsi', 'Drew Houston']","[{'name': 'Arash Ferdowsi', 'role': '', 'socia...",Dropbox is building the world’s first smart wo...
5,Embark Trucks,http://embarktrucks.com,Self-driving semi trucks.,"['W16', 'Public', 'Autonomous Trucking']",['https://www.linkedin.com/company/embarktruck...,2016.0,100.0,"San Francisco, CA, USA",San Francisco,['Alex Rodrigues'],"[{'name': 'Brandon Moak', 'role': 'CTO', 'soci...",We are a San Francisco based team building sel...
6,Ginkgo Bioworks,http://ginkgobioworks.com,Our mission is to make biology easier to engin...,"['S14', 'Public', 'Synthetic Biology', 'Diagno...",['https://www.linkedin.com/company/ginkgo-biow...,2009.0,500.0,"Boston, MA, USA; Remote",Boston,['Reshma Shetty'],"[{'name': 'Barry Canton', 'role': 'Founder', '...",Ginkgo Bioworks is the organism company. We de...
7,Lucira Health,https://www.lucirahealth.com/,Next generation infectious disease test kits,"['W15', 'Public']",['https://www.linkedin.com/company/lucira-heal...,,43.0,"Emeryville, CA, USA","Emeryville, CA",['Debkishore Mitra'],"[{'name': 'Debkishore Mitra', 'role': 'Founder...",Diassess Inc. is revolutionizing the way infec...
8,MatterPort,http://matterport.com,Turn physical objects and environments into 3D...,"['W12', 'Public', 'Computer Vision']",['https://www.crunchbase.com/organization/matt...,,201.0,"Sunnyvale, CA, USA","Sunnyvale, CA",['David Gausebeck'],"[{'name': 'Matt Bell', 'role': 'Founder', 'soc...",Matterport is an immersive media technology co...
9,Momentus,https://momentus.space,The space infrastructure services company,"['S18', 'Public', 'Space Exploration', 'Solar ...",['https://www.linkedin.com/company/momentusinc...,2017.0,125.0,"Santa Clara, CA, USA","Santa Clara, CA",,[],Momentus is a space infrastructure services co...
