# Notebook

The goal is to create a dataset of all the companies in the Y Combinator Startup Directory including the following fields:

- company_name
- short_description
- year_founded
- batch (e.g. W20, S19)
- tags (e.g. marketplace, fintech, edtech)
- location
  - city
  - state
  - country
- team_size
- url
- long_description
- date_fetched

Data visualizations:

- map of all startups by location
- startups by batch, team_size, and tags
- trends by batch using description (NLP)

## Todos

- [x] instantiate first driver
- [x] enable scrolling until the end of the page
- [x] fetch urls from start url
- [x] click on 'See all options'
- [x] click on batches recursively
- [x] fetch all urls from one batch
- [ ] print script runtime
- [ ] add tqdm
- [ ] print comments/info for main script (i.e. "fetched links for batch 2..")
- [ ] scrape fields from each startup url

In [192]:
import json
import re
from time import sleep

from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By

In [193]:
driver = Firefox()
driver.get("https://www.ycombinator.com/companies")

In [194]:
# click 'See all options'
see_all_options = driver.find_element(By.LINK_TEXT, 'See all options')
see_all_options.click()

In [195]:
# seasons = ['W', 'S', 'IK',]
# decades = [0, 1, 2,]

def compile_batches():
    """Returns elements of checkboxes from all batches."""
    pattern = re.compile(r'^(W|S|IK)[012]')
    bx = driver.find_elements(By.XPATH, '//label')
    for element in bx:
        if pattern.match(element.text):
            yield element

In [196]:
# source: https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python

def scroll_to_bottom():
    """Scrolls to the bottom of the page."""
    # get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # wait to load page
        sleep(3)

        # calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

In [197]:
def fetch_url_paths():
    """Returns a generator with url paths for all companies."""
    # contains 'companies' but not 'founders'
    elements = driver.find_elements(By.XPATH, ('//a[contains(@href,"/companies/") and not(contains(@href,"founders"))]'))
    for url in elements:
        yield url.get_attribute('href') 

In [245]:
def write_urls_to_file(ul: list):
    """Appends a list of company urls to a file."""
    with open('start_urls.txt', 'w') as f:
        json.dump(ul, f)

In [243]:
def main():
    """Run the main script to write all start urls to a file."""
    ulist = []
    # compile an array of batches (checkbox elements)
    batches = compile_batches()

    for b in list(batches)[-2::]:
        # filter companies
        b.click()

        # scroll down to load all companies
        scroll_to_bottom()

        # fetch links and append them to ulist
        urls = [u for u in fetch_url_paths()]
        ulist.extend(urls)

        # uncheck the batch checkbox
        b.click()

    write_urls_to_file(ulist)

In [244]:
main()

['https://www.ycombinator.com/companies/wufoo', 'https://www.ycombinator.com/companies/project-wedding', 'https://www.ycombinator.com/companies/clustrix', 'https://www.ycombinator.com/companies/inkling', 'https://www.ycombinator.com/companies/audiobeta', 'https://www.ycombinator.com/companies/flagr', 'https://www.ycombinator.com/companies/snipshot']
['https://www.ycombinator.com/companies/wufoo', 'https://www.ycombinator.com/companies/project-wedding', 'https://www.ycombinator.com/companies/clustrix', 'https://www.ycombinator.com/companies/inkling', 'https://www.ycombinator.com/companies/audiobeta', 'https://www.ycombinator.com/companies/flagr', 'https://www.ycombinator.com/companies/snipshot', 'https://www.ycombinator.com/companies/reddit', 'https://www.ycombinator.com/companies/kiko', 'https://www.ycombinator.com/companies/clickfacts', 'https://www.ycombinator.com/companies/textpayme', 'https://www.ycombinator.com/companies/loopt', 'https://www.ycombinator.com/companies/infogami', 'h

## Extras

In [94]:
%%capture cap --no-stderr

q = driver.page_source
print(q)

with open('output.html', 'w') as f:
    f.write(cap.stdout)


In [None]:
# example of how to scroll to the bottom of a page
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
# example of taking a screenshot
# driver.save_screenshot('screenshot.png')