## Phase I Project Proposal
### What Makes Y Combinator Startups Successful?

#### Name: Mukhilkanna Balakumar, DS 3000


### Introduction

Y Combinator is one of the most prestigious startup accelerators in the world, having funded companies like Airbnb, Dropbox, Stripe, and Reddit. What factors contribute to the success of YC-backed startups? I'm interested in examining two key questions:

1. Can we predict a startup's industry/category based on its characteristics (such as founding year, location, and team size)? This would help investors and founders understand what types of companies thrive in different sectors.

2. Is there a relationship between a company's founding year and its success metrics (such as funding raised or company status)? This could reveal trends in startup ecosystems over time and help identify optimal timing for launching certain types of companies.

These questions have practical applications: the first could help accelerator programs tailor their support to specific industries, while the second could provide insights into market cycles and timing strategies for entrepreneurs. Understanding these patterns could also help predict which types of startups are more likely to succeed or get acquired based on historical YC data.

### Data Collection

I will use web scraping with BeautifulSoup to collect data from Y Combinator's public company directory API endpoint. YC provides a JSON API that lists all their companies with key information including founding year, industry, location, team size, and company status. The data is publicly accessible and has information about thousands of YC companies.

Code:

In [None]:
import requests
import pandas as pd

# Get data from YC API
url = "https://api.ycombinator.com/v0.1/companies"
response = requests.get(url)
companies = response.json()

print(f"Retrieved {len(companies)} companies")

# Create DataFrame with key features
df = pd.DataFrame(companies)

# Sample of the data
print(df.head())
print(f"\nShape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Save
df.to_csv('yc_companies.csv', index=False)

Retrieved 4 companies
                                           companies  \
0  {'id': 31013, 'name': 'Rivet', 'slug': 'rivet-...   
1  {'id': 31011, 'name': 'Openroll', 'slug': 'ope...   
2  {'id': 31009, 'name': 'Bear', 'slug': 'bear', ...   
3  {'id': 31005, 'name': 'MarkIt', 'slug': 'marki...   
4  {'id': 31004, 'name': 'Dome', 'slug': 'dome', ...   

                                            nextPage  page  totalPages  
0  https://api.ycombinator.com/v0.1/companies?page=2     1         219  
1  https://api.ycombinator.com/v0.1/companies?page=2     1         219  
2  https://api.ycombinator.com/v0.1/companies?page=2     1         219  
3  https://api.ycombinator.com/v0.1/companies?page=2     1         219  
4  https://api.ycombinator.com/v0.1/companies?page=2     1         219  

Shape: (25, 4)
Columns: ['companies', 'nextPage', 'page', 'totalPages']


### Data Usage and Remaining Issues

The dataset includes:
- **Numeric features**: `year_founded`, `team_size` (2+ numeric)
- **Categorical features**: `industry`, `status`, `location` (1+ categorical)  
- **Observations**: Thousands of YC companies (30+ observations)

For **Question 1**, I can use classification to predict `industry` from other features like year_founded and location. For **Question 2**, I can examine relationships between `year_founded` and `status` using regression or classification methods.

Some data cleaning will be needed (handling missing values, standardizing locations), but the data meets all requirements and is collected via API returning JSON data.