# Startup Success Prediction
Startups are young companies founded to develop a unique product or service with the intention to disrupt industries and change the world while doing it at scale. Startups aim to grow very fast starting with a product called minimum viable product (MVP) that will serve as a test to see whether the product is something the customer wants to use. From there,it will go through iterative growth an innovation looking to rapidly expand its customer base to establish itself in a larger market. The ultimate goal, implicitly or explicitly, is going public through a process known as Initial Public Offering (IPO). Going public offers a chance to investors to cash out, this is known as an exit.
<br>
Before going public, startups go through several rounds of funding where venture capital firms invest tens or hundreds of millions into these companies hoping to receive the more than 200,000% return Peter Thiel saw when he invested in Facebook 8 years before IPO. However, about 90% of startups fail. Therefore, investors have a very high chance of not getting any return on their investment. This model aims to help rich people get richer. Estimates from this model can then be used to determine whether it is worth investing in a startup.
<br>

### Objective
The objective is to predict whether a startup turns into a success or a failure. The success of a company is defined as the event that gives the company's founders and investors a large sum of money through the process of M&A (Merger and Acquisition) or an IPO. A company would be considered as failed if it had to be shut down.
<br>

### Dataset Description
The data contains industry trends, investment insights (total funding, number of investors, etc.), individual company information (location, industry, etc.), and whether or not the startup has been acquired.

# Data Exploration

In [19]:
import idlelib.tooltip

import pandas as pd

df = pd.read_csv('data_startup.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,,Bandsintown,1,...,c:6669,0,1,0,0,0,0,1.0,0,acquired
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,,TriCipher,1,...,c:16283,1,0,0,1,1,1,4.75,1,acquired
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,San Diego CA 92121,Plixi,1,...,c:65620,0,0,1,0,0,0,4.0,1,acquired
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Cupertino CA 95014,Solidcore Systems,1,...,c:42668,0,0,0,1,1,1,3.3333,1,acquired
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,San Francisco CA 94105,Inhale Digital,0,...,c:65806,1,1,0,0,0,0,1.0,1,closed


### Cleanup
The dataset seems to have some redundant features:
Unnamed 0: All the values are unique, ranging from 1 to 1153 with some values missing. Makes me wonder if this was a unique ID.

In [69]:
records_count = df.shape[0]
unnamed_0 = df.iloc[:,0].unique()
print("Records in dataset:", records_count)
print("Unique values in Unnamed: 0:", len(unnamed_0))

Records in dataset: 923
Unique values in Unnamed: 0: 923


Feature Unnamed: 6 is missing some data but the feature seems to be just a merge of city, state code, and zip code features. Safe to drop.

In [70]:
df.iloc[:, [1,4,6,7]][:20]

Unnamed: 0,state_code,zip_code,city,Unnamed: 6
0,CA,92101,San Diego,
1,CA,95032,Los Gatos,
2,CA,92121,San Diego,San Diego CA 92121
3,CA,95014,Cupertino,Cupertino CA 95014
4,CA,94105,San Francisco,San Francisco CA 94105
5,CA,94043,Mountain View,Mountain View CA 94043
6,CA,94041,Mountain View,
7,CA,94901,San Rafael,
8,MA,1267,Williamstown,Williamstown MA 1267
9,CA,94306,Palo Alto,


Features id and object_id are identical. There is no need for either. Safe to drop both.

In [71]:
id_to_obj_id = df.id == df.object_id
print("Records in dataset:", records_count)
print("Records where the id feature is equal to the object_id feature:", id_to_obj_id.sum())

Records in dataset: 923
Records where the id feature is equal to the object_id feature: 923


The labels feature seems to be just a binary representation of the status feature. Safe to drop the status feature.

In [72]:
df.loc[:, ['labels', 'status']][:20]

Unnamed: 0,labels,status
0,1,acquired
1,1,acquired
2,1,acquired
3,1,acquired
4,0,closed
5,0,closed
6,1,acquired
7,1,acquired
8,1,acquired
9,1,acquired


Records 832 and 124 are duplicates. Safe to drop one of them.

In [73]:
df.iloc[[124, 832]]

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,Unnamed: 6,name,labels,...,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status
124,506,CA,37.54827,-121.988572,94538,c:28482,Fremont,Fremont CA 94538,Redwood Systems,1,...,c:28482,1,0,1,1,1,0,2.25,1,acquired
832,505,CA,37.48151,-121.945328,94538,c:28482,Fremont,,Redwood Systems,1,...,c:28482,1,0,1,1,1,0,2.25,1,acquired
