
## **Agenda:**
1. About the Dataset

2. Objective

3. Importing Libraries

4. Importing Data

5. Exploratory Data Analysis

6. Data Preprocessing

7. Splitting Data into Train and Test Set

8. Train the model and Calculate Accuracy of Model:

9. Predicting on Given Test Dataset.

## **About the Dataset**
Startups play a major role in economic growth. They bring new ideas, spur innovation, create employment thereby moving the economy. There has been an exponential growth in startups over the past few years. Predicting the success of a startup allows investors to find companies that have the potential for rapid growth, thereby allowing them to be one step ahead of the competition.

The data contains industry trends, investment insights and individual company information. There are 48 columns/features. Some of the features are:

age_first_funding_year – quantitative

age_last_funding_year – quantitative

relationships – quantitative

funding_rounds – quantitative

funding_total_usd – quantitative

milestones – quantitative

age_first_milestone_year – quantitative

age_last_milestone_year – quantitative

state – categorical

industry_type – categorical

has_VC – categorical

has_angel – categorical

has_roundA – categorical

has_roundB – categorical

has_roundC – categorical

has_roundD – categorical

avg_participants – quantitative

is_top500 – categorical

status(acquired/closed) – categorical (the target variable, if a startup is ‘acquired’ by some other organization, means the startup succeed) 



## **Objective**

The objective is to predict whether a startup which is currently operating turns into a success or a failure. The success of a company is defined as the event that gives the company's founders a large sum of money through the process of M&A (Merger and Acquisition) or an IPO (Initial Public Offering). A company would be considered as failed if it had to be shut down.

## **Importing Libraries**

In [None]:
import pandas as pd
import numpy as np

## **Importing Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Importing the dataset
train=pd.read_csv('/content/drive/My Drive/startup data/training_set_label.csv')
train.head()

In [None]:
test=pd.read_csv('/content/drive/My Drive/startup data/testing_set_label.csv')

## **Exploratory Data Analysis**

In [None]:
print(train.shape)
print(test.shape)

(923, 48)
(231, 47)


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 48 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                923 non-null    int64  
 1   state_code                923 non-null    object 
 2   latitude                  923 non-null    float64
 3   longitude                 923 non-null    float64
 4   zip_code                  923 non-null    object 
 5   id                        923 non-null    object 
 6   city                      923 non-null    object 
 7   Unnamed: 6                430 non-null    object 
 8   name                      923 non-null    object 
 9   founded_at                923 non-null    object 
 10  closed_at                 335 non-null    object 
 11  first_funding_at          923 non-null    object 
 12  last_funding_at           923 non-null    object 
 13  age_first_funding_year    923 non-null    float64
 14  age_last_f

There are 12 columns which have object data type.

In [None]:
train.describe()

Unnamed: 0.1,Unnamed: 0,latitude,longitude,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,is_otherstate,is_software,is_web,is_mobile,is_enterprise,is_advertising,is_gamesvideo,is_ecommerce,is_biotech,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500
count,923.0,923.0,923.0,923.0,923.0,771.0,771.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0,923.0
mean,572.297941,38.517442,-103.539212,2.23563,3.931456,3.055353,4.754423,7.710726,2.310943,25419750.0,1.84182,0.527627,0.114843,0.089924,0.045504,0.221018,0.165764,0.156013,0.08559,0.07909,0.067172,0.056338,0.027086,0.036836,0.00325,0.32286,0.326111,0.254605,0.508126,0.392199,0.232936,0.099675,2.838586,0.809317
std,333.585431,3.741497,22.394167,2.510449,2.96791,2.977057,3.212107,7.265776,1.390922,189634400.0,1.322632,0.499507,0.319005,0.286228,0.208519,0.415158,0.37207,0.363064,0.27991,0.270025,0.250456,0.230698,0.162421,0.188462,0.056949,0.467823,0.469042,0.435875,0.500205,0.488505,0.422931,0.299729,1.874601,0.393052
min,1.0,25.752358,-122.756956,-9.0466,-9.0466,-14.1699,-7.0055,0.0,1.0,11000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,283.5,37.388869,-122.198732,0.5767,1.66985,1.0,2.411,3.0,1.0,2725000.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5,1.0
50%,577.0,37.779281,-118.374037,1.4466,3.5288,2.5205,4.4767,5.0,2.0,10000000.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.5,1.0
75%,866.5,40.730646,-77.214731,3.57535,5.56025,4.6863,6.7534,10.0,3.0,24725000.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.8,1.0
max,1153.0,59.335232,18.057121,21.8959,21.8959,24.6849,24.6849,63.0,10.0,5700000000.0,8.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,16.0,1.0


In [None]:
test.describe()

From this we can see that columns containing age has negative values also which is not possible. 

In [None]:
train.isnull().sum()

Unnamed: 0                    0
state_code                    0
latitude                      0
longitude                     0
zip_code                      0
id                            0
city                          0
Unnamed: 6                  493
name                          0
founded_at                    0
closed_at                   588
first_funding_at              0
last_funding_at               0
age_first_funding_year        0
age_last_funding_year         0
age_first_milestone_year    152
age_last_milestone_year     152
relationships                 0
funding_rounds                0
funding_total_usd             0
milestones                    0
state_code.1                  1
is_CA                         0
is_NY                         0
is_MA                         0
is_TX                         0
is_otherstate                 0
category_code                 0
is_software                   0
is_web                        0
is_mobile                     0
is_enter

In [None]:
test.isnull().sum()

Unnamed: 0                    0
state_code                    0
latitude                      0
longitude                     0
zip_code                      0
id                            0
city                          0
Unnamed: 6                  125
name                          0
founded_at                    0
closed_at                   147
first_funding_at              0
last_funding_at               0
age_first_funding_year        0
age_last_funding_year         0
age_first_milestone_year     43
age_last_milestone_year      43
relationships                 0
funding_rounds                0
funding_total_usd             0
milestones                    0
state_code.1                  0
is_CA                         0
is_NY                         0
is_MA                         0
is_TX                         0
is_otherstate                 0
category_code                 0
is_software                   0
is_web                        0
is_mobile                     0
is_enter

There are missing values in total 4 columns of train and test dataset.



In [None]:
train.duplicated().sum()

0

There is not any duplicated entry in train data.

## **Data Preprocessing**

In [None]:
#first we will drop column unnamed: 6, as it is just a mix of city and zip_code also it contains many missing values.
train=train.drop(columns=['Unnamed: 6'])
test=test.drop(columns=['Unnamed: 6'])

In [None]:
#We cannot drop closed_at column since that affects success, so we will introduce a new column which indicates whether startup is closed or open.

train["is_open"]=train["closed_at"].isnull()
train=train.drop(columns=["closed_at"])
train.head()

Unnamed: 0.1,Unnamed: 0,state_code,latitude,longitude,zip_code,id,city,name,founded_at,first_funding_at,last_funding_at,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,state_code.1,is_CA,is_NY,is_MA,is_TX,is_otherstate,category_code,is_software,is_web,is_mobile,is_enterprise,is_advertising,is_gamesvideo,is_ecommerce,is_biotech,is_consulting,is_othercategory,object_id,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status,is_open
0,1005,CA,42.35888,-71.05682,92101,c:6669,San Diego,Bandsintown,01-01-07,04-01-09,01-01-10,2.2493,3.0027,4.6685,6.7041,3,3,375000,3,CA,1,0,0,0,0,music,0,0,0,0,0,0,0,0,0,1,c:6669,0,1,0,0,0,0,1.0,0,acquired,True
1,204,CA,37.238916,-121.973718,95032,c:16283,Los Gatos,TriCipher,01-01-00,2/14/2005,12/28/2009,5.126,9.9973,7.0055,7.0055,9,4,40100000,1,CA,1,0,0,0,0,enterprise,0,0,0,1,0,0,0,0,0,0,c:16283,1,0,0,1,1,1,4.75,1,acquired,True
2,1001,CA,32.901049,-117.192656,92121,c:65620,San Diego,Plixi,3/18/2009,3/30/2010,3/30/2010,1.0329,1.0329,1.4575,2.2055,5,1,2600000,2,CA,1,0,0,0,0,web,0,1,0,0,0,0,0,0,0,0,c:65620,0,0,1,0,0,0,4.0,1,acquired,True
3,738,CA,37.320309,-122.05004,95014,c:42668,Cupertino,Solidcore Systems,01-01-02,2/17/2005,4/25/2007,3.1315,5.3151,6.0027,6.0027,5,3,40000000,1,CA,1,0,0,0,0,software,1,0,0,0,0,0,0,0,0,0,c:42668,0,0,0,1,1,1,3.3333,1,acquired,True
4,1002,CA,37.779281,-122.419236,94105,c:65806,San Francisco,Inhale Digital,08-01-10,08-01-10,04-01-12,0.0,1.6685,0.0384,0.0384,2,2,1300000,1,CA,1,0,0,0,0,games_video,0,0,0,0,0,1,0,0,0,0,c:65806,1,1,0,0,0,0,1.0,1,closed,False


In [None]:
test["is_open"]=test["closed_at"].isnull()
test=test.drop(columns=["closed_at"])

In [None]:
train["is_open"][train.is_open==False]=0
train["is_open"][train.is_open==True]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
test["is_open"][test.is_open==False]=0
test["is_open"][test.is_open==True]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
# We will replace null values from age first milstone year and age last milestone year column by zero 
train["age_first_milestone_year"][train.age_first_milestone_year.isna()]=0
train["age_last_milestone_year"][train.age_last_milestone_year.isna()]=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
test["age_first_milestone_year"][test.age_first_milestone_year.isna()]=0
test["age_last_milestone_year"][test.age_last_milestone_year.isna()]=0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
print(train["state_code"].value_counts())
print(train["state_code.1"].value_counts())

CA    488
NY    106
MA     83
WA     42
TX     42
CO     19
IL     18
PA     17
VA     13
GA     11
OR      7
NJ      7
NC      7
MD      7
FL      6
OH      6
MN      5
DC      4
CT      4
RI      3
TN      3
MI      3
UT      3
MO      2
ME      2
KY      2
NH      2
IN      2
AZ      2
NV      2
WV      1
NM      1
WI      1
ID      1
AR      1
Name: state_code, dtype: int64
CA    487
NY    106
MA     83
WA     42
TX     42
CO     19
IL     18
PA     17
VA     13
GA     11
OR      7
NJ      7
NC      7
MD      7
FL      6
OH      6
MN      5
DC      4
CT      4
RI      3
TN      3
MI      3
UT      3
MO      2
ME      2
KY      2
NH      2
IN      2
AZ      2
NV      2
WV      1
NM      1
WI      1
ID      1
AR      1
Name: state_code.1, dtype: int64


In [None]:
print(train["id"].value_counts())
print(train["object_id"].value_counts())

c:28482    2
c:49960    1
c:1853     1
c:10864    1
c:15652    1
          ..
c:4        1
c:65601    1
c:11498    1
c:2808     1
c:512      1
Name: id, Length: 922, dtype: int64
c:28482    2
c:49960    1
c:1853     1
c:10864    1
c:15652    1
          ..
c:4        1
c:65601    1
c:11498    1
c:2808     1
c:512      1
Name: object_id, Length: 922, dtype: int64


In [None]:
train['category_code'].value_counts()

software            153
web                 144
mobile               79
enterprise           73
advertising          62
games_video          52
semiconductor        35
network_hosting      34
biotech              34
hardware             27
ecommerce            25
public_relations     25
cleantech            23
analytics            19
security             19
social               14
search               12
messaging            11
other                11
travel                8
fashion               8
news                  8
medical               7
photo_video           7
finance               6
music                 6
education             4
health                3
consulting            3
real_estate           3
manufacturing         2
automotive            2
transportation        2
hospitality           1
sports                1
Name: category_code, dtype: int64

In [None]:
#State_code.1 is just a copy of state_code, id and object id are same.Also Unnamed:0 and Category_code columns also not required so we will remove all these columns

train=train.drop(columns=['state_code','state_code.1','id','object_id','category_code','Unnamed: 0'])
test=test.drop(columns=['state_code','state_code.1','id','object_id','category_code','Unnamed: 0'])

In [None]:
train["status"]=train.status.map({"acquired":1, "closed":0})

In [None]:
train

Unnamed: 0,latitude,longitude,zip_code,city,name,founded_at,first_funding_at,last_funding_at,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,is_otherstate,is_software,is_web,is_mobile,is_enterprise,is_advertising,is_gamesvideo,is_ecommerce,is_biotech,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status,is_open
0,42.358880,-71.056820,92101,San Diego,Bandsintown,01-01-07,04-01-09,01-01-10,2.2493,3.0027,4.6685,6.7041,3,3,375000,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1.0000,0,1,1.0
1,37.238916,-121.973718,95032,Los Gatos,TriCipher,01-01-00,2/14/2005,12/28/2009,5.1260,9.9973,7.0055,7.0055,9,4,40100000,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,1,4.7500,1,1,1.0
2,32.901049,-117.192656,92121,San Diego,Plixi,3/18/2009,3/30/2010,3/30/2010,1.0329,1.0329,1.4575,2.2055,5,1,2600000,2,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,4.0000,1,1,1.0
3,37.320309,-122.050040,95014,Cupertino,Solidcore Systems,01-01-02,2/17/2005,4/25/2007,3.1315,5.3151,6.0027,6.0027,5,3,40000000,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,3.3333,1,1,1.0
4,37.779281,-122.419236,94105,San Francisco,Inhale Digital,08-01-10,08-01-10,04-01-12,0.0000,1.6685,0.0384,0.0384,2,2,1300000,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1.0000,1,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
918,37.740594,-122.376471,94107,San Francisco,CoTweet,01-01-09,07-09-09,07-09-09,0.5178,0.5178,0.5808,4.5260,9,1,1100000,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,6.0000,1,1,1.0
919,42.504817,-71.195611,1803,Burlington,Reef Point Systems,01-01-98,04-01-05,3/23/2007,7.2521,9.2274,6.0027,6.0027,1,3,52000000,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,2.6667,1,0,0.0
920,37.408261,-122.015920,94089,Sunnyvale,Paracor Medical,01-01-99,6/29/2007,6/29/2007,8.4959,8.4959,9.0055,9.0055,5,1,44000000,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,8.0000,1,0,0.0
921,37.556732,-122.288378,94404,San Francisco,Causata,01-01-09,10-05-09,11-01-11,0.7589,2.8329,0.7589,3.8356,12,2,15500000,2,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1.0000,1,1,1.0


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 41 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   latitude                  923 non-null    float64
 1   longitude                 923 non-null    float64
 2   zip_code                  923 non-null    object 
 3   city                      923 non-null    object 
 4   name                      923 non-null    object 
 5   founded_at                923 non-null    object 
 6   first_funding_at          923 non-null    object 
 7   last_funding_at           923 non-null    object 
 8   age_first_funding_year    923 non-null    float64
 9   age_last_funding_year     923 non-null    float64
 10  age_first_milestone_year  923 non-null    float64
 11  age_last_milestone_year   923 non-null    float64
 12  relationships             923 non-null    int64  
 13  funding_rounds            923 non-null    int64  
 14  funding_to

In [None]:
train['founded_at']=pd.to_datetime(train['founded_at'])
test['founded_at']=pd.to_datetime(test['founded_at'])

In [None]:
train["founded_year"]=train["founded_at"].dt.year
train=train.drop(columns=["founded_at"])

In [None]:
test["founded_year"]=test["founded_at"].dt.year
test=test.drop(columns=["founded_at"])

In [None]:
train=train.select_dtypes(exclude='object')
test=test.select_dtypes(exclude='object')

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 36 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   latitude                  923 non-null    float64
 1   longitude                 923 non-null    float64
 2   age_first_funding_year    923 non-null    float64
 3   age_last_funding_year     923 non-null    float64
 4   age_first_milestone_year  923 non-null    float64
 5   age_last_milestone_year   923 non-null    float64
 6   relationships             923 non-null    int64  
 7   funding_rounds            923 non-null    int64  
 8   funding_total_usd         923 non-null    int64  
 9   milestones                923 non-null    int64  
 10  is_CA                     923 non-null    int64  
 11  is_NY                     923 non-null    int64  
 12  is_MA                     923 non-null    int64  
 13  is_TX                     923 non-null    int64  
 14  is_otherst

In [None]:
print(train.isnull().any())
print(test.isnull().any())

latitude                    False
longitude                   False
age_first_funding_year      False
age_last_funding_year       False
age_first_milestone_year    False
age_last_milestone_year     False
relationships               False
funding_rounds              False
funding_total_usd           False
milestones                  False
is_CA                       False
is_NY                       False
is_MA                       False
is_TX                       False
is_otherstate               False
is_software                 False
is_web                      False
is_mobile                   False
is_enterprise               False
is_advertising              False
is_gamesvideo               False
is_ecommerce                False
is_biotech                  False
is_consulting               False
is_othercategory            False
has_VC                      False
has_angel                   False
has_roundA                  False
has_roundB                  False
has_roundC    

In [None]:
print(len(train[train['age_first_funding_year']<0]))

print(len(train[train['age_last_funding_year']<0]))

print(len(train[train['age_first_milestone_year']<0]))

print(len(train[train['age_last_milestone_year']<0]))

46
13
46
12


In [None]:
train["age_first_funding_year"][train.age_first_funding_year<0]=0
train["age_last_funding_year"][train.age_last_funding_year<0]=0
train["age_first_milestone_year"][train.age_first_milestone_year<0]=0
train["age_last_milestone_year"][train.age_last_milestone_year<0]=0

In [None]:
print(len(test[test['age_first_funding_year']<0]))

print(len(test[test['age_last_funding_year']<0]))

print(len(test[test['age_first_milestone_year']<0]))

print(len(test[test['age_last_milestone_year']<0]))

25
8
3
1


In [None]:
test["age_first_funding_year"][test.age_first_funding_year<0]=0
test["age_last_funding_year"][test.age_last_funding_year<0]=0
test["age_first_milestone_year"][test.age_first_milestone_year<0]=0
test["age_last_milestone_year"][test.age_last_milestone_year<0]=0

In [None]:
train

Unnamed: 0,latitude,longitude,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,is_otherstate,is_software,is_web,is_mobile,is_enterprise,is_advertising,is_gamesvideo,is_ecommerce,is_biotech,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,status,is_open,founded_year
0,42.358880,-71.056820,2.2493,3.0027,4.6685,6.7041,3,3,375000,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1.0000,0,1,1.0,2007
1,37.238916,-121.973718,5.1260,9.9973,7.0055,7.0055,9,4,40100000,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,1,4.7500,1,1,1.0,2000
2,32.901049,-117.192656,1.0329,1.0329,1.4575,2.2055,5,1,2600000,2,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,4.0000,1,1,1.0,2009
3,37.320309,-122.050040,3.1315,5.3151,6.0027,6.0027,5,3,40000000,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,3.3333,1,1,1.0,2002
4,37.779281,-122.419236,0.0000,1.6685,0.0384,0.0384,2,2,1300000,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1.0000,1,0,0.0,2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
918,37.740594,-122.376471,0.5178,0.5178,0.5808,4.5260,9,1,1100000,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,6.0000,1,1,1.0,2009
919,42.504817,-71.195611,7.2521,9.2274,6.0027,6.0027,1,3,52000000,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,2.6667,1,0,0.0,1998
920,37.408261,-122.015920,8.4959,8.4959,9.0055,9.0055,5,1,44000000,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,8.0000,1,0,0.0,1999
921,37.556732,-122.288378,0.7589,2.8329,0.7589,3.8356,12,2,15500000,2,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1.0000,1,1,1.0,2009


In [None]:
# training starts
X=train.drop("status", axis=1)
y=train["status"]

## **Splitting Data into Train and Test**

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=56,stratify=y)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((646, 35), (646,), (277, 35), (277,))

## **Fitting the Model and Checking Accuracy**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
dt=DecisionTreeClassifier(max_depth=7)
dt.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=7, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [None]:
y_pred=dt.predict(X_test)
print(accuracy_score(y_test,y_pred))
confusion_matrix(y_test,y_pred)

0.9675090252707581


array([[ 93,   5],
       [  4, 175]])

In [None]:
KNN=KNeighborsClassifier(3)
KNN.fit(X_train,y_train)
y_pred=KNN.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred))
confusion_matrix(y_test,y_pred)

0.6028880866425993


array([[ 37,  61],
       [ 49, 130]])

In [None]:
Ada=AdaBoostClassifier()
Ada.fit(X_train,y_train)
y_pred=Ada.predict(X_test)

In [None]:
print(accuracy_score(y_test,y_pred))
confusion_matrix(y_test,y_pred)

0.9819494584837545


array([[ 98,   0],
       [  5, 174]])

## **Predicting on Test Dataset**

In [None]:
test

Unnamed: 0,latitude,longitude,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,is_NY,is_MA,is_TX,is_otherstate,is_software,is_web,is_mobile,is_enterprise,is_advertising,is_gamesvideo,is_ecommerce,is_biotech,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,is_top500,is_open,founded_year
0,41.321520,-72.929423,0.2192,1.6795,0.0000,0.0000,5,3,2700000,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,1,0,0,1.3333,1,0.0,2007
1,37.452084,-122.112879,7.5507,9.7534,11.0192,11.0192,1,3,62800000,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,6.0000,1,0.0,2000
2,40.296222,-74.050972,1.3178,2.5534,4.0027,4.0027,4,2,8500000,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,2.0000,0,0.0,2004
3,37.789268,-122.395184,0.7507,1.3315,1.0849,6.4986,8,2,2000000,4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1.0000,0,0.0,2005
4,33.133240,-117.275027,8.9781,12.0658,0.0000,0.0000,4,2,80500000,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,5.5000,1,0.0,1997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
226,45.520247,-122.674195,1.2329,1.2329,2.3945,2.5452,10,1,350000,2,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,4.0000,0,1.0,2010
227,37.440682,-122.123103,0.0000,3.7260,1.4219,1.4219,6,3,14078664,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1.5000,1,0.0,2007
228,37.496904,-122.333057,7.2438,7.2438,11.0932,11.0932,6,1,5000000,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.0000,1,1.0,2000
229,40.728425,-73.999882,0.9233,0.9233,1.5863,2.7726,7,1,1270000,2,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,15.0000,1,1.0,2009


In [None]:
predictions=Ada.predict(test)

In [None]:
res = ['acquired' if predictions==1 else 'closed' for predictions in predictions]
res = pd.DataFrame(res)
res.index = test.index 
res.columns = ["prediction"]

from google.colab import files
res.to_csv("submission_1.csv")
files.download("submission_1.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>