**Hello, welcome to the preprocessing notebook of our data science project: predicting successful startup!**

We will try to include as much details as possible with markdowns :)

# Data Import

In [78]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [79]:
# data = pd.read_csv('/Users/danyhersco/data_startup_final.csv')
# this is mine!!!

# Data Exploration

In [80]:
data.shape

(314486, 19)

Let's quickly drop real duplicates (see below for "false duplicates"):

In [84]:
data = data.drop_duplicates()
data.shape

(313349, 19)

We extracted the dataset from DBeaver, after running an SQL query. As a result of multiple `LEFT JOIN` to include `industry` and `technology` features, many startups have duplicates.

Consequently, let's check for the number of unique values for each feature of our dataframe.

In [85]:
data.describe(include='all')

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,last_funding_at,headquartersCountry,headquartersRegion,employeeCount,industry_name,technology_name,announcedOn,stage,moneyRaised
count,313349,310151,313346,313349,313349,5368,22027,313349.0,287895,235682,313349,312208,312165,301115.0,304228,298398,312698,280208,223067
unique,55665,54614,56142,3,1083,507,1370,,18,15347,2905,162,1392,,40,8,3360,9,34767
top,Arthur Intelligence,https://www.goarthur.ai/,Limited information available,private,2016-01-01,2021-11-11,2021-02-19,,seed,"{""currency"":""USD"",""amount"":100000000,""amountUS...",2019-01-01,US,California,,Health Care,Software,2019-01-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
freq,216,216,597,308312,67276,114,146,,107263,4776,2882,138007,49580,,29285,164076,4340,100995,5655
mean,,,,,,,,3.410079,,,,,,61.976743,,,,,
std,,,,,,,,2.42072,,,,,,443.794835,,,,,
min,,,,,,,,1.0,,,,,,0.0,,,,,
25%,,,,,,,,2.0,,,,,,6.0,,,,,
50%,,,,,,,,3.0,,,,,,31.0,,,,,
75%,,,,,,,,5.0,,,,,,31.0,,,,,


We have **55665** unique startups in our dataset. The next step will be to remove all duplicates while treating the non-duplicated columns.

Here is how our dataset look like:

In [86]:
data.head()

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,last_funding_at,headquartersCountry,headquartersRegion,employeeCount,industry_name,technology_name,announcedOn,stage,moneyRaised
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-04-15 00:00:00.000,unknown,"{""amount"": 5652200, ""currency"": ""USD"", ""amount..."
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2020-08-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2017-04-01 00:00:00.000,unknown,"{""amount"": 2500000, ""currency"": ""USD"", ""amount..."
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2016-08-09 00:00:00.000,pre_seed,"{""amount"": 13000000, ""currency"": ""USD"", ""amoun..."
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-02-01 00:00:00.000,pre_seed,"{""amount"": 12000000, ""currency"": ""USD"", ""amoun..."


# Remove the duplicates through encoding

This problem will be challenging as there is not really fully duplicated rows. As we said above, it is just a result of the `LEFT JOIN` that multiplicated the rows for one company (several industries, several technologies, and several funding rounds). We have to one hot encode the latter features, which is what we will be doing in this section.

At the end of this section, we should reach a shape of `(55665, >18)`.

In [27]:
from useful.variables import industries, technologies

## `industry`

In [28]:
data.reset_index(drop=True, inplace=True)

In [29]:
data.head()

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,last_funding_at,headquartersCountry,headquartersRegion,employeeCount,industry_name,technology_name,announcedOn,stage,moneyRaised
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-04-15 00:00:00.000,unknown,"{""amount"": 5652200, ""currency"": ""USD"", ""amount..."
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2020-08-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2017-04-01 00:00:00.000,unknown,"{""amount"": 2500000, ""currency"": ""USD"", ""amount..."
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2016-08-09 00:00:00.000,pre_seed,"{""amount"": 13000000, ""currency"": ""USD"", ""amoun..."
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-02-01 00:00:00.000,pre_seed,"{""amount"": 12000000, ""currency"": ""USD"", ""amoun..."


In [10]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
data[encoder.get_feature_names_out()] = encoder.fit_transform(data[['industry_name']])

In [12]:
data.head()

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,...,industry_name_Privacy and Security,industry_name_Professional Services,industry_name_Real Estate and Construction,industry_name_Sales and Marketing,industry_name_Software,industry_name_Sports,industry_name_Transportation,industry_name_Travel and Tourism,industry_name_Video,industry_name_nan
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
encoder2 = OneHotEncoder(sparse=False)
data[encoder2.get_feature_names_out()] = encoder2.fit_transform(data[['technology_name']])

In [18]:
data.head()

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,...,industry_name_nan,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,technology_name_nan
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [14]:
data.columns

Index(['name', 'website', 'short_description', 'ipo_status', 'founded_on',
       'went_public_on', 'exited_on', 'num_funding_rounds',
       'last_equity_funding_type', 'last_equity_funding_total',
       'last_funding_at', 'headquartersCountry', 'headquartersRegion',
       'employeeCount', 'industry_name', 'technology_name', 'announcedOn',
       'stage', 'moneyRaised', 'industry_name_Advertising',
       'industry_name_Agriculture and Farming',
       'industry_name_Clothing and Apparel',
       'industry_name_Commerce and Shopping',
       'industry_name_Community and Lifestyle',
       'industry_name_Computer Hardware', 'industry_name_Consumer Electronics',
       'industry_name_Consumer Goods', 'industry_name_Content and Publishing',
       'industry_name_Data and Analytics', 'industry_name_Design',
       'industry_name_Education', 'industry_name_Energy',
       'industry_name_Environment and Sustainability', 'industry_name_Events',
       'industry_name_Financial Services', 'i

In [None]:
grouped = data.groupby('name')['']

In [19]:
columns_to_groupby=data.columns.to_list()-industries

TypeError: unsupported operand type(s) for -: 'list' and 'list'

# Test

In [70]:
data.columns

Index(['name', 'website', 'short_description', 'ipo_status', 'founded_on',
       'went_public_on', 'exited_on', 'num_funding_rounds',
       'last_equity_funding_type', 'last_equity_funding_total',
       'last_funding_at', 'headquartersCountry', 'headquartersRegion',
       'employeeCount', 'industry_name', 'technology_name', 'announcedOn',
       'stage', 'moneyRaised'],
      dtype='object')

In [71]:
test = data[['name', 'website', 'short_description', 'ipo_status', 'founded_on',
             'went_public_on', 'exited_on', 'num_funding_rounds',
             'last_equity_funding_type', 'last_equity_funding_total',
             'last_funding_at', 'headquartersCountry', 'headquartersRegion',
             'employeeCount', 'industry_name', 'technology_name']]

In [72]:
from sklearn.preprocessing import OneHotEncoder

#encoder1 = OneHotEncoder(sparse=False)
#test[encoder1.get_feature_names_out()] = encoder1.fit_transform(test[['industry_name']])

encoder2 = OneHotEncoder(sparse=False)
test[encoder2.get_feature_names_out()] = encoder2.fit_transform(test[['technology_name']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test[encoder2.get_feature_names_out()] = encoder2.fit_transform(test[['technology_name']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test[encoder2.get_feature_names_out()] = encoder2.fit_transform(test[['technology_name']])


In [73]:
test

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,...,technology_name,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,technology_name_nan
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
314481,Maze,https://maze.co/,"Rapid, remote testing for agile teams, from id...",private,2018-01-01,,,4,series_b,"{""currency"":""USD"",""amount"":4000000000,""amountU...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
314482,Maze,https://maze.co/,"Rapid, remote testing for agile teams, from id...",private,2018-01-01,,,4,series_b,"{""currency"":""USD"",""amount"":4000000000,""amountU...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
314483,Maze,https://maze.co/,"Rapid, remote testing for agile teams, from id...",private,2018-01-01,,,4,series_b,"{""currency"":""USD"",""amount"":4000000000,""amountU...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
314484,Maze,https://maze.co/,"Rapid, remote testing for agile teams, from id...",private,2018-01-01,,,4,series_b,"{""currency"":""USD"",""amount"":4000000000,""amountU...",...,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [74]:
test.columns

Index(['name', 'website', 'short_description', 'ipo_status', 'founded_on',
       'went_public_on', 'exited_on', 'num_funding_rounds',
       'last_equity_funding_type', 'last_equity_funding_total',
       'last_funding_at', 'headquartersCountry', 'headquartersRegion',
       'employeeCount', 'industry_name', 'technology_name',
       'technology_name_AR and VR', 'technology_name_Artificial Intelligence',
       'technology_name_Biotechnology', 'technology_name_BlockChain',
       'technology_name_Hardware', 'technology_name_Science and Engineering',
       'technology_name_Software', 'technology_name_Sustainability',
       'technology_name_nan'],
      dtype='object')

In [75]:
grouped = test.groupby(['name'])['name', 'technology_name', 'technology_name_AR and VR',
                                             'technology_name_Artificial Intelligence',
                                             'technology_name_Biotechnology', 'technology_name_BlockChain',
                                             'technology_name_Hardware', 'technology_name_Science and Engineering',
                                             'technology_name_Software', 'technology_name_Sustainability',
                                             'technology_name_nan'].max()

  grouped = test.groupby(['name'])['name', 'technology_name', 'technology_name_AR and VR',
  'technology_name_nan'].max()


In [76]:
grouped.head(20)

Unnamed: 0_level_0,name,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,technology_name_nan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
!Creatice,!Creatice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
#IconSource,#IconSource,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
&ME,&ME,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
&Open,&Open,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
&SISTERS,&SISTERS,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
'She said' App,'She said' App,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
'hoodHeroes,'hoodHeroes,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
(med)24,(med)24,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
* mastBus Laundry,* mastBus Laundry,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
+ Organic Remedies,+ Organic Remedies,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
grouped.columns

Index(['name', 'technology_name_AR and VR',
       'technology_name_Artificial Intelligence',
       'technology_name_Biotechnology', 'technology_name_BlockChain',
       'technology_name_Hardware', 'technology_name_Science and Engineering',
       'technology_name_Software', 'technology_name_Sustainability',
       'technology_name_nan'],
      dtype='object')

In [54]:
encoder1 = OneHotEncoder(sparse=False)
grouped[encoder1.get_feature_names_out()] = encoder1.fit_transform(grouped[['industry_name']])

In [55]:
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,name,industry_name,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,...,industry_name_Payments,industry_name_Privacy and Security,industry_name_Professional Services,industry_name_Real Estate and Construction,industry_name_Sales and Marketing,industry_name_Software,industry_name_Sports,industry_name_Transportation,industry_name_Travel and Tourism,industry_name_Video
name,industry_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
#IconSource,Commerce and Shopping,#IconSource,Commerce and Shopping,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#IconSource,Media and Entertainment,#IconSource,Media and Entertainment,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#IconSource,Sales and Marketing,#IconSource,Sales and Marketing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
#IconSource,Sports,#IconSource,Sports,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
&ME,Health Care,&ME,Health Care,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
​Masto,Manufacturing,​Masto,Manufacturing,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
​Masto,Music and Audio,​Masto,Music and Audio,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
客湖KEHU,Commerce and Shopping,客湖KEHU,Commerce and Shopping,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
客湖KEHU,Sales and Marketing,客湖KEHU,Sales and Marketing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [58]:
grouped.reset_index(drop=True, inplace=True)
grouped

Unnamed: 0,name,industry_name,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,...,industry_name_Payments,industry_name_Privacy and Security,industry_name_Professional Services,industry_name_Real Estate and Construction,industry_name_Sales and Marketing,industry_name_Software,industry_name_Sports,industry_name_Transportation,industry_name_Travel and Tourism,industry_name_Video
0,#IconSource,Commerce and Shopping,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,#IconSource,Media and Entertainment,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,#IconSource,Sales and Marketing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,#IconSource,Sports,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,&ME,Health Care,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104771,​Masto,Manufacturing,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
104772,​Masto,Music and Audio,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
104773,客湖KEHU,Commerce and Shopping,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
104774,客湖KEHU,Sales and Marketing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [59]:
grouped.columns

Index(['name', 'industry_name', 'technology_name_AR and VR',
       'technology_name_Artificial Intelligence',
       'technology_name_Biotechnology', 'technology_name_BlockChain',
       'technology_name_Hardware', 'technology_name_Science and Engineering',
       'technology_name_Software', 'technology_name_Sustainability',
       'technology_name_nan', 'industry_name_Advertising',
       'industry_name_Agriculture and Farming',
       'industry_name_Clothing and Apparel',
       'industry_name_Commerce and Shopping',
       'industry_name_Community and Lifestyle',
       'industry_name_Computer Hardware', 'industry_name_Consumer Electronics',
       'industry_name_Consumer Goods', 'industry_name_Content and Publishing',
       'industry_name_Data and Analytics', 'industry_name_Design',
       'industry_name_Education', 'industry_name_Energy',
       'industry_name_Environment and Sustainability', 'industry_name_Events',
       'industry_name_Financial Services', 'industry_name_Food 

In [60]:
grouped_2 = grouped.groupby('name')[grouped.drop(columns='industry_name').columns].max()

In [62]:
grouped_2.reset_index(drop=True, inplace=True)

In [63]:
grouped_2

Unnamed: 0,name,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,technology_name_nan,...,industry_name_Payments,industry_name_Privacy and Security,industry_name_Professional Services,industry_name_Real Estate and Construction,industry_name_Sales and Marketing,industry_name_Software,industry_name_Sports,industry_name_Transportation,industry_name_Travel and Tourism,industry_name_Video
0,#IconSource,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,&ME,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,&Open,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,&SISTERS,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,'She said' App,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48871,Протеже Системс,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48872,Сибирьэко,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48873,​Masto,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48874,客湖KEHU,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


# Remove the duplicates through encoding

This problem will be challenging as there is not really fully duplicated rows. As we said above, it is just a result of the `LEFT JOIN` that multiplicated the rows for one company (several industries, several technologies, and several funding rounds). We have to one hot encode the latter features, which is what we will be doing in this section.

At the end of this section, we should reach a shape of `(55665, >18)`.

In [88]:
data.head()

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,last_funding_at,headquartersCountry,headquartersRegion,employeeCount,industry_name,technology_name,announcedOn,stage,moneyRaised
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-04-15 00:00:00.000,unknown,"{""amount"": 5652200, ""currency"": ""USD"", ""amount..."
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2020-08-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2017-04-01 00:00:00.000,unknown,"{""amount"": 2500000, ""currency"": ""USD"", ""amount..."
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2016-08-09 00:00:00.000,pre_seed,"{""amount"": 13000000, ""currency"": ""USD"", ""amoun..."
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-02-01 00:00:00.000,pre_seed,"{""amount"": 12000000, ""currency"": ""USD"", ""amoun..."


##  `industry` encoding

In [93]:
data_ind = data[['name', 'industry_name']].drop_duplicates()
data_ind.head()

Unnamed: 0,name,industry_name
0,"Cardiomo Care, Inc.",Health Care
20,"Cardiomo Care, Inc.",Computer Hardware
40,"Cardiomo Care, Inc.",Consumer Electronics
60,UniCoin Blockchain Inc.,Payments
62,UniCoin Blockchain Inc.,Financial Services


In [94]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
data_ind[encoder.get_feature_names_out()] = encoder.fit_transform(data_ind[['industry_name']])

In [95]:
data_ind.head()

Unnamed: 0,name,industry_name,industry_name_Advertising,industry_name_Agriculture and Farming,industry_name_Clothing and Apparel,industry_name_Commerce and Shopping,industry_name_Community and Lifestyle,industry_name_Computer Hardware,industry_name_Consumer Electronics,industry_name_Consumer Goods,...,industry_name_Privacy and Security,industry_name_Professional Services,industry_name_Real Estate and Construction,industry_name_Sales and Marketing,industry_name_Software,industry_name_Sports,industry_name_Transportation,industry_name_Travel and Tourism,industry_name_Video,industry_name_nan
0,"Cardiomo Care, Inc.",Health Care,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,"Cardiomo Care, Inc.",Computer Hardware,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40,"Cardiomo Care, Inc.",Consumer Electronics,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
60,UniCoin Blockchain Inc.,Payments,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62,UniCoin Blockchain Inc.,Financial Services,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
for column in data_ind.columns[2:]:
    data_ind.rename(columns={column: column[14:]}, inplace=True)

In [101]:
data_ind.head()

Unnamed: 0,name,industry_name,Advertising,Agriculture and Farming,Clothing and Apparel,Commerce and Shopping,Community and Lifestyle,Computer Hardware,Consumer Electronics,Consumer Goods,...,Privacy and Security,Professional Services,Real Estate and Construction,Sales and Marketing,Software,Sports,Transportation,Travel and Tourism,Video,nan
0,"Cardiomo Care, Inc.",Health Care,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,"Cardiomo Care, Inc.",Computer Hardware,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40,"Cardiomo Care, Inc.",Consumer Electronics,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
60,UniCoin Blockchain Inc.,Payments,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62,UniCoin Blockchain Inc.,Financial Services,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
data_ind = data_ind.groupby('name').max()
data_ind.head()

  data_ind = data_ind.groupby('name').max()


Unnamed: 0_level_0,Advertising,Agriculture and Farming,Clothing and Apparel,Commerce and Shopping,Community and Lifestyle,Computer Hardware,Consumer Electronics,Consumer Goods,Content and Publishing,Data and Analytics,...,Privacy and Security,Professional Services,Real Estate and Construction,Sales and Marketing,Software,Sports,Transportation,Travel and Tourism,Video,nan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
!Creatice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
#IconSource,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
&ME,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
&Open,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
&SISTERS,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [112]:
data_ind.shape

(55665, 41)

##  `technology` encoding

In [103]:
data_tec = data[['name', 'technology_name']].drop_duplicates()
data_tec.head()

Unnamed: 0,name,technology_name
0,"Cardiomo Care, Inc.",Software
10,"Cardiomo Care, Inc.",Hardware
60,UniCoin Blockchain Inc.,BlockChain
64,Dutch Finance Lab,Software
66,Polybit,Software


In [104]:
encoder2 = OneHotEncoder(sparse=False)
data_tec[encoder2.get_feature_names_out()] = encoder2.fit_transform(data_tec[['technology_name']])

In [107]:
data_tec.head()

Unnamed: 0,name,technology_name,technology_name_AR and VR,technology_name_Artificial Intelligence,technology_name_Biotechnology,technology_name_BlockChain,technology_name_Hardware,technology_name_Science and Engineering,technology_name_Software,technology_name_Sustainability,technology_name_nan
0,"Cardiomo Care, Inc.",Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
10,"Cardiomo Care, Inc.",Hardware,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
60,UniCoin Blockchain Inc.,BlockChain,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
64,Dutch Finance Lab,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
66,Polybit,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [108]:
for column in data_tec.columns[2:]:
    data_tec.rename(columns={column: column[16:]}, inplace=True)

In [109]:
data_tec.head()

Unnamed: 0,name,technology_name,AR and VR,Artificial Intelligence,Biotechnology,BlockChain,Hardware,Science and Engineering,Software,Sustainability,nan
0,"Cardiomo Care, Inc.",Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
10,"Cardiomo Care, Inc.",Hardware,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
60,UniCoin Blockchain Inc.,BlockChain,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
64,Dutch Finance Lab,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
66,Polybit,Software,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [110]:
data_tec = data_tec.groupby('name').max()
data_tec.head()

  data_tec = data_tec.groupby('name').max()


Unnamed: 0_level_0,AR and VR,Artificial Intelligence,Biotechnology,BlockChain,Hardware,Science and Engineering,Software,Sustainability,nan
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
!Creatice,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
#IconSource,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
&ME,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
&Open,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
&SISTERS,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [111]:
data_tec.shape

(55665, 9)

## `funding_round` encoding

In [113]:
data.columns

Index(['name', 'website', 'short_description', 'ipo_status', 'founded_on',
       'went_public_on', 'exited_on', 'num_funding_rounds',
       'last_equity_funding_type', 'last_equity_funding_total',
       'last_funding_at', 'headquartersCountry', 'headquartersRegion',
       'employeeCount', 'industry_name', 'technology_name', 'announcedOn',
       'stage', 'moneyRaised'],
      dtype='object')

In [123]:
data_fun = data[['name', 'announcedOn', 'stage', 'moneyRaised']].drop_duplicates()

In [124]:
data_fun.head()

Unnamed: 0,name,announcedOn,stage,moneyRaised
0,"Cardiomo Care, Inc.",2019-04-15 00:00:00.000,unknown,"{""amount"": 5652200, ""currency"": ""USD"", ""amount..."
1,"Cardiomo Care, Inc.",2020-08-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
2,"Cardiomo Care, Inc.",2017-04-01 00:00:00.000,unknown,"{""amount"": 2500000, ""currency"": ""USD"", ""amount..."
3,"Cardiomo Care, Inc.",2016-08-09 00:00:00.000,pre_seed,"{""amount"": 13000000, ""currency"": ""USD"", ""amoun..."
4,"Cardiomo Care, Inc.",2019-02-01 00:00:00.000,pre_seed,"{""amount"": 12000000, ""currency"": ""USD"", ""amoun..."


In [125]:
data_fun['announcedOn'] = pd.to_datetime(data_fun['announcedOn'])

In [126]:
data_fun.sort_values(by=['name', 'announcedOn']).head(30)

Unnamed: 0,name,announcedOn,stage,moneyRaised
62012,!Creatice,2018-07-09,pre_seed,"{""amount"": 30200000, ""currency"": ""USD"", ""amoun..."
251494,#IconSource,2021-05-18,seed,"{""amount"": 160000000, ""currency"": ""USD"", ""amou..."
233180,&ME,2018-08-20,seed,"{""amount"": 38800000, ""currency"": ""USD"", ""amoun..."
233181,&ME,2019-03-12,seed,"{""amount"": 100000000, ""currency"": ""USD"", ""amou..."
233182,&ME,2020-01-18,seed,"{""amount"": 67600000, ""currency"": ""USD"", ""amoun..."
244142,&Open,2017-02-27,pre_seed,
244140,&Open,2021-05-20,seed,"{""amount"": 720000000, ""currency"": ""USD"", ""amou..."
244141,&Open,2022-04-17,unknown,"{""amount"": 217184000, ""currency"": ""USD"", ""amou..."
244139,&Open,2022-06-22,series_a,"{""amount"": 2600000000, ""currency"": ""USD"", ""amo..."
308288,&SISTERS,2019-11-20,unknown,"{""amount"": 25803800, ""currency"": ""USD"", ""amoun..."
