**Hello, welcome to the preprocessing notebook of our data science project: predicting successful startup!**

We will try to include as much details as possible with markdowns :)

# Data Import

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('/Users/danyhersco/data_startup_final.csv')

# Data Exploration

In [3]:
data.shape

(314486, 19)

We extracted the dataset from DBeaver, after running an SQL query. As a result of multiple `LEFT JOIN` to include `industry` and `technology` features, many startups have duplicates.

Consequently, let's check for the number of unique values for each feature of our dataframe.

In [4]:
data.describe(include='all')

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,last_funding_at,headquartersCountry,headquartersRegion,employeeCount,industry_name,technology_name,announcedOn,stage,moneyRaised
count,314486,311288,314483,314486,314486,5383,22095,314486.0,288993,236615,314486,313337,313294,302219.0,305345,299492,313835,281285,223828
unique,55665,54614,56142,3,1083,507,1370,,18,15347,2905,162,1392,,40,8,3360,9,34767
top,Arthur Intelligence,https://www.goarthur.ai/,Limited information available,private,2016-01-01,2021-11-11,2021-02-19,,seed,"{""currency"":""USD"",""amount"":100000000,""amountUS...",2019-01-01,US,California,,Health Care,Software,2019-01-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
freq,216,216,597,309434,67496,114,146,,107532,4782,2892,138635,49880,,29388,164642,4343,101387,5685
mean,,,,,,,,3.414276,,,,,,61.950658,,,,,
std,,,,,,,,2.421491,,,,,,443.224729,,,,,
min,,,,,,,,1.0,,,,,,0.0,,,,,
25%,,,,,,,,2.0,,,,,,6.0,,,,,
50%,,,,,,,,3.0,,,,,,31.0,,,,,
75%,,,,,,,,5.0,,,,,,31.0,,,,,


We have **55665** unique startups in our dataset. The next step will be to remove all duplicates while treating the non-duplicated columns.

Here is how our dataset look like:

In [5]:
data.head()

Unnamed: 0,name,website,short_description,ipo_status,founded_on,went_public_on,exited_on,num_funding_rounds,last_equity_funding_type,last_equity_funding_total,last_funding_at,headquartersCountry,headquartersRegion,employeeCount,industry_name,technology_name,announcedOn,stage,moneyRaised
0,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-04-15 00:00:00.000,unknown,"{""amount"": 5652200, ""currency"": ""USD"", ""amount..."
1,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2020-08-01 00:00:00.000,seed,"{""amount"": 10000000, ""currency"": ""USD"", ""amoun..."
2,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2017-04-01 00:00:00.000,unknown,"{""amount"": 2500000, ""currency"": ""USD"", ""amount..."
3,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2016-08-09 00:00:00.000,pre_seed,"{""amount"": 13000000, ""currency"": ""USD"", ""amoun..."
4,"Cardiomo Care, Inc.",https://cardiomo.com/,AI-based Remote Patient Monitoring solution to...,private,2016-01-27,,,10,seed,"{""currency"":""USD"",""amount"":10000000,""amountUSD...",2020-08-01,US,New York,31.0,Health Care,Software,2019-02-01 00:00:00.000,pre_seed,"{""amount"": 12000000, ""currency"": ""USD"", ""amoun..."


# Remove the duplicates through encoding

This problem will be challenging as there is not really fully duplicated rows. As we said above, it is just a result of the `LEFT JOIN` that multiplicated the rows for one company (several industries, several technologies, and several funding rounds). We have to one hot encode the latter features, which is what we will be doing in this section.

At the end of this section, we should reach a shape of `(55665, >18)`.

In [3]:
from useful.variables import industries, technologies

## `industry`