## Data Preprocessing

### Categorical Variables and Encoding

In [77]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [78]:
df = pd.read_csv('./data/clean_startup_success_dataset.csv')

In [79]:
df.head()

Unnamed: 0,name,country,year,city,main_category,funding_rounds,funding_filled,first_funding_at,last_funding_at,status,status_class,fail,operating,success
0,#fame,IND,2015,Mumbai,Media,1,10000000,2015-01-05,2015-01-05,operating,operating,False,True,False
1,:Qounter,USA,2014,Delaware City,Application Platforms,2,700000,2014-03-01,2014-10-14,operating,operating,False,True,False
2,"(THE) ONE of THEM,Inc.",GBR,2014,,Apps,1,3406878,2014-01-30,2014-01-30,operating,operating,False,True,False
3,0-6.com,CHN,2007,Beijing,Curated Web,1,2000000,2008-03-19,2008-03-19,operating,operating,False,True,False
4,004 Technologies,USA,2010,Champaign,Software,1,10070591,2014-07-24,2014-07-24,operating,operating,False,True,False


so now we have anaylized the data & cleaned it. Now we will preprocess the data & train the model.

In [80]:
df["first_funding_at"] = pd.to_datetime(df["first_funding_at"], errors='coerce')
df["last_funding_at"] = pd.to_datetime(df["last_funding_at"], errors='coerce')


In [81]:
df["funding_year"] = df["first_funding_at"].dt.year
df["funding_month"] = df["first_funding_at"].dt.month
df["funding_dayofweek"] = df["first_funding_at"].dt.dayofweek

In [82]:
df.head()

Unnamed: 0,name,country,year,city,main_category,funding_rounds,funding_filled,first_funding_at,last_funding_at,status,status_class,fail,operating,success,funding_year,funding_month,funding_dayofweek
0,#fame,IND,2015,Mumbai,Media,1,10000000,2015-01-05,2015-01-05,operating,operating,False,True,False,2015.0,1.0,0.0
1,:Qounter,USA,2014,Delaware City,Application Platforms,2,700000,2014-03-01,2014-10-14,operating,operating,False,True,False,2014.0,3.0,5.0
2,"(THE) ONE of THEM,Inc.",GBR,2014,,Apps,1,3406878,2014-01-30,2014-01-30,operating,operating,False,True,False,2014.0,1.0,3.0
3,0-6.com,CHN,2007,Beijing,Curated Web,1,2000000,2008-03-19,2008-03-19,operating,operating,False,True,False,2008.0,3.0,2.0
4,004 Technologies,USA,2010,Champaign,Software,1,10070591,2014-07-24,2014-07-24,operating,operating,False,True,False,2014.0,7.0,3.0


In [83]:
df['country'].unique().shape

(135,)

* now that we have seen that we have categorical variables in our dataset, we will encode them using most suitable encoding techniques.

For Country column, we will use Frequency Encoding, because it can tell the model how often a particular country appears in the dataset, which can be useful information for the model to learn from, and also it can tell the model about relative success of startups from different countries.

In [84]:
df["country_freq"] = df["country"].map(df["country"].value_counts())

In [85]:
df['city'].unique()

array(['Mumbai', 'Delaware City', nan, ..., 'Zwolle', 'Middlefield',
       'Damansara New Village'], shape=(4963,), dtype=object)

Here we can see that city and main category is also a categorical variable with many unique values. so we should use Frequency Encoding for this column.

In [86]:
df['city'].fillna('Unknown')
df['city_freq'] = df['city'].map(df['city'].value_counts())
df['main_category_freq'] = df['main_category'].map(df['main_category'].value_counts())

Let's also include the new feature time gap which is the difference between last funding date and first funding date.

In [87]:
df['funding_duration_days'] = (df['last_funding_at'] - df['first_funding_at']).dt.days
df['funding_duration_days']

0          0.0
1        227.0
2          0.0
3          0.0
4          0.0
         ...  
64852      0.0
64853    851.0
64854      0.0
64855      0.0
64856      0.0
Name: funding_duration_days, Length: 64857, dtype: float64

* here we can see that the First and last funding dates are in the datetime format. and they are also very usefull because 
they can tell use how long a startup has been able to secure funding, which can be an important indicator of its success and growth potential.

so to solve this issue we will be creating diffrent set of features from these two columns

In [88]:
# --- Extract Features from first_funding_at ---
df["first_funding_year"] = df["first_funding_at"].dt.year
df["first_funding_month"] = df["first_funding_at"].dt.month
df["first_funding_dayofweek"] = df["first_funding_at"].dt.dayofweek
df["years_since_first_funding"] = 2025 - df["first_funding_at"].dt.year

# --- Extract Features from last_funding_at ---
df["last_funding_year"] = df["last_funding_at"].dt.year
df["last_funding_month"] = df["last_funding_at"].dt.month
df["last_funding_dayofweek"] = df["last_funding_at"].dt.dayofweek
df["years_since_last_funding"] = 2025 - df["last_funding_at"].dt.year

In [89]:
df.head()

Unnamed: 0,name,country,year,city,main_category,funding_rounds,funding_filled,first_funding_at,last_funding_at,status,...,main_category_freq,funding_duration_days,first_funding_year,first_funding_month,first_funding_dayofweek,years_since_first_funding,last_funding_year,last_funding_month,last_funding_dayofweek,years_since_last_funding
0,#fame,IND,2015,Mumbai,Media,1,10000000,2015-01-05,2015-01-05,operating,...,226,0.0,2015.0,1.0,0.0,10.0,2015,1,0,10
1,:Qounter,USA,2014,Delaware City,Application Platforms,2,700000,2014-03-01,2014-10-14,operating,...,228,227.0,2014.0,3.0,5.0,11.0,2014,10,1,11
2,"(THE) ONE of THEM,Inc.",GBR,2014,,Apps,1,3406878,2014-01-30,2014-01-30,operating,...,1501,0.0,2014.0,1.0,3.0,11.0,2014,1,3,11
3,0-6.com,CHN,2007,Beijing,Curated Web,1,2000000,2008-03-19,2008-03-19,operating,...,2181,0.0,2008.0,3.0,2.0,17.0,2008,3,2,17
4,004 Technologies,USA,2010,Champaign,Software,1,10070591,2014-07-24,2014-07-24,operating,...,4010,0.0,2014.0,7.0,3.0,11.0,2014,7,3,11


In [90]:
df.isna().sum()

name                            1
country                         0
year                            0
city                         7927
main_category                   0
funding_rounds                  0
funding_filled                  0
first_funding_at               26
last_funding_at                 0
status                          0
status_class                    0
fail                            0
operating                       0
success                         0
funding_year                   26
funding_month                  26
funding_dayofweek              26
country_freq                    0
city_freq                    7927
main_category_freq              0
funding_duration_days          26
first_funding_year             26
first_funding_month            26
first_funding_dayofweek        26
years_since_first_funding      26
last_funding_year               0
last_funding_month              0
last_funding_dayofweek          0
years_since_last_funding        0
dtype: int64