## Data Preprocessing

### Categorical Variables and Encoding

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [42]:
df = pd.read_csv('./data/clean_startup_success_dataset.csv')

In [43]:
df.head()

Unnamed: 0,name,country,year,city,main_category,funding_rounds,funding_filled,first_funding_at,last_funding_at,status,status_class,fail,operating,success
0,#fame,IND,2015,Mumbai,Media,1,10000000,2015-01-05,2015-01-05,operating,operating,False,True,False
1,:Qounter,USA,2014,Delaware City,Application Platforms,2,700000,2014-03-01,2014-10-14,operating,operating,False,True,False
2,"(THE) ONE of THEM,Inc.",GBR,2014,,Apps,1,3406878,2014-01-30,2014-01-30,operating,operating,False,True,False
3,0-6.com,CHN,2007,Beijing,Curated Web,1,2000000,2008-03-19,2008-03-19,operating,operating,False,True,False
4,004 Technologies,USA,2010,Champaign,Software,1,10070591,2014-07-24,2014-07-24,operating,operating,False,True,False


so now we have anaylized the data & cleaned it. Now we will preprocess the data & train the model.

In [44]:
df["first_funding_at"] = pd.to_datetime(df["first_funding_at"], errors='coerce')
df["last_funding_at"] = pd.to_datetime(df["last_funding_at"], errors='coerce')

In [45]:
df['country'].unique().shape

(135,)

* now that we have seen that we have categorical variables in our dataset, we will encode them using most suitable encoding techniques.

For Country column, we will use Frequency Encoding, because it can tell the model how often a particular country appears in the dataset, which can be useful information for the model to learn from, and also it can tell the model about relative success of startups from different countries.

In [46]:
df["country_freq"] = df["country"].map(df["country"].value_counts())

In [47]:
df['city'].unique()

array(['Mumbai', 'Delaware City', nan, ..., 'Zwolle', 'Middlefield',
       'Damansara New Village'], shape=(4963,), dtype=object)

Here we can see that city and main category is also a categorical variable with many unique values. so we should use Frequency Encoding for this column.

In [48]:
df['city'] = df['city'].fillna('Unknown')
df['city_freq'] = df['city'].map(df['city'].value_counts())
df['main_category_freq'] = df['main_category'].map(df['main_category'].value_counts())

* here we can see that the First and last funding dates are in the datetime format. and they are also very usefull because 
they can tell use how long a startup has been able to secure funding, which can be an important indicator of its success and growth potential.

so to solve this issue we will be creating diffrent set of features from these two columns

In [49]:
df['first_funding_at'] = df['first_funding_at'].fillna(df['last_funding_at'])

# --- Extract Features from first_funding_at ---
df["first_funding_year"] = df["first_funding_at"].dt.year
df["first_funding_month"] = df["first_funding_at"].dt.month
df["first_funding_dayofweek"] = df["first_funding_at"].dt.dayofweek
df["years_since_first_funding"] = 2025 - df["first_funding_at"].dt.year

# --- Extract Features from last_funding_at ---
df["last_funding_year"] = df["last_funding_at"].dt.year
df["last_funding_month"] = df["last_funding_at"].dt.month
df["last_funding_dayofweek"] = df["last_funding_at"].dt.dayofweek
df["years_since_last_funding"] = 2025 - df["last_funding_at"].dt.year

In [50]:
df.head()

Unnamed: 0,name,country,year,city,main_category,funding_rounds,funding_filled,first_funding_at,last_funding_at,status,...,city_freq,main_category_freq,first_funding_year,first_funding_month,first_funding_dayofweek,years_since_first_funding,last_funding_year,last_funding_month,last_funding_dayofweek,years_since_last_funding
0,#fame,IND,2015,Mumbai,Media,1,10000000,2015-01-05,2015-01-05,operating,...,288,226,2015,1,0,10,2015,1,0,10
1,:Qounter,USA,2014,Delaware City,Application Platforms,2,700000,2014-03-01,2014-10-14,operating,...,5,228,2014,3,5,11,2014,10,1,11
2,"(THE) ONE of THEM,Inc.",GBR,2014,Unknown,Apps,1,3406878,2014-01-30,2014-01-30,operating,...,7927,1501,2014,1,3,11,2014,1,3,11
3,0-6.com,CHN,2007,Beijing,Curated Web,1,2000000,2008-03-19,2008-03-19,operating,...,573,2181,2008,3,2,17,2008,3,2,17
4,004 Technologies,USA,2010,Champaign,Software,1,10070591,2014-07-24,2014-07-24,operating,...,32,4010,2014,7,3,11,2014,7,3,11


In [51]:
df.isna().sum()

name                         1
country                      0
year                         0
city                         0
main_category                0
funding_rounds               0
funding_filled               0
first_funding_at             0
last_funding_at              0
status                       0
status_class                 0
fail                         0
operating                    0
success                      0
country_freq                 0
city_freq                    0
main_category_freq           0
first_funding_year           0
first_funding_month          0
first_funding_dayofweek      0
years_since_first_funding    0
last_funding_year            0
last_funding_month           0
last_funding_dayofweek       0
years_since_last_funding     0
dtype: int64

Unnamed: 0,status,status_class,fail,operating,success
0,operating,operating,False,True,False
1,operating,operating,False,True,False
2,operating,operating,False,True,False
3,operating,operating,False,True,False
4,operating,operating,False,True,False


### Feature Extraction from Date Columns

* now that we have transformed the categorical variables, we will extract features from the date columns.

In [55]:
df['funding_duration_days'] = (df['last_funding_at'] - df['first_funding_at']).dt.days
df['avg_funding_per_round'] = df['funding_filled'] / (df['funding_rounds'] + 1e-5)

X = df[['country_freq', 'year', 'city_freq', 'main_category_freq', 'funding_rounds', 'funding_filled', 'first_funding_year', 'last_funding_year', 'funding_duration_days', 'avg_funding_per_round']]

In [56]:
X.head()

Unnamed: 0,country_freq,year,city_freq,main_category_freq,funding_rounds,funding_filled,first_funding_year,last_funding_year,funding_duration_days,avg_funding_per_round
0,1736,2015,288,226,1,10000000,2015,2015,0,9999900.0
1,40920,2014,5,228,2,700000,2014,2014,227,349998.3
2,4009,2014,7927,1501,1,3406878,2014,2014,0,3406844.0
3,1714,2007,573,2181,1,2000000,2008,2008,0,1999980.0
4,40920,2010,32,4010,1,10070591,2014,2014,0,10070490.0


In [60]:
cur = ['status', 'status_class', 'fail', 'operating', 'success']
for c in cur:
    print(df[c].value_counts())

status
operating    52065
closed        6123
acquired      5336
ipo           1333
Name: count, dtype: int64
status_class
operating    52065
success       6669
fail          6123
Name: count, dtype: int64
fail
False    58734
True      6123
Name: count, dtype: int64
operating
True     52065
False    12792
Name: count, dtype: int64
success
False    58188
True      6669
Name: count, dtype: int64


In [61]:
y = df['success']

* let's save the feature matrix and target variable separately in the csv files for future use.

In [64]:
X.to_csv('./data/feature_matrix.csv', index=False)
y.to_csv('./data/target_variable.csv', index=False)