# Week 2 Step 2: Feature Selection and Encoding

## 1. Feature Selection

### 1.1 Load datasets

In [2]:
import pandas as pd
import numpy as np

In [3]:
file_path = 'data\cleaned\startup_success_cleaned.csv'
overall_df = pd.read_csv(file_path)

  file_path = 'data\cleaned\startup_success_cleaned.csv'


In [4]:
# Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(overall_df.head())

First 5 rows of the dataset:
     id state_code   latitude   longitude zip_code     id.1           city  \
0  1005         CA  42.358880  -71.056820    92101   c:6669      San Diego   
1   204         CA  37.238916 -121.973718    95032  c:16283      Los Gatos   
2  1001         CA  32.901049 -117.192656    92121  c:65620      San Diego   
3   738         CA  37.320309 -122.050040    95014  c:42668      Cupertino   
4  1002         CA  37.779281 -122.419236    94105  c:65806  San Francisco   

               Unnamed: 6               name  labels  ... has_angel  \
0  San Francisco CA 94105        Bandsintown       1  ...         1   
1  San Francisco CA 94105          TriCipher       1  ...         0   
2      San Diego CA 92121              Plixi       1  ...         0   
3      Cupertino CA 95014  Solidcore Systems       1  ...         0   
4  San Francisco CA 94105     Inhale Digital       0  ...         1   

  has_roundA has_roundB has_roundC  has_roundD  avg_participants  is_top500

In [5]:
# Show general information about the dataset
print("\nGeneral information:")
print(overall_df.info())


General information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 51 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        923 non-null    int64  
 1   state_code                923 non-null    object 
 2   latitude                  923 non-null    float64
 3   longitude                 923 non-null    float64
 4   zip_code                  923 non-null    object 
 5   id.1                      923 non-null    object 
 6   city                      923 non-null    object 
 7   Unnamed: 6                923 non-null    object 
 8   name                      923 non-null    object 
 9   labels                    923 non-null    int64  
 10  founded_at                923 non-null    object 
 11  closed_at                 923 non-null    object 
 12  first_funding_at          923 non-null    object 
 13  last_funding_at           923 non-null    o

### 1.2 Select features
All Variables in the dataset:

| # | Column                    | Dtype    |
|---|----------------------------|----------|
| 0 | id                         | int64    |
| 1 | state_code                 | object   |
| 2 | latitude                   | float64  |
| 3 | longitude                  | float64  |
| 4 | zip_code                   | object   |
| 5 | id.1                       | object   |
| 6 | city                       | object   |
| 7 | Unnamed: 6                 | object   |
| 8 | name                       | object   |
| 9 | labels                     | int64    |
| 10 | founded_at                | object   |
| 11 | closed_at                 | object   |
| 12 | first_funding_at          | object   |
| 13 | last_funding_at           | object   |
| 14 | age_first_funding_year    | float64  |
| 15 | age_last_funding_year     | float64  |
| 16 | age_first_milestone_year  | float64  |
| 17 | age_last_milestone_year   | float64  |
| 18 | relationships             | int64    |
| 19 | funding_rounds            | int64    |
| 20 | funding_total_usd         | int64    |
| 21 | milestones                | int64    |
| 22 | state_code.1              | object   |
| 23 | is_CA                     | int64    |
| 24 | is_NY                     | int64    |
| 25 | is_MA                     | int64    |
| 26 | is_TX                     | int64    |
| 27 | is_otherstate             | int64    |
| 28 | category_code             | object   |
| 29 | is_software               | int64    |
| 30 | is_web                    | int64    |
| 31 | is_mobile                 | int64    |
| 32 | is_enterprise             | int64    |
| 33 | is_advertising            | int64    |
| 34 | is_gamesvideo             | int64    |
| 35 | is_ecommerce              | int64    |
| 36 | is_biotech                | int64    |
| 37 | is_consulting             | int64    |
| 38 | is_othercategory          | int64    |
| 39 | object_id                 | object   |
| 40 | has_VC                    | int64    |
| 41 | has_angel                 | int64    |
| 42 | has_roundA                | int64    |
| 43 | has_roundB                | int64    |
| 44 | has_roundC                | int64    |
| 45 | has_roundD                | int64    |
| 46 | avg_participants          | float64  |
| 47 | is_top500                 | int64    |
| 48 | status                    | object   |
| 49 | first_funding_year        | int64    |
| 50 | last_funding_year         | int64    |



Due to Literatures, multiple factors have been found to be relavant with start-up success. The top factors include:
- Funding Amount and Funding Rounds Progression
- Founder/Management Team Quality Experience
- Social and Online Presence
- Team Size/Human Capital
- Timing(time between funding rounds which indicates speed of growth)
- Industry
- Business Traction / Milestones
- Geographic Location

This section explains each potential factor and list the columns in the DataFrame which are relevant with the factor. At the end of this section, the unused columns will be disposed.

#### 1.2.1 Basic information of companies and geographical locations

`id`,`founded_at` and `closed_at` will be remained. 

The status_code and the corresponding boolean columns would be used when finding relationship:
- `state_code`
- `is_CA`
- `is_NY`
- `is_MA`
- `is_TX`
- `is_otherstate`


In [6]:
# Remove columns that are not needed for analysis
columns_to_remove = ['id', 'Unnamed: 6', 'name', 'labels', 'zip_code', 'id.1','state_code.1','city','longitude','latitude']
overall_df.drop(columns=columns_to_remove, inplace=True)

#### 1.2.2 Status of the firm
Two columns indicate the status of the firm which are both useful:
- `status`
- `is_top500`

#### 1.2.3 Funding and Milestones information
Important funding information includes:
- `first_funding_year`
- `last_funding_year`
- `age_first_funding_year`
- `age_last_funding_year`
- `age_first_milestone_year`
- `age_last_milestone_year`
- `funding_rounds`
- `funding_total_used`
- `milestones`
- `avg_participants`
- and 6 boolean columns depending on funding rounds:
    - `has_VC`
    - `has_angel`
    - `has_roundA`
    - `has_roundB`
    - `has_roundC`
    - `has_roundD`

In [7]:
# Remove columns that are not needed for analysis
columns_to_remove = ['relationships', 'first_funding_at', 'last_funding_at']
overall_df.drop(columns=columns_to_remove, inplace=True)

#### 1.2.4 Industry
Industry information include:
-`category_code`
and 10 boolean columns regarding category_code:
- `is_software`
- `is_web`
- `is_mobile`
- `is_enterprise`
- `is_advertising`
- `is_gamesvide`
- `is_ecommerce`
- `is_biotech`
- `is_consulting`
- `is_othercategory`

In [8]:
# Remove columns that are not needed for analysis
overall_df.drop(columns="object_id", inplace=True)

In [9]:
# Display the first 5 rows of the dataset after disposing unnecessary columns
print("First 5 rows of the dataset:")
print(overall_df.head())

First 5 rows of the dataset:
  state_code founded_at   closed_at  age_first_funding_year  \
0         CA   1/1/2007  2013-06-01                  2.2493   
1         CA   1/1/2000  2013-06-01                  5.1260   
2         CA  3/18/2009  2013-06-01                  1.0329   
3         CA   1/1/2002  2013-06-01                  3.1315   
4         CA   8/1/2010  2012-10-01                  0.0000   

   age_last_funding_year  age_first_milestone_year  age_last_milestone_year  \
0                 3.0027                    4.6685                   6.7041   
1                 9.9973                    7.0055                   7.0055   
2                 1.0329                    1.4575                   2.2055   
3                 5.3151                    6.0027                   6.0027   
4                 1.6685                    0.0384                   0.0384   

   funding_rounds  funding_total_usd  milestones  ...  has_angel  has_roundA  \
0               3             375000 

## 2 Encode categorical variables

Identify categorical variablesï¼š
- `state_code`(already has dummy variables)
- `status`
- `category_code`(already has dummy variables)


In [10]:
# Check how many status are there
overall_df['status'].unique()

array(['acquired', 'closed'], dtype=object)

In [11]:
overall_df['status'] = overall_df['status'].map({'acquired': 1, 'closed': 0})


#### Drop all categorical columns

In [12]:
# Remove columns that are not needed for analysis
drop_column = ["state_code","category_code"]
overall_df.drop(columns=drop_column, inplace=True)

#### Convert the date into integer year

In [13]:
# Convert the date into integer year
overall_df['founded_at'] = pd.to_datetime(overall_df['founded_at']).dt.year
overall_df['closed_at'] = pd.to_datetime(overall_df['closed_at']).dt.year

#### Overview again

In [14]:
# Show general information about the dataset
print("\nGeneral information:")
print(overall_df.info())


General information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 923 entries, 0 to 922
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   founded_at                923 non-null    int32  
 1   closed_at                 923 non-null    int32  
 2   age_first_funding_year    923 non-null    float64
 3   age_last_funding_year     923 non-null    float64
 4   age_first_milestone_year  923 non-null    float64
 5   age_last_milestone_year   923 non-null    float64
 6   funding_rounds            923 non-null    int64  
 7   funding_total_usd         923 non-null    int64  
 8   milestones                923 non-null    int64  
 9   is_CA                     923 non-null    int64  
 10  is_NY                     923 non-null    int64  
 11  is_MA                     923 non-null    int64  
 12  is_TX                     923 non-null    int64  
 13  is_otherstate             923 non-null    i

All columns and their datatype:
| # | Column                    | Dtype    |
|---|----------------------------|----------|
| 0 | founded_at                | int32    |
| 1 | closed_at                 | int32    |
| 2 | age_first_funding_year    | float64  |
| 3 | age_last_funding_year     | float64  |
| 4 | age_first_milestone_year  | float64  |
| 5 | age_last_milestone_year   | float64  |
| 6 | funding_rounds            | int64    |
| 7 | funding_total_usd         | int64    |
| 8 | milestones                | int64    |
| 9 | is_CA                     | int64    |
| 10 | is_NY                    | int64    |
| 11 | is_MA                    | int64    |
| 12 | is_TX                    | int64    |
| 13 | is_otherstate            | int64    |
| 14 | is_software              | int64    |
| 15 | is_web                   | int64    |
| 16 | is_mobile                | int64    |
| 17 | is_enterprise            | int64    |
| 18 | is_advertising           | int64    |
| 19 | is_gamesvideo            | int64    |
| 20 | is_ecommerce             | int64    |
| 21 | is_biotech               | int64    |
| 22 | is_consulting            | int64    |
| 23 | is_othercategory         | int64    |
| 24 | has_VC                   | int64    |
| 25 | has_angel                | int64    |
| 26 | has_roundA               | int64    |
| 27 | has_roundB               | int64    |
| 28 | has_roundC               | int64    |
| 29 | has_roundD               | int64    |
| 30 | avg_participants         | float64  |
| 31 | is_top500                | int64    |
| 32 | status                   | int64    |
| 33 | first_funding_year       | int64    |
| 34 | last_funding_year        | int64    |


In [15]:
# Display the first 5 rows of the dataset after disposing unnecessary columns
print("First 5 rows of the dataset:")
print(overall_df.head())

First 5 rows of the dataset:
   founded_at  closed_at  age_first_funding_year  age_last_funding_year  \
0        2007       2013                  2.2493                 3.0027   
1        2000       2013                  5.1260                 9.9973   
2        2009       2013                  1.0329                 1.0329   
3        2002       2013                  3.1315                 5.3151   
4        2010       2012                  0.0000                 1.6685   

   age_first_milestone_year  age_last_milestone_year  funding_rounds  \
0                    4.6685                   6.7041               3   
1                    7.0055                   7.0055               4   
2                    1.4575                   2.2055               1   
3                    6.0027                   6.0027               3   
4                    0.0384                   0.0384               2   

   funding_total_usd  milestones  is_CA  ...  has_angel  has_roundA  \
0             37

In [16]:
# Final check
overall_df.select_dtypes(include='object').columns

Index([], dtype='object')

In [17]:
# Save the cleaned file
overall_df.to_csv('data\processed\startup_success_processed.csv', index=False)

print("\nThe cleaned file has been saved as 'startup_success_cleaned.csv'.")


The cleaned file has been saved as 'startup_success_cleaned.csv'.


  overall_df.to_csv('data\processed\startup_success_processed.csv', index=False)
