## Entendendo o problema

Prever se uma startup terá sucesso (ativa/adquirida) ou não (fechada). 

## Entendendo o dataset

- **id**: start-ups
- **labels**: **target** -> 1 (sucesso ~64,7%) | 0 (insucesso ~35,3%)
- age (NaN significa que o evento não aconteceu -> requer tratamento)

    - **first_funding_year**: Anos da fundação até o primeiro funding
    - **last_funding_year**: Anos da fundação até o último funding

    - **first_milestone_year**: Anos da fundação até o primeiro milestone
    - **last_milestone_year**: Anos da fundação até o último milestone

- **relationship**: total de funcionários (fundadores, executivos, investidores)
- **funding_rounds**: Número de rodadas de captação
- **funding_total_usd**: Total captado (USD)
- **milestones**: Contagem de marcos relevantes
- **avg_participants**: Média de investidores por rodada
- **is_CA**...: sede da start-up {0,1}
- **is_software**...: setor {0,1} -> requer encoding
- **has_vc/angel**: investidor {0,1}
- **has_roundA**...: rodada {0,1}

### Roadmap
**exploration**: 
1. ver os outliers de funding total usd
2. milestones tá relacionado com first/last_milestone?

**pré-processing**
1. normalizar escalas (StandardScaler) em funding_total_usd, relationships, funding_rounds e avg_participants
2. balancear classes do target: use class_weight, threshold tuning ou métricas robustas (AUC/F1)

# Importando as bibliotecas necessárias

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import sklearn as sk

print("Bibliotecas importadas com sucesso:","\n Pandas - ", pd.__version__, "\n NumPy -", np.__version__, "\n Matplotlib", "\n Seaborn -", sns.__version__, "\n Scikit-learn -", sk.__version__)

Bibliotecas importadas com sucesso: 
 Pandas -  2.3.2 
 NumPy - 2.3.3 
 Matplotlib 
 Seaborn - 0.13.2 
 Scikit-learn - 1.7.2


# Data exploration

In [11]:
df = pd.read_csv('../data/train.csv')
df.head(10)

Unnamed: 0,id,age_first_funding_year,age_last_funding_year,age_first_milestone_year,age_last_milestone_year,relationships,funding_rounds,funding_total_usd,milestones,is_CA,...,is_consulting,is_othercategory,has_VC,has_angel,has_roundA,has_roundB,has_roundC,has_roundD,avg_participants,labels
0,719,10.42,13.09,8.98,12.72,4,3,4087500,3,1,...,0,0,1,1,0,0,0,0,1.0,0
1,429,3.79,3.79,,,21,1,45000000,0,0,...,0,0,0,0,0,1,0,0,1.0,1
2,178,0.71,2.28,1.95,2.28,5,2,5200000,2,1,...,0,1,1,0,1,0,0,0,1.0,0
3,197,3.0,5.0,9.62,10.39,16,2,14500000,2,0,...,0,0,0,1,0,1,0,0,2.0,1
4,444,0.66,5.88,6.21,8.61,29,5,70000000,4,1,...,0,0,0,0,1,1,1,1,2.8,1
5,67,1.09,1.09,,,5,1,20500000,0,0,...,0,0,1,0,0,0,0,0,4.0,0
6,505,0.65,0.65,,,3,1,900000,0,1,...,0,0,0,0,1,0,0,0,1.0,0
7,410,0.34,5.22,3.0,6.03,5,4,48730000,3,0,...,0,0,1,0,1,1,1,0,3.0,1
8,284,0.0,0.0,0.74,0.74,2,1,50000,1,1,...,0,0,0,1,0,0,0,0,2.0,0
9,252,0.08,4.33,3.66,5.61,12,7,88651133,5,0,...,0,0,1,1,1,1,1,1,3.1667,1


In [12]:
numero_de_linhas = len(df)
print(f"O DataFrame tem {numero_de_linhas} linhas, cada qual representando uma start-up.")

O DataFrame tem 646 linhas, cada qual representando uma start-up.


In [5]:
df.dtypes

id                            int64
age_first_funding_year      float64
age_last_funding_year       float64
age_first_milestone_year    float64
age_last_milestone_year     float64
relationships                 int64
funding_rounds                int64
funding_total_usd             int64
milestones                    int64
is_CA                         int64
is_NY                         int64
is_MA                         int64
is_TX                         int64
is_otherstate                 int64
category_code                object
is_software                   int64
is_web                        int64
is_mobile                     int64
is_enterprise                 int64
is_advertising                int64
is_gamesvideo                 int64
is_ecommerce                  int64
is_biotech                    int64
is_consulting                 int64
is_othercategory              int64
has_VC                        int64
has_angel                     int64
has_roundA                  

### identificando células vazias

As colunas com células vazias são:
- age_first_funding_year: 35
- age_last_funding_year: 9
- age_first_milestone_year: 138
- age_last_milestone_year: 111

In [6]:
df.isnull().sum()

id                            0
age_first_funding_year       35
age_last_funding_year         9
age_first_milestone_year    138
age_last_milestone_year     111
relationships                 0
funding_rounds                0
funding_total_usd             0
milestones                    0
is_CA                         0
is_NY                         0
is_MA                         0
is_TX                         0
is_otherstate                 0
category_code                 0
is_software                   0
is_web                        0
is_mobile                     0
is_enterprise                 0
is_advertising                0
is_gamesvideo                 0
is_ecommerce                  0
is_biotech                    0
is_consulting                 0
is_othercategory              0
has_VC                        0
has_angel                     0
has_roundA                    0
has_roundB                    0
has_roundC                    0
has_roundD                    0
avg_part

### **Pre-processamento**: limpeza e transformação de dados