# Train/Test Split + Logistic Regression (Cathy)


---


Work Flow:
* Split the dataset (80% for training, 20% for testing)
    * since we cannot ensure each set has equal numbers of 1 and 0, we shall use stratification
    * drop unsecssary colomns
* Prepare/structure the input features
    * this should include scaling continuos features to aviod biases; leave the binary ones
* Fit the Logistic Regression model (assuming each feature is independent from each other)
    * assign random weight and bias to calculate the score Z
    * sub it in the probabillity formula, and compare the result with the true label
    * adjust the weight to get the probabillity as close to true label as possible (maximum likelihood estimation)
    * repeat over time and model would learn the pattern
* Evaluate and interpret results
* Run cross valiadation test

new files pushed: test_cleaned.csv, train_cleaned.csv, logreg_confusion_matrix.png, logreg_classification_report.txt

Firstly, we perform train/test split:

#### **Important:** Comments from Robert(during PR):
I found the `out_dir` here and the path used to later store data ready for model are both `data/processed` instead of `data/model_ready`. However, in the actual repo the data ready for model(test and train data) ara actually stored in `data/model_ready`. 

Perhaps the developor initally outputed the data to `processed` folder and then manually created the `model_ready` folder? 

For anyone in the future: Please make sure the path used is correct if there is neccessity to run the code again :)

In [None]:
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split

# --- paths ---
data_path = Path("data/processed/startup_success_processed.csv")  # adjust if needed
out_dir = Path("data/processed")
out_dir.mkdir(parents=True, exist_ok=True)

# --- load ---
df = pd.read_csv(data_path)

# --- target & features ---
target_col = "status"
X = df.drop(columns=[target_col])
y = df[target_col]

# --- single, reproducible, stratified split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,       # <-- this keeps class proportions the same
    random_state=42   # <-- ensures reproducibility
)

# --- save the split (push these to the repo) ---
train_df = X_train.copy()
train_df[target_col] = y_train
test_df = X_test.copy()
test_df[target_col] = y_test

train_df.to_csv(out_dir / "train.csv", index=False)
test_df.to_csv(out_dir / "test.csv", index=False)

print("Saved:", out_dir / "train.csv", "and", out_dir / "test.csv")


Saved: data/processed/train.csv and data/processed/test.csv


Then, we drop unnecessary columns:

In [35]:
# load the already-split data
train = pd.read_csv("data/processed/train.csv")
test  = pd.read_csv("data/processed/test.csv")

# columns to drop (as discussed)
cols_to_drop = [
    "closed_at",
    "founded_at",
    "first_funding_year",
    "last_funding_year"
]

# drop them from both train + test
train_clean = train.drop(columns=cols_to_drop, errors='ignore')
test_clean  = test.drop(columns=cols_to_drop, errors='ignore')

# save cleaned splits
train_clean.to_csv("data/processed/train_clean.csv", index=False)
test_clean.to_csv("data/processed/test_clean.csv", index=False)

print("Dropped:", cols_to_drop)
print("Train shape:", train_clean.shape)
print("Test shape:", test_clean.shape)


Dropped: ['closed_at', 'founded_at', 'first_funding_year', 'last_funding_year']
Train shape: (738, 31)
Test shape: (185, 31)


We shall only scale those continuos variables:

In [37]:
# load cleaned splits
train_clean = pd.read_csv("data/processed/train_clean.csv")
test_clean  = pd.read_csv("data/processed/test_clean.csv")

target_col = "status"

# separate X/y
X_train = train_clean.drop(columns=[target_col])
y_train = train_clean[target_col]
X_test  = test_clean.drop(columns=[target_col])
y_test  = test_clean[target_col]

# ---- choose which columns to scale (continuous / count-like) ----
continuous_cols = [
    "age_first_funding_year",
    "age_last_funding_year",
    "age_first_milestone_year",
    "age_last_milestone_year",
    "funding_rounds",
    "funding_total_usd",
    "milestones",
    "avg_participants"
]

# sanity: keep only those that exist after your earlier drops
continuous_cols = [c for c in continuous_cols if c in X_train.columns]




Now, we can build the logistic regression model (with the pipeline)

In [41]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

preprocess = ColumnTransformer(
    transformers=[
        ("scale", StandardScaler(), continuous_cols),
    ],
    remainder="passthrough"   # leave binary and other columns untouched
)

logreg = LogisticRegression(max_iter=1000, n_jobs=None)  # simple baseline

pipe = Pipeline(steps=[
    ("prep", preprocess),
    ("model", logreg)
])
pipe.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



Compute evaluation of the test set:

In [43]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# class predictions (threshold = 0.5 by default)
y_pred = pipe.predict(X_test)

# probabilities for ROC-AUC
y_proba = pipe.predict_proba(X_test)[:, 1]

acc  = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, zero_division=0)
rec  = recall_score(y_test, y_pred, zero_division=0)
f1   = f1_score(y_test, y_pred, zero_division=0)
auc  = roc_auc_score(y_test, y_proba)
cm   = confusion_matrix(y_test, y_pred)

print(f"Accuracy : {acc:.3f}")
print(f"Precision: {prec:.3f}")
print(f"Recall   : {rec:.3f}")
print(f"F1-score : {f1:.3f}")
print(f"ROC-AUC  : {auc:.3f}")
print("Confusion matrix:\n", cm)


Accuracy : 0.714
Precision: 0.782
Recall   : 0.775
F1-score : 0.778
ROC-AUC  : 0.760
Confusion matrix:
 [[39 26]
 [27 93]]


Interpret the model (which features matter?):

In [45]:
scaled_names = continuous_cols
other_names = [c for c in X_train.columns if c not in continuous_cols]
feature_names = scaled_names + other_names  # ColumnTransformer + remainder order

coefs = pipe.named_steps["model"].coef_.ravel()
pairs = sorted(zip(feature_names, coefs), key=lambda x: x[1], reverse=True)

print("\nTop positive features (increase success probability):")
for name, w in pairs[:10]:
    print(f"{name:30s}  {w: .3f}")

print("\nTop negative features (decrease success probability):")
for name, w in pairs[-10:]:
    print(f"{name:30s}  {w: .3f}")



Top positive features (increase success probability):
has_roundD                       0.866
is_top500                        0.760
milestones                       0.715
is_MA                            0.675
is_biotech                       0.564
is_enterprise                    0.457
is_NY                            0.415
avg_participants                 0.316
has_roundC                       0.249
age_first_milestone_year         0.229

Top negative features (decrease success probability):
is_web                          -0.055
is_mobile                       -0.139
age_first_funding_year          -0.172
is_otherstate                   -0.194
has_angel                       -0.208
is_gamesvideo                   -0.212
is_consulting                   -0.242
has_VC                          -0.248
is_TX                           -0.273
is_ecommerce                    -0.738


In [47]:
import joblib, os
os.makedirs("models", exist_ok=True)
joblib.dump(pipe, "models/baseline_logreg.pkl")
print("Saved model to models/baseline_logreg.pkl")


Saved model to models/baseline_logreg.pkl


Create confusion matrix heatmap (saved as external file):

In [51]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot heatmap
fig = plt.figure()
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion Matrix (Logistic Regression)')
plt.colorbar()

tick_marks = np.arange(2)
plt.xticks(tick_marks, ['Pred 0', 'Pred 1'])
plt.yticks(tick_marks, ['True 0', 'True 1'])

# Add numbers inside squares
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        plt.text(j, i, format(cm[i, j], 'd'), ha="center", va="center")

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.tight_layout()

# Save to docs/models/
plt.savefig("docs/logreg_confusion_matrix.png", dpi=200)
plt.close(fig)

print("Saved confusion matrix heatmap → docs/models/logreg_confusion_matrix.png")


Saved confusion matrix heatmap → docs/models/logreg_confusion_matrix.png


For better evaluation of the model, I decide to run the stratified 5-fold cross-validation.

In [53]:
from sklearn.model_selection import StratifiedKFold, cross_validate, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# --- load your TRAIN ONLY split you already saved/cleaned ---
train = pd.read_csv("data/processed/train_clean.csv")

target_col = "status"
X = train.drop(columns=[target_col])
y = train[target_col]

# --- detect binary vs continuous (same rule as before) ---
def is_binary(s: pd.Series) -> bool:
    vals = pd.unique(s.dropna())
    return set(vals).issubset({0, 1})

binary_cols = [c for c in X.columns if is_binary(X[c])]
cont_cols   = [c for c in X.columns if c not in binary_cols]

# --- preprocessing inside CV to avoid leakage ---
preproc = ColumnTransformer(
    transformers=[
        ("scale", StandardScaler(), cont_cols),
        ("passthrough", "passthrough", binary_cols),
    ],
    remainder="drop"
)

pipe = Pipeline([
    ("prep", preproc),
    ("clf", LogisticRegression(max_iter=2000))
])

# --- 5-fold stratified CV ---
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "accuracy": "accuracy",
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
    "roc_auc": "roc_auc"
}

cvres = cross_validate(pipe, X, y, cv=skf, scoring=scoring, return_train_score=False)

# summarize
summary = {m: (np.mean(cvres[f'test_{m}']), np.std(cvres[f'test_{m}'])) for m in scoring.keys()}
for m, (mu, sd) in summary.items():
    print(f"{m:9s}: {mu:.4f} ± {sd:.4f}")

# --- optional: out-of-fold predictions to form a CV confusion matrix ---
y_pred_oof = cross_val_predict(pipe, X, y, cv=skf)
cm = confusion_matrix(y, y_pred_oof, labels=[0,1])
print("\nOOF Confusion Matrix [rows=true, cols=pred]:\n", cm)

# optional: quick text report on OOF preds
print("\nClassification report (OOF):")
print(classification_report(y, y_pred_oof, digits=4))



accuracy : 0.7372 ± 0.0372
precision: 0.7654 ± 0.0234
recall   : 0.8554 ± 0.0406
f1       : 0.8077 ± 0.0290
roc_auc  : 0.7606 ± 0.0513

OOF Confusion Matrix [rows=true, cols=pred]:
 [[136 125]
 [ 69 408]]

Classification report (OOF):
              precision    recall  f1-score   support

           0     0.6634    0.5211    0.5837       261
           1     0.7655    0.8553    0.8079       477

    accuracy                         0.7371       738
   macro avg     0.7144    0.6882    0.6958       738
weighted avg     0.7294    0.7371    0.7286       738

