# Semana 05 - Perceptrón: Implementación desde Cero vs scikit-learn

Comparación detallada entre una implementación propia (scratch) y las implementaciones de `Perceptron`, `SGDClassifier` (modo perceptrón) y un baseline `LogisticRegression` usando el dataset *Breast Cancer Wisconsin (Diagnostic)*.

**Objetivos:**
- Reforzar fundamentos matemáticos del Perceptrón.
- Evaluar diferencias prácticas entre implementación manual y librerías.
- Analizar hiperparámetros (learning rate, épocas, regularización).
- Medir desempeño, robustez y tiempos.
- Construir un flujo reproducible y extensible.

> Si la descarga del dataset remoto falla (sin Internet), se usará el dataset integrado de `sklearn.datasets.load_breast_cancer` como fallback. 

## 1. Descarga y Extracción del Dataset (UCI)
Fuente: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
ZIP directo: https://archive.ics.uci.edu/static/public/17/breast+cancer+wisconsin+diagnostic.zip

In [None]:
import os, zipfile, io, hashlib, time, json, math, textwrap, statistics
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from urllib.request import urlopen, Request

DATA_DIR = Path('data')
DATA_DIR.mkdir(exist_ok=True)
ZIP_URL = 'https://archive.ics.uci.edu/static/public/17/breast+cancer+wisconsin+diagnostic.zip'
ZIP_PATH = DATA_DIR / 'breast_cancer_wisconsin_diagnostic.zip'
RAW_CSV = None
try:
    if not ZIP_PATH.exists():
        print('Descargando ZIP...')
        req = Request(ZIP_URL, headers={'User-Agent':'Mozilla/5.0'})
        with urlopen(req, timeout=30) as r: data = r.read()
        ZIP_PATH.write_bytes(data)
    else:
        print('ZIP ya existe, omitiendo descarga.')
    with zipfile.ZipFile(ZIP_PATH,'r') as z:
        print('Contenido ZIP:', z.namelist())
        # Buscar un .data o .csv
        members = [m for m in z.namelist() if m.lower().endswith(('.data','.csv'))]
        if members:
            target = members[0]
            csv_bytes = z.read(target)
            # Guardar a un csv normalizado
            RAW_CSV = DATA_DIR / 'breast_cancer_wisconsin_diagnostic.csv'
            RAW_CSV.write_bytes(csv_bytes)
            print('Extraído a', RAW_CSV)
        else:
            print('No se encontró archivo .data/.csv en ZIP, se usará fallback.')
except Exception as e:
    print('Fallo en descarga/lectura ZIP:', e)
    RAW_CSV = None

## 2. Carga Inicial y Estructura de los Datos

In [None]:
from sklearn.datasets import load_breast_cancer
if RAW_CSV and RAW_CSV.exists():
    # El archivo original suele tener formato: ID,Diagnosis,30 features...
    df_raw = pd.read_csv(RAW_CSV, header=None)
    # Según documentación, las columnas: ID, diagnosis, 30 features + maybe trailing empty
    # Para mayor robustez aplicamos shape check
    # Usamos dataset sklearn para nombres de columnas
    sk = load_breast_cancer()
    feature_names = list(sk.feature_names) if hasattr(sk,'feature_names') else [f'f{i}' for i in range(30)]
    cols = ['id','diagnosis'] + feature_names
    if df_raw.shape[1] >= 32: # a veces incluye vacía
        df_raw = df_raw.iloc[:, :32]
    df_raw.columns = cols[:df_raw.shape[1]]
else:
    print('Usando fallback sklearn.load_breast_cancer')
    sk = load_breast_cancer()
    df_raw = pd.DataFrame(sk.data, columns=sk.feature_names)
    df_raw.insert(0,'id', range(1, len(df_raw)+1))
    df_raw.insert(1,'diagnosis', sk.target)
    # Mapear 0/1 a B/M para mantener formato luego inverteremos
    df_raw['diagnosis'] = df_raw['diagnosis'].map({0:'malignant',1:'benign'})
df_raw.head()

In [None]:
print('Shape:', df_raw.shape)
print('Tipos:
', df_raw.dtypes.head())
print('Valores diagnosis:', df_raw['diagnosis'].value_counts())

## 3. Limpieza y Conversión de Tipos

In [None]:
df = df_raw.copy()
# Eliminar ID si no aporta
if 'id' in df.columns: df = df.drop(columns=['id'])
# Mapear diagnosis a binario (1 maligno / 0 benigno). Dataset original: M=Malignant, B=Benign
mapping = {'M':1,'B':0,'malignant':1,'benign':0}
df['diagnosis'] = df['diagnosis'].map(mapping)
# Asegurar tipos numéricos
for c in df.columns:
    if c!='diagnosis': df[c] = pd.to_numeric(df[c], errors='coerce')
print('Nulos por columna (esperado ~0):')
print(df.isna().sum().sort_values(ascending=False).head())
df.head()

## 4. Análisis Exploratorio Básico (EDA)

In [None]:
desc = df.describe().T
target_dist = df['diagnosis'].value_counts(normalize=True) * 100
print(target_dist)
corr = df.corr(numeric_only=True)
sns.heatmap(corr.iloc[:10,:10], cmap='coolwarm', center=0); plt.title('Correlaciones (primeras 10 features)'); plt.show()
desc.head()

## 5. Ingeniería de Características y Normalización

In [None]:
from sklearn.preprocessing import StandardScaler
X_full = df.drop(columns=['diagnosis']).copy()
y = df['diagnosis'].astype(int).values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_full)
X_scaled[:3]

## 6. División Entrenamiento / Validación / Prueba

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.30, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)
print('Train:', X_train.shape, 'Val:', X_val.shape, 'Test:', X_test.shape)

## 7. Definición Matemática del Perceptrón
\( at{y} = 
\begin{cases} 1 & 	ext{si } w^T x + b e 0 \ 0 & 	ext{caso contrario} nd{cases} \)
Actualización: \( w := w + ta (y-at{y}) x,  b := b + ta (y-at{y}) \).

## 8. Implementación desde Cero: Clase PerceptronScratch

In [None]:
from dataclasses import dataclass
import numpy as np
@dataclass
class PerceptronScratch:
    learning_rate: float = 0.01
    n_epochs: int = 50
    shuffle: bool = True
    random_state: int = 42
    verbose: bool = False
    def __post_init__(self):
        self.rng = np.random.default_rng(self.random_state)
        self.w_ = None
        self.b_ = 0.0
        self.errors_ = []
        self.acc_ = []
    def _shuffle(self, X, y):
        idx = self.rng.permutation(len(X))
        return X[idx], y[idx]
    def fit(self, X, y, X_val=None, y_val=None, early_stop=True):
        X = np.asarray(X); y = np.asarray(y).astype(int)
        n_samples, n_features = X.shape
        self.w_ = self.rng.normal(0, 0.01, size=n_features)
        self.b_ = 0.0
        self.errors_.clear(); self.acc_.clear()
        for epoch in range(self.n_epochs):
            if self.shuffle: X, y = self._shuffle(X, y)
            errors = 0
            for xi, target in zip(X, y):
                update = self.learning_rate * (target - self.predict_single(xi))
                if update != 0:
                    self.w_ += update * xi
                    self.b_ += update
                    errors += int(update!=0)
            self.errors_.append(errors)
            if X_val is not None:
                val_pred = self.predict(X_val)
                acc = (val_pred == y_val).mean()
                self.acc_.append(acc)
            if self.verbose:
                print(f'Epoch {epoch+1}/{self.n_epochs} - errors={errors}')
            if early_stop and errors == 0:
                if self.verbose: print('Convergencia temprana.')
                break
        return self
    def net_input(self, X):
        return np.dot(X, self.w_) + self.b_
    def predict_single(self, x):
        return 1 if (np.dot(x, self.w_) + self.b_) >= 0 else 0
    def predict(self, X):
        return (self.net_input(X) >= 0).astype(int)
    def decision_function(self, X):
        return self.net_input(X)

## 9. Función de Entrenamiento Manual (Loop de Épocas) + 10. Historial

In [None]:
scratch = PerceptronScratch(learning_rate=0.001, n_epochs=100, verbose=False)
t0=time.perf_counter()
scratch.fit(X_train, y_train, X_val=X_val, y_val=y_val)
t_scratch = time.perf_counter()-t0
print('Tiempo entrenamiento (scratch):', t_scratch)
print('Errores por época (primeros 15):', scratch.errors_[:15])

In [None]:
plt.plot(scratch.errors_, marker='o'); plt.title('Errores por época (Scratch)'); plt.xlabel('Época'); plt.ylabel('Errores'); plt.show()
if scratch.acc_:
    plt.plot(scratch.acc_); plt.title('Accuracy validación (Scratch)'); plt.xlabel('Época'); plt.ylabel('Acc'); plt.show()

## 11. Evaluación del Modelo Manual (Métricas y Matriz de Confusión)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc, classification_report
y_pred_scratch = scratch.predict(X_test)
metrics_scratch = {
    'accuracy': accuracy_score(y_test,y_pred_scratch),
    'precision': precision_score(y_test,y_pred_scratch),
    'recall': recall_score(y_test,y_pred_scratch),
    'f1': f1_score(y_test,y_pred_scratch)
}
cm_scratch = confusion_matrix(y_test,y_pred_scratch)
metrics_scratch, cm_scratch

## 12. Visualización de Frontera (PCA 2D)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
X_train_2d = pca.fit_transform(X_train)
X_test_2d = pca.transform(X_test)
# Ajustar un perceptrón scratch en espacio 2D solo para visualizar
viz_model = PerceptronScratch(learning_rate=0.01, n_epochs=50).fit(X_train_2d, y_train)
# Grid para frontera
xx, yy = np.meshgrid(np.linspace(X_train_2d[:,0].min()-1, X_train_2d[:,0].max()+1, 200), np.linspace(X_train_2d[:,1].min()-1, X_train_2d[:,1].max()+1, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
zz = viz_model.predict(grid).reshape(xx.shape)
plt.contourf(xx, yy, zz, alpha=0.25, cmap='coolwarm');
sns.scatterplot(x=X_train_2d[:,0], y=X_train_2d[:,1], hue=y_train, palette='coolwarm', s=30, edgecolor='k');
plt.title('Frontera (PCA 2D)'); plt.show()

## 13. Ajuste de Hiperparámetros (lr, épocas)

In [None]:
grid_lr=[0.0005,0.001,0.01,0.1]
grid_epochs=[25,50,100]
results=[]
for lr in grid_lr:
    for ne in grid_epochs:
        m=PerceptronScratch(learning_rate=lr,n_epochs=ne)
        m.fit(X_train,y_train,X_val=X_val,y_val=y_val,early_stop=True)
        yv=m.predict(X_val)
        results.append({'lr':lr,'epochs':ne,'val_acc':(yv==y_val).mean(),'final_errors':m.errors_[-1]})
hp_df = pd.DataFrame(results).sort_values('val_acc', ascending=False)
hp_df.head()

## 14. Implementación scikit-learn (Perceptron, SGDClassifier, LogisticRegression)

In [None]:
from sklearn.linear_model import Perceptron, SGDClassifier, LogisticRegression
t0=time.perf_counter(); sk_perc = Perceptron(max_iter=1000, eta0=1.0, random_state=42, tol=1e-3); sk_perc.fit(X_train,y_train); t_perc=time.perf_counter()-t0
t0=time.perf_counter(); sk_sgd = SGDClassifier(loss='perceptron', learning_rate='constant', eta0=0.01, max_iter=1000, random_state=42, tol=1e-3); sk_sgd.fit(X_train,y_train); t_sgd=time.perf_counter()-t0
t0=time.perf_counter(); sk_log = LogisticRegression(max_iter=1000, random_state=42); sk_log.fit(X_train,y_train); t_log=time.perf_counter()-t0
models = {'scratch': (scratch, t_scratch), 'sk_perceptron': (sk_perc, t_perc), 'sk_sgd_perc': (sk_sgd, t_sgd), 'log_reg': (sk_log, t_log)}
models

## 15. Comparación de Modelos

In [None]:
rows=[]
for name,(model,t) in models.items():
    ypred = model.predict(X_test)
    rows.append({'modelo':name,'accuracy':accuracy_score(y_test,ypred),'precision':precision_score(y_test,ypred),'recall':recall_score(y_test,ypred),'f1':f1_score(y_test,ypred),'tiempo_s':t})
cmp_df = pd.DataFrame(rows).sort_values('f1', ascending=False)
cmp_df

## 16. Curvas ROC y AUC

In [None]:
plt.figure()
for name,(model,_) in models.items():
    if hasattr(model,'decision_function'):
        scores = model.decision_function(X_test)
    else:
        # probas para logistic
        scores = model.predict_proba(X_test)[:,1]
    fpr,tpr,_ = roc_curve(y_test, scores)
    auc_val = auc(fpr,tpr)
    plt.plot(fpr,tpr,label=f
)
plt.plot([0,1],[0,1],'k--')
plt.xlabel('FPR'); plt.ylabel('TPR'); plt.title('ROC Curves'); plt.legend(); plt.show()

## 17. Validación Cruzada Estratificada

In [None]:
from sklearn.model_selection import StratifiedKFold
def cv_score_scratch(lr=0.001, epochs=50, k=5):
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
    scores=[]
    for tr, va in skf.split(X_train, y_train):
        m=PerceptronScratch(learning_rate=lr,n_epochs=epochs)
        m.fit(X_train[tr], y_train[tr])
        pred = m.predict(X_train[va])
        scores.append((pred==y_train[va]).mean())
    return np.mean(scores), np.std(scores)
mean_cv, std_cv = cv_score_scratch()
print('Scratch CV acc mean±std:', f
)
# sklearn Perceptron CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores=[]
for tr,va in skf.split(X_train,y_train):
    m=Perceptron(max_iter=1000, random_state=42)
    m.fit(X_train[tr],y_train[tr])
    scores.append(m.score(X_train[va],y_train[va]))
print('sklearn Perceptron CV acc:', f
)

## 18. Manejo de Desbalance de Clases (class_weight / re-muestreo)

In [None]:
class_dist = pd.Series(y).value_counts(normalize=True)
print('Distribución clases (%):', class_dist*100)
# Ejemplo de class_weight con sklearn Perceptron
cw_model = Perceptron(class_weight='balanced', max_iter=1000, random_state=42)
cw_model.fit(X_train,y_train)
print('Balanced accuracy test:', cw_model.score(X_test,y_test))

## 19. Medición de Tiempos y Complejidad

In [None]:
def time_train(model, Xtr, ytr):
    t0=time.perf_counter(); model.fit(Xtr,ytr); return time.perf_counter()-t0
times = {name: t for name,(_,t) in models.items()}
pd.Series(times).sort_values()

## 20. Persistencia de Modelos

In [None]:
import joblib
# Guardar pesos scratch
scratch_artifact = {'w': scratch.w_.tolist(), 'b': scratch.b_, 'learning_rate': scratch.learning_rate, 'epochs': len(scratch.errors_)}
with open('perceptron_scratch.json','w') as f: json.dump(scratch_artifact,f)
joblib.dump(sk_perc,'perceptron_sklearn.joblib')
print('Artefactos guardados.')

## 21. Pruebas Unitarias (concepto)
Ejemplos de aserciones rápidas; ideal mover a archivo test separado con pytest.

In [None]:
# Test: dimensiones pesos
assert scratch.w_.shape[0] == X_train.shape[1]
# Test: predict vs decision sign consistency
test_scores = scratch.decision_function(X_test[:5])
assert np.all((test_scores>=0).astype(int) == scratch.predict(X_test[:5]))
print('Pruebas básicas OK.')

## 22. Pipeline + GridSearchCV (sklearn)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([('clf', Perceptron(random_state=42))])
param_grid = {'clf__penalty':[None,'l2','l1'],'clf__alpha':[0.0001,0.001,0.01]}
gs = GridSearchCV(pipe, param_grid, scoring='f1', cv=5, n_jobs=-1)
gs.fit(X_train,y_train)
print('Best params:', gs.best_params_, 'Best f1:', gs.best_score_)

## 23. Resumen Programático de Resultados

In [None]:
best_sklearn = gs.best_estimator_
y_best = best_sklearn.predict(X_test)
summary_rows = []
for name,(model,t) in models.items():
    pred = model.predict(X_test)
    summary_rows.append({'modelo':name,'accuracy':accuracy_score(y_test,pred),'f1':f1_score(y_test,pred),'tiempo_s':t})
summary_rows.append({'modelo':'best_gridsearch','accuracy':accuracy_score(y_test,y_best),'f1':f1_score(y_test,y_best),'tiempo_s':None})
summary_df = pd.DataFrame(summary_rows).sort_values('f1', ascending=False)
summary_df.to_csv('resumen_modelos.csv', index=False)
summary_df

## Conclusiones y Trabajo Futuro
- El perceptrón scratch ofrece transparencia pero menor optimización.
- scikit-learn maneja mejor early stopping, regularización y eficiencia.
- Logistic Regression y variantes con regularización suelen superar en estabilidad al perceptrón puro.
- Próximos pasos: ampliar a validación estratificada repetida, calibración de probabilidades, incorporación de regularización L1/L2 en versión scratch, y comparación con SVM lineal.

---
Notebook generado para propósito educativo (Semana 05).

In [None]:
# (Opcional) Mover artefactos generados a carpeta artifacts/
from pathlib import Path
import shutil

art_dir = Path('artifacts')
art_dir.mkdir(exist_ok=True)
for fname in ['perceptron_scratch.json','perceptron_sklearn.joblib','resumen_modelos.csv']:
    f = Path(fname)
    if f.exists():
        target = art_dir / f.name
        shutil.move(str(f), target)
        print(f'Movido {f} -> {target}')
print('Resumen contenido artifacts:', list(art_dir.iterdir()))