<a href="https://colab.research.google.com/github/cam2149/DistributedProcessing/blob/main/4_procesamiento_en_paralelo_con_dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Instalación de librerias necesarias**

In [None]:
!pip install dask[complete] dask-ml scikit-learn pandas numpy matplotlib kaggle



In [None]:
!du -hs *

55M	sample_data


In [None]:
# Importaciones
import dask
import dask.array as da
import dask.dataframe as dd
from dask.distributed import Client
from dask_ml.datasets import make_classification
from dask_ml.model_selection import train_test_split
from dask_ml.wrappers import Incremental
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import pandas as pd

###**Inicialización del cliente Dask**

In [None]:
# Crear cliente Dask para procesamiento paralelo
# En Colab, usa LocalCluster automáticamente
client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')
client
# El dashboard estará disponible en el link que se muestra

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:33655
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.scheduler:Registering Worker plugin shuffle
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:44875'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:38813'
INFO:distributed.scheduler:Register worker addr: tcp://127.0.0.1:35795 name: 0
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:35795
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:51192
INFO:distributed.scheduler:Register worker addr: tcp://127.0.0.1:37705 name: 1
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:37705
INFO:distributed.core:Starting established connection to tcp://127

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 2
Total threads: 4,Total memory: 3.73 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:33655,Workers: 0
Dashboard: http://127.0.0.1:8787/status,Total threads: 0
Started: Just now,Total memory: 0 B

0,1
Comm: tcp://127.0.0.1:35795,Total threads: 2
Dashboard: http://127.0.0.1:44671/status,Memory: 1.86 GiB
Nanny: tcp://127.0.0.1:44875,
Local directory: /tmp/dask-scratch-space/worker-diyky9jy,Local directory: /tmp/dask-scratch-space/worker-diyky9jy

0,1
Comm: tcp://127.0.0.1:37705,Total threads: 2
Dashboard: http://127.0.0.1:39675/status,Memory: 1.86 GiB
Nanny: tcp://127.0.0.1:38813,
Local directory: /tmp/dask-scratch-space/worker-ia0_bf9z,Local directory: /tmp/dask-scratch-space/worker-ia0_bf9z


El cliente Dask permite distribuir el trabajo entre múltiples workers y proporciona un dashboard para monitorear el progreso

###**Generación de dataset sintético grande**

In [None]:
# Crear un dataset de clasificación con 1 millón de muestras
n_samples = 1_000_000
n_features = 100
n_classes = 2

X, y = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_classes=n_classes,
    n_informative=50,
    n_redundant=20,
    chunks=n_samples // 10,  # Dividir en 10 chunks
    flip_y=0.1,
    random_state=42
)

print(f"Shape de X: {X.shape}")
print(f"Shape de y: {y.shape}")
print(f"Tipo: {type(X)}")

Shape de X: (1000000, 100)
Shape de y: (1000000,)
Tipo: <class 'dask.array.core.Array'>


Este código genera un millón de muestras divididas en chunks para procesamiento eficiente en memoria.

###**División train-test y persistencia en memoria**

In [None]:
# Split de datos
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

# Persistir datos en memoria distribuida para acceso rápido
X_train, X_test, y_train, y_test = dask.persist(
    X_train, X_test, y_train, y_test
)

print(f"Train samples: {len(y_train)}")
print(f"Test samples: {len(y_test)}")

# Precomputar las clases (requerido para clasificación)
classes = da.unique(y_train).compute()
print(f"Classes: {classes}")


Train samples: 800000
Test samples: 200000
Classes: [0 1]


La persistencia mantiene los datos en memoria distribuida, evitando recalcular operaciones en cada iteración

###**Entrenamiento incremental con SGDClassifier**

In [None]:
# Crear estimador base de scikit-learn
base_estimator = SGDClassifier(
    loss='log_loss',
    penalty='l2',
    max_iter=1000,
    tol=1e-3,
    random_state=42
)

# Envolver con Incremental de Dask-ML
incremental_model = Incremental(
    estimator=base_estimator,
    scoring='accuracy'
)

# Entrenar el modelo (una pasada por todos los datos)
incremental_model.fit(X_train, y_train, classes=classes)

# Evaluar
train_score = incremental_model.score(X_train, y_train)
test_score = incremental_model.score(X_test, y_test)

print(f"Train accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")

Train accuracy: 0.8789
Test accuracy: 0.8781


El wrapper Incremental automatiza el uso de partial_fit sobre los chunks de datos Dask.

###**Entrenamiento con múltiples épocas**

In [None]:
# Reiniciar modelo para entrenamiento con múltiples pasadas
base_estimator_multi = SGDClassifier(
    loss='log_loss',
    penalty='l2',
    max_iter=1000,
    tol=1e-4,
    random_state=42
)

incremental_model_multi = Incremental(
    estimator=base_estimator_multi,
    scoring='accuracy'
)

# Entrenar por 10 épocas
scores_history = []

for epoch in range(10):
    incremental_model_multi.partial_fit(X_train, y_train, classes=classes)
    score = incremental_model_multi.score(X_test, y_test)
    scores_history.append(score)
    print(f"Epoch {epoch+1}/10 - Test accuracy: {score:.4f}")

print(f"\nBest accuracy: {max(scores_history):.4f}")


Epoch 1/10 - Test accuracy: 0.8795
Epoch 2/10 - Test accuracy: 0.8871
Epoch 3/10 - Test accuracy: 0.8879
Epoch 4/10 - Test accuracy: 0.8904
Epoch 5/10 - Test accuracy: 0.8903
Epoch 6/10 - Test accuracy: 0.8907
Epoch 7/10 - Test accuracy: 0.8906
Epoch 8/10 - Test accuracy: 0.8906
Epoch 9/10 - Test accuracy: 0.8912
Epoch 10/10 - Test accuracy: 0.8909

Best accuracy: 0.8912


Este enfoque permite múltiples pasadas sobre los datos para mejorar el rendimiento del modelo.

###**Predicciones y evaluación detallada**

In [None]:
# Realizar predicciones (lazy computation)
y_pred_lazy = incremental_model_multi.predict(X_test)

# Computar predicciones reales
y_pred = y_pred_lazy.compute()
y_test_computed = y_test.compute()

# Métricas detalladas
accuracy = accuracy_score(y_test_computed, y_pred)
print(f"\nAccuracy final: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_computed, y_pred))



Accuracy final: 0.8909

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.89      0.89    100091
           1       0.89      0.89      0.89     99909

    accuracy                           0.89    200000
   macro avg       0.89      0.89      0.89    200000
weighted avg       0.89      0.89      0.89    200000



##**GridSearchCV paralelo con Dask backend**

In [None]:
from sklearn.model_selection import GridSearchCV
import joblib

# Crear dataset más pequeño para GridSearch
X_small, y_small = make_classification(
    n_samples=10_000,
    n_features=20,
    n_classes=2,
    chunks=2000,
    random_state=42
)

X_small_computed = X_small.compute()
y_small_computed = y_small.compute()

# Definir grid de hiperparámetros
param_grid = {
    'alpha': [0.0001, 0.001, 0.01],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'max_iter': [1000, 2000]
}

base_sgd = SGDClassifier(loss='log_loss', random_state=42)

grid_search = GridSearchCV(
    base_sgd,
    param_grid=param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

# Ejecutar grid search con backend de Dask
with joblib.parallel_backend('dask'):
    grid_search.fit(X_small_computed, y_small_computed)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

# Mostrar resultados
results_df = pd.DataFrame(grid_search.cv_results_)
print("\nTop 5 configurations:")
print(results_df[['params', 'mean_test_score', 'rank_test_score']].head())


Best parameters: {'alpha': 0.01, 'max_iter': 1000, 'penalty': 'l1'}
Best score: 0.6919

Top 5 configurations:
                                              params  mean_test_score  \
0  {'alpha': 0.0001, 'max_iter': 1000, 'penalty':...         0.679101   
1  {'alpha': 0.0001, 'max_iter': 1000, 'penalty':...         0.678700   
2  {'alpha': 0.0001, 'max_iter': 1000, 'penalty':...         0.684800   
3  {'alpha': 0.0001, 'max_iter': 2000, 'penalty':...         0.679101   
4  {'alpha': 0.0001, 'max_iter': 2000, 'penalty':...         0.678700   

   rank_test_score  
0               15  
1               17  
2               11  
3               15  
4               17  


El backend de Dask para joblib permite distribuir el entrenamiento de GridSearchCV en el cluster

###**Procesamiento de datos CSV reales con Dask**

In [None]:
# Ejemplo con datos CSV (descomentar si tienes un archivo)
# df = dd.read_csv('large_dataset.csv')
#
# # Preprocesamiento con Dask DataFrame
# df = df.dropna()
# df['feature_engineered'] = df['feature1'] * df['feature2']
#
# # Convertir a arrays de Dask
# X_from_df = df.drop('target', axis=1).to_dask_array(lengths=True)
# y_from_df = df['target'].to_dask_array(lengths=True)
#
# # Continuar con el pipeline de ML...


In [None]:
# Cerrar el cliente Dask
client.close()


INFO:distributed.scheduler:Remove client Client-76614cfb-096d-11f1-84d2-0242ac1c000c
INFO:distributed.core:Received 'close-stream' from tcp://127.0.0.1:51212; closing.
INFO:distributed.scheduler:Remove client Client-76614cfb-096d-11f1-84d2-0242ac1c000c
INFO:distributed.scheduler:Close client connection: Client-76614cfb-096d-11f1-84d2-0242ac1c000c
INFO:distributed.scheduler:Retire worker addresses (stimulus_id='retire-workers-1771050656.645428') (0, 1)
INFO:distributed.nanny:Closing Nanny at 'tcp://127.0.0.1:44875'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.nanny:Closing Nanny at 'tcp://127.0.0.1:38813'. Reason: nanny-close
INFO:distributed.nanny:Nanny asking worker to close. Reason: nanny-close
INFO:distributed.scheduler:Remove client Client-worker-476ef974-096e-11f1-8665-0242ac1c000c
INFO:distributed.core:Received 'close-stream' from tcp://127.0.0.1:39342; closing.
INFO:distributed.scheduler:Remove client Client-worke

###**Ventajas clave de Dask para ML**
- **Escalabilidad**: Maneja datasets que exceden la memoria RAM dividiendo en chunks

- **Paralelización**: Distribuye el trabajo entre múltiples cores/workers automáticamente
​

- **Integración**: Compatible con scikit-learn y su ecosistema
​

- **Lazy evaluation**: Solo computa cuando es necesario, optimizando recursos
​

Este ejemplo completo cubre desde la configuración básica hasta técnicas avanzadas como entrenamiento incremental y búsqueda de hiperparámetros distribuida.