# Información del grupo

**GRUPO 06**


**Integrantes**:

*   [intercambio] Rita Castro Lobo - ritacastrolobo@tecnico.ulisboa.pt
* [intercambio] Isabella Schneider - ischneider.ext@fi.uba.ar


# Enunciado del problema

En este trabajo práctico se va a utilizar un conjunto de datos que contiene una serie de casos de
uso (user stories) de distintos proyectos y el número de story points que tiene asignado cada
uno. Los story points indican la complejidad de cada tarea. El objetivo será predecir el story
point de cada user story dado el texto que lo representa.

Los conjuntos de datos a utilizar train y test se encuentran disponibles en la competencia de
Kaggle y deberán descargarlos desde allí. Allí mismo encontrarán también un archivo de
ejemplo de cómo se deben subir las soluciones.
El trabajo consiste en construir diferentes modelos de regresión, capaces de analizar una
porción de texto en lenguaje natural y predecir los story points. Para ello habrá que realizar un
preprocesamiento del texto para que este pueda ser analizado por los distintos modelos. Se
utilizará el modelo de bag of words, o cualquier otro que permita convertir texto en vectores.
Los modelos que se deben construir son los siguientes:
● Bayes Naïve
● Random Forest
● XGBoost
● Un modelo de red neuronal aplicando Keras y Tensor Flow.
● Un ensamble de al menos 3 modelos elegidos por el grupo.
Para cada uno de estos modelos se debe realizar una búsqueda de hiperparametros que
optimicen su desempeño en el conjunto de test local (porción del archivo training).
Una vez encontrados dichos hiperparametros, se procederá a hacer un submission a Kaggle. Es
decir que habrá al menos 5 submissions (uno por cada modelo).

# Instalaciones

Ejecutar solo si no dispone de los paquetes necesarios

In [56]:
!pip install stop_words



In [57]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Librerías

In [58]:
import datetime
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import seaborn as sns
import spacy
import stop_words
import tensorflow as tf

from tensorflow import keras

from keras import Sequential
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from keras.layers import BatchNormalization, Dense, Dropout, Input, TextVectorization
from keras.metrics import F1Score
from keras.models import load_model
from keras.optimizers import Adadelta, Adam, RMSprop
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, make_scorer, precision_score, recall_score
from sklearn.model_selection import cross_val_predict, cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier

from google.colab import drive

# Mount Drive

In [59]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Conjuntos

In [60]:
# Cargar los datasets del Kaggle
conjunto_train = pd.read_csv('/content/drive/MyDrive/Datasets/train.csv')
conjunto_test = pd.read_csv('/content/drive/MyDrive/Datasets/test.csv')
sample_solution = pd.read_csv('/content/drive/MyDrive/Datasets/sample_solution.csv')

# StopWords que va a servir a lematizar
stop_words_en = stop_words.get_stop_words('en')

In [61]:
conjunto_test.head()

Unnamed: 0,id,title,description,project
0,3433,Add Run > Tizen Emulator menu action in App an...,The action will create the launch shortcut for...,project8
1,106,Chrome & IE mis-behavior,"On Wed, Aug 4, 2010 at 12:21 PM, Bryan Beecher...",project2
2,7182,Problems with Publishing routes (on release re...,I have a problem with publishing routes in Nex...,project1
3,8985,Redis sink: better handling of module options/...,Please see the discussion here: https://githu...,project6
4,2149,java0.log generated by the SAM,"I found an issue on the TAC 5.2.1, a java0.log...",project1


In [62]:
conjunto_train.head()

Unnamed: 0,id,title,description,project,storypoint
0,5660,Error enabling Appcelerator services during ap...,"When creating the default app, I encountered t...",project8,3
1,9014,Create a maintenance branch,"As a developer, I'd like to have a maintenance...",project6,5
2,4094,Service Activity Monitoring Backend integrated...,SAM API used by SAM GUI,project1,5
3,811,fs::enter(rootfs) does not work if 'rootfs' is...,I noticed this when I was testing the unified ...,project5,2
4,4459,transform processor with script option is broken,Creating the following stream throws exception...,project6,2


# Train_test split



In [63]:
# Combinar título y descripción para crear el texto completo de cada user story
conjunto_train['full_text'] = conjunto_train['title'] + ' ' + conjunto_train['description']

# Definir X e y
X = conjunto_train['full_text']
y = conjunto_train['storypoint']  # Variable objetivo

# Dividir en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Exportacion de esos nuevos conjuntos
X_train.to_csv('/content/drive/MyDrive/Train_test_split/X_train.csv')
X_test.to_csv('/content/drive/MyDrive/Train_test_split/X_test.csv')
y_train.to_csv('/content/drive/MyDrive/Train_test_split/y_train.csv')
y_test.to_csv('/content/drive/MyDrive/Train_test_split/y_test.csv')


# Preprocesamiento de datos

In [64]:
# Transformacion del conjunto test
test = conjunto_test.set_index('id')
test.index.name = None
test ['full_text']= test['title'] + ' ' + test['description']
test = test ['full_text']
test

Unnamed: 0,full_text
3433,Add Run > Tizen Emulator menu action in App an...
106,"Chrome & IE mis-behavior On Wed, Aug 4, 2010 a..."
7182,Problems with Publishing routes (on release re...
8985,Redis sink: better handling of module options/...
2149,java0.log generated by the SAM I found an issu...
...,...
9069,Test source module in isolation Register the m...
3100,Multiple use of term repository in Service Bui...
6648,"Update Jobs documentation to include ""job laun..."
6076,Create com.appcelerator.titanium.windows.core ...


In [65]:
import spacy
# Lematizacion
nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc if token.is_alpha])

# Exportaciones
conjunto_test_processed = test.apply(preprocess_text)
X_train_processed = X_train.apply(preprocess_text)
X_test_processed = X_test.apply(preprocess_text)
conjunto_test_processed.to_csv('/content/drive/MyDrive/processed/conjunto_test_processed.csv')
X_train_processed.to_csv('/content/drive/MyDrive/processed/X_train_processed.csv')
X_test_processed.to_csv('/content/drive/MyDrive/processed/X_test_processed.csv')

# 1. Bayes Naïve

## Importaciones

In [66]:
# Datasets a cargar
X_train = pd.read_csv('/content/drive/MyDrive/Train_test_split/X_train.csv', index_col=0)['full_text']
X_test = pd.read_csv('/content/drive/MyDrive/Train_test_split/X_test.csv', index_col=0)['full_text']
y_train = pd.read_csv('/content/drive/MyDrive/Train_test_split/y_train.csv', index_col=0)['storypoint']
y_test = pd.read_csv('/content/drive/MyDrive/Train_test_split/y_test.csv', index_col=0)['storypoint']

# Esos son los que fueron lematizados
X_train_processed = pd.read_csv('/content/drive/MyDrive/processed/X_train_processed.csv', index_col=0)['full_text']
X_test_processed = pd.read_csv('/content/drive/MyDrive/processed/X_test_processed.csv', index_col=0)['full_text']
conjunto_test_processed = pd.read_csv('/content/drive/MyDrive/processed/conjunto_test_processed.csv', index_col=0)['full_text']

## Vectorización - Busqueda de los hiperparametros


Vamos a hacer diferentes vectorizaciones para elegir la mejor :
- Vect_1 : TFIDF Vectorizer sin lematizacion y con hiperparametros afinados
- Vect_2 : TFIDF Vectorizer con lematizacion y con hiperparametros afinados
- Vect_3 : Count Vectorizer sin lematizacion y con hiperparametros afinados
- Vect_4 : Count Vectorizer con lematizacion y con hiperparametros afinados

In [67]:
'''
import random
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from tqdm import tqdm  # For progress tracking
import time

# Lista para almacenar las mejores configuraciones
best_configs = []

# Hiperparámetros para vectorización
min_dfs = [1, 2, 3, 5, 8, 10, 12, 15]
ngram_ranges = [(1, 2), (1, 3), (1, 4)]
max_dfs = [0.75, 0.85, 0.95]
binaries = [True, False]
max_features_options = [None, 1000, 5000, 10000]

# Generar configuraciones posibles
all_configs = [
    {
        'min_df': min_df,
        'ngram_range': ngram_range,
        'max_df': max_df,
        'binary': binary,
        'max_features': max_features
    }
    for min_df in min_dfs
    for ngram_range in ngram_ranges
    for max_df in max_dfs
    for binary in binaries
    for max_features in max_features_options
]

# Modelo de Bayes Naïve
nb_model = MultinomialNB()

# Randomly select configurations for random search
n_samples = 70  # Number of random configurations to sample
random_configs = random.sample(all_configs, n_samples)

# Función para probar configuraciones
def test_config(config, X_train, vectorizer):
    vect = vectorizer(**config)
    X_train_vect = vect.fit_transform(X_train)

    # Evaluar el modelo usando validación cruzada
    score = cross_val_score(nb_model, X_train_vect, y_train, cv=10, scoring='neg_mean_squared_error')
    return round(np.sqrt(-score.mean()), 6)  # RMSE como métrica

# Probar diferentes configuraciones con TF-IDF y Count Vectorizer
vectorizers = [(TfidfVectorizer, "TFIDF"), (CountVectorizer, "Count")]

# Total iterations for progress tracking
total_iterations = len(vectorizers) * len(random_configs) * 2  # 2 datasets: Original and Lemmatized
start_time = time.time()  # Start time

# Use tqdm for progress tracking
with tqdm(total=total_iterations, desc="Random Search Progress") as pbar:
    for vectorizer_class, name in vectorizers:
        for config in random_configs:
            # Evaluar tanto X_train como X_train_processed
            for dataset_name, dataset, lemmatized in [("Original", X_train, False), ("Lemmatized", X_train_processed, True)]:
                score = test_config(config, dataset, vectorizer_class)
                best_configs.append((name, dataset_name, config, score, lemmatized))

                # Update progress bar
                pbar.update(1)

                # Calculate elapsed and remaining time
                elapsed_time = time.time() - start_time
                iterations_completed = pbar.n
                remaining_iterations = total_iterations - iterations_completed
                estimated_time_remaining = (elapsed_time / iterations_completed) * remaining_iterations if iterations_completed > 0 else 0

                # Update progress bar with estimated time remaining
                pbar.set_postfix(elapsed=f"{elapsed_time:.2f}s", remaining=f"{estimated_time_remaining:.2f}s")

# Ordenar por la puntuación más baja (mejor RMSE)
best_configs.sort(key=lambda x: x[3])
'''

'''
# Mostrar las 5 mejores configuraciones
print("\nTop 5 configuraciones:")
for i in range(5):
    print(best_configs[i])


Top 5 configuraciones:
('Count', 'Lemmatized', {'min_df': 1, 'ngram_range': (1, 4), 'max_df': 0.95, 'binary': True, 'max_features': None}, 2.884183, True)
('TFIDF', 'Original', {'min_df': 1, 'ngram_range': (1, 3), 'max_df': 0.75, 'binary': False, 'max_features': None}, 2.911783, False)
('TFIDF', 'Lemmatized', {'min_df': 3, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.912462, True)
('TFIDF', 'Lemmatized', {'min_df': 3, 'ngram_range': (1, 3), 'max_df': 0.95, 'binary': True, 'max_features': None}, 2.912462, True)
('TFIDF', 'Original', {'min_df': 2, 'ngram_range': (1, 2), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.915449, False)
'''


'\n# Mostrar las 5 mejores configuraciones\nprint("\nTop 5 configuraciones:")\nfor i in range(5):\n    print(best_configs[i])\n\n\nTop 5 configuraciones:\n(\'Count\', \'Lemmatized\', {\'min_df\': 1, \'ngram_range\': (1, 4), \'max_df\': 0.95, \'binary\': True, \'max_features\': None}, 2.884183, True)\n(\'TFIDF\', \'Original\', {\'min_df\': 1, \'ngram_range\': (1, 3), \'max_df\': 0.75, \'binary\': False, \'max_features\': None}, 2.911783, False)\n(\'TFIDF\', \'Lemmatized\', {\'min_df\': 3, \'ngram_range\': (1, 3), \'max_df\': 0.85, \'binary\': True, \'max_features\': None}, 2.912462, True)\n(\'TFIDF\', \'Lemmatized\', {\'min_df\': 3, \'ngram_range\': (1, 3), \'max_df\': 0.95, \'binary\': True, \'max_features\': None}, 2.912462, True)\n(\'TFIDF\', \'Original\', {\'min_df\': 2, \'ngram_range\': (1, 2), \'max_df\': 0.85, \'binary\': True, \'max_features\': None}, 2.915449, False)\n'

In [None]:
import random
import time
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from tqdm import tqdm  # For progress tracking
import numpy as np

# Set random seeds for reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Lista para almacenar las mejores configuraciones
best_configs = []

# Hiperparámetros para vectorización
min_dfs = [1, 2, 3, 5, 8, 10, 12, 15]
ngram_ranges = [(1, 2), (1, 3), (1, 4)]
max_dfs = [0.75, 0.85, 0.95]
binaries = [True, False]
max_features_options = [None, 1000, 5000, 10000]

# Generar configuraciones posibles
all_configs = [
    {
        'min_df': min_df,
        'ngram_range': ngram_range,
        'max_df': max_df,
        'binary': binary,
        'max_features': max_features
    }
    for min_df in min_dfs
    for ngram_range in ngram_ranges
    for max_df in max_dfs
    for binary in binaries
    for max_features in max_features_options
]

# Modelo de Bayes Naïve
nb_model = MultinomialNB()

# Randomly select configurations for random search
n_samples = 110  # Number of random configurations to sample
random_configs = random.sample(all_configs, n_samples)  # Random sampling using fixed seed

# Función para probar configuraciones
def test_config(config, X_train, vectorizer):
    vect = vectorizer(**config)
    X_train_vect = vect.fit_transform(X_train)

    # Evaluar el modelo usando validación cruzada
    score = cross_val_score(nb_model, X_train_vect, y_train, cv=9, scoring='neg_mean_squared_error')
    return round(np.sqrt(-score.mean()), 6)  # RMSE como métrica

# Probar diferentes configuraciones con TF-IDF y Count Vectorizer
vectorizers = [(TfidfVectorizer, "TFIDF"), (CountVectorizer, "Count")]

# Total iterations for progress tracking
total_iterations = len(vectorizers) * len(random_configs) * 2  # 2 datasets: Original and Lemmatized
start_time = time.time()  # Start time

# Use tqdm for progress tracking
with tqdm(total=total_iterations, desc="Random Search Progress") as pbar:
    for vectorizer_class, name in vectorizers:
        for config in random_configs:
            # Evaluar tanto X_train como X_train_processed
            for dataset_name, dataset, lemmatized in [("Original", X_train, False), ("Lemmatized", X_train_processed, True)]:
                score = test_config(config, dataset, vectorizer_class)
                best_configs.append((name, dataset_name, config, score, lemmatized))

                # Update progress bar
                pbar.update(1)

                # Calculate elapsed and remaining time
                elapsed_time = time.time() - start_time
                iterations_completed = pbar.n
                remaining_iterations = total_iterations - iterations_completed
                estimated_time_remaining = (elapsed_time / iterations_completed) * remaining_iterations if iterations_completed > 0 else 0

                # Update progress bar with estimated time remaining
                pbar.set_postfix(elapsed=f"{elapsed_time:.2f}s", remaining=f"{estimated_time_remaining:.2f}s")

# Ordenar por la puntuación más baja (mejor RMSE)
best_configs.sort(key=lambda x: x[3])

# Mostrar las 5 mejores configuraciones
print("\nTop 5 configuraciones:")
for i in range(5):
    print(best_configs[i])

'''Top 5 configuraciones:
70 10
('Count', 'Lemmatized', {'min_df': 1, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.897456, True)
('TFIDF', 'Lemmatized', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.907297, True)
('TFIDF', 'Original', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.910886, False)
('TFIDF', 'Original', {'min_df': 1, 'ngram_range': (1, 3), 'max_df': 0.95, 'binary': False, 'max_features': None}, 2.914092, False)
('TFIDF', 'Original', {'min_df': 2, 'ngram_range': (1, 2), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.915449, False)

100 10
('Count', 'Lemmatized', {'min_df': 1, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.897456, True)
('Count', 'Lemmatized', {'min_df': 1, 'ngram_range': (1, 2), 'max_df': 0.95, 'binary': True, 'max_features': None}, 2.898685, True)
('TFIDF', 'Original', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': False, 'max_features': None}, 2.906589, False)
('TFIDF', 'Lemmatized', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.907297, True)
('TFIDF', 'Lemmatized', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.95, 'binary': True, 'max_features': None}, 2.907297, True)

110 9 35min
('Count', 'Lemmatized', {'min_df': 1, 'ngram_range': (1, 2), 'max_df': 0.95, 'binary': True, 'max_features': None}, 2.876408, True)
('Count', 'Lemmatized', {'min_df': 1, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.89051, True)
('TFIDF', 'Original', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': False, 'max_features': None}, 2.901705, False)
('TFIDF', 'Original', {'min_df': 3, 'ngram_range': (1, 4), 'max_df': 0.75, 'binary': False, 'max_features': None}, 2.905119, False)
('TFIDF', 'Lemmatized', {'min_df': 2, 'ngram_range': (1, 3), 'max_df': 0.85, 'binary': True, 'max_features': None}, 2.905359, True)
'''


Random Search Progress:  92%|█████████▏| 404/440 [31:09<02:19,  3.87s/it, elapsed=1869.99s, remaining=166.63s]

In [None]:
# Finer search for max_df
def refine_max_df(best_config, vectorizer_class, dataset, y_train):
    # Extract the best max_df value and generate a finer range around it
    best_max_df = best_config['max_df']
    max_df_range = [round(x, 3) for x in np.arange(best_max_df - 0.15, best_max_df + 0.16, 0.01)]
    max_df_range = [max(0.0, min(value, 1.0)) for value in max_df_range]  # Clamp to valid range

    # Prepare updated configuration without max_df
    refined_config = {key: value for key, value in best_config.items() if key != 'max_df'}

    # Initialize variables to track the best result
    best_rmse = float('inf')
    best_max_df_value = None

    # Iterate over the finer range of max_df values
    for max_df in max_df_range:
        refined_config['max_df'] = max_df
        vect = vectorizer_class(**refined_config)
        X_train_vect = vect.fit_transform(dataset)

        # Cross-validation
        score = cross_val_score(nb_model, X_train_vect, y_train, cv=9, scoring='neg_mean_squared_error')
        rmse = round(np.sqrt(-score.mean()), 6)

        # Track the best result
        if rmse < best_rmse:
            best_rmse = rmse
            best_max_df_value = max_df

    return best_max_df_value, best_rmse
    # Extract the best max_df value and generate a finer range around it
    best_max_df = best_config['max_df']
    max_df_range = [round(x, 3) for x in np.arange(best_max_df - 0.05, best_max_df + 0.06, 0.01)]

    # Prepare updated configuration without max_df
    refined_config = {key: value for key, value in best_config.items() if key != 'max_df'}

    # Initialize variables to track the best result
    best_rmse = float('inf')
    best_max_df_value = None

    # Iterate over the finer range of max_df values
    for max_df in max_df_range:
        refined_config['max_df'] = max_df
        vect = vectorizer_class(**refined_config)
        X_train_vect = vect.fit_transform(dataset)

        # Cross-validation
        score = cross_val_score(nb_model, X_train_vect, y_train, cv=9, scoring='neg_mean_squared_error')
        rmse = round(np.sqrt(-score.mean()), 6)

        # Track the best result
        if rmse < best_rmse:
            best_rmse = rmse
            best_max_df_value = max_df

    return best_max_df_value, best_rmse

# Extract the best configuration from the random search
best_overall_config = best_configs[0]
best_vectorizer = best_overall_config[0]  # Vectorizer type ('TFIDF' or 'Count')
best_dataset = best_overall_config[1]  # Dataset name ('Original' or 'Lemmatized')
best_params = best_overall_config[2]  # Hyperparameters
best_dataset_data = X_train if best_dataset == "Original" else X_train_processed

# Perform finer search for max_df
vectorizer_class = TfidfVectorizer if best_vectorizer == "TFIDF" else CountVectorizer
best_max_df, best_rmse = refine_max_df(best_params, vectorizer_class, best_dataset_data, y_train)

# Print the refined max_df and updated RMSE
print("\nRefined max_df:")
print(f"Best max_df: {best_max_df}")
print(f"Best RMSE after refinement: {best_rmse}")


In [None]:
# Add the refined best configuration to the best_configs list
refined_best_config = {
    'vectorizer': best_vectorizer,
    'dataset': best_dataset,
    'params': {**best_params, 'max_df': best_max_df},  # Update max_df with the refined value
    'rmse': best_rmse,
    'refined': True  # Mark as a refined configuration
}

# Add the refined configuration to the list of best configs
best_configs.append((
    refined_best_config['vectorizer'],
    refined_best_config['dataset'],
    refined_best_config['params'],
    refined_best_config['rmse'],
    'Refined'
))

# Sort the configurations again to include the refined configuration
best_configs.sort(key=lambda x: x[3])

# Display the top 5 configurations, including the refined one
print("\nTop 5 configurations (including refinement):")
for i, config in enumerate(best_configs[:5]):
    print(f"{i + 1}. Vectorizer: {config[0]}, Dataset: {config[1]}, "
          f"Params: {config[2]}, RMSE: {config[3]:.6f}, Note: {config[4]}")

# Mostrar las 5 mejores configuraciones (antes del refinamiento)
print("\nTop 5 configuraciones (antes del refinamiento):")
for i, config in enumerate(best_configs[:5]):
    print(f"{i + 1}. Vectorizer: {config[0]}, Dataset: {config[1]}, Params: {config[2]}, RMSE: {config[3]:.6f}, Lemmatization: {config[4]}")


In [None]:
best_score = best_configs[0][3]
best_config = best_configs[0][2]
bool_lemma = best_configs[0][4]
type_vect = best_configs[0][0]

print(f"Score: {best_score} - Configuracion: {best_config} - Lematizacion: {bool_lemma} - Tipo: {type_vect}")


#Score: 2.884183 - Configuracion: {'min_df': 1, 'ngram_range': (1, 4), 'max_df': 0.95, 'binary': True, 'max_features': None} - Lematizacion: True - Tipo: Count


In [None]:
'''
# Extract the best configuration
best_score = best_configs[0]['best_score']
best_config = best_configs[0]['best_params']
bool_lemma = best_configs[0]['dataset'] == "Lemmatized"  # Check if the best dataset is 'Lemmatized'
type_vect = best_configs[0]['vectorizer']  # Best vectorizer type

# Print the results
print(f"Score: {best_score:.6f} - Configuracion: {best_config} - Lematizacion: {bool_lemma} - Tipo: {type_vect}")
'''

In [None]:

if bool_lemma :
    X_train = X_train_processed
    X_test = X_test_processed
if type_vect == "TFIDF":
    vect = TfidfVectorizer(stop_words=stop_words_en, **best_config)
elif type_vect == "Count":
    vect = CountVectorizer(stop_words=stop_words_en, **best_config)

X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

# Export transformed data
from scipy.sparse import save_npz, load_npz

# Export sparse matrices
save_npz('/content/drive/MyDrive/NB/X_train_vect.npz', X_train_vect)
save_npz('/content/drive/MyDrive/NB/X_test_vect.npz', X_test_vect)

print("X_train_vect shape:", X_train_vect.shape)
print("y_train shape:", y_train.shape)

## Modelo - Búsqueda de los hiperparametros

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import Ridge  # Modelo de regresión
import numpy as np

# Los hiperparámetros a probar
param_distributions = {
    'alpha': [0.1, 20, 40, 55, 65, 60]
}

# Modelo de regresión
ridge_model = Ridge()

# Búsqueda aleatoria con scoring de RMSE (negativo)
random_search = RandomizedSearchCV(estimator=ridge_model,
                                   param_distributions=param_distributions,
                                   cv=5,
                                   n_iter=20,
                                   scoring='neg_root_mean_squared_error')  # Usamos directamente el RMSE negativo
random_search.fit(X_train_vect, y_train)

# Los mejores parámetros encontrados y el mejor score
best_params = random_search.best_params_
best_score = random_search.best_score_

# Convertimos el mejor score a RMSE positivo
best_rmse = -best_score  # El RMSE es negativo, así que lo invertimos
print(f'Mejores parámetros: {best_params}')
print(f'Mejor RMSE: {best_rmse}')  # Imprimimos el RMSE positivo

# Vamos a afinar el parámetro elegido
best_params = random_search.best_params_
param_grid = {
    'alpha': [best_params['alpha'] - 1.6, best_params['alpha'] - 1.5, best_params['alpha'], best_params['alpha'] + 0.5, best_params['alpha'] + 5]
}

grid_search = GridSearchCV(estimator=ridge_model, param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(X_train_vect, y_train)

# Los mejores parámetros afinados y el mejor score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Convertimos el mejor score afinado a RMSE positivo
best_rmse_fined = -best_score  # El RMSE es negativo, así que lo invertimos
print(f'Mejores parámetros (afinados): {best_params}')
print(f'Mejor RMSE (afinados): {best_rmse_fined}')

'''
Mejores parámetros: {'alpha': 5.0}
Mejor RMSE: 2.701914578249913
Mejores parámetros (afinados): {'alpha': 5.05}
Mejor RMSE (afinados): 2.701874559903312
'''

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge
import numpy as np

# Usamos el mejor modelo encontrado durante la búsqueda de hiperparámetros
best_ridge_model = grid_search.best_estimator_

# Entrenamos el modelo en los datos de entrenamiento
best_ridge_model.fit(X_train_vect, y_train)

# Hacemos predicciones sobre el conjunto de prueba
y_pred = best_ridge_model.predict(X_test_vect)

# Calculamos las métricas de regresión
rmse = np.sqrt(mean_squared_error(y_test, y_pred))  # Raíz del error cuadrático medio (RMSE)
r2 = r2_score(y_test, y_pred)  # R² score

# Imprimimos los resultados
print(f'RMSE: {rmse}')
print(f'R²: {r2}')

'''
RMSE: 2.4028558111634597
R²: 0.30083648590265677
'''

## Prediccion


In [None]:
if bool_lemma:
    # Si utilizamos el conjunto con lematización (que es una Serie)
    conjunto_test = conjunto_test_processed.copy()  # La Serie ya es el texto procesado
    # Aplicar la transformación con el vectorizador sobre la Serie directamente
    X_conjunto_test = vect.transform(conjunto_test)
else:
    # Si no hay lematización, conjunto_test también será una Serie
    # La transformación se aplica directamente a la Serie
    conjunto_test = conjunto_test_processed  # Esto ya es una Serie de 'full_text'
    X_conjunto_test = vect.transform(conjunto_test)  # Aplica sobre la Serie

# Predecir utilizando el modelo ya entrenado
pred_test = best_ridge_model.predict(X_conjunto_test)

# Como conjunto_test es una Serie, su índice actuará como 'id'
final_pred_df = pd.DataFrame({
    'id': conjunto_test.index,  # El índice de la Serie es el 'id'
    'storypoint': pred_test    # Las predicciones obtenidas
})

# Mostrar el DataFrame final con 'id' y 'storypoints'
final_pred_df


## Exportación

In [None]:
current_date = datetime.datetime.now().strftime('%Y-%m-%d')

final_pred_df.to_csv(f"/content/drive/MyDrive/Predicciones/Bayes_Naives_{current_date}.csv", index=False)

joblib.dump(best_ridge_model, f'/content/drive/MyDrive/Modelos/bn_model_{current_date}.joblib')

#2. Random Forest

## Vectorización - Busqueda de los hiperparametros

In [None]:
#1. Vectorizacion - Busqueda de hiperparametros
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm  # For progress tracking
import numpy as np
import random
import time

# Set random seed
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Initialize RF model
rf_model = RandomForestRegressor(random_state=SEED)

# Hyperparameters for vectorization
min_dfs = [1, 2, 3, 5]
ngram_ranges = [(1, 2), (1, 3)]
max_dfs = [0.75, 0.85]
binaries = [True, False]
max_features_options = [1000, 5000]

# Generate all configurations
all_configs = [
    {
        'min_df': min_df,
        'ngram_range': ngram_range,
        'max_df': max_df,
        'binary': binary,
        'max_features': max_features
    }
    for min_df in min_dfs
    for ngram_range in ngram_ranges
    for max_df in max_dfs
    for binary in binaries
    for max_features in max_features_options
]

# Randomly sample configurations
n_samples = min(110, len(all_configs))  # Number of configurations to try
random_configs = random.sample(all_configs, n_samples)

# Function to evaluate a vectorizer configuration
def test_config(config, X_train_sample, vectorizer):
    vect = vectorizer(**config)
    X_train_vect = vect.fit_transform(X_train_sample)

    #Reducir Dimensionalidad
    svd = TruncatedSVD(n_components=100, random_state=SEED)  # Reduce to 100 components
    X_train_vect_reduced = svd.fit_transform(X_train_vect)

    # Cross-validation with RF
    score = cross_val_score(rf_model, X_train_vect_reduced, y_train.sample(len(X_train_sample)), cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
    return round(np.sqrt(-score.mean()), 6)  # RMSE as metric

# Vectorizers to test
vectorizers = [(TfidfVectorizer, "TFIDF")]

# Track progress
best_configs = []
with tqdm(total=len(vectorizers) * len(random_configs), desc="Random Search Progress") as pbar:
    for vectorizer_class, name in vectorizers:
        for config in random_configs:
            score = test_config(config, X_train.sample(frac=0.2), vectorizer_class)
            best_configs.append((name, config, score))
            pbar.update(1)

# Sort configurations by RMSE (lowest is best)
best_configs.sort(key=lambda x: x[2])

# Show top 5 configurations
print("\nTop 5 configurations:")
for i in range(min(5, len(best_configs))):
    print(f"{i+1}. Vetorizador: {best_configs[i][0]}, Config: {best_configs[i][1]}, RMSE: {best_configs[i][2]}")

# Best configuration
best_config = best_configs[0][1]
best_rmse = best_configs[0][2]
vectorizer = TfidfVectorizer(**best_config)

# Add RMSE to the best configuration
best_config_with_rmse = {**best_config, 'rmse': best_rmse}

print(f"Melhor configuração com RMSE: {best_config_with_rmse}")

# Vectorize the complete set
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# Save to CSV
file_path = "/content/drive/MyDrive/Vect_RF.csv"
best_configs_df = pd.DataFrame(
    best_configs,
    columns=['vectorizer', 'config', 'rmse']
)
best_configs_df.to_csv(file_path, index=False)
print(f"Best configurations saved to {file_path}")

In [None]:
#2. Finer search for max_df
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np
from tqdm import tqdm

# Load all configurations
best_configs_df = pd.read_csv("/content/drive/MyDrive/Vect_RF.csv")

# Convert 'config' column back to dictionary
best_configs_df['config'] = best_configs_df['config'].apply(eval)

# Sort by RMSE and extract the best configuration
best_configs_df = best_configs_df.sort_values(by='rmse').reset_index(drop=True)
best_config = best_configs_df.iloc[0]['config']

print("Best configuration extracted:")
print(best_config)

# Initialize vectorizer with the best configuration
vectorizer = TfidfVectorizer(**best_config)

def refine_max_df(best_config, dataset, y_train, cv=2, refine_range=(-0.05, 0.05), step=0.01, sample_frac=0.2):
    """
    Refine the max_df parameter in the vectorizer configuration for improved RMSE.

    """
    best_max_df = best_config['max_df']
    max_df_range = np.clip(
        np.round(np.arange(best_max_df + refine_range[0], best_max_df + refine_range[1] + step, step), 3),
        0.0,
        1.0
    )  # Generate a refined range and clamp values to [0.0, 1.0]

    # Prepare updated configuration without max_df
    refined_config = {key: value for key, value in best_config.items() if key != 'max_df'}

    # Initialize variables to track the best result
    best_rmse = float('inf')
    best_max_df_value = None

    # Sample the dataset
    dataset_sample = dataset.sample(frac=sample_frac, random_state=SEED)
    y_sample = y_train.loc[dataset_sample.index]

    # Use a progress bar to track iterations
    with tqdm(total=len(max_df_range), desc="Refining max_df") as pbar:
        for max_df in max_df_range:
            refined_config['max_df'] = max_df
            vect = TfidfVectorizer(**refined_config)
            X_train_vect = vect.fit_transform(dataset_sample)

            # Perform cross-validation
            score = cross_val_score(
                rf_model,
                X_train_vect,
                y_sample,
                cv=cv,
                scoring='neg_mean_squared_error',
                n_jobs=-1
            )
            rmse = round(np.sqrt(-score.mean()), 6)

            # Update the best parameters if a better RMSE is found
            if rmse < best_rmse:
                best_rmse = rmse
                best_max_df_value = max_df

            pbar.update(1)

    return best_max_df_value, best_rmse

# Debugging: Print dataset shapes
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

# Debugging: Print the number of configurations evaluated
print(f"Number of configurations evaluated: {len(best_configs)}")
print(f"Best configuration: {best_configs[0][1]}")

# Extract the best configuration from the random search
best_overall_config = best_configs[0]  # Safely extract the best configuration
best_params = best_overall_config[2]  # Retrieve hyperparameters

# Refine max_df
best_max_df, best_rmse = refine_max_df(best_config, X_train, y_train)  # Refine max_df

# Print the refined max_df and updated RMSE
print("\nRefined max_df:")
print(f"Best max_df: {best_max_df}")
print(f"Best RMSE after refinement: {best_rmse}")

In [None]:
# Update and add the refined best configuration to the best_configs list
refined_best_config = {
    'vectorizer': 'TFIDF',
    'dataset': 'original',
    'params': {**best_config, 'max_df': best_max_df},  # Update max_df with the refined value
    'rmse': best_rmse,
    'refined': True  # Mark as a refined configuration
}

# Append and sort configurations
best_configs.append((
    refined_best_config['vectorizer'],
    refined_best_config['dataset'],
    refined_best_config['params'],
    refined_best_config['rmse'],
    'Refined'
))

# Convert best_configs to a DataFrame for easier manipulation
df_configs = pd.DataFrame(
    best_configs,
    columns=['vectorizer', 'dataset', 'params', 'rmse', 'note', ]
)

# Check if the 'params' column contains valid dictionaries
df_configs['max_df'] = df_configs['params'].apply(
    lambda x: x.get('max_df') if isinstance(x, dict) else None
)

# Remove rows where 'max_df' is None
df_configs = df_configs.dropna(subset=['max_df'])

# Sort configurations by RMSE
df_configs = df_configs.sort_values(by='rmse').reset_index(drop=True)

while len(df_configs) < 5:
    df_configs = pd.concat([df_configs, pd.DataFrame([{'vectorizer': 'None', 'params': {}, 'rmse': float('inf'), 'max_df': 'No max_df'}])], ignore_index=True)

# Display the top configurations (ensuring at least 5 if available)
top_n = min(len(df_configs), 5)
print(f"\nTop {top_n} configurations (including refinement):")
for i, row in df_configs.head(top_n).iterrows():
    print(f"{i + 1}. Vectorizer: {row['vectorizer']}, Dataset: {row['dataset']}, "
          f"Params: {row['params']}, RMSE: {row['rmse']:.6f}, Note: {row['note']}")

# Update the best_configs list with the optimized results
best_configs = df_configs.values.tolist()

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
stop_words = ENGLISH_STOP_WORDS

# Retrieve the best configuration from the sorted list
best_score = best_configs[0][3]
best_config = best_configs[0][2]

# Display the best configuration details
print("\nBest Configuration Details:")
print(f"- Best RMSE: {best_score:.6f}")
print(f"- Configuration: {best_config}")

# Apply the best configuration for vectorization
vect = TfidfVectorizer(stop_words='english', **best_config)

# Vectorize the train and test datasets
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

## Modelo

In [None]:
#3. Modelo - Busqueda de los hiperparametros
from scipy.stats import randint
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
from tqdm import tqdm  # Progress tracking

# Melhor configuração do TFIDF encontrada anteriormente
best_tfidf_config = best_configs[0][2]

# Vectorize the text data
vectorizer = TfidfVectorizer(stop_words='english', **best_tfidf_config)
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

# Reduce dimensionality
svd = TruncatedSVD(n_components=100, random_state=42)
X_train_reduced = svd.fit_transform(X_train_vect)
X_test_reduced = svd.transform(X_test_vect)

# Initialize Random Forest
rf_model = RandomForestRegressor(random_state=42)

# Define Random Forest hyperparameter distributions
rf_param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5),
    'max_leaf_nodes': randint(10, 50),  # Limit number of leafs
    'bootstrap': [True]
}

# RandomizedSearchCV for Random Forest
random_search_rf = RandomizedSearchCV(
    estimator=rf_model,
    param_distributions=rf_param_distributions,
    n_iter=5,  # Number of random combinations to try
    cv=3,       # 3-fold cross-validation
    scoring='neg_mean_squared_error',
    verbose=0,  # Suppress individual output
    n_jobs=-1,  # Use all available cores
    random_state=42
)

# Progress tracking
with tqdm(total=random_search_rf.n_iter, desc="Randomized Search Progress") as pbar:
    random_search_rf.fit(X_train_reduced, y_train)
    pbar.update(random_search_rf.n_iter)

# Best parameters and score
best_rf_params = random_search_rf.best_params_
best_rf_rmse = np.sqrt(-random_search_rf.best_score_)

print("\nBest RF Parameters:", best_rf_params)
print("Best RMSE (CV):", best_rf_rmse)

# Use the best model found during RandomizedSearchCV
best_rf_model = random_search_rf.best_estimator_

# Train the model on the training data
best_rf_model.fit(X_train_reduced, y_train)

# Predict on the test set
y_pred = best_rf_model.predict(X_test_reduced)

# Calculate regression metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))  # Root Mean Squared Error
r2 = r2_score(y_test, y_pred)  # R² score

# Print results
print("\nPerformance on Test Set:")
print(f'RMSE: {rmse}')
print(f'R²: {r2}')

## Predicciones

In [None]:
conjunto_test = conjunto_test_processed
X_conjunto_test = vectorizer.transform(conjunto_test)

# Predict on the final test set
pred_test = best_rf_model.predict(X_conjunto_test)

# Create a DataFrame for predictions
final_pred_df = pd.DataFrame({
    'id': conjunto_test.index,
    'storypoint': pred_test
})

# Display the final DataFrame
final_pred_df

## Exportación

In [None]:
#5. Exportacion
current_date = datetime.datetime.now().strftime('%Y-%m-%d')

final_pred_df.to_csv(f"/content/drive/MyDrive/Predicciones/Random_Forest_{current_date}.csv", index=False)

joblib.dump(best_rf_model, f'/content/drive/MyDrive/Modelos/rf_model_{current_date}.joblib')

In [None]:
import joblib

# Define the model file path
model_path = '/content/drive/MyDrive/Modelos/bn_model_2024-12-02.joblib'
# Load the model
rf_model = joblib.load(model_path)

print("Model loaded successfully.")

In [None]:
# Entrenamos el modelo en los datos de entrenamiento
rf_model.fit(X_train_reduced, y_train)


# Hacemos predicciones sobre el conjunto de prueba
y_pred = rf_model.predict(X_test_reduced)

# Export predictions to a CSV file
import pandas as pd

# Convert predictions to a DataFrame
y_pred_rf = pd.DataFrame(y_pred, columns=["Predictions"])
y_pred_rf.to_csv('/content/drive/MyDrive/Ensemble/RF_y_pred.csv')

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse}")
print(f"Test RMSE: {rmse}")
print(f"Test R²: {r2}")

# Write the metrics to the file
with open('/content/drive/MyDrive/Metricas/RF', "w") as file:
    file.write(f"Test MSE: {mse}\n")
    file.write(f"Test RMSE: {rmse}\n")
    file.write(f"Test R²: {r2}\n")

#3. XGBoost

## Importaciones

In [None]:
# Datasets a cargar
X_train = pd.read_csv('Datasets/X_train.csv', index_col=0)['full_text']
X_test = pd.read_csv('Datasets/X_test.csv', index_col=0)['full_text']
y_train = pd.read_csv('Datasets/y_train.csv', index_col=0)['storypoint']
y_test = pd.read_csv('Datasets/y_test.csv', index_col=0)['storypoint']

## Vectorización - Busqueda de los hiperparametros

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm  # For progress tracking
import numpy as np
import random
import time

# Set random seed
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Initialize XGBoost model
xgb_model = XGBRegressor(random_state=SEED, n_jobs=-1, verbosity=0)

# Hiperparámetros para vectorización
min_dfs = [1, 2, 3, 5, 8, 10, 12, 15]
ngram_ranges = [(1, 2), (1, 3), (1, 4)]
max_dfs = [0.75, 0.85, 0.95]
binaries = [True, False]
max_features_options = [None, 1000, 5000, 10000]

# Generate all configurations
all_configs = [
    {
        'min_df': min_df,
        'ngram_range': ngram_range,
        'max_df': max_df,
        'binary': binary,
        'max_features': max_features
    }
    for min_df in min_dfs
    for ngram_range in ngram_ranges
    for max_df in max_dfs
    for binary in binaries
    for max_features in max_features_options
]

# Randomly sample configurations
n_samples = min(110, len(all_configs))  # Number of configurations to try
random_configs = random.sample(all_configs, n_samples)

# Function to evaluate a vectorizer configuration
def test_config(config, X_train_sample, vectorizer):
    vect = vectorizer(**config)
    X_train_vect = vect.fit_transform(X_train_sample)

    # Reduce Dimensionality
    svd = TruncatedSVD(n_components=100, random_state=SEED)  # Reduce to 100 components
    X_train_vect_reduced = svd.fit_transform(X_train_vect)

    # Cross-validation with XGBoost
    score = cross_val_score(xgb_model, X_train_vect_reduced, y_train.sample(len(X_train_sample)), cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
    return round(np.sqrt(-score.mean()), 6)  # RMSE as metric

# Vectorizers to test
vectorizers = [(TfidfVectorizer, "TFIDF")]

# Track progress
best_configs = []
with tqdm(total=len(vectorizers) * len(random_configs), desc="Random Search Progress") as pbar:
    for vectorizer_class, name in vectorizers:
        for config in random_configs:
            score = test_config(config, X_train.sample(frac=0.4), vectorizer_class)
            best_configs.append((name, config, score))
            pbar.update(1)

# Sort configurations by RMSE (lowest is best)
best_configs.sort(key=lambda x: x[2])

# Show top 5 configurations
print("\nTop 5 configurations:")
for i in range(min(5, len(best_configs))):
    print(f"{i+1}. Vetorizador: {best_configs[i][0]}, Config: {best_configs[i][1]}, RMSE: {best_configs[i][2]}")

# Best configuration
best_config = best_configs[0][1]
best_rmse = best_configs[0][2]
vectorizer = TfidfVectorizer(**best_config)

# Add RMSE to the best configuration
best_config_with_rmse = {**best_config, 'rmse': best_rmse}

print(f"Mejor configuración con RMSE: {best_config_with_rmse}")

# Vectorize the complete set
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# Train XGBoost with the best configuration
X_train_reduced = TruncatedSVD(n_components=100, random_state=SEED).fit_transform(X_train_vect)
X_test_reduced = TruncatedSVD(n_components=100, random_state=SEED).fit_transform(X_test_vect)

xgb_model.fit(X_train_reduced, y_train)
y_pred = xgb_model.predict(X_test_reduced)

# Save arrays as .npy files
np.save('/content/drive/MyDrive/XGB/X_train_reduced.npy', X_train_reduced)
np.save('/content/drive/MyDrive/XGB/X_test_reduced.npy', X_test_reduced)
np.save('/content/drive/MyDrive/XGB/X_train_vect.npy', X_train_vect.toarray())
np.save('/content/drive/MyDrive/XGB/X_test_vect.npy', X_test_vect.toarray())

# Load later
X_train_reduced = np.load('/content/drive/MyDrive/XGB/X_train_reduced.npy')



In [None]:
import time

def refine_max_df(best_config, vectorizer_class, dataset, y_train):
    # Start the timer for the entire function
    total_start_time = time.time()

    # Extract the best max_df value and generate a finer range around it
    best_max_df = best_config['max_df']
    max_df_range = [round(x, 3) for x in np.arange(best_max_df, best_max_df + 0.35, 0.01)]
    max_df_range = [max(0.0, min(value, 1.0)) for value in max_df_range]  # Clamp to valid range

    # Prepare updated configuration without max_df
    refined_config = {key: value for key, value in best_config.items() if key != 'max_df'}

    # Initialize variables to track the best result
    best_rmse = float('inf')
    best_max_df_value = None

    # Iterate over the finer range of max_df values
    for i, max_df in enumerate(max_df_range):
        iteration_start_time = time.time()  # Start time for this iteration
        refined_config['max_df'] = max_df

        # Vectorize with the current configuration
        vect = vectorizer_class(**refined_config)
        X_train_vect = vect.fit_transform(dataset)

        # Reduce dimensions with SVD
        svd = TruncatedSVD(n_components=100, random_state=SEED)
        X_train_reduced = svd.fit_transform(X_train_vect)

        # Cross-validation with XGBoost
        score = cross_val_score(xgb_model, X_train_reduced, y_train, cv=3, scoring='neg_mean_squared_error')
        rmse = round(np.sqrt(-score.mean()), 6)

        # Track the best result
        if rmse < best_rmse:
            best_rmse = rmse
            best_max_df_value = max_df

        # Calculate and log the iteration time
        iteration_time = time.time() - iteration_start_time
        print(f"Iteration {i + 1}/{len(max_df_range)}: max_df={max_df}, RMSE={rmse}, Time={iteration_time:.2f}s")

    # Total time
    total_time = time.time() - total_start_time
    print(f"Total time for refine_max_df: {total_time:.2f}s")

    return best_max_df_value, best_rmse


# Extract the best configuration from the random search
best_overall_config = best_configs[0]
best_vectorizer = best_overall_config[0]  # Vectorizer type ('TFIDF' or 'Count')
best_params = best_overall_config[1]  # Hyperparameters
best_dataset_data = X_train if best_dataset == "Original" else X_train_processed

# Perform finer search for max_df
vectorizer_class = TfidfVectorizer if best_vectorizer == "TFIDF" else CountVectorizer
best_max_df, best_rmse = refine_max_df(best_config, vectorizer_class, best_dataset_data, y_train)

# Print the refined max_df and updated RMSE
print("\nRefined max_df:")
print(f"Best max_df: {best_max_df}")
print(f"Best RMSE after refinement: {best_rmse}")


## Vectorización

In [None]:

type_vect = "TFIDF"

# Initialize the vectorizer based on the type
if type_vect == "TFIDF":
    vect = TfidfVectorizer(stop_words=stop_words_en, **best_params)  # Use best_params
elif type_vect == "Count":
    vect = CountVectorizer(stop_words=stop_words_en, **best_params)  # Use best_params
else:
    raise ValueError("Invalid vectorizer type. Choose 'TFIDF' or 'Count'.")

# Fit and transform X_train, and transform X_test
X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

# Export transformed data as sparse matrices
from scipy.sparse import save_npz

save_npz('/content/drive/MyDrive/XGB', X_train_vect)
save_npz('/content/drive/MyDrive/XGB', X_test_vect)

# Print shapes of the transformed data
print("X_train_vect shape:", X_train_vect.shape)
print("y_train shape:", y_train.shape)


## Modelo - Busqueda de los hiperparametros

In [None]:
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np
from scipy.stats import randint, uniform
from tqdm import tqdm  # Progress tracking

# Define XGBoost hyperparameter distributions
xgb_param_distributions = {
    'n_estimators': randint(100, 1000),  # Number of trees
    'learning_rate': uniform(0.01, 0.3),  # Learning rate (eta)
    'max_depth': randint(3, 10),  # Maximum tree depth
    'subsample': uniform(0.5, 0.5),  # Subsample ratio
    'colsample_bytree': uniform(0.5, 0.5),  # Column subsample ratio
    'gamma': uniform(0, 0.5),  # Minimum loss reduction
    'reg_alpha': uniform(0, 1),  # L1 regularization term
    'reg_lambda': uniform(0, 1)  # L2 regularization term
}

# Initialize XGBoost Regressor
xgb_model = XGBRegressor(random_state=42, objective='reg:squarederror')

# RandomizedSearchCV for XGBoost
random_search_xgb = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=xgb_param_distributions,
    n_iter=5,  # Number of random configurations to try
    cv=3,       # 3-fold cross-validation
    scoring='neg_mean_squared_error',
    verbose=0,  # Suppress individual output
    n_jobs=-1,
    random_state=42
)

# Progress tracking
total_iterations = random_search_xgb.n_iter
with tqdm(total=total_iterations, desc="Randomized Search Progress") as pbar:
    # Add a custom callback to monitor progress
    for i in range(total_iterations):
        random_search_xgb.fit(X_train_vect, y_train)  # Fit the model
        pbar.update(1)  # Update the progress bar

# Best parameters and score
best_xgb_params = random_search_xgb.best_params_
best_rmse = np.sqrt(-random_search_xgb.best_score_)

print("Best XGBoost Parameters:", best_xgb_params)
print("Best RMSE:", best_rmse)


## Entrenar

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Train the best model on full training data
best_xgb_model = random_search_xgb.best_estimator_

# Evaluate on the test set
y_pred = best_xgb_model.predict(X_test_vect)

# Export predictions to a CSV file
import pandas as pd

# Convert predictions to a DataFrame
y_pred_df = pd.DataFrame(y_pred, columns=["Predictions"])
y_pred_df.to_csv('/content/drive/MyDrive/Ensemble/XGB_y_pred.csv')



mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse}")
print(f"Test RMSE: {rmse}")
print(f"Test R²: {r2}")

# Write the metrics to the file
with open('/content/drive/MyDrive/Metricas/XGB', "w") as file:
    file.write(f"Test MSE: {mse}\n")
    file.write(f"Test RMSE: {rmse}\n")
    file.write(f"Test R²: {r2}\n")

## Prediccion

In [None]:
conjunto_test = conjunto_test_processed
X_conjunto_test = vectorizer.transform(conjunto_test)

# Predict on the final test set
pred_test = best_xgb_model.predict(X_conjunto_test)

# Create a DataFrame for predictions
final_pred_df = pd.DataFrame({
    'id': conjunto_test.index,
    'storypoint': pred_test
})

# Display the final DataFrame
final_pred_df

## Exportacion

In [None]:
current_date = datetime.datetime.now().strftime('%Y-%m-%d')

final_pred_df.to_csv(f"/content/drive/MyDrive/Predicciones/XGB_{current_date}.csv", index=False)

joblib.dump(best_xgb_model, f'/content/drive/MyDrive/Modelos/xgb_model_{current_date}.joblib')

#4. Ensamble

## Importaciones

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import joblib

y_test = pd.read_csv('/content/drive/MyDrive/Train_test_split/y_test.csv', index_col=0)['storypoint']

# Load predictions for each model
pred_nb = pd.read_csv('/content/drive/MyDrive/Ensemble/NB_y_pred.csv')["Predictions"].values  # Bayes Naive
pred_rf = pd.read_csv('/content/drive/MyDrive/Ensemble/RF_y_pred.csv')["Predictions"].values  # Random Forest
pred_xgb = pd.read_csv('/content/drive/MyDrive/Ensemble/XGB_y_pred.csv')["Predictions"].values  # XGBoost

# Verify the shapes match
assert len(pred_rf) == len(pred_xgb) == len(pred_nb), "Mismatch in predictions length!"

## Weights

In [None]:
def read_rmse_from_file(file_path):
    """
    Reads the RMSE value from a specified file.
    Assumes the file contains a line starting with 'Test RMSE:'.
    """
    with open(file_path, 'r') as file:
        for line in file:
            if line.startswith('Test RMSE:'):
                return float(line.split(':')[1].strip())
    raise ValueError(f"RMSE value not found in file: {file_path}")

# Specify the file paths for each model
file_paths = {
    'Bayes Naive': '/content/drive/MyDrive/Metricas/NB',
    'Random Forest': '/content/drive/MyDrive/Metricas/RF',
    'XGBoost': '/content/drive/MyDrive/Metricas/XGB',
}

# Read RMSE values from the files
rmse_values = {model: read_rmse_from_file(file_path) for model, file_path in file_paths.items()}

# Calculate inverse RMSE
inverse_rmse = {model: 1/rmse for model, rmse in rmse_values.items()}

# Normalize weights
total_inverse_rmse = sum(inverse_rmse.values())
weights = {model: inv_rmse / total_inverse_rmse for model, inv_rmse in inverse_rmse.items()}

# Print calculated weights
print("Calculated Weights for Ensemble:")
for model, weight in weights.items():
    print(f"{model}: {weight:.2f}")


# Map weights to model predictions
model_weights = {
    'Bayes Naive': weights['Bayes Naive'],
    'Random Forest': weights['Random Forest'],
    'XGBoost': weights['XGBoost']
}

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd


# Map weights to model predictions
model_weights = {
    'Bayes Naive': weights['Bayes Naive'],
    'Random Forest': weights['Random Forest'],
    'XGBoost': weights['XGBoost']
}

# Calculate the ensemble predictions using a weighted average
y_pred_ensemble = (model_weights['Bayes Naive'] * pred_nb +
                   model_weights['Random Forest'] * pred_rf +
                   model_weights['XGBoost'] * pred_xgb)

# Export predictions to a CSV file
# Convert predictions to a DataFrame
y_pred_df = pd.DataFrame(y_pred_ensemble, columns=["Predictions"])
y_pred_df.to_csv('/content/drive/MyDrive/Ensemble/Ensemble_y_pred.csv')



mse = mean_squared_error(y_test, y_pred_ensemble)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_ensemble)

print(f"Test MSE: {mse}")
print(f"Test RMSE: {rmse}")
print(f"Test R²: {r2}")

# Write the metrics to the file
with open('/content/drive/MyDrive/Metricas/Ensemble', "w") as file:
    file.write(f"Test MSE: {mse}\n")
    file.write(f"Test RMSE: {rmse}\n")
    file.write(f"Test R²: {r2}\n")

## Predicciones

In [None]:
y_test = pd.read_csv('/content/drive/MyDrive/Train_test_split/y_test.csv', index_col=0)['storypoint']

# Evaluate the ensamble predictions
rmse = np.sqrt(mean_squared_error(y_test, y_pred_ensemble))
print(f"Ensamble RMSE: {rmse}")

# Create a Dataframe
final_pred_esbl = pd.DataFrame({
    'id': y_test,
    'story_points': y_pred_ensemble
})

# Display the first few rows of the final DataFrame
print("\nFinal Predictions DataFrame:")
print(final_pred_esbl.head())

## Exportación

In [None]:
# Exportacion
current_date = datetime.datetime.now().strftime('%Y-%m-%d')

final_pred_esbl.to_csv(f"/content/drive/MyDrive/Predicciones/Ensamble_{current_date}.csv", index=False)

joblib.dump(y_pred_ensemble, f'/content/drive/MyDrive/Modelos/Ensamble.joblib')

# Red Neuronal aplicando Keras y Tensor Flow

## Importaciones

In [None]:
# Datasets a cargar
X_train = pd.read_csv('/content/drive/MyDrive/Train_test_split/X_train.csv', index_col=0)['full_text']
X_test = pd.read_csv('/content/drive/MyDrive/Train_test_split/X_test.csv', index_col=0)['full_text']
y_train = pd.read_csv('/content/drive/MyDrive/Train_test_split/y_train.csv', index_col=0)['storypoint']
y_test = pd.read_csv('/content/drive/MyDrive/Train_test_split/y_test.csv', index_col=0)['storypoint']

# Esos son los que fueron lematizados
X_train_processed = pd.read_csv('/content/drive/MyDrive/processed/X_train_processed.csv', index_col=0)['full_text']
X_test_processed = pd.read_csv('/content/drive/MyDrive/processed/X_test_processed.csv', index_col=0)['full_text']
conjunto_test_processed = pd.read_csv('/content/drive/MyDrive/processed/conjunto_test_processed.csv', index_col=0)['full_text']

In [None]:
# Transformación en listas (el texto ya está preprocesado)
X_train_list = X_train_processed.tolist()
X_test_list = X_test_processed.tolist()

# Transformación en arrays (sin necesidad de reshape ya que son textos)
X_train_array = np.array(X_train_list, dtype=object)
X_test_array = np.array(X_test_list, dtype=object)


## Eleccion del modelo

In [None]:
# Función para generar un modelo de red neuronal con hiperparámetros seleccionados
def create_model(optimizer='adam', dense_length=64, learning_rate=0.001, rho=0.95, epsilon=1e-07, momentum=0.0, max_tokens=10000, dropout_rate=0.5):
    # El optimizador seleccionado
    if optimizer == 'adam':
        opt = Adam(learning_rate=learning_rate, epsilon=epsilon)
    elif optimizer == 'rmsprop':
        opt = RMSprop(learning_rate=learning_rate, rho=rho, epsilon=epsilon, momentum=momentum)
    elif optimizer == 'adadelta':
        opt = Adadelta(learning_rate=learning_rate, rho=rho, epsilon=epsilon)

    # Vectorización de los textos para utilizarlos
    vectorizer = TextVectorization(output_mode='tf-idf', max_tokens=max_tokens)
    vectorizer.adapt(X_train_list)

    # Nuestro modelo
    model = Sequential([
        Input(shape=(1,), dtype=tf.string),
        vectorizer,
        BatchNormalization(),
        Dense(dense_length, activation='relu', kernel_regularizer='l2'),
        Dropout(dropout_rate),
        Dense(dense_length // 2, activation='relu', kernel_regularizer='l2'),
        Dropout(dropout_rate),
        Dense(1, activation='linear', kernel_regularizer='l2')  # Regresión
    ])

    model.compile(optimizer=opt, loss='mean_squared_error', metrics=[tf.keras.metrics.RootMeanSquaredError()])

    return model

# Hiperparámetros a probar por tipo de optimizador
param_dist_adam = {
    'optimizer': ['adam'],
    'dense_length': [16, 32, 64, 128, 256],
    'learning_rate': [0.01, 0.001, 0.0001],
    'dropout_rate': [0.3, 0.4, 0.5],
    'max_tokens': [5000, 10000]
}
param_dist_rmsprop = {
    'optimizer': ['rmsprop'],
    'dense_length': [16, 32, 64, 128, 256],
    'learning_rate': [0.01, 0.001, 0.0001],
    'rho': [0.85, 0.9, 0.95],
    'epsilon': [1e-08, 1e-07, 1e-06],
    'momentum': [0.0, 0.2, 0.5],
    'dropout_rate': [0.3, 0.4, 0.5],
    'max_tokens': [5000, 10000]
}
param_dist_adadelta = {
    'optimizer': ['adadelta'],
    'dense_length': [16, 32, 64, 128, 256],
    'learning_rate': [0.01, 0.001, 0.0001],
    'rho': [0.85, 0.9, 0.95],
    'epsilon': [1e-08, 1e-07, 1e-06],
    'dropout_rate': [0.3, 0.4, 0.5],
    'max_tokens': [5000, 10000]
}
param_grid = [
    param_dist_adam,
    #param_dist_rmsprop,
    #param_dist_adadelta
]

# Función para GridSearch adaptado a redes neuronales
def grid_search(X_train, y_train, X_val, y_val, param_grid, n_iter):
    best_score = float('inf')  # Buscamos minimizar RMSE
    best_params = None

    for param_dist in param_grid:
        for _ in range(n_iter):
            # Selección aleatoria de hiperparámetros
            params = {key: random.choice(value) for key, value in param_dist.items()}
            print(f"Probando parámetros: {params}")

            model = create_model(**params)

            # Regularización
            early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
            reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.000001)
            callbacks_list = [early_stopping, reduce_lr]

            # Entrenamiento
            model.fit(
                X_train, y_train,
                epochs=20,
                batch_size=128,
                validation_data=(X_val, y_val),
                callbacks=callbacks_list,
                verbose=0
            )

            # Evaluación
            val_predictions = model.predict(X_val).flatten()
            val_rmse = np.sqrt(mean_squared_error(y_val, val_predictions))
            print(f"RMSE en validación: {val_rmse}")

            if val_rmse < best_score:
                best_score = val_rmse
                best_params = params

            model = None

    return best_score, best_params

# Ejecutar GridSearch
best_score, best_params = grid_search(X_train_array, y_train, X_test_array, y_test, param_grid, n_iter=30)

print(f"Mejor RMSE: {best_score} con parámetros: {best_params}")

'''
best_score = 2.671667730078612
best_params = {'optimizer': 'adam', 'dense_length': 32, 'learning_rate': 0.01, 'dropout_rate': 0.3, 'max_tokens': 5000}
'''


## Fit con el mejor modelo

In [None]:
#version más completa
# Vectorización de los textos
vectorizer = TextVectorization(output_mode='tf-idf', max_tokens=best_params['max_tokens'])
vectorizer.adapt(X_train_list)

# Creación del modelo optimizado con los mejores parámetros
model = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer,
    BatchNormalization(),
    Dense(best_params['dense_length'], activation='relu', kernel_regularizer='l2'),
    Dropout(best_params['dropout_rate']),
    Dense(best_params['dense_length'] // 2, activation='relu', kernel_regularizer='l2'),
    Dropout(best_params['dropout_rate']),
    Dense(1, activation='linear', kernel_regularizer='l2')  # Activación 'linear' para regresión
])

# Compilación del modelo según el optimizador seleccionado
if best_params['optimizer'] == 'rmsprop':
    model.compile(
        optimizer=RMSprop(learning_rate=best_params['learning_rate'], rho=best_params['rho'], epsilon=best_params['epsilon'], momentum=best_params['momentum']),
        loss='mean_squared_error',  # Pérdida para regresión
        metrics=[tf.keras.metrics.RootMeanSquaredError()]  # Métrica principal: RMSE
    )
elif best_params['optimizer'] == 'adadelta':
    model.compile(
        optimizer=Adadelta(learning_rate=best_params['learning_rate'], rho=best_params['rho'], epsilon=best_params['epsilon']),
        loss='mean_squared_error',
        metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )
elif best_params['optimizer'] == 'adam':
    model.compile(
        optimizer=Adam(learning_rate=best_params['learning_rate'], epsilon=1e-07),
        loss='mean_squared_error',
        metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )

# Resumen del modelo
model.summary()

# Métodos de regularización
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.000001)
callbacks_list = [early_stopping, reduce_lr]

# Entrenamiento del modelo con los mejores hiperparámetros
history = model.fit(
    X_train_array, y_train,
    epochs=50,
    batch_size=128,
    validation_data=(X_test_array, y_test),
    callbacks=callbacks_list
)



In [None]:
# Predicciones del modelo
y_predic = model.predict(X_test_array).flatten()  # Asegurar que las predicciones son unidimensionales

# Convertir las predicciones a una serie para comparación
y_pred_series = pd.Series(y_predic, index=y_test.index)

# Cálculo de métricas de regresión
rmse = np.sqrt(mean_squared_error(y_test, y_pred_series))
mse = mean_squared_error(y_test, y_pred_series)
r2 = 1 - (np.sum((y_test - y_pred_series) ** 2) / np.sum((y_test - y_test.mean()) ** 2))

print(f"Test MSE: {mse}")
print(f"Test RMSE: {rmse}")
print(f"Test R²: {r2}")

# Write the metrics to the file
with open('/content/drive/MyDrive/Metricas/RN', "w") as file:
    file.write(f"Test MSE: {mse}\n")
    file.write(f"Test RMSE: {rmse}\n")
    file.write(f"Test R²: {r2}\n")

'''
Test MSE: 8.546259059054583
Test RMSE: 2.9233985460512533
Test R²: -0.03490238611566365
'''

## Predicciones

In [None]:
# Copiar y preparar el conjunto de test procesado
X_conjunto_test = conjunto_test_processed.copy()
X_conjunto_test_array = np.array(X_conjunto_test, dtype=object)

# Predecir los valores del conjunto de test
y_predic = model.predict(X_conjunto_test_array).flatten()

# Crear el DataFrame final con los valores predichos
final_pred_df = pd.DataFrame({
    'id': conjunto_test_processed.index,  # Ajustar los IDs (iniciar en 1)
    'storypoint': y_predic  # Las predicciones como valores continuos
})

# Ver el DataFrame final
final_pred_df



## Exportaciones

In [None]:

# Obtener la fecha actual
current_date = datetime.datetime.now().strftime('%Y-%m-%d')

# Guardar las predicciones en un archivo CSV
final_pred_df.to_csv(f"/content/drive/MyDrive/Predicciones/Red_Neuronal_{current_date}.csv", index=False)

# Guardar el modelo como archivo .joblib
model.save(f"/content/drive/MyDrive/Modelos/red_neuronal_model_{current_date}.h5")  # Guardar en formato Keras
joblib.dump(model, f"/content/drive/MyDrive/Modelos/red_neuronal_model_{current_date}.joblib")
