# Los MiserAIbles Team

### Integrantes
Sorany Hincapie Salazar  
Brayan Montoya Osorio

## Exploración de los datos

In [5]:
import pandas as pd

path_to_data = '../data/challenge_data-18-ago.csv'

df = pd.read_csv(path_to_data, sep = ';')
df.head(10)

Unnamed: 0,title,abstract,group
0,Adrenoleukodystrophy: survey of 303 cases: bio...,Adrenoleukodystrophy ( ALD ) is a genetically ...,neurological|hepatorenal
1,endoscopy reveals ventricular tachycardia secrets,Research question: How does metformin affect c...,neurological
2,dementia and cholecystitis: organ interplay,Purpose: This randomized controlled study exam...,hepatorenal
3,The interpeduncular nucleus regulates nicotine...,Partial lesions were made with kainic acid in ...,neurological
4,guillain-barre syndrome pathways in leukemia,Hypothesis: statins improves stroke outcomes v...,neurological
5,Effects of suprofen on the isolated perfused r...,Although suprofen has been associated with the...,hepatorenal
6,atherosclerosis and lymphoma: vascular insights,Aim: To investigate aspirin effects on diabete...,cardiovascular
7,Potential therapeutic use of the selective dop...,The clinical utility of dopamine (DA) D1 recep...,neurological
8,The basal ganglia connection in epilepsy,Background: dementia affects cardiac patients ...,neurological
9,septum and peripheral artery disease: vascular...,Purpose: This observational study examined cal...,cardiovascular


In [6]:
from collections import Counter

all_labels = [label for labels in df['label_list'] for label in labels]
label_counts = Counter(all_labels)

for label, count in label_counts.items():
    print(f"{label}: {count}")

KeyError: 'label_list'

## Extracción de características con NLP

Pasos:
1. Tokenización.
2. Limpieza.
3. Lemmatización.
4. Filtrado de StopWords.
5. Vectorización a través de TF-IDF.

### Ejemplo de prueba con un registro del dataset siguiendo los pasos anteriores.

In [None]:
row_example =  df.iloc[0]['title'] + ' ' +  df.iloc[0]['abstract']
print(row_example)

Adrenoleukodystrophy: survey of 303 cases: biochemistry, diagnosis, and therapy. Adrenoleukodystrophy ( ALD ) is a genetically determined disorder associated with progressive central demyelination and adrenal cortical insufficiency . All affected persons show increased levels of saturated unbranched very-long-chain fatty acids , particularly hexacosanoate ( C26  0 ) , because of impaired capacity to degrade these acids . This degradation normally takes place in a subcellular organelle called the peroxisome , and ALD , together with Zellwegers cerebrohepatorenal syndrome , is now considered to belong to the newly formed category of peroxisomal disorders . Biochemical assays permit prenatal diagnosis , as well as identification of most heterozygotes . We have identified 303 patients with ALD in 217 kindreds . These patients show a wide phenotypic variation . Sixty percent of patients had childhood ALD and 17 % adrenomyeloneuropathy , both of which are X-linked , with the gene mapped to X

### 1. Tokenización:  
Dividir texto en palabras, oraciones o elementos pequeños. Para esto se usa la librería **nltk** y se descarga el modulo **punkt_tab**.

In [None]:
import nltk
nltk.download('punkt_tab')

def tokenize_text(text):
    return nltk.word_tokenize(text)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
tokens = tokenize_text(row_example)
print(tokens)

['Adrenoleukodystrophy', ':', 'survey', 'of', '303', 'cases', ':', 'biochemistry', ',', 'diagnosis', ',', 'and', 'therapy', '.', 'Adrenoleukodystrophy', '(', 'ALD', ')', 'is', 'a', 'genetically', 'determined', 'disorder', 'associated', 'with', 'progressive', 'central', 'demyelination', 'and', 'adrenal', 'cortical', 'insufficiency', '.', 'All', 'affected', 'persons', 'show', 'increased', 'levels', 'of', 'saturated', 'unbranched', 'very-long-chain', 'fatty', 'acids', ',', 'particularly', 'hexacosanoate', '(', 'C26', '0', ')', ',', 'because', 'of', 'impaired', 'capacity', 'to', 'degrade', 'these', 'acids', '.', 'This', 'degradation', 'normally', 'takes', 'place', 'in', 'a', 'subcellular', 'organelle', 'called', 'the', 'peroxisome', ',', 'and', 'ALD', ',', 'together', 'with', 'Zellwegers', 'cerebrohepatorenal', 'syndrome', ',', 'is', 'now', 'considered', 'to', 'belong', 'to', 'the', 'newly', 'formed', 'category', 'of', 'peroxisomal', 'disorders', '.', 'Biochemical', 'assays', 'permit', 'pr

### 2. Limpieza:
Eliminar caracteres especiales y convertir texto a minúsculas.

In [None]:
def clean_tokens(tokens):
    return [token.lower() for token in tokens if token.isalpha()]

In [None]:
cleaned_tokens = clean_tokens(tokens)
print(cleaned_tokens)

['adrenoleukodystrophy', 'survey', 'of', 'cases', 'biochemistry', 'diagnosis', 'and', 'therapy', 'adrenoleukodystrophy', 'ald', 'is', 'a', 'genetically', 'determined', 'disorder', 'associated', 'with', 'progressive', 'central', 'demyelination', 'and', 'adrenal', 'cortical', 'insufficiency', 'all', 'affected', 'persons', 'show', 'increased', 'levels', 'of', 'saturated', 'unbranched', 'fatty', 'acids', 'particularly', 'hexacosanoate', 'because', 'of', 'impaired', 'capacity', 'to', 'degrade', 'these', 'acids', 'this', 'degradation', 'normally', 'takes', 'place', 'in', 'a', 'subcellular', 'organelle', 'called', 'the', 'peroxisome', 'and', 'ald', 'together', 'with', 'zellwegers', 'cerebrohepatorenal', 'syndrome', 'is', 'now', 'considered', 'to', 'belong', 'to', 'the', 'newly', 'formed', 'category', 'of', 'peroxisomal', 'disorders', 'biochemical', 'assays', 'permit', 'prenatal', 'diagnosis', 'as', 'well', 'as', 'identification', 'of', 'most', 'heterozygotes', 'we', 'have', 'identified', 'pat

### 3. Lemmatización:  
Llevar palabras a su forma raíz.

Se va a usar Lemmatización en lugar de Stemming dado que se busca obtener resultados más precisos y significativos. Es probable que el Stemming pierda contexto gramatical o genere palabras inexistentes, mientras que la Lemmatización mantiene el significado de las palabras y es más preciso linguisticamente.

In [None]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
def lemmatize_words(words):
    return [lemmatizer.lemmatize(word) for word in words]

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
lemmatized_words = lemmatize_words(cleaned_tokens)
print(lemmatized_words)


['adrenoleukodystrophy', 'survey', 'of', 'case', 'biochemistry', 'diagnosis', 'and', 'therapy', 'adrenoleukodystrophy', 'ald', 'is', 'a', 'genetically', 'determined', 'disorder', 'associated', 'with', 'progressive', 'central', 'demyelination', 'and', 'adrenal', 'cortical', 'insufficiency', 'all', 'affected', 'person', 'show', 'increased', 'level', 'of', 'saturated', 'unbranched', 'fatty', 'acid', 'particularly', 'hexacosanoate', 'because', 'of', 'impaired', 'capacity', 'to', 'degrade', 'these', 'acid', 'this', 'degradation', 'normally', 'take', 'place', 'in', 'a', 'subcellular', 'organelle', 'called', 'the', 'peroxisome', 'and', 'ald', 'together', 'with', 'zellwegers', 'cerebrohepatorenal', 'syndrome', 'is', 'now', 'considered', 'to', 'belong', 'to', 'the', 'newly', 'formed', 'category', 'of', 'peroxisomal', 'disorder', 'biochemical', 'assay', 'permit', 'prenatal', 'diagnosis', 'a', 'well', 'a', 'identification', 'of', 'most', 'heterozygote', 'we', 'have', 'identified', 'patient', 'wit

### 4. Filtrado de StopWords

Se busca eliminar palabras que aporten poco significado, tales como: the, an, a, or, and. De igual manera se busca eliminar términos científicos y médicos, esto para reducir la dimensionalidad y procesar más rápido el dataset.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

academic_medic_stopwords = {
        'abstract', 'paper', 'study', 'research', 'article', 'journal',
        'analysis', 'method', 'approach', 'technique', 'result', 'conclusion',
        'introduction', 'discussion', 'experimental', 'theoretical',
        'also', 'however', 'therefore', 'furthermore', 'moreover',
        'studies', 'report', 'review', 'evaluation', 'assessment', 'investigation', 
        'examination', 'observation', 'finding', 'findings', 'results', 'methods', 
        'methodology', 'patient', 'patients', 'subject', 'subjects', 'participant', 
        'participants', 'case', 'cases', 'group', 'groups', 'control', 'controls',
        'significant', 'significantly', 'statistical', 'statistically',
        'important', 'effective', 'successful', 'common', 'rare', 'typical',
        'normal', 'abnormal', 'positive', 'negative', 'high', 'low', 'increased', 'decreased'
    }

def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words and token not in academic_medic_stopwords]

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
filtered_tokens = remove_stopwords(lemmatized_words)
print(filtered_tokens)

['adrenoleukodystrophy', 'survey', 'biochemistry', 'diagnosis', 'therapy', 'adrenoleukodystrophy', 'ald', 'genetically', 'determined', 'disorder', 'associated', 'progressive', 'central', 'demyelination', 'adrenal', 'cortical', 'insufficiency', 'affected', 'person', 'show', 'level', 'saturated', 'unbranched', 'fatty', 'acid', 'particularly', 'hexacosanoate', 'impaired', 'capacity', 'degrade', 'acid', 'degradation', 'normally', 'take', 'place', 'subcellular', 'organelle', 'called', 'peroxisome', 'ald', 'together', 'zellwegers', 'cerebrohepatorenal', 'syndrome', 'considered', 'belong', 'newly', 'formed', 'category', 'peroxisomal', 'disorder', 'biochemical', 'assay', 'permit', 'prenatal', 'diagnosis', 'well', 'identification', 'heterozygote', 'identified', 'ald', 'kindred', 'show', 'wide', 'phenotypic', 'variation', 'sixty', 'percent', 'childhood', 'ald', 'adrenomyeloneuropathy', 'gene', 'mapped', 'neonatal', 'ald', 'distinct', 'entity', 'autosomal', 'recessive', 'inheritance', 'point', 'r

Pipeline de tokenización, limpieza, lemmatización y filtrado para todo el dataset:

In [None]:
def tokenization_pipeline(text):
    tokens = tokenize_text(text)
    cleaned_tokens = clean_tokens(tokens)
    lemmatized_words = lemmatize_words(cleaned_tokens)
    filtered_words = remove_stopwords(lemmatized_words)
    return filtered_words


### Procesamiento de dataset:

In [None]:
df['tokenized_article'] = (df['title'] + df['abstract']).apply(tokenization_pipeline)

In [None]:
df['text_joined'] = df['tokenized_article'].apply(lambda tokens: ' '.join(tokens))
df.head(3)

Unnamed: 0,title,abstract,group,tokenized_article,text_joined
0,Adrenoleukodystrophy: survey of 303 cases: bio...,Adrenoleukodystrophy ( ALD ) is a genetically ...,neurological|hepatorenal,"[adrenoleukodystrophy, survey, biochemistry, d...",adrenoleukodystrophy survey biochemistry diagn...
1,endoscopy reveals ventricular tachycardia secrets,Research question: How does metformin affect c...,neurological,"[endoscopy, reveals, ventricular, tachycardia,...",endoscopy reveals ventricular tachycardia secr...
2,dementia and cholecystitis: organ interplay,Purpose: This randomized controlled study exam...,hepatorenal,"[dementia, cholecystitis, organ, interplaypurp...",dementia cholecystitis organ interplaypurpose ...


### 5. Vectorización a través de TF-IDF
Se aplica la técnica TF-IDF que busca etc etc. PENDIENTEEEEEEEEE

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text_joined'])
X.shape


(3565, 12438)

### Entrenamiento de modelos

In [None]:
# Convertir las etiquetas separadas por '|' en listas
df['label_list'] = df['group'].apply(lambda x: x.split('|'))
df.head(3)


Unnamed: 0,title,abstract,group,tokenized_article,text_joined,label_list
0,Adrenoleukodystrophy: survey of 303 cases: bio...,Adrenoleukodystrophy ( ALD ) is a genetically ...,neurological|hepatorenal,"[adrenoleukodystrophy, survey, biochemistry, d...",adrenoleukodystrophy survey biochemistry diagn...,"[neurological, hepatorenal]"
1,endoscopy reveals ventricular tachycardia secrets,Research question: How does metformin affect c...,neurological,"[endoscopy, reveals, ventricular, tachycardia,...",endoscopy reveals ventricular tachycardia secr...,[neurological]
2,dementia and cholecystitis: organ interplay,Purpose: This randomized controlled study exam...,hepatorenal,"[dementia, cholecystitis, organ, interplaypurp...",dementia cholecystitis organ interplaypurpose ...,[hepatorenal]


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['label_list'])
y

array([[0, 1, 1, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       ...,
       [1, 0, 1, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 1]], shape=(3565, 4))

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Regresión Logística

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
clf = LogisticRegression(max_iter=1000)
from sklearn.multioutput import MultiOutputClassifier
multi_clf = MultiOutputClassifier(clf)
multi_clf.fit(X_train, y_train)


# Predicción y evaluación
y_pred = multi_clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=mlb.classes_))
# ...existing code...

                precision    recall  f1-score   support

cardiovascular       0.98      0.78      0.87       260
   hepatorenal       0.99      0.63      0.77       228
  neurological       0.84      0.89      0.87       338
   oncological       0.98      0.50      0.66       130

     micro avg       0.92      0.75      0.82       956
     macro avg       0.95      0.70      0.79       956
  weighted avg       0.93      0.75      0.82       956
   samples avg       0.92      0.82      0.84       956



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [None]:
from sklearn.svm import LinearSVC
from sklearn.multioutput import MultiOutputClassifier

# Crear el clasificador SVM
svm_clf = LinearSVC(max_iter=1000)
multi_svm = MultiOutputClassifier(svm_clf)

# Entrenar el modelo
multi_svm.fit(X_train, y_train)

# Predicción y evaluación
y_pred_svm = multi_svm.predict(X_test)
print(classification_report(y_test, y_pred_svm, target_names=mlb.classes_))
# ...existing code...

                precision    recall  f1-score   support

cardiovascular       0.95      0.86      0.90       260
   hepatorenal       0.98      0.78      0.87       228
  neurological       0.88      0.89      0.88       338
   oncological       0.98      0.69      0.81       130

     micro avg       0.93      0.83      0.88       956
     macro avg       0.95      0.81      0.87       956
  weighted avg       0.94      0.83      0.88       956
   samples avg       0.94      0.88      0.89       956




Precision is ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.



### XGBoost

3. Entrenamiento modelos clasicos
4. Reducción de dimensionalidad
5. Re-entrenamiento
