### Sistema de clasificación de subreddit. 


Se deberá implementar una función 
classify_subreddit(text) que clasifique un texto de entrada en una de las siguientes 
categorías: 
MachineLearning, 
datascience, 
statistics, 
learnmachinelearning, 
computerscience, AskStatistics, artificial, analytics, datasets, deeplearning, rstats, 
computervision, DataScienceJobs, MLQuestions, dataengineering, data, dataanalysis, 
datascienceproject, Kaggle. 


Para ello, se proporciona un dataset sobre el que se podrán entrenar distintos algoritmos 
de clasificación. La etiqueta del subreddit correspondiente se encuentra en la columna 
“subreddit”. Se deberán probar, al menos, los siguientes 3 métodos:
- Un método basado en TF-IDF + algoritmo de clasificación de machine learning 
- Un método basado en entidades reconocidas (Named-Entity Recognition) + 
algoritmo declasificación de machine learning 
- Un método basado en Word Embeddings + algoritmo de clasificación de machine 
learning 


Para evaluar cada método, se utilizará la métrica f1 score, y se utilizará un 70% de los datos 
del dataset para entrenamiento y un 30% para test (realizando un sampling aleatorio 
previo). 
En el notebook implementacion_modulo_2.ipynb deberás documentar todos los pasos 
seguidos y resultados obtenidos, así como explicar las diferencias entre los métodos 
probados. En el fichero core.py deberás incluir una función classify_subreddit(text) que 
devuelva un string con la etiqueta resultante de la clasificación. 

In [None]:
import pandas as pd
pd.options.display.max_columns = None
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier

import spacy
from sklearn.feature_extraction.text import CountVectorizer

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer #type: ignore
from tensorflow.keras.preprocessing.sequence import pad_sequences #type: ignore
from tensorflow.keras.layers import Embedding #type: ignore
from tensorflow.keras.models import Sequential #type: ignore





In [3]:
# we load the dataset with the cleanpost column so we don't have to process it again.
df = pd.read_csv("processed_dataset.csv",
                 low_memory=False) 
df = df.dropna(subset=['clean_post', 'subreddit'])
df

Unnamed: 0,created_date,created_timestamp,subreddit,title,author,author_created_utc,full_link,score,num_comments,num_crossposts,subreddit_subscribers,post,sentiment,clean_post
0,2010-02-11 19:47:22,1265910442.0,analytics,So what do you guys all do related to analytic...,xtom,1.227476e+09,https://www.reddit.com/r/analytics/comments/b0...,7.0,4.0,0.0,,There's a lot of reasons to want to know all t...,NEGATIVE,theres lot reasons want know stuff figured id ...
1,2010-03-04 20:17:26,1267726646.0,analytics,"Google's Invasive, non-Anonymized Ad Targeting...",xtom,1.227476e+09,https://www.reddit.com/r/analytics/comments/b9...,2.0,1.0,0.0,,"I'm cross posting this from /r/cyberlaw, hopef...",NEGATIVE,im cross posting rcyberlaw hopefully guys find...
2,2011-01-06 04:51:18,1294282278.0,analytics,"DotCed - Functional Web Analytics - Tagging, R...",dotced,1.294282e+09,https://www.reddit.com/r/analytics/comments/ew...,1.0,1.0,,,"DotCed,a Functional Analytics Consultant, offe...",NEGATIVE,dotceda functional analytics consultant offeri...
3,2011-01-19 11:45:30,1295430330.0,analytics,Program Details - Data Analytics Course,iqrconsulting,1.288245e+09,https://www.reddit.com/r/analytics/comments/f5...,0.0,0.0,,,Here is the program details of the data analyt...,NEGATIVE,program details data analytics certification c...
4,2011-01-19 21:52:28,1295466748.0,analytics,potential job in web analytics... need to anal...,therewontberiots,1.278672e+09,https://www.reddit.com/r/analytics/comments/f5...,2.0,4.0,,,i decided grad school (physics) was not for me...,POSITIVE,decided grad school physics branching job mark...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274234,2022-05-07 21:38:52,1651948732.0,rstats,Help interpretting lmer model output,seeking-stillness,,https://www.reddit.com/r/rstats/comments/ukjiy...,1.0,0.0,0.0,64078.0,Hello! I am wonder how the following output wo...,NEGATIVE,hello wonder following output would interprete...
274235,2022-05-07 22:13:52,1651950832.0,rstats,Medical stats book with R,Sweaty_Catch_4275,,https://www.reddit.com/r/rstats/comments/ukk7u...,1.0,0.0,0.0,64080.0,Can anybody recommend me a book with medical s...,POSITIVE,anybody recommend book medical statistics r th...
274236,2022-05-08 00:38:50,1651959530.0,rstats,Markov chains with unequal sequence lengths,sebelly,,https://www.reddit.com/r/rstats/comments/ukn1i...,1.0,0.0,0.0,64083.0,I'm trying to build a simple Markov chain. I h...,NEGATIVE,im trying build simple markov chain data thera...
274237,2022-05-08 01:19:00,1651961940.0,rstats,view all available Rcpp::plugins,BOBOLIU,,https://www.reddit.com/r/rstats/comments/uknuh...,1.0,0.0,0.0,64084.0,How do I view all available Rcpp::plugins? Tha...,POSITIVE,view available rcppplugins thanks


## 1. TF-IDF + Logistic Regression

vamos a probar primero este método por su rapidez para hacernos uan idea general

In [3]:
X = df['clean_post']
y = df['subreddit']

# train 70% test 30% split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [4]:
print(y_train.isnull().sum())  # check for NaN
print(y_test.isnull().sum())
print(y_train.unique())


0
0
['datascience' 'artificial' 'statistics' 'learnmachinelearning'
 'AskStatistics' 'DataScienceJobs' 'MachineLearning' 'dataengineering'
 'computervision' 'computerscience' 'MLQuestions' 'analytics' 'datasets'
 'kaggle' 'data' 'rstats' 'deeplearning' 'dataanalysis'
 'datascienceproject']


In [5]:
X_train = X_train.astype(str)
X_test = X_test.astype(str)

In [7]:
# vectorize TF-IDF
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))  # vocab 5000 words max
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In [8]:
print(type(X_train_tfidf))
print(type(X_test_tfidf))


<class 'scipy.sparse._csr.csr_matrix'>
<class 'scipy.sparse._csr.csr_matrix'>


**`lbfgs` -> "Limited-memory Broyden–Fletcher–Goldfarb–Shanno"**. It is an optimization algorithm based on the gradient method.

In [9]:
# train model
model = LogisticRegression(solver='lbfgs', max_iter=500, random_state=42)
model.fit(X_train_tfidf, y_train)


In [10]:
y_pred = model.predict(X_test_tfidf)

print("classification report:")
print(classification_report(y_test, y_pred))

accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted') #weighted average

print(f"accuracy: {accuracy:.2f}")
print(f"weighted: {f1:.2f}")


classification report:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


                      precision    recall  f1-score   support

       AskStatistics       0.49      0.58      0.53      9056
     DataScienceJobs       0.93      0.60      0.73       688
         MLQuestions       0.23      0.05      0.08      3410
     MachineLearning       0.45      0.58      0.51     11223
           analytics       0.73      0.59      0.65      2349
          artificial       0.57      0.40      0.47      2621
     computerscience       0.62      0.79      0.70      6712
      computervision       0.59      0.53      0.56      2925
                data       0.71      0.25      0.36       799
        dataanalysis       0.47      0.18      0.26      1214
     dataengineering       0.77      0.65      0.71      2467
         datascience       0.56      0.62      0.59     11168
  datascienceproject       0.00      0.00      0.00        76
            datasets       0.62      0.71      0.66      3442
        deeplearning       0.31      0.10      0.15      2432
       

Observamos que para un `TF-IDF + Logistic Regression` obtenemos un **52% de precisión**. No es un resultado muy aceptable, por lo que vamos a probar algunos algoritmos más complejos como `Random Forest`.

## 2. TF-IDF + RandomForestClassifier

In [7]:
rf_model = RandomForestClassifier(n_estimators=100, n_jobs=-1, min_samples_split=10)

In [14]:
rf_model.fit(X_train_tfidf, y_train)

In [17]:
y_pred_rf = rf_model.predict(X_test_tfidf)

print("classification report:")
print(classification_report(y_test, y_pred_rf)) 

accuracy_rf = accuracy_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf, average='weighted') #weighted average

print(f"accuracy: {accuracy_rf:.5f}")
print(f"weighted: {f1_rf:.5f}")


classification report:
                      precision    recall  f1-score   support

       AskStatistics       0.45      0.62      0.52      9056
     DataScienceJobs       0.94      0.52      0.67       688
         MLQuestions       0.05      0.01      0.01      3410
     MachineLearning       0.40      0.57      0.47     11223
           analytics       0.71      0.53      0.60      2349
          artificial       0.52      0.39      0.44      2621
     computerscience       0.58      0.75      0.66      6712
      computervision       0.57      0.46      0.51      2925
                data       0.69      0.24      0.36       799
        dataanalysis       0.31      0.06      0.09      1214
     dataengineering       0.78      0.60      0.68      2467
         datascience       0.54      0.62      0.58     11168
  datascienceproject       0.12      0.04      0.06        76
            datasets       0.56      0.64      0.59      3442
        deeplearning       0.19      0.06     

Por la distribución y los tipos de datos, el poncetaje de **la prediccion con `RandomForestClassifier` es de aproximadamente el 48%**

Vamos a probar otro tipo de modelos de **machine learning** como puede ser `Named-Entity Recognition`

## 3. Named-Entity Recognition (NER)

In [21]:
!python -m spacy download en_core_web_md


Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
      --------------------------------------- 0.8/33.5 MB 5.6 MB/s eta 0:00:06
     - -------------------------------------- 1.6/33.5 MB 4.9 MB/s eta 0:00:07
     --- ------------------------------------ 2.9/33.5 MB 4.9 MB/s eta 0:00:07
     ---- ----------------------------------- 3.9/33.5 MB 5.0 MB/s eta 0:00:06
     ------ --------------------------------- 5.2/33.5 MB 5.2 MB/s eta 0:00:06
     ------- -------------------------------- 6.3/33.5 MB 5.1 MB/s eta 0:00:06
     -------- ------------------------------- 7.1/33.5 MB 5.0 MB/s eta 0:00:06
     --------- ------------------------------ 8.1/33.5 MB 5.0 MB/s eta 0:00:06
     ----------- ---------------------------- 9.4/33.5 MB 5.0 MB/s eta 0:00:05
     ------------ -----------------------

In [8]:
nlp = spacy.load("en_core_web_md")

In [10]:
def extract_entities(text: str) -> list:

    doc = nlp(text)
    entities = [ent.label_ for ent in doc.ents]  #extracts labelss (PERSON, ORG)
    return entities

In [25]:
df['entities'] = df['clean_post'].apply(lambda x: extract_entities(str(x)))

: 

como la operación de apply está procesando todo el DataFrame fila por fila, y al usar un modelo de spaCy en textos largos o en un dataset grande, consume mucha memoria y tiempo, lo que hacee que el **kernel de Jupyter colapse**


importamos `swifter` para ver si nos soluciona este problema, ya que es una librería que paraleliza automáticamente las operaciones apply para mejorar el rendimiento.

In [None]:
%pip install swifter


In [9]:
import swifter

df['entities'] = df['clean_post'].swifter.apply(lambda x: extract_entities(str(x)))

  from .autonotebook import tqdm as notebook_tqdm
Pandas Apply:  43%|████▎     | 116490/272248 [19:49<29:00, 89.49it/s] 

: 

probamos con un fragmento del df para ver como va funcionando y nos guardamos las entities en una variable

In [1]:
sample_df = df[:10]  # primeras 10 filas

entities_list = sample_df['clean_post'].apply(lambda x: extract_entities(str(x)))

entities_list


NameError: name 'df' is not defined

In [15]:
entities_list = sample_df['clean_post'].apply(lambda x: extract_entities(str(x))).tolist()
print(entities_list[:5])

[[], ['PERSON', 'PERSON', 'DATE', 'CARDINAL', 'ORDINAL', 'CARDINAL'], ['CARDINAL'], [], ['DATE', 'CARDINAL']]


In [19]:
entities_text = [' '.join(entities) for entities in entities_list]

print(entities_text)  


['', 'PERSON PERSON DATE CARDINAL ORDINAL CARDINAL', 'CARDINAL', '', 'DATE CARDINAL', '', 'ORG PERSON ORG GPE GPE PERSON ORG', '', '', 'ORG DATE CARDINAL']


In [17]:
# Vectorizar las entidades
vectorizer = CountVectorizer()
X_entities = vectorizer.fit_transform(entities_text)

# Verificar la matriz generada
print(X_entities.toarray())  
print(vectorizer.get_feature_names_out())  


[[0 0 0 0 0 0]
 [2 1 0 1 0 2]
 [1 0 0 0 0 0]
 [0 0 0 0 0 0]
 [1 1 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 2 0 3 2]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [1 1 0 0 1 0]]
['cardinal' 'date' 'gpe' 'ordinal' 'org' 'person']


In [None]:
vectorizer = CountVectorizer()
X_entities = vectorizer.fit_transform(sample_df['entities_text'])

volvemos a probar ahora que vemos que para unas pocas lineas funciona

In [22]:
nlp = spacy.load("en_core_web_md", disable=["parser", "tagger"])  # Desactiva partes innecesarias para optimización

def extract_entities(text):
    doc = nlp(text)
    entities = [ent.label_ for ent in doc.ents]  
    return entities



In [23]:
entities_list = df['clean_post'].apply(lambda x: extract_entities(str(x))).tolist()



: 

## 4. Word Embeddings + algoritmo de clasificación de machine learning

Por fallos ajenos a pip en la instalacion de `gensim` no haremos esta parte con `import gensim.downloader as api`

Descargaremos desde **[GloVe](https://nlp.stanford.edu/projects/glove/)**  el archivo glove.6B.zip y usaremos la versión de 50 dimensiones: `glove.6B.50d.txt` porque es más ligera y rápida

In [12]:
X = df['clean_post']
y = df['subreddit']

# train 70% test 30% split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [13]:
#covertimos las palabras a tokens únicos
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

In [14]:
#convertimos el texto a secuencias de enteros
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

In [15]:
#padding para igualar la longitud de las secuencias
maxlen = 200  #establecemos la dimension fija
X_train_padded = pad_sequences(X_train_seq, maxlen=maxlen, padding="post", truncating="post")
X_test_padded = pad_sequences(X_test_seq, maxlen=maxlen, padding="post", truncating="post")

In [16]:
# Obtener tamaño del vocabulario
vocab_size = len(tokenizer.word_index) + 1  # +1 por el token <OOV>

In [18]:
# Crear Embeddings Aleatorios (inicialización)
embedding_dim = 50  # Dimensión de los embeddings
embedding_matrix = np.random.uniform(-1, 1, (vocab_size, embedding_dim))

# Crear un modelo de Word Embedding con Keras (opcional)
embedding_layer = Embedding(
    input_dim=vocab_size,        # Tamaño del vocabulario
    output_dim=embedding_dim,   # Dimensión del vector de embedding
    weights=[embedding_matrix], # Inicialización con la matriz aleatoria
    trainable=True              # Los embeddings se ajustarán durante el entrenamiento
)

In [19]:
# Obtener los embeddings finales para los textos
X_train_embeddings = embedding_layer(X_train_padded).numpy().mean(axis=1)  # Promediamos los vectores
X_test_embeddings = embedding_layer(X_test_padded).numpy().mean(axis=1)

In [20]:
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_train_embeddings, y_train)

In [21]:
y_pred = clf.predict(X_test_embeddings)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"F1-Score (Weighted): {f1}")
print("\nClasification Report:\n", classification_report(y_test, y_pred))

F1-Score (Weighted): 0.20587179229134223


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Clasification Report:
                       precision    recall  f1-score   support

       AskStatistics       0.25      0.28      0.26      9056
     DataScienceJobs       0.89      0.06      0.11       688
         MLQuestions       0.06      0.00      0.00      3410
     MachineLearning       0.21      0.47      0.29     11223
           analytics       0.34      0.02      0.04      2349
          artificial       0.45      0.08      0.13      2621
     computerscience       0.23      0.23      0.23      6712
      computervision       0.37      0.03      0.06      2925
                data       0.73      0.14      0.23       799
        dataanalysis       0.00      0.00      0.00      1214
     dataengineering       0.36      0.03      0.05      2467
         datascience       0.27      0.58      0.37     11168
  datascienceproject       0.00      0.00      0.00        76
            datasets       0.31      0.16      0.21      3442
        deeplearning       0.33      0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [23]:
# Entrenamiento del modelo Random Forest
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, min_samples_split=10, random_state=42)
clf.fit(X_train_embeddings, y_train)

# Predicción
y_pred = clf.predict(X_test_embeddings)

# Evaluación
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score (Weighted): {f1}")

# Reporte de clasificación
print("\nClassification Report:\n", classification_report(y_test, y_pred))


F1-Score (Weighted): 0.19345899029641456

Classification Report:
                       precision    recall  f1-score   support

       AskStatistics       0.22      0.21      0.22      9056
     DataScienceJobs       0.86      0.36      0.51       688
         MLQuestions       0.05      0.01      0.01      3410
     MachineLearning       0.20      0.44      0.28     11223
           analytics       0.28      0.01      0.02      2349
          artificial       0.15      0.02      0.03      2621
     computerscience       0.20      0.17      0.19      6712
      computervision       0.17      0.01      0.02      2925
                data       0.66      0.20      0.30       799
        dataanalysis       0.14      0.01      0.02      1214
     dataengineering       0.30      0.01      0.03      2467
         datascience       0.26      0.54      0.35     11168
  datascienceproject       0.17      0.03      0.05        76
            datasets       0.32      0.14      0.20      3442
   

In [24]:
#vamos a tratar de mejorar la precion
clf = RandomForestClassifier(
    n_estimators=100, 
    n_jobs=-1, 
    min_samples_split=10, 
    random_state=42, 
    class_weight="balanced"
)
clf.fit(X_train_embeddings, y_train)

# Predicción
y_pred = clf.predict(X_test_embeddings)

# Evaluación
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score (Weighted): {f1}")

# Reporte de clasificación
print("\nClassification Report:\n", classification_report(y_test, y_pred))

F1-Score (Weighted): 0.20047570630988357

Classification Report:
                       precision    recall  f1-score   support

       AskStatistics       0.20      0.20      0.20      9056
     DataScienceJobs       0.59      0.43      0.50       688
         MLQuestions       0.05      0.02      0.03      3410
     MachineLearning       0.23      0.25      0.24     11223
           analytics       0.17      0.05      0.07      2349
          artificial       0.19      0.10      0.13      2621
     computerscience       0.18      0.26      0.21      6712
      computervision       0.13      0.04      0.06      2925
                data       0.40      0.21      0.27       799
        dataanalysis       0.12      0.03      0.05      1214
     dataengineering       0.21      0.05      0.08      2467
         datascience       0.29      0.44      0.35     11168
  datascienceproject       0.08      0.05      0.06        76
            datasets       0.17      0.36      0.23      3442
   

## Named-Entity Recognition (NER)