# **Caso práctico 1: Análisis de tweets**

Análisis de tweets de este dataset https://github.com/garnachod/TwitterSentimentDataset

## 1. Cargar y explorar los datos

In [None]:
import pandas as pd

dataset = pd.read_csv("datos/tweets_en_es.csv")
dataset = dataset.sample(frac=1.0) ; dataset

In [None]:
import rubrix as rb

records = [
    rb.TextClassificationRecord(
        inputs=row.text,
        metadata={"source": row.source}
    )
    for i,row in dataset[0:5000].iterrows()
]

rb.delete(name="tuits_en_es")

output = rb.log(records, name="tuits_en_es")

## 2. Análisis de sentimiento

Usaremos `pysentimiento` (https://github.com/pysentimiento/pysentimiento). Entrenado originalmente con datasets de Tweets. 

Creado por **Juan Manuel Pérez** (https://twitter.com/perezjotaeme)


In [None]:
from pysentimiento import SentimentAnalyzer
from pysentimiento.preprocessing import preprocess_tweet

analyzer = SentimentAnalyzer(lang="es")
analyzer.predict(dataset.iloc[1].text).probas.items()

In [None]:
records = [
    rb.TextClassificationRecord(
        inputs=preprocess_tweet(row.text),
        metadata={"source": row.source},
        prediction=[
            pred 
            for pred in analyzer.predict(preprocess_tweet(row.text)).probas.items()
        ]
    )
    for i,row in dataset[0:100].iterrows()
]
rb.delete(name="tuits_en_es")
_ = rb.log(records, name="tuits_en_es")

In [None]:
from rubrix.metrics.text_classification import f1

f1(name="tuits_en_es") #  query="metadata.source:tweets_pos_clean.txt"

## 3. Análisis de emociones

Creado por **Juan Manuel Pérez** (https://twitter.com/perezjotaeme).

Ahora usaremos Hugging Face directamente.

In [None]:
from transformers import pipeline
from datasets import Dataset

classifier = pipeline(
    "text-classification",
    model="finiteautomata/beto-emotion-analysis", 
    return_all_scores=True
)

dataset = Dataset.from_csv("datos/tweets_en_es.csv")
dataset = dataset.select(range(0,100)) 

In [None]:
rb.delete("tweets_en_es_emocion")
classifier = rb.monitor(classifier, dataset="tweets_en_es_emocion", sample_rate=1.0)

In [None]:
dataset.map(
    lambda r: {"prediction": classifier(r["text"])},
)

## 4. Categorización de texto (modelo "zero-shot")


Ver tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/zeroshot_data_annotation.html

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", 
                       model="Recognai/zeroshot_selectra_medium")

labels = ["amistad", "política", "videojuegos", "deporte", "comida", "famosos", "música"]
template = "Este mensaje es sobre {}"

classifier = rb.monitor(classifier, dataset="tweets_en_es_categorizacion", sample_rate=1.0)

rb.delete("tweets_en_es_categorizacion")

In [12]:
classifier("Te quiero amiga!", candidate_labels=labels, hypothesis_template=template)

{'sequence': 'Te quiero amiga!',
 'labels': ['amistad',
  'famosos',
  'música',
  'política',
  'deporte',
  'comida',
  'videojuegos'],
 'scores': [0.9217610955238342,
  0.02147088013589382,
  0.016207056120038033,
  0.014193572103977203,
  0.010944677516818047,
  0.008591394871473312,
  0.0068312352523207664]}

In [14]:
dataset.map(
    lambda r: {"prediction": classifier(r["text"], candidate_labels=labels, hypothesis_template=template)}
)

  0%|          | 0/100 [00:00<?, ?ex/s]

Dataset({
    features: ['Unnamed: 0', 'text', 'source', 'prediction'],
    num_rows: 100
})

## 5. Reconocimiento de entidades con spaCy

In [None]:
!python -m spacy download es_core_news_md

In [27]:
import spacy
import rubrix as rb

nlp = spacy.load("es_core_news_md")
nlp = rb.monitor(nlp, dataset="tuits_ner")

rb.delete("tuits_ner")

dataset = Dataset.from_csv("tweets_en_es.csv")
dataset = dataset.select(range(0,1000)) 

Using custom data configuration default-a85a1ff4f4fab58b
Reusing dataset csv (/Users/dani/.cache/huggingface/datasets/csv/default-a85a1ff4f4fab58b/0.0.0)


In [28]:
def process_record(r):
    doc = nlp(record["text"])
    return {"processed": True}

dataset.map(process_record)

  0%|          | 0/1000 [00:00<?, ?ex/s]

Dataset({
    features: ['Unnamed: 0', 'text', 'source', 'processed'],
    num_rows: 1000
})

## 6. ¿Cómo entrenar un modelo una vez etiquetados los datos?

Ver tutorial completo en: https://rubrix.readthedocs.io/en/stable/tutorials/01-labeling-finetuning.html

![Labeling workflow](https://rubrix.readthedocs.io/en/stable/_images/workflow.svg "Labeling workflow")

### Preparación de datos entrenamiento y prueba

In [29]:
rb_df = rb.load(
    name='tweets_en_es_emocion', 
    query="status:Validated"
)

In [30]:
rb_df

In [None]:
from datasets import Dataset

# select text input and the annotated label
rb_df['text'] = rb_df.inputs.transform(lambda r: r['text'])
rb_df['labels'] = rb_df.annotation


# create 🤗 dataset from pandas with labels as numeric ids
label2id = { label:i for i,label in enumerate(["joy", "sadness", "fear", "others", "surprise", "disgust"])}
train_ds = Dataset.from_pandas(rb_df[['text', 'labels']])
train_ds = train_ds.map(lambda example: {'labels': label2id[example['labels']]})

train_ds = train_ds.train_test_split(test_size=0.2) ; train_ds

A partir de este punto se puede seguir el proceso estándar con el Hugging Face `Trainer` (ver tutorial)