# **Caso práctico 1: Análisis de tweets**

Análisis de tweets de este dataset https://github.com/garnachod/TwitterSentimentDataset

## 1. Cargar y explorar los datos

In [12]:
import pandas as pd

dataset = pd.read_csv("datos/tweets_en_es.csv")
dataset = dataset.sample(frac=1.0) ; dataset

Unnamed: 0.1,Unnamed: 0,text,source
99233,43873,1509. Hace UN SIGLO que no hablamos :(\n,tweets_neg_clean.txt
89888,34528,Suelen ser los que más ensucian :( https://t.c...,tweets_neg_clean.txt
194909,17333,Cuando Consuelo Císcar se creyó Marco Polo htt...,tweets_clean.txt
130703,75343,La única que me manda snaps es Kiara y no son ...,tweets_neg_clean.txt
143792,88432,@Reynerosrvng Ah! .... q triste :(\n,tweets_neg_clean.txt
...,...,...,...
19572,19572,@camilacabello97 mijaaa llegaste a los 2 millo...,tweets_pos_clean.txt
47372,47372,al final todo va a estar bien :) http://t.co/K...,tweets_pos_clean.txt
211488,33912,@MMiguelj30 No la conozco. La apunto a la list...,tweets_clean.txt
130776,75416,EXTRAÑO A MIS AMIGAS :( @PauManavella00 @taaru...,tweets_neg_clean.txt


In [13]:
import rubrix as rb

records = [
    rb.TextClassificationRecord(
        inputs=row.text,
        metadata={"source": row.source}
    )
    for i,row in dataset[0:5000].iterrows()
]

rb.delete(name="tuits_en_es")

output = rb.log(records, name="tuits_en_es")

In [15]:
rb.load("tuits_en_es", query="status:Validated")

Unnamed: 0,inputs,prediction,annotation,prediction_agent,annotation_agent,multi_label,explanation,id,metadata,status,event_timestamp
0,{'text': 'Personas lindas con camisas lindas :...,,Negativo,,rubrix,False,,0002390b-cc0c-4c14-8f7c-1b3b61fa45dc,{'source': 'tweets_neg_clean.txt'},Validated,
1,{'text': 'El Arte de Aplaudir con las nalgas :...,,Positivo,,rubrix,False,,000e680f-5981-4ad8-9d8b-f84f666cfa91,{'source': 'tweets_pos_clean.txt'},Validated,
2,{'text': 'Necesito que alguien vea Mr Robot y ...,,Negativo,,rubrix,False,,0018a876-c4f4-4059-b93b-2521bcd264b9,{'source': 'tweets_neg_clean.txt'},Validated,


## 2. Análisis de sentimiento

Usaremos `pysentimiento` (https://github.com/pysentimiento/pysentimiento). Entrenado originalmente con datasets de Tweets. 

Creado por **Juan Manuel Pérez** (https://twitter.com/perezjotaeme)


In [16]:
from pysentimiento import SentimentAnalyzer
from pysentimiento.preprocessing import preprocess_tweet

analyzer = SentimentAnalyzer(lang="es")
analyzer.predict(dataset.iloc[1].text).probas.items()

dict_items([('NEG', 0.9988141059875488), ('NEU', 0.0006684837280772626), ('POS', 0.0005174684920348227)])

In [17]:
prediction = analyzer.predict("Hola me llamo Daniel y estoy contento")

SentimentOutput(output=POS, probas={POS: 0.999, NEG: 0.001, NEU: 0.001})

In [23]:
records = [
    rb.TextClassificationRecord(
        inputs=preprocess_tweet(row.text),
        metadata={"source": row.source},
        #annotation=analyzer.predict(preprocess_tweet(row.text)).probas.items()[0] This will load the annotation as a valid label
        prediction=[prediction for prediction in analyzer.predict(preprocess_tweet(row.text)).probas.items()]
    )
    for i,row in dataset[0:100].iterrows()
]
rb.delete(name="tuits_en_es")
_ = rb.log(records, name="tuits_en_es_nuevo")

In [21]:
from rubrix.metrics.text_classification import f1

f1(name="tuits_en_es") #  

MetricSummary(data={'micro': 0.8888888888888888, 'macro': 0.6363636363636364, 'per_label': {'POS': 1.0, 'NEU': 0.0, 'NEG': 0.9090909090909091}})

## 3. Análisis de emociones

Creado por **Juan Manuel Pérez** (https://twitter.com/perezjotaeme).

Ahora usaremos Hugging Face directamente.

In [24]:
from transformers import pipeline
from datasets import Dataset

classifier = pipeline(
    "text-classification",
    model="finiteautomata/beto-emotion-analysis", 
    return_all_scores=True
)

dataset = Dataset.from_csv("datos/tweets_en_es.csv")
dataset = dataset.select(range(0,100)) 

loading configuration file https://huggingface.co/finiteautomata/beto-emotion-analysis/resolve/main/config.json from cache at /Users/dani/.cache/huggingface/transformers/b75b62ad64772a1df4c46943b8729e726a8bcc147e845197d591ebde2d1430b2.516dbdb8064a2498055e9af6e5d92ae5a4583bcefe0ca71535c71f42cf513138
Model config BertConfig {
  "_name_or_path": "dccuchile/bert-base-spanish-wwm-cased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "others",
    "1": "joy",
    "2": "sadness",
    "3": "anger",
    "4": "surprise",
    "5": "disgust",
    "6": "fear"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "anger": 3,
    "disgust": 5,
    "fear": 6,
    "joy": 1,
    "others": 0,
    "sadness": 2,
    "surprise": 4
  },
  "layer_norm_eps": 1e-12,

In [None]:
rb.delete("tweets_en_es_emocion")

classifier = rb.monitor(classifier, dataset="tweets_en_es_emocion", sample_rate=1.0)

In [None]:
dataset.map(
    lambda r: {"prediction": classifier(r["text"])},
)

## 4. Categorización de texto (modelo "zero-shot")


Ver tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/zeroshot_data_annotation.html

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", 
                       model="Recognai/zeroshot_selectra_medium")

labels = ["amistad", "política", "videojuegos", "deporte", "comida", "famosos", "música"]
template = "Este mensaje es sobre {}"

classifier = rb.monitor(classifier, dataset="tweets_en_es_categorizacion", sample_rate=1.0)

rb.delete("tweets_en_es_categorizacion")

In [12]:
classifier("Te quiero amiga!", candidate_labels=labels, hypothesis_template=template)

{'sequence': 'Te quiero amiga!',
 'labels': ['amistad',
  'famosos',
  'música',
  'política',
  'deporte',
  'comida',
  'videojuegos'],
 'scores': [0.9217610955238342,
  0.02147088013589382,
  0.016207056120038033,
  0.014193572103977203,
  0.010944677516818047,
  0.008591394871473312,
  0.0068312352523207664]}

In [14]:
dataset.map(
    lambda r: {"prediction": classifier(r["text"], candidate_labels=labels, hypothesis_template=template)}
)

  0%|          | 0/100 [00:00<?, ?ex/s]

Dataset({
    features: ['Unnamed: 0', 'text', 'source', 'prediction'],
    num_rows: 100
})

## 5. Reconocimiento de entidades con spaCy

In [None]:
!python -m spacy download es_core_news_md

In [25]:
import spacy
import rubrix as rb

from datasets import Dataset

nlp = spacy.load("es_core_news_md")
nlp = rb.monitor(nlp, dataset="tuits_ner")

rb.delete("tuits_ner")

dataset = Dataset.from_csv("tweets_en_es.csv")
dataset = dataset.select(range(0,1000)) 

Using custom data configuration default-a85a1ff4f4fab58b
Reusing dataset csv (/Users/dani/.cache/huggingface/datasets/csv/default-a85a1ff4f4fab58b/0.0.0)


In [26]:
def process_record(r):
    doc = nlp(r["text"])
    return {"processed": True}

dataset.map(process_record)

Loading cached processed dataset at /Users/dani/.cache/huggingface/datasets/csv/default-a85a1ff4f4fab58b/0.0.0/cache-1c80317fa3b1799d.arrow


Dataset({
    features: ['Unnamed: 0', 'text', 'source', 'prediction'],
    num_rows: 100
})

## 6. ¿Cómo entrenar un modelo una vez etiquetados los datos?

Ver tutorial completo en: https://rubrix.readthedocs.io/en/stable/tutorials/01-labeling-finetuning.html

![Labeling workflow](https://rubrix.readthedocs.io/en/stable/_images/workflow.svg "Labeling workflow")

### Preparación de datos entrenamiento y prueba

In [29]:
rb_df = rb.load(
    name='tweets_en_es_emocion', 
    query="status:Validated"
)

In [30]:
rb_df

In [None]:
from datasets import Dataset

# select text input and the annotated label
rb_df['text'] = rb_df.inputs.transform(lambda r: r['text'])
rb_df['labels'] = rb_df.annotation


# create 🤗 dataset from pandas with labels as numeric ids
label2id = { label:i for i,label in enumerate(["joy", "sadness", "fear", "others", "surprise", "disgust"])}
train_ds = Dataset.from_pandas(rb_df[['text', 'labels']])
train_ds = train_ds.map(lambda example: {'labels': label2id[example['labels']]})

train_ds = train_ds.train_test_split(test_size=0.2) ; train_ds

A partir de este punto se puede seguir el proceso estándar con el Hugging Face `Trainer` (ver tutorial)