# pysentimiento: A multilingual toolkit for Sentiment Analysis and SocialNLP tasks

# Sentiment Analysis and Opinion Mining in Portuguese



In this notebook we show a brief example of how to use [pysentimiento](https://github.com/pysentimiento/pysentimiento/), a multilingual toolkit for opinion mining and sentiment analysis for Portuguese tasks.

`pysentimiento` supports the following tasks for Portuguese:

- Sentiment Analysis 
- Hate Speech Detection
- Irony Detection

First, let's install the library

------------

Neste notebook, mostramos um breve exemplo de como usar [pysentimiento](https://github.com/pysentimiento/pysentimiento/), um kit de ferramentas multilíngue para mineração de opinião e análise de sentimentos para português.

`pysentimiento` suporta as seguintes tarefas para o português:

- Análise de sentimentos
- Detecção de discurso de ódio
- Detecção de Ironia

Primeiro, vamos instalar a biblioteca


In [1]:
!pip install pysentimiento


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Let's create an analyzer. The `create_analyzer` receives the task and the language as parameters (currently supports "es" and "en").

In [2]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="pt")



Let's check out some examples:

Veamos algunos ejemplos:

In [3]:
analyzer.predict("oh estou muito feliz porque é verão")

AnalyzerOutput(output=POS, probas={POS: 0.961, NEU: 0.036, NEG: 0.003})

In [4]:
analyzer.predict("Como isso me entristece! Não pode ser!!!")

AnalyzerOutput(output=NEG, probas={NEG: 0.991, POS: 0.005, NEU: 0.005})

In [5]:
analyzer.predict("O que o cocô quer dizer sobre sua saúde? Médica revela em 5 pontos, de cor a formato ideal")

AnalyzerOutput(output=NEU, probas={NEU: 0.910, NEG: 0.063, POS: 0.027})

### Emojis

It supports the use of emojis through the [emoji](https://pypi.org/project/emoji/) library.


In [6]:
analyzer.predict("🤢")

AnalyzerOutput(output=NEG, probas={NEG: 0.976, NEU: 0.016, POS: 0.008})

## Hate Speech

`pysentimiento` also supports hate speech detection, by training models using the dataset from ["A Hierarchically Labeled Portuguese Hate Speech Dataset"](https://github.com/paulafortuna/Portuguese-Hate-Speech-Dataset).

In [9]:
hate_speech_analyzer = create_analyzer(task="hate_speech", lang="pt")

loading configuration file config.json from cache at /users/jmperez/.cache/huggingface/hub/models--pysentimiento--bertabaporu-pt-hate-speech/snapshots/22acaa49cd237583c60359bf087b42bc2622841b/config.json
Model config BertConfig {
  "_name_or_path": "pysentimiento/bertabaporu-pt-hate-speech",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Sexism",
    "1": "Body",
    "2": "Racism",
    "3": "Ideology",
    "4": "Homophobia"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "Body": 1,
    "Homophobia": 4,
    "Ideology": 3,
    "Racism": 2,
    "Sexism": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 8,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_t

This model is a multi-label classification algorithm, returning three different variables at the same time:

- Is the message hateful or not?
- Is the hateful message targeted at a specific person or a group?
- Is the hateful message aggressive?

In [12]:
hate_speech_analyzer.predict("gordas são realmente nojentos, por que eles não pensam em si mesmos?")

AnalyzerOutput(output=['Sexism', 'Body'], probas={Sexism: 0.822, Body: 0.864, Racism: 0.060, Ideology: 0.066, Homophobia: 0.009})

In [23]:
hate_speech_analyzer.predict("todos os comunistas são terroristas")

AnalyzerOutput(output=['Ideology'], probas={Sexism: 0.059, Body: 0.018, Racism: 0.051, Ideology: 0.805, Homophobia: 0.046})

In [31]:
hate_speech_analyzer.predict("chega de homossexuais, gays e outros")

AnalyzerOutput(output=['Homophobia'], probas={Sexism: 0.039, Body: 0.008, Racism: 0.021, Ideology: 0.014, Homophobia: 0.960})

## Preprocessing

`pysentimiento` features a preprocessing module with various options for manipulating hashtags, emojis, character repetition, and so on.

In [33]:
from pysentimiento.preprocessing import preprocess_tweet

preprocess_tweet("📢 O Twitter removeu as postagens de @JairBolsonaro por 'violar as regras de convivência' #BreakingNews", lang="pt")

"emoji buzina emoji  O Twitter removeu as postagens de @USER por 'violar as regras de convivência' hashtag breaking news"

In [34]:
preprocess_tweet("📢 O Twitter removeu as postagens de @JairBolsonaro por 'violar as regras de convivência' #BreakingNews", preprocess_handles=False, demoji=False)

"📢 O Twitter removeu as postagens de @JairBolsonaro por 'violar as regras de convivência' hashtag breaking news"