# We are going to build and fine-tune a sentiment classifier from scratch to classify our 1M reddit posts and +3M reddit comments.

- First we need to upload our scrapped data

- We apply spicy to create a list with all the comments splitted by sentences. We do this because the sentiment classifier was designed to work by sentencem, instead of full comments.

- We are going to tune the most popular sentiment classifier on the Hugging Face Hub which has been fine-tuned on the SST2 sentiment dataset, the distilbert-base-uncased-finetuned-sst-2-english.

- Label a training dataset of a shuffle sample of all the comments from our subreddit "whatcarshouldIbuy" as baseline to get an initial pre-trained sentiment classifier predictions.

- Fine-tune the pre-trained classifier with your training dataset. We upload the sample to Rubrix and we label a % of it. To do this we created a new label "NEUTRAL", as the original one has only a binary predictors ("POSITIVE" and "NEGATIVE")

- Label more data by correcting the predictions of the fine-tuned model.

- Fine-tune the pre-trained classifier with the extended training dataset.

## Introduction

We will fine-tune a sentiment classifier for our used-car domain, starting with no labeled data. The schema of the process is the following


### Uploading our scrapping datasets

In [1]:
import pandas as pd
import spacy
import rubrix as rb
#!python -m spacy download "en_core_web_sm"
#spacy option!
import spacy
from transformers import AutoTokenizer
import numpy as np
from transformers import Trainer
from datasets import load_metric
from transformers import TrainingArguments
from transformers import pipeline
from transformers import AutoModelForSequenceClassification

In [9]:
titles = pd.read_csv("scraped_reddit_posts_for_whatcarshouldIbuy.csv")

In [10]:
titles.head()

Unnamed: 0,full_link,subreddit,post keywords,id,date,score,num_comments,author,title,selftext,top_comment,comment_score
0,https://www.reddit.com/r/whatcarshouldIbuy/com...,whatcarshouldIbuy,,5fsq70,2016-11-30,3.0,5.0,ChalkPie,≤$9k and <50k miles?,I've been driving the same car since I started...,I've seen a few Ford Focus/Hyundai Accent seda...,1.0
1,https://www.reddit.com/r/whatcarshouldIbuy/com...,whatcarshouldIbuy,,5fsfy7,2016-11-30,3.0,2.0,everreadyy,buying a new car and feel overwhelmed,I am about to go on a cross-country road trip ...,"Insurance should be the same, as the $130 you ...",2.0
2,https://www.reddit.com/r/whatcarshouldIbuy/com...,whatcarshouldIbuy,,5frt0p,2016-11-30,1.0,3.0,thecw,Replacement for a 2006 Mariner,"Hi all,\nMy wife and I have a 2006 Mercury Mar...",I'd go for a Toyota Rav4 or Honda CRV. Both wi...,2.0
3,https://www.reddit.com/r/whatcarshouldIbuy/com...,whatcarshouldIbuy,,5frs4b,2016-11-30,1.0,6.0,chy_vak,Cheapest car to insure?,"Hey guys, not sure if this is the right place ...","Chevy Impala's are pretty good on insurance, a...",3.0
4,https://www.reddit.com/r/whatcarshouldIbuy/com...,whatcarshouldIbuy,,5frs37,2016-11-30,113.0,12.0,sockrocker,Optimizing your car search: A few tips I wish ...,**Background**: I just bought a used car. I sp...,You missed the most import step by far:\n**11a...,22.0


In [11]:
comments = pd.read_csv("scraped_reddit_comments_for_whatcarshouldIbuy.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [12]:
comments.head()

Unnamed: 0,parent_id,comment_id,score_id,created_utc,body,score,permalink,is_submitter,author,date,link
0,t3_559xc5,d88vp4i,2,2016-09-30 22:10:09,Torque is what will help a car ascent a steep ...,2,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-09-30 22:10:09,https://www.reddit.com/r/whatcarshouldIbuy/com...
1,t1_d88vp4i,d8b3wiq,1,2016-10-02 19:07:45,Thank you.\nI don't really have the option of ...,1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,True,Nyxelestia,2016-10-02 19:07:45,https://www.reddit.com/r/whatcarshouldIbuy/com...
2,t1_d8b3wiq,d8cah3g,1,2016-10-03 16:38:21,I would aim for the larger engine options with...,1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-10-03 16:38:21,https://www.reddit.com/r/whatcarshouldIbuy/com...
3,t1_d8cah3g,d8d1dnj,1,2016-10-04 02:40:55,"This is so helpful, thank you! :)\nOut of curi...",1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,True,Nyxelestia,2016-10-04 02:40:55,https://www.reddit.com/r/whatcarshouldIbuy/com...
4,t1_d8d1dnj,d8dpmsl,2,2016-10-04 16:33:02,An Accord will drive smoother/quieter than a C...,2,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-10-04 16:33:02,https://www.reddit.com/r/whatcarshouldIbuy/com...


#### Drop na

In [164]:
comments.dropna(inplace = True)

### Sentiment Analysis pre-train

Model: sentiment distilbert fine-tuned on sst-2
As of December 2021, the distilbert-base-uncased-finetuned-sst-2-english is in the top five of the most popular text-classification models in the Hugging Face Hub.

This model is a distilbert model fine-tuned on SST-2 (Stanford Sentiment Treebank), a highly popular sentiment classification benchmark.

This is a general-purpose sentiment classifier, which will need further fine-tuning for specific use cases and styles of text.

In [2]:
sentiment_classifier = pipeline(
    model="distilbert-base-uncased-finetuned-sst-2-english",
    task="sentiment-analysis", 
    return_all_scores=True,
)

### Sentencizer

##### Creation of the sentencizer aplying it over the 100 sample.

First step is to divide all the comments by sentences. We are going to train our sentiment classifier per sentence, and later, we will pass it over the whole column of comments, adding the sentiment classification as a new column of the dataset. 

In [3]:
nlp = spacy.load("en_core_web_sm")

To train Rubrix We get a random sample of 100k of the comments, and we prepare the dataset to upload it to the Rubrix server. This is a flat list with all the sentences splitted.

In [None]:
sentences_for_rubrix = []
for text in comments["body"].sample(100000):
    doc=nlp(text)
    sentences = [sentence.text for sentence in doc.sents]
    sentences_for_rubrix += sentences

Sentiment prediction per sentence

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
txt_into_sentences_to_label1, txt_into_sentences_to_label2 = train_test_split(sentences_for_rubrix, test_size=0.5)

## 1. Run the **pre-trained model** over the dataset and log the predictions

First step: we use the pre-trained model for predicting over our raw dataset.


In [None]:
sample_txt = []
for element in txt_into_sentences_to_label1[0:100000]:
        sample_txt.append({"text":element, "predictions": sentiment_classifier(element, truncation=True)}) 

We create the Rubrix dataset for labelling

In [None]:
sample_txt_rb = []
for item in sample_txt:
    sample_txt_ = rb.TextClassificationRecord(
        text=item["text"],
        #metadata={'title_id': item['title_id'], "title": item["title"]}, # log the intents for exploration of specific intents
        prediction=[(pred['label'], pred['score']) for pred in item['predictions'][0]],
        prediction_agent="distilbert-base-uncased-finetuned-sst-2-english"
    )
    sample_txt_rb.append(sample_txt_)

In [None]:
dataset_rb = rb.DatasetForTextClassification(sample_txt_rb)

We upload everything to Rubrix

In [None]:
rb.log(name='cars_sentence_with_pretrained_for_neutral', records=dataset_rb)

## 2. Explore and label data with the pretrained model

In this step, we'll start by exploring how the pre-trained model is performing with our dataset. 

At first sight:

- The pre-trained sentiment classifier only has 'POSITIVE' and 'NEGATIVE', and after checking the comments, we consider that it is needed a NEUTRAL LABEL

- The prediction per comment tends to be NEGATIVE-(60%), POSITIVE(40%).

- Labelling RULES:
    * A recommendation of a car/cars brand or model/models is considered POSITIVE
    * A selection of some cars of a list (are in the titles) are considered POSTIIVE
    * If the comment has something like: I love the car, but there is only one problem, is POSITIVE
    * If the comment has something like: I hate the car, but there is only one thing good, is NEGATIVE.
    * If has a balance with the some negatives and some positives, is NEUTRAL
    * If it speaks about something around cars without influence, is NEUTRAL
    * Questions about reliability or models or whatever is NEUTRAL
    * Questions about problems, are NEGATIVES
    * Questions about good features are POSITIVES
    
Taking into account this RULES, we can start labeling our data. 

Rubrix provides you with a search-driven UI to annotated data, using **free-text search**, **search filters** and **the Elasticsearch query DSL** for advanced queries. This is especially useful for sparse datasets, tasks with a high number of labels, or unbalanced classes. In the standard case, we recommend you to follow the workflow below:

1. **Start labeling examples sequentially**, without using search features. This way you will annotate a fraction of your data which will be aligned with the dataset distribution.

2. Once you have a sense of the data, you can **start using filters and search features to annotate examples with specific labels**. In our case, we'll label examples predicted as `POSITIVE` by our pre-trained model, and then a few examples predicted as `NEGATIVE`.

3. In our case as we add a NEUTRAL label, we need to label more to train this new NEUTRAL label

### We have uploaded to rubrix the first 100000 senteces of our sample_txt list. The dataset is huge so we need to reduce resources as is super time consuming.
### After upload it, we have validated and labeled by hand around 1000 records, a 1% of those 100000, and we tuned the baseline classifier with those new inputs.

## 3. Fine-tune the pre-trained model


First, let's load the annotations from our dataset using the query parameter from the `load` method. The `Validated` status corresponds to annotated records.

In [7]:
rb_dataset = rb.load(name='cars_sentence_with_pretrained_for_neutral', query="status:Validated")

##### Creating and preparing our train and test dataset

Let's now prepare our dataset for training and testing our sentiment classifier, using the `datasets` library:

In [8]:
# create 🤗 dataset with labels as numeric ids
train_ds = rb_dataset.prepare_for_training()

In [9]:
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 1012
})

In [10]:
# tokenize our datasets
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train_ds = train_ds.map(tokenize_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

We split the data into a training and evalutaion set

In [11]:
train_dataset, eval_dataset = tokenized_train_ds.train_test_split(test_size=0.2, seed=42).values()

### Train our sentiment classifier

As we mentioned before, we're going to fine-tune the `distilbert-base-uncased-finetuned-sst-2-english` model. 
We load the model:

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels = 3, ignore_mismatched_sizes = True)

In [None]:
tokenized_train_ds

In [19]:
training_args = TrainingArguments(
    "distilbert-base-uncased-sentiment_cars", 
    evaluation_strategy="epoch",
    logging_steps=30,
)

metric = load_metric("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average = "micro")

trainer = Trainer(
    args=training_args,
    model=model, 
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset, 
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

## 4. Testing the fine-tuned model

In this step, let's first test the model we have just trained.

Let's create a new pipeline with our model:

In [None]:
finetuned_sentiment_classifier = pipeline(
    model=model.to("cpu"),
    tokenizer=tokenizer, 
    task="sentiment-analysis", 
    return_all_scores=True,
    truncation = True
)

LABEL_0 = NEGATIVE, LABEL_1 = NEUTRAL, LABEL_2 = POSITIVE

#### Then, we can compare its predictions with the pre-trained model and an example:

In [None]:
finetuned_sentiment_classifier(
                        'Always consider your insurance costs, whichever car you choose.'
), sentiment_classifier('Always consider your insurance costs, whichever car you choose.'
                       )

In [None]:
finetuned_sentiment_classifier(
                        'I only buy Toyota/Lexus for reliability reasons.'
), sentiment_classifier('I only buy Toyota/Lexus for reliability reasons.'
                       )

In [None]:
finetuned_sentiment_classifier(
                        'I got my 2013 BMW 335i M-Sport with almost every option, ~18000 miles for $35k'
), sentiment_classifier('I got my 2013 BMW 335i M-Sport with almost every option, ~18000 miles for $35k'
                       )

In [None]:
finetuned_sentiment_classifier(
'BMWs lease really well, too, so if you want a BMW I would strongly advise leasing one.'
), sentiment_classifier('BMWs lease really well, too, so if you want a BMW I would strongly advise leasing one.'
                       )

In [None]:
finetuned_sentiment_classifier(
    'I spent another ~$2k for the new shocks, tires, replaced the rear sway bar, got the free recalls done, and now I have a car I love for a grand total of $10k, half my initial budget.'
), sentiment_classifier(
    'I spent another ~$2k for the new shocks, tires, replaced the rear sway bar, got the free recalls done, and now I have a car I love for a grand total of $10k, half my initial budget.'
)

## 5. Run our **fine-tuned model** over the dataset and log the predictions


Let's now create a dataset from the remaining records (those which we haven't annotated in the first annotation session).

We'll do this using the `Default` status, which means the record hasn't been assigned a label.

In [3]:
rb_dataset = rb.load(name='cars_sentence_with_pretrained_for_neutral', query="status:Default",)

#### HE HECHO TODO ESTO, RETUNEAMOS EL CLASSIFIER PERO AHORA CON TRES LABELS!

From here, this is basically the same as step 1, in this case using our fine-tuned model:


Let's take advantage of the datasets map feature, to make batched predictions.

In [None]:
def predict(examples):
    texts = [example["text"] for example in examples["inputs"]]
    return {
        "prediction": finetuned_sentiment_classifier(texts), 
        "prediction_agent": ["distilbert-base-uncased-sentiment-car"]*len(texts)
    }

ds_dataset = rb_dataset.to_datasets().map(predict, batched=True, batch_size=8) 

Afterward, we can convert the dataset directly to Rubrix records again and log them to the web app.

In [None]:
records = rb.read_datasets(ds_dataset, task="TextClassification")

rb.log(records=records, name='car_labeling_with_finetune_neutral_to_improve')

## 6. Explore and label data with the fine-tuned model


In this step, we'll start by exploring how the fine-tuned model is performing with our dataset and we label again a few number of sentences to refine more our classifier.

We labelled more than 300 annotated examples.

Let's add our new examples to our previous training set.

In [13]:
rb_dataset = rb.load("car_labeling_with_finetune_neutral_to_improve")

mapping = {
    "LABEL_0": "NEGATIVE",
    "LABEL_1": "NEUTRAL",
    "LABEL_2": "POSITIVE"
}
for record in rb_dataset:
  # skip unlabeled records
  if not record.annotation:
    continue

  record.annotation = mapping[record.annotation]

In [21]:
train_ds = rb_dataset.prepare_for_training()

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train_ds = train_ds.map(tokenize_function, batched=True)

loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at C:\Users\fredi/.cache\huggingface\transformers\4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "ti

  0%|          | 0/1 [00:00<?, ?ba/s]

In [22]:
tokenized_train_ds.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=3, names=['NEGATIVE', 'NEUTRAL', 'POSITIVE'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)

In [23]:
train_dataset.info

DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=3, names=['NEGATIVE', 'NEUTRAL', 'POSITIVE'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)

In [24]:
from datasets import concatenate_datasets

train_dataset = concatenate_datasets([train_dataset, tokenized_train_ds])

As we want to measure the effect of adding examples to our training set we will:

Fine-tune from the pre-trained sentiment weights (as we did before)
Use the previous test set and the extended train set (obtaining a metric we use to compare this new version with our previous model)

In [25]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", num_labels = 3, ignore_mismatched_sizes = True)

loading configuration file https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json from cache at C:\Users\fredi/.cache\huggingface\transformers\4e60bb8efad3d4b7dc9969bf204947c185166a0a3cf37ddb6f481a876a3777b5.9f8326d0b7697c7fd57366cdde57032f46bc10e37ae81cb7eb564d66d23ec96b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "s

In [26]:
train_ds = train_dataset.shuffle(seed=42)

trainer = Trainer(
    args=training_args,
    model=model, 
    train_dataset=train_dataset, 
    eval_dataset=eval_dataset, 
    compute_metrics=compute_metrics,
)


In [27]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1487
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 558


Epoch,Training Loss,Validation Loss,F1
1,0.6098,0.877583,0.64532
2,0.282,1.167308,0.635468
3,0.1621,1.370419,0.655172


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 203
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 203
  Batch size = 8
Saving model checkpoint to distilbert-base-uncased-sentiment_cars\checkpoint-500
Configuration saved in distilbert-base-uncased-sentiment_cars\checkpoint-500\config.json
Model weights saved in distilbert-base-uncased-sentiment_cars\checkpoint-500\pytorch_model.bin
The following columns in the evaluation set don't have a 

TrainOutput(global_step=558, training_loss=0.4252532032655559, metrics={'train_runtime': 4980.3513, 'train_samples_per_second': 0.896, 'train_steps_per_second': 0.112, 'total_flos': 590947603928064.0, 'train_loss': 0.4252532032655559, 'epoch': 3.0})

Finally we saved our model tuned!

In [28]:
model.save_pretrained("distilbert-base-uncased-sentiment-car_with_neutral")

Configuration saved in distilbert-base-uncased-sentiment-car_with_neutral\config.json
Model weights saved in distilbert-base-uncased-sentiment-car_with_neutral\pytorch_model.bin


In [29]:
final_car_sentiment_classifier_tuned_for_testing = pipeline(
    model=model.to("cpu"),
    tokenizer=tokenizer, 
    task="sentiment-analysis", 
    return_all_scores=True
)

In [30]:
final_car_sentiment_classifier_tuned_for_testing(
    "Rx8 is always an option if you like checking your oil at every tank up."
), sentiment_classifier(
    "Rx8 is always an option if you like checking your oil at every tank up.")

([[{'label': 'LABEL_0', 'score': 0.01840960793197155},
   {'label': 'LABEL_1', 'score': 0.9537872076034546},
   {'label': 'LABEL_2', 'score': 0.027803298085927963}]],
 [[{'label': 'NEGATIVE', 'score': 0.726030170917511},
   {'label': 'POSITIVE', 'score': 0.2739698588848114}]])

In [31]:
final_car_sentiment_classifier_tuned_for_testing(
    "The seats of the Infiniti feel like they were made for a medium sized Asian man."
), sentiment_classifier(
    "The seats of the Infiniti feel like they were made for a medium sized Asian man.")


([[{'label': 'LABEL_0', 'score': 0.8444103598594666},
   {'label': 'LABEL_1', 'score': 0.10671481490135193},
   {'label': 'LABEL_2', 'score': 0.04887479916214943}]],
 [[{'label': 'NEGATIVE', 'score': 0.977379322052002},
   {'label': 'POSITIVE', 'score': 0.02262074500322342}]])

## 5. Now its time to make the sentiment of the entire comments

Here we divide the comment per sentence, creating a new column in our dataset as we need to the sentiment per sentence and later we sum the sentiment result

In [117]:
comment_into_sentences = []
for text in comments["body"]:
    doc=nlp(text)
    sentences = [sentence.text for sentence in doc.sents]
    comment_into_sentences.append(sentences)

comments["body_splitted"] = comment_into_sentences   

In [118]:
comments.to_json("comments_splitted.json")

First, we save our last model as the good one. And we do the sentiment per comment and sentence, getting a list per comment with all the sentiments per sentences. Later we use this sentiment to calculate how good/bad/neutral is the sentiment per comment.

In [151]:
final_car_sentiment_classifier_tuned_for_op = pipeline(
    model=model.to("cpu"),
    tokenizer=tokenizer, 
    task="sentiment-analysis", 
    return_all_scores=False
)

In [152]:
sentiment_numbers = []
for comment in comments["body_splitted"]:
    sentiment = final_car_sentiment_classifier_tuned_for_op(comment, truncation = True)
    sentiment_numbers.append(sentiment)
    if (len(sentiment_numbers)==100000 or len(sentiment_numbers)==300000 or len(sentiment_numbers)==600000 or len(sentiment_numbers)==900000 or len(sentiment_numbers)==1100000):
        print(len(sentiment_numbers))

100000
300000
600000
900000
1100000


In [153]:
comments["sentiment"] = sentiment_numbers

In [154]:
comments.to_json("comments_splitted_sentiment.json")

In [236]:
comments = pd.read_json("comments_splitted_sentiment.json")

Now we count the number of negatives and positives per comment, and we average the result!

First approach, we average the sum dividing by the total number of sentences in the comment

In [237]:
sentiment_count = []
for comment in comments["sentiment"]:
    sentiment_number = 0
    for sentiment in comment:
        if sentiment["label"] == "LABEL_2":
            sentiment_number +=1
        elif sentiment["label"] == "LABEL_0":
            sentiment_number +=-1
    sentiment_of_comment = sentiment_number/len(comment)    
    sentiment_count.append(sentiment_number)

In [238]:
comments["sentiment_score"] = sentiment_count

Second approach: Instead of dividing by the number of sentences in the comment, we divide by the avg number of sentences per comment in the dataset. Why would be beneficial?
We consider that if someone spend time explaining things and givinig a highly positive comment or a highly negative comment, would be good to get it. The only way of doing that is giving more weight to the posts with more sentences.

This is the code to calculate the avg number of sentences per comment.

In [247]:
avg = 0
for item in comments["body_splitted"]:
    avg += len(item)/len(comments)

In [248]:
avg

3.286314331600568

In [252]:
sentiment_count2 = []
for comment in comments["sentiment"]:
    sentiment_number2 = 0
    for sentiment in comment:
        if sentiment["label"] == "LABEL_2":
            sentiment_number2 +=1
        elif sentiment["label"] == "LABEL_0":
            sentiment_number2 +=-1
    sentiment_of_comment2 = sentiment_number2/avg   
    sentiment_count2.append(sentiment_of_comment2)

In [253]:
comments["sentiment_score_avg"] = sentiment_count2

In [254]:
comments

Unnamed: 0,parent_id,comment_id,score_id,created_utc,body,score,permalink,is_submitter,author,date,link,body_splitted,sentiment,sentiment_score,sentiment_score_avg
0,t3_559xc5,d88vp4i,2,2016-09-30 22:10:09,Torque is what will help a car ascent a steep ...,2,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-09-30 22:10:09,https://www.reddit.com/r/whatcarshouldIbuy/com...,[Torque is what will help a car ascent a steep...,"[{'label': 'LABEL_0', 'score': 0.9868171811}, ...",-3,-0.912877
1,t1_d88vp4i,d8b3wiq,1,2016-10-02 19:07:45,Thank you.\nI don't really have the option of ...,1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,True,Nyxelestia,2016-10-02 19:07:45,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Thank you., \nI don't really have the option ...","[{'label': 'LABEL_2', 'score': 0.9974262118}, ...",0,0.000000
2,t1_d8b3wiq,d8cah3g,1,2016-10-03 16:38:21,I would aim for the larger engine options with...,1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-10-03 16:38:21,https://www.reddit.com/r/whatcarshouldIbuy/com...,[I would aim for the larger engine options wit...,"[{'label': 'LABEL_2', 'score': 0.9396278858}, ...",0,0.000000
3,t1_d8cah3g,d8d1dnj,1,2016-10-04 02:40:55,"This is so helpful, thank you! :)\nOut of curi...",1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,True,Nyxelestia,2016-10-04 02:40:55,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[This is so helpful, thank you!, :), \nOut of ...","[{'label': 'LABEL_2', 'score': 0.9975322485}, ...",-2,-0.608585
4,t1_d8d1dnj,d8dpmsl,2,2016-10-04 16:33:02,An Accord will drive smoother/quieter than a C...,2,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-10-04 16:33:02,https://www.reddit.com/r/whatcarshouldIbuy/com...,[An Accord will drive smoother/quieter than a ...,"[{'label': 'LABEL_2', 'score': 0.944755733}, {...",6,1.825754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1124909,t1_ge7fc5x,ge7rel9,1,2020-12-01 01:49:38,Thanks for your input. I only say I don't want...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,True,Skyline99x,2020-12-01 01:49:38,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Thanks for your input., I only say I don't wa...","[{'label': 'LABEL_2', 'score': 0.8439986706}, ...",2,0.608585
1124910,t1_ge7gmfe,ge7rr9x,1,2020-12-01 01:52:49,Thanks for the in depth input. I honestly forg...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,True,Skyline99x,2020-12-01 01:52:49,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Thanks for the in depth input., I honestly fo...","[{'label': 'LABEL_2', 'score': 0.9389728904}, ...",4,1.217169
1124911,t1_ge8ad5z,gea8j4i,1,2020-12-01 18:36:15,Will consider. Do you know if maintenance is m...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,True,Skyline99x,2020-12-01 18:36:15,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Will consider., Do you know if maintenance is...","[{'label': 'LABEL_2', 'score': 0.8665452003}, ...",0,0.000000
1124912,t1_gea8j4i,geaant1,1,2020-12-01 18:53:00,A car this old I’d probably go to an independe...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,False,BalIsack,2020-12-01 18:53:00,https://www.reddit.com/r/whatcarshouldIbuy/com...,[A car this old I’d probably go to an independ...,"[{'label': 'LABEL_1', 'score': 0.8880066276}]",0,0.000000


In [None]:
comments.to_json("comments_splitted_sentiment_score_scoreAvg.json")

In [2]:
comments = pd.read_json("comments_splitted_sentiment_score_scoreAvg.json")

In [3]:
comments

Unnamed: 0,parent_id,comment_id,score_id,created_utc,body,score,permalink,is_submitter,author,date,link,body_splitted,sentiment,sentiment_score,sentiment_score_avg
0,t3_559xc5,d88vp4i,2,2016-09-30 22:10:09,Torque is what will help a car ascent a steep ...,2,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-09-30 22:10:09,https://www.reddit.com/r/whatcarshouldIbuy/com...,[Torque is what will help a car ascent a steep...,"[{'label': 'LABEL_0', 'score': 0.9868171811}, ...",-0.272727,-0.912877
1,t1_d88vp4i,d8b3wiq,1,2016-10-02 19:07:45,Thank you.\nI don't really have the option of ...,1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,True,Nyxelestia,2016-10-02 19:07:45,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Thank you., \nI don't really have the option ...","[{'label': 'LABEL_2', 'score': 0.9974262118}, ...",0.000000,0.000000
2,t1_d8b3wiq,d8cah3g,1,2016-10-03 16:38:21,I would aim for the larger engine options with...,1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-10-03 16:38:21,https://www.reddit.com/r/whatcarshouldIbuy/com...,[I would aim for the larger engine options wit...,"[{'label': 'LABEL_2', 'score': 0.9396278858}, ...",0.000000,0.000000
3,t1_d8cah3g,d8d1dnj,1,2016-10-04 02:40:55,"This is so helpful, thank you! :)\nOut of curi...",1,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,True,Nyxelestia,2016-10-04 02:40:55,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[This is so helpful, thank you!, :), \nOut of ...","[{'label': 'LABEL_2', 'score': 0.9975322485}, ...",-0.400000,-0.608585
4,t1_d8d1dnj,d8dpmsl,2,2016-10-04 16:33:02,An Accord will drive smoother/quieter than a C...,2,/r/whatcarshouldIbuy/comments/559xc5/car_with_...,False,pinks1ip,2016-10-04 16:33:02,https://www.reddit.com/r/whatcarshouldIbuy/com...,[An Accord will drive smoother/quieter than a ...,"[{'label': 'LABEL_2', 'score': 0.944755733}, {...",0.666667,1.825754
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1124909,t1_ge7fc5x,ge7rel9,1,2020-12-01 01:49:38,Thanks for your input. I only say I don't want...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,True,Skyline99x,2020-12-01 01:49:38,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Thanks for your input., I only say I don't wa...","[{'label': 'LABEL_2', 'score': 0.8439986706}, ...",0.333333,0.608585
1124910,t1_ge7gmfe,ge7rr9x,1,2020-12-01 01:52:49,Thanks for the in depth input. I honestly forg...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,True,Skyline99x,2020-12-01 01:52:49,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Thanks for the in depth input., I honestly fo...","[{'label': 'LABEL_2', 'score': 0.9389728904}, ...",0.444444,1.217169
1124911,t1_ge8ad5z,gea8j4i,1,2020-12-01 18:36:15,Will consider. Do you know if maintenance is m...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,True,Skyline99x,2020-12-01 18:36:15,https://www.reddit.com/r/whatcarshouldIbuy/com...,"[Will consider., Do you know if maintenance is...","[{'label': 'LABEL_2', 'score': 0.8665452003}, ...",0.000000,0.000000
1124912,t1_gea8j4i,geaant1,1,2020-12-01 18:53:00,A car this old I’d probably go to an independe...,1,/r/whatcarshouldIbuy/comments/k47weg/hi_lookin...,False,BalIsack,2020-12-01 18:53:00,https://www.reddit.com/r/whatcarshouldIbuy/com...,[A car this old I’d probably go to an independ...,"[{'label': 'LABEL_1', 'score': 0.8880066276}]",0.000000,0.000000


#### Steps to improve the whole dataset with the comments.