<a href="https://colab.research.google.com/github/aishafarooque/Tweet-Intimacy-Analysis/blob/main/Baseline_BERT_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning distilbert-base-uncased for Twitter and Reddit

In [None]:
from IPython.display import clear_output 

! pip install --upgrade pip
! pip install transformers
! pip install tqdm
! pip install datasets
! pip install evaluate


# (Aisha) Clear output because it tends to get long. 
# But these libraries are always successfully installed.
clear_output()

In [None]:
import numpy as np
import pandas as pd
import transformers
from datasets import Dataset,load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
# Remove Twitter's train.csv if it already exists
! rm -rf train.csv

# Download Twitter's training data
! wget https://raw.githubusercontent.com/aishafarooque/Tweet-Intimacy-Analysis/main/train.csv

# Rename train.csv -> twitter_train.csv for more clarity
! mv train.csv twitter_train.csv

--2022-12-05 07:03:46--  https://raw.githubusercontent.com/aishafarooque/Tweet-Intimacy-Analysis/main/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 813066 (794K) [text/plain]
Saving to: ‘train.csv’


2022-12-05 07:03:46 (51.1 MB/s) - ‘train.csv’ saved [813066/813066]



In [None]:
# Sanitze working directory
! rm -rf /content/__MACOSX
! rm -rf /content/annotated_question_intimacy_data

# Removing the .zip file if it already exists.
! rm -rf annotated_question_intimacy_data.zip

# Download the dataset from the author's GitHub repository.
! wget https://raw.githubusercontent.com/Jiaxin%2DPei/Quantifying%2DIntimacy%2Din%2DLanguage/main/data/annotated_question_intimacy_data.zip

# Unzip the file. 
! unzip /content/annotated_question_intimacy_data.zip

--2022-12-05 07:03:47--  https://raw.githubusercontent.com/Jiaxin%2DPei/Quantifying%2DIntimacy%2Din%2DLanguage/main/data/annotated_question_intimacy_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94741 (93K) [application/zip]
Saving to: ‘annotated_question_intimacy_data.zip’


2022-12-05 07:03:47 (20.2 MB/s) - ‘annotated_question_intimacy_data.zip’ saved [94741/94741]

Archive:  /content/annotated_question_intimacy_data.zip
   creating: annotated_question_intimacy_data/
  inflating: annotated_question_intimacy_data/final_train.txt  
   creating: __MACOSX/
   creating: __MACOSX/annotated_question_intimacy_data/
  inflating: __MACOSX/annotated_question_intimacy_data/._final_train.txt  
  inflating: annotated_question_intimacy_data/final_val.txt  
  i

## Downloading data

In [None]:
import pandas as pd

twitter_df_train = pd.read_csv('/content/twitter_train.csv', on_bad_lines='skip')
twitter_df_train = twitter_df_train.drop(columns=['language'])
twitter_df_train = twitter_df_train.rename(columns={'text': 'document', 'label': 'labels'})

print("Dataset size:", len(twitter_df_train), '\n')
twitter_df_train.info()

Dataset size: 9491 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9491 entries, 0 to 9490
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   document  9491 non-null   object 
 1   labels    9491 non-null   float64
dtypes: float64(1), object(1)
memory usage: 148.4+ KB


In [None]:
reddit_df_train = pd.read_csv('/content/annotated_question_intimacy_data/final_train.txt', 
                              sep='\t', header=None, names=['document', 'labels'])

print("Dataset size:", len(reddit_df_train), '\n')
reddit_df_train.info()

Dataset size: 1797 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1797 entries, 0 to 1796
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   document  1797 non-null   object 
 1   labels    1797 non-null   float64
dtypes: float64(1), object(1)
memory usage: 28.2+ KB


In [None]:
combined_df = pd.concat([twitter_df_train,reddit_df_train])
print("Dataset size:", len(combined_df), '\n')
combined_df.info()

Dataset size: 11288 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11288 entries, 0 to 1796
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   document  11288 non-null  object 
 1   labels    11288 non-null  float64
dtypes: float64(1), object(1)
memory usage: 264.6+ KB


#### Linear Mapping

In [None]:
A, B, C, D = -1, 1, 1, 5
scale = (D-C)/(B-A)
offset = -A*(D-C)/(B-A) + C

for index, row in reddit_df_train.iterrows():
  iScore = row['labels']

  # If the cell is re-run without clearing local variables, we'll
  # double convert the values between the 1-5 range resulting in values between
  # 5-10. This condition makes sure original scores from Reddit are not already
  #  greater than 1. 
  if iScore > 1:
    break

  q = iScore * scale + offset
  reddit_df_train.at[index, 'labels'] = round(q, 1)
  
reddit_df_train.head()

Unnamed: 0,document,labels
0,What are the most mediocre animals in the anim...,2.3
1,What's the difference between an allergic reac...,3.1
2,What is your favorite subreddit that not every...,3.1
3,What's the most disgusting meal you've ever ea...,3.5
4,Whats one question you hate being asked?,4.0


In [None]:
import random as rn
from datasets import Dataset, DatasetDict

import datasets

twitter_dataset = Dataset.from_pandas(twitter_df_train)
twitter_dataset = twitter_dataset.train_test_split(test_size=0.2)
twitter_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'labels'],
        num_rows: 7592
    })
    test: Dataset({
        features: ['document', 'labels'],
        num_rows: 1899
    })
})

In [None]:
reddit_dataset = Dataset.from_pandas(reddit_df_train)
reddit_dataset = reddit_dataset.train_test_split(test_size=0.2)
reddit_dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'labels'],
        num_rows: 1437
    })
    test: Dataset({
        features: ['document', 'labels'],
        num_rows: 360
    })
})

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer takes a sentence and encodes it into integers. 

In [82]:
sample_sentence = 'Will USA win the world cup?'
tokenizer(sample_sentence)['input_ids']

[101, 2097, 3915, 2663, 1996, 2088, 2452, 1029, 102]

If we decode the tokenized input, we can see the tokens assigned to each word. `[CLS` and `[SEP]` are special tokens indicating the start and end of a sentence. 

We do not need to handle the unknown tokens such as emojis, or platform specific jargon, for example, '[17M]' or 'r/askReddit', because the authors of the dataset have already processed them. 

In [83]:
tokenized_sample = []

for i in tokenizer(sample_sentence)['input_ids']:
  tokenized_sample.append(tokenizer.decode(i))

tokenized_sample

['[CLS]', 'will', 'usa', 'win', 'the', 'world', 'cup', '?', '[SEP]']

## Data Preprocessing

#### Tokenize Twitter Data

Preprocess our samples by feeding them into `tokenizer`. To ensure that an input is truncated to fit model requirements, set `truncate=True`. By default, no adding is applied. 

Reference:
- [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation)

Source:
- [Fine-tuning a model with the Trainer API or Keras](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter3/section3.ipynb)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["document"], padding="max_length", truncation=True)

twitter_tokenized_datasets = twitter_dataset.map(tokenize_function, batched=True)

# Selecting a small sample of rows from training and testing datasets will help the 
# model train quickly. 
twitter_tokenized_datasets_train = twitter_tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
twitter_tokenized_datasets_test = twitter_tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

After tokenizing, we will have `input_ids` and `attention_mask` added to the dataset. 

Here, `input_ids` are numerical representations of the tokens used to build the sequence. And `attention_mask` tells a model is a token should be attended to or ignored. 

In [None]:
twitter_tokenized_datasets_train

Dataset({
    features: ['document', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

In [None]:
twitter_tokenized_datasets_test

Dataset({
    features: ['document', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 1000
})

#### Tokenize Reddit Data

We follow the same preprocessing steps for Reddit as for Twitter.
The only difference is that since the size is already small, we're not choosing 1000 rows out of the dataset. 

In [None]:
reddit_tokenized_datasets = reddit_dataset.map(tokenize_function, batched=True)
reddit_tokenized_datasets

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['document', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1437
    })
    test: Dataset({
        features: ['document', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 360
    })
})

## BERT Base Uncased for Sequence Classification

As a baseline for our project, we're using the `distilbert-base-uncased` model. This model is smaller but faster than BERT and shares the same corpus. The only different is that `distilbert-base-uncased` was pretrained in a self-supervising fashion on raw texts without human labelling. 

#### Model Configuration

Because this is a regression problem, the number of labels for the model is equal to 1, according to their documentations. 

[DistilBert Model transformer with a sequence classification/regression head](https://github.com/TalSchuster/pytorch-transformers/blob/master/pytorch_transformers/modeling_distilbert.py#L546)


In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=1)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Since we did not add any new tokens to the tokenizer, the following line of code can be ignored. But we have kept it here to look at the shape of the tokenizer.

In [None]:
model.resize_token_embeddings(len(tokenizer))

Embedding(30522, 768, padding_idx=0)

#### Training only using Twitter

Since this is a regression problem where we are trying to predict a continuous value, we need to define metrics.

The task here is `STS-B` (Semantic Textual Similarity Benchmark) which means to determine the similarity of two sentences with a score from 1 to 5. 

For this task, we have used the Pearson Correlation Coefficient and Spearman's_Rank_Correlation_Coefficient. Pearson will primarily be used for analysis because the authors have done the same.


- Pearson Correlation Coefficient meansure the association between two continuous variables. It answers questions such as "Do test scores and hours spent studying have a statistically significant relationship?" Source: [Pearson's Correlation Coefficient - Statistics Solutions](https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/pearsons-correlation-coefficient/#:~:text=High%20degree%3A%20If%20the%20coefficient,to%20be%20a%20small%20correlation.)
- The Spearman's Rank Correlation Coefficient is used to discover the relationship strength between two sets of data. This is not used in our report, but we included it for our learning. Source: [Spearman's Rank Correlation Coefficient](https://geographyfieldwork.com/SpearmansRank.htm#:~:text=The%20Spearman's%20Rank%20Correlation%20Coefficient,Museum%20in%20El%20Raval%2C%20Barcelona.)

In [None]:
metric_name = "pearson"
metric = load_metric('glue', 'stsb')
metric 

Metric(name: "glue", features: {'predictions': Value(dtype='float32', id=None), 'references': Value(dtype='float32', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print

The `compute_metrics` function will help monitor metrics for this regression task. It computes the `Pearson_Correlation` and `Spearman Correlation`. It is invoked during model training and evaluation. 

In [None]:
from datasets import load_metric
from sklearn.metrics import mean_squared_error
from scipy import stats

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)


Here's an example of how the function above is invoked using random samples and labels.

Here, `sample_preds` and `sample_labels` are arrays in the format: `array([6, 3, 7, 4, 6, 9, 2, 6, 7, 4, 3, 7, 7, 2, 5, 4])`, with 16 elements.

In [84]:
sample_preds = np.random.randint(0, 10, size=(16,))
sample_labels = np.random.randint(0, 10, size=(16,))

metric.compute(predictions=sample_preds, 
               references=sample_labels)


{'pearson': -0.2802949870590709, 'spearmanr': -0.26313293188717224}

In [None]:
from transformers import TrainingArguments, Trainer

# The output directory where the model predictions and checkpoints will be written
OUTPUT_DIR = 'twitter_trainer'

training_args = TrainingArguments(output_dir=OUTPUT_DIR,  
                                  # Logging is done at the end of each epoch
                                  logging_strategy="epoch",
                                  # Evaluation is done at the end of each epoch
                                  evaluation_strategy="epoch",
                                  # The batch size per GPU/TPU core/CPU for training.
                                  # It is set to 8 to stay under the Google Colab free RAM limit. 
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  # We train over 3 epochs to stay under the Google Colab free RAM limit. 
                                  num_train_epochs=3,
                                  # Limit the total amount of checkpoints and delete old ones.
                                  save_total_limit = 2,
                                  # No save is done during training.
                                  save_strategy = 'no',
                                  # Do not load the best model found during training.
                                  # Done to speed up training. 
                                  load_best_model_at_end=False,
                                  # Pearson metric to match the paper
                                  metric_for_best_model="pearson",
                                  )

twitter_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=twitter_tokenized_datasets_train,
    eval_dataset=twitter_tokenized_datasets_test,
    compute_metrics=compute_metrics
)

# Force the compute_metrics to be set to our custom function. 
# Even though it is in the trainer, it doesn't always
# register. 
twitter_trainer.compute_metrics = compute_metrics

# The Trainer class provides an API for feature-complete training in PyTorch 
# for most standard use cases
twitter_trainer.train()


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 66954241


Epoch,Training Loss,Validation Loss,Pearson,Spearmanr
1,0.2155,0.724226,0.412502,0.433367
2,0.1351,0.806192,0.391574,0.409468
3,0.0848,0.769739,0.409009,0.424968


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *

TrainOutput(global_step=375, training_loss=0.14510909525553387, metrics={'train_runtime': 196.6766, 'train_samples_per_second': 15.253, 'train_steps_per_second': 1.907, 'total_flos': 397395108864000.0, 'train_loss': 0.14510909525553387, 'epoch': 3.0})

In [None]:
twitter_trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'eval_loss': 0.7697387337684631,
 'eval_pearson': 0.40900907219256627,
 'eval_spearmanr': 0.42496835028286695,
 'eval_runtime': 17.4996,
 'eval_samples_per_second': 57.144,
 'eval_steps_per_second': 7.143,
 'epoch': 3.0}

In [108]:
from transformers import Trainer

trainer = Trainer(model=model)

def pipeline_prediction(text):
    dataset = Dataset.from_pandas(pd.DataFrame({'document': [text]}),
                                  preserve_index=False) 
    tokenized_datasets = dataset.map(tokenize_function)
    pred = trainer.predict(tokenized_datasets)[0]

    return str('Prediction for '+ text+ ' is '+ str(pred[0][0]))

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [109]:
pipeline_prediction("What's on your bucket list?")

  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


"Prediction for What's on your bucket list? is 1.8093481"

In [110]:
pipeline_prediction("Where did you get married?")

  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


'Prediction for Where did you get married? is 3.5103896'

#### Training using Twitter train and Reddit test

In [None]:
from transformers import TrainingArguments, Trainer

twitter_reddit_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=twitter_tokenized_datasets_train,
    eval_dataset=reddit_tokenized_datasets['train'],
    compute_metrics=compute_metrics
)

twitter_reddit_trainer.train()


The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  Number of trainable parameters = 66954241


Epoch,Training Loss,Validation Loss,Pearson,Spearmanr
1,0.1084,0.182191,0.96283,0.972531
2,0.0761,0.285813,0.9474,0.961618
3,0.0511,0.272361,0.95133,0.962483


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1437
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1437
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *

TrainOutput(global_step=375, training_loss=0.07854781087239583, metrics={'train_runtime': 219.8952, 'train_samples_per_second': 13.643, 'train_steps_per_second': 1.705, 'total_flos': 397395108864000.0, 'train_loss': 0.07854781087239583, 'epoch': 3.0})

In [None]:
twitter_reddit_trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1437
  Batch size = 8


{'eval_loss': 0.2723606526851654,
 'eval_pearson': 0.9513302955418135,
 'eval_spearmanr': 0.9624829753371851,
 'eval_runtime': 25.2409,
 'eval_samples_per_second': 56.931,
 'eval_steps_per_second': 7.131,
 'epoch': 3.0}

In [None]:
from transformers import Trainer

trainer = Trainer(model=model)
twitter_reddit_trainer_prediction_1 = pipeline_prediction("What's on your bucket list?")
twitter_reddit_trainer_prediction_1

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


1.8093481

In [None]:
twitter_reddit_trainer_prediction_2 = pipeline_prediction("Where did you get married?")
twitter_reddit_trainer_prediction_2

  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


3.5103896

#### Training using Reddit train and Twitter test

In [None]:
from transformers import TrainingArguments, Trainer

reddit_twitter_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=reddit_tokenized_datasets['train'],
    eval_dataset=twitter_tokenized_datasets_train,
    compute_metrics=compute_metrics
)

reddit_twitter_trainer.compute_metrics = compute_metrics
reddit_twitter_trainer.train()


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1437
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 540
  Number of trainable parameters = 66954241


Epoch,Training Loss,Validation Loss,Pearson,Spearmanr
1,0.0288,0.925028,0.779328,0.801089
2,0.0197,0.814385,0.787937,0.806806
3,0.0227,0.818589,0.785322,0.799982


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *

TrainOutput(global_step=540, training_loss=0.023718603452046714, metrics={'train_runtime': 261.5605, 'train_samples_per_second': 16.482, 'train_steps_per_second': 2.065, 'total_flos': 571056771437568.0, 'train_loss': 0.023718603452046714, 'epoch': 3.0})

In [None]:
reddit_twitter_trainer.evaluate()


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


{'eval_loss': 0.8185886144638062,
 'eval_pearson': 0.7853215695759056,
 'eval_spearmanr': 0.7999821929798833,
 'eval_runtime': 17.8743,
 'eval_samples_per_second': 55.946,
 'eval_steps_per_second': 6.993,
 'epoch': 3.0}

In [None]:
from transformers import Trainer

trainer = Trainer(model=model)
reddit_twitter_trainer_prediction_1 = pipeline_prediction("What's on your bucket list?")
reddit_twitter_trainer_prediction_1

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


1.8093481

In [None]:
reddit_twitter_trainer_prediction_2 = pipeline_prediction("Where did you get married?")
reddit_twitter_trainer_prediction_2

  0%|          | 0/1 [00:00<?, ?ex/s]

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: document. If document are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


3.5103896