In [19]:
import torch

device = 0 if torch.cuda.is_available() else -1  # Use GPU if available, otherwise CPU

In [20]:
!nvidia-smi

Wed Jul 24 16:30:04 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   69C    P0             30W /   70W |   14907MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [21]:
# Use GPU if available
device = 0 if torch.cuda.is_available() else -1

### 1. Import Libraries

In [22]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

### 2. Load Model and Tokenizer

In [23]:
# Load the model and tokenizer
name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForTokenClassification.from_pretrained(name)

* **AutoTokenizer:** Loads the tokenizer specific to the model.
* **AutoModelForTokenClassification:** Loads the model fine-tuned for token classification tasks, such as POS tagging.

### 3. Initialize the Pipeline

In [25]:
# Use CPU
device = -1  # CPU device ID

# Initialize the pipeline for token classification
pos_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)

* **pipeline:** Initializes a pipeline for named entity recognition (NER) which is used here for POS tagging. The aggregation_strategy="simple" helps to merge subwords into full words.

### 4. Define Helper Functions


a. Function to Get POS Sequence

get_pos_sequence: Processes the output of the pipeline to merge subword tokens into full words and collects their POS tags.

In [26]:
def get_pos_sequence(tags):
    pos_sequence = []
    current_word = ""
    current_pos = None
    for tag in tags:
        token = tag['word']
        pos_tag = tag['entity_group']
        
        if token.startswith("##"):  # Handle subword tokens
            current_word += token[2:]
        else:
            if current_word:
                pos_sequence.append((current_word, current_pos))
            current_word = token
            current_pos = pos_tag
    
    if current_word:  # Append the last word if any
        pos_sequence.append((current_word, current_pos))
        
    return pos_sequence


b. Function to Summarize POS Tags

get_pos_summary: Creates a summary of the POS tags by counting the occurrence of each tag.

In [27]:
def get_pos_summary(tags):
    pos_tags = {}
    for tag in tags:
        pos_tag = tag['entity_group']
        if pos_tag in pos_tags:
            pos_tags[pos_tag] += 1
        else:
            pos_tags[pos_tag] = 1
    return pos_tags


c. Function to Apply POS Tagging

get_token_pos_data: Applies the POS tagging pipeline to the text and returns both the sequence of POS tags and the summary.

In [28]:
def get_token_pos_data(text):
    tags = pos_pipeline(text)
    pos_sequence = get_pos_sequence(tags)
    pos_summary = get_pos_summary(tags)
    return pos_sequence, pos_summary


### 5. Apply POS Tagging to DataFrame


In [35]:
# Example usage with a sample DataFrame
data = {
    "comments": [
        "The quick brown fox jumps over the lazy dog.",
        "Although it was raining heavily, the children went outside to play soccer.",
        "What time does the meeting start tomorrow?",
        "I wanted to go to the concert, but I had to work late.",
        "Wow, what an incredible performance that was!",
        "The tall, handsome man with the blue jacket and the red hat greeted everyone warmly.",
        "She didn't go to the party because she was feeling sick, and her friends understood why.",
        "Carefully review the report and ensure that all errors are corrected promptly.",
        "The neural network's performance improved significantly after tuning the hyperparameters.",
        "He hit the nail on the head with his comments about the project's issues."
    ]
}

df = pd.DataFrame(data)

# Apply the POS tagging sequence and summary to DataFrame
df['pos_tokens_sequence'], df['pos_tokens_summary'] = zip(*df['comments'].apply(get_token_pos_data))

# Set display options
pd.set_option('display.max_colwidth', None)  # Display full content of each column

# Display DataFrame
df

Unnamed: 0,comments,pos_tokens_sequence,pos_tokens_summary
0,The quick brown fox jumps over the lazy dog.,"[(the, DET), (quick brown, ADJ), (fox, NOUN), (jumps, VERB), (over, ADP), (the, DET), (lazy, ADJ), (dog, NOUN), (., PUNCT)]","{'DET': 2, 'ADJ': 2, 'NOUN': 2, 'VERB': 1, 'ADP': 1, 'PUNCT': 1}"
1,"Although it was raining heavily, the children went outside to play soccer.","[(although, SCONJ), (it, PRON), (was, AUX), (raining, VERB), (heavily, ADV), (,, PUNCT), (the, DET), (children, NOUN), (went, VERB), (outside, ADV), (to, PART), (play, VERB), (soccer, NOUN), (., PUNCT)]","{'SCONJ': 1, 'PRON': 1, 'AUX': 1, 'VERB': 3, 'ADV': 2, 'PUNCT': 2, 'DET': 1, 'NOUN': 2, 'PART': 1}"
2,What time does the meeting start tomorrow?,"[(what, DET), (time, NOUN), (does, AUX), (the, DET), (meeting, NOUN), (start, VERB), (tomorrow, NOUN), (?, PUNCT)]","{'DET': 2, 'NOUN': 3, 'AUX': 1, 'VERB': 1, 'PUNCT': 1}"
3,"I wanted to go to the concert, but I had to work late.","[(i, PRON), (wanted, VERB), (to, PART), (go, VERB), (to, ADP), (the, DET), (concert, NOUN), (,, PUNCT), (but, CCONJ), (i, PRON), (had, VERB), (to, PART), (work, VERB), (late, ADV), (., PUNCT)]","{'PRON': 2, 'VERB': 4, 'PART': 2, 'ADP': 1, 'DET': 1, 'NOUN': 1, 'PUNCT': 2, 'CCONJ': 1, 'ADV': 1}"
4,"Wow, what an incredible performance that was!","[(wow, INTJ), (,, PUNCT), (what an, DET), (incredible, ADJ), (performance, NOUN), (that, PRON), (was, AUX), (!, PUNCT)]","{'INTJ': 1, 'PUNCT': 2, 'DET': 1, 'ADJ': 1, 'NOUN': 1, 'PRON': 1, 'AUX': 1}"
5,"The tall, handsome man with the blue jacket and the red hat greeted everyone warmly.","[(the, DET), (tall, ADJ), (,, PUNCT), (handsome, ADJ), (man, NOUN), (with, ADP), (the, DET), (blue, ADJ), (jacket, NOUN), (and, CCONJ), (the, DET), (red, ADJ), (hat, NOUN), (greeted, VERB), (everyone, PRON), (warmly, ADV), (., PUNCT)]","{'DET': 3, 'ADJ': 4, 'PUNCT': 2, 'NOUN': 3, 'ADP': 1, 'CCONJ': 1, 'VERB': 1, 'PRON': 1, 'ADV': 1}"
6,"She didn't go to the party because she was feeling sick, and her friends understood why.","[(she, PRON), (didn, AUX), (', PUNCT), (t, PART), (go, VERB), (to, ADP), (the, DET), (party, NOUN), (because, SCONJ), (she, PRON), (was, AUX), (feeling, VERB), (sick, ADJ), (,, PUNCT), (and, CCONJ), (her, PRON), (friends, NOUN), (understood, VERB), (why, ADV), (., PUNCT)]","{'PRON': 3, 'AUX': 2, 'PUNCT': 3, 'PART': 1, 'VERB': 3, 'ADP': 1, 'DET': 1, 'NOUN': 2, 'SCONJ': 1, 'ADJ': 1, 'CCONJ': 1, 'ADV': 1}"
7,Carefully review the report and ensure that all errors are corrected promptly.,"[(carefully, ADV), (review, VERB), (the, DET), (report, NOUN), (and, CCONJ), (ensure, VERB), (that, SCONJ), (all, DET), (errors, NOUN), (are, AUX), (corrected, VERB), (promptly, ADV), (., PUNCT)]","{'ADV': 2, 'VERB': 3, 'DET': 2, 'NOUN': 2, 'CCONJ': 1, 'SCONJ': 1, 'AUX': 1, 'PUNCT': 1}"
8,The neural network's performance improved significantly after tuning the hyperparameters.,"[(the, DET), (neural, ADJ), (network, NOUN), (' s, PART), (performance, NOUN), (improved, VERB), (significantly, ADV), (after, SCONJ), (tuning, VERB), (the, DET), (hyperparameters, NOUN), (., PUNCT)]","{'DET': 2, 'ADJ': 1, 'NOUN': 3, 'PART': 1, 'VERB': 2, 'ADV': 1, 'SCONJ': 1, 'PUNCT': 1}"
9,He hit the nail on the head with his comments about the project's issues.,"[(he, PRON), (hit, VERB), (the, DET), (nail, NOUN), (on, ADP), (the, DET), (head, NOUN), (with, ADP), (his, PRON), (comments, NOUN), (about, ADP), (the, DET), (project, NOUN), (' s, PART), (issues, NOUN), (., PUNCT)]","{'PRON': 2, 'VERB': 1, 'DET': 3, 'NOUN': 5, 'ADP': 3, 'PART': 1, 'PUNCT': 1}"


The output is a DataFrame showing the original comments, the sequence of POS-tagged words, and a summary of the POS tags for each comment.

### Script for POS Tagging Using vblagoje/bert-english-uncased-finetuned-pos

In [34]:
from transformers import pipeline

# Load the pipeline for POS tagging
pos_pipeline = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos", aggregation_strategy="simple")

# Example sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Although it was raining heavily, the children went outside to play soccer.",
    "What time does the meeting start tomorrow?",
    "I wanted to go to the concert, but I had to work late.",
    "Wow, what an incredible performance that was!",
    "The tall, handsome man with the blue jacket and the red hat greeted everyone warmly.",
    "She didn't go to the party because she was feeling sick, and her friends understood why.",
    "Carefully review the report and ensure that all errors are corrected promptly.",
    "The neural network's performance improved significantly after tuning the hyperparameters.",
    "He hit the nail on the head with his comments about the project's issues."
]


# Apply the POS tagging pipeline to each sentence
for sentence in sentences:
    pos_tags = pos_pipeline(sentence)
    print(f"Sentence: {sentence}")
    print("POS Tags:")
    for tag in pos_tags:
        print(f"  {tag['word']} -> {tag['entity_group']} (Score: {tag['score']:.4f})")
    print("\n")


Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Sentence: The quick brown fox jumps over the lazy dog.
POS Tags:
  the -> DET (Score: 0.9994)
  quick brown -> ADJ (Score: 0.9697)
  fox -> NOUN (Score: 0.9970)
  jumps -> VERB (Score: 0.9994)
  over -> ADP (Score: 0.9993)
  the -> DET (Score: 0.9995)
  lazy -> ADJ (Score: 0.9979)
  dog -> NOUN (Score: 0.9989)
  . -> PUNCT (Score: 0.9997)


Sentence: Although it was raining heavily, the children went outside to play soccer.
POS Tags:
  although -> SCONJ (Score: 0.9988)
  it -> PRON (Score: 0.9995)
  was -> AUX (Score: 0.9985)
  raining -> VERB (Score: 0.9967)
  heavily -> ADV (Score: 0.9991)
  , -> PUNCT (Score: 0.9997)
  the -> DET (Score: 0.9995)
  children -> NOUN (Score: 0.9987)
  went -> VERB (Score: 0.9995)
  outside -> ADV (Score: 0.9975)
  to -> PART (Score: 0.9991)
  play -> VERB (Score: 0.9995)
  soccer -> NOUN (Score: 0.9892)
  . -> PUNCT (Score: 0.9997)


Sentence: What time does the meeting start tomorrow?
POS Tags:
  what -> DET (Score: 0.9990)
  time -> NOUN (Score: 0.99

The first method aggregates tokens into sequences and provides a summary of POS tags. It lists POS tags in a simpler format without confidence scores and provides a summary count of each tag.

The second method includes each token’s POS tag along with a confidence score. It also shows POS tags for each token individually, with additional details on confidence.

Although the presentation and detail differ, the fundamental POS tags are consistent between the two outputs.