<a href="https://colab.research.google.com/github/darrickpang/Email/blob/master/AI_Week8_Assignment_Darrick_Pang.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# NLP Assignment Solution Using Generative AI

## Task 1: Text Preprocessing Using spaCy

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Natural Language Processing (NLP) is a fascinating field of AI that enables computers to understand human language."
text2 = "Artificial Intelligence is shaping the future."


def text_preprocessing(text):
  doc = nlp(text)

  # Sentence Tokenization
  sentences = [sent.text for sent in doc.sents]
  print("Tokenized Sentences:", sentences)

  # Tokenization (Lowercased for Consistency)
  tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_digit]
  print("Tokenized Text:", tokens)

  # Stopword Removal and Lemmatization
  lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
  print("Lemmatized Text:", lemmas)


text_preprocessing(text)
print()
text_preprocessing(text2)

Tokenized Sentences: ['Natural Language Processing (NLP) is a fascinating field of AI that enables computers to understand human language.']
Tokenized Text: ['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'of', 'ai', 'that', 'enables', 'computers', 'to', 'understand', 'human', 'language']
Lemmatized Text: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'ai', 'enable', 'computer', 'understand', 'human', 'language']

Tokenized Sentences: ['Artificial Intelligence is shaping the future.']
Tokenized Text: ['artificial', 'intelligence', 'is', 'shaping', 'the', 'future']
Lemmatized Text: ['artificial', 'intelligence', 'shape', 'future']


First, we import spacy as our NLP model. Secondly, we load an English pipeline called "en_core_web_sm" to break our sentences into tokens and lemmatize. Then we use tokenization and lemmatization. Tokenization is used to break down a text into meaningful segments called "tokens". This is similar to using "text.split(' ')". Lemmatization reduces the word down to its base form. For example, the base form for "computers" is "computer".


The difference is I used a different sentence but the tokenization and lemmatization of each sentence was the same. That is, both sentences were broken down into seperate words, and each word was reduced to its base form. In addition, punctuation such as periods and parantheses and stop words such as "is" and "a" are not included for both.

In [None]:
## Task 2: Sentiment Analysis Using Fine-Tuned Model

from transformers import pipeline

sentiment_model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

label_map = {"LABEL_0": "Negative", "LABEL_1": "Neutral", "LABEL_2": "Positive"}

text_samples = [
    "I love this product! It's amazing.",
    "The service was terrible and I am very disappointed.",
    "The movie was okay, not the best but not the worst."
]

text_samples2 = [
    "I love this product! It's amazing.",
    "That movie sucks.",
    "The food was delicious but overpriced."
]

def classify_sentiment(text_samples):
  sentiment_counts = {"Positive": 0, "Neutral": 0, "Negative": 0}

  for text in text_samples:
      sentiment = sentiment_model(text)
      label = label_map[sentiment[0]['label']]
      confidence = round(sentiment[0]['score'], 2)
      sentiment_counts[label] += 1
      print(f"Text: {text}")
      print(f"Sentiment Analysis Result: {label} (Confidence: {confidence})")
      print("-" * 50)

  print("Sentiment Distribution:", sentiment_counts)

classify_sentiment(text_samples)
print()
classify_sentiment(text_samples2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


Text: I love this product! It's amazing.
Sentiment Analysis Result: Positive (Confidence: 0.99)
--------------------------------------------------
Text: The service was terrible and I am very disappointed.
Sentiment Analysis Result: Negative (Confidence: 0.98)
--------------------------------------------------
Text: The movie was okay, not the best but not the worst.
Sentiment Analysis Result: Neutral (Confidence: 0.49)
--------------------------------------------------
Sentiment Distribution: {'Positive': 1, 'Neutral': 1, 'Negative': 1}

Text: I love this product! It's amazing.
Sentiment Analysis Result: Positive (Confidence: 0.99)
--------------------------------------------------
Text: That movie sucks.
Sentiment Analysis Result: Negative (Confidence: 0.97)
--------------------------------------------------
Text: The food was delicious but overpriced.
Sentiment Analysis Result: Positive (Confidence: 0.83)
--------------------------------------------------
Sentiment Distribution: {'P

In part 2, we want to analyze the sentiment for each text. To do so, we need to use Transformers so we first import the module and then we call the module using "pipeline". Then we have 2 samples of texts that we want to plug into our sentiment model "cardiffnlp/twitter-roberta-base-sentiment" that is trained on 58 million tweets. After that, each sentence runs through the model to classify the mood for each. Then we show the final count for positive, neutral, and negative.

The confidence scores did change signficantly with different sentences. For example, the sentence "The movie was okay, not the best but not the worst" was given a confidence of 0.49 for neutral, but "The food was delicious but overpriced" had a confidence score of 0.83 for positive.


However, the sentence "The food was delicious but overpriced" should have been classified as neutral because it had one positive part saying the food was delicious and a negative part saying the food was overpriced. So the positive is cancelled out by the negative which will make this neutral.

In [None]:
## Task 3: Named Entity Recognition Using Generative AI with Token Aggregation

import warnings
warnings.filterwarnings("ignore")

ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", aggregation_strategy="simple")

text_sample = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in California. The company made $274.5 billion in revenue in 2020."
text2 = "Elon Musk founded Space Exploration Technologies Corporation, commonly called SpaceX, in California to decrease costs of space launches and create a colony on Mars and beyond."

def classify_entities(text_sample):
  entities = ner_pipeline(text_sample)
  entities_sorted = sorted(entities, key=lambda x: x['score'], reverse=True)

  print("Named Entity Recognition (NER) Results:")
  for entity in entities_sorted:
      confidence = round(entity['score'], 2)
      print(f"Entity: {entity['word']:<15} | Type: {entity['entity_group']:<5} | Confidence: {confidence}")


classify_entities(text_sample)
print()
classify_entities(text2)

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


Named Entity Recognition (NER) Results:
Entity: California      | Type: LOC   | Confidence: 1.0
Entity: Apple Inc       | Type: ORG   | Confidence: 1.0
Entity: Steve Jobs      | Type: PER   | Confidence: 0.9900000095367432
Entity: Steve Wozniak   | Type: PER   | Confidence: 0.8899999856948853

Named Entity Recognition (NER) Results:
Entity: California      | Type: LOC   | Confidence: 1.0
Entity: Space Exploration Technologies Corporation | Type: ORG   | Confidence: 1.0
Entity: SpaceX          | Type: ORG   | Confidence: 1.0
Entity: Mars            | Type: LOC   | Confidence: 1.0
Entity: Elon Musk       | Type: PER   | Confidence: 1.0


In this 3rd part, we want to identify the person, location, and organization for each text. We import our model for name-entity recognition (NER) "dbmdz/bert-large-cased-finetuned-conll03-english". Then we run the texts through the model to locate our entities.

For the first text, the entities detected are California for location, Apple for organization and Steve Jobs and Steve Wozniak for people. For second text, California and Mars are detected for location, SpaceX and Space Exploration Technologies Corporation for organization, and Elon Musk for person.

Changing the names of location, persons, and organizations did not affect output in terms of identifying them, but there was a change in confidence level. For the first text, while there is a 100 percent confidence level for location and organization, the confindence for the persons Jobs and Wozniak are below 100 percent. However, with different entity names in the second text, we get 100 percent confidence for all.

Perhaps the reason why the confidence level for Steve Jobs and Steve Wozniak is lower than the Elon Musk level is that Musk is mentioned almost all the time today and the two Steves are not.

In [None]:
## Task 4: Machine Translation Using Generative AI

to_spanish = "Helsinki-NLP/opus-mt-en-es"
to_chinese = "Helsinki-NLP/opus-mt-en-zh"

translation_pipeline = pipeline("translation_en_to_es", model=to_spanish)

sentence = "Natural language processing is a branch of AI."
sentence2 = "Although AI is still in its infancy, it is being used more and more now, and we will see it in just about every industry such as healthcare and manufacturing."

def translate_text(sentence):
  translated_text = translation_pipeline(sentence)[0]['translation_text']
  print("Translated Text:", translated_text)

translate_text(sentence)
print()
translate_text(sentence2)

# Install missing dependency
import os
os.system("pip install sacremoses")

Device set to use cpu


Translated Text: El procesamiento del lenguaje natural es una rama de la IA.

Translated Text: Aunque la IA todavía está en su infancia, se está utilizando cada vez más ahora, y lo veremos en casi todas las industrias como la salud y la fabricación.


0

Part 4 is all about translating text from English to Spanish and we do this by using the model "Helsinki-NLP/opus-mt-en-es". We run the text through the translation model and get the translated phrase.

I believe the model translated the two sentences correctly because it does not seem there are grammatical errors, and it seemed to handle a complex sentence well. Overall, it seems the translation works well.

In [None]:
## Task 5: Topic Modeling & Classification Using Generative AI

topic_model_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

articles = {
    "Article 1": "The new government policy on climate change was announced today.",
    "Article 2": "The stock market has seen an unexpected rise this week.",
    "Article 3": "Scientists discovered a new exoplanet that could support life.",
    "Article 4": "The latest smartphone has innovative features never seen before."
}

article2 = {
    "Article 1": "A carrier battle group appears in the Western Pacific as a show of force to North Korea.",
    "Article 2": "Bitcoin prices reached an all-time high this week.",
    "Article 3": "A new medical discovery that could cure cancer sees massive investment.",
    "Article 4": "The latest aircraft carriers will have quantum computers."
}

def classify_articles(articles):
  categories = ["Politics", "Finance", "Science", "Technology"]

  classified_articles = {}
  for article, content in articles.items():
      classification = topic_model_pipeline(content, candidate_labels=categories)
      classified_articles[article] = (classification['labels'][0], round(classification['scores'][0], 2))

  # Sort Articles by Confidence
  classified_articles_sorted = sorted(classified_articles.items(), key=lambda x: x[1][1], reverse=True)

  print("Automatically Classified Articles:")
  print("{:<10} | {:<15} | {:<10}".format("Article", "Category", "Confidence"))
  print("-" * 40)
  for article, (category, confidence) in classified_articles_sorted:
      print(f"{article:<10} | {category:<15} | {confidence:<10}")

classify_articles(articles)
print()
classify_articles(article2)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Automatically Classified Articles:
Article    | Category        | Confidence
----------------------------------------
Article 3  | Science         | 0.93      
Article 4  | Technology      | 0.93      
Article 1  | Politics        | 0.64      
Article 2  | Finance         | 0.55      

Automatically Classified Articles:
Article    | Category        | Confidence
----------------------------------------
Article 4  | Technology      | 0.85      
Article 3  | Science         | 0.8       
Article 1  | Politics        | 0.63      
Article 2  | Technology      | 0.6       


In part 5, we want to classify article titles into categories using the model "facebook/bart-large-mnli". Then we run the article titles through the model and we classify each article as science, technology, politics, or finance and show the confidence score.

For the first set of articles, they were classified correctly. However, I cannot say the same for the second set because article 2 is about bitcoin prices, not the technology behind it. It was classified as technology when it should have been classified as finance.

For the first set, the confidence level varies from 55 percent to 93 percent. The second set saw a variation from 60 percent to 85 percent.

If the article title contained more than one topic, it only classifies it in one category. For example, in the second set of articles, we have a title "A new medical discovery that could cure cancer sees massive investment". This is both a science and finance article because the medical discovery led to a massive investment. Perhaps why it was classified as science is that the title focused more on the science than finance.