<a href="https://colab.research.google.com/github/elyal2/UPC2023/blob/main/2_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Big Data Analytics and Machine Learning with NLP and Computer Vision**
# *Course Overview*
Welcome to our comprehensive course on Big Data Analytics and Machine Learning, where we'll delve into the fascinating realms of Natural Language Processing (NLP) and Computer Vision. This course is designed to equip you with the essential skills and knowledge needed to analyze, interpret, and glean insights from large datasets using advanced machine learning techniques.

Throughout this journey, we'll explore how machine learning can be applied to understand and process human languages and visual data, opening doors to numerous applications in various industries.

# Introduction to Challenges in Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It involves the creation of computational algorithms to process and understand human language. NLP has many challenges due to the complexity and nuance of human language, including ambiguity, sarcasm, idioms, and the need for context understanding.

## The Rise of Transformers

Transformers have revolutionized the way we approach NLP tasks. They are a type of neural network architecture that, unlike previous models, are exceptionally good at handling sequences of data, such as text. Transformers are based on the mechanism of self-attention, weighting the influence of different parts of the input data.

## Hugging Face
Hugging Face is an AI research organization and company known for its pioneering work in the field of natural language processing (NLP). It has gained widespread recognition for developing and open-sourcing the transformers library, which provides a collection of state-of-the-art machine learning models, primarily focused on NLP tasks.

One of the key impacts of Hugging Face's work is the facilitation of the open-source movement in the AI field. By open-sourcing their models and training frameworks, Hugging Face has cultivated a collaborative community where individuals and organizations can contribute to the advancement of NLP technologies. This collaborative approach has accelerated innovation, improved model performance, and increased the pace at which NLP models evolve.


Moreover, Hugging Face has also contributed to the ML community through their model hub, which allows users to share and discover pre-trained models, and through active participation in research, often publishing papers and releasing datasets. Their work has also promoted the practice of fine-tuning pre-trained models for specific tasks, which has become a standard approach in NLP tasks due to its effectiveness and efficiency.

### Hugging Face Transformers

Hugging Face provides a powerful and easy-to-use library for implementing transformer models. The library includes pre-trained models that can be fine-tuned on a specific task, such as text classification, summarization, and question answering.

The Hugging Face transformers library has democratized access to powerful NLP models like BERT, GPT-2, T5, and many others by making them available with an easy-to-use interface. This has enabled researchers, developers, and companies around the world to implement cutting-edge NLP features into their applications without the need for the deep and specialized expertise that was once required to build such models from scratch.




## Examples of NLP Tasks

In this section, we will explore some primary NLP tasks using Hugging Face transformers.



### Text Classification
Text classification involves assigning categories or labels to a piece of text. We will use a pre-trained transformer and fine-tune it on a dataset for sentiment analysis.

We will use: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english



In [9]:
# Example Python code for text classification
from transformers import pipeline

classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
result = classifier("I love using transformers for NLP tasks!")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9983


The **pipeline** function is part of the transformers library provided by [Hugging Face](https://huggingface.co/docs), which is a popular and extensive library that offers a collection of pre-trained models for various Natural Language Processing (NLP) tasks. The library provides a high-level API for common tasks such as text classification, question answering, translation, summarization, and more.

The result is usually a list with a dictionary for each piece of text analyzed. Each dictionary typically contains the keys 'label' and 'score', where 'label' is the predicted sentiment, and 'score' is the confidence level of the prediction.

### Summarization
Summarization aims to shorten a piece of text, distilling the most essential information. We will use a pre-trained model for summarizing a news article.

We will use this model: https://huggingface.co/sshleifer/distilbart-cnn-12-6

This is the text to analyze:

```
On Dec. 6, Google launched its latest artificial intelligence (AI) model, Gemini, which it claimed is the most advanced model currently available on the market — even better than the popular model developed by OpenAI, ChatGPT-4.

This bold claim was treated like a challenge by community sleuths across the internet, who swiftly moved to examine the methods and benchmarks used by Google to assert Gemini’s supposed superiority and poke fun at the company’s marketing of the product.

David Gull, CEO of AI-powered wellness startup Vital, told Cointelegraph that each model, be it ChatGPT-4, Llama 2, or now Gemini, has its own set of strengths and challenges.

“When navigating the AI startup world, choosing the right language model is key to the success of your product. With Google’s introduction of the Gemini AI model, our LLM [large language model] options have expanded significantly.”
Currently, OpenAI’s ChatGPT-4 model stands out with its “extensive real-world application,” along with enhanced safety measures, he added.

Gull said as businesses now “dive into” Gemini, a good goal would be to “strike a balance between performance and customization” that aligns with the company’s mission and values and provides the best user experience.
```

In [8]:
# Example Python code for summarization
from transformers import pipeline

summarizer = pipeline('summarization', model='sshleifer/distilbart-cnn-12-6')
article = """On Dec. 6, Google launched its latest artificial intelligence (AI) model, Gemini, which it claimed is the most advanced model currently available on the market — even better than the popular model developed by OpenAI, ChatGPT-4.

This bold claim was treated like a challenge by community sleuths across the internet, who swiftly moved to examine the methods and benchmarks used by Google to assert Gemini’s supposed superiority and poke fun at the company’s marketing of the product.

David Gull, CEO of AI-powered wellness startup Vital, told Cointelegraph that each model, be it ChatGPT-4, Llama 2, or now Gemini, has its own set of strengths and challenges.

“When navigating the AI startup world, choosing the right language model is key to the success of your product. With Google’s introduction of the Gemini AI model, our LLM [large language model] options have expanded significantly.”
Currently, OpenAI’s ChatGPT-4 model stands out with its “extensive real-world application,” along with enhanced safety measures, he added.

Gull said as businesses now “dive into” Gemini, a good goal would be to “strike a balance between performance and customization” that aligns with the company’s mission and values and provides the best user experience."""
summary = summarizer(article, max_length=130, min_length=30, do_sample=False)[0]
print(summary['summary_text'])


 Google launched its latest artificial intelligence (AI) model, Gemini, which it claimed is the most advanced model currently available on the market . Community sleuths across the internet quickly moved to examine the methods and benchmarks used by Google to assert Gemini’s supposed superiority .


The result is a list with a dictionary that includes the keys 'summary_text', which contains the generated summary of the input text.

### Question Answering
Question answering models can find the answer to a question in a given context. We will use a pre-trained model to answer a question using a paragraph from Wikipedia.

We will use: https://huggingface.co/distilbert-base-cased-distilled-squad

In [7]:
# Example Python code for question answering
from transformers import pipeline

qa_pipeline = pipeline('question-answering', model="distilbert-base-cased-distilled-squad")
context = """France (French: [fʁɑ̃s] ⓘ), officially the French Republic (French: République française [ʁepyblik fʁɑ̃sɛːz]),[14] is a country located primarily in Western Europe. It also includes overseas regions and territories in the Americas and the Atlantic, Pacific and Indian oceans,[XII] giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north, Germany to the north east, Switzerland to the east, Italy and Monaco to the south east, Andorra and Spain to the south, and a maritime border with the United Kingdom to the north west. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its overseas territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean. Its eighteen integral regions (five of which are overseas) span a combined area of 643,801 km2 (248,573 sq mi) and have a total population of over 68 million as of January 2023.[5][8] France is a unitary semi-presidential republic with its capital in Paris, the country's largest city and main cultural and commercial centre; other major urban areas include Marseille, Lyon, Toulouse, Lille, Bordeaux, Strasbourg, and Nice."""
result = qa_pipeline(question="What is the capital of France?", context=context)
print(f"Answer: '{result['answer']}' with score: {round(result['score'], 4)}")


Answer: 'Paris' with score: 0.9945


The result is a dictionary containing the keys 'score', 'start', 'end', and 'answer', with 'score' representing the confidence of the answer, 'start' and 'end' indicating the position of the answer in the input context, and 'answer' containing the text of the answer itself.

### Zero-Shot Learning
Zero-shot learning models are designed to understand and perform tasks they haven't been explicitly trained on. This is particularly valuable in scenarios where labeled data is scarce.

**Zero-Shot Classification**
We will demonstrate zero-shot classification, where the model classifies text into categories it has not seen during training.

We will use: https://huggingface.co/facebook/bart-large-mnli

In [10]:
# Example Python code for zero-shot classification
from transformers import pipeline

zero_shot_classifier = pipeline('zero-shot-classification', model="facebook/bart-large-mnli")
result = zero_shot_classifier(
    "This is a new breakthrough in NLP, allowing models to generalize to tasks not seen during training.",
    candidate_labels=['education', 'politics', 'business', 'technology']
)
print(f"Label: {result['labels'][0]}, with score: {round(result['scores'][0], 4)}")
print(f"Label: {result['labels'][1]}, with score: {round(result['scores'][1], 4)}")


Label: technology, with score: 0.9717
Label: business, with score: 0.0147


The result is a dictionary containing the keys 'sequence', 'labels', and 'scores'. 'sequence' is the input text, 'labels' is the list of possible labels, and 'scores' is a list of confidence scores corresponding to each label.