# Exploring Word Embeddings with Word2Vec

## Introduction

Word embeddings, which are dense vector representations of words, have revolutionized the way we handle textual data in machine learning. Among the various methods to obtain word embeddings, Google's Word2Vec has stood out due to its ability to capture semantic relationships between words. In this notebook, we'll delve into the Word2Vec embeddings and visualize them to understand their structure better.

### Objectives:

1. **Load a Pretrained Word2Vec Model**: We'll use the `gensim` library to load Google's pretrained Word2Vec model. This model has been trained on a massive amount of data and can capture intricate semantic relationships.

2. **Explore the Capabilities of Word2Vec**: We'll showcase some fundamental operations that can be performed using Word2Vec, such as:
    - Finding words most similar to a given word.
    - Computing the similarity score between two words.
    - Solving analogies. For instance, given "man is to king as woman is to ?", the model can predict the word that best completes the analogy.

3. **Visualize Word Embeddings**: The embeddings are typically in a high-dimensional space (e.g., 300 dimensions). We'll use dimensionality reduction techniques, like PCA, to visualize these embeddings in a 2D space. This visualization will help us understand how similar words cluster together.

## Let's Dive In!

With the background set, let's dive into the exploration and see the magic of Word2Vec in action!


The gensim.downloader utility provides a convenient way to download several pre-trained models:

In [31]:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)

/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz


## Load the Pretrained Word2Vec Model

Once you've downloaded the model, you can load it into memory:

In [None]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format(path, binary=True)


## Test the model

Try different words as input for your model. What is the size of the word embeddings?

In [None]:
len(model['word'])

Here are a few things the model can do:
- Find Most Similar Words:

In [None]:
similar_words = model.most_similar('cat', topn=5)
print(similar_words)


- Compute Similarity between Two Words:

In [None]:
similarity = model.similarity('king', 'queen')
print(similarity)

similarity = model.similarity('king', 'lunch')
print(similarity)


- Use the Analogy Feature:

Given the analogy "man is to king as woman is to ?", the model can find the word that best completes the analogy:

In [None]:
analogy_result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(analogy_result)


## Plotting Word Embeddings

To visualize word embeddings, we need to reduce their dimensionality to 2D. Let's pick a few words and use PCA to reduce their dimensions. Then, we'll plot them:

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Choose words to visualize
words = ['king', 'queen', 'man', 'woman', 'prince', 'princess', 'boy', 'girl', 'car', 'bike']

# Extract embeddings for these words
embeddings = [model[word] for word in words]

# Use PCA to reduce dimensions to 2D
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot the results
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], marker='x', color='red')
    plt.text(embeddings_2d[i, 0]+0.02, embeddings_2d[i, 1]+0.02, word, fontsize=12)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Word Embeddings in 2D using PCA')
plt.grid(True)
plt.show()


Downlaoad a dataset for translation

# Machine Translation with Huggingface

## Introduction

Machine translation the task of converting text from one language to another. In this notebook, we will leverage pre-trained models and datasets from Hugging Face, a platform offering a vast array of NLP resources.

We will go through the following:

### 1. Loading a Translation Dataset from Hugging Face:
Hugging Face's `datasets` library offers a plethora of datasets catering to numerous NLP tasks. We will fetch a translation dataset, which will serve as our ground truth for evaluating translation performance.

### 2. Acquiring a Pretrained Translation Model and Tokenizer:
The power of neural machine translation often lies in models trained on vast amounts of data. Thankfully, Hugging Face's `transformers` library provides access to several state-of-the-art pre-trained models. We will retrieve a model specifically trained for our language pair of interest, along with its tokenizer, facilitating the conversion of text into a format the model can understand.

### 3. Evaluating Machine Translation:
Once armed with our model, we will put it to the test! After translating an example our dataset's source sentences, we will evaluate the quality of these translations compared to the reference translations. For this, we'll use the BLEU (Bilingual Evaluation Understudy) score, a widely-accepted metric in the NLP community for assessing translation quality.


Let's first install a few useful libraries.

In [None]:
!pip install datasets
!pip install transformers
!pip install sentencepiece

Import the dataset from Hugging Face

In [None]:
from datasets import load_dataset

dataset = load_dataset('wmt16', 'de-en')


In [None]:
# Checking the dataset structure
print(dataset)

# Viewing an example from the training set
print(dataset['train'][0])


Let's pick a sentence from the testing dataset

In [None]:
test_split = dataset['test']
line = test_split['translation'][1]
line

## Loading a Pretrained MarianMT Model for English-to-German Translation

In the provided code snippet, we're leveraging the `transformers` library from Hugging Face to load a specific machine translation model and its associated tokenizer:

1. **Import Necessary Modules**:
   We start by importing the relevant classes:
   - `MarianMTModel`: Represents the actual translation model that can convert sequences from one language to another.
   - `MarianTokenizer`: Aids in converting text into a format (tokens) that the model can understand and vice-versa.

2. **Specify the Model Name**:
   We define the `model_name` as `"Helsinki-NLP/opus-mt-en-de"`. This points to a pretrained model on the Hugging Face Model Hub that's optimized for English-to-German translations. The naming convention typically follows the pattern `<organization>/<model-name>`, indicating the group or individual that trained the model and the specific model identifier.

3. **Load the Model and Tokenizer**:
   Using the `from_pretrained` method, we:
   - Load the translation model (`AutoModelForSeq2SeqLM`) using the specified `model_name`.
   - Load the associated tokenizer (`AutoTokenizer`) that knows how to tokenize English text for this specific model and convert German tokens back into text.

With these steps, we're fully equipped to process English text, feed it into our translation model, and obtain German translations.


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-en-de"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


In [None]:
tokenizer.tokenize(line['de'])

In [None]:
def translate(text, model, tokenizer):
    # Tokenize the source text
    inputs = tokenizer.encode(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

    # Perform the translation
    outputs = model.generate(inputs, max_length=512, num_beams=4, early_stopping=True)
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text

# Test the function
source_text = line['en']
translated = translate(source_text, model, tokenizer)
print(translated)


In [None]:
source_text

## Evaluating translation with the BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate the quality of machine-generated translations in relation to human-provided reference translations. Introduced in a [paper](https://www.aclweb.org/anthology/P02-1040.pdf) by Kishore Papineni and others in 2002, BLEU has since become one of the most widely-used metrics in machine translation evaluation.

### Key Concepts:

 **Precision of N-grams**:
   - At its core, BLEU considers the precision of n-grams (contiguous sequences of n items from a text) in the machine-generated translation with respect to the reference translation.
   - The precision is computed for different n-gram lengths, such as unigrams (1-grams), bigrams (2-grams), trigrams (3-grams), and so on.


### Interpreting the Score:

- The BLEU score ranges from 0 to 1 (or 0% to 100% when multiplied by 100).
- A score of 1 indicates that the machine translation matches the human reference translation perfectly.
- Different human translators can produce slightly different translations for the same source text.

### Limitations:

- It may not always correlate perfectly with human judgment, especially for individual sentences.
- BLEU assumes that more matches with the reference translation indicate better quality.


In [None]:
# If `new_translation` is a single string, you can tokenize it as:
new_translation_tokens = translated.split()

# Similarly, for reference_texts:
reference_texts_tokens = [line['de'].split()]


Import the sentence_bleu score from the nltk library

In [None]:
from nltk.translate.bleu_score import sentence_bleu
import nltk

In [None]:
smooth = nltk.translate.bleu_score.SmoothingFunction().method2

sentence_bleu(reference_texts_tokens, new_translation_tokens, smoothing_function = smooth)

What if you have no time to waste?

## Hugging Face Pipelines: Simplifying NLP Tasks

Hugging Face's `transformers` library offers a high-level utility called `pipelines` that provides an easy-to-use interface for several common natural language processing (NLP) tasks. Built on top of the vast collection of pretrained models available in the library, the `pipelines` utility abstracts away many of the underlying complexities, making it straightforward for both newcomers and experienced practitioners to leverage state-of-the-art NLP models.

### Key Features:

1. **Predefined Tasks**:
   - `pipelines` supports a variety of tasks out-of-the-box, including text classification, token classification, question answering, text generation, and more.

2. **Minimal Code**:
   - With just a few lines of code, users can obtain meaningful results without needing to worry about tokenization, model inference, or post-processing.

3. **Flexibility**:
   - While `pipelines` offers simplicity, it doesn't sacrifice flexibility. Users can easily customize the underlying models, tokenizers, and more.

4. **Broad Model Support**:
   - Whether you're looking to use BERT for token classification, GPT-2 for text generation, or any other model in the Hugging Face Model Hub, there's likely a pipeline ready for it.


The `pipelines` utility in the Hugging Face `transformers` library is a powerful tool that democratizes access to state-of-the-art NLP models. Whether you're building an AI-powered chatbot, a document summarization system, or just exploring the capabilities of modern NLP, `pipelines` can accelerate your implementations.



In [None]:
from transformers import pipeline

# Initialize the translation pipeline
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de", tokenizer="Helsinki-NLP/opus-mt-en-de")

# Translate a sentence
source_text = "Hello, how are you?"
translation = translator(source_text)

print(translation[0]['translation_text'])


In [None]:
translator = pipeline("translation_en_to_da", model="Helsinki-NLP/opus-mt-en-da", tokenizer="Helsinki-NLP/opus-mt-en-da")
# Translate a sentence
source_text = "Hello, how are you?"
translation = translator(source_text)

print(translation[0]['translation_text'])

Pipelines are available for many different tasks!

## Available Tasks in Hugging Face's `pipeline` Utility

Hugging Face's `pipeline` utility in the `transformers` library provides a high-level, easy-to-use API for various NLP tasks. As of January 2022, the following tasks are supported:

- **Feature Extraction**:
  - `feature-extraction`
  
- **Text Classification**:
  - `text-classification`
  
- **Sentiment Analysis**:
  - `sentiment-analysis` (alias for `text-classification`)
  
- **Token Classification**:
  - `token-classification`
  
- **Named Entity Recognition (NER)**:
  - `ner` (alias for `token-classification`)
  
- **Question Answering**:
  - `question-answering`
  
- **Masked Language Modeling**:
  - `fill-mask`
  
- **Summarization**:
  - `summarization`
  
- **Translation**:
  - `translation_xx_to_yy` (where `xx` and `yy` are source and target language codes, respectively)
  
- **Text-to-Text Generation**:
  - `text2text-generation`
  
- **Text Generation**:
  - `text-generation`
  
- **Zero-Shot Classification**:
  - `zero-shot-classification`
  
- **Conversational Models**:
  - `conversational`
  
For translation tasks, the format `translation_xx_to_yy` allows flexibility in specifying any source (`xx`) and target (`yy`) language combination, provided there's a model that supports that particular pair.

To always get the most up-to-date list of supported tasks, refer to the official Hugging Face documentation or inspect the source code of the `pipeline` function in the `transformers` library.


Let's try some of these pipelines!

In [None]:
from transformers import pipeline

# Initialize the sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Analyze the sentiment of a sample sentence
result = sentiment_analyzer("I love Artificial Intelligence !!")

result

In [None]:
# Initialize the text generation pipeline
text_generator = pipeline("text-generation")

# Generate text based on a given prompt
prompt = "Once upon a time"
generated_text = text_generator(prompt, max_length=100, do_sample=True, temperature=0.7)

print('\n')
print(generated_text[0]['generated_text'])


In [None]:
# Initialize the question answering pipeline
qa_pipeline = pipeline("question-answering")

# Define the context and the question
context = """
The Transformers library provides state-of-the-art machine learning architectures like BERT, GPT-2, and RoBERTa.
It is designed by Hugging Face for natural language processing tasks such as text classification, extraction, and translation.
"""
question = "Who designed the Transformers library?"

# Extract the answer from the context
answer = qa_pipeline(question=question, context=context)

print(f"Question: {question}")
print(f"Answer: {answer['answer']} (with score: {answer['score']:.4f})")


In [None]:
from transformers import pipeline

# Initialize the NER pipeline
ner_pipeline = pipeline("ner")

# Define the text
text = "Hugging Face is a company based in New York City. Its Transformers library is very popular in the NLP community."

# Recognize named entities in the text
entities = ner_pipeline(text)

# Display the recognized entities
for entity in entities:
    word = entity["word"]
    label = entity["entity"]
    score = entity["score"]
    print(f"Entity: {word}, Label: {label}, Score: {score:.4f}")


Now you are ready to bring NLP into your applications!!