# Lesson 2: NLP Fundamentals

## Introduction

Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language. This lesson will explore why NLP is crucial, what we can accomplish with it, and how NLP algorithms and models have evolved over time.

## Lesson Objectives

By the end of this lesson, you will:
1. Understand the importance of NLP in modern AI applications
2. Recognize the main types of NLP tasks
3. Comprehend the evolution of NLP algorithms and models
4. Gain insights into state-of-the-art NLP models and their applications

## 1. Why We Need NLP

Natural Language Processing is essential for several reasons:

1. **Human-Computer Interaction**: NLP enables more natural and intuitive interaction between humans and computers.
2. **Information Extraction**: It helps in extracting meaningful information from vast amounts of unstructured text data.
3. **Automation**: NLP can automate many language-related tasks, saving time and resources.
4. **Accessibility**: It can make information and services more accessible to people with disabilities or language barriers.
5. **Decision Making**: NLP can assist in data-driven decision making by analyzing text-based information.

## 2. What Can We Do with NLP?

NLP encompasses a wide range of tasks, which can be broadly categorized into three main types:

### 2.1 Classification Tasks

Classification tasks involve categorizing text into predefined classes or categories. Examples include:

- Sentiment Analysis
- Spam Detection
- Topic Classification
- Language Identification

Let's look at a simple example of sentiment analysis using the NLTK library:

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

text = "I love this course on NLP! It's very informative and engaging."
sentiment_scores = sia.polarity_scores(text)

print(f"Text: {text}")
print(f"Sentiment Scores: {sentiment_scores}")
print(f"Overall Sentiment: {'Positive' if sentiment_scores['compound'] > 0 else 'Negative' if sentiment_scores['compound'] < 0 else 'Neutral'}")

### 2.2 Extraction Tasks

Extraction tasks involve identifying and extracting specific information from text. Examples include:

- Named Entity Recognition (NER)
- Relation Extraction
- Key Phrase Extraction
- Information Retrieval

Here's a simple example of Named Entity Recognition using spaCy:

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
doc = nlp(text)

for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

### 2.3 Generation Tasks

Generation tasks involve creating human-readable text based on input or prompts. Examples include:

- Machine Translation
- Text Summarization
- Question Answering
- Dialogue Systems
- Text Completion

Here's a simple example of text generation using a pre-trained GPT-2 model:

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "Natural Language Processing is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"Generated Text: {generated_text}")

## 3. Evolution of NLP Algorithms and Models

NLP has seen significant advancements over the years. Let's explore the key milestones in the evolution of NLP algorithms and models:

### 3.1 Bag-of-Words Model

The Bag-of-Words (BoW) model is one of the simplest representations of text in NLP. It represents text as an unordered set of words, disregarding grammar and word order but keeping multiplicity.

Example using scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "Natural language processing is fascinating.",
    "NLP models have evolved significantly over time.",
    "Modern NLP uses advanced neural networks."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW representation:\n", X.toarray())

### 3.2 Word2Vec

Word2Vec, introduced by Google in 2013, is a group of related models used to produce word embeddings. These models are shallow, two-layer neural networks trained to reconstruct linguistic contexts of words.

Example using gensim:

In [None]:
from gensim.models import Word2Vec

sentences = [
    ["natural", "language", "processing", "is", "fascinating"],
    ["nlp", "models", "have", "evolved", "significantly", "over", "time"],
    ["modern", "nlp", "uses", "advanced", "neural", "networks"]
]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

print("Similarity between 'nlp' and 'language':", model.wv.similarity('nlp', 'language'))
print("Most similar words to 'neural':", model.wv.most_similar('neural'))

### 3.3 LSTM (Long Short-Term Memory)

LSTM is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem faced by traditional RNNs. LSTMs are capable of learning long-term dependencies, making them particularly useful for sequential data like text.

Example using Keras:

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

texts = [
    "Natural language processing is fascinating",
    "NLP models have evolved significantly over time",
    "Modern NLP uses advanced neural networks"
]
labels = [0, 1, 1]  # Binary classification example

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
padded_sequences = pad_sequences(sequences)

model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=64),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_sequences, labels, epochs=10, verbose=0)

print("Model summary:")
model.summary()

### 3.4 Transformer

The Transformer model, introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, revolutionized NLP by introducing the self-attention mechanism. This allowed for more parallelization during training and better handling of long-range dependencies.

While implementing a full Transformer is beyond the scope of this lesson, here's a simplified example of the self-attention mechanism:

In [None]:
import numpy as np

def self_attention(query, key, value):
    attention_scores = np.dot(query, key.T) / np.sqrt(key.shape[1])
    attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores), axis=1, keepdims=True)
    output = np.dot(attention_weights, value)
    return output

# Example usage
query = np.random.randn(2, 4)  # 2 queries, each with dimension 4
key = np.random.randn(3, 4)    # 3 keys, each with dimension 4
value = np.random.randn(3, 4)  # 3 values, each with dimension 4

attention_output = self_attention(query, key, value)
print("Self-attention output shape:", attention_output.shape)

### 3.5 BERT (Bidirectional Encoder Representations from Transformers)

BERT, introduced by Google in 2018, is a transformer-based model that uses bidirectional training of Transformer, allowing it to learn contextual relations between words (or sub-words) in a text.

Example using the Transformers library:

In [None]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

text = "Natural language processing has advanced significantly."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

print("BERT output shape:", outputs.last_hidden_state.shape)

### 3.6 T5 (Text-to-Text Transfer Transformer)

T5, introduced by Google in 2019, treats every NLP task as a "text-to-text" problem, where both the input and output are always text strings.

Example using the Transformers library:

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_text = "translate English to German: Natural language processing is fascinating."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Translated text:", translated_text)

### 3.7 GPT-2 (Generative Pre-trained Transformer 2)

GPT-2, introduced by OpenAI in 2019, is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. It's known for its impressive text generation capabilities.

We've already seen a GPT-2 example earlier in this lesson, but here's another one focusing on text completion:

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "The future of NLP is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = model.generate(input_ids, max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)
completed_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"Completed Text: {completed_text}")

## Conclusion

In this lesson, we've explored the fundamentals of Natural Language Processing, including why it's important, what tasks it can perform, and how NLP algorithms and models have evolved over time. From simple Bag-of-Words representations to sophisticated models like BERT and GPT-2, NLP has come a long way in its ability to understand and generate human language.

As we progress through this course, we'll delve deeper into these models and learn how to apply them to solve real-world problems. The field of NLP is rapidly evolving, and staying updated with the latest advancements will be crucial for anyone working in this exciting domain.

## Additional Resources

1. "Speech and Language Processing" by Dan Jurafsky and James H. Martin
2. "Natural Language Processing with Transformers" by Lewis Tunstall, Leandro von Werra, and Thomas Wolf
3. The Illustrated Transformer by Jay Alammar: http://jalammar.github.io/illustrated-transformer/
4. BERT paper: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
5. GPT-2 paper: "Language Models are Unsupervised Multitask Learners"

In the next lesson, we'll explore the basic knowledge and architectural characteristics of Large Language Models in more detail.