# Natural Language processing

Natural Language Processing (NLP) is a field of AI that helps computers understand, interpret, and respond to human language.  It has applications in chatbots, sentiment analysis, language translation, and more.

Challenges in NLP

Ambiguity: Words can have multiple meanings (e.g., "bank" as a financial institution vs river bank).

Context Understanding: Understanding implied meanings or references. E.g. 'Digging one a hole'

Computational Complexity: Analyzing large datasets efficiently.

We should be using standard libraries which have been trained on large corpus of data to get a lot of meaning directly from the sentences. E.g. spaCy is commonly used in all production environements for most NLP tasks like named entity recognition, stemming/lemmatization, sentiment analysis etc

In [None]:
!pip install spacy==3.5.3

In [None]:
import spacy
!python -m spacy download en_core_web_sm

## Tokenization:
Splitting text into individual words or sentences.

In [33]:
import spacy
nlp=spacy.load("en_core_web_sm")
text = "Manufacturing machines are important. Machines help production"
doc = nlp(text)
tokens = []
for token in doc:
    tokens.append(token)
print("Tokens ",tokens)

Tokens  [Manufacturing, machines, are, important, ., Machines, help, production]


## Stemming and Lemmatization
Stemming: Reduces words to their root form by chopping off suffixes.
Lemmatization: Converts words to their dictionary base form.

In [38]:
# In spaCY, stemming is not supported and one can directly get the root word by lemmatization
lemmas = [token.lemma_ for token in tokens]
print(lemmas)

['manufacturing', 'machine', 'be', 'important', '.', 'machine', 'help', 'production']


In [40]:
industries = ["Machines", "Machine", "Machining", "Production Machines", "Manufacturing machine"]

normalized_industries = []
for industry in industries:
    doc = nlp(industry)
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    normalized_industries.append(lemmatized_text)

print("Normalized Industry Names:", list(set(normalized_industries)))

Normalized Industry Names: ['production machine', 'manufacturing machine', 'machine', 'machining']


## Named Entity Recognition and Parts of speech tagging

NER (Named Entity Recognition): Identifies real-world entities like people, organizations, locations, etc., in text (e.g., "Narendra Modi" → PERSON, "India" → GPE). spaCy uses statistical models trained on labeled data to detect and label these entities.

POS Tagging (Part of Speech Tagging): Assigns grammatical categories to words, such as noun, verb, or adjective (e.g., "run" → VERB). spaCy performs this using pre-trained language models that analyze word context.

spaCy processes text by building a "Doc" object containing linguistic annotations for each token, including its lemma, POS tag, and named entity label.


In [42]:
text = "Narendra Modi is the Prime Minister of India"

doc = nlp(text)

print("Tokens and POS tags")
for token in doc:
    print(f"{token.text}: {token.pos_}")

print("\n Named Entities")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")

Tokens and POS tags
Narendra: PROPN
Modi: PROPN
is: AUX
the: DET
Prime: PROPN
Minister: PROPN
of: ADP
India: PROPN

 Named Entities
Narendra Modi: PERSON
India: GPE


Before transformers, embeddings etc came into the picture, these were the libraries which helped us with a variety of NLP tasks

In [1]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 KB[0m [31m614.9 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: textblob
Successfully installed textblob-0.18.0.post0
You should consider upgrading via the '/Users/adityaganguli/.pyenv/versions/3.8.16/envs/tech-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [3]:
import spacy
from textblob import TextBlob

# Load English language model in spaCy
nlp = spacy.load("en_core_web_sm")

def analyze_sentiment(text):
    # Process text using spaCy
    doc = nlp(text)
    
    # Convert processed text to string and analyze sentiment using TextBlob
    sentiment = TextBlob(doc.text).sentiment
    
    # Output sentiment polarity and subjectivity
    print(f"Text: {text}")
    print(f"Polarity: {sentiment.polarity} (range -1 to 1, negative to positive)")
    print(f"Subjectivity: {sentiment.subjectivity} (range 0 to 1, objective to subjective)")

# Example usage
analyze_sentiment("I love spaCy and Python! They make NLP easy and fun.")
analyze_sentiment("I hate waiting for hours in traffic.")


Text: I love spaCy and Python! They make NLP easy and fun.
Polarity: 0.4527777777777778 (range -1 to 1, negative to positive)
Subjectivity: 0.5444444444444444 (range 0 to 1, objective to subjective)
Text: I hate waiting for hours in traffic.
Polarity: -0.8 (range -1 to 1, negative to positive)
Subjectivity: 0.9 (range 0 to 1, objective to subjective)


## Vector Embeddings
Alright, things are going to get interesting. So for a long time, researchers used feature engineering using different parts of speech, sentiment analysis etc. In came the word2vec model where words or tokens began to be represented as vectors. Introduced by Google researchers led by Tomas Mikolov, Word2Vec revolutionized NLP by learning dense word embeddings in a vector space. This allowed the model to capture semantic relationships (e.g., "king - man + woman ≈ queen").

Vectors are in n dimensional space. Here's a simple representation of a word vector in few dimensions
![Word vectors](https://corpling.hypotheses.org/files/2018/04/3dplot-500x381.jpg) 

Word vectors are built around the concept of "meaning of a word can be derived from the company it keeps". Hence a neural network. Hence, a neural network is trained to predict either the target word based on the surrounding words (CBOW) or the surrounding words based on the target word (Skip-Gram) 


![CBOW Skip gram](https://i0.wp.com/spotintelligence.com/wp-content/uploads/2023/12/continuous-bag-of-words-vs-skip-gram-1-1024x576.webp?resize=1024%2C576&ssl=1) 

Now lets look at some examples of how one can derive meaning out of word vectors. We will be using pre-trained models since they have been trained on a large corpus of text. Stanford NLP group has a good pre-trained word vector model called Glove. We will need to download the Glove word vectors from http://nlp.stanford.edu/data/glove.6B.zip. then unzip and link it to the model path here

In [4]:
import os
glove_path = "/Users/adityaganguli/Downloads/glove/glove.6B.100d.txt"
print("File exists:", os.path.exists(glove_path))

File exists: True


### How word embeddings are trained using neural networks

Training Process:

**Input Layer**: A one-hot encoded vector for the word.

**Hidden Layer**: Contains the embedding weights. This layer maps high-dimensional one-hot vectors to dense, lower-dimensional embeddings.

**Output Layer**: Predicts the probability of context words (softmax activation).

**Loss Function**: Optimizes based on how well the model predicts the surrounding words (e.g., cross-entropy loss).

#### Optimization:

**Backpropagation** adjusts the weights (embeddings) based on the loss, gradually improving the vector representation of words.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus

In [5]:
import numpy as np

# Load GloVe embeddings
def load_glove_model(file_path):
    glove_model = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.split()
            word = parts[0]
            vector = np.array(parts[1:], dtype='float32')
            glove_model[word] = vector
    print(f"Loaded {len(glove_model)} word vectors.")
    return glove_model

# Specify GloVe file path (100d)
glove_path = "/Users/adityaganguli/Downloads/glove/glove.6B.100d.txt"
glove_model = load_glove_model(glove_path)



Loaded 400000 word vectors.


In [8]:
# Function to find closest word based on vector arithmetic
def find_closest_word(glove_model, word_vec):
    closest_word = None
    min_dist = float("inf")
    for word, vec in glove_model.items():
        dist = np.linalg.norm(word_vec - vec)
        if dist < min_dist:
            min_dist = dist
            closest_word = word
    return closest_word

# Vector arithmetic examples:

# 1. Country Capitals: Germany - Berlin + France ≈ ?
capital_prediction = find_closest_word(glove_model, glove_model["france"] - glove_model["paris"] + glove_model["berlin"])
print(f"France - Paris + Berlin ≈ {capital_prediction}")

# 2. Country Leaders: Germany - Hitler + Italy ≈ ?
leader_prediction = find_closest_word(glove_model, glove_model["italy"] - glove_model["germany"] + glove_model["hitler"])
print(f"Germany - Hitler + Italy ≈ {leader_prediction}")

# 3. Animal Babies: Dog - Puppy + Cat ≈ ?
baby_animal_prediction = find_closest_word(glove_model, glove_model["cat"] - glove_model["dog"] + glove_model["puppy"])
print(f"Dog - Puppy + Cat ≈ {baby_animal_prediction}")

# 4. Famous Landmarks: Paris - Eiffel Tower + India ≈ ?
landmark_prediction = find_closest_word(glove_model, glove_model["india"] - glove_model["paris"] + glove_model["eiffel"])
print(f"Paris - Eiffel Tower + delhi ≈ {landmark_prediction}")


France - Paris + Berlin ≈ germany
Germany - Hitler + Italy ≈ mussolini
Dog - Puppy + Cat ≈ puppy
Paris - Eiffel Tower + delhi ≈ maldives


In [9]:
result_vector = glove_model["king"] - glove_model["queen"] + glove_model["woman"]
closest_word = find_closest_word(glove_model, result_vector)
print(f"'king' - 'queen' + 'woman' ≈ {closest_word}")

'king' - 'queen' + 'woman' ≈ man


### Word vectors were revolutionary because they
#### Semantic Representation: 
Word vectors encode meaning, placing similar words (e.g., "king" and "queen") close in a continuous vector space.
#### Efficient Encoding: 
Replaced sparse one-hot encoding with dense, low-dimensional vectors, reducing computational and storage costs.
#### Analogy Reasoning: 
Enabled analogies like showcasing semantic relationships like king-queen = man-woman
#### Scalability: 
Trained on large, unlabeled datasets, making them broadly applicable for diverse NLP tasks.
#### Foundation for Modern NLP:
Inspired contextual embeddings (e.g., BERT, GPT), leading to breakthroughs in machine translation, chatbots, and search engines.