# Natural Language Processing (NLP) Overview

Welcome to this introductory notebook on Natural Language Processing (NLP). This notebook aims to provide an overview of NLP, its importance, and key techniques used in the field, along with code examples using Python libraries.

## What is Natural Language Processing?

Natural Language Processing (NLP) is a subfield of data science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language in a valuable way. By utilizing NLP techniques, we can process large amounts of text data to perform tasks such as:

- **Automatic Summarization**
- **Machine Translation**
- **Named Entity Recognition**
- **Sentiment Analysis**
- **Speech Recognition**
- **Topic Segmentation**




## Why is Natural Language Processing Important?

- **Facilitates Communication**: Enables seamless interaction between humans and computers, powering chatbots, virtual assistants, and translation systems.
- **Extracts Meaningful Information**: Helps in deriving insights from unstructured text data like social media posts, reviews, and articles.
- **Automates Tasks**: Automates language-related tasks such as document classification, report generation, and question answering.
- **Personalizes Experiences**: Enhances user experiences through personalized recommendations and content filtering.





## 1. Text Preprocessing

Text data is often unstructured and noisy, requiring preprocessing to make it suitable for analysis. Text preprocessing involves:

1. **Noise Removal**
2. **Lexicon Normalization**
3. **Object Standardization**



### 1.1 Noise Removal

Noise refers to irrelevant information in text, such as stop words, URLs, hashtags, mentions, and punctuations. Removing noise helps in focusing on the meaningful parts of the text.


In [1]:
import re
import string

def remove_noise(text):
    """Remove noise from text by eliminating URLs, mentions, hashtags, punctuations, and digits.

    Args:
        text (str): The input text to be cleaned.

    Returns:
        str: The cleaned text with noise removed.
    """
    # Remove URLs using regex pattern that matches 'http' followed by any non-whitespace characters
    text = re.sub(r'http\S+', '', text)
    
    # Remove mentions (e.g., @user) and hashtags (e.g., #topic) using regex pattern that matches '@' or '#' followed by word characters
    text = re.sub(r'@\w+|#\w+', '', text)
    
    # Remove punctuations by translating each punctuation character to None
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove digits using regex pattern that matches one or more digits
    text = re.sub(r'\d+', '', text)
    
    return text

# Sample text containing a URL, hashtag, and mention
sample_text = "Check out this link: https://example.com #NLP @user123"

# Clean the sample text by removing noise
clean_text = remove_noise(sample_text)

# Print the cleaned text
print(clean_text)


Check out this link   


See examples with https://regexr.com/

### 1.2 Lexicon Normalization

Lexicon normalization is a crucial preprocessing step in NLP that involves reducing words to their root or base form.
This helps in minimizing the complexity of text data and ensures that different forms of a word are treated as a single item.

- **Stemming**: This technique truncates words to their base or root form, often by removing suffixes. For example, "running" becomes "run", "happily" becomes "happi", and "cats" becomes "cat".
  - **Use Cases**:
    - **Search Engines**: Improves search results by matching different forms of a word to a single base form.
    - **Text Classification**: Reduces the dimensionality of feature space, making models more efficient.
    - **Information Retrieval**: Enhances the retrieval of documents by treating different word forms as equivalent.

- **Lemmatization**: This technique converts words to their dictionary or base form, considering the context and part of speech. For example, "better" becomes "good", "am" becomes "be", and "geese" becomes "goose".
  - **Use Cases**:
    - **Machine Translation**: Ensures accurate translation by considering the context and part of speech.
    - **Chatbots**: Improves understanding of user input by normalizing words to their base form.
    - **Sentiment Analysis**: Enhances the accuracy of sentiment classification by treating different forms of a word as a single item.

Both techniques are essential for text normalization, but they serve different purposes and have different use cases. Stemming is generally faster and more aggressive, which can lead to non-dictionary forms. Lemmatization, on the other hand, is more curate as it considers the context and part of speech, but it is computationally more intensive.

Examples:
- Original: "The striped bats are hanging on their feet for best"
- Stemming: "the stripe bat are hang on their feet for best"
- Lemmatization: "the stripe bat be hang on their foot for good"

In [2]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

text = "The striped bats are hanging on their feet for best"
words = word_tokenize(text)

# Stemming
stems = [stemmer.stem(word) for word in words]
print("Stemming:", stems)

# Lemmatization
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
print("Lemmatization:", lemmas)


Stemming: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmatization: ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'feet', 'for', 'best']


### 1.3 Object Standardization

Standardizing text involves converting slang, abbreviations, and colloquial terms to a standard form.


In [4]:

contractions_dict = {"can't": "cannot", "won't": "will not", "n't": " not"}
def standardize_text(text):
    for word in text.split():
        if word.lower() in contractions_dict:
            text = text.replace(word, contractions_dict[word.lower()])
    return text

sample_text = "I can't do this anymore. It's not fair!"
standard_text = standardize_text(sample_text)
print(standard_text)


I cannot do this anymore. It's not fair!




## 2. Text to Features (Feature Engineering)

Transforming text into numerical features is essential for machine learning models.



### 2.1 Syntactic Parsing

Analyzing the grammatical structure of sentences.

- **Part-of-Speech (POS) Tagging**: Assigning word types to each word in a sentence, such as nouns, verbs, adjectives, etc.



In [5]:
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [6]:

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/carlosmorales/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
text = "I am learning Natural Language Processing"
words = word_tokenize(text)
pos_tags = pos_tag(words)
print("POS Tags:", pos_tags)


POS Tags: [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP')]


In [8]:
def explain_pos_tags(pos_tags):
    """Provide explanations for POS tags."""
    explanations = {
        'CC': 'Coordinating conjunction',
        'CD': 'Cardinal number',
        'DT': 'Determiner',
        'EX': 'Existential there',
        'FW': 'Foreign word',
        'IN': 'Preposition or subordinating conjunction',
        'JJ': 'Adjective',
        'JJR': 'Adjective, comparative',
        'JJS': 'Adjective, superlative',
        'LS': 'List item marker',
        'MD': 'Modal',
        'NN': 'Noun, singular or mass',
        'NNS': 'Noun, plural',
        'NNP': 'Proper noun, singular',
        'NNPS': 'Proper noun, plural',
        'PDT': 'Predeterminer',
        'POS': 'Possessive ending',
        'PRP': 'Personal pronoun',
        'PRP$': 'Possessive pronoun',
        'RB': 'Adverb',
        'RBR': 'Adverb, comparative',
        'RBS': 'Adverb, superlative',
        'RP': 'Particle',
        'SYM': 'Symbol',
        'TO': 'to',
        'UH': 'Interjection',
        'VB': 'Verb, base form',
        'VBD': 'Verb, past tense',
        'VBG': 'Verb, gerund or present participle',
        'VBN': 'Verb, past participle',
        'VBP': 'Verb, non-3rd person singular present',
        'VBZ': 'Verb, 3rd person singular present',
        'WDT': 'Wh-determiner',
        'WP': 'Wh-pronoun',
        'WP$': 'Possessive wh-pronoun',
        'WRB': 'Wh-adverb'
    }
    return [(word, tag, explanations.get(tag, "Unknown")) for word, tag in pos_tags]



# Explain POS tags
explained_pos_tags = explain_pos_tags(pos_tags)

print("Explained POS Tags:", explained_pos_tags)


Explained POS Tags: [('I', 'PRP', 'Personal pronoun'), ('am', 'VBP', 'Verb, non-3rd person singular present'), ('learning', 'VBG', 'Verb, gerund or present participle'), ('Natural', 'NNP', 'Proper noun, singular'), ('Language', 'NNP', 'Proper noun, singular'), ('Processing', 'NNP', 'Proper noun, singular')]




### 2.2 Entity Extraction

Identifying important entities like names, locations, organizations.



In [9]:
import spacy
from spacy.cli import download

# Download the 'en_core_web_sm' model if not already present
download('en_core_web_sm')

nlp = spacy.load('en_core_web_sm')


Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY



### 2.3 Statistical Features

**Term Frequency-Inverse Document Frequency (TF-IDF)**: This statistical measure evaluates the importance of a word in a document relative to a collection of documents (corpus). It is calculated by multiplying two metrics:

1. **Term Frequency (TF)**: The number of times a word appears in a document, normalized by the total number of words in that document.
   Example: In the document "the cat sat on the mat", the term frequency of "the" is 2/6 = 0.33.

2. **Inverse Document Frequency (IDF)**: The logarithm of the total number of documents divided by the number of documents containing the word. This helps to reduce the weight of commonly used words and increase the weight of rare words.
   Example: If the word "cat" appears in 3 out of 10 documents, the IDF is log(10/3) ≈ 0.52.

The TF-IDF score for a word is the product of its TF and IDF scores. This score reflects how important a word is to a specific document in the context of the entire corpus.



In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)


In [12]:
print(vectorizer.get_feature_names_out())


['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [13]:
print(X.toarray())


[[0.         0.46941728 0.61722732 0.3645444  0.         0.
  0.3645444  0.         0.3645444 ]
 [0.         0.7284449  0.         0.28285122 0.         0.47890875
  0.28285122 0.         0.28285122]
 [0.49711994 0.         0.         0.29360705 0.49711994 0.
  0.29360705 0.49711994 0.29360705]]



## 3. Important NLP Tasks

### 3.1 Text Classification

Categorizing text into predefined classes.



In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

texts = ['I love this movie', 'I hate this movie', 'This movie is great', 'This movie is terrible']
labels = ['positive', 'negative', 'positive', 'negative']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
clf = MultinomialNB()
clf.fit(X, labels)

test_text = ['I really love this fantastic movie']
test_X = vectorizer.transform(test_text)
print(clf.predict(test_X))


['positive']




### 3.2 Text Similarity

Measuring similarity between texts.



In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

text1 = "Natural Language Processing is fun"
text2 = "I find very fun learning about Natural Language Processing"

# Use TfidfVectorizer instead of CountVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])

# Compute cosine similarity between the two vectors
cos_sim = cosine_similarity(vectors[0:1], vectors[1:2])
print("Cosine Similarity:", cos_sim[0][0])

Cosine Similarity: 0.474330706497194


### 3.3 Text Similarity with Sentence Transformers

In [26]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Two lists of sentences
sentences1 = [
    "The new movie is awesome",
    "The cat sits outside",
    "A man is playing guitar",
]

sentences2 = [
    "The dog plays in the garden",
    "The new movie is so great",
    "A woman watches TV",
]

# Compute embeddings for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)

# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)

# Output the pairs with their score
for idx_i, sentence1 in enumerate(sentences1):
    print(sentence1)
    for idx_j, sentence2 in enumerate(sentences2):
        print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")

  from tqdm.autonotebook import tqdm, trange


The new movie is awesome
 - The dog plays in the garden   : 0.0543
 - The new movie is so great     : 0.8939
 - A woman watches TV            : -0.0502
The cat sits outside
 - The dog plays in the garden   : 0.2838
 - The new movie is so great     : -0.0029
 - A woman watches TV            : 0.1310
A man is playing guitar
 - The dog plays in the garden   : 0.2277
 - The new movie is so great     : -0.0136
 - A woman watches TV            : -0.0327


## References

- NLTK Documentation: [http://www.nltk.org/](http://www.nltk.org/)
- spaCy Documentation: [https://spacy.io/](https://spacy.io/)
- Scikit-learn Documentation: [https://scikit-learn.org/](https://scikit-learn.org/)
