<a href="https://colab.research.google.com/github/akashpatil108/NLP/blob/main/NLP_BAsics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***NLP (Natural Language Processing)***

## Akash patil
aakashgolu1008@gmail.com


1. **Definition**: Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language.

2. **Tasks in NLP**: NLP encompasses various tasks, including:
   - Text Classification: Assigning categories or labels to text documents (e.g., sentiment analysis, spam detection).
   - Named Entity Recognition (NER): Identifying and classifying entities (e.g., names of people, places) in text.
   - Machine Translation: Translating text from one language to another.
   - Sentiment Analysis: Determining the sentiment expressed in a piece of text (e.g., positive, negative, neutral).
   - Text Generation: Creating human-like text, such as chatbots or language models.

3. **Uses of NLP**:
   - Virtual Assistants: Chatbots and virtual assistants like Siri, Alexa, and Google Assistant use NLP to understand and respond to user queries.
   - Search Engines: NLP helps search engines understand user queries and retrieve relevant results.
   - Text Analysis: NLP is used for sentiment analysis, customer feedback analysis, and market research.
   - Language Translation: NLP powers language translation services like Google Translate.
   - Speech Recognition: NLP plays a crucial role in converting spoken language into text (speech-to-text).

Names of popular NLP libraries:

1. NLTK (Natural Language Toolkit)
2. spaCy
3. Gensim
4. scikit-learn
5. TensorFlow
6. Keras
7. Transformers (Hugging Face)
8. TextBlob

These libraries offer various functionalities for Natural Language Processing tasks and are widely used in the NLP community.

 **Text Representation**:
   - **Corpus**: A collection of text documents used for analysis.
   - **Documents**: Individual pieces of text (e.g., sentences, paragraphs) in the corpus.
   - **Vocabulary**: The set of unique words present in the corpus.
   - **Words**: In NLP, words are often represented as vectors (numeric representations).

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

# Corpus: Collection of text documents (sentences)
corpus = [
    "This is the first document.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document?"
]

# Create the CountVectorizer
vectorizer = CountVectorizer()

# Generate the word count matrix (Vector representation of the corpus)
X = vectorizer.fit_transform(corpus)

# Vocabulary: Set of unique words in the corpus
vocabulary = vectorizer.get_feature_names_out()

# Print the results
print("Corpus:")
print(corpus,"\n")

for i, document in enumerate(corpus, start=1):
    print(f"Document {i}: {document}" , "\n")

print("\nVocabulary:")
print(vocabulary, "\n")

print("\nVector representation of the corpus:")
print(X.toarray())



## Steps in NLP:

1. **Text Preprocessing**:
   - **Tokenization**: Splitting text into smaller units, called tokens. It helps in analyzing individual words or phrases.
   - **Cleaning**: Removing unnecessary characters, symbols, and special characters from the text.
   - **Lowercasing**: Converting all text to lowercase to ensure uniformity.
   - **Stemming**: Reducing words to their root form by removing suffixes (e.g., "running" to "run").
   - **Advantages of Stemming**: It reduces word variants and simplifies the vocabulary for analysis.
   - **Disadvantages of Stemming**: Sometimes, the stemming process produces non-words that may not carry meaningful information.
   - **Lemmatization**: Similar to stemming, but it transforms words to their base or dictionary form (e.g., "better" to "good").
   - **Advantages of Lemmatization**: It produces valid words, which helps in better analysis and understanding.
   - **Disadvantages of Lemmatization**: It requires more computational resources compared to stemming.
   - **Stopwords**: Common words like "the," "and," "is" that are removed as they don't add much meaning to the text.

In [None]:
# Tokenization (using NLTK):
from rich import print
import nltk
nltk.download('punkt')  # Download tokenizer data (run only once)
from nltk.tokenize import word_tokenize,sent_tokenize

text = "This is an example sentence. Tokenization is essential."
tokens = word_tokenize(text)
print("[bold green]This is word tokenization-[/bold green] " , tokens)
sentences = sent_tokenize(text)
print('this is sentence tokenization :-', sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Cleaning and lower case
import re

def clean_text(text):
    # Remove special characters and digits
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
    return cleaned_text

text = "This is An Example sentence with !@#$ special characters AND 123 digits."
cleaned_text = clean_text(text)
lowered_text=cleaned_text.lower()
print('[blue bold]cleaned text :-[/blue bold] ',cleaned_text)
print('[blue bold]lowering the text :-[/blue bold]', lowered_text)

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

nltk.download('punkt')  # Download tokenizer data (run only once)
nltk.download('stopwords')  # Download stopwords data (run only once)

def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

    return stemmed_tokens

text = "This is an example sentence with some stopwords and it will be stemmed."
processed_text = preprocess_text(text)
print(processed_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')       # Download tokenizer data (run only once)
nltk.download('stopwords')   # Download stopwords data (run only once)
nltk.download('wordnet')     # Download WordNet data (run only once)

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

    return lemmatized_tokens

text = "This is an example sentence with some stopwords and lemmatized words."
processed_tokens = preprocess_text(text)
print(processed_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




2. **Word Embeddings**:
   - **Count or Frequency-Based Methods**:
     - **One-Hot Encoding**: Each word is represented as a binary vector with '1' at its index in the vocabulary and '0' elsewhere.
       - *Advantages*: Simple and easy to implement.
       - *Disadvantages*: High-dimensional, sparse vectors and doesn't capture word relationships.
     - **Bag of Words (BoW)**: Represents a text as the frequency distribution of its words in the vocabulary.
       - *Advantages*: Captures word occurrences in a document.
       - *Disadvantages*: Ignores word order and context.
     - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs the importance of words in a document relative to the entire corpus.
       - *Advantages*: Considers the significance of words in a document while considering their frequency in the corpus.
       - *Disadvantages*: Ignores word order and context.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

# Corpus: Collection of text documents (sentences)
corpus = [
    "This is the first document.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document?"
]

# Create the CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Vocabulary: Set of unique words in the corpus
vocabulary = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense array
X_dense = X.toarray()

# One-Hot Encoder
encoder = OneHotEncoder(dtype=int)  # Specify dtype=int to avoid errors
OHE = encoder.fit_transform(X_dense)

# Print vocabularies
print("Vocabulary:")
print(vocabulary)

# Print OHE for each document
print("\nOne-Hot Encoding for each document:")
for i, document in enumerate(corpus, start=1):
    print(f"Document {i}:")
    print(OHE[i - 1].toarray())


In [None]:
# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

# Corpus: Collection of text documents (sentences)
corpus = [
    "This is the first document.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document?"
]

# Create the CountVectorizer with bigram N-grams
vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = vectorizer.fit_transform(corpus)

# Vocabulary with bigram N-grams
bigram_vocabulary = vectorizer.get_feature_names_out()

# Print bigram vocabularies
print("Vocabulary with Bigram N-grams:")
print(bigram_vocabulary)

# Print vector representation of the documents with bigram N-grams
print((X_bigram.toarray()))

In [None]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Corpus: Collection of text documents (sentences)
corpus = [
    "This is the first document.",
    "The second document is here.",
    "And this is the third document.",
    "Is this the first document?"
]

# Create the TF-IDF Vectorizer
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(corpus)

# Vocabulary for TF-IDF
tfidf_vocabulary = vectorizer_tfidf.get_feature_names_out()

# Print TF-IDF vocabularies
print("Vocabulary for TF-IDF:")
print(tfidf_vocabulary)
print("\nVector representation")
print(X_tfidf.toarray(),"\n")
# Print vector representation of the documents with TF-IDF
print("\nVector representation of the documents with TF-IDF:")
for i, document in enumerate(corpus, start=1):
    print(f"Document {i}:")
    print(X_tfidf.toarray()[i - 1])



3. **Word Embeddings - Deep Learning Trained Models**:
   - **Word2Vec**: A popular word embedding model that learns word vectors from large amounts of text data.
     - **CBOW (Continuous Bag of Words)**: Predicts a word from its context (surrounding words).
     - **Skip-gram**: Predicts context words given a target word.
     - *Advantages*: Captures semantic relationships and word analogies.
     - *Disadvantages*: Requires large training datasets and computational resources.
   - **Average Word2Vec**: Takes the average of word vectors in a document to obtain a document-level representation.
     - *Advantages*: Simple and computationally efficient.
     - *Disadvantages*: Loses word order and context information.

In [None]:
## Word Embeddings- CBOW
from gensim.models import Word2Vec

# Sample corpus (list of sentences)
corpus = [
    "I love natural language processing.",
    "Word embeddings are powerful.",
    "Gensim is a popular NLP library.",
    "Word2Vec is used for word embeddings.",
    "Machine learning is an exciting field.",
    "Natural language processing is fascinating.",
]

# Tokenize the sentences into words
tokenized_corpus = [sentence.split() for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, sg=0)  # sg=0 for CBOW, sg=1 for Skip-gram

# Get word embeddings
word_embeddings = model.wv

# Check word embeddings for a specific word
word = "natural"
print(f"Word embeddings for '{word}':")
print(word_embeddings[word])


Word embeddings for 'natural':
[-8.7273493e-03  2.1338412e-03 -8.7237195e-04 -9.3203643e-03
 -9.4293803e-03 -1.4108053e-03  4.4366214e-03  3.7041160e-03
 -6.4995661e-03 -6.8759792e-03 -5.0012907e-03 -2.2880139e-03
 -7.2466894e-03 -9.6056927e-03 -2.7409189e-03 -8.3655678e-03
 -6.0378369e-03 -5.6738453e-03 -2.3442381e-03 -1.7093163e-03
 -8.9567676e-03 -7.3503336e-04  8.1489393e-03  7.6890145e-03
 -7.2097708e-03 -3.6669916e-03  3.1176507e-03 -9.5751481e-03
  1.4777203e-03  6.5273619e-03  5.7451706e-03 -8.7653380e-03
 -4.5186016e-03 -8.1421286e-03  4.5378572e-05  9.2684561e-03
  5.9704515e-03  5.0691962e-03  5.0627715e-03 -3.2445851e-03
  9.5548471e-03 -7.3563973e-03 -7.2690817e-03 -2.2616270e-03
 -7.7878643e-04 -3.2184774e-03 -5.9350772e-04  7.4876128e-03
 -6.9592992e-04 -1.6239390e-03  2.7441652e-03 -8.3578480e-03
  7.8571858e-03  8.5343011e-03 -9.5808953e-03  2.4493726e-03
  9.9083344e-03 -7.6644877e-03 -6.9650519e-03 -7.7358182e-03
  8.3951578e-03 -6.8444869e-04  9.1427807e-03 -8.15933

In [None]:
## Word Embeddings- skip gram
from gensim.models import Word2Vec

# Sample corpus (list of sentences)
corpus = [
    "I love natural language processing.",
    "Word embeddings are powerful.",
    "Gensim is a popular NLP library.",
    "Word2Vec is used for word embeddings.",
    "Machine learning is an exciting field.",
    "Natural language processing is fascinating.",
]

# Tokenize the sentences into words
tokenized_corpus = [sentence.split() for sentence in corpus]

# Train Word2Vec model with Skip-gram
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, sg=1)  # sg=1 for Skip-gram

# Get word embeddings
word_embeddings = model.wv

# Check word embeddings for a specific word
word = "natural"
print(f"Word embeddings for '{word}':")
print(word_embeddings[word])


Word embeddings for 'natural':
[-8.7282434e-03  2.1323499e-03 -8.7293837e-04 -9.3202461e-03
 -9.4291931e-03 -1.4109012e-03  4.4344491e-03  3.7042808e-03
 -6.4992919e-03 -6.8747955e-03 -5.0009028e-03 -2.2876314e-03
 -7.2487770e-03 -9.6054496e-03 -2.7421385e-03 -8.3645936e-03
 -6.0391035e-03 -5.6730080e-03 -2.3443603e-03 -1.7078170e-03
 -8.9573581e-03 -7.3504820e-04  8.1506288e-03  7.6900194e-03
 -7.2083804e-03 -3.6672179e-03  3.1183362e-03 -9.5731718e-03
  1.4771431e-03  6.5263347e-03  5.7457364e-03 -8.7650688e-03
 -4.5180772e-03 -8.1416983e-03  4.5987152e-05  9.2664352e-03
  5.9724897e-03  5.0689536e-03  5.0624730e-03 -3.2434696e-03
  9.5536355e-03 -7.3564979e-03 -7.2704842e-03 -2.2632363e-03
 -7.7888061e-04 -3.2179474e-03 -5.9269514e-04  7.4890708e-03
 -6.9653458e-04 -1.6243787e-03  2.7444451e-03 -8.3584115e-03
  7.8573301e-03  8.5354634e-03 -9.5825167e-03  2.4483970e-03
  9.9072671e-03 -7.6650758e-03 -6.9659669e-03 -7.7361115e-03
  8.3959494e-03 -6.8271364e-04  9.1436859e-03 -8.15924


4. **Machine Learning Use Cases**:
   - Text Classification Example:
     - Dataset: Collection of movie reviews labeled as "positive" or "negative."
     - Steps:
       - Text preprocessing: Tokenization, lowercasing, removing stopwords.
       - Text representation: Using TF-IDF or Word2Vec to convert text to numeric vectors.
       - Model Training: Employing a machine learning algorithm like Naive Bayes or SVM.
       - Model Evaluation: Testing the trained model on a separate test dataset.
       - Prediction: Using the model to predict sentiment on new movie reviews.