<a href="https://colab.research.google.com/github/Yashgabani845/ml/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basic NLP Concepts**

Text Preprocessing:


## **Tokenization in NLP**
Tokenization is the process of breaking down text into smaller components
like words or sentences. These components are called "tokens". Tokenization is one of the fundamental steps in natural language processing (NLP).



Type of tokenization
1.   Word
2.   Sentence





In [None]:
!pip install nltk





In [None]:
import nltk


In [None]:
nltk.download('punkt')  # Download the tokenizer models
from nltk.tokenize import word_tokenize

teststring = "long-off long-off suryakumar yadav suryakumar yadav.... ne pakada hai apne careear ka sabse important catch "

tokens = word_tokenize(teststring)
tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['long-off',
 'long-off',
 'suryakumar',
 'yadav',
 'suryakumar',
 'yadav',
 '....',
 'ne',
 'pakada',
 'hai',
 'apne',
 'careear',
 'ka',
 'sabse',
 'important',
 'catch']

**Sentance Tokenization**

In [None]:
from nltk.tokenize import sent_tokenize
sentances = sent_tokenize(teststring)
sentances

['long-off long-off suryakumar yadav suryakumar yadav.... ne pakada hai apne careear ka sabse important catch']

# ***With spacy***



In [None]:
import spacy

In [None]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')
doc = nlp(teststring)
# Word Tokenization
tokens = [token.text for token in doc]
print(tokens)
sentences = [sent.text for sent in doc.sents]
print(sentences)

['long', '-', 'off', 'long', '-', 'off', 'suryakumar', 'yadav', 'suryakumar', 'yadav', '....', 'ne', 'pakada', 'hai', 'apne', 'careear', 'ka', 'sabse', 'important', 'catch']
['long-off long-off suryakumar yadav suryakumar yadav....', 'ne pakada hai apne careear ka sabse important catch']



> For particular tokenization i would say nltk is quite better




**Comparison Between NLTK and spaCy for Tokenization**


Speed
NLTK :Slower	,Slightly more code required	 ,Rule-based and uses punkt model	, Allows manual adjustments

Spacy:
Faster
More intuitive API
Based on rules and machine learning
Highly customizable pipelines



# **Stemming and Lemmatization in NLP**




In [None]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Sample words
words = ["running", "ran", "easily", "fairly", "studies", "studying"]

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'ran', 'easili', 'fairli', 'studi', 'studi']


In [None]:
from nltk.stem import LancasterStemmer

lancaster_stemmer = LancasterStemmer()
stemmed_words = [lancaster_stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'ran', 'easy', 'fair', 'study', 'study']


**The PorterStemmer is one of the most popular stemming algorithms, and it removes common suffixes such as "ing", "ly", etc.**

**LancasterStemmer: Another stemming algorithm that is more aggressive.**

**Lemmatization in NLTK**

Lemmatization gives valid words by considering the word’s context and part of speech.

Without specifying the POS, lemmatize() assumes the word is a noun. Providing the correct POS (e.g., pos='v' for verbs) leads to better results.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')  # For multilingual WordNet

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Sample words
words = ["running", "ran", "easily", "fairly", "studies", "studying"]

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


['run', 'run', 'easily', 'fairly', 'study', 'study']


**Stopwords Removal: This involves removing common words that do not add significant meaning to the text. These are usually articles, prepositions, and auxiliary verbs like “the”, “is”, “in”, etc.**

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('stopwords')
nltk.download('punkt')

# Sample text
text = "This is a simple sentence with some common words like 'the' and 'is'."

# Tokenize the text
tokens = word_tokenize(text)

# Load the list of English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

['simple', 'sentence', 'common', 'words', 'like', "'the", "'", "'is", "'", '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Text Normalization in NLTK**

In [None]:
import re

# Sample text
text = "This is an example sentence, with PUNCTUATION! And special characters like @ and #."

# Convert to lowercase
text_lower = text.lower()

# Remove punctuation and special characters using regex
text_clean = re.sub(r'[^\w\s]', '', text_lower)

# Tokenize the cleaned text
tokens = word_tokenize(text_clean)
print(tokens)


['this', 'is', 'an', 'example', 'sentence', 'with', 'punctuation', 'and', 'special', 'characters', 'like', 'and']


#**Feature Extraction**

Understanding how to represent text in a form that machine learning models can understand.



 **The Bag of Words (BoW)** model is a simple and widely used method for text representation. In this approach, text (such as a sentence or document) is represented as a collection of words, without considering the grammar, order, or structure. Each unique word in the corpus is used as a feature, and the value for each word is typically its frequency (how often it appears) in the document.

In [None]:
from nltk.tokenize import word_tokenize
from collections import Counter

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
]

# Tokenize each document
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

# Create a vocabulary (unique words)
vocabulary = set([word for doc in tokenized_docs for word in doc])

# Create a frequency vector for each document
def create_bow(doc, vocabulary):
    bow_vector = Counter(doc)
    return [bow_vector.get(word, 0) for word in vocabulary]

# Generate the BoW vectors for all documents
bow_vectors = [create_bow(doc, vocabulary) for doc in tokenized_docs]
print("Vocabulary:", vocabulary)
print("BoW Vectors:", bow_vectors)

Vocabulary: {'jump', 'quick', 'never', 'over', 'the', 'dog', 'fox', 'quickly', 'brown', 'lazy', 'jumps'}
BoW Vectors: [[0, 1, 0, 1, 2, 1, 1, 0, 1, 1, 1], [1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0]]


**Skitlearn**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the model to the documents and transform into BoW vectors
bow_vectors = vectorizer.fit_transform(documents)

# Get the feature names (vocabulary)
vocabulary = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense matrix for better readability
bow_dense = bow_vectors.toarray()

print("Vocabulary:", vocabulary)
print("BoW Vectors:", bow_dense)


Vocabulary: ['brown' 'dog' 'fox' 'jump' 'jumps' 'lazy' 'never' 'over' 'quick'
 'quickly' 'the']
BoW Vectors: [[1 1 1 0 1 1 0 1 1 0 2]
 [0 1 0 1 0 1 1 1 0 1 1]]
