<a href="https://colab.research.google.com/github/Vivek-afk81/NLP_intro/blob/main/nlp_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
try:
    from google.colab import drive
    drive.mount("/content/drive")
except ImportError:
    pass


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
import os
os.chdir("/content/drive/My Drive/nlp")

#NLP Phase 1: Text Cleaning & Normalization

Machine learning models cannot understand raw text.
We must clean, normalize, and structure text before converting it into numbers.

###Step 1 :- Load the text data

In [21]:
text="this is a classic Anaconda environment mismatch problem"
text=text.lower()

print("Lower text: ",text)

Lower text:  this is a classic anaconda environment mismatch problem


###step 2 :-Remove Punctuations
we we will remove everything except letters and spaces
Using re- regular expression

In [22]:
import re
text =re.sub(r'[^\w\s]','',text)
print("Without punctuation: ",text)

## Removes everything except word characters and whitespace

Without punctuation:  this is a classic anaconda environment mismatch problem


###STEP 3 :- TOKENIZATION

In [23]:
#This is whitespace tokenization
tokens=text.split()
print("tokens",tokens)


tokens ['this', 'is', 'a', 'classic', 'anaconda', 'environment', 'mismatch', 'problem']


###STEP 4 :- Remove Stop Words

In [24]:
#remove stop words is am are the etc

stopwords = [
    "a", "an", "the",
    "is", "am", "are", "was", "were",
    "this", "that", "these", "those",
    "and", "or", "but",
    "to", "of", "in", "on"
]
# filtered_tokens=[]
# for word in tokens:
#   if word not in stopwords:
#     filtered_tokens.append(word)

filtered_tokens=[words for words in tokens if words not in stopwords]

print(filtered_tokens)

['classic', 'anaconda', 'environment', 'mismatch', 'problem']


###Step 5 :- Simple Stemming (Demonstration)

In [25]:
#This is not a full stemmer like Porter Stemmer.
#It only demonstrates the idea of reducing words to a base form
def simple_stem(word):
  if word.endswith("ing"):
    return word[:-3]
  return word

stemmed_words=[]
for word in filtered_tokens:
  stemmed_words.append(simple_stem(word))

print("After stemming :",stemmed_words)

After stemming : ['classic', 'anaconda', 'environment', 'mismatch', 'problem']


#NLP Phase 2 -Words to vector

Why ML - because ML only understands numbers

Machine learning models work with numerical features, not text.



In [26]:
from sklearn.feature_extraction.text import CountVectorizer #converting text to count


###Bag of Words (Count Vectorization)

Each document is represented as a vector of word counts.

In [27]:
sentences = [
    "This is a classic problem in programming.",
    "Anaconda environment mismatch can cause import errors.",
    "Python libraries must be installed in the correct environment.",
    "Understanding environments saves a lot of debugging time."
]
# Vectorize
vectorizer=CountVectorizer()
X=vectorizer.fit_transform(sentences) # learn the vocubalary and covert the sentences to numbers

In [28]:
X

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 29 stored elements and shape (4, 27)>

In [29]:
print("vocubalary")
print(vectorizer.get_feature_names_out())

vocubalary
['anaconda' 'be' 'can' 'cause' 'classic' 'correct' 'debugging'
 'environment' 'environments' 'errors' 'import' 'in' 'installed' 'is'
 'libraries' 'lot' 'mismatch' 'must' 'of' 'problem' 'programming' 'python'
 'saves' 'the' 'this' 'time' 'understanding']


In [30]:
print("\nBag of Words Matrix:\n")
print(X.toarray())


Bag of Words Matrix:

[[0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0]
 [1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1]]


here:
- each row = sentence
- each column = word
- each value=count

###Another Example (Different Corpus)

In [31]:
sentences = [
    "Speech recognition requires the correct audio drivers.",
    "Missing dependencies often lead to runtime errors.",
    "Virtual environments help isolate Python projects.",
    "Proper setup makes development smoother."
]
vectorizer=CountVectorizer()
X=vectorizer.fit_transform(sentences)
print("Vocabulary",vectorizer.get_feature_names_out())
print("\nMatrix : ")
print(X.toarray())

Vocabulary ['audio' 'correct' 'dependencies' 'development' 'drivers' 'environments'
 'errors' 'help' 'isolate' 'lead' 'makes' 'missing' 'often' 'projects'
 'proper' 'python' 'recognition' 'requires' 'runtime' 'setup' 'smoother'
 'speech' 'the' 'to' 'virtual']

Matrix : 
[[1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0]
 [0 0 1 0 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0]]


###Introducing TF-IDF
###TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY

TF-IDF Vectorization

Bag of Words treats all words equally.
TF-IDF assigns lower weight to common words and higher weight to important words.

TF-IDF = TF Ã— IDF

TF  = term frequency in a document  
IDF = log(total documents / documents containing the term)



In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer()
X=vectorizer.fit_transform(sentences)
# Internally it will:
#calculate TF
# Multiple tf X idf
#generate matrix
#

In [33]:
print("Vocabulary",vectorizer.get_feature_names_out())

Vocabulary ['audio' 'correct' 'dependencies' 'development' 'drivers' 'environments'
 'errors' 'help' 'isolate' 'lead' 'makes' 'missing' 'often' 'projects'
 'proper' 'python' 'recognition' 'requires' 'runtime' 'setup' 'smoother'
 'speech' 'the' 'to' 'virtual']


In [34]:
print("TF-IDF Matrix: ")
print(X.toarray())

TF-IDF Matrix: 
[[0.37796447 0.37796447 0.         0.         0.37796447 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.37796447 0.37796447
  0.         0.         0.         0.37796447 0.37796447 0.
  0.        ]
 [0.         0.         0.37796447 0.         0.         0.
  0.37796447 0.         0.         0.37796447 0.         0.37796447
  0.37796447 0.         0.         0.         0.         0.
  0.37796447 0.         0.         0.         0.         0.37796447
  0.        ]
 [0.         0.         0.         0.         0.         0.40824829
  0.         0.40824829 0.40824829 0.         0.         0.
  0.         0.40824829 0.         0.40824829 0.         0.
  0.         0.         0.         0.         0.         0.
  0.40824829]
 [0.         0.         0.         0.4472136  0.         0.
  0.         0.         0.         0.         0.4472136  0.
  0.         0.         0.4472136  0.         0.         0.
  0.      

In [35]:
# IDF calculation:
#log(total documents / documents containing word)