## **1. Natural Language Processing (NLP)**

### **Libraries and Modules**
- **`nltk` (Natural Language Toolkit):**
  A Python library for text processing tasks.

In [None]:
import nltk
nltk.download('punkt_tab')  # Tokenization
nltk.download('stopwords')  # Stopword removal
nltk.download('wordnet')  # Lemmatization

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### **Stopword Removal**
- Filters out common words.

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
text = "This is a sample sentence demonstrating stopword removal."
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)

['sample', 'sentence', 'demonstrating', 'stopword', 'removal', '.']


- **`regex` (Regular Expressions):**
  Useful for cleaning text.

In [None]:
import re
text = "Hello!! Welcome to NLP, 2023. Let's explore!"
clean_text = re.sub(r'[^a-zA-Z\s]', '', text)  # Removes special characters and digits
print(clean_text)

Hello Welcome to NLP  Lets explore


### **Tokenization**
- **`word_tokenize`:**
  Splits text into words.

In [None]:
from nltk.tokenize import word_tokenize
text = "Natural Language Processing is fun!"
tokens = word_tokenize(text)
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'fun', '!']


- **`sent_tokenize`:**
  Splits text into sentences.

In [None]:
from nltk.tokenize import sent_tokenize
text = "Natural Language Processing is fun. It helps in text analysis."
sentences = sent_tokenize(text)
print(sentences)

['Natural Language Processing is fun.', 'It helps in text analysis.']


### **Stemming and Lemmatization**
- **Stemming:**
  Reduces words to their root form.

In [None]:
  from nltk.stem import PorterStemmer
  ps = PorterStemmer()
  words = ["running", "runner", "ran", "easily", "studies"]
  stemmed = [ps.stem(word) for word in words]
  print(stemmed)

['run', 'runner', 'ran', 'easili', 'studi']


- **Lemmatization:**
  Maps words to their base form using context.

In [None]:
  from nltk.stem import WordNetLemmatizer
  lemmatizer = WordNetLemmatizer()
  words = ["running", "better", "studies", "feet"]
  lemmatized = [lemmatizer.lemmatize(word) for word in words]
  print(lemmatized)

['running', 'better', 'study', 'foot']


### 1. **CountVectorizer** (Word Counts):
- **What it does:** It counts how many times each word appears in a document.
- **Result:** A matrix showing the raw frequency of each word in the document.
- **Example:**
  - Sentence 1: "Natural language processing is fascinating."
  - Sentence 2: "Learning NLP opens many career opportunities."
  - **Matrix Output:**
    ```
    | Word         | Document 1 | Document 2 |
    |--------------|------------|------------|
    | natural      | 1          | 0          |
    | language     | 1          | 0          |
    | processing   | 1          | 0          |
    | is           | 1          | 0          |
    | fascinating  | 1          | 0          |
    | learning     | 0          | 1          |
    | nlp          | 0          | 1          |
    | opens        | 0          | 1          |
    | many         | 0          | 1          |
    | career       | 0          | 1          |
    | opportunities| 0          | 1          |
    ```
    - Here, **"fascinating" appears once in Document 1** and **"learning" appears once in Document 2**.

- **Good for:** Simple tasks like word frequency analysis or finding common words.


### 2. **TfidfVectorizer** (Word Importance):
- **What it does:** It not only counts the words but also considers their importance. Rare words are given more weight because they add more value to understanding the document.
- **TF-IDF Formula:**
  - **Term Frequency (TF):** How often a word appears in a document.
  - **Inverse Document Frequency (IDF):** Reduces the weight of common words like "the" or "is" that appear in all documents.
  - Final score = TF Ã— IDF.

- **Example:**
  - Sentence 1: "Natural language processing is fascinating."
  - Sentence 2: "Learning NLP opens many career opportunities."
  - **Matrix Output (TF-IDF scores):**
    ```
    | Word         | Document 1     | Document 2     |
    |--------------|----------------|----------------|
    | natural      | 0.400731       | 0.000000       |
    | language     | 0.400731       | 0.000000       |
    | processing   | 0.400731       | 0.000000       |
    | is           | 0.400731       | 0.000000       |
    | fascinating  | 0.566946       | 0.000000       |
    | learning     | 0.000000       | 0.447214       |
    | nlp          | 0.000000       | 0.447214       |
    | opens        | 0.000000       | 0.447214       |
    | many         | 0.000000       | 0.447214       |
    | career       | 0.000000       | 0.447214       |
    | opportunities| 0.000000       | 0.447214       |
    ```
    - Here, **"fascinating" gets a higher score in Document 1 because it's unique.**
    - Common words like "is" have lower importance due to IDF.

- **Good for:** Understanding which words are more meaningful in differentiating documents.


## **2. Text Vectorization**

### **Bag of Words (BoW):**
Converts text into a matrix of token counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
documents = ["Natural language processing is fascinating.",
             "Learning NLP opens many career opportunities."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['career' 'fascinating' 'is' 'language' 'learning' 'many' 'natural' 'nlp'
 'opens' 'opportunities' 'processing']
[[0 1 1 1 0 0 1 0 0 0 1]
 [1 0 0 0 1 1 0 1 1 1 0]]


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences/documents
documents = [
    "Natural language processing is fascinating.",
    "Learning NLP opens many career opportunities."
]

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert the sparse matrix to a Pandas DataFrame
df = pd.DataFrame(
    X.toarray(),
    index=documents,  # Use sentences as row indices
    columns=vectorizer.get_feature_names_out()  # Use feature names as columns
)

# Print the DataFrame
df.head()

Unnamed: 0,career,fascinating,is,language,learning,many,natural,nlp,opens,opportunities,processing
Natural language processing is fascinating.,0,1,1,1,0,0,1,0,0,0,1
Learning NLP opens many career opportunities.,1,0,0,0,1,1,0,1,1,1,0


### **TF-IDF (Term Frequency-Inverse Document Frequency):**
Assigns weights to words based on importance in the document and corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
documents = ["Natural language processing is fascinating.",
             "Learning NLP opens many career opportunities."]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(documents)
print(tfidf.get_feature_names_out())
print(X.toarray())

['career' 'fascinating' 'is' 'language' 'learning' 'many' 'natural' 'nlp'
 'opens' 'opportunities' 'processing']
[[0.         0.4472136  0.4472136  0.4472136  0.         0.
  0.4472136  0.         0.         0.         0.4472136 ]
 [0.40824829 0.         0.         0.         0.40824829 0.40824829
  0.         0.40824829 0.40824829 0.40824829 0.        ]]


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences/documents
documents = [
    "Natural language processing is fascinating.",
    "Learning NLP opens many career opportunities."
]

# Create a TfidfVectorizer instance
tfidf = TfidfVectorizer()

# Fit and transform the documents
X = tfidf.fit_transform(documents)

# Convert the sparse matrix to a Pandas DataFrame
df = pd.DataFrame(
    X.toarray(),
    index=documents,  # Use sentences as row indices
    columns=tfidf.get_feature_names_out()  # Use feature names as columns
)

# Print the DataFrame
df.head()

Unnamed: 0,career,fascinating,is,language,learning,many,natural,nlp,opens,opportunities,processing
Natural language processing is fascinating.,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.447214,0.0,0.0,0.0,0.447214
Learning NLP opens many career opportunities.,0.408248,0.0,0.0,0.0,0.408248,0.408248,0.0,0.408248,0.408248,0.408248,0.0


## **5. Machine Learning Integration**

Preprocessed data (BoW or TF-IDF) can be used for machine learning tasks.

### Example: Sentiment Classification

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Sample dataset
texts = ["I love programming.", "Python is awesome!", "I hate bugs.", "Debugging is frustrating."]
labels = [1, 1, 0, 0]  # 1 = Positive, 0 = Negative

# Text Vectorization (BoW)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Train Model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print(predictions)

[0]
