<a href="https://colab.research.google.com/github/abhyagarg22/NLP/blob/main/Basics_of_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing (NLP)

### Objectives:
- **To introduce the basic concepts of NLP.**
- **To demonstrate real-world applications of NLP.**
- **To engage students with interactive examples and exercises.**


## 1. What is NLP?

- **Definition:** Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language.
- **Simple Explanation:** Think of NLP as teaching computers to understand and talk in human language.

## 2. Need for NLP

- **Understanding Human Language:** Computers need to understand human language to provide meaningful responses.
- **Automating Repetitive Tasks:** Tasks like sorting emails, summarizing texts, or analyzing sentiments can be automated using NLP.
- **Enhancing User Experience:** Virtual assistants, chatbots, and translation services all rely on NLP to improve interactions.

## 3. Applications of NLP

1. **Machine Translation:** Tools like Google Translate.
2. **Sentiment Analysis:** Understanding emotions in texts (e.g., social media posts).
3. **Text Summarization:** Creating short summaries of long documents.
4. **Speech Recognition:** Voice-activated assistants like Siri or Alexa.
5. **Chatbots and Virtual Assistants:** Automated customer service bots.

## 4. Basic Steps in NLP

1. **Text Preprocessing:** Cleaning and preparing text data.
2. **Tokenization:** Splitting text into individual words or phrases.
3. **Removing Stop Words:** Filtering out common words that add little meaning.
4. **Stemming and Lemmatization:** Reducing words to their root forms.
5. **Vectorization:** Converting text into numerical vectors.
6. **Model Building:** Creating machine learning models to analyze text.
7. **Evaluation:** Assessing the performance of the models.


## Demonstration Outline:

## 1. Text Preprocessing:

In [None]:
import re

def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    return text

sample_text = "Hello World! This is a sample text for NLP preprocessing."
processed_text = preprocess_text(sample_text)
print(processed_text)


hello world this is a sample text for nlp preprocessing


## 2. Tokenisation

In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokens = word_tokenize(processed_text)
print(tokens)


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['hello', 'world', 'this', 'is', 'a', 'sample', 'text', 'for', 'nlp', 'preprocessing']


## 3. Removing Stop Words:

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['hello', 'world', 'sample', 'text', 'nlp', 'preprocessing']


In [None]:
print(stop_words)

{"don't", 're', 'again', 'be', 'yours', 'when', 'most', 'hadn', 'can', 'y', 'do', 'yourselves', 'such', 'have', "you'll", 'yourself', 'some', 'which', 'in', 'they', 'itself', 'no', 'whom', 'down', 'nor', 'and', "couldn't", 'were', 'am', 'shouldn', 'doing', 's', "shouldn't", 'on', "mustn't", 'his', 'a', 'under', 'weren', 'through', 'me', 'should', 'above', 'we', 'if', 'out', 'there', 'himself', 'its', "it's", 'other', 'wouldn', 'at', "hadn't", "shan't", 'over', 'is', 'during', 'she', "she's", 'after', 'm', 'themselves', 'while', 'your', "you're", 'isn', 'all', 'then', 'd', 'further', 'ourselves', 'here', 'was', 'same', 'i', 'them', 'to', 'he', 'or', 'too', "doesn't", 'own', 'but', 'her', 'those', 'where', 'into', "wasn't", 'for', "you've", 'will', "didn't", 'my', 'of', 'who', 'hers', 'theirs', 'only', "that'll", 'doesn', 'what', 'from', 'once', 'until', 'having', "hasn't", 'haven', 'mustn', 'ain', "aren't", 'aren', 'few', 'between', 'up', 'as', "won't", 'that', 'because', 'off', 'both',

## 4. Stemming

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed_tokens)



['hello', 'world', 'sampl', 'text', 'nlp', 'preprocess']


## 5. Lemmatization:

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
import spacy

# Download the necessary NLTK data
nltk.download('wordnet')
nltk.download('omw-1.4')  # Required for wordnet
nltk.download('punkt')  # Ensure punkt is downloaded for tokenization

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Example text
text = "The boy was going for a trip where he could say that he hiked, danced, sung, swam, surfed and cooked."

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Example list of filtered tokens
filtered_tokens = ["running", "jumps", "easily", "fairly"]

# Check if the required NLTK resources are available
try:
    nltk.data.find('corpora/wordnet.zip')
    nltk.data.find('corpora/omw-1.4.zip')
    # Lemmatize each token
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    print("Lemmatized tokens using NLTK:", lemmatized_tokens)
except LookupError:
    print("WordNet resource not found. Please ensure it is downloaded properly.")

# Additionally, lemmatize using spaCy
doc = nlp(text)
spacy_lemmatized_tokens = [token.lemma_ for token in doc]
print("Lemmatized tokens using spaCy:", spacy_lemmatized_tokens)


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
WordNet resource not found. Please ensure it is downloaded properly.
Lemmatized tokens using spaCy: ['the', 'boy', 'be', 'go', 'for', 'a', 'trip', 'where', 'he', 'could', 'say', 'that', 'he', 'hike', ',', 'danced', ',', 'sung', ',', 'swam', ',', 'surfed', 'and', 'cook', '.']


In [None]:

nltk.download('wordnet')
nltk.download('omw-1.4')


lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_tokens)


[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


LookupError: 
**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/root/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

## 6. Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

texts = ["I love programming.", "Python is awesome.", "I hate bugs.", "Debugging is fun."]
labels = [1, 1, 0, 1]  # 1 for positive, 0 for negative

tfidf_vectorizer = TfidfVectorizer()
vectorized_texts = tfidf_vectorizer.fit_transform(texts)
print(vectorized_texts)

  (0, 7)	0.7071067811865476
  (0, 6)	0.7071067811865476
  (1, 0)	0.6176143709756019
  (1, 5)	0.48693426407352264
  (1, 8)	0.6176143709756019
  (2, 1)	0.7071067811865476
  (2, 4)	0.7071067811865476
  (3, 3)	0.6176143709756019
  (3, 2)	0.6176143709756019
  (3, 5)	0.48693426407352264


## 7. Simple Model Building:

In [None]:

X_train, X_test, y_train, y_test = train_test_split(vectorized_texts, labels, test_size=0.25, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")


Accuracy: 100.00%
