<a href="https://colab.research.google.com/github/alenready/A1/blob/main/Bootcamp_of_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Natural Language Processing (NLP) with Python**

**Introduction to NLP**

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. NLP combines computational linguistics with machine learning techniques to process and analyze large amounts of natural language data.

**Why NLP?**

Sentiment Analysis

Text Classification

Named Entity Recognition (NER)

Machine Translation

Speech Recognition

Chatbots and Virtual Assistants

**Getting Started with NLP in Python**

Python provides several libraries for NLP, including:

**NLTK (Natural Language Toolkit)**: Useful for text processing and analysis.

**spaCy:** Efficient NLP library with pre-trained models.

**TextBlob:** Simplified text processing and sentiment analysis.

**Transformers (Hugging Face):** Advanced NLP models like BERT and GPT.

INSTALL PACKAGES

In [1]:
print(10+10)

20


In [2]:
pip install nltk spacy textblob transformers



**1. Tokenization**

Tokenization is the process of breaking text into words or sentences.

**Using NLTK:**

In [3]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download the 'punkt_tab' data required for tokenization
nltk.download('punkt_tab')  # This line downloads the necessary data

text = "Natural Language Processing is fun! Let's learn more."

print("Word Tokens:", nltk.word_tokenize(text))
print("Sentence Tokens:", sent_tokenize(text))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Word Tokens: ['Natural', 'Language', 'Processing', 'is', 'fun', '!', 'Let', "'s", 'learn', 'more', '.']
Sentence Tokens: ['Natural Language Processing is fun!', "Let's learn more."]


**Using spaCy:**

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Word Tokens:", [token.text for token in doc])

Word Tokens: ['Natural', 'Language', 'Processing', 'is', 'fun', '!', 'Let', "'s", 'learn', 'more', '.']


**2. Stopword Removal**

Stopwords are common words (e.g., "the", "is", "and") that add little meaning.

In [5]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)

Filtered Words: ['Natural', 'Language', 'Processing', 'fun', '!', 'Let', "'s", 'learn', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**3. Lemmatization**

Lemmatization reduces words to their base form.

In [7]:
print("Lemmatized Words:", [token.lemma_ for token in doc])

Lemmatized Words: ['Natural', 'Language', 'Processing', 'be', 'fun', '!', 'let', 'us', 'learn', 'more', '.']


**4. Part-of-Speech (POS) Tagging**

Identifying the grammatical category of words.

In [8]:
import nltk
from nltk.tokenize import word_tokenize

# Download the 'averaged_perceptron_tagger_eng' resource
nltk.download('averaged_perceptron_tagger_eng')

print("POS Tags:", nltk.pos_tag(word_tokenize(text)))

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


POS Tags: [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fun', 'JJ'), ('!', '.'), ('Let', 'NNP'), ("'s", 'POS'), ('learn', 'NN'), ('more', 'RBR'), ('.', '.')]




**"Natural Language Processing is fun! Let's learn more."**

Each word is assigned a **POS tag**, which indicates its grammatical role in the sentence. Let's break it down:

| Word       | POS Tag | Meaning |
|------------|--------|---------|
| **Natural**  | JJ     | Adjective (describes a noun) |
| **Language** | NNP    | Proper noun (specific name) |
| **Processing** | NNP | Proper noun (specific name) |
| **is**      | VBZ    | Verb, 3rd person singular present |
| **fun**     | JJ     | Adjective (describes a noun) |
| **!**       | .      | Punctuation (period, exclamation, etc.) |
| **Let**     | NNP    | Proper noun (incorrect tagging; should be a verb) |
| **'s**      | POS    | Possessive ending (should be part of the verb contraction) |
| **learn**   | NN     | Noun (incorrect tagging; should be a verb) |
| **more**    | RBR    | Adverb, comparative (e.g., more, better, faster) |
| **.**       | .      | Punctuation |

### Observations:
1. **Proper Noun Tagging for "Natural Language Processing"**:  
   - "Language" and "Processing" are tagged as **NNP (Proper Nouns)**, which makes sense because **Natural Language Processing (NLP)** is often treated as a proper noun.
   - However, "Natural" is tagged as **JJ (Adjective)**, which is also reasonable.

2. **Incorrect Tagging**:
   - "Let" is tagged as **NNP (Proper Noun)** instead of a verb.
   - "'s" is tagged as **POS (Possessive)**, but in this context, it is actually part of a contraction ("Let's" = "Let us").
   - "learn" is tagged as **NN (Noun)**, but it should be a **VB (Verb, base form)**.

### Why Are Some Tags Incorrect?
- POS tagging is based on **statistical models**, which sometimes misinterpret words based on context.
- Improving accuracy requires a more **context-aware POS tagger**, such as **spaCy’s dependency parsing**.


Sentiment Analysis

Determines the sentiment (positive, negative, or neutral) of a text.

Using TextBlob:

In [10]:
from textblob import TextBlob

tb = TextBlob(text)
print("Sentiment Score:", tb.sentiment.polarity)

Sentiment Score: 0.325


 **Text Generation with Transformers**

Using pre-trained language models for text generation.

Using Hugging Face's Transformers:

In [9]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
print(generator("Once upon a time,"))

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Once upon a time, I would not stand still, the voice of truth receded from every door.\n\nI kept following her, going back to the wall, through the streets of Ammon and back to my heart.\n\nShe came'}]


## **Small Project: Sentiment Analysis Chatbot**

**Description **

This chatbot takes user input, analyzes its sentiment, and responds accordingly

In [None]:
from textblob import TextBlob

def chatbot():
    print("Hello! I am a simple chatbot. Type 'exit' to end the conversation.")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'exit':
            print("Chatbot: Goodbye!")
            break
        sentiment = TextBlob(user_input).sentiment.polarity
        if sentiment > 0:
            response = "I'm glad to hear that!"
        elif sentiment < 0:
            response = "I'm sorry to hear that."
        else:
            response = "That's interesting!"
        print(f"Chatbot: {response}")

if __name__ == "__main__":
    chatbot()

Hello! I am a simple chatbot. Type 'exit' to end the conversation.
You: hello I am good
Chatbot: I'm glad to hear that!
