#Overview of NLP and its applications
##Introduction to basic NLP tasks: tokenization, stemming, and lemmatization
##Hands-on exercise: Tokenizing text using Python

#Natural Language Processing (NLP) 
is a subfield of artificial intelligence and linguistics that focuses on the interaction between computers and human language. It involves the development of algorithms and techniques to enable computers to understand, interpret, and generate human language in a meaningful way. NLP has gained significant attention and has numerous applications across various domains.

---



One of the primary applications of NLP is in language understanding and machine translation. NLP techniques enable machines to process and comprehend human language, making it possible to build systems that can automatically translate text from one language to another. This has paved the way for multilingual communication, cross-cultural collaboration, and global knowledge sharing.

---



NLP is also widely used in information extraction and text mining. By analyzing and extracting structured information from unstructured text, NLP allows for the extraction of valuable insights and knowledge from large volumes of data. This is particularly useful in applications such as sentiment analysis, topic modeling, named entity recognition, and text summarization.

---



Another prominent application of NLP is in question answering and conversational agents. Natural language understanding and generation techniques enable the development of intelligent chatbots, virtual assistants, and voice-controlled systems. These systems can understand user queries, provide relevant responses, and engage in human-like conversations, enhancing user experiences and enabling seamless human-computer interactions.

---

NLP plays a crucial role in sentiment analysis and opinion mining, where it helps in determining the sentiment and subjective information expressed in text data. This has applications in areas such as social media monitoring, brand reputation management, market research, and customer feedback analysis.

---



In addition, NLP is extensively used in the healthcare domain for clinical text analysis, medical records processing, and biomedical information extraction. It aids in extracting relevant medical information, identifying patterns, and assisting healthcare professionals in decision-making and research.

---



Furthermore, NLP finds applications in information retrieval, search engines, and document classification. By analyzing and understanding the content of documents and web pages, NLP enables efficient indexing, retrieval, and ranking of information, leading to improved search experiences and knowledge discovery.

---



NLP also plays a vital role in natural language generation, text-to-speech synthesis, and language generation for chatbots, virtual assistants, and automated report generation. These capabilities enable machines to generate coherent and contextually appropriate human-like language, expanding the possibilities for automated content creation and personalized user experiences.

---



Overall, NLP is a rapidly evolving field with a wide range of applications that have the potential to revolutionize how we interact with computers, process information, and communicate with each other. Its versatility and impact span across various industries, including healthcare, education, e-commerce, customer service, finance, and many others.

#A sample text and exercises for tokenization, stemming, and lemmatization.

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. It supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.

https://www.nltk.org/

In [None]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet


# Download required resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Sample text for exercises
text = "I am enjoying this hands-on exercise. It is teaching me about NLP tasks like tokenization, stemming, and lemmatization."


Exercise 1: Tokenization

Tokenize the given text using the word_tokenize() function from the nltk.tokenize module.
Print the tokens.


In [None]:
# Tokenization exercise
print("Tokenization:")
tokens = word_tokenize(text)
print(tokens)
print()

Tokenization:
['I', 'am', 'enjoying', 'this', 'hands-on', 'exercise', '.', 'It', 'is', 'teaching', 'me', 'about', 'NLP', 'tasks', 'like', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']



**Stemming** is a process of reducing words to their base or root form by removing suffixes or prefixes.
It follows a simple and rule-based approach, chopping off the ends of words based on predefined rules to extract the stem.
The resulting stems may not always be valid words or have a semantic meaning.
Stemming is generally faster and computationally less expensive compared to lemmatization.

Examples of stemming include transforming "running" to "run," "cats" to "cat," and "jumped" to "jump."


**Lemmatization** is the process of reducing words to their base or dictionary form (called lemmas), considering the word's morphological analysis and context.
It takes into account the word's part of speech (POS) tag to ensure accurate normalization.
Lemmatization uses lexical knowledge resources like WordNet or linguistic rules to transform words.
The resulting lemmas are valid words with semantic meaning and can be used for analysis and comprehension.
Lemmatization is generally slower and computationally more expensive compared to stemming.

Examples of lemmatization include transforming "running" to "run," "cats" to "cat," and "better" to "good."

In summary, stemming is a simpler and more aggressive technique that removes prefixes and suffixes to extract the root form, while lemmatization considers the context and part of speech to normalize words to their base forms, resulting in valid words with semantic meaning. The choice between stemming and lemmatization depends on the specific NLP task and the level of normalization required.


Exercise 2: Stemming

Initialize a PorterStemmer object from the nltk.stem module.
Perform stemming on the tokens obtained from Exercise 1.
Print the stemmed words.

In [None]:
# c exercise
print("Stemming:")
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]
print(stemmed_words)
print()

Stemming:
['i', 'am', 'enjoy', 'thi', 'hands-on', 'exercis', '.', 'it', 'is', 'teach', 'me', 'about', 'nlp', 'task', 'like', 'token', ',', 'stem', ',', 'and', 'lemmat', '.']



Exercise 3: Lemmatization

Initialize a WordNetLemmatizer object from the nltk.stem module.
Perform lemmatization on the tokens obtained from Exercise 1.
Print the lemmatized words.

In [None]:
# Lemmatization exercise
print("Lemmatization:")
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
print(lemmatized_words)

Lemmatization:
['I', 'am', 'enjoying', 'this', 'hands-on', 'exercise', '.', 'It', 'is', 'teaching', 'me', 'about', 'NLP', 'task', 'like', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']


In [None]:
# POS tagging
print("POS tagging:")
pos_tags = pos_tag(tokens)
print(pos_tags)
print()

POS tagging:
[('I', 'PRP'), ('am', 'VBP'), ('enjoying', 'VBG'), ('this', 'DT'), ('hands-on', 'JJ'), ('exercise', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('teaching', 'VBG'), ('me', 'PRP'), ('about', 'IN'), ('NLP', 'NNP'), ('tasks', 'NNS'), ('like', 'IN'), ('tokenization', 'NN'), (',', ','), ('stemming', 'VBG'), (',', ','), ('and', 'CC'), ('lemmatization', 'NN'), ('.', '.')]



In [None]:
# Function to map POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:

# Lemmatization exercise
print("Lemmatization:")
lemmatizer = WordNetLemmatizer()
lemmatized_words = []
for token, pos in pos_tags:
    pos_tag = get_wordnet_pos(pos)  # Mapping POS tags to WordNet POS tags
    lemmatized_word = lemmatizer.lemmatize(token, pos=pos_tag)
    lemmatized_words.append(lemmatized_word)
print(lemmatized_words)

Lemmatization:
['I', 'be', 'enjoy', 'this', 'hands-on', 'exercise', '.', 'It', 'be', 'teach', 'me', 'about', 'NLP', 'task', 'like', 'tokenization', ',', 'stem', ',', 'and', 'lemmatization', '.']


Example of Lemmatization based on POS

"He is fishing."
Lemmatized form: ['He', 'be', 'fish', '.']

"Fishing is a sport."
Lemmatized form: ['Fishing', 'be', 'a', 'sport', '.']

In the first sentence, "fishing" is lemmatized to "fish" to represent the base form of the verb. In the second sentence, "fishing" is lemmatized to "Fishing" as it is recognized as a noun and doesn't undergo any changes.

#spaCy 
##https://spacy.io/
is a powerful and widely used open-source library for natural language processing (NLP) in Python. It provides efficient and high-performance tools for various NLP tasks, making it popular among researchers, developers, and data scientists.

---



One of spaCy's key capabilities is tokenization, which involves splitting text into individual words or tokens. It handles complex tokenization scenarios, such as contractions, hyphenated words, and punctuation, with excellent accuracy.

---



spaCy also offers advanced linguistic annotations, including part-of-speech tagging, dependency parsing, named entity recognition (NER), and sentence segmentation. These annotations provide detailed linguistic information about the text, enabling deeper analysis and understanding of the language's structure and meaning.

---



Additionally, spaCy supports word vector representations, allowing users to access pre-trained word embeddings like GloVe or train custom models on their own data. These word vectors capture semantic relationships between words and are useful for tasks such as similarity calculations and text classification.

---



Another significant feature of spaCy is its efficient processing pipeline design, which allows for fast and scalable NLP workflows. It is designed to handle large volumes of text efficiently, making it suitable for real-world applications that require processing large datasets.

---



Furthermore, spaCy provides support for rule-based matching, allowing users to define custom patterns and extract information based on specific linguistic patterns or rules. This feature is beneficial for tasks such as extracting specific entities or specific linguistic patterns from the text.

---



Overall, spaCy is a comprehensive NLP library that combines performance, ease of use, and a wide range of capabilities. Its extensive features make it a go-to choice for tasks like text preprocessing, information extraction, text classification, and many other NLP applications.

In [None]:
import spacy

# Load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

# Define a text for processing
text = "I am a runner running in a race. I have been running since morning."

# Tokenization
doc = nlp(text)
tokens = [token.text for token in doc]

print("Tokens:", tokens)

# Lemmatization with POS tagging
lemmatized_tokens = [token.lemma_ if token.lemma_ != "-PRON-" else token.text for token in doc]

print("Lemmatized Tokens:", lemmatized_tokens)



Tokens: ['I', 'am', 'a', 'runner', 'running', 'in', 'a', 'race', '.', 'I', 'have', 'been', 'running', 'since', 'morning', '.']
Lemmatized Tokens: ['I', 'be', 'a', 'runner', 'run', 'in', 'a', 'race', '.', 'I', 'have', 'be', 'run', 'since', 'morning', '.']
