In [None]:
'''
NLP: Natural Language Processing
Definition: 
A subfield of artificial intelligence that focuses on the interaction between 
computers and humans through natural language.

Applications:
1. Sentiment Analysis
2. Machine Translation
3. Chatbots
4. Information Retrieval
5. Text Summarization

Basics:
1. Tokenization
2. Stopword Removal
3. Stemming and Lemmatization
4. Part-of-Speech Tagging
5. Named Entity Recognition

How it works:
NLP works by using algorithms and models to process and analyze large amounts of natural language data. 
This involves several steps, including:

1. Text Preprocessing: Cleaning and preparing the text data for analysis, which may include tokenization, 
stopword removal, and stemming.

2. Feature Extraction: Converting the text data into a numerical format that can be used by 
machine learning algorithms. This may involve techniques such as bag-of-words, TF-IDF, or word embeddings.

3. Model Training: Using labeled data to train machine learning models to perform specific NLP tasks, 
such as sentiment analysis or named entity recognition.

4. Inference: Applying the trained models to new, unseen text data to make predictions or extract insights.

5. Post-processing: Refining the model outputs and integrating them into applications or workflows.


'''

In [None]:
'''
Roadmap for NLP:
1. Learn the basics of NLP and its applications.
2. Familiarize yourself with text preprocessing techniques.
    a. Tokenization
    b. Stopword Removal
    c. Stemming and Lemmatization
    d. Part-of-Speech Tagging
    e. Named Entity Recognition
3. Explore feature extraction methods. (Text Representation - Input texts to vectors)
    a. Bag-of-Words
    b. TF-IDF
    c. Uni-grams
    d. Word Embeddings (Word2Vec, GloVe, FastText)
4. Understand different NLP models and algorithms.
    a. Rule-based Models
    b. Machine Learning Models (e.g., SVM, Random Forest)
    c. Deep Learning Models (e.g., GRU, RNN, LSTM, Transformers)
5. Word Embeddings
    a. Word2Vec
    b. GloVe
    c. FastText
6. Transformers
    a. BERT
    b. GPT
    c. T5
7. Hugging Face Transformers - A library for state-of-the-art NLP models.
8. BERT
   a. Architecture
   b. Pre-training
   c. Fine-tuning
9. Work on real-world NLP projects to gain hands-on experience.
'''

In [None]:
'''
Practical Applications of NLP:
1. Sentiment Analysis: Determining the sentiment behind a piece of text (e.g., positive, negative, neutral).
2. Chatbots: Building conversational agents that can understand and respond to user queries.
3. Machine Translation: Automatically translating text from one language to another.
4. Text Summarization: Generating concise summaries of long documents.
5. Named Entity Recognition: Identifying and classifying key entities in text (e.g., names, dates, locations).
'''

In [None]:
'''
Tokenization in NLP:

Definition:
Tokenization is the process of breaking down a text into smaller units, called tokens. 
These tokens can be words, phrases, or even individual characters, 
depending on the level of granularity required for the NLP task at hand. 
Tokenization is a crucial step in NLP as it helps in understanding the structure and meaning of the text.


Topics:
a. Corpus: A corpus is a large and structured set of texts (or speech) that is used for 
linguistic analysis and model training in NLP. It serves as the foundational dataset for various NLP tasks, 
providing the necessary context and examples for algorithms to learn from.
b. Token: A token is an individual unit of text that has been extracted from a larger body of text 
during the tokenization process. 
Tokens can be words, phrases, or even characters, depending on the level of granularity required for the NLP task.  
c. Documents: Sentences or larger bodies of text that are processed as a whole during NLP tasks.
d. Vocabulary: The set of all unique tokens (words, phrases, etc.) present in a corpus or dataset. 
            A rich vocabulary is essential for effective NLP applications.
e. Words: The individual tokens that make up a text, which can be analyzed and processed for various NLP tasks.

'''

In [None]:
'''Tokenization
2 important library:
a. nltk - Natural Language Toolkit, a powerful library for text processing and analysis.
b. spacy - An industrial-strength NLP library that provides fast and efficient tools for various NLP tasks.

Difference between nltk and spacy:

1. Ease of Use:
   - NLTK: More flexible and allows for fine-grained control over text processing tasks, 
            but has a steeper learning curve.
   - SpaCy: Designed for production use, with a simpler API and better performance out of the box.

2. Speed:
   - NLTK: Generally slower, as it is more focused on providing a wide range of tools and resources.
   - SpaCy: Optimized for speed and efficiency, making it suitable for large-scale applications.

3. Pre-trained Models:
   - NLTK: Offers a variety of pre-trained models, but they may not be as advanced as those in SpaCy.
   - SpaCy: Provides state-of-the-art pre-trained models for various languages and tasks, 
            making it easier to get started.

4. Features:
   - NLTK: A comprehensive library with a wide range of tools for text processing, including tokenization, 
            stemming, and parsing.
   - SpaCy: Focuses on providing a streamlined set of features for common NLP tasks, such as named 
            entity recognition and dependency parsing.

'''



In [1]:
import nltk
import spacy

In [9]:
corpus = """Hello, from the other world, My name is Abhishek.
Please do learn the entire course to become expert in NLP! Come here for more to learn about NLP.
"""

In [None]:
corpus

'Hello, from the other world, My name is Abhishek.\nPlease do learn the entire course to become expert in NLP! Come here for more to learn about NLP.\n'

In [12]:
# tokenization
# sentences -> paragraphs
from nltk.tokenize import sent_tokenize, word_tokenize
# nltk.download('punkt_tab')
sentences = sent_tokenize(corpus)

In [13]:
sentences

['Hello, from the other world, My name is Abhishek.',
 'Please do learn the entire course to become expert in NLP!',
 'Come here for more to learn about NLP.']

In [14]:
for sentence in sentences:
    print(sentence)

Hello, from the other world, My name is Abhishek.
Please do learn the entire course to become expert in NLP!
Come here for more to learn about NLP.


In [17]:
word_list = word_tokenize(corpus)
word_list

['Hello',
 ',',
 'from',
 'the',
 'other',
 'world',
 ',',
 'My',
 'name',
 'is',
 'Abhishek',
 '.',
 'Please',
 'do',
 'learn',
 'the',
 'entire',
 'course',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '!',
 'Come',
 'here',
 'for',
 'more',
 'to',
 'learn',
 'about',
 'NLP',
 '.']

In [None]:
# tokenization - paragraph to word or sentence to word
words = [word_tokenize(sentence) for sentence in sentences]
words

[['Hello',
  ',',
  'from',
  'the',
  'other',
  'world',
  ',',
  'My',
  'name',
  'is',
  'Abhishek',
  '.'],
 ['Please',
  'do',
  'learn',
  'the',
  'entire',
  'course',
  'to',
  'become',
  'expert',
  'in',
  'NLP',
  '!'],
 ['Come', 'here', 'for', 'more', 'to', 'learn', 'about', 'NLP', '.']]

In [25]:
corpus1 = '''Hello Abhishek. NLP's are better than ML for Text analysis.
'''

In [26]:
from nltk.tokenize import wordpunct_tokenize
# breaks 's as well - punctuation aware
word_list = wordpunct_tokenize(corpus1)
word_list

['Hello',
 'Abhishek',
 '.',
 'NLP',
 "'",
 's',
 'are',
 'better',
 'than',
 'ML',
 'for',
 'Text',
 'analysis',
 '.']

In [27]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
word_list = tokenizer.tokenize(corpus1)
word_list

['Hello',
 'Abhishek.',
 'NLP',
 "'s",
 'are',
 'better',
 'than',
 'ML',
 'for',
 'Text',
 'analysis',
 '.']