# **Session 1: Introduction to NLP**


**Natural Language Processing (NLP)**  
NLP is a field at the intersection of linguistics, computer science, and artificial intelligence, focused on enabling computers to interact with and understand human language. It involves teaching machines to analyze, process, and generate natural language, which includes both text and speech.

**Key Concepts:**
- **Problem**: NLP tackles the problem of how to program computers to process and analyze large volumes of natural language data. Human languages are complex, with variations in grammar, syntax, and context, making this a challenging task for machines. The goal is to make computers capable of understanding and responding to human language effectively.
  
- **Challenge**: One of the biggest challenges in NLP is dealing with unstructured data. Unlike structured data (e.g., tables and databases), natural language is unstructured and often ambiguous. Words can have multiple meanings, sentences can have different interpretations, and context can change the meaning of a phrase. Teaching machines to handle this variability is difficult.

- **Ultimate Goal**: The ultimate goal of NLP is to give machines the ability to understand, interpret, and generate human language in a way that is meaningful and useful. This includes reading and comprehending text, engaging in conversations, answering questions, and translating between languages—all with an understanding of the nuances of human communication. The ideal outcome is that machines can not only process language but also derive accurate meanings and insights from it.

### **Applications of NLP**

Here are some common applications of Natural Language Processing (NLP):

1. **Search Engines**: Enhances search capabilities by understanding and interpreting user queries to provide relevant search results.

2. **Speech Recognition**: Converts spoken language into text, enabling voice assistants like Siri, Google Assistant, and Alexa to understand and respond to user commands.

3. **Machine Translation**: Translates text from one language to another, as seen in services like Google Translate and DeepL.

4. **Sentiment Analysis**: Analyzes text to determine the sentiment or emotional tone behind it, used in social media monitoring, customer feedback analysis, and brand reputation management.

5. **Text Summarization**: Creates concise summaries of longer documents, useful for quickly grasping the main points of articles, reports, or research papers.

6. **Named Entity Recognition (NER)**: Identifies and classifies entities such as names of people, organizations, locations, dates, and more within text, used in information extraction and document indexing.

7. **Chatbots and Virtual Assistants**: Facilitates automated interactions with users through conversational agents that understand and respond to natural language queries.

8. **Information Extraction**: Extracts specific pieces of information from unstructured text, such as extracting key facts from news articles or legal documents.

9. **Text Classification**: Categorizes text into predefined categories, used for spam detection in emails, topic categorization, and content moderation.

10. **Recommendation Systems**: Provides personalized recommendations by analyzing user preferences and behavior, used in platforms like Netflix, Amazon, and Spotify.

11. **Grammar and Spell Checking**: Identifies and corrects grammatical errors and spelling mistakes in text, improving writing quality and readability.

12. **Topic Modeling**: Discovers the abstract topics that occur in a collection of documents, used for organizing and summarizing large datasets of text.


### **Basic Concepts of NLP (Terminology)**

- **Text Corpus or Corpora**: A large collection of text data, which can include various languages such as English, French, etc. It serves as the primary source of data for NLP tasks.

- **Paragraph**: The largest unit of text typically processed in NLP tasks. It consists of multiple sentences and provides broader context and structure.

- **Sentences**: A sentence is a coherent unit of text that conveys a complete thought or idea. In NLP, sentences are usually identified by punctuation marks such as periods.

- **Phrases and Words**: Words are the smallest units of text in NLP. Phrases consist of groups of words that convey specific meanings or functions within a sentence.

### **Text Pre-Processing**

Text pre-processing involves preparing and transforming textual data to make it suitable for analysis and machine learning tasks. Key steps in text pre-processing include:

- **Sentence Tokenization**: The process of dividing text into individual sentences, allowing for better analysis and understanding of sentence-level information.

- **Word Tokenization**: Breaking down sentences into individual words or tokens, which are the basic units for text analysis.

- **Text Lemmatization and Stemming**: Techniques used to reduce words to their base or root forms. Lemmatization involves converting words to their dictionary form, while stemming cuts words down to their stem or root form.

- **Stop Words**: Commonly used words (such as "and," "the," "is") that are often filtered out from the text because they do not contribute significant meaning for analysis.

- **Regex (Regular Expressions)**: Patterns used to match and manipulate specific sequences of characters within text, aiding in tasks such as extracting or cleaning data.

### **NLTK (Natural Language Toolkit)**

- **Overview**: NLTK is an open-source suite of Python modules, datasets, and tutorials designed to support research and development in natural language processing (NLP). It provides a comprehensive platform for working with text data.

- **Download**: You can download NLTK from [nltk.org](https://www.nltk.org).

- **Components of NLTK**:
  1. **Code**: Includes a range of tools and libraries such as corpus readers, tokenizers, stemmers, taggers, chunkers, parsers, and WordNet. NLTK contains around 50,000 lines of code.
  
  2. **Corpora**: Provides over 30 annotated datasets commonly used in NLP tasks, totaling more than 300 MB of data. These datasets are essential for training and evaluating NLP models.
  
  3. **Documentation**: Features extensive resources including a 400-page book, various articles, reviews, and detailed API documentation to assist users in understanding and utilizing NLTK effectively.

Let's start by importing NLTK

In [21]:
import nltk
nltk.download('all')
from nltk.corpus import webtext
from nltk.probability import FreqDist
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

## **Mount your Google Drive files**

In [22]:
from google.colab import drive
drive.mount('/content/gdrive')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## **Load the corpus file from Google Drive**





In [23]:
corpus = open('gdrive/My Drive/ObamaSpeech.txt', 'r').read()
print(corpus)

Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans:

Our Constitution declares that from time to time, the President shall give to Congress information about the state of our Union. For 220 years, our leaders have fulfilled this duty. They've done so during periods of prosperity and tranquility, and they've done so in the midst of war and depression, at moments of great strife and great struggle.

It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain. These were the times that tested the courage of our convictions and the strength of our Union. And despite all our divisions and disagreements, our hesitation

## **Understanding Sentence Tokenization**

Sentence tokenization, also known as sentence segmentation, is the process of dividing a continuous piece of text into its individual sentences. Although this might seem like a straightforward task, it involves several complexities that need to be addressed for accurate results.

#### **Basic Approach**

In many languages, including English, sentences are often separated by punctuation marks such as periods (.), exclamation points (!), or question marks (?). At a basic level, sentence tokenization involves splitting text whenever one of these punctuation marks is encountered.

#### **Challenges in Sentence Tokenization**

1. **Abbreviations**: Periods used in abbreviations (e.g., "Dr." for "Doctor" or "e.g." for "for example") do not indicate the end of a sentence. Tokenization algorithms need to distinguish these cases to avoid incorrect sentence breaks.

2. **Decimal Points**: Periods in numerical contexts (e.g., "3.14" or "1.5") might be confused with sentence-ending periods. Proper tokenization must recognize that such periods are part of numbers, not sentence boundaries.

3. **Quotes and Parentheses**: Punctuation within quotes or parentheses can complicate sentence boundaries. For instance, a period inside quotation marks (e.g., "He said, 'Hello.'") might not signal the end of a sentence, but rather a sentence-ending punctuation within the quote.

#### **Advanced Techniques**

To handle these complexities, modern sentence tokenization methods often employ algorithms and models that use context and rules to correctly identify sentence boundaries. These methods consider the surrounding text, punctuation, and common language patterns to ensure accurate segmentation.

### **Explanation of the code**

1. **Importing Libraries**:
   ```python
   from nltk.corpus.reader import wordlist
   ```
   - This line imports the `wordlist` module from NLTK’s `corpus.reader` package. However, it seems like `wordlist` is not used in the code snippet provided. This might be an oversight or a remnant from another part of the code.

2. **Sentence Tokenization**:
   ```python
   sentences = nltk.sent_tokenize(corpus)
   ```
   - `nltk.sent_tokenize(corpus)` is a function from the Natural Language Toolkit (NLTK) library that performs sentence tokenization. It takes `corpus`, which is a string containing the text to be processed, and divides it into a list of sentences.
   - `sent_tokenize` uses pre-trained models to identify sentence boundaries, handling various punctuation and formatting issues to accurately separate sentences.

3. **Iterating Through Sentences**:
   ```python
   for sentence in sentences:
       print(sentence)
   ```
   - This loop iterates through each sentence in the `sentences` list. The `for` loop accesses each `sentence` one by one.
   - `print(sentence)` outputs each sentence to the console, displaying the results of the sentence tokenization process.

**Summary**

This code snippet tokenizes a given text (`corpus`) into sentences using NLTK’s `sent_tokenize` function and then prints each sentence. The purpose is to break down the text into manageable units (sentences) and visualize the results. If you’re working with a large text, this approach helps in analyzing and processing text data on a sentence-by-sentence basis.

In [24]:
from nltk.corpus.reader import wordlist
sentences=nltk.sent_tokenize(corpus)
for sentence in sentences:
    print(sentence)

Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans:

Our Constitution declares that from time to time, the President shall give to Congress information about the state of our Union.
For 220 years, our leaders have fulfilled this duty.
They've done so during periods of prosperity and tranquility, and they've done so in the midst of war and depression, at moments of great strife and great struggle.
It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed.
But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt.
When the market crashed on Black Tuesday and civil rights marchers were beaten on Bloody Sunday, the future was anything but certain.
These were the times that tested the courage of our convictions and the strength of our Union.
And despite all our divisions and disagreements, our hesitations

## **Word Tokenization**

Word tokenization, also known as word segmentation, is the process of dividing a continuous string of text into its individual words. In languages like English, which use the Latin alphabet, spaces are typically a reliable indicator of word boundaries. This means that spaces between words generally serve as effective delimiters for tokenizing text into words.


### **Explanation of the code**

1. **Word Tokenization**:
   ```python
   words = nltk.word_tokenize(sentences[3])
   ```
   - `nltk.word_tokenize(sentences[3])` is a function call that tokenizes a specific sentence into individual words. Here, `sentences[3]` refers to the fourth sentence in the list `sentences` (since indexing starts at 0).
   - `nltk.word_tokenize` takes a string input (in this case, `sentences[3]`) and breaks it down into a list of words. It handles punctuation and other language-specific rules to accurately identify word boundaries.

2. **Iterating Through Words**:
   ```python
   for word in words:
       print(word)
   ```
   - This `for` loop iterates over each word in the `words` list, which was produced by the `word_tokenize` function.
   - `print(word)` outputs each word to the console, displaying the results of the word tokenization process.

**Summary**

This code snippet takes the fourth sentence from a list of tokenized sentences (`sentences[3]`), tokenizes it into individual words using NLTK’s `word_tokenize` function, and then prints each word. This process is useful for analyzing and processing text at the word level, especially for tasks like text analysis, natural language processing, or preparing data for machine learning models.

In [25]:
words=nltk.word_tokenize(sentences[3])

for word in words:
    print(word)

It
's
tempting
to
look
back
on
these
moments
and
assume
that
our
progress
was
inevitable
,
that
America
was
always
destined
to
succeed
.


## **Stop Words**

Stop words are common words that typically carry little meaning or significance in text analysis and are often filtered out during text preprocessing. These words are frequently used in language but do not contribute much to the overall meaning or context of the text. In machine learning and natural language processing, removing stop words helps to reduce noise and improve the quality of the analysis.

- **Purpose**: Stop words are removed to focus on more meaningful words that provide valuable insights for tasks such as text classification, sentiment analysis, and information retrieval.

- **Examples**: Common stop words include "it," "is," "and," "a," "am," "are," and others. These words are so frequent that they tend to overshadow the more informative content in the text.

By eliminating stop words, we can streamline the text and enhance the effectiveness of various text-processing and machine learning algorithms.

### **Explanation of the code**

1. **Initialize the List**:
   ```python
   filtered_sentence = []
   ```
   - An empty list `filtered_sentence` is created to store words that are not stop words.

2. **Import Stop Words**:
   ```python
   from nltk.corpus import stopwords
   ```
   - This line imports the `stopwords` module from NLTK’s `corpus` package, which provides a list of common stop words.

3. **Load Stop Words**:
   ```python
   stop_words = set(stopwords.words("english"))
   ```
   - `stopwords.words("english")` retrieves a list of English stop words from NLTK.
   - `set(stopwords.words("english"))` converts this list into a set, which allows for faster lookup and comparison.

4. **Filter Words**:
   ```python
   for w in words:
       if w not in stop_words:
           filtered_sentence.append(w)
   ```
   - This `for` loop iterates over each word `w` in the list `words`.
   - For each word, it checks if the word is not in the `stop_words` set.
   - If the word is not a stop word, it is added to the `filtered_sentence` list.

5. **Print Filtered Sentence**:
   ```python
   print(filtered_sentence)
   ```
   - This line prints the `filtered_sentence` list, which contains only the words from `words` that are not considered stop words.

**Summary**

The code snippet removes common stop words from a list of words (`words`) and stores the remaining words in `filtered_sentence`. It uses NLTK’s predefined list of English stop words to filter out irrelevant words, allowing you to focus on more meaningful content in the text.

In [26]:
filtered_sentence=[]

from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)

print(filtered_sentence)

['It', "'s", 'tempting', 'look', 'back', 'moments', 'assume', 'progress', 'inevitable', ',', 'America', 'always', 'destined', 'succeed', '.']


## **Lemmatization and Stemming**
**bold text**
In text processing, words can appear in different forms, such as various tenses or related terms. To manage these variations and simplify analysis, lemmatization and stemming are used.

#### **Lemmatization**

- **Definition**: Lemmatization is the process of reducing words to their base or dictionary form, known as a lemma. The lemma is the standard, canonical form of a word as found in a dictionary.

- **How It Works**: Lemmatization takes into account the context and meaning of a word to determine its lemma. For example:
  - "studies" becomes "study"
  - "studying" becomes "study"

- **Purpose**: The goal of lemmatization is to convert different forms of a word into a single, meaningful base form, which helps in accurate text analysis.

#### **Stemming**

- **Definition**: Stemming is the process of reducing words to their root form by removing prefixes or suffixes, using straightforward algorithms.

- **How It Works**: Stemming applies rules to strip away common affixes, resulting in a root form that may not always be a valid word. For example:
  - "studies" becomes "studi"
  - "studying" becomes "study"

- **Purpose**: The aim of stemming is to simplify words to a common base form, which can aid in text analysis. However, stemming can sometimes produce less precise or less meaningful results.

**Summary**

Both lemmatization and stemming aim to reduce words to a common base form, but they do so differently:

- **Lemmatization** provides a precise and meaningful base form by considering the word’s context and meaning.
- **Stemming** uses simpler rules to strip away affixes, resulting in a root form that might not always be a meaningful word.

The choice between lemmatization and stemming depends on the specific needs of the text analysis task, with lemmatization often being more accurate and stemming being faster and more straightforward.

### **Explanation of the code**

1. **Import Libraries**:
   ```python
   from nltk.stem import PorterStemmer, WordNetLemmatizer
   from nltk.corpus import wordnet
   ```
   - This line imports `PorterStemmer` and `WordNetLemmatizer` from the NLTK library, which are used for stemming and lemmatization, respectively.
   - `wordnet` is imported from the NLTK corpus to provide part-of-speech tags for lemmatization.

2. **Select a Word**:
   ```python
   word = filtered_sentence[12]
   print(word)
   ```
   - `word` is assigned the value of the 13th item in the `filtered_sentence` list (index 12, as indexing starts from 0).
   - `print(word)` outputs this word to the console for reference.

3. **Initialize Lemmatizer and Stemmer**:
   ```python
   lemmatizer = WordNetLemmatizer()
   stemmer = PorterStemmer()
   ```
   - `WordNetLemmatizer` is initialized to handle lemmatization.
   - `PorterStemmer` is initialized to handle stemming.

4. **Lemmatization**:
   ```python
   print("The lemma of the word is", lemmatizer.lemmatize(word, pos=wordnet.VERB))
   ```
   - `lemmatizer.lemmatize(word, pos=wordnet.VERB)` computes the lemma of `word`, treating it as a verb (`pos=wordnet.VERB`).
   - `print` displays the lemma.

5. **Stemming**:
   ```python
   print("The stem of the word is", stemmer.stem(word))
   ```
   - `stemmer.stem(word)` computes the stem of `word`.
   - `print` displays the stem.

**Summary**

This code snippet demonstrates how to apply lemmatization and stemming to a specific word using NLTK. The `WordNetLemmatizer` converts the word to its canonical form based on its part of speech, while the `PorterStemmer` reduces the word to its root form. The output provides both the lemma and the stem of the selected word.

In [27]:
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk.corpus import wordnet


word=filtered_sentence[12]
print(word)

lemmatizer=WordNetLemmatizer()
steemer=PorterStemmer()

print("The lemma of the word is", lemmatizer.lemmatize(word,pos=wordnet.VERB))
print("The stem of the word is", steemer.stem(word))

destined
The lemma of the word is destine
The stem of the word is destin
