In [1]:
## We will use a paragraph from wiki
paragraph = """Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.[2]
Deep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.[3][4][5]
Early forms of neural networks were inspired by information processing and distributed communication nodes in biological systems, in particular the human brain. However, current neural networks do not intend to model the brain function of organisms, and are generally seen as low quality models for that purpose.[6]"""

In [2]:
paragraph

'Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.[2]\nDeep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.[3][4][5]\nEarly forms of neural networks were inspired by information processing and distributed communication nodes in biological systems, in particular the human brain. However, current neural networks do not intend to model the brain funct

In [34]:
#Importing nltk libraries
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Tokenization

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, phrases, or even characters

#### Types of Tokenization
1. Word Tokenization
2. Sentence Tokenization
3. Character Tokenization

In [6]:
## Tokenization

sentences = nltk.sent_tokenize(paragraph) #Converting paragraph to sentences. By Default it divide by fullstop.

sentences


['Deep learning is the subset of machine learning methods based on neural networks with representation learning.',
 'The adjective "deep" refers to the use of multiple layers in the network.',
 'Methods used can be either supervised, semi-supervised or unsupervised.',
 '[2]\nDeep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.',
 '[3][4][5]\nEarly forms of neural networks were inspired by information processing and distributed communication nodes in biological systems, in particular the human brain.',
 'However, current neural networks do not intend t

In [8]:
type(sentences), type(paragraph)

(list, str)

In [12]:
#Let us try to do work tokenization directly to our paragraph

nltk.word_tokenize(paragraph)[0:10]
#This will show first 10 words after getting tokenized. We will again get list.

['Deep',
 'learning',
 'is',
 'the',
 'subset',
 'of',
 'machine',
 'learning',
 'methods',
 'based']

In [18]:
#Let us see can we do same with sentences


# nltk.word_tokenize(sentences) #If we run this code it will show an error

#The error will be because nltk.word_toknize will accept the string as input
#and sentences is having list of string.

#So to overcome that we can use a loop

[nltk.word_tokenize(sent) for sent in sentences][0:2] #Here the output will be
#List of list where inner list will have all the words associated with each
#list in sentences
#Here we have shown just two inner list

[['Deep',
  'learning',
  'is',
  'the',
  'subset',
  'of',
  'machine',
  'learning',
  'methods',
  'based',
  'on',
  'neural',
  'networks',
  'with',
  'representation',
  'learning',
  '.'],
 ['The',
  'adjective',
  '``',
  'deep',
  "''",
  'refers',
  'to',
  'the',
  'use',
  'of',
  'multiple',
  'layers',
  'in',
  'the',
  'network',
  '.']]

## Stemming

Stemming is the process of reducing words to their base or root form. The purpose of stemming is to normalize words to their base form so that they can be analyzed as the same item despite being different in their surface forms. For instance, the words "running," "runner," and "ran" can all be reduced to the stem "run."

### Types of Stemming -

1. Porter Stemmer: One of the most commonly used stemming algorithms, developed by Martin Porter in 1980. It applies a series of rules to iteratively trim suffixes from words.

2. Snowball Stemmer: Also known as the Porter2 stemmer, it is an improved version of the Porter Stemmer and provides more accurate results.

3. Lancaster Stemmer: Known for being more aggressive compared to the Porter Stemmer. It can result in very short stems and might be too aggressive for some applications.



##### We have another option Lemmatization

### What is Lemmatization -
Lemmatization is the process of reducing a word to its base or dictionary form, known as a lemma. Unlike stemming, which simply cuts off word endings in an attempt to achieve a base form, lemmatization considers the context and converts the word to its meaningful base form. Lemmatization is more sophisticated than stemming and usually produces more accurate results. Lemmatization is highly beneficial for tasks requiring a deeper understanding of the text, such as sentiment analysis, text classification, and more.



#### BUT AS OF NOW WE WILL USE STEMMING

In [20]:
# Now we will use stemming - We will use Snowball Stemmer

stemmer = SnowballStemmer('english') #Defined the object


In [21]:
p_Stemmer = PorterStemmer()

In [36]:
## We need to clean the data too, we will use regex for that

import re

corpus = []

#Doing stemming
for i in range(len(sentences)):
  review = re.sub('[^a-zA-Z]',' ',sentences[i])
  review = review.lower()
  review = review.split()
  review = [stemmer.stem(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

In [37]:
corpus[4]

'earli form neural network inspir inform process distribut communic node biolog system particular human brain'

In [38]:
sentences[4]

'[3][4][5]\nEarly forms of neural networks were inspired by information processing and distributed communication nodes in biological systems, in particular the human brain.'

## Bag of Words

The Bag of Words (BoW) model is a fundamental technique in natural language processing (NLP) for representing text data. It simplifies the representation of text by treating it as a collection of words (or tokens), disregarding grammar and word order but maintaining the multiplicity of occurrences. Each unique word in the text is treated as a feature.

### How Bag of Words Works

1. **Text Preprocessing:**
   - Tokenize the text into words.
   - Convert all words to lowercase (or uppercase) to ensure uniformity.
   - Remove punctuation, special characters, and stopwords (common words like "the," "is," etc. that do not carry significant meaning).

2. **Vocabulary Creation:**
   - Create a vocabulary of all unique words present in the text.

3. **Vector Representation:**
   - For each document or sentence, create a vector of length equal to the vocabulary size.
   - Each element of the vector represents the count of a word's occurrence in the document.

### Example

Consider the following two sentences:
1. "I love NLP."
2. "NLP is fun."

#### Step-by-Step Process:

1. **Tokenization and Preprocessing:**
   - Sentence 1: ['i', 'love', 'nlp']
   - Sentence 2: ['nlp', 'is', 'fun']

2. **Vocabulary Creation:**
   - Vocabulary: ['i', 'love', 'nlp', 'is', 'fun']

3. **Vector Representation:**
   - Sentence 1: [1, 1, 1, 0, 0]
   - Sentence 2: [0, 0, 1, 1, 1]


### Considerations

- **Simplicity:** The BoW model is simple and easy to understand but does not capture the semantics (meaning) of the words or the order of words in the text.
- **High Dimensionality:** For large text corpora, the vocabulary can become very large, leading to high-dimensional feature vectors.
- **Sparsity:** Most vectors will be sparse (i.e., containing many zeros) because most documents contain only a small subset of the total vocabulary.

### Applications

- **Text Classification:** BoW is commonly used in text classification tasks where the goal is to classify documents into predefined categories.
- **Information Retrieval:** Used in search engines to index and retrieve documents based on keyword queries.
- **Document Clustering:** Helps in clustering similar documents together.

The Bag of Words model is a foundational concept in NLP and serves as a stepping stone to more advanced text representation techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings (e.g., Word2Vec, GloVe).


In [41]:
from sklearn.feature_extraction.text import CountVectorizer

In [42]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus) #X is a sparse matrix

In [43]:
#We will check index value of each vocabulary
vectorizer.vocabulary_

{'deep': 17,
 'learn': 39,
 'subset': 68,
 'machin': 41,
 'method': 44,
 'base': 4,
 'neural': 49,
 'network': 48,
 'represent': 62,
 'adject': 0,
 'refer': 61,
 'use': 75,
 'multipl': 46,
 'layer': 38,
 'either': 22,
 'supervis': 69,
 'semi': 66,
 'unsupervis': 74,
 'architectur': 3,
 'belief': 5,
 'recurr': 60,
 'convolut': 15,
 'transform': 72,
 'appli': 2,
 'field': 24,
 'includ': 32,
 'comput': 14,
 'vision': 76,
 'speech': 67,
 'recognit': 59,
 'natur': 47,
 'languag': 37,
 'process': 54,
 'translat': 73,
 'bioinformat': 6,
 'drug': 20,
 'design': 18,
 'medic': 43,
 'imag': 31,
 'analysi': 1,
 'climat': 11,
 'scienc': 64,
 'materi': 42,
 'inspect': 34,
 'board': 8,
 'game': 27,
 'program': 56,
 'produc': 55,
 'result': 63,
 'compar': 13,
 'case': 10,
 'surpass': 70,
 'human': 30,
 'expert': 23,
 'perform': 53,
 'earli': 21,
 'form': 25,
 'inspir': 35,
 'inform': 33,
 'distribut': 19,
 'communic': 12,
 'node': 50,
 'biolog': 7,
 'system': 71,
 'particular': 52,
 'brain': 9,
 'howe

In [44]:
corpus[0]

'deep learn subset machin learn method base neural network represent learn'

In [46]:
#index of deep is 17
#Let us see how the first courpus that is at 0th index has been created as bag of words

X[0].toarray()
#If you see there will be 1 on 17th index, that means index of deep is 17 at each list and it is repeatd 1 time

array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])