###### Word Embedding is one such technique where we can represent the text using vectors.

The more popular forms of word embeddings are:

1- BoW, which stands for Bag of Words
2- TF-IDF, which stands for Term Frequency-Inverse Document Frequency




#### Basic feature extraction using text data
Number of words

Number of characters

Average word length

Number of stopwords

Number of special characters

Number of numerics

Number of uppercase words

#### Basic Text Pre-processing of text data
Lower casing

Punctuation removal

Stopwords removal

Frequent words removal

Rare words removal

Spelling correction

Tokenization

Stemming

Lemmatization

#### Advance Text Processing

N-grams

Term Frequency

Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF)

Bag of Words

Sentiment Analysis

Word Embedding

refer this link:
https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/

# Bag of Words

It follows below steps :

<img src="https://miro.medium.com/max/700/1*ZJykVXi_OQRULpK6PlGNPQ.png" width="550px" />

The Bag of Words (BoW) model is the simplest form of text representation in numbers.

we can represent a sentence as a bag of words vector (a string of numbers).

##### Eg

Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good


Three types of movie reviews below:

###### Review 1: This movie is very scary and long

###### Review 2: This movie is not scary and is slow

###### Review 3: This movie is spooky and good

=====================================
###### We will first build a vocabulary from all the unique words in the above three reviews. 
###### The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’.

### Note:

##### We have 3 sentence above and we have created unique words of thats three sentences, after that we will create table

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/BoWBag-of-Words-model-2.png" width="550px" />

=================================================

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

### Drawbacks of using a Bag-of-Words (BoW) Model

1-  it assumes all words are independent of each other.

2-  It leads to a highly sparse vectors

3-  Bag of words leads to a high dimensional feature vector due to large size of Vocabulary

4- Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.

### Advantages:

1- Very simple to understand and implement.


=======================================================================

##### NOTE: More advanced way of representing text data is by embeddings or word vectors. Read the different ways here. And here is how to evaluate word vectors.

##### Example:

Suppose the vocabulary contains the words : { and, cat, dog, jumped, sat, over, ran, the }. You have the following sentence:  “The fox jumped over the dog and the dog ran”. Bag of words representation for this toy example: [1 0 2 1 0 1 1 3].

##### Nonzero values:

As the word and occurs only once in the sentence, there is value of 1 for the feature and. The word dog occurs twice in the sentence and hence a value of 2 for the feature dog.

###### Zero values:

There is no word cat in the sentence and hence a 0 for the feature cat. In a real dataset, the vocabulary contains 50K to 100K words leading to extremely high dimensional sparse vectors. Techniques for dimensionality reduction are typically used to handle bag of words vectors.   

##### Where does it fail ?

When we want to capture more context (what word appeared  after some other word) and not just co-occurrence in the same document. Sometimes bag of bigrams are used to capture some context, though they are very expensive. 

==========================================

#### Popular and simple method of feature extraction with text data which are currently used are:
    
1- Bag-of-Words

2- TF-IDF

3- Word2Vec

In [None]:
## references
https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428
https://www.machinelearningaptitude.com/topics/natural-language-processing/what-are-some-advantages-and-disadvantages-using-bag-of-words-where-would-you-use-it-and-where-would-you-not/#:~:text=Disadvantages%3A,are%20independent%20of%20each%20other.

###

Step #1 : We will first preprocess the data, in order to:

1-Convert text to lower case.

2- Remove all non-word characters.

3- Remove all punctuations.


In [6]:
text="Beans. I was trying to explain to somebody as we were flying in, that’s corn. That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. "

In [None]:
# Python3 code for preprocessing text 
import nltk 
import re 
import numpy as np 
  
# execute the text here as : 
# text = """ # place text here  """ 
dataset = nltk.sent_tokenize(text) 
for i in range(len(dataset)): 
    dataset[i] = dataset[i].lower() 
    dataset[i] = re.sub(r'\W', ' ', dataset[i]) 
    dataset[i] = re.sub(r'\s+', ' ', dataset[i]) 

Step #2 : Obtaining most frequent words in our text.

We will apply the following steps to generate our model.

We declare a dictionary to hold our bag of words.
Next we tokenize each sentence to words.
Now for each word in sentence, we check if the word exists in our dictionary.
If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [None]:
# Creating the Bag of Words model 
word2count = {} 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1

In [None]:
Step #3 : Building the Bag of Words model
In this step we construct a vector, which would tell us whether a word in each sentence is a 
frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.
This can be implemented with the help of following code:

In [None]:

X = [] 
for data in dataset: 
    vector = [] 
    for word in freq_words: 
        if word in nltk.word_tokenize(data): 
            vector.append(1) 
        else: 
            vector.append(0) 
    X.append(vector) 
X = np.asarray(X)

In [None]:

In our model, we have a total of 118 words. However when processing large texts, the number of words 
could reach millions. We do not need to use all those words. Hence, we select a particular 
number of most frequently used words. To implement this we use:
    

import heapq 
freq_words = heapq.nlargest(100, word2count, key=word2count.get)

In [None]:


### Another way

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

### Practical stuff
https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/#:~:text=A%20simple%20and%20effective%20model,each%20word%20a%20unique%20number.