# BAG OF WORDS


BOW is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

a)A vocabulary of known words.
b)A measure of the presence of known words.



Drawbacks of using a Bag-of-Words (BoW) Model:



1.If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.
2.Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)
3.We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

In [9]:
import nltk
from nltk.tokenize import sent_tokenize as st, word_tokenize as wt
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer as ps, WordNetLemmatizer as wl

In [10]:
para='''An atom is the smallest unit of ordinary matter that forms a chemical element.
		Every solid, liquid, gas, and plasma is composed of neutral or ionized atoms.
		Atoms are extremely small, typically around 100 picometers across. 
		They are so small that accurately predicting their behavior using classical physics—as 
		if they were tennis balls, for example—is not possible due to quantum effects.
		Every atom is composed of a nucleus and one or more electrons bound to the nucleus. 
		The nucleus is made of one or more protons and a number of neutrons. 
		Only the most common variety of hydrogen has no neutrons. 
		More than 99.94% of an atom's mass is in the nucleus. 
		The protons have a positive electric charge, the electrons have a negative electric charge, 
		and the neutrons have no electric charge. If the number of protons and electrons are equal, 
		then the atom is electrically neutral. If an atom has more or fewer electrons than protons, 
		then it has an overall negative or positive charge, respectively – such atoms are called ions.
		The electrons of an atom are attracted to the protons in an atomic nucleus by the electromagnetic force. 
		The protons and neutrons in the nucleus are attracted to each other by the nuclear force. 
		This force is usually stronger than the electromagnetic force that repels the positively 
		charged protons from one another. Under certain circumstances, the repelling electromagnetic 
		force becomes stronger than the nuclear force. In this case, the nucleus splits and leaves 
		behind different elements. This is a form of nuclear decay.'''

In [11]:
import re

ps=ps() #object creation porter stemmer
wl=wl() #object creation word net lemmatizer
sentences=st(para) #tokenizing to sentences
corpus=[]

In [12]:
for i in range(len(sentences)):
    rev=re.sub('[^a-zA-Z]',' ',sentences[i]) #everything except alphabets would be replaced by space
    rev=rev.lower() 
    rev=rev.split() #splits them word wise into elements of a list
    rev=[wl.lemmatize(word) for word in rev if word not in set(stopwords.words('english'))]
    rev=' '.join(rev)
    corpus.append(rev) #appending to list

In [13]:
#bag of words
from sklearn.feature_extraction.text import CountVectorizer #importing countervectorizer
cv=CountVectorizer()
x=cv.fit_transform(corpus).toarray() #transforming it to an array

In [14]:
x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# TF-IDF 

TF-IDF, which stands for term frequency — inverse document frequency, is a scoring measure widely used in information retrieval (IR) or summarization. TF-IDF is intended to reflect how relevant a term is in a given document.

>The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF).
TF=No Of Repeated Words in a Sentence/No of words in a sentence

>At the same time, if a word occurs many times in a document but also along many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF).
IDF=log(No of sentences/No of sentences containing words)

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

In [15]:
#creating TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer as tfidf
cv=tfidf() #object creation
x=cv.fit_transform(corpus).toarray() #transforming

In [16]:
x

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.40679695, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])