- https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/

# Bag of Words Model in Python

In [1]:
import nltk  
import numpy as np  
import random  
import string

import bs4 as bs  
import urllib.request  
import re  

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')  
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:  
    article_text += para.text

In [5]:
article_text

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.\nChallenges in natural language processing frequently involve speech recognition, natural language understanding, and natural-language generation.\nNatural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a task that involves the automated interpretation and generation of natural language, but at the time not articulated as a problem separate from artificial intelligence.\nThe premise of symbolic NLP is well-summarized by John Searle\'s Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions an

In [41]:
corpus = nltk.sent_tokenize(article_text)

In [42]:
print(corpus[11])

More recent systems based on machine-learning algorithms have many advantages over hand-produced rules: 
Despite the popularity of machine learning in NLP research, symbolic methods are still (2020) commonly used
Since the so-called "statistical revolution"[14][15] in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning.


In [43]:
for i in range(len(corpus )):
    corpus [i] = corpus [i].lower()
    corpus [i] = re.sub(r'\W',' ',corpus [i])
    corpus [i] = re.sub(r'\s\d+\s',' ',corpus [i]) # remove digits that are not attached to other words
    corpus [i] = re.sub(r'\s+',' ',corpus [i])
    

In [44]:
print(len(corpus))

43


In [45]:
print(corpus[11])

more recent systems based on machine learning algorithms have many advantages over hand produced rules despite the popularity of machine learning in nlp research symbolic methods are still commonly used since the so called statistical revolution in the late 1980s and mid 1990s much natural language processing research has relied heavily on machine learning 


In [46]:
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

In [56]:
len(wordfreq)

491

In [49]:
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)
most_freq

['the',
 'of',
 'a',
 'to',
 'in',
 'language',
 'is',
 'and',
 'natural',
 'processing',
 'as',
 'that',
 'machine',
 'learning',
 'for',
 'nlp',
 'such',
 'statistical',
 'on',
 'rules',
 'tasks',
 'based',
 'algorithms',
 'are',
 'cognitive',
 'linguistics',
 'with',
 'which',
 'or',
 'many',
 'used',
 'models',
 'has',
 'by',
 'systems',
 'neural',
 'methods',
 'research',
 'input',
 'word',
 'big',
 'data',
 'e',
 'this',
 'more',
 'have',
 'real',
 'when',
 'some',
 'intelligence',
 'large',
 'speech',
 'understanding',
 'an',
 'task',
 'from',
 'g',
 'hand',
 'however',
 'results',
 'they',
 'be',
 'intent',
 'author',
 'being',
 'computer',
 'science',
 'symbolic',
 's',
 'given',
 'other',
 '1980s',
 'was',
 'deep',
 'part',
 'can',
 'set',
 'commonly',
 'since',
 'inference',
 'through',
 'world',
 'examples',
 'these',
 'probabilistic',
 'relative',
 'larger',
 'translation',
 'sequence',
 'idea',
 'presented',
 'meaning',
 'text',
 'artificial',
 'computers',
 'human',
 'ho

In [50]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [52]:
len(sentence_vectors)

43

In [54]:
sentence_vectors = np.asarray(sentence_vectors)
sentence_vectors

array([[1, 1, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

In [57]:
sentence_vectors.shape

(43, 200)

In [58]:
import pandas as pd
pd.DataFrame(sentence_vectors, columns=most_freq)

Unnamed: 0,the,of,a,to,in,language,is,and,natural,processing,...,amounts,challenges,frequently,involve,roots,1950s,already,alan,published,article
0,1,1,1,1,1,1,1,1,1,1,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,1,0,1,1,1,...,0,1,1,1,0,0,0,0,0,0
2,1,0,0,0,1,1,0,0,1,1,...,0,0,0,0,1,1,0,0,0,0
3,1,1,1,0,1,1,1,1,1,0,...,0,0,0,0,0,0,1,1,1,1
4,1,1,1,1,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
5,1,1,0,1,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
6,1,1,1,0,1,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
7,1,1,0,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,1,0,1,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,1,1,1,1,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
