In [236]:
import nltk
import re
import os
import pandas as pd
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alex-\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

##### 1. Load Input Text

In [239]:
text_file_path = '\MSAI_532\ResidencyWeekend\text_file.txt'

In [241]:
with open('text_file.txt', 'r') as file:
    text = file.read()
print(text[:100])

KBR said Friday the global economic downturn so far has
had
little effect on its business but warned


In [243]:
len(text)

3052306

##### 2. Tokenize the text into words

In [246]:
# Approach 1: Using nltk

words = word_tokenize(text)
words_df = pd.DataFrame(words, columns=['Word'])
words_df['Count'] = 1

In [247]:
words_df.groupby(by=['Word'], as_index=False)['Count'].sum().sort_values(by=['Count'], ascending=False).head(10)

Unnamed: 0,Word,Count
185,",",32581
34305,the,26511
194,.,25307
14550,a,12750
34565,to,12697
27636,of,12476
15249,and,11759
24312,in,10340
5,'',6795
14549,``,6171


"nltk" word_tokenize() does not work very well for this exercise, since we can see that it identifies special characters like ',', '.', etc., as words when they are not.

In [249]:
# Approach 2: Using regular expressions

words = re.findall(r'\b\w+\b', text.lower())
words_df = pd.DataFrame(words, columns=['Word'])
words_df['Count'] = 1

Here, we have used a regular expression  to find sequences of word characters bounded by special non-word characters (blank spaces, commas, etc.). This regular expression considers a word any sequence of one or more characters.

In addition, we are making sure that all characters in the text are lowercase, so we avoid situations in which we might treat words that are actually the same as different (e.g., "There" vs "there").

In [251]:
words_df.groupby(by=['Word'], as_index=False)['Count'].sum().sort_values(by=['Count'], ascending=False).head(10)

Unnamed: 0,Word,Count
25485,the,29900
944,a,13488
25751,to,12915
17781,of,12586
1826,and,12408
12983,in,11250
25482,that,6219
21999,s,5789
10359,for,5156
17873,on,4298


We can see how we have gotten rid of the special characters now

##### 3. Find the Top 10 most common bigrams

In [258]:
words_shifted = words[1:]

bigrams = pd.concat([pd.Series(words), pd.Series(words_shifted)], axis=1)

bigrams['Bigram'] = bigrams[0] + ' ' + bigrams[1]
bigrams['Count'] = 1

In [259]:
top_10_bigrams = bigrams.groupby(by=['Bigram'], as_index=False)['Count'].sum().sort_values(by=['Count'], ascending=False).head(10)
top_10_bigrams = top_10_bigrams.reset_index(drop=True)

In [261]:
display(top_10_bigrams)

Unnamed: 0,Bigram,Count
0,of the,3154
1,in the,2758
2,to the,1196
3,on the,1159
4,for the,942
5,and the,859
6,in a,846
7,it s,776
8,at the,773
9,to be,743


As expected, the top 10 bigrams found in the text are mostly combinations of prepositions and conjunctions with articles.

The only notable exception is the top 10th diagram, which contains a combination of the preposition "to" with the infinitive of the verb "be", however that is probably one of the most if not the most used verbs in the English language.