### The order words are used in is important

The order that words are used in text is not random. In English, for example, you can say "the red apple" but not "apple red the". The relationships between words in text is very complex. So complex, in fact, that there's an entire field of linguistics devoted to it: syntax. (You can learn more about syntax [here](https://pdfs.semanticscholar.org/ba33/a9656f43a6b7420a180bdeccd5be98fc05eb.pdf), if you're interested.)

However, there is a relatively simply way of capturing some of these relationships between words, originally proposed by Warren Weaver. You can capture quite a bit of information by just looking at which words tend to show up next to each other more often. 


### What are ngrams? 

The general idea is that you can look at each pair (or triple, set of four, etc.) of words that occur next to each other. In a sufficently-large corpus, you're likely to see "the red" and "red apple" several times, but less likely to see "apple red" and "red the". This is useful to know if, for example, you're trying to figure out what someone is more likely to say to help decide between possible output for an automatic speech recognition system.

These co-occuring words are known as "n-grams", where "n" is a number saying how long a string of words you considered. (Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are four words, 5-grams are five words, etc.)


> **Learning goal:** In this tutorial, you'll learn how to find all the n-grams in a corpus & count how often each is used.

In [1]:
# load in all the modules we're going to need
import nltk, re, string, collections
from nltk.util import ngrams # function for making ngrams

# this corpus is pretty big, so let's look at just one of the files in it
with open("D:\Anderson\Downloads\wife.txt", "r", encoding='UTF-8') as file:
    text = file.read()
    

# check to make sure the file read in alright; let's print out the first 2000 characters
text[0:2000]

"Taxila or Takshasila, on the\nbanks of the Vitasata was once\nnuled by King Kalingadutt. He\nwas a follower of Buddhism.\nBut in his kingdom there were\nseveral who advocated the Vedic\nreligion. The king never forced\nhis subjects to adopt the Bud-\ndhist creed. Only when. people\nvoluntarily approached. him he\nused to initiate them,\n\nAmong those who thus adopted\nthe Buddhist creed was one rich\nmerchant called Vitastadutt. But\nhis son, Ratnadutt, was an ardent\nbeliever of the Vedic cult. So, he\nwas always condemning and\ncursing his father.\n\nâ€œYou area sinner. You have\nstrayed away from the virtuous\nVedic path and adopted atheist\ncreeds. Instead of worshipping\nBrahmans you take to beggar-\nWorship. This accursed religion\nis for those who do not bathe\nproperly, who eat any time of the\ny, Who lead a life of ease with-\nout let or hindrance, lounging\nin the viharas along with loafers\nof all castes and communities.\nHow could you take to it?â€\x9d\nRatoadutt would ask

Looking at the text, you can see that there's some extra xml markup that we don't really want to analyze (the suff in <pointy brackets>). Let's get rid of that.

In [2]:
# let's do some preprocessing. We don't care about the XML notation, new lines 
# or punctuation marks other than periods. (We'll consider the end of the sentence
# a "word") We also don't want to consider capitalization. 

# get rid of all the XML markup
text = re.sub('<.*>','',text)

#cambio get rid of /n
text = re.sub('\n',' ',text)

# get rid of the "ENDOFARTICLE." text
text = re.sub('ENDOFARTICLE.','',text)

# get rid of punctuation (except periods!)
punctuationNoPeriod = "[" + re.sub("\.","",string.punctuation) + "]"
text = re.sub(punctuationNoPeriod, "", text)

# make sure it looks ok
text[0:2000]

'Taxila or Takshasila on the banks of the Vitasata was once nuled by King Kalingadutt He was a follower of Buddhism But in his kingdom there were several who advocated the Vedic religion The king never forced his subjects to adopt the Bud dhist creed Only when people voluntarily approached him he used to initiate them  Among those who thus adopted the Buddhist creed was one rich merchant called Vitastadutt But his son Ratnadutt was an ardent believer of the Vedic cult So he was always condemning and cursing his father  â€œYou area sinner You have strayed away from the virtuous Vedic path and adopted atheist creeds Instead of worshipping Brahmans you take to beggar Worship This accursed religion is for those who do not bathe properly who eat any time of the y Who lead a life of ease with out let or hindrance lounging in the viharas along with loafers of all castes and communities How could you take to itâ€\x9d Ratoadutt would ask his father  And the father would reply with a great pain 

Ok, now onto making the n-grams! The first thing we want to do is "tokenize", or break the  text into indvidual words (you can find more information on tokenization [here](https://www.kaggle.com/rtatman/tokenization-tutorial)).

In [3]:
# first get individual words
tokenized = text.split()

# get list of all the trigrams
esGrams = ngrams(tokenized,1)

# and get a list of all the bi-grams
esBigrams = ngrams(tokenized, 2)

# get list of all the trigrams
esTrigrams = ngrams(tokenized,3)


# If you like, you can uncomment the next like to take a look at 
# the first ten to make sure they look ok. Please note that doing so 
# will consume the generator & will break the next block of code, so you'll
# need to re-comment it and run this block again to get it to work.
# list(esBigrams)[:10]

Yay, we've got our n-grams! Just a list of all the bigrams in a text isn't that useful, though. Let's collapse it a little bit & get a count of how frequently we see each bigram in our corpus. 

In [4]:
# get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)

# what are the ten most popular ngrams in this Spanish corpus?
esBigramFreq.most_common(30)

[(('to', 'the'), 20),
 (('the', 'king'), 16),
 (('of', 'the'), 14),
 (('the', 'princess'), 12),
 (('he', 'was'), 10),
 (('on', 'the'), 8),
 (('away', 'from'), 7),
 (('in', 'the'), 7),
 (('The', 'princess'), 7),
 (('The', 'king'), 6),
 (('Pravara', 'was'), 6),
 (('him', 'to'), 6),
 (('of', 'them'), 6),
 (('for', 'the'), 6),
 (('was', 'a'), 5),
 (('his', 'father'), 5),
 (('from', 'the'), 5),
 (('to', 'be'), 5),
 (('at', 'the'), 5),
 (('the', 'city'), 5),
 (('wanted', 'to'), 5),
 (('ministers', 'son'), 5),
 (('that', 'he'), 5),
 (('the', 'ministers'), 5),
 (('one', 'of'), 5),
 (('the', 'old'), 5),
 (('food', 'and'), 5),
 (('to', 'his'), 4),
 (('One', 'day'), 4),
 (('to', 'him'), 4)]

In [5]:
esTrigramFreq = collections.Counter(esTrigrams)

esTrigramFreq.most_common(20)

[(('to', 'the', 'king'), 5),
 (('the', 'city', 'wall'), 4),
 (('run', 'away', 'from'), 4),
 (('The', 'ministers', 'son'), 4),
 (('ministers', 'son', 'was'), 4),
 (('pre', 'tended', 'to'), 3),
 (('and', 'lay', 'down'), 3),
 (('away', 'from', 'home'), 3),
 (('a', 'couple', 'of'), 3),
 (('one', 'of', 'them'), 3),
 (('the', 'old', 'lady'), 3),
 (('to', 'the', 'sea'), 3),
 (('his', 'son', 'Ratnadutt'), 2),
 (('you', 'take', 'to'), 2),
 (('went', 'to', 'the'), 2),
 (('came', 'to', 'the'), 2),
 (('the', 'In', 'Kashmir'), 2),
 (('In', 'Kashmir', 'there'), 2),
 (('Kashmir', 'there', 'was'), 2),
 (('there', 'was', 'once'), 2)]

In [6]:
esGramFreq = collections.Counter(esGrams)

esGramFreq.most_common(20)

[(('the',), 168),
 (('to',), 99),
 (('and',), 83),
 (('was',), 64),
 (('a',), 59),
 (('of',), 56),
 (('his',), 34),
 (('him',), 34),
 (('he',), 32),
 (('her',), 30),
 (('The',), 28),
 (('that',), 28),
 (('Pravara',), 28),
 (('in',), 25),
 (('king',), 24),
 (('with',), 22),
 (('princess',), 21),
 (('He',), 19),
 (('for',), 19),
 (('one',), 17)]

I'm not a fluent Spanish speaker, but those do look like some reasonable very frequent two-word phrases in Spanish.

And that's all there is to it! Here are some exercises to help you try it out yourself.


### Your turn!

1. Find the most frequent n-grams in another file in this corpus. You can find the other files by entering "../input/spanish_corpus/" in a code chunk and then hitting the Tab key. All of the files will be listed in a drop-down menu. Are the most frequent bigrams the same as they were in this file?
2. Instead of bigrams (two word phrases), can you find trigrams (three words)?
3. Find the most frequent ngrams in another corpus. You can find some [here](https://www.kaggle.com/datasets?sortBy=relevance&group=featured&search=corpus) to start you out.