### The order words are used in is important

The order that words are used in text is not random. In English, for example, you can say "the red apple" but not "apple red the". The relationships between words in text is very complex. So complex, in fact, that there's an entire field of linguistics devoted to it: syntax. (You can learn more about syntax [here](https://pdfs.semanticscholar.org/ba33/a9656f43a6b7420a180bdeccd5be98fc05eb.pdf), if you're interested.)

However, there is a relatively simply way of capturing some of these relationships between words, originally proposed by Warren Weaver. You can capture quite a bit of information by just looking at which words tend to show up next to each other more often. 


### What are ngrams? 

The general idea is that you can look at each pair (or triple, set of four, etc.) of words that occur next to each other. In a sufficently-large corpus, you're likely to see "the red" and "red apple" several times, but less likely to see "apple red" and "red the". This is useful to know if, for example, you're trying to figure out what someone is more likely to say to help decide between possible output for an automatic speech recognition system.

These co-occuring words are known as "n-grams", where "n" is a number saying how long a string of words you considered. (Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are four words, 5-grams are five words, etc.)


> **Learning goal:** In this tutorial, you'll learn how to find all the n-grams in a corpus & count how often each is used.

In [1]:
# load in all the modules we're going to need
import nltk, re, string, collections
from nltk.util import ngrams # function for making ngrams

# this corpus is pretty big, so let's look at just one of the files in it
with open("C:/Users/nosre/Downloads/spanishText_25000_30000", "r",encoding='latin-1') as file:
    text = file.read()
    

# check to make sure the file read in alright; let's print out the first 2000 characters
text[0:1000]

'<doc id="62546" title="Pragmática Sanción de 1713" nonfiltered="1" processed="1" dbindex="25000">\nPragmática Sanción de 1713, norma promulgada por el emperador Carlos VI, de la Casa de Austria, en 1713 que facilitó la futura entronización de su hija María Teresa I. Ésta fue la primera ley fundamental común para todas las zonas de los Habsburgo, se decreto con un intento del emperador por lograr la integracion de los territorios del sacro imperio; sin embargo, el proyecto unificador no pudo alcanzarse del todo, pues Hungría puso como condicon para aceptar la Pragmática Sanción que fuera ratificada su constitucion y autonomia, lo que en realidad fortalecio el separatismo Hungaro.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nENDOFARTICLE.\n</doc>\n<doc id="62547" title="Universidad Complutense de Madrid" nonfiltered="2" processed="2" dbindex="25001">\n\n\nLa Universidad Complutense de Madrid (UCM) es una importante universidad española localizada en la Ciudad Universitaria de Madrid, España.\n\nHistori

Looking at the text, you can see that there's some extra xml markup that we don't really want to analyze (the suff in <pointy brackets>). Let's get rid of that.

In [2]:
# let's do some preprocessing. We don't care about the XML notation, new lines 
# or punctuation marks other than periods. (We'll consider the end of the sentence
# a "word") We also don't want to consider capitalization. 

# get rid of all the XML markup
text = re.sub('<.*>','',text)

#cambio get rid of /n
text = re.sub('\n','',text)

# get rid of the "ENDOFARTICLE." text
text = re.sub('ENDOFARTICLE.','',text)

# get rid of punctuation (except periods!)
punctuationNoPeriod = "[" + re.sub("\.","",string.punctuation) + "]"
text = re.sub(punctuationNoPeriod, "", text)

# make sure it looks ok
text[0:1000]

'Pragmática Sanción de 1713 norma promulgada por el emperador Carlos VI de la Casa de Austria en 1713 que facilitó la futura entronización de su hija María Teresa I Ésta fue la primera ley fundamental común para todas las zonas de los Habsburgo se decreto con un intento del emperador por lograr la integracion de los territorios del sacro imperio sin embargo el proyecto unificador no pudo alcanzarse del todo pues Hungría puso como condicon para aceptar la Pragmática Sanción que fuera ratificada su constitucion y autonomia lo que en realidad fortalecio el separatismo HungaroLa Universidad Complutense de Madrid UCM es una importante universidad española localizada en la Ciudad Universitaria de Madrid EspañaHistoriaLos orígenes de la Universidad Complutense se encuentran en Universidad Cisneriana de Alcalá de Henares Tras languidecer durante el siglo XVIII mediante Real Orden de la Reina Regente de 29 de octubre de 1836 se decretó su supresión en Alcalá y traslado a Madrid donde pasó a den

Ok, now onto making the n-grams! The first thing we want to do is "tokenize", or break the  text into indvidual words (you can find more information on tokenization [here](https://www.kaggle.com/rtatman/tokenization-tutorial)).

In [3]:
# first get individual words
tokenized = text.split()

# get list of all the trigrams
esGrams = ngrams(tokenized,1)

# and get a list of all the bi-grams
esBigrams = ngrams(tokenized, 2)

# get list of all the trigrams
esTrigrams = ngrams(tokenized,3)


# If you like, you can uncomment the next like to take a look at 
# the first ten to make sure they look ok. Please note that doing so 
# will consume the generator & will break the next block of code, so you'll
# need to re-comment it and run this block again to get it to work.
# list(esBigrams)[:10]

Yay, we've got our n-grams! Just a list of all the bigrams in a text isn't that useful, though. Let's collapse it a little bit & get a count of how frequently we see each bigram in our corpus. 

In [4]:
# get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)

# what are the ten most popular ngrams in this Spanish corpus?
esBigramFreq.most_common(30)

[(('de', 'la'), 38923),
 (('en', 'el'), 16435),
 (('de', 'los'), 15828),
 (('en', 'la'), 15291),
 (('a', 'la'), 11038),
 (('de', 'las'), 9316),
 (('que', 'se'), 7004),
 (('y', 'el'), 5286),
 (('y', 'la'), 5111),
 (('a', 'los'), 4966),
 (('por', 'el'), 4939),
 (('con', 'el'), 4608),
 (('de', 'un'), 4290),
 (('de', 'su'), 4043),
 (('con', 'la'), 3955),
 (('en', 'los'), 3953),
 (('por', 'la'), 3781),
 (('lo', 'que'), 3458),
 (('de', 'una'), 3395),
 (('que', 'el'), 3339),
 (('en', 'las'), 3116),
 (('la', 'ciudad'), 3064),
 (('que', 'la'), 2991),
 (('en', 'un'), 2746),
 (('en', 'su'), 2730),
 (('es', 'el'), 2701),
 (('y', 'de'), 2605),
 (('y', 'en'), 2568),
 (('a', 'las'), 2510),
 (('es', 'un'), 2473)]

In [5]:
esTrigramFreq = collections.Counter(esTrigrams)

esTrigramFreq.most_common(20)

[(('uno', 'de', 'los'), 1581),
 (('de', 'la', 'ciudad'), 1316),
 (('una', 'de', 'las'), 1116),
 (('la', 'ciudad', 'de'), 990),
 (('por', 'lo', 'que'), 936),
 (('a', 'través', 'de'), 881),
 (('el', 'nombre', 'de'), 737),
 (('parte', 'de', 'la'), 711),
 (('en', 'el', 'que'), 707),
 (('en', 'la', 'que'), 673),
 (('a', 'partir', 'de'), 637),
 (('la', 'mayoría', 'de'), 609),
 (('de', 'la', 'provincia'), 602),
 (('la', 'provincia', 'de'), 592),
 (('de', 'la', 'República'), 571),
 (('en', 'el', 'año'), 562),
 (('a', 'lo', 'largo'), 539),
 (('de', 'octubre', 'de'), 491),
 (('en', 'el', 'siglo'), 485),
 (('la', 'Universidad', 'de'), 483)]

In [6]:
esGramFreq = collections.Counter(esGrams)

esGramFreq.most_common(20)

[(('de',), 262457),
 (('la',), 133828),
 (('en',), 98541),
 (('el',), 93087),
 (('y',), 92372),
 (('que',), 66878),
 (('a',), 60677),
 (('los',), 52295),
 (('del',), 49278),
 (('se',), 42588),
 (('por',), 33389),
 (('un',), 32902),
 (('las',), 31993),
 (('con',), 30297),
 (('una',), 28447),
 (('su',), 23918),
 (('es',), 22440),
 (('como',), 19584),
 (('al',), 19269),
 (('para',), 19124)]

I'm not a fluent Spanish speaker, but those do look like some reasonable very frequent two-word phrases in Spanish.

And that's all there is to it! Here are some exercises to help you try it out yourself.


### Your turn!

1. Find the most frequent n-grams in another file in this corpus. You can find the other files by entering "../input/spanish_corpus/" in a code chunk and then hitting the Tab key. All of the files will be listed in a drop-down menu. Are the most frequent bigrams the same as they were in this file?
2. Instead of bigrams (two word phrases), can you find trigrams (three words)?
3. Find the most frequent ngrams in another corpus. You can find some [here](https://www.kaggle.com/datasets?sortBy=relevance&group=featured&search=corpus) to start you out.