<a href="https://colab.research.google.com/github/gzc/spark/blob/main/project_1_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 1**

Unigrams, bigrams, and in general n-grams are 1,2 or n words that appear consecutively in a single sentence. Consider the sentence:

     "to know you is to love you."

This sentence contains:

     Unigrams(single words): to(2 times), know(1 time), you(2 times), is(1 time), love(1 time)
     Bigrams: "to know","know you","you is", "is to","to love", "love you" (all 1 time)
     Trigrams: "to know you", "know you is", "you is to", "is to love", "to love you" (all 1 time)

 The goal of this Project is to find the most common n-grams in the text of Moby Dick.

 Your task is to:

 * Convert all text to lower case, remove all punctuations. (Finally, the text should contain only letters, numbers and spaces)
 * Count the occurance of each word and of each 2,3,4,5 - gram
 * List the 5 most common elements for each order (word, bigram, trigram...). For each element, list the sequence of words and the number of occurances.

Basically, you need to change all punctuations to a space and define as a word anything that is between whitespace or at the beginning or the end of a sentence, and does not consist of whitespace (strings consisiting of only white spaces should not be considered as words). The important thing here is to be simple, not to be 100% correct in terms of parsing English. Evaluation will be primarily based on identifying the 5 most frequent n-grams in correct order for all values of n. Some slack will be allowed in the values of frequency of ngrams to allow flexibility in text processing.

This text is short enough to process on a single core using standard python. However, you are required to solve it using RDD's for the whole process. At the very end you can use `.take(5)` to bring the results to the central node for printing.

The code for reading the file and splitting it into sentences is shown below:

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz  
!tar xf /content/spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()

Download the dataset from [here](https://github.com/gzc/spark/blob/main/n-grams.txt) and keep it somewhere on your computer. Load the dataset into your Colab directory from your local system:

In [None]:
from google.colab import files
files.upload()

Saving n-grams.txt to n-grams (2).txt


{'n-grams.txt': b'to know you is to love you.\nhello world.\nThe public needs to understand we continue to care for other patient populations in addition to Covid patients.\nWe remain open for trauma, emergency care, and urgent care needs.'}

In [None]:
textRDD = sc.wholeTextFiles('n-grams.txt').map(lambda x: x[1])
print (textRDD.take(1))

['to know you is to love you.\nhello world.\nThe public needs to understand we continue to care for other patient populations in addition to Covid patients.\nWe remain open for trauma, emergency care, and urgent care needs.']


In [None]:
sentences=textRDD.flatMap(lambda x: x.split(".")).map(lambda x: ''.join(i for i in x if i is not '\n'))
print (sentences.take(4))

['to know you is to love you', 'hello world', 'The public needs to understand we continue to care for other patient populations in addition to Covid patients', 'We remain open for trauma, emergency care, and urgent care needs']


Let `freq_ngramRDD` be the final result RDD containing the n-grams sorted by their frequency in descending order. Use the following function to print your final output:


In [None]:
def printOutput(n,freq_ngramRDD):
    top=freq_ngramRDD.take(5)
    print ('\n============ %d most frequent %d-grams'%(5,n))
    print ('\nindex\tcount\tngram')
    for i in range(5):
        print ('%d.\t%d: \t"%s"'%(i+1,top[i][1], top[i][0]))

Your output for unigrams should look like:
```
============ 5 most frequent 1-grams

index	count	ngram
1.       40: 	 "a"
2.	   25: 	 "the"
3.	   21: 	 "and"
4.	   16: 	 "to"
5.	   9:  	 "of"

```
Note: This is just a sample output and does not resemble the actual results in any manner.

Your final program should generate an output using the following code:

In [None]:
for n in range(1,6):
    # Put your logic for generating the sorted n-gram RDD here and store it in freq_ngramRDD variable
    
    printOutput(n,freq_ngramRDD)



index	count	ngram
1.	5: 	"to"
2.	3: 	"care"
3.	2: 	"you"
4.	2: 	"needs"
5.	2: 	"we"


index	count	ngram
1.	1: 	"to know"
2.	1: 	"know you"
3.	1: 	"you is"
4.	1: 	"is to"
5.	1: 	"to love"


index	count	ngram
1.	1: 	"to know you"
2.	1: 	"know you is"
3.	1: 	"you is to"
4.	1: 	"is to love"
5.	1: 	"to love you"


index	count	ngram
1.	1: 	"to know you is"
2.	1: 	"know you is to"
3.	1: 	"you is to love"
4.	1: 	"is to love you"
5.	1: 	"the public needs to"


index	count	ngram
1.	1: 	"to know you is to"
2.	1: 	"know you is to love"
3.	1: 	"you is to love you"
4.	1: 	"the public needs to understand"
5.	1: 	"public needs to understand we"
