# Homework - kNN implementation with Spark 

In this homework you will implement kNN algorithm with Spark, and apply that to classify text documents. “Classification”
is the task of labeling documents based upon their contents. You will be asked to perform 4 tasks, covering data preparation, feature extraction, and classification.

## Data

The dataset you will be using in this homework is the same as the dataset you used in the Spark introduction lab. That is the widely-used “20 newsgroups” dataset. A newsgroup post is like an old-school blog post, and this dataset has 19,997 such posts from 20 different categories, according to where the blog post was made. 

The 20 categories are listed in the file `news_categories.txt`. The category name can be extracted from the id of the document. For example, 
* the document with the id `20_newsgroups/comp.graphics/37261` is from the `comp.graphics` category, 
* the document with the id `20_newsgroups/sci.med/59082` is from the `sci.med` category. 

The data file has one line per document of text. It can be accessed at:

`s3://comp643bucket/lab/spark_intro_aws/20_news_same_line.txt`

We have also provided a small subset of the data in the file `20_news_same_line_random_sample.txt`, so that you can debug your Spark code on a small dataset, before run it on the entire dataset. 

In [1]:
import pyspark
import re
import numpy as np
import random
from collections import Counter


In [2]:
# pyspark works best with java8 
# set JAVA_HOME enviroment variable to java8 path 
%env JAVA_HOME = /usr/lib/jvm/java-8-openjdk-amd64

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


In [3]:
sc = pyspark.SparkContext()

## Task 1 - compute "bag of words" for each document (25 pts)

For task 1, we want to extract "bag of words" features for documents. 

The first part of this task is the same as what you've already implemented in `Lab - Spark introduction (Vocareum)`. We need a dictionary, as an RDD, that includes the 20,000 most frequent words
in the training corpus. The result of such an RDD must be in this format:
`
[('mostcommonword', 0),
 ('nextmostcommonword', 1),
 ...]
`

**NOTE**: There aren’t 20,000 unique words in the small dataset (`20_news_same_line_random_sample.txt`). Use only the top 50 words when working with this file.

For this part, we provided our code, so that you only need to run it, to create this dictionary, named `refDict`, as an RDD. This `refDict` RDD will be our reference dictionary of words. The words in `refDict` will be our reference words for which we will compute "bag of words" and "TF-IDF" features for our training corpus and finally for the test documents.

**Provided code to create the reference dictionary of words.**

Run the code cells below to create the `refDict` RDD.

In [4]:
# set the number of dictionary words 
# 50 for the small dataset
# 20,000 for the large dataset
numWords = 50

In [5]:
# load up the dataset 
# "data/20_news_same_line_random_sample.txt" for small dataset 
# "s3://comp643bucket/lab/spark_intro_aws/20_news_same_line.txt" for entire large dataset
corpus = sc.textFile ("data/20_news_same_line_random_sample.txt")

# each entry in validLines will be a line from the text file
validLines = corpus.filter(lambda x : 'id=' in x)

# now we transform it into a bunch of (docID, text) pairs
keyAndText = validLines.map(lambda x : (x[x.index('id="') + 4 : x.index('" url=')], x[x.index('"> ') + 3:x.index(' </doc>')]))

# now we split the text in each (docID, text) pair into a list of words
# after this, we have a data set with (docID, ["word1", "word2", "word3", ...])
# we have a bit of fancy regular expression stuff here to make sure that we do not
# die on some of the documents                       
regex = re.compile('[^a-zA-Z]')  
keyAndListOfWords = keyAndText.map(lambda x : (str(x[0]), regex.sub(' ', x[1]).lower().split()))

# now get the top 20,000 words... first change (docID, ["word1", "word2", "word3", ...])
# to ("word1", 1) ("word2", 1)...
allWords = keyAndListOfWords.flatMap(lambda x: ((j, 1) for j in x[1]))

# now, count all of the words, giving us ("word1", 1433), ("word2", 3423423), etc.
allCounts = allWords.reduceByKey (lambda a, b: a + b)

# and get the top numWords (50 for small dataset, 20K for large dataset) frequent words in a local array
topWords = allCounts.top (numWords, lambda x : x[1])

# and we'll create an RDD that has a bunch of (word, rank) pairs
# start by creating an RDD that has the number 0 up to numWords (50 for small dataset, 20K for large dataset) 
# numWords is the number of words that will be in our dictionary
twentyK = sc.parallelize(range(numWords))

# now, we transform (0), (1), (2), ... to ("mostcommonword", 0) ("nextmostcommon", 1), ...
# the number will be the spot in the dictionary used to tell us where the word is located
refDict = twentyK.map(lambda x:(topWords[x][0],x))

In [6]:
refDict.take(10)

[('the', 0),
 ('to', 1),
 ('of', 2),
 ('a', 3),
 ('and', 4),
 ('i', 5),
 ('in', 6),
 ('is', 7),
 ('that', 8),
 ('it', 9)]

Now, your task is to write Spark code to create "bag of words" features based on the words in the reference dictionary, `refDict`.  

You need to create a new RDD, named `bag_of_words`. Each element of this RDD corresponds to one document, and is a key-value pair. Specifically, the key is the document identifier `id` (like `20_newsgroups/comp.graphics/37261`) and the value is a `numpy` array with `numWords` (50 for small dataset, 20K for large dataset) entries, where the ith entry in the array is the number of times that the ith word in the `refDict` (created in the first part) appears in the document. This array corresponds to the "bag of words" features for each document.  

Once you created this `bag_of_words` RDD, print out the result arrays for these documents:
* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

Since each array is going to be huge, with a lot of zeros, just print out non-zero entries in the array (that is, for an array `a`, print out `a[a.nonzero()`].

only use words in the refdict
bag of words is an RDD of k v pairs. (docids, numpy arrays[count of words in that doc]
tfidf numerator is the bag of words
you don't have to make your idf as a RDD
term freq an rdd, idf an array so you can multipy them


In [7]:
ref_dict = refDict.collectAsMap()

In [8]:
def creat_bow(doc_id, word_list):
    counts = dict(Counter(word_list))
    bow = []
    for word in ref_dict:
        if word in counts:
            bow.append(counts[word])
        else:
            bow.append(0)
    return doc_id, bow

In [9]:
creat_bow('myid', ['from', 'cst', 'garfield', 'catt', 'ncsu', 'the', 'the', 'a','a'])

('myid',
 [2,
  0,
  0,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0])

In [10]:
bag_of_words = keyAndListOfWords.map(lambda x: creat_bow(x[0], x[1]))

In [11]:
bag_of_words.take(3)

[('20_newsgroups/comp.graphics/37926',
  [0,
   1,
   0,
   4,
   3,
   3,
   1,
   1,
   0,
   0,
   0,
   3,
   0,
   0,
   0,
   1,
   1,
   2,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   1,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   1,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0]),
 ('20_newsgroups/comp.graphics/37944',
  [4,
   6,
   1,
   4,
   8,
   0,
   4,
   2,
   0,
   3,
   0,
   2,
   1,
   0,
   0,
   2,
   2,
   1,
   1,
   2,
   2,
   1,
   0,
   3,
   1,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   2,
   1,
   1,
   1,
   1,
   1,
   3,
   1,
   1,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   0]),
 ('20_newsgroups/comp.graphics/38274',
  [1,
   0,
   1,
   0,
   0,
   0,
   2,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   1,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   1,
   0,
   2,
   0,
   0,
   0,
   1,
   0,
   1,
   1,
   1,
   0,
   0,
   0,
   0,
   1,
   0,
   0,
   1,
   0,
   0,
   0,
   

Once you created your `bag_of_words` RDD, print out the result arrays for these documents,
* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

by running the code cells below:

In [12]:
arr1_1 = np.array(bag_of_words.lookup("20_newsgroups/soc.religion.christian/21626"))
arr1_1[arr1_1.nonzero()]

array([ 7,  2, 10,  4,  4,  5,  1,  6,  7,  8,  1,  1,  1,  3,  1,  2,  1,
        1,  2,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

In [13]:
arr1_2 = np.array(bag_of_words.lookup("20_newsgroups/talk.politics.misc/179019"))
arr1_2[arr1_2.nonzero()]

array([ 7, 23,  5, 17,  6,  5, 14, 10,  3, 20, 15,  4, 11,  1,  1,  4,  4,
        8,  4,  3,  2,  1,  2,  3, 10,  3,  1,  1,  1,  1,  1,  1,  2,  1,
        1,  1,  2,  2,  2,  3])

In [14]:
arr1_3 = np.array(bag_of_words.lookup("20_newsgroups/rec.autos/103167"))
arr1_3[arr1_3.nonzero()]

array([9, 1, 2, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1])

## Task 2 - compute TF-IDF for each document (30 pts)

It is often difficult to classify documents accurately using raw count vectors (bag of words). Thus, the next task is
to write some more Spark code that converts each of the count vectors to TF-IDF vectors. You need to create an RDD of key-value pairs, named `tfidf`, that the keys are document identifiers, and the values are the TF-IDF vector. Again, we are only interested in the top `numWords` (50 for small dataset, 20K for large dataset) most common words.  

The ith entry in a TF-IDF vector corresponds to the ith word in the top `numWords` most common words dictionary `refDict`. Then, the ith entry in a TF-IDF vector for document $d$ is computed as:

$$ TF(i, d) \times IDF(i) $$

Where $TF(i, d)$ is: 

$$ \frac {\textrm{Number of occurances of word $i$ in $d$}} {\textrm{Total number of words in $d$}} $$

Note that the “Total number of words” is not the number of distinct words. The “total number of words”
in “Today is a great day today” is six. 

And the $IDF(i)$ is:

$$ \log \frac {\textrm{Number of documents in corpus}} {\textrm{Number of documents having word $i$}} $$

Once you created this `tfidf` RDD, print out the non-zero array entries (TF-IDF vector) that you have created for these documents:

* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

In [15]:
# start your code here ...
TFid = bag_of_words.leftOuterJoin(keyAndListOfWords)
bow = bag_of_words.collectAsMap()
reg_ex = re.compile('\/(.*?)\/')

In [16]:
def calc_tfid(doc_id, bow, all_words):
    n = len(all_words)
    tfid = np.divide(bow, n)
    return doc_id, tfid

In [17]:
def calc_idf(bow):
    idf = np.zeros(numWords)
    n = len(bow)
    for i in bow:
        for j in range(len(bow[i])):
            if bow[i][j]:
                idf[j] += 1
    #print(idf)
    for i in range(len(idf)):
        if idf[i]:
            idf[i] = n/idf[i]
    #idf = np.log(idf)
    return np.log(idf)
IDFi = calc_idf(bow)
IDFi

array([0.06827884, 0.13238919, 0.20702417, 0.13581972, 0.19358475,
       0.18874212, 0.14966077, 0.24334626, 0.3188288 , 0.32850407,
       0.54128483, 0.3188288 , 0.52424864, 0.57270103, 2.81341072,
       0.        , 0.4780358 , 0.4975804 , 0.64055473, 0.55512588,
       0.58339632, 0.50252682, 0.51583817, 0.77435724, 0.5798185 ,
       0.6558514 , 0.47965001, 0.85802182, 0.65392647, 0.89159812,
       0.        , 1.35867919, 0.        , 0.00501254, 0.06081214,
       0.72981116, 0.33547274, 0.94417594, 0.83701755, 0.15782409,
       0.93649344, 0.89404012, 0.85802182, 0.90881872, 1.50959258,
       1.08767235, 1.27296568, 0.99967234, 0.597837  , 0.90386821])

In [18]:
TFid1 = TFid.map(lambda x: calc_tfid(x[0], x[1][0], x[1][1]))
TFid1.first()[1]

array([0.04059041, 0.0295203 , 0.01107011, 0.01107011, 0.0295203 ,
       0.02214022, 0.01476015, 0.00369004, 0.01476015, 0.0295203 ,
       0.        , 0.        , 0.00369004, 0.00738007, 0.        ,
       0.01107011, 0.00369004, 0.01845018, 0.        , 0.        ,
       0.00738007, 0.00738007, 0.01845018, 0.        , 0.00369004,
       0.00369004, 0.00738007, 0.00738007, 0.00738007, 0.01476015,
       0.00369004, 0.        , 0.00369004, 0.00369004, 0.00369004,
       0.01476015, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00369004, 0.        , 0.        ,
       0.        , 0.00369004, 0.01107011, 0.        , 0.00738007])

In [19]:
tf_idf = TFid1.map(lambda x: (x[0], re.findall(r'\/(.*?)\/', x[0])[0], np.multiply(x[1], IDFi)))
tf_idf.first()

('20_newsgroups/comp.graphics/38299',
 'comp.graphics',
 array([2.77146586e-03, 3.90816791e-03, 2.29178047e-03, 1.50353937e-03,
        5.71467894e-03, 4.17879243e-03, 2.20901512e-03, 8.97956674e-04,
        4.70596017e-03, 9.69753703e-03, 0.00000000e+00, 0.00000000e+00,
        1.93449684e-03, 4.22657585e-03, 0.00000000e+00, 0.00000000e+00,
        1.76396975e-03, 9.18045013e-03, 0.00000000e+00, 0.00000000e+00,
        4.30550787e-03, 3.70868503e-03, 9.51730933e-03, 0.00000000e+00,
        2.13955164e-03, 2.42011585e-03, 3.53985245e-03, 6.33226438e-03,
        4.82602559e-03, 1.31601198e-02, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 1.84964643e-05, 2.24399038e-04, 1.07721205e-02,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 3.16613219e-03, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 4.69729032e-03, 1.10664835e-02,
        0.00000000e+00, 6.67061411e-03]))

Once you created your `tfidf` RDD, print out the result arrays for these documents,
* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

by running the code cells below:

In [20]:
arr2_1 = tf_idf.filter(lambda x: x[0]=='20_newsgroups/soc.religion.christian/21626')
arr2_1.collect()

[('20_newsgroups/soc.religion.christian/21626',
  'soc.religion.christian',
  array([2.35444278e-03, 1.30432698e-03, 1.01982349e-02, 2.67625070e-03,
         3.81447781e-03, 4.64882080e-03, 7.37245195e-04, 7.19250026e-03,
         1.09940966e-02, 1.29459731e-02, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         2.35485616e-03, 2.45113496e-03, 9.46632607e-03, 0.00000000e+00,
         2.87387348e-03, 4.95100316e-03, 2.54107471e-03, 0.00000000e+00,
         2.85624875e-03, 6.46159011e-03, 0.00000000e+00, 0.00000000e+00,
         3.22131265e-03, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 2.46923243e-05, 0.00000000e+00, 3.59512889e-03,
         0.00000000e+00, 0.00000000e+00, 4.12323917e-03, 7.77458548e-04,
         4.61326817e-03, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 5.35799187e-03, 0.00000000e+00, 4.92449429e-03,
         0.00000000e+00, 0.00000000e+00]))]

In [21]:
arr2_2 = tf_idf.filter(lambda x: x[0]=='20_newsgroups/talk.politics.misc/179019')
arr2_2.collect()

[('20_newsgroups/talk.politics.misc/179019',
  'talk.politics.misc',
  array([1.04584658e-03, 6.66291318e-03, 2.26503468e-03, 5.05237482e-03,
         2.54159408e-03, 0.00000000e+00, 1.63742642e-03, 7.45480880e-03,
         6.97656021e-03, 2.15648184e-03, 2.36886141e-02, 1.04648403e-02,
         4.58860958e-03, 1.37849263e-02, 0.00000000e+00, 0.00000000e+00,
         1.04603020e-03, 4.35518947e-03, 5.60660596e-03, 9.71773974e-03,
         5.10631349e-03, 3.29886316e-03, 2.25749744e-03, 1.69443596e-03,
         2.53749889e-03, 0.00000000e+00, 3.14868713e-03, 0.00000000e+00,
         1.43091131e-02, 5.85294170e-03, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 1.09683629e-05, 1.33068139e-04, 0.00000000e+00,
         7.34076009e-04, 2.06603049e-03, 3.66309650e-03, 3.45348108e-04,
         2.04921978e-03, 0.00000000e+00, 1.87750946e-03, 3.97732480e-03,
         0.00000000e+00, 4.76005404e-03, 0.00000000e+00, 0.00000000e+00,
         2.61635449e-03, 5.93348936e-03]))]

In [22]:
arr2_3 = tf_idf.filter(lambda x: x[0]=='20_newsgroups/rec.autos/103167')
arr2_3.collect()

[('20_newsgroups/rec.autos/103167',
  'rec.autos',
  array([6.08425314e-03, 1.31078404e-03, 0.00000000e+00, 0.00000000e+00,
         3.83336137e-03, 5.60620172e-03, 2.96357969e-03, 2.40936890e-03,
         3.15672081e-03, 0.00000000e+00, 0.00000000e+00, 3.15672081e-03,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         4.73302773e-03, 0.00000000e+00, 6.34212604e-03, 0.00000000e+00,
         5.77620115e-03, 4.97551308e-03, 1.02146171e-02, 0.00000000e+00,
         0.00000000e+00, 6.49357818e-03, 0.00000000e+00, 2.54857967e-02,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 4.96291270e-05, 6.02100390e-04, 0.00000000e+00,
         3.32151224e-03, 0.00000000e+00, 0.00000000e+00, 1.56261470e-03,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00]))]

## Task 3 - build a kNN classifier (30 pts)

Task 3 is to build a kNN classifier, as a Python function named `predictLabel` in the cell below. This function will take as input a text string (`test_doc`) and a number k, and then output the name of one of the 20 newsgroups. This name is the news group that the classifier thinks that the text string is “closest” to. It is computed using the classical kNN algorithm. 

Your function first needs to convert the input string into all lower case words, and then compute a TF-IDF vector corresponding to the words in `refDict` created in the first task. Recall that the words in `refDict` is our reference words to compute "TF-IDF" features. In task 2, we already computed TF-IDF values of these words for our training corpus. In this task, you need to compute TF-IDF values of these words for the input text string `test_doc`. For that, you need to compute term frequency of these words in the `test_doc`. Since IDF measure of a word only depends on the training corpus, and this measure is already calculated for `refDict` words in task 2, you don't need to re-calculate IDF for the `test_doc` and can re-use what you have.    
Then, your function needs to find the k documents in the corpus that are “closest” to the `test_doc` (where distance is computed using the l2 norm between the TF-IDF feature vectors), and returns the newsgroup label that is most frequent in those top k. Ties go to the label with the closest corpus document. 

Once you have implemented your function, run it on the following 8 test cases, each is an excerpt from a Wikipedia article,
chosen to match one of the 20 newsgroups. By reading each test document, you can guess which of the 20 newsgroups is the most relevent topic, and you can compare that with what your prediction function returns. The result you get from the small dataset might not be so accurate, due to the small training corpus. But, once you run it on the entire dataset in S3, you should see reasonable results.  

In [45]:
# k is the number of neighbors to consider
# test_doc is the text to compare 
def predictLabel (k, test_doc):
    # your code here
    t_words = regex.sub(' ', test_doc).lower().split()
    t_bow = creat_bow('test', t_words)
    t_tfid = calc_tfid('test', t_bow[1], t_words)
    t_tf_idf = np.multiply(t_tfid[1], IDFi)
    
    nearest = tf_idf.map(lambda x: (x[1], np.linalg.norm(np.subtract(x[2], t_tf_idf)))).takeOrdered(k, key = lambda x: x[1])
    vote = []
    for i in nearest:
        vote.append(i[0])
    votes = dict(Counter(vote))
    max_value = max(votes.values())
    max_neigh = []
    for i in votes:
        if votes[i] == max_value:
            max_neigh.append(i)

    return random.choice(max_neigh)
    #return nearest

#### Test cases

Run your predictLabel function on the 8 test cases below.

In [46]:
print(predictLabel (10, 'Graphics are pictures and movies created using computers – usually referring to image data created by a computer specifically with help from specialized graphical hardware and software. It is a vast and recent area in computer science. The phrase was coined by computer graphics researchers Verne Hudson and William Fetter of Boeing in 1960. It is often abbreviated as CG, though sometimes erroneously referred to as CGI. Important topics in computer graphics include user interface design, sprite graphics, vector graphics, 3D modeling, shaders, GPU design, implicit surface visualization with ray tracing, and computer vision, among others. The overall methodology depends heavily on the underlying sciences of geometry, optics, and physics. Computer graphics is responsible for displaying art and image data effectively and meaningfully to the user, and processing image data received from the physical world. The interaction and understanding of computers and interpretation of data has been made easier because of computer graphics. Computer graphic development has had a significant impact on many types of media and has revolutionized animation, movies, advertising, video games, and graphic design generally.'))

talk.politics.mideast


In [47]:
print(predictLabel (10, 'A deity is a concept conceived in diverse ways in various cultures, typically as a natural or supernatural being considered divine or sacred. Monotheistic religions accept only one Deity (predominantly referred to as God), polytheistic religions accept and worship multiple deities, henotheistic religions accept one supreme deity without denying other deities considering them as equivalent aspects of the same divine principle, while several non-theistic religions deny any supreme eternal creator deity but accept a pantheon of deities which live, die and are reborn just like any other being. A male deity is a god, while a female deity is a goddess. The Oxford reference defines deity as a god or goddess (in a polytheistic religion), or anything revered as divine. C. Scott Littleton defines a deity as a being with powers greater than those of ordinary humans, but who interacts with humans, positively or negatively, in ways that carry humans to new levels of consciousness beyond the grounded preoccupations of ordinary life.'))

talk.politics.mideast


In [48]:
print(predictLabel (10, 'Egypt, officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and southwest corner of Asia by a land bridge formed by the Sinai Peninsula. Egypt is a Mediterranean country bordered by the Gaza Strip and Israel to the northeast, the Gulf of Aqaba to the east, the Red Sea to the east and south, Sudan to the south, and Libya to the west. Across the Gulf of Aqaba lies Jordan, and across from the Sinai Peninsula lies Saudi Arabia, although Jordan and Saudi Arabia do not share a land border with Egypt. It is the worlds only contiguous Eurafrasian nation. Egypt has among the longest histories of any modern country, emerging as one of the worlds first nation states in the tenth millennium BC. Considered a cradle of civilisation, Ancient Egypt experienced some of the earliest developments of writing, agriculture, urbanisation, organised religion and central government. Iconic monuments such as the Giza Necropolis and its Great Sphinx, as well the ruins of Memphis, Thebes, Karnak, and the Valley of the Kings, reflect this legacy and remain a significant focus of archaeological study and popular interest worldwide. Egypts rich cultural heritage is an integral part of its national identity, which has endured, and at times assimilated, various foreign influences, including Greek, Persian, Roman, Arab, Ottoman, and European. One of the earliest centers of Christianity, Egypt was Islamised in the seventh century and remains a predominantly Muslim country, albeit with a significant Christian minority.'))

talk.politics.mideast


In [49]:
print(predictLabel (10, 'The term atheism originated from the Greek atheos, meaning without god(s), used as a pejorative term applied to those thought to reject the gods worshiped by the larger society. With the spread of freethought, skeptical inquiry, and subsequent increase in criticism of religion, application of the term narrowed in scope. The first individuals to identify themselves using the word atheist lived in the 18th century during the Age of Enlightenment. The French Revolution, noted for its unprecedented atheism, witnessed the first major political movement in history to advocate for the supremacy of human reason. Arguments for atheism range from the philosophical to social and historical approaches. Rationales for not believing in deities include arguments that there is a lack of empirical evidence; the problem of evil; the argument from inconsistent revelations; the rejection of concepts that cannot be falsified; and the argument from nonbelief. Although some atheists have adopted secular philosophies (eg. humanism and skepticism), there is no one ideology or set of behaviors to which all atheists adhere.'))

talk.politics.mideast


In [50]:
print(predictLabel (10, 'President Dwight D. Eisenhower established NASA in 1958 with a distinctly civilian (rather than military) orientation encouraging peaceful applications in space science. The National Aeronautics and Space Act was passed on July 29, 1958, disestablishing NASAs predecessor, the National Advisory Committee for Aeronautics (NACA). The new agency became operational on October 1, 1958. Since that time, most US space exploration efforts have been led by NASA, including the Apollo moon-landing missions, the Skylab space station, and later the Space Shuttle. Currently, NASA is supporting the International Space Station and is overseeing the development of the Orion Multi-Purpose Crew Vehicle, the Space Launch System and Commercial Crew vehicles. The agency is also responsible for the Launch Services Program (LSP) which provides oversight of launch operations and countdown management for unmanned NASA launches.'))

talk.politics.mideast


In [51]:
print(predictLabel (10, 'The transistor is the fundamental building block of modern electronic devices, and is ubiquitous in modern electronic systems. First conceived by Julius Lilienfeld in 1926 and practically implemented in 1947 by American physicists John Bardeen, Walter Brattain, and William Shockley, the transistor revolutionized the field of electronics, and paved the way for smaller and cheaper radios, calculators, and computers, among other things. The transistor is on the list of IEEE milestones in electronics, and Bardeen, Brattain, and Shockley shared the 1956 Nobel Prize in Physics for their achievement.'))

talk.politics.mideast


In [52]:
print(predictLabel (10, 'The Colt Single Action Army which is also known as the Single Action Army, SAA, Model P, Peacemaker, M1873, and Colt .45 is a single-action revolver with a revolving cylinder holding six metallic cartridges. It was designed for the U.S. government service revolver trials of 1872 by Colts Patent Firearms Manufacturing Company – todays Colts Manufacturing Company – and was adopted as the standard military service revolver until 1892. The Colt SAA has been offered in over 30 different calibers and various barrel lengths. Its overall appearance has remained consistent since 1873. Colt has discontinued its production twice, but brought it back due to popular demand. The revolver was popular with ranchers, lawmen, and outlaws alike, but as of the early 21st century, models are mostly bought by collectors and re-enactors. Its design has influenced the production of numerous other models from other companies.'))

sci.med


In [53]:
print(predictLabel (10, 'Howe was recruited by the Red Wings and made his NHL debut in 1946. He led the league in scoring each year from 1950 to 1954, then again in 1957 and 1963. He ranked among the top ten in league scoring for 21 consecutive years and set a league record for points in a season (95) in 1953. He won the Stanley Cup with the Red Wings four times, won six Hart Trophies as the leagues most valuable player, and won six Art Ross Trophies as the leading scorer. Howe retired in 1971 and was inducted into the Hockey Hall of Fame the next year. However, he came back two years later to join his sons Mark and Marty on the Houston Aeros of the WHA. Although in his mid-40s, he scored over 100 points twice in six years. He made a brief return to the NHL in 1979–80, playing one season with the Hartford Whalers, then retired at the age of 52. His involvement with the WHA was central to their brief pre-NHL merger success and forced the NHL to expand their recruitment to European talent and to expand to new markets.'))

rec.sport.baseball


## Task 4 - run on the entire dataset in EMR cluster (15 pts)

For the last part of this homework, you need to run your Spark code for tasks 1 through 3, on the netire dataset stored in S3, in an AWS EMR cluster. 

Follow the instructions on `Lab - Spark Intro (AWS)` to create and connect to an EMR cluster in AWS and run Spark programs in there. You can gather your code for each task in a Python `.py` file and submit them as jobs in the batch mode and get the final result back. To troubleshoot, you can run your code line by line, in an interactive mode to debug your program.       

The entire dataset exists in this S3 URI: `s3://comp643bucket/lab/spark_intro_aws/20_news_same_line.txt`

Repeat tasks 1 through 3 on the entire dataset in your EMR cluster, and print your results in the markdown cells below (keep the results from the small subset above). 

**Repeat task 1 on the entire dataset in your EMR cluster - print out the non-zero array entries (bag of words) that you have created for documents:**

* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

rel christian 
[ 7  2 10  4  4  5  1  6  7  8  1  1  1  1  3  2  1  1  1  2  1  1  1  1                                             
  1  1  1  3  1  1  1  1  1  4  1  1  1  1  2  2  1  1  1  1  1  1  1  1                                                           
  1  1  1  1  1  1  1  2  1  1  1  1  1  2  1  1  1  1  1  2  1  1  1  1                                                           
  1  1  3  1  1  1  1  1  1  1  1  2  1  2  1  1  2  1  1  1  1  2  1  1                                                           
  1  2  3  1  1  2  1  2  5  3  1  2  1  1  1  1  3  1  1  1  1  2  2]                                                             
politics 
[ 7 23  5 17  6  5 14 10  3 15 20  4  1 11  1  4  8  4  4  3  2  1  2  3                                                  
 10  3  1  1  1  1  1  1  2  1  1  1  1  2  2  3  5  2  2  1  1  2  1  1                                                           
  8  1  2  1  3  1  1  2  1  2  1  1  2  2  3  1  1  1  1  2  1  1 11  3                                                           
  1  1  1  1  1  2  1  2  1  1  1  1  1  1  1  1  1  1  2  2  1  1  1  1                                                           
  1  1  2  1  1  2  1  1  6  1  3  1  3  1  1  1  3  3  2  1  1  3 11  2                                                           
  1  1  1  1 11  1  1  1  1  1  2  2  1  1  1  2  2  1  5  1  2  1  4  1                                                           
  2  1  1  1  1  1  2  3  1  1  1  1  4  3  1  1  1  1  1  1  1  1  1  1                                                           
  1  1  1  2  1  2  1  1  1  1]                                                                                                    
autos 
[9 1 2 3 2 1 1 1 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 3 1 1                                                   
 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 3 1 1 1 1 1 1 2 1 1 1]    

In [55]:
arr2_1 = tf_idf.filter(lambda x: x[0]=="20_newsgroups/soc.religion.christian/21626")
x = np.array(arr2_1.first()[2])
print(x[x.nonzero()])

arr2_2 = tf_idf.filter(lambda x: x[0]=="20_newsgroups/talk.politics.misc/179019")
x = np.array(arr2_2.first()[2])
print(x[x.nonzero()])

arr2_3 = tf_idf.filter(lambda x: x[0]=='20_newsgroups/rec.autos/103167')
x = np.array(arr2_3.first()[2])
print(x[x.nonzero()])

[2.35444278e-03 1.30432698e-03 1.01982349e-02 2.67625070e-03
 3.81447781e-03 4.64882080e-03 7.37245195e-04 7.19250026e-03
 1.09940966e-02 1.29459731e-02 2.35485616e-03 2.45113496e-03
 9.46632607e-03 2.87387348e-03 4.95100316e-03 2.54107471e-03
 2.85624875e-03 6.46159011e-03 3.22131265e-03 2.46923243e-05
 3.59512889e-03 4.12323917e-03 7.77458548e-04 4.61326817e-03
 5.35799187e-03 4.92449429e-03]
[1.04584658e-03 6.66291318e-03 2.26503468e-03 5.05237482e-03
 2.54159408e-03 1.63742642e-03 7.45480880e-03 6.97656021e-03
 2.15648184e-03 2.36886141e-02 1.04648403e-02 4.58860958e-03
 1.37849263e-02 1.04603020e-03 4.35518947e-03 5.60660596e-03
 9.71773974e-03 5.10631349e-03 3.29886316e-03 2.25749744e-03
 1.69443596e-03 2.53749889e-03 3.14868713e-03 1.43091131e-02
 5.85294170e-03 1.09683629e-05 1.33068139e-04 7.34076009e-04
 2.06603049e-03 3.66309650e-03 3.45348108e-04 2.04921978e-03
 1.87750946e-03 3.97732480e-03 4.76005404e-03 2.61635449e-03
 5.93348936e-03]
[6.08425314e-03 1.31078404e-03 3.833

Copy your result in this markdown cell ...


**Repeat task 2 on the entire dataset in your EMR cluster - print out the non-zero array entries (TF-IDF) that you have created for documents:**

* `20_newsgroups/soc.religion.christian/21626`
* `20_newsgroups/talk.politics.misc/179019`
* `20_newsgroups/rec.autos/103167`

[2.50653569e-03 1.26524403e-03 9.07357579e-03 2.52936770e-03                                                                       
 3.60456817e-03 4.46138143e-03 8.18136376e-04 7.92091348e-03                                                                       
 1.15652641e-02 1.31238499e-02 2.46500149e-03 2.52057198e-03                                                                       
 2.93979464e-03 9.36767830e-03 5.13279316e-03 2.87004562e-03                                                                       
 3.24764680e-03 3.24336219e-03 6.52582242e-03 1.43086181e-05                                                                       
 3.87998805e-03 4.28381785e-03 8.00426464e-04 4.64467872e-03                                                                       
 1.33572660e-02 4.87579931e-03 5.58843456e-03 4.81912373e-03                                                                       
 4.90705097e-03 5.26027628e-03 2.00789036e-02 6.19148775e-03                                                                       
 5.21888616e-03 5.71804583e-03 6.29472023e-03 1.25049320e-02                                                                       
 2.44253425e-02 6.33110163e-03 1.43139978e-02 1.21455292e-02                                                                       
 7.28530911e-03 7.42565733e-03 7.07579097e-03 1.32021767e-02                                                                       
 8.45750301e-03 7.94727055e-03 8.00823340e-03 1.83211264e-02                                                                       
 8.56985894e-03 1.27987652e-02 8.79815310e-03 9.04371859e-03                                                                       
 1.66418042e-02 1.03766364e-02 1.29228044e-02 1.18830047e-02                                                                       
 1.13766488e-02 1.19078075e-02 2.34841885e-02 1.18147560e-02                                                                       
 1.16077016e-02 1.27329950e-02 1.28053907e-02 1.20280648e-02                                                                       
 2.83614632e-02 1.44232690e-02 1.31061065e-02 1.42647139e-02                                                                       
 1.37168044e-02 1.37247884e-02 1.47124262e-02 4.69877601e-02                                                                       
 1.40127063e-02 1.52865393e-02 1.52646454e-02 1.68034370e-02                                                                       
 1.76836921e-02 1.81612386e-02 1.89140509e-02 1.65844628e-02                                                                       
 3.36517588e-02 1.83927429e-02 3.41122266e-02 1.62331868e-02                                                                       
 1.82305518e-02 3.99979618e-02 1.85818278e-02 1.90299647e-02                                                                       
 1.90652774e-02 2.22829566e-02 4.15071733e-02 2.29116459e-02                                                                       
 2.23285690e-02 2.23515346e-02 4.72054163e-02 7.34314819e-02                                                                       
 2.38460222e-02 2.31765542e-02 5.14404563e-02 2.48832871e-02                                                                       
 5.62924414e-02 1.33645338e-01 8.21605732e-02 2.65639312e-02                                                                       
 5.47737155e-02 2.80721425e-02 2.89555953e-02 2.82214299e-02                                                                       
 2.88684053e-02 1.02082836e-01 3.03727484e-02 3.11321113e-02                                                                       
 3.48281980e-02 3.48281980e-02 7.39452440e-02 7.48842606e-02] 
 
 [1.11340644e-03 6.46326521e-03 2.01524714e-03 4.77508093e-03                                                                       
 2.40173087e-03 1.81708626e-03 8.20978705e-03 7.33900782e-03                                                                       
 2.18611176e-03 1.05050958e-02 2.33151344e-02 4.53349691e-03                                                                       
 1.48213179e-02 1.09495690e-03 4.47856554e-03 9.99176432e-03                                                                       
 5.22344255e-03 5.54818000e-03 3.41999019e-03 2.54975607e-03                                                                       
 1.71413481e-03 2.88521794e-03 3.06969009e-03 1.44070574e-02                                                                       
 6.04366002e-03 6.35590695e-06 1.83315588e-04 2.07669887e-03                                                                       
 7.74990292e-04 3.80575502e-03 3.55550486e-04 1.92703048e-03                                                                       
 2.06317239e-03 1.97777170e-03 4.06220263e-03 3.77972373e-03                                                                       
 6.23858233e-03 1.04558149e-02 4.96477994e-03 2.79932601e-03                                                                       
 2.35195495e-03 2.71214436e-03 3.43786062e-03 2.79612299e-03                                                                       
 2.53612424e-03 2.12023308e-02 2.50530668e-03 6.43594532e-03                                                                       
 4.52031331e-03 8.43685097e-03 2.55327832e-03 2.84055464e-03                                                                       
 6.01147952e-03 2.88181946e-03 6.34732132e-03 3.11291223e-03                                                                       
 3.23614387e-03 6.28615127e-03 7.25466363e-03 1.09634576e-02                                                                       
 3.54728113e-03 3.82175163e-03 5.94737533e-03 4.16408763e-03

 7.72329455e-03 4.15822511e-03 4.17291096e-03 4.55001745e-02                                                                       
 1.26414384e-02 4.23114652e-03 4.20781570e-03 4.57717517e-03                                                                       
 4.31678706e-03 4.63373445e-03 9.38358907e-03 4.61111503e-03                                                                       
 9.59207248e-03 4.85058251e-03 4.80683726e-03 4.84056333e-03                                                                       
 4.91892100e-03 5.30053693e-03 5.31911772e-03 5.24813014e-03                                                                       
 5.42097130e-03 5.91937410e-03 5.71782201e-03 1.13763427e-02                                                                       
 1.14088590e-02 5.79839641e-03 6.03704674e-03 6.39867017e-03                                                                       
 6.08594674e-03 6.33244772e-03 6.12694853e-03 1.20362622e-02                                                                       
 6.54832988e-03 6.37637259e-03 1.31097692e-02 6.78057554e-03                                                                       
 6.80988201e-03 4.20796869e-02 7.85511925e-03 2.57955882e-02                                                                       
 7.34789471e-03 2.25534277e-02 7.77723972e-03 7.42468096e-03                                                                       
 8.77832238e-03 2.81633004e-02 2.49504088e-02 1.58799804e-02                                                                       
 8.50586517e-03 8.29246250e-03 2.59647870e-02 1.09554209e-01                                                                       
 1.82352965e-02 8.53812399e-03 8.41184758e-03 8.94141440e-03                                                                       
 8.97422137e-03 1.06554954e-01 8.32661545e-03 8.68947999e-03                                                                       
 8.67213630e-03 9.15322924e-03 8.59296856e-03 1.89715231e-02                                                                       
 1.91767029e-02 9.83843713e-03 9.66857901e-03 9.78991666e-03                                                                       
 2.00027079e-02 2.48106391e-02 9.72374623e-03 5.01129930e-02                                                                       
 9.92858102e-03 2.02640612e-02 1.04194179e-02 4.40099800e-02                                                                       
 1.03563535e-02 2.17455738e-02 1.03939736e-02 1.07207686e-02                                                                       
 1.06343817e-02 1.16826900e-02 1.18730869e-02 2.60485506e-02                                                                       
 3.70283873e-02 1.19489853e-02 1.21091678e-02 1.22228653e-02                                                                       
 1.26043825e-02 5.04175301e-02 3.89477831e-02 1.35983766e-02                                                                       
 1.37689707e-02 1.30242753e-02 1.36537766e-02 1.31100975e-02                                                                       
 1.35983766e-02 1.32925517e-02 1.31543047e-02 1.35983766e-02                                                                       
 1.32454922e-02 1.35983766e-02 1.52273492e-02 1.47161569e-02                                                                       
 3.04546983e-02 1.64232872e-02 3.42402451e-02 1.56033891e-02                                                                       
 1.58955803e-02 1.56033891e-02 1.56033891e-02]
 
 [6.47728529e-03 1.27150762e-03 3.62241256e-03 5.38016097e-03                                                                       
 3.28874623e-03 2.65337531e-03 3.32071938e-03 3.16886389e-03                                                                       
 4.95440894e-03 5.90869615e-03 6.27603530e-03 5.15820302e-03                                                                       
 1.15370151e-02 2.60757032e-02 6.55812847e-03 2.87589057e-05                                                                       
 8.29457660e-04 3.50663924e-03 1.60877794e-03 8.55115716e-03                                                                       
 9.68596156e-03 1.52290169e-02 2.55005644e-02 1.51761379e-02                                                                       
 1.40013108e-02 5.15552510e-02 1.63480406e-02 3.19017710e-02                                                                       
 1.46884821e-02 1.91073027e-02 1.94155971e-02 6.64072708e-02                                                                       
 2.28659376e-02 2.15382376e-02 2.45876992e-02 2.41752193e-02                                                                       
 2.64201893e-02 2.96592303e-02 3.11986848e-02 1.08230379e-01                                                                       
 3.26672645e-02 3.52944897e-02 3.60201617e-02 3.41399899e-02                                                                       
 3.69262519e-02 3.79236931e-02 3.62294123e-02 3.63649118e-02                                                                       
 3.84147520e-02 3.67829051e-02 3.95569763e-02 4.15468047e-02                                                                       
 4.65825793e-02 9.95671330e-02 1.10610880e-01 5.26571293e-02                                                                       
 5.32826630e-02 1.73043567e-01 5.47909870e-02 5.54383411e-02                                                                       
 5.85579554e-02 6.05856228e-02 5.80226364e-02 6.40529736e-02                                                                       
 1.50509949e-01 7.19235665e-02 7.19235665e-02 7.74643168e-02]

**Repeat task 3 on the entire dataset in your EMR cluster - print out the predicted label for each of the below test document:**

In [33]:
print(predictLabel (10, 'Graphics are pictures and movies created using computers – usually referring to image data created by a computer specifically with help from specialized graphical hardware and software. It is a vast and recent area in computer science. The phrase was coined by computer graphics researchers Verne Hudson and William Fetter of Boeing in 1960. It is often abbreviated as CG, though sometimes erroneously referred to as CGI. Important topics in computer graphics include user interface design, sprite graphics, vector graphics, 3D modeling, shaders, GPU design, implicit surface visualization with ray tracing, and computer vision, among others. The overall methodology depends heavily on the underlying sciences of geometry, optics, and physics. Computer graphics is responsible for displaying art and image data effectively and meaningfully to the user, and processing image data received from the physical world. The interaction and understanding of computers and interpretation of data has been made easier because of computer graphics. Computer graphic development has had a significant impact on many types of media and has revolutionized animation, movies, advertising, video games, and graphic design generally.'))
print(predictLabel (10, 'A deity is a concept conceived in diverse ways in various cultures, typically as a natural or supernatural being considered divine or sacred. Monotheistic religions accept only one Deity (predominantly referred to as God), polytheistic religions accept and worship multiple deities, henotheistic religions accept one supreme deity without denying other deities considering them as equivalent aspects of the same divine principle, while several non-theistic religions deny any supreme eternal creator deity but accept a pantheon of deities which live, die and are reborn just like any other being. A male deity is a god, while a female deity is a goddess. The Oxford reference defines deity as a god or goddess (in a polytheistic religion), or anything revered as divine. C. Scott Littleton defines a deity as a being with powers greater than those of ordinary humans, but who interacts with humans, positively or negatively, in ways that carry humans to new levels of consciousness beyond the grounded preoccupations of ordinary life.'))
print(predictLabel (10, 'Egypt, officially the Arab Republic of Egypt, is a transcontinental country spanning the northeast corner of Africa and southwest corner of Asia by a land bridge formed by the Sinai Peninsula. Egypt is a Mediterranean country bordered by the Gaza Strip and Israel to the northeast, the Gulf of Aqaba to the east, the Red Sea to the east and south, Sudan to the south, and Libya to the west. Across the Gulf of Aqaba lies Jordan, and across from the Sinai Peninsula lies Saudi Arabia, although Jordan and Saudi Arabia do not share a land border with Egypt. It is the worlds only contiguous Eurafrasian nation. Egypt has among the longest histories of any modern country, emerging as one of the worlds first nation states in the tenth millennium BC. Considered a cradle of civilisation, Ancient Egypt experienced some of the earliest developments of writing, agriculture, urbanisation, organised religion and central government. Iconic monuments such as the Giza Necropolis and its Great Sphinx, as well the ruins of Memphis, Thebes, Karnak, and the Valley of the Kings, reflect this legacy and remain a significant focus of archaeological study and popular interest worldwide. Egypts rich cultural heritage is an integral part of its national identity, which has endured, and at times assimilated, various foreign influences, including Greek, Persian, Roman, Arab, Ottoman, and European. One of the earliest centers of Christianity, Egypt was Islamised in the seventh century and remains a predominantly Muslim country, albeit with a significant Christian minority.'))
print(predictLabel (10, 'The term atheism originated from the Greek atheos, meaning without god(s), used as a pejorative term applied to those thought to reject the gods worshiped by the larger society. With the spread of freethought, skeptical inquiry, and subsequent increase in criticism of religion, application of the term narrowed in scope. The first individuals to identify themselves using the word atheist lived in the 18th century during the Age of Enlightenment. The French Revolution, noted for its unprecedented atheism, witnessed the first major political movement in history to advocate for the supremacy of human reason. Arguments for atheism range from the philosophical to social and historical approaches. Rationales for not believing in deities include arguments that there is a lack of empirical evidence; the problem of evil; the argument from inconsistent revelations; the rejection of concepts that cannot be falsified; and the argument from nonbelief. Although some atheists have adopted secular philosophies (eg. humanism and skepticism), there is no one ideology or set of behaviors to which all atheists adhere.'))
print(predictLabel (10, 'President Dwight D. Eisenhower established NASA in 1958 with a distinctly civilian (rather than military) orientation encouraging peaceful applications in space science. The National Aeronautics and Space Act was passed on July 29, 1958, disestablishing NASAs predecessor, the National Advisory Committee for Aeronautics (NACA). The new agency became operational on October 1, 1958. Since that time, most US space exploration efforts have been led by NASA, including the Apollo moon-landing missions, the Skylab space station, and later the Space Shuttle. Currently, NASA is supporting the International Space Station and is overseeing the development of the Orion Multi-Purpose Crew Vehicle, the Space Launch System and Commercial Crew vehicles. The agency is also responsible for the Launch Services Program (LSP) which provides oversight of launch operations and countdown management for unmanned NASA launches.'))
print(predictLabel (10, 'The transistor is the fundamental building block of modern electronic devices, and is ubiquitous in modern electronic systems. First conceived by Julius Lilienfeld in 1926 and practically implemented in 1947 by American physicists John Bardeen, Walter Brattain, and William Shockley, the transistor revolutionized the field of electronics, and paved the way for smaller and cheaper radios, calculators, and computers, among other things. The transistor is on the list of IEEE milestones in electronics, and Bardeen, Brattain, and Shockley shared the 1956 Nobel Prize in Physics for their achievement.'))
print(predictLabel (10, 'The Colt Single Action Army which is also known as the Single Action Army, SAA, Model P, Peacemaker, M1873, and Colt .45 is a single-action revolver with a revolving cylinder holding six metallic cartridges. It was designed for the U.S. government service revolver trials of 1872 by Colts Patent Firearms Manufacturing Company – todays Colts Manufacturing Company – and was adopted as the standard military service revolver until 1892. The Colt SAA has been offered in over 30 different calibers and various barrel lengths. Its overall appearance has remained consistent since 1873. Colt has discontinued its production twice, but brought it back due to popular demand. The revolver was popular with ranchers, lawmen, and outlaws alike, but as of the early 21st century, models are mostly bought by collectors and re-enactors. Its design has influenced the production of numerous other models from other companies.'))
print(predictLabel (10, 'Howe was recruited by the Red Wings and made his NHL debut in 1946. He led the league in scoring each year from 1950 to 1954, then again in 1957 and 1963. He ranked among the top ten in league scoring for 21 consecutive years and set a league record for points in a season (95) in 1953. He won the Stanley Cup with the Red Wings four times, won six Hart Trophies as the leagues most valuable player, and won six Art Ross Trophies as the leading scorer. Howe retired in 1971 and was inducted into the Hockey Hall of Fame the next year. However, he came back two years later to join his sons Mark and Marty on the Houston Aeros of the WHA. Although in his mid-40s, he scored over 100 points twice in six years. He made a brief return to the NHL in 1979–80, playing one season with the Hartford Whalers, then retired at the age of 52. His involvement with the WHA was central to their brief pre-NHL merger success and forced the NHL to expand their recruitment to European talent and to expand to new markets.'))

rec.sport.baseball
rec.sport.baseball
comp.os.ms-windows.misc
comp.os.ms-windows.misc
rec.sport.baseball
comp.os.ms-windows.misc
talk.politics.mideast
misc.forsale


comp.graphics                                                                                                                      
talk.religion.misc                                                                                                                 
alt.atheism                                                                                                                        
alt.atheism                                                                                                                        
sci.space                                                                                                                          
talk.politics.misc                                                                                                                 
talk.politics.guns                                                                                                                 
talk.politics.mideast  

### Copyright ©2020 Christopher M Jermaine (cmj4@rice.edu), Risa B Myers  (rbm2@rice.edu), Marmar Orooji (marmar.orooji@rice.edu)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.