# COMP 755 Machine Learning Final Project
By Abu-Bakar Raja and Duy Nguyen


### Description of the project:

In this project, we have a data set of jeopardy questions and we will classify these questions by their categories. We will first prepare and manipulate the data and discuss the process of vectorizing the data. Then, we will train a basic Multinomial Naive Bayes model using Sklearn and then train an SVM model with Grid Search to find the optimal parameters. Finally, we will compare the performance of these models and try to change the categories we are classifying to improve our models.

### Loading the data

In [2]:
#importing data and printing size of the data
import json

with open('JEOPARDY_QUESTIONS1.json', 'r') as f:
    data = json.load(f)
print("Number of questions in this data set:")    
print(len(data))

Number of questions in this data set:
216930


### Exploring the data 
First, we want to see how many questions are in each of the categories and how many different categories are there in the data set. We built a dictionary of all the categories and the number of questions in each.

In [5]:
#counting how many questions are there per category and printing how many different category there are in the data set
data_info = {}
for d in data:
    if d['category'] in data_info:
        data_info[d['category']] +=1
    else:
        data_info[d['category']] = 1
print('Number of distinct categories: ')
print(len(data_info))

Number of distinct categories: 
27995


Since we have 27,995 categories, we will only try to analyze a subset of these categories to make things more manageable. We will start out with just analyzing the questions under the top 5 most popular categories. Here we get these top 5 categories and how many questions are in each of these 5 categories.

In [87]:
#only using categories that have at least 400 questions aka top 5 categories.
#printing the # of useful categories
#printing the target_name array

print('The top 5 categories and how many questions are in each:')

min_num_questions = 400 #use 400 for real data
target_names = []
num_cat = 0;

for x in data_info:
    if data_info[x]>= min_num_questions: 
        print(x,data_info[x])
        target_names.append(x)
        num_cat +=1

print ("\nnumber of catagories: ", num_cat)
target_names

The top 5 categories and how many questions are in each:
SCIENCE 519
AMERICAN HISTORY 418
POTPOURRI 401
LITERATURE 496
BEFORE & AFTER 547

number of catagories:  5


['SCIENCE', 'AMERICAN HISTORY', 'POTPOURRI', 'LITERATURE', 'BEFORE & AFTER']

We will now split the data up into train data and keep a small set for testing. We will use 75% of our data to train our models and use the remaining 25% for testing. 

In [21]:
#spliting the data into train_data and test_data
#3/4 of the data is used for traning and 1/4 is used for testing

train_data = {'data':[],
             'target_names':target_names,
            'target': []}
test_data = {'data':[],
             'target_names':target_names,
            'target': []}

q_in_cat = []
for x in range (len(target_names)):
    q_in_cat.append(0)

for d in data:
    if d['category'] in target_names:
        # put every forth question in data into our test data set 
        if q_in_cat[target_names.index(d['category'])] % 4 == 3:
            test_data['data'].append(d['question'])
            test_data['target'].append(target_names.index(d['category']))
        else:
            train_data['data'].append(d['question'])
            train_data['target'].append(target_names.index(d['category']))
        q_in_cat[target_names.index(d['category'])] +=1
        
print('Number of samples in the training data set: ', len(train_data['data']))
print('Number of samples in the test data set: ', len(test_data['data']))

Number of samples in the training data set:  1788
Number of samples in the test data set:  593


Now that we have the data set prepared, we need to extract features for each sample from the data set by vectorizing the text of these questions into numerical values that we can do analysis on. There are many ways to process text and extract features out of them ranging from a simple bag of words approach to using a neural net such as Word2Vec. Each way has their own benefits and use cases, but here, we will take the most naive and intuitive approach and use the simple bag of words representation to process these texts. 


As the name suggests, we will "put all of our words into a bag" where we will assign a unique integer ID to each distinct word occuring within our train data set, and then for each sample i, we will count the number of occurences of the word w in the sample and store that in the i vector at w's ID. Therefore, the number of features is the number of distinct words that we have in the data set and for each feature (or word), we have the number of occurences in that sample. 

We note that the questions we have in the data set are fairly short in length compared to the number of distinct words in our data set. Only a tiny fracture of all the possible words in our bag will appear in each entry, and thus, we will end up with a lot of 0's in our feature vectors. Therefore, we will only store the non-zeros features for each sample.

We can do this using sklearn's CountVectorizer from their feature_extraction library.



In [67]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data['data'])
X_train_counts.shape

(1788, 7006)

X_train_counts is now our feature matrix with 1788 of our training samples and we can see that our "bag" has 7006 words in total. The words in the bag and their ID's are shown below:

In [23]:
print(len(count_vect.vocabulary_))
count_vect.vocabulary_

7006


{'at': 828,
 'sea': 5648,
 'level': 3833,
 '70': 450,
 'degrees': 2028,
 'this': 6383,
 'travels': 6511,
 '129': 54,
 'feet': 2632,
 'per': 4793,
 'second': 5660,
 'it': 3536,
 'speeds': 5983,
 'up': 6630,
 'over': 4650,
 'foot': 2748,
 'sec': 5659,
 'for': 2752,
 'each': 2292,
 'rising': 5442,
 'degree': 2027,
 'the': 6354,
 'largest': 3750,
 'tree': 6520,
 'general': 2902,
 'sherman': 5766,
 'in': 3411,
 'california': 1312,
 'is': 3524,
 'type': 6584,
 'also': 661,
 'called': 1314,
 'sierra': 5809,
 'redwood': 5288,
 'href': 3332,
 'http': 3333,
 'www': 6964,
 'archive': 767,
 'com': 1652,
 'media': 4135,
 '2006': 355,
 '02': 5,
 '06_dj_13': 24,
 'jpg': 3629,
 'target': 6281,
 '_blank': 476,
 'sarah': 5576,
 'of': 4543,
 'clue': 1600,
 'crew': 1865,
 'reads': 5261,
 'from': 2831,
 'pole': 4945,
 'vault': 6673,
 'duke': 2276,
 'university': 6616,
 'track': 6479,
 'durham': 2282,
 'nc': 4410,
 'bending': 1017,
 'an': 691,
 'elastic': 2353,
 'solid': 5929,
 'stress': 6120,
 'force': 275

Instead of using just the number of occurences, we can improve this and use the frequency of appearance of each word to account for longer, more wordy questions asking about the same thing. These longer questions have the same content as their shorter counterparts, but they will have a higher number of occurences since they are longer. Therefore, by switching to using frequency and dividing the number of occurences of each word in the sample by the total number of words in each sample, we can normalize these values.

Moreover, words that appear a lot across different samples are not as informative as those who appears less frequently. Therefore, we will also want to scale down the weights of these words that appear frequently across samples.

We use sklearn's tfdif to transform the feature matrix we got before to account for the frequency. Our new matrix will now have the frequency of each words in each sample instead of just the count and the weights for popular words in the data set will be scaled down.

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(1788, 7006)

### Training a model to classify quesitons
Now that we have our features matrix, we can train a classifer to predict the categories of questions. We will start with a simple multinomial Naive Bayes model. 

In [36]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, train_data['target'])

clf is now our trained model and we can do a quick check to see if it works and what the model would predict for some questions. Just as a quick check:

In [56]:
docs_new = ['The element with the atomic number of 2', 'The war that ended slavery', 'The author of the popular youth series Harry Potter', 'The year slavery was abolished']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, train_data['target_names'][category]))


'The element with the atomic number of 2' => SCIENCE
'The war that ended slavery' => AMERICAN HISTORY
'The author of the popular youth series Harry Potter' => LITERATURE
'The year slavery was abolished' => AMERICAN HISTORY


Now that we have a working model, we want to use it to predict the categories for questions in our test data set to see how well we did. We composed everything from above into a pipeline so that we can reuse it quickly. In this pipeline, we will use the simple count vectorizer from above to get the feature matrix of the number of occurences, then we normalize the matrix with respect to the length of each sample with tfdif, and finally, we train our Naive Bayes model on that vector.

In [71]:
from sklearn.pipeline import Pipeline
clf_wo_normalizing = Pipeline([('vect', CountVectorizer()),
                      ('clf', MultinomialNB())])

text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB())])

clf_wo_normalizing = clf_wo_normalizing.fit(train_data['data'], train_data['target'])
text_clf = text_clf.fit(train_data['data'], train_data['target'])

In [96]:
import numpy as np

predicted = text_clf.predict(test_data['data'])
acc_with = np.mean(predicted == test_data['target'])
print('Accuracy of the model with normalzing for length:', acc_with)

predicted = clf_wo_normalizing.predict(test_data['data'])
acc_without = np.mean(predicted == test_data['target'])
print('Accuracy of the model without normalzing for length:', acc_without)




Accuracy of the model with normalzing for length: 0.7217537942664418
Accuracy of the model without normalzing for length: 0.7234401349072512


### Discussion of the model used and the result

##### Overall Accuracy:
We got about 72% accuracy with just a simple Naive Bayes model. For a start, this is not bad, but only 72% is not a good enough accuracy. This is due to the fact that some of the categories have overlapping content with the others. In fact, Potpourri, by definition, is a mixture of things, and therefore, the questions in this category could easily have been anything, including being a history question or a literature question or a science question. This means that some of our categories to classify were subsets of one of the other categories, thus, there was a whole lot of overlap as one question could have been easily in both Potpourri AND science. Below, as we improve our project, we will train with questions and categories that have less overlap.

##### Normalizing vs Not Normalizing:
We actually got a lower accuracy with normalizing for length. This is because these questions are Jeopardy questions, and thus, they have relatively the same length, leading to little to no difference between normalizing and not. However, since the values are close enough, we will just keep the normalizing step in case we ever need to extend this to apply to a different data set. 


### Try a different model
Now let's see if we can do better with a different text classifying algorithm. Here we will try the linear support vector machine algorithm, which is a popular text classification algorithm. The biggest difference between SVM and Multinomial Naive Bayes (MNB) is that MNB treats the features as independent while SVM looks for interaction between features. Thus, SVM is theoretically better for when there are dependencies among the features since SVM can take into account these dependencies whereas MNB cannot. Given our data set, we want to see if SVM would be better than a MNB since our data contains shorter questions, and the words in the questions may or may not have any relationship. This is motivated by a paper that suggests MNB is better than SVM for shorter snippets (https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf) and since we have really short questions that resembles snippets a little bit, we want to test this out.

In [98]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-5, n_iter=5, random_state=42)),
])
_ = text_clf_svm.fit(train_data['data'], train_data['target'])
predicted_svm = text_clf_svm.predict(test_data['data'])
svm_acc = np.mean(predicted_svm == test_data['target'])
print('Accuracy of SVM:', svm_acc )





Accuracy of SVM: 0.7082630691399663




We got a lover accuracy score with our SVM model at 70% compared to the 72% of our Naive Bayes model. While it is not a definite conclusion, this might be due to the fact these questions are, in fact, just snippets and this, there was not a relationship or dependency among the features and that's why MNB did better, even though not by much.

### Grid Search to fine tune the parameters

Instead of exhaustively plugging all the possible parameters to optimize our model, we can use grid search to look for the optimized parameters. For example, below we are considering unigrams and bigrams and trying different alpha values for the penalty term. We can set up values for the parameters that we want to try and grid search can do it for us. (We set n_jobs to -1 so that we can use all of the detected cores)


In [100]:
>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {
...     'vect__ngram_range': [(1, 1), (1, 2)],
...     'tfidf__use_idf': (True, False),
...     'clf__alpha': (1e-2, 1e-3),
... }

In [103]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs=-1)
gs_clf = gs_clf.fit(train_data['data'], train_data['target'])
print('Best score for grid search: ', gs_clf.best_score_)

Best score for grid search:  0.6979573183865223


In [None]:
"RUNNING stuff top 3 without 'POTPOURRI' and 'BEFORE & AFTER'"

### Running with jsut the top 3 categories WITHOUT Potpourri and Before and After

In order to address a problem we had before with one of our categories potentially being a subset one another, we will remove Potpourri and Before and After since we suspect that since Potpourri and Before and After can be any question, they are confusing our MNB model. Therefore, we will carry out the same steps as above but only with 3 categories.

In [104]:
#only using 'SCIENCE', 'AMERICAN HISTORY', 'LITERATURE' as the categories aka removing 'POTPOURRI' and 'BEFORE & AFTER' since they overlap
#printing the # categories

target_names = ['SCIENCE', 'AMERICAN HISTORY', 'LITERATURE']
num_cat = 0;

for x in target_names:
    print(x,data_info[x])
    num_cat +=1

print ("\nnumber of catagories: ", num_cat)
target_names

SCIENCE 519
AMERICAN HISTORY 418
LITERATURE 496

number of catagories:  3


['SCIENCE', 'AMERICAN HISTORY', 'LITERATURE']

In [105]:
#spliting the data into train_data and test_data
#3/4 of the data is used for traning and 1/4 is used for testing

train_data = {'data':[],
             'target_names':target_names,
            'target': []}
test_data = {'data':[],
             'target_names':target_names,
            'target': []}

q_in_cat = []
for x in range (len(target_names)):
    q_in_cat.append(0)

for d in data:
    if d['category'] in target_names:
        if q_in_cat[target_names.index(d['category'])] % 4 == 3:
            test_data['data'].append(d['question'])
            test_data['target'].append(target_names.index(d['category']))
        else:
            train_data['data'].append(d['question'])
            train_data['target'].append(target_names.index(d['category']))
        q_in_cat[target_names.index(d['category'])] +=1

In [106]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data['data'])
X_train_counts.shape

(1076, 4776)

In [107]:
count_vect.vocabulary_.get(u'algorithm')

In [108]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(1076, 4776)

In [109]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, train_data['target'])

In [110]:
docs_new = ['The element with the atomic number of 2', 'The war that ended slavery', 'The author of the popular youth series Harry Potter', 'The year slavery was abolished']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, train_data['target_names'][category]))


'The element with the atomic number of 2' => SCIENCE
'The war that ended slavery' => AMERICAN HISTORY
'The author of the popular youth series Harry Potter' => LITERATURE
'The year slavery was abolished' => AMERICAN HISTORY


In [111]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB())])

text_clf = text_clf.fit(train_data['data'], train_data['target'])

In [112]:
import numpy as np
predicted = text_clf.predict(test_data['data'])
np.mean(predicted == test_data['target'])

0.8935574229691877

In [115]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                   ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-5, n_iter=5, random_state=42)),
 ])
_ = text_clf_svm.fit(train_data['data'], train_data['target'])
predicted_svm = text_clf_svm.predict(test_data['data'])
np.mean(predicted_svm == test_data['target'])



0.8935574229691877

In [116]:
from sklearn.model_selection import GridSearchCV
parameters = {
     'vect__ngram_range': [(1, 1), (1, 2)],
     'tfidf__use_idf': (True, False),
     'clf__alpha': (1e-2, 1e-3),
}

In [119]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs=-1)
gs_clf = gs_clf.fit(train_data['data'], train_data['target'])
print('Best score for grid search: ', gs_clf.best_score_)

Best score for grid search:  0.888470139341689


### Discussion of the result of running just Science, American History, and Literature

By removing the two problematic categories, we have greatly increased our accuracy score to 89% across all models. This confirms our earlier suspicion that Potpourri and Before and After contained questions that could have as easily been in any of the categories. These 3 categories have a clearer and more easily defined differences such that our model can now draw a clearer line to classify them. 

Knowing that the model will perform a whole lot better if we are trying to classify categories with as little overlap as possible, we will now try to do the whole process again with the 5 categories being Science, American History, Literature, Sports, and Business and Industry. We believe that this is a more reasonable set of categories to try to classify questions into since these are distinguishable and none of them is a subset of another by definition.

In [120]:
#only using 'SCIENCE', 'AMERICAN HISTORY', 'LITERATURE','SPORTS', 'BUSINESS & INDUSTRY' as the categories aka 5 distinct categories
#printing the # categories

target_names = ['SCIENCE', 'AMERICAN HISTORY', 'LITERATURE', 'SPORTS', 'BUSINESS & INDUSTRY']
num_cat = 0;

for x in target_names:
    print(x,data_info[x])
    num_cat +=1

print ("\nnumber of catagories: ", num_cat)
target_names

SCIENCE 519
AMERICAN HISTORY 418
LITERATURE 496
SPORTS 342
BUSINESS & INDUSTRY 311

number of catagories:  5


['SCIENCE', 'AMERICAN HISTORY', 'LITERATURE', 'SPORTS', 'BUSINESS & INDUSTRY']

In [121]:
#spliting the data into train_data and test_data
#3/4 of the data is used for traning and 1/4 is used for testing

train_data = {'data':[],
             'target_names':target_names,
            'target': []}
test_data = {'data':[],
             'target_names':target_names,
            'target': []}

q_in_cat = []
for x in range (len(target_names)):
    q_in_cat.append(0)

for d in data:
    if d['category'] in target_names:
        if q_in_cat[target_names.index(d['category'])] % 4 == 3:
            test_data['data'].append(d['question'])
            test_data['target'].append(target_names.index(d['category']))
        else:
            train_data['data'].append(d['question'])
            train_data['target'].append(target_names.index(d['category']))
        q_in_cat[target_names.index(d['category'])] +=1

In [122]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_data['data'])
X_train_counts.shape

(1567, 6135)

In [123]:
count_vect.vocabulary_.get(u'algorithm')

In [124]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(1567, 6135)

In [125]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, train_data['target'])

In [127]:
docs_new = ['The element with the atomic number of 2', 'The war that ended slavery', 'The author of the popular youth series Harry Potter', 'The year slavery was abolished']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, train_data['target_names'][category]))


'The element with the atomic number of 2' => SCIENCE
'The war that ended slavery' => AMERICAN HISTORY
'The author of the popular youth series Harry Potter' => LITERATURE
'The year slavery was abolished' => AMERICAN HISTORY


In [128]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB())])

text_clf = text_clf.fit(train_data['data'], train_data['target'])

In [129]:
import numpy as np
predicted = text_clf.predict(test_data['data'])
np.mean(predicted == test_data['target'])

0.815028901734104

In [130]:
>>> from sklearn.linear_model import SGDClassifier
>>> text_clf_svm = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
...                                            alpha=1e-5, n_iter=5, random_state=42)),
... ])
>>> _ = text_clf_svm.fit(train_data['data'], train_data['target'])
>>> predicted_svm = text_clf_svm.predict(test_data['data'])
>>> np.mean(predicted_svm == test_data['target'])



0.8439306358381503

In [131]:
>>> from sklearn.model_selection import GridSearchCV
>>> parameters = {
...     'vect__ngram_range': [(1, 1), (1, 2)],
...     'tfidf__use_idf': (True, False),
...     'clf__alpha': (1e-2, 1e-3),
... }

In [132]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs=-1)
gs_clf = gs_clf.fit(train_data['data'], train_data['target'])
gs_clf.best_score_

0.8474518169999097

### Discussion of the result of running on 5 different (or so we thought) categories

We obtained an accuracy of about 84% when running on the 5 hand-selected categories. Again, we picked these categories because we think that they are different enough for a decent accuracy score, and we got a pretty decent score as predicted. We only did ever slightly worse than when we ran only with 3 categories, which makes sense since we are adding more categories to classify. This shows that we need to formulate our problem carefully to ensure that there is little overlapping among the target groups and that no group is a subgroup of another since that will definitely confuse the MNB model. 

### Future Work

For future work, it will be interesting to try more text classification models on this data set and compare the models similarly to what we did here with MNB and SVM. There are so many models out there to try and we need to analyze our data set more rigoriously to choose what model we should use to fit the data. 

It will also be interesting to try out different ways of extracting the features and see which method makes the most sense for our data set. Maybe we can try Word2Vec or other extraction algorithms.

We can also extend this work and applied our findings and experiences to a more pressing problem of classifying Piazza questions for students (like proposed).

### Clustering (For fun)

Here we did a quick clustering analysis to ee 

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5,
                                 min_df=2, stop_words='english',
                                 use_idf= True)

X = vectorizer.fit_transform(twenty_train.data)

In [33]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=20, init='k-means++', max_iter=100, n_init=1,
                verbose=False)

In [34]:
km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=20, n_init=1, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=False)

In [35]:
from sklearn import metrics
labels = twenty_train.target
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))

Homogeneity: 0.351


In [36]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(20):
        print("Cluster %d:" % i, end='')
        for ind in order_centroids[i, :10]:
            print(' %s' % terms[ind], end='')
        print()


Top terms per cluster:
Cluster 0: uiuc cso baseball year players illinois game article team cs
Cluster 1: windows window mouse dos motif server problem com using screen
Cluster 2: car access digex com cars pat ax radar engine usa
Cluster 3: people israel israeli com article don think just jews government
Cluster 4: sale 00 offer shipping condition new price asking 10 university
Cluster 5: com netcom hp article posting sun ibm nntp host stratus
Cluster 6: nasa gov space jpl larc baalke gsfc jsc higgins center
Cluster 7: gun guns firearms people weapons com militia don amendment control
Cluster 8: ca canada university bc columbia bnr usc posting host nntp
Cluster 9: team game hockey ca nhl play games players season toronto
Cluster 10: keith caltech livesey sgi solntze wpd jon schneider cco morality
Cluster 11: drive scsi ide controller drives hard disk bus floppy hd
Cluster 12: god jesus christians bible people christian christ faith believe christianity
Cluster 13: turkish armenian arme