##### Lab: Naive Bayes, K-NN classifiers, and text

In this lab we will work with a subset of the 20 newsgroup data that was mentioned during the Naive Bayes discussion in class. 

## Data

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics. The classification problem is to identify the newsgroup a post was summited to, given the text of the post.

There are a few versions of this dataset from different sources online. Below, we use the version within scikit-learn which is already split into a train and test/eval set. For a longer introduction to this dataset, see the scikit-learn website: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


### Let's download the data and take a look at the target names:

In [1]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### For simplicity here, we will select just a few of these categories, and download the training and testing set:

In [2]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med', 'sci.space',
              'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey']

# load training data
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

### target_names holds the list of the requested category names:

In [3]:
# print the class names
print (twenty_train.target_names, "\n")

# The files themselves are loaded in memory in the data attribute. For reference the filenames are also available
print(len(twenty_train.data))

['alt.atheism', 'comp.graphics', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.med', 'sci.space', 'soc.religion.christian'] 

5824


#### For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list. The category integer id of each sample is stored in the target attribute:

In [4]:
twenty_train.target[:25]

array([6, 9, 2, 5, 8, 0, 6, 2, 5, 6, 9, 4, 1, 6, 5, 8, 2, 1, 5, 5, 1, 3,
       3, 9, 1])

In [5]:
# print the class names for the first 25 articles
for t in twenty_train.target[:25]:
    target_id = twenty_train.target[t]
    print(target_id, twenty_train.target_names[target_id])
    
# print the first article
print("\n", twenty_train.data[:1])

6 rec.sport.hockey
6 rec.sport.hockey
2 misc.forsale
0 alt.atheism
5 rec.sport.baseball
6 rec.sport.hockey
6 rec.sport.hockey
2 misc.forsale
0 alt.atheism
6 rec.sport.hockey
6 rec.sport.hockey
8 sci.space
9 soc.religion.christian
6 rec.sport.hockey
0 alt.atheism
5 rec.sport.baseball
2 misc.forsale
9 soc.religion.christian
0 alt.atheism
0 alt.atheism
9 soc.religion.christian
5 rec.sport.baseball
5 rec.sport.baseball
6 rec.sport.hockey
9 soc.religion.christian

 ["From: huot@cray.com (Tom Huot)\nSubject: Re: Ulf and all...\nLines: 29\nNntp-Posting-Host: pittpa.cray.com\nOrganization: Cray Research Inc.\nX-Newsreader: TIN [version 1.1 PL8]\n\nRichard Wernick (richard@amc.com) wrote:\n: You should be ashamed to call yourself an Ulf Samuelson fan. Anybody who plays\n: the way he does, does not belong in the NHL. There have been cheap shot artists\n: through the history of the game, but a lot of them have been talanted players.\n: Bobby Clarke, Kenny Linsemen, Pie McKenzie, Chris Chelios etc

### Convert the data to word counts and see how many times the word 'algorithm' appears
In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers. One of the simplest methods of encoding data is by word counts: you take each snippet of text, count the occurrences of each word within it.

In [6]:
# import and use CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

count_vect.vocabulary_.get(u'algorithm')


10285

### Run Naive Bayes
Multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified

In [7]:
# Import Multinomial NB (this a good Naive Bayes classifier for text) and other libraries to help with analysis

from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn import metrics

# fit classifier using word counts
clf_1 = MultinomialNB().fit(X_train_counts, twenty_train.target)

# load the test data set and convert to word counts
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# note that I am using .transform instead of .fit_transform. this keeps the columns the same as the training set
X_test_counts = count_vect.transform(twenty_test.data)

# make predictions on test data
predicted = clf_1.predict(X_test_counts)

# print accuracy
print (np.mean(predicted == twenty_test.target)) 

# print precision and recall statistics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

# print confusion matrix
print(metrics.confusion_matrix(twenty_test.target, predicted))

0.9223426212590299
                        precision    recall  f1-score   support

           alt.atheism       0.91      0.87      0.89       319
         comp.graphics       0.90      0.91      0.90       389
          misc.forsale       0.95      0.85      0.90       390
             rec.autos       0.87      0.95      0.91       396
       rec.motorcycles       0.97      0.95      0.96       398
    rec.sport.baseball       0.97      0.93      0.95       397
      rec.sport.hockey       0.95      0.97      0.96       399
               sci.med       0.95      0.86      0.90       396
             sci.space       0.92      0.94      0.93       394
soc.religion.christian       0.87      0.97      0.92       398

              accuracy                           0.92      3876
             macro avg       0.92      0.92      0.92      3876
          weighted avg       0.92      0.92      0.92      3876

[[276   1   1   2   1   1   2   2   3  30]
 [  5 354   2   5   0   2   2   5  11  

### Convert the data to a TF-IDF representation and run Naive Bayes
There are often issues with word count approach where the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms. One approach to fix this is known as term frequency-inverse document frequency (TF–IDF) which weights the word counts by a measure of how often they appear in the documents.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf_2 = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# make predictions on test data
predicted = clf_2.predict(X_test_tfidf)

# print accuracy
print (np.mean(predicted == twenty_test.target)) 

# print precision and recall statistics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

# print confusion matrix
print(metrics.confusion_matrix(twenty_test.target, predicted))

0.868937048503612
                        precision    recall  f1-score   support

           alt.atheism       0.97      0.55      0.70       319
         comp.graphics       0.93      0.83      0.88       389
          misc.forsale       0.96      0.83      0.89       390
             rec.autos       0.88      0.94      0.91       396
       rec.motorcycles       0.94      0.94      0.94       398
    rec.sport.baseball       0.94      0.90      0.92       397
      rec.sport.hockey       0.92      0.97      0.94       399
               sci.med       0.94      0.77      0.85       396
             sci.space       0.92      0.91      0.91       394
soc.religion.christian       0.58      0.98      0.73       398

              accuracy                           0.87      3876
             macro avg       0.90      0.86      0.87      3876
          weighted avg       0.90      0.87      0.87      3876

[[175   1   0   0   2   1   1   6   3 130]
 [  1 322   2   9   3   7   2   1  10  3

### Create a pipeline for Naive Bayes for TF-IDF and rerun experiments
It works by allowing several transformers to be chained together

In [9]:
from sklearn.pipeline import Pipeline
clf_3 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('clf', MultinomialNB()),
])

clf_3.fit(twenty_train.data, twenty_train.target)

# make predictions on test data
predicted = clf_3.predict(twenty_test.data)

# print accuracy
print (np.mean(predicted == twenty_test.target)) 

# print precision and recall statistics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

# print confusion matrix
print(metrics.confusion_matrix(twenty_test.target, predicted))

0.868937048503612
                        precision    recall  f1-score   support

           alt.atheism       0.97      0.55      0.70       319
         comp.graphics       0.93      0.83      0.88       389
          misc.forsale       0.96      0.83      0.89       390
             rec.autos       0.88      0.94      0.91       396
       rec.motorcycles       0.94      0.94      0.94       398
    rec.sport.baseball       0.94      0.90      0.92       397
      rec.sport.hockey       0.92      0.97      0.94       399
               sci.med       0.94      0.77      0.85       396
             sci.space       0.92      0.91      0.91       394
soc.religion.christian       0.58      0.98      0.73       398

              accuracy                           0.87      3876
             macro avg       0.90      0.86      0.87      3876
          weighted avg       0.90      0.87      0.87      3876

[[175   1   0   0   2   1   1   6   3 130]
 [  1 322   2   9   3   7   2   1  10  3

###  We can now determine the category for any string, using the predict() method of this pipeline. Here's a quick utility function that will return the prediction for a single string:

In [10]:
def predict_category(s, train=twenty_train, model=clf_3):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [11]:
predict_category('sending a payload to the ISS')

'sci.space'

In [12]:
predict_category('discussing islam vs atheism')

'alt.atheism'

In [13]:
predict_category('determining the screen resolution')

'comp.graphics'

### Create and test a nearest-neighbor classfier using word counts and 5 neighbors (default)

In [14]:
# k-means
from sklearn import neighbors

clf_5=neighbors.KNeighborsClassifier(n_neighbors = 5) # don't need argument here but do later

# we create an instance of Neighbours Classifier and fit the data.
clf_5.fit(X_train_counts, twenty_train.target)

predicted = clf_5.predict(X_test_counts)

print (np.mean(predicted == twenty_test.target)) 

print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

print(metrics.confusion_matrix(twenty_test.target, predicted))

0.4538183694530444
                        precision    recall  f1-score   support

           alt.atheism       0.46      0.53      0.49       319
         comp.graphics       0.25      0.50      0.34       389
          misc.forsale       0.72      0.48      0.58       390
             rec.autos       0.35      0.40      0.37       396
       rec.motorcycles       0.74      0.49      0.59       398
    rec.sport.baseball       0.46      0.36      0.40       397
      rec.sport.hockey       0.50      0.52      0.51       399
               sci.med       0.53      0.26      0.35       396
             sci.space       0.84      0.36      0.50       394
soc.religion.christian       0.38      0.66      0.48       398

              accuracy                           0.45      3876
             macro avg       0.53      0.46      0.46      3876
          weighted avg       0.53      0.45      0.46      3876

[[168  23   6  16   4  13   9   9   2  69]
 [ 19 195  15  36   8  15  20   7   8  

**<font color="red" size=5> Create and test a nearest-neighbor classfier using TF-IDF vectors and 5 neighbors (default)</font>**

In [16]:
# insert code here
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import neighbors

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf_x=neighbors.KNeighborsClassifier(n_neighbors = 5)
clf_x.fit(X_train_tfidf, twenty_train.target)
predicted = clf_x.predict(X_test_tfidf)


# clf_2 = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

# X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# # make predictions on test data
# predicted = clf_2.predict(X_test_tfidf)

# # print accuracy
# print (np.mean(predicted == twenty_test.target)) 

# # print precision and recall statistics
# print(metrics.classification_report(twenty_test.target, predicted,
#     target_names=twenty_test.target_names))

# # print confusion matrix
# print(metrics.confusion_matrix(twenty_test.target, predicted))







# predicted = clf_5.predict(X_test_counts)

print (np.mean(predicted == twenty_test.target)) 

print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

print(metrics.confusion_matrix(twenty_test.target, predicted))

0.7647058823529411
                        precision    recall  f1-score   support

           alt.atheism       0.52      0.85      0.64       319
         comp.graphics       0.74      0.73      0.73       389
          misc.forsale       0.76      0.66      0.71       390
             rec.autos       0.80      0.79      0.79       396
       rec.motorcycles       0.84      0.83      0.83       398
    rec.sport.baseball       0.82      0.78      0.80       397
      rec.sport.hockey       0.86      0.88      0.87       399
               sci.med       0.89      0.54      0.67       396
             sci.space       0.86      0.77      0.81       394
soc.religion.christian       0.71      0.84      0.77       398

              accuracy                           0.76      3876
             macro avg       0.78      0.77      0.76      3876
          weighted avg       0.78      0.76      0.77      3876

[[270   2   5   1   0   1   0   7   5  28]
 [ 35 285  13  14   9   8   5   3   9  

**<font color="red" size=5> Using TF-IDF vectors, write code that uses grid search to select the number of neighbors</font>**</br>
<font size=3>Use grid search to assess n_neighbor values of 1, 3, 5 and weights parameters of 'uniform' and 'distance'</font>

In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn import neighbors
# finish the function below
gs_params = {"n_neighbors": [1,3,5],
              "weights": ['uniform', 'distance']  
             }
est = neighbors.KNeighborsClassifier()
gs_clf = GridSearchCV(estimator=est, param_grid=gs_params)
gs_results = gs_clf.fit(X_train_tfidf, twenty_train.target)

print(gs_results.best_params_)

print(gs_results.best_estimator_)

best_clf = gs_results.best_estimator_
gs_predicted = gs_clf.predict(X_test_tfidf)

print (np.mean(gs_predicted == twenty_test.target)) 

print(metrics.classification_report(twenty_test.target, gs_predicted,
    target_names=twenty_test.target_names))

print(metrics.confusion_matrix(twenty_test.target, gs_predicted))

# assess the best classifier the same way as done above

{'n_neighbors': 1, 'weights': 'uniform'}
KNeighborsClassifier(n_neighbors=1)
0.7763157894736842
                        precision    recall  f1-score   support

           alt.atheism       0.64      0.84      0.73       319
         comp.graphics       0.71      0.71      0.71       389
          misc.forsale       0.76      0.64      0.70       390
             rec.autos       0.82      0.75      0.78       396
       rec.motorcycles       0.83      0.86      0.84       398
    rec.sport.baseball       0.79      0.79      0.79       397
      rec.sport.hockey       0.83      0.86      0.85       399
               sci.med       0.84      0.67      0.75       396
             sci.space       0.82      0.85      0.83       394
soc.religion.christian       0.74      0.80      0.77       398

              accuracy                           0.78      3876
             macro avg       0.78      0.78      0.77      3876
          weighted avg       0.78      0.78      0.78      3876

[[267

**<font color="red" size=5> Use gridsearchcv to compare the performance of count vectors to TFIDF vectors on just the comp.* subset of newsgroups</font>**

In [30]:
# insert code to read in appropriate newsgroup data here
cats = ['comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x']

our_train = fetch_20newsgroups(subset='train', categories=cats, shuffle=True, random_state=42)
our_train.target[:25]
for p in our_train.target[:25]:
    target_id = our_train.target[p]
    print(target_id, our_train.target_names[target_id])

# print the first article
print("\n", our_train.data[:1])

2 comp.sys.ibm.pc.hardware
1 comp.os.ms-windows.misc
3 comp.sys.mac.hardware
3 comp.sys.mac.hardware
1 comp.os.ms-windows.misc
4 comp.windows.x
3 comp.sys.mac.hardware
1 comp.os.ms-windows.misc
1 comp.os.ms-windows.misc
4 comp.windows.x
4 comp.windows.x
1 comp.os.ms-windows.misc
1 comp.os.ms-windows.misc
2 comp.sys.ibm.pc.hardware
2 comp.sys.ibm.pc.hardware
1 comp.os.ms-windows.misc
2 comp.sys.ibm.pc.hardware
2 comp.sys.ibm.pc.hardware
1 comp.os.ms-windows.misc
2 comp.sys.ibm.pc.hardware
3 comp.sys.mac.hardware
1 comp.os.ms-windows.misc
2 comp.sys.ibm.pc.hardware
2 comp.sys.ibm.pc.hardware
4 comp.windows.x

 ["From: lemons@cadsys.enet.dec.com\nSubject: Xremote into X11R6?\nReply-To: lemons@cadsys.enet.dec.com ()\nOrganization: Digital Equipment Corporation\nLines: 12\nX-Newsreader: mxrn 6.18\n\n\nHi!\n\nI remember reading (or hallucinating) that NCD's PC-Xremote functionality had \nbeen given, by NCD, to MIT for inclusion in X11R6.  Is this true?  If so,\n(set mode/cheap) can I just wa

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

ct_vec = CountVectorizer()
X_Train_counts = ct_vec.fit_transform(our_train.data)

ct_vec.vocabulary_.get(u'algorithm')



15462

In [37]:
our_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# note that I am using .transform instead of .fit_transform. this keeps the columns the same as the training set
X_Test_counts = ct_vec.transform(our_test.data)

print("\n", our_test.data[:1])


 ['From: donn@carson.u.washington.edu (Donn Cave)\nSubject: Re: Anyone know use "rayshade" out there?\nOrganization: University of Washington\nLines: 13\nNNTP-Posting-Host: carson.u.washington.edu\nKeywords: rayshade, uw.\n\nfineman@stein2.u.washington.edu (Twixt your toes) writes:\n\n| I\'m using "rayshade" on the u.w. computers here, and i\'d like input\n| from other users, and perhaps swap some ideas.  I could post\n| uuencoded .gifs here, or .ray code, if anyone\'s interested.  I\'m having\n| trouble coming up with colors that are metallic (i.e. brass, steel)\n| from the RGB values.\n\nSorry, I\'m not a rayshade user - but hey, it looks like this group could\nuse some traffic.  My guess is that "metallic" isn\'t a color, in the RGB\nsense.  Rather, it\'s a matter of how the surface reflects light.  I\'m not\nsure what property metallic materials have, that makes them recognizable\nas such, but I\'m pretty sure any color material can look metallic.\n']


In [38]:
from sklearn import neighbors

clf_8=neighbors.KNeighborsClassifier(n_neighbors = 5) # don't need argument here but do later

# we create an instance of Neighbours Classifier and fit the data.
clf_8.fit(X_Train_counts, our_train.target)

predicted = clf_8.predict(X_Test_counts)

print (np.mean(predicted == our_test.target)) 

print(metrics.classification_report(our_test.target, predicted,
    target_names=twenty_test.target_names))

print(metrics.confusion_matrix(our_test.target, predicted))

0.09236326109391124
                        precision    recall  f1-score   support

           alt.atheism       0.08      0.44      0.14       319
         comp.graphics       0.10      0.14      0.12       389
          misc.forsale       0.09      0.09      0.09       390
             rec.autos       0.11      0.28      0.16       396
       rec.motorcycles       0.10      0.05      0.06       398
    rec.sport.baseball       0.00      0.00      0.00       397
      rec.sport.hockey       0.00      0.00      0.00       399
               sci.med       0.00      0.00      0.00       396
             sci.space       0.00      0.00      0.00       394
soc.religion.christian       0.00      0.00      0.00       398

              accuracy                           0.09      3876
             macro avg       0.05      0.10      0.06      3876
          weighted avg       0.05      0.09      0.05      3876

[[141  56  32  72  18   0   0   0   0   0]
 [242  53  21  49  24   0   0   0   0 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [40]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import neighbors

tfidf_transformer = TfidfTransformer()

X_Train_tfidf = tfidf_transformer.fit_transform(X_Train_counts)
X_Test_tfidf = tfidf_transformer.fit_transform(X_Test_counts)

clf_x=neighbors.KNeighborsClassifier(n_neighbors = 5)
clf_x.fit(X_Train_tfidf, our_train.target)
predicted = clf_x.predict(X_Test_tfidf)

print (np.mean(predicted == our_test.target)) 

print(metrics.classification_report(our_test.target, predicted,
    target_names=our_test.target_names))

print(metrics.confusion_matrix(our_test.target, predicted))

0.11145510835913312
                        precision    recall  f1-score   support

           alt.atheism       0.05      0.14      0.08       319
         comp.graphics       0.08      0.08      0.08       389
          misc.forsale       0.26      0.32      0.29       390
             rec.autos       0.15      0.15      0.15       396
       rec.motorcycles       0.10      0.43      0.16       398
    rec.sport.baseball       0.00      0.00      0.00       397
      rec.sport.hockey       0.00      0.00      0.00       399
               sci.med       0.00      0.00      0.00       396
             sci.space       0.00      0.00      0.00       394
soc.religion.christian       0.00      0.00      0.00       398

              accuracy                           0.11      3876
             macro avg       0.06      0.11      0.08      3876
          weighted avg       0.06      0.11      0.08      3876

[[ 46  16  22  25 210   0   0   0   0   0]
 [254  31  19  24  61   0   0   0   0 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [42]:
# insert code to use gridsearchcv to assess count vectors
from sklearn.model_selection import GridSearchCV
from sklearn import neighbors
# finish the function below
gs_params = {"n_neighbors": [1,3,5],
              "weights": ['uniform', 'distance']  
             }
est = neighbors.KNeighborsClassifier()
gs_clf = GridSearchCV(estimator=est, param_grid=gs_params)
gs_results = gs_clf.fit(X_Train_counts, our_train.target)

print(gs_results.best_params_)

print(gs_results.best_estimator_)

best_clf = gs_results.best_estimator_
gs_predicted = gs_clf.predict(X_Test_counts)

print (np.mean(gs_predicted == our_test.target)) 

print(metrics.classification_report(our_test.target, gs_predicted,
    target_names=our_test.target_names))

print(metrics.confusion_matrix(our_test.target, gs_predicted))

{'n_neighbors': 1, 'weights': 'uniform'}
KNeighborsClassifier(n_neighbors=1)
0.09313725490196079
                        precision    recall  f1-score   support

           alt.atheism       0.08      0.33      0.13       319
         comp.graphics       0.09      0.12      0.10       389
          misc.forsale       0.10      0.13      0.11       390
             rec.autos       0.11      0.30      0.16       396
       rec.motorcycles       0.10      0.10      0.10       398
    rec.sport.baseball       0.00      0.00      0.00       397
      rec.sport.hockey       0.00      0.00      0.00       399
               sci.med       0.00      0.00      0.00       396
             sci.space       0.00      0.00      0.00       394
soc.religion.christian       0.00      0.00      0.00       398

              accuracy                           0.09      3876
             macro avg       0.05      0.10      0.06      3876
          weighted avg       0.05      0.09      0.06      3876

[[10

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [43]:
# insert code to use gridsearchcv to assess tfidf vectors
from sklearn.model_selection import GridSearchCV
from sklearn import neighbors
# finish the function below
gs_params = {"n_neighbors": [1,3,5],
              "weights": ['uniform', 'distance']  
             }
est = neighbors.KNeighborsClassifier()
gs_clf = GridSearchCV(estimator=est, param_grid=gs_params)
gs_results = gs_clf.fit(X_Train_tfidf, our_train.target)

print(gs_results.best_params_)

print(gs_results.best_estimator_)

best_clf = gs_results.best_estimator_
gs_predicted = gs_clf.predict(X_Test_tfidf)

print (np.mean(gs_predicted == our_test.target)) 

print(metrics.classification_report(our_test.target, gs_predicted,
    target_names=our_test.target_names))

print(metrics.confusion_matrix(our_test.target, gs_predicted))

{'n_neighbors': 1, 'weights': 'uniform'}
KNeighborsClassifier(n_neighbors=1)
0.10036119711042311
                        precision    recall  f1-score   support

           alt.atheism       0.06      0.14      0.08       319
         comp.graphics       0.07      0.09      0.08       389
          misc.forsale       0.18      0.25      0.21       390
             rec.autos       0.12      0.16      0.14       396
       rec.motorcycles       0.10      0.37      0.15       398
    rec.sport.baseball       0.00      0.00      0.00       397
      rec.sport.hockey       0.00      0.00      0.00       399
               sci.med       0.00      0.00      0.00       396
             sci.space       0.00      0.00      0.00       394
soc.religion.christian       0.00      0.00      0.00       398

              accuracy                           0.10      3876
             macro avg       0.05      0.10      0.07      3876
          weighted avg       0.05      0.10      0.07      3876

[[ 4

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
