Similar to my Logistic Regression notebook, in this notebook I will fit a k-nearest neighbor model on count vectorized data, TFIDF data, and count vectorized data with sentiment analysis in the attempts to create a model that can differentiate between the subreddits. If I am successful at doing so, this will provide as evidence that democrats and republicans have a preference for topics to discuss.

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words



### Loading in my train, test splitted data

In [48]:
reddit = pd.read_csv('../Data/reddit.csv', index_col=0)
X_train = pd.read_csv('../Data/X_train.csv', header = None, index_col=0)
X_test = pd.read_csv('../Data/X_test.csv', header = None, index_col=0)
y_train = pd.read_csv('../Data/y_train.csv', header = None, index_col=0)
y_test = pd.read_csv('../Data/y_test.csv', header = None, index_col=0)
X_train_sen = pd.read_csv('../Data/X_train_sen.csv', index_col=0)
X_test_sen = pd.read_csv('../Data/X_test_sen.csv', index_col=0)
y_train_sen = pd.read_csv('../Data/y_train_sen.csv', header = None, index_col=0)
y_test_sen = pd.read_csv('../Data/y_test_sen.csv', header = None, index_col=0)
reddit.fillna('', inplace=True)

Creating a list of custom stop words that include numbers and some terms found in urls

In [2]:
custom_stopwords = list(stop_words.ENGLISH_STOP_WORDS)
custom_stopwords.extend(['10', '12', '13', '14', '15', '18', '25','200', '000','https', 'com', 'youtube', 'www'])

### KNN on TFIDF data

TFIDF stands for term frequency inverse document frequency, and it essentially gives a word a ratio for how much it appears in the document. I will tokenize my data using TFIDF, then I will run a k-nearest neighbors classifier model on said data

In [11]:
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words=custom_stopwords)),
    ('knn', KNeighborsClassifier())
])

After extensive grid searching, these are the best parameters I've found to maximize accuracy:

tfidf_min_df: ignore words that aren't found in at least 1 post

tfidf_max_df: ignore words that are found in over $65 \%$ of the documents

tfidf_norm: Regularize the data with an l1 norm

knn_n_neighbors: 50 of the closet neighbors to a point get to vote on what that point's class is

knn_n_weights: weight each point based on distance

knn_n_metric: measure distance with the manhattan metric

In [24]:
params = {
    'tfidf__min_df': [1,3,5,7],
    'tfidf__max_df': [.65,.70,.75,.85],
    'tfidf__norm': ['l1','l2'],
    'knn__n_neighbors': [40,50,60,70],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['minkowski', 'euclidean', 'manhattan']
}

In [25]:
gs = GridSearchCV(pipe, params)

In [26]:
gs.fit(X_train[1], y_train[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidf__min_df': [1, 3, 5, 7], 'tfidf__max_df': [0.65, 0.7, 0.75, 0.85, 0.9], 'tfidf__norm': ['l1', 'l2'], 'knn__n_neighbors': [40, 50, 60, 70], 'knn__weights': ['uniform', 'distance'], 'knn__metric': ['minkowski', 'euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [30]:
gs.score(X_train[1],y_train[1])

0.9887323943661972

In [31]:
gs.score(X_test[1],y_test[1])

0.6028169014084507

In [32]:
gs.best_params_

{'knn__metric': 'manhattan',
 'knn__n_neighbors': 50,
 'knn__weights': 'distance',
 'tfidf__max_df': 0.65,
 'tfidf__min_df': 1,
 'tfidf__norm': 'l1'}

My model did not score better than any of my logistic regression models and is more overfit

### KNN on count vectorized data

Now I'll tokenize my data by using counting the appearance of each tearm in each post. Then I'll fit a k-nearest neighbors model to said data.

In [33]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words=custom_stopwords)),
    ('knn', KNeighborsClassifier())
])

After extensive grid searching I found the best parameters to optimize accuracy would be:

cvec_min_df: Ignore words that don't appear in at least 3 documents

cvec_max_df: Ignore words that appear in over $85 \%$ of the documents

cvec_ngram_range: (1,2) Consider each word individually and pairs of words

knn_n_neighbors: 40 of the closest neighbors to a point get to vote on that point's class

knn_weights: each point has a uniform weight

knn_metric: distance is measured using the minkowski metric

In [37]:
params = {
    'cvec__min_df': [1,3,5],
    'cvec__max_df': [.85,.9,.95,1.0],
    'cvec__ngram_range': [(1,1),(1,2)],
    'knn__n_neighbors': [30,40,50],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['minkowski', 'euclidean', 'manhattan']
}

In [38]:
gs = GridSearchCV(pipe, params)

In [39]:
gs.fit(X_train[1], y_train[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['ourselves...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'cvec__min_df': [1, 3, 5], 'cvec__max_df': [0.85, 0.9, 0.95, 1.0], 'cvec__ngram_range': [(1, 1), (1, 2)], 'knn__n_neighbors': [30, 40, 50], 'knn__weights': ['uniform', 'distance'], 'knn__metric': ['minkowski', 'euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

### Scoring my model

In [40]:
gs.score(X_train[1],y_train[1])

0.6046948356807512

In [41]:
gs.score(X_test[1],y_test[1])

0.5633802816901409

In [42]:
gs.best_params_

{'cvec__max_df': 0.85,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'knn__metric': 'minkowski',
 'knn__n_neighbors': 40,
 'knn__weights': 'uniform'}

My model scored worse than my intial one with an accuracy of $56.3 \%$. However, it is not overfit

### KNN on count vectorized data with sentiment analysis

Let's see if adding features for sentiment analysis will improve my models accuracy in classifying the subreddits

In [44]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words=custom_stopwords)),
    ('knn', KNeighborsClassifier())
])

After extensive grid searching, I found that the best parameters for my model to be as follows:

n_neighbors: 30 neighbors get to vote on each points class

weights: each point is weighted by distance

metric: distance is measured using the minkowski metric

In [53]:
params = {
    'n_neighbors': [30,40,50],
    'weights': ['uniform', 'distance'],
    'metric': ['minkowski', 'euclidean', 'manhattan']
}

In [46]:
gs = GridSearchCV(KNeighborsClassifier(), params)

In [49]:
gs.fit(X_train_sen,y_train_sen[1])

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [30, 40, 50], 'weights': ['uniform', 'distance'], 'metric': ['minkowski', 'euclidean', 'manhattan']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [50]:
gs.score(X_train_sen,y_train_sen[1])

0.9624413145539906

In [51]:
gs.score(X_test_sen,y_test_sen[1])

0.5492957746478874

In [52]:
gs.best_params_

{'metric': 'minkowski', 'n_neighbors': 30, 'weights': 'distance'}

My model scored a record low of $54.9 \%$, which is only about $5 \%$ better than a coin flip. This is surprising given that we included sentiment analysis. My model is also severely overfit in this case

### Conclusion

We could not create a k-nearest neighbor model that scored better than $60 \%$ on our accuracy metric. Since our model had a difficiult time differentiating between the two subreddit's posts, this is further evidence that democrats and republicans discuss similar topics. In my next modeling notebook, I will try to fit a random forest model to the same train, test splitted data. Random forest models are an ensemble method, meaning that several models are fit and used to determine classes. Since several models are greater than one model, random forest will assumedly perform better on our seemingly indiscernible data.