# ***Latent Semantic Analysis: Machine Learning***
### Goals:
* Predictive Modeling
* Pipeline 1: TFIDF, SVD, KNN
* Pipeline 2: TFIDF, MultinomialNB

### Output:
* Pipeline 1 Score: 0.5502248875562219
* Pipeline 2 Score: 0.61769115442278866

## 1. Load Data from MongoDB
* Remove duplicates

In [1]:
cd ..

/home/jovyan/dsi/assignments/p4


In [2]:
%run __init__.py



In [3]:
%matplotlib inline

In [4]:
client = pymongo.MongoClient('34.215.225.199', 27016)
db_ref = client.wiki_database
wiki_ref = db_ref.wiki_database

In [5]:
wiki_data = []
cursor = wiki_ref.find({})
for document in cursor:
    wiki_data.append(document)

In [6]:
wiki_df = pd.DataFrame(wiki_data)
wiki_df.drop_duplicates(subset=['pageid'], inplace=True)
wiki_df.head(10)

Unnamed: 0,_id,category,pageid,text
0,5a1506d02c74b40013488ead,Machine learning,43385931,multiple issuesrefimprovedatejuly footnotesdat...
1,5a1506d22c74b40013488eae,Machine learning,49082762,use dmy datesdateseptember machine learn barth...
2,5a1506d42c74b40013488eaf,Machine learning,233488,forthe journalmachine learn journalmachine lea...
3,5a1506d52c74b40013488eb0,Machine learning,53587467,attention outline set outline list portalconte...
4,5a1506d62c74b40013488eb1,Machine learning,3771060,accuracy paradox predictive analytic state pre...
5,5a1506d72c74b40013488eb2,Machine learning,43808044,machine learn baraction model learning abbrevi...
6,5a1506d82c74b40013488eb3,Machine learning,28801798,abouta machine learn methodactive learning con...
7,5a1506d82c74b40013488eb4,Machine learning,45049676,adversarial machine learning research field li...
8,5a1506d92c74b40013488eb5,Machine learning,52642349,infobox artist aiva nationality luxembourgish ...
9,5a1506da2c74b40013488eb6,Machine learning,30511763,multiple issuescoidatenovember expert neededwi...


## 2a. Predictive Model

* Pipeline with TFIDF, SVD, KNN

In [7]:
X_train, X_test, y_train, y_test = train_test_split(wiki_df['text'],
                                                    wiki_df['category'],
                                                    test_size=.25)

In [8]:
pipeline_one = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svd', TruncatedSVD()),
    ('clf', KNeighborsClassifier())
])

In [9]:
params_one = {
    'tfidf__min_df'     : [5,10,20],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__stop_words' : ['english'],
    'clf__n_neighbors'  : [5,10,20]
}

In [10]:
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True)

gs_one = GridSearchCV(pipeline_one,
                  param_grid=params_one,
                  cv=5,
                  n_jobs=1,
                  verbose=1,
                  scoring=roc_auc_scorer)

In [11]:
gs_one.fit(X_train, y_train), gs_one.score(X_test, y_test)

Fitting 5 folds for each of 18 candidates, totalling 90 fits




ValueError: multiclass format is not supported

## 2b. Predictive Model

* Pipeline with TFIDF, NB

In [12]:
pipeline_two = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

In [13]:
params_two = {
    'tfidf__min_df'     : [5,10,20],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__stop_words' : ['english'],
    'clf__alpha'  : [1,.1,.01,.001]
}

In [14]:
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True)

gs_two = GridSearchCV(pipeline_two,
                  param_grid=params_two,
                  cv=5,
                  n_jobs=1,
                  verbose=1,
                  scoring=roc_auc_scorer)

In [15]:
gs_two.fit(X_train, y_train), gs_two.score(X_test, y_test)

Fitting 5 folds for each of 24 candidates, totalling 120 fits




ValueError: multiclass format is not supported

## Step by Step, Pipe 1
* SVD, KNN

In [16]:
X_train, X_test, y_train, y_test = train_test_split(wiki_df['text'],
                                                    wiki_df['category'],
                                                    test_size=.25)

In [17]:
tfidf       = TfidfVectorizer(min_df=20, ngram_range=(1,3), stop_words='english')
X_tf_train  = tfidf.fit_transform(X_train)
X_tf_test   = tfidf.transform(X_test)

In [18]:
svd               = TruncatedSVD(100)

svd_train        = svd.fit_transform(X_tf_train)
svd_test         = svd.transform(X_tf_test)

In [19]:
knn = KNeighborsClassifier(n_neighbors=20)

knn.fit(svd_train, y_train)
knn.score(svd_test, y_test)

0.5502248875562219

## Step by Step, Pipe 2
* Naive Bayes

In [20]:
X_train, X_test, y_train, y_test = train_test_split(wiki_df['text'],
                                                    wiki_df['category'],
                                                    test_size=.25)

In [21]:
tfidf       = TfidfVectorizer(min_df=20, ngram_range=(1,3), stop_words='english')
X_tf_train  = tfidf.fit_transform(X_train)
X_tf_test   = tfidf.transform(X_test)

In [22]:
clf = MultinomialNB(alpha=.01)

clf.fit(X_tf_train, y_train)
clf.score(X_tf_test, y_test)

0.61769115442278866