## Info

This file predicts credibility of documents using the 10 classifiers created in Kfold_Train file. As in the training step, we split the data into documents relevant to 10 validation sets of 5 topics. For each set of documents, we used respective classifiers trained in the previous step, which are alien to the topics that the documents are discussing.

In [1]:
import pandas as pd
import os,sys,re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from collections import Counter
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline
from joblib import dump, load


In [6]:
infecteds = pd.read_csv('infected.txt', header=None, sep = ' ')
infecteds.columns = ['DOCID']

In [None]:
bm25run = pd.read_csv('treceval/UWatMDS_BM25.txt', header=None, sep = ' ')
bm25run.columns = ['TID','QID','DOCID','REL','COR','CRE']
bm25run = bm25run[~bm25run.DOCID.isin(infecteds.DOCID)]

qrels = pd.read_csv('qrels_correctness.txt', header=None, sep = ' ')
qrels.columns = ['TID','QID','DOCID','REL','COR','CRE']
qrels = qrels[~qrels.DOCID.isin(infecteds.DOCID)]
qrels = qrels[qrels.CRE.isin([0,1])]
qrels.head()

## Classification

In [4]:
qrels.shape

(4159, 6)

In [None]:
DOCS_DIR = '/media/ludwig/story/DecisionRunDocs/trec_decision_parts/trec_decision_docs/'
SAVE_DIR = 'model/'
counter = 1
docs = []
for docname in bm25run['DOCID'].drop_duplicates():
    try:
        with open(DOCS_DIR + docname) as fh:
            docs.append(fh.read())
    except:
        docs.append('!DOCTYPE')
        print('there is a problem with %s' % counter)
    if counter % 1000 == 0:
        print(counter)
    counter += 1

In [9]:
mapper = pd.DataFrame(bm25run.DOCID.drop_duplicates())
mapper['DOCS'] = docs

In [7]:
mapper.head()

Unnamed: 0,DOCID,DOCS
0,clueweb12-1712wb-84-02961,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
1,clueweb12-1304wb-88-12518,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
2,clueweb12-1711wb-24-23490,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
3,clueweb12-1304wb-51-24177,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
4,clueweb12-1707wb-89-14469,<http://gallstonesremedy.info/?p=462>; rel=sho...


In [8]:
bm25run = bm25run.merge(mapper, on = 'DOCID', how = 'left')

In [9]:
bm25run.head()

Unnamed: 0,TID,QID,DOCID,REL,COR,CRE,DOCS
0,1,Q0,clueweb12-1712wb-84-02961,1,42.583,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
1,1,Q0,clueweb12-1304wb-88-12518,2,42.359,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
2,1,Q0,clueweb12-1711wb-24-23490,3,42.222,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
3,1,Q0,clueweb12-1304wb-51-24177,4,42.086,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
4,1,Q0,clueweb12-1707wb-89-14469,5,42.005,UWatMDS_BM25,<http://gallstonesremedy.info/?p=462>; rel=sho...


In [6]:
splts = pd.read_csv('10fold_groups.txt', header=None)
splts.shape

(50, 2)

## Classify

In [7]:
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline
from joblib import dump, load

In [12]:
k = 1
pline = load('LOGREG_10fold_%s.joblib' % k)
test_topics = splts[splts[0] == k][1].values
target = bm25run[bm25run.TID.isin(test_topics)]

In [13]:
target.head()

Unnamed: 0,TID,QID,DOCID,REL,COR,CRE,DOCS
21985,23,Q0,clueweb12-0407wb-21-20136,1,33.863,UWatMDS_BM25,<html>\n<head>\n<title>Gestational Diabetes in...
21986,23,Q0,clueweb12-0000wb-47-16597,2,33.777,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S..."
21987,23,Q0,clueweb12-1604wb-25-31323,3,33.734,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
21988,23,Q0,clueweb12-0803wb-65-25055,4,33.689,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
21989,23,Q0,clueweb12-0508wb-71-29330,5,33.669,UWatMDS_BM25,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."


In [17]:
temp  = []
preds = []
counter = 1
for doc in target.DOCS.values.tolist():
    temp.append(doc)
    if counter % 1000 == 0:
        preds.extend(pline.predict_proba(temp)[:,1])
        temp = []
        print(counter)
    counter += 1
preds.extend(pline.predict_proba(temp)[:,1])

1000
2000
3000
4000


In [19]:
target['PROBS'] = preds
del target['DOCS'], target['REL'], target['CRE'], target['COR'], target['QID']
target.to_csv('PROBS_10fold_%s.csv' % k)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [20]:
for k in range(2,11):
    pline = load('LOGREG_10fold_%s.joblib' % k)
    test_topics = splts[splts[0] == k][1].values
    target = bm25run[bm25run.TID.isin(test_topics)]
    print(test_topics)
    #
    temp  = []
    preds = []
    counter = 1
    for doc in target.DOCS.values.tolist():
        temp.append(doc)
        if counter % 1000 == 0:
            preds.extend(pline.predict_proba(temp)[:,1])
            temp = []
            print(counter)
        counter += 1
    preds.extend(pline.predict_proba(temp)[:,1])
    #
    target['PROBS'] = preds
    del target['DOCS'], target['REL'], target['CRE'], target['COR'], target['QID']
    target.to_csv('PROBS_10fold_%s.csv' % k)
    

[ 7 44  1  3 15]
1000
2000
3000
4000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


[ 9 11 21 26 25]
1000
2000
3000
4000
[16 30 51 32 18]
1000
2000
3000
4000
[35  5 38 20 10]
1000
2000
3000
4000
[12 28 43 13 45]
1000
2000
3000
4000
[19 46  4 47 41]
1000
2000
3000
4000
[36 40  6 50 49]
1000
2000
3000
4000
[17 37  8 22 42]
1000
2000
3000
4000
[27  2 39 29 33]
1000
2000
3000
4000
