# Replicating Textual Analysis

This notebook will replicate the textual analysis provided as a sample by scikit-learn using the fully anonymized text data processed through our text-obscuring script. At the conclusion, we will compare the results that could be derived from the cleansed text and the uncleansed text. 

Find the original instructions at: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [1]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

To ensure that our testing and training datasets have the same mapping of words to numbers, we must process them through our anonymization script together. Therefore we should load the entire dataset and split it later. 

In [2]:
#In order to use the csv obscuring python script, the data must be in CSV format. Exporting it as such:
import pandas as pd
data_df = pd.DataFrame(columns = ["target", "data"])
data_df["target"] = [twenty_train.target_names[i] for i in twenty_train.target]
data_df["data"] = twenty_train.data
data_df.to_csv("newsgroups.csv")
print("A sample of the data in raw form", "\n\n", data_df.head(10))

A sample of the data in raw form 

                    target                                               data
0                 sci.med  From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...
1  soc.religion.christian  From: swf@elsegundoca.ncr.com (Stan Friesen)\n...
2  soc.religion.christian  From: David.Bernard@central.sun.com (Dave Bern...
3           comp.graphics  From: hotopp@ami1.bwi.wec.com (Daniel T. Hotop...
4                 sci.med  From: billc@col.hp.com (Bill Claussen)\nSubjec...
5  soc.religion.christian  From: mauaf@csv.warwick.ac.uk (Mr P D Simmons)...
6                 sci.med  From: lady@uhunix.uhcc.Hawaii.Edu (Lee Lady)\n...
7           comp.graphics  From: dfegan@lescsse.jsc.nasa.gov (Doug Egan)\...
8           comp.graphics  From: tgl+@cs.cmu.edu (Tom Lane)\nSubject: JPE...
9           comp.graphics  From: chu@TorreyPinesCA.ncr.com (Patrick Chu 3...


In [3]:
''' 
Reimporting the data cleansed with these configurations:

file_name = newsgroups_train
output_base = newsgroups
column_name = data
delete_column = Yes
index_num = 0

case_sensitive = No
stemming = No
random_seed = 1
remove_punctuation = Yes
combine_above = 17000
combine_below = 1
'''
anon_data = pd.read_csv("./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_NoStem_NoPunc_AboveNone_BelowNone.csv")
print(anon_data.head(10))

                   target                                      obscured_data
0                 sci.med  b4Seg 10Khk dRaYm LYwJD ujMnQ S7fhI tVqRr aSvx...
1  soc.religion.christian  b4Seg gMgiz v3kDg mwDQ6 fjpY2 H_Vzl BxcvF aSvx...
2  soc.religion.christian  b4Seg VXnD4 HGFtW 0VE5D UNsQQ fjpY2 dx0ia HGFt...
3           comp.graphics  b4Seg ympNG hnGMJ FhNWX WWr2H fjpY2 oa748 5EKh...
4                 sci.med  b4Seg vzC1q -XaO6 7D57- fjpY2 _6ozZ _1RCs aSvx...
5  soc.religion.christian  b4Seg nIIie ujOQs qPpKp Q2CIN 35sOC rvND4 Lvw4...
6                 sci.med  b4Seg P2WMC jCrYJ cZV5M iT3VW ujMnQ 4v_fB P2WM...
7           comp.graphics  b4Seg lGib2 lwttC dulL- BCuBV 1Cnnx tAD8R KVn_...
8           comp.graphics  b4Seg -IKdR dRaYm c8m-p ujMnQ 7cmJN MGCyx aSvx...
9           comp.graphics  b4Seg GAkWy NHVa6 mwDQ6 fjpY2 4Gs1v GAkWy -dgh...


#### Count Vectorizing

Although the words have all been converted into numbers, they are stored as space seperated strings so the functions used on words will work on the numbers. 

The first two lines are spliting the dataset randomly into training and testing. This means that our training and testing sets will differ from the example, but is a necessary step for anonymization.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(anon_data["obscured_data"], 
                                                    anon_data["target"], 
                                                    test_size=0.4, 
                                                    random_state=42)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(2255, 38015)

In [18]:
X_train_counts

<2255x38015 sparse matrix of type '<class 'numpy.int64'>'
	with 395378 stored elements in Compressed Sparse Row format>

In [16]:
#Vocabulary still contains the locations of the numeralized words. Here '38264' is the number for 'From'
count_vect.vocabulary_.get(u'b4Seg')

#### TF-IDF Fitting

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2255, 38015)

In [20]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2255, 38015)

## Training A Classifier 

### Naive Bayes

In [21]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

Here we can see one difficulty in working with the anonymized data. Without the mapping of the original word to its randomized string, one who is building the classification model cannot use meaningful novel data to check the model performance.

Here, for the sake of demonstration, it is possible to test the classifier by finding the equivalent string of numbers to "God is love" and "OpenGL on the GPU is fast" because we have the mappings exported from the obscuration process. However, we run into another difficulty. By searching the original training data, it is revealed that "GPU" only appears in these email addresses: "C5u5LG.C3G@gpu.utcc.utoronto.ca" and "edwest@gpu.utcc.utoronto.ca" which are turned into seperate words: "c5u5lg c3g gpu utcc utoronto ca" and "edwest gpu utcc utoronto ca". Then "gpu" is converted into the string "Bb4UV". So we can see that using the string "gpu" is not matching something meaningful in our sample data. However, this insight is only possible with non-obscured data, and the lack of it could lead to a flawed model.

In [24]:
docs_new = ['uYWvU CWUT_ cr2_s', 'fGKDi o3qdl Zdr9p Bb4UV CWUT_ HkaPm']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

for doc, category, original in zip(docs_new, predicted, ["God is Love", "OpenGL on the GPU is fast"]):
    print('%r => %r => %s' % (original, doc, category))

'God is Love' => 'uYWvU CWUT_ cr2_s' => soc.religion.christian
'OpenGL on the GPU is fast' => 'fGKDi o3qdl Zdr9p Bb4UV CWUT_ HkaPm' => comp.graphics


The predictions are the same as the uncleansed data.

### Building a Pipline

In [25]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),])

### Evaluation of the performance on the test set

In [26]:
# A bit of a simpler, built-in way to get the accuracy score.
from sklearn import metrics
text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.894946808511


We can see above that, using the Naive Bayes, we achieved a similar classification accuracy to their 83.4% with the our obscured strings. Our accuracy is not the exact same for a number of reasons. First, the original walkthrough used predivided training and testing data. We split our set using the scikit learn's function. Also, we made two selections in obscuring our data: to striped punctuation and run as case insensitive.

Now we will try the linear support vector machine using their same presets.

In [30]:
from sklearn.linear_model import SGDClassifier
>>> text_clf = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(loss='hinge', 
                                               penalty='l2',
                                               alpha=1e-3, 
                                               random_state=42,
                                               max_iter=5, 
                                               tol=None)),])
# Repeated code from above
text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.948803191489


Once again, our classification accuracy above and report are quite similar to their 91.2%.

In [31]:
print(metrics.classification_report(y_test, y_pred))

print(metrics.confusion_matrix(y_test, y_pred))

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.91      0.95       339
         comp.graphics       0.90      0.99      0.95       386
               sci.med       0.98      0.93      0.95       388
soc.religion.christian       0.94      0.96      0.95       391

           avg / total       0.95      0.95      0.95      1504

[[310   3   6  20]
 [  0 383   1   2]
 [  0  26 359   3]
 [  3  12   1 375]]


### Parameter tuning using grid search

We can perform the same grid search with our classifier. However, we would expect that some parameters no longer behave similarly. For example, the model builder looses flexibility in using information lower than the word level on which to model the target data, such as letters or groups of punctuation, once the configuration options have been set.

In [32]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1,1),(1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),}

In [33]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [34]:
gs_clf = gs_clf.fit(X_train[:400], y_train[:400])

In [35]:
print(gs_clf.predict(['16572 24157 31156'])[0])

comp.graphics


In [36]:
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.8975
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}


The grid search has returned the same parameters as with the original data with a very similar accuracy score to their 90.0%.

This concludes the example that was provided by scikit learn. Now, we will look at some of the configuration options for the obscuration script and how they effect accuracy in interpretibility.

# Configuring The Obscuration Script

In summary, this walk-through demonstrates that the anonymized text data from our obscuration process retains it predictive power for machine learning categorization. However, there are a few drawbacks. The most obvious is the totally opaque nature of the text for the model builder. This could be problematic in cases where the model is picking up on something unintended. For example, if the orignal data accidentally contained the name of the target variable, the model could predict with 100% accuracy, but the model builder could not physically read the data to find the problem.  

As mentioned before, there is also an issue with presenting novel data to the model. If the new data is not encoded with the same numbers as the original training data, the model cannot predict the outcomes. To encode data with the same numbers, in its current state, the algorithm must run the new data at the same time as the old data