# Replicating Textual Analysis

This notebook will replicate the textual analysis provided as a sample by scikit-learn using the fully anonymized text data processed through our text-obscuring script. At the conclusion, we will compare the results that could be derived from the obscured text and the original text, walk through the configuration options of the obscuration script, and discuss some of the challenges with working with obscured data.

Find the original instructions at: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [4]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

For the sake of minimizing steps, we will load the entire dataset, obscure it all at once, split it later into training and testing sets using scikit learn, instead of loading the presplit data.

In [5]:
import pandas as pd
from sklearn.model_selection import cross_val_score

In [190]:
#In order to use the csv obscuring python script, the data must be in CSV format. Exporting it as such:
data_df = pd.DataFrame(columns = ["target", "data"])
data_df["target"] = [twenty_train.target_names[i] for i in twenty_train.target]
data_df["data"] = twenty_train.data
#data_df.to_csv("newsgroups.csv")
print("A sample of the data in raw form", "\n\n", data_df.head(10))

A sample of the data in raw form 

                    target                                               data
0                 sci.med  From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...
1  soc.religion.christian  From: swf@elsegundoca.ncr.com (Stan Friesen)\n...
2  soc.religion.christian  From: David.Bernard@central.sun.com (Dave Bern...
3           comp.graphics  From: hotopp@ami1.bwi.wec.com (Daniel T. Hotop...
4                 sci.med  From: billc@col.hp.com (Bill Claussen)\nSubjec...
5  soc.religion.christian  From: mauaf@csv.warwick.ac.uk (Mr P D Simmons)...
6                 sci.med  From: lady@uhunix.uhcc.Hawaii.Edu (Lee Lady)\n...
7           comp.graphics  From: dfegan@lescsse.jsc.nasa.gov (Doug Egan)\...
8           comp.graphics  From: tgl+@cs.cmu.edu (Tom Lane)\nSubject: JPE...
9           comp.graphics  From: chu@TorreyPinesCA.ncr.com (Patrick Chu 3...


In [16]:
'''
Importing the data having been cleansed with these configuration options.

file_name = newsgroups
output_base = data
column_name = data
delete_column = Yes
index_num = 0
case_sensitive = No
stemming = No
remove_punctuation = Yes
salt_string = exampleString
concat_hashes = 5
combine_above = None
combine_below = None
stop_words = 
stop_above_words = 
stop_below_words = 
'''

anon_data = pd.read_csv("./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_NoStem_NoPunc_Concat5_AboveNone_BelowNone.csv")
print(anon_data.head(10))

                   target                                      obscured_data
0                 sci.med  b4Seg 10Khk dRaYm LYwJD ujMnQ S7fhI tVqRr aSvx...
1  soc.religion.christian  b4Seg gMgiz v3kDg mwDQ6 fjpY2 H_Vzl BxcvF aSvx...
2  soc.religion.christian  b4Seg VXnD4 HGFtW 0VE5D UNsQQ fjpY2 dx0ia HGFt...
3           comp.graphics  b4Seg ympNG hnGMJ FhNWX WWr2H fjpY2 oa748 5EKh...
4                 sci.med  b4Seg vzC1q -XaO6 7D57- fjpY2 _6ozZ _1RCs aSvx...
5  soc.religion.christian  b4Seg nIIie ujOQs qPpKp Q2CIN 35sOC rvND4 Lvw4...
6                 sci.med  b4Seg P2WMC jCrYJ cZV5M iT3VW ujMnQ 4v_fB P2WM...
7           comp.graphics  b4Seg lGib2 lwttC dulL- BCuBV 1Cnnx tAD8R KVn_...
8           comp.graphics  b4Seg -IKdR dRaYm c8m-p ujMnQ 7cmJN MGCyx aSvx...
9           comp.graphics  b4Seg GAkWy NHVa6 mwDQ6 fjpY2 4Gs1v GAkWy -dgh...


#### Count Vectorizing

Although the words have all been converted into numbers, they are stored as space seperated strings so the functions used on words will work on the numbers. 

The first two lines are spliting the dataset randomly into training and testing. This means that our training and testing sets will differ from the example, but will be split in a statistically sound manner.

In [17]:
from sklearn.model_selection import train_test_split
X = anon_data["obscured_data"]
y = anon_data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.4, 
                                                    random_state=42)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(2255, 38015)

In [18]:
X_train_counts

<2255x38015 sparse matrix of type '<class 'numpy.int64'>'
	with 395378 stored elements in Compressed Sparse Row format>

In [19]:
#Vocabulary still contains the locations of the numeralized words. Here '38264' is the number for 'from'
count_vect.vocabulary_.get(u'b4Seg')

#### TF-IDF Fitting

In [20]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2255, 38015)

In [21]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2255, 38015)

## Training A Classifier 

### Naive Bayes

In [22]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

Here we can see one difficulty in working with the anonymized data. Without the mapping of the original word to its randomized string, one who is building the classification model cannot use meaningful novel data to check the model performance.

Here, for the sake of demonstration, it is possible to test the classifier by finding the equivalent string of numbers to "God is love" and "OpenGL on the GPU is fast" because we have the mappings exported from the obscuration process. However, we run into another difficulty. By searching the original training data, it is revealed that "GPU" only appears in these email addresses: "C5u5LG.C3G@gpu.utcc.utoronto.ca" and "edwest@gpu.utcc.utoronto.ca" which are turned into seperate words: "c5u5lg c3g gpu utcc utoronto ca" and "edwest gpu utcc utoronto ca". Then "gpu" is converted into the string "Bb4UV". So we can see that using the string "gpu" is not matching what we would expect it to. However, this insight is only possible with non-obscured data, the lack of which could lead to a flawed model.

In [23]:
docs_new = ['uYWvU CWUT_ cr2_s', 'fGKDi o3qdl Zdr9p Bb4UV CWUT_ HkaPm']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

for doc, category, original in zip(docs_new, predicted, ["God is Love", "OpenGL on the GPU is fast"]):
    print('%r => %r => %s' % (original, doc, category))

'God is Love' => 'uYWvU CWUT_ cr2_s' => soc.religion.christian
'OpenGL on the GPU is fast' => 'fGKDi o3qdl Zdr9p Bb4UV CWUT_ HkaPm' => comp.graphics


The predictions are the same as the uncleansed data.

### Building a Pipline

In [24]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),])

### Evaluation of the performance on the test set

In [25]:
# A bit of a simpler, built-in way to get the accuracy score.
from sklearn import metrics
text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.894946808511


We can see above that, using the Naive Bayes, we achieved a similar classification accuracy to their 83.4% with the our obscured strings. Our accuracy is not the exact same for a number of reasons. First, the original walkthrough used predivided training and testing data. We split our set using the scikit learn's function. Also, we made two selections in obscuring our data: to striped punctuation and run as case insensitive.

Now we will try the linear support vector machine using their same presets.

In [26]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', SGDClassifier(loss='hinge', 
                                               penalty='l2',
                                               alpha=1e-3, 
                                               random_state=42,
                                               max_iter=5, 
                                               tol=None)),])
# Repeated code from above
text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.948803191489


Once again, our classification accuracy above and report are quite similar to their 91.2%.

In [27]:
print(metrics.classification_report(y_test, y_pred))

print(metrics.confusion_matrix(y_test, y_pred))

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.91      0.95       339
         comp.graphics       0.90      0.99      0.95       386
               sci.med       0.98      0.93      0.95       388
soc.religion.christian       0.94      0.96      0.95       391

           avg / total       0.95      0.95      0.95      1504

[[310   3   6  20]
 [  0 383   1   2]
 [  0  26 359   3]
 [  3  12   1 375]]


### Parameter tuning using grid search

We can perform the same grid search with our classifier. However, we would expect that some parameters no longer behave similarly. For example, the model builder looses flexibility in using information lower than the word level on which to model the target data, such as letters or groups of punctuation, once the configuration options have been set.

In [73]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1,1),(1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),}

In [74]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [75]:
gs_clf = gs_clf.fit(X_train[:400], y_train[:400])

In [76]:
print(gs_clf.predict(['16572 24157 31156'])[0])

comp.graphics


In [77]:
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.8975
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}


The grid search has returned the same parameters as with the original data with a very similar accuracy score to their 90.0%.

This concludes the example that was provided by scikit learn. Now, we will look at some of the configuration options for the obscuration script and how they effect accuracy and interpretibility.

# Configuring The Obscuration Script

As remarked upon above, the obscuration process takes a number of configuration options that can affect the ability of an individual to create a useful model with the obscured text. We will begin by isolating these one at a time to notice how it changes the model. Recall the configuration options in the initial example (not case sensitive, not stemming, removing punctuation, no stop words). 

For consistency, we will use these optimized parameters for the SVM model and change the obscuration configurations. Our baseline score is below.

In [102]:
text_clf2 = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
                    ('tfidf', TfidfTransformer(use_idf=True)),
                    ('clf', SGDClassifier(loss='hinge', 
                                               penalty='l2',
                                               alpha=.001, 
                                               random_state=42,
                                               max_iter=5, 
                                               tol=None)),])

# Using the optimized parameters, computes a base cross-validated accuracy score for the configurations above.
scores = cross_val_score(text_clf2, X, y, cv=5, scoring='accuracy')
print(sum(scores)/len(scores))

0.958239639118


We will start with case sensitivity.

In [134]:
def load_and_get_accuracy(filename):
    data = pd.read_csv(filename)
    X = data["obscured_data"]
    y = data["target"]
    scores = cross_val_score(text_clf2, X, y, cv=10, scoring='accuracy')
    return sum(scores)/len(scores)

In [131]:
'''
case_sensitive = Yes 
'''
load_and_get_accuracy("./obscured_newsgroups/newsgroups_data_Salt=exampleString_Case_NoStem_NoPunc_AboveNone_BelowNone.csv")

0.95744231955807635

As we can see, for this particular dataset, the accuracy has very slightly decreased. 

Next, we will reset case sensitivity and using NLTK stemming, which converts multiple forms of the same word into their common root

In [135]:
'''
stemming = Yes 
'''
load_and_get_accuracy("./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_Stem_NoPunc_AboveNone_BelowNone.csv")

0.96089482060959774

Here we can see a slight increase in the accuracy score.

Next, we will include punctuation instead of removing it.

In [136]:
'''
remove_punctuation = No
'''
load_and_get_accuracy("./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_NoStem_Punc_AboveNone_BelowNone.csv")

0.95531253608647337

With this parameter, the accuracy has decreased again.

Now we must consider the continuous configuration options. The first is the hashes concatenation. Making this figure smaller decreases the file size of the obscured text and further anonymizes the data by making it more likely that two words could hash to the same value. However, that can decrease accuracy of the model. We will try values from 2 to 5 to see the impact they make on accuracy. Given our results above, we will use stemming.

In [140]:
import os
for i in range(2,7):
    filename = "./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_Stem_Punc_Concat" + str(i) + "_AboveNone_BelowNone.csv"
    acc = load_and_get_accuracy(filename)
    size = os.stat(filename).st_size
    print(i, acc, size/1000000)

2 0.911434246821 5.035804
3 0.958240203186 6.695125
4 0.95956929628 8.354446
5 0.960103331308 10.013767
6 0.960104040528 11.673088


Above, we can see the relationship between the hash concatenation, the accuracy of our model, and the file size of the obscured files in MB. As we can see, the accuracy begins to plateu at concat_hashes = 5. However, this is dependent on the number of unique words in our dataset. For this example, we had 37,000 words with stemming. Therefore a reasonable rule might be to use 5 unless there are over 50,000 unique words in the dataset. (The count of unique words is printed by the obscuration script when it is finished.)

Now we will consider the lower and upper bounds for combining words. 

In [147]:
# If combine_above and combine_below = Ask, the obscuration script produces these counts.
'''
Original_Word      Hash   Count
58            the  Zdr9p  51817
20             of  3MVme  30909
17             to  Fsqeu  30297
108             a  KqMIB  23545
49            and  2gcdS  23357
38             is  CWUT_  20500
26             in  0q7Zs  19641
13              i  dhib2  19408
56           that  SgVfw  17890
37             it  HeKam  14817
82            you  5geKC  10862
68            for  WT2-D  10738
74             be  hO9jZ   9627
63            not  oX23J   8957
83            thi  RJ2Pl   8723
0            from  b4Seg   8345
61            are  XIAAS   8058
4             edu  ujMnQ   7663
95           have  fehAw   7489
148             s  Cio4U   7449 


25895           8mb  6Y-Qm      1
25896        181924  NSiAu      1
25897         21026  L_kHO      1
25898       1000000  xy86N      1
12989    entangleth  d2Kr_      1
12988       warreth  btSwj      1
3013           capp  FYM8L      1
25902         oneof  X2Dbh      1
3015          savel  4uCZT      1
25906       diametr  EGceI      1
25907         clees  w0o_V      1
12979      incosist  7fKRI      1
25909         heber  epTk0      1
25910         kenit  aDfKl      1
12978       defraud  PwuiJ      1
25912        scenic  6lVeI      1
12975       geisler  dyTRp      1
25914         novak  dBGUn      1
3017        cerullo  ZljLW      1
35184        truest  kdnwd      1
'''
print("")




In [152]:
for i in (8000,10000,17000,20000,30000):
    filename = "./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_Stem_NoPunc_Concat5_Above" + str(i) +"_Below1.csv"
    acc = load_and_get_accuracy(filename)
    print(i, acc)
print("None", load_and_get_accuracy("./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_Stem_NoPunc_AboveNone_BelowNone.csv"))

8000 0.947334454057
10000 0.952652218446
17000 0.952916055788
20000 0.956640191868
30000 0.960098362946
None 0.96089482061


This shows us the relationship between the model accuracy and what count we select to concatenate above. As we can see, we do lose accuracy when we bundle more words together. 

Now, turning our attention towards concatenating those below a given count. For example, if we select 2, the script will bundle words with counts of 1. 

In [162]:
for i in (9,8,7,6,5,4,3,2,"None"):
    filename = "./obscured_newsgroups/newsgroups_data_Salt=exampleString_NoCase_Stem_NoPunc_Concat5_AboveNone_Below" + str(i) + ".csv"
    acc = load_and_get_accuracy(filename)
    print(i, acc)

9 0.961963630076
8 0.963026761961
7 0.961962215399
6 0.962760087739
5 0.963291297175
4 0.961696246635
3 0.962228165291
2 0.961167164888
None 0.96089482061


We can see from these results that the trend is not quite as predictable as the bundling of the words with the highest counts. However, in general, it is best to minimize the bundling of words to not lose information. However, below a certain count, the words lose predictive power because they occur so rarely.

Having looked at how some of the configuration options impact the ability of the model to predict outcomes, we now should turn our attention towards some of the difficulties of using this kind of data for modeling.

# Problems with Obscured Data

In summary, this walk-through demonstrates that the anonymized text data from our obscuration process retains it predictive power for machine learning categorization. However, there are a few drawbacks. The most obvious is the totally opaque nature of the text for the model builder. This could be problematic in cases where the model is picking up on something unintended. For example, if the orignal data accidentally contained the name of the target variable, the model could predict with nearly 100% accuracy, but the model builder could not physically read the data to find the problem. To demonstrate, consider the following case:

In [204]:
tainted_data = data_df["data"] + data_df["target"]
tainted_df = pd.concat([data_df["target"], tainted_data], axis=1)
tainted_df.columns = ["target", "data"]
tainted_df.to_csv("tainted_newsgroups.csv")
obs_tainted = pd.read_csv("./obscured_tainted_newsgroups/newsgroups_data_Salt=exampleString_NoCase_NoStem_NoPunc_Concat5_AboveNone_BelowNone.csv")

Now the data column ends with the name of the target variable, tainting the data with unwanted information. If the words are unobscured, the problem is relatively easy to see. But if it is obscured, the problem become very difficult to spot.

In [205]:
print(tainted_data[0])

From: geb@cs.pitt.edu (Gordon Banks)
Subject: Re: "CAN'T BREATHE"
Article-I.D.: pitt.19440
Reply-To: geb@cs.pitt.edu (Gordon Banks)
Organization: Univ. of Pittsburgh Computer Science
Lines: 23

In article <1993Mar29.204003.26952@tijc02.uucp> pjs269@tijc02.uucp (Paul Schmidt) writes:
>I think it is important to verify all procedures with proper studies to
>show their worthiness and risk.  I just read an interesting tidbit that 
>80% of the medical treatments are unproven and not based on scientific 
>fact.  For example, many treatments of prostate cancer are unproven and
>the treatment may be more dangerous than the disease (according to the
>article I read.)

Where did you read this?  I don't think this is true.  I think most
medical treatments are based on science, although it is difficult
to prove anything with certitude.  It is true that there are some
things that have just been found "to work", but we have no good
explanation for why.  But almost everything does have a scientific
r

In [206]:
print(obs_tainted.obscured_data[0])

b4Seg 10Khk dRaYm LYwJD ujMnQ S7fhI tVqRr aSvxX 8GOEh ZBoa5 5EKhx PR5Vb bm0jf dhib2 I48NW LYwJD fxCuf za07I Fsqeu 10Khk dRaYm LYwJD ujMnQ S7fhI tVqRr WNXx- VJwA- 3MVme a4S68 6cgjU m8ssc w1yvW vV-hc 0q7Zs bm0jf G9h6y ejl5K QCY7E A-4Hy sI42G O4oSL A-4Hy sI42G BCRN3 -xP3O 4WNbo dhib2 YkSoQ HeKam CWUT_ iZDcx Fsqeu UYWYk VjqvU pYxa4 g0NaX GX8I1 89cUm Fsqeu BMtFB xCiMp IsQ4A 2gcdS nV0dX dhib2 r0gZo gMIMP 5DMKD leHin 1JB89 SgVfw yXB70 3MVme Zdr9p dKOvp kkwzF XIAAS 6LgEE 2gcdS oX23J UVPZD o3qdl YQDPb c60Dm WT2-D dcdTZ kv3Ct kkwzF 3MVme S24xz MhFbz XIAAS 6LgEE 2gcdS Zdr9p Sdg2q ir3Cb hO9jZ aVyIC aXt6a ycYsB Zdr9p O-iFx 2AH89 Fsqeu Zdr9p bm0jf dhib2 gMIMP I7YWm ZP7Fa 5geKC gMIMP v2yHv dhib2 LOhVB 5EKhx YkSoQ v2yHv CWUT_ 8EXrH dhib2 YkSoQ jaG0a dKOvp kkwzF XIAAS UVPZD o3qdl m8ssc olbfQ HeKam CWUT_ pNYpA Fsqeu 01yJO ae0s3 g0NaX Tp8qq HeKam CWUT_ 8EXrH SgVfw vvriI XIAAS CWKnZ BO7DV SgVfw fehAw r0gZo FNqYJ F98lY Fsqeu eJZtM L3jun JLKmm fehAw 7cQw7 rFewE -x9LH WT2-D sxTUc L3jun TejJ0 OG1_V NsFSl fehA

In [229]:
X = obs_tainted["obscured_data"]
y = obs_tainted["target"]
scores = cross_val_score(text_clf2, X, y, cv=10, scoring='accuracy')
print(sum(scores)/len(scores))

0.991758804989


By measuring the accuracy of our same model trained on this tainted data, we can see that now the model has becomie overly accurate because the model now has access to data it should not. It is not 100% because the target names have punctuation in them, which is split into different words. So in the exapmle above "LWPSa 4NIgL" = "sci.med". And in a small percent of cases, this information can be overpowered by other indicators of classification in the model.

Of course, this is only one example of an issue with working with completely non-human readable data. The greater takeaway is that diagnosis of problems in the data or the model becomes significantly more difficult. 

Another consideration is that the obscuration severly limits the post-modeling analysis that is possible by the makers of the model. For example, we can find the word most assosciated with each category, as shown below, which can be helpful in understanding what the model is using to make its predictions.

In [15]:
text_clf2.fit(X, y)
coef = text_clf2.named_steps["clf"].coef_
vocab = text_clf2.named_steps["vect"].vocabulary_
feature_names = text_clf2.named_steps["vect"].get_feature_names()
coefs = pd.DataFrame(coef)

In [288]:
for index, value in coefs.idxmax(axis=1).iteritems():
    print(feature_names[value], twenty_train.target_names[index],)

rf alt.atheism
uiezi comp.graphics
htxf3 sci.med
3nuqu soc.religion.christian


However, this kind of process yields little fruit with non-readable data. 