# Replicating Textual Analysis

This notebook will replicate the textual analysis provided as a sample by scikit-learn using the fully anonymized text data processed through our text-obscuring script. At the conclusion, we will compare the results that could be derived from the cleansed text and the uncleansed text. 

Find the original instructions at: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [27]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

To ensure that our testing and training datasets have the same mapping of words to numbers, we must process them through our anonymization script together. Therefore we should load the entire dataset and split it later. 

In [50]:
#In order to use the csv obscuring python script, the data must be in CSV format. Exporting it as such:
import pandas as pd
data_df = pd.DataFrame(columns = ["target", "data"])
data_df["target"] = [twenty_train.target_names[i] for i in twenty_train.target]
data_df["data"] = twenty_train.data
data_df.to_csv("newsgroups_train.csv")
print("A sample of the data in raw form", "\n\n", data_df.head(10))

Orig_X_train, Orig_X_test, Orig_y_train, Orig_y_test = train_test_split(data_df["data"], 
                                                    data_df["target"], 
                                                    test_size=0.4, 
                                                    random_state=42)

A sample of the data in raw form 

                    target                                               data
0                 sci.med  From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...
1  soc.religion.christian  From: swf@elsegundoca.ncr.com (Stan Friesen)\n...
2  soc.religion.christian  From: David.Bernard@central.sun.com (Dave Bern...
3           comp.graphics  From: hotopp@ami1.bwi.wec.com (Daniel T. Hotop...
4                 sci.med  From: billc@col.hp.com (Bill Claussen)\nSubjec...
5  soc.religion.christian  From: mauaf@csv.warwick.ac.uk (Mr P D Simmons)...
6                 sci.med  From: lady@uhunix.uhcc.Hawaii.Edu (Lee Lady)\n...
7           comp.graphics  From: dfegan@lescsse.jsc.nasa.gov (Doug Egan)\...
8           comp.graphics  From: tgl+@cs.cmu.edu (Tom Lane)\nSubject: JPE...
9           comp.graphics  From: chu@TorreyPinesCA.ncr.com (Patrick Chu 3...


In [51]:
Orig_X_train[Orig_X_train.str.contains("gpu", case=False)]

3365    From: geb@cs.pitt.edu (Gordon Banks)\nSubject:...
563     From: edwest@gpu.utcc.utoronto.ca (Dr. Edmund ...
Name: data, dtype: object

In [33]:
''' 
Reimporting the data cleansed with these configurations:

file_name = newsgroups_train
output_base = newsgroups
column_name = data
delete_column = Yes
index_num = 0

case_sensitive = No
stemming = No
random_seed = 1
remove_punctuation = Yes
combine_above = 17000
combine_below = 1
'''
anon_data = pd.read_csv("./cleansed_newsgroups_train/newsgroups_data_NoCase_NoStem_NoPunc_Above17000_Below1.csv")
print(anon_data.head(10))

                   target                                      Cleansed_data
0                 sci.med  38264 24044 34757 8944 36760 21565 23363 10220...
1  soc.religion.christian  38264 18209 14880 32178 36760 21565 21621 4274...
2  soc.religion.christian  38264 47923 24575 1921 36760 21565 35634 43508...
3           comp.graphics  38264 4651 50050 259 16413 36760 50591 44409 6...
4                 sci.med  38264 20181 49030 48691 36760 21565 42010 3438...
5  soc.religion.christian  38264 16304 1643 27435 38107 23277 36760 28401...
6                 sci.med  38264 9026 3578 51496 36760 21565 42836 24157 ...
7           comp.graphics  38264 49775 21454 34522 36760 21565 29508 4335...
8           comp.graphics  38264 1402 4591 21678 36760 3977 12005 7141 66...
9           comp.graphics  38264 17553 7328 42658 25289 36760 39245 50520...


#### Count Vectorizing

Although the words have all been converted into numbers, they are stored as space seperated strings so the functions used on words will work on the numbers. 

The first two lines are spliting the dataset randomly into training and testing. This means that our training and testing sets will differ from the example, but is a necessary step for anonymization.

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(anon_data["Cleansed_data"], 
                                                    anon_data["target"], 
                                                    test_size=0.4, 
                                                    random_state=42)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(2255, 42622)

In [39]:
#Vocabulary still contains the locations of the numeralized words. Here '38264' is the number for 'From'
count_vect.vocabulary_.get(u'38264')

24979

#### TF-IDF Fitting

In [40]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2255, 42622)

In [41]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2255, 42622)

## Training A Classifier 

### Naive Bayes

In [52]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

Here we can see one difficulty in working with the anonymized data. Without the mapping of the original word to its number, one who is building the classification model cannot use meaningful novel data to check the model performance.

Here, for the sake of demonstration, it is possible to test the classifier by finding the equivalent string of numbers to "God is love" and "OpenGL on the GPU is fast" because we have the mappings exported from the cleansing process. However, we run into another difficulty. By searching the original training data, it is revealed that "GPU" only appears in these email addresses: "C5u5LG.C3G@gpu.utcc.utoronto.ca" and "edwest@gpu.utcc.utoronto.ca" which are cleansed into: "c5u5lgc3ggpuutccutorontoca" and "edwestgpuutccutorontoca" and assigned the numbers 13444 and 33785. Replacing "GPU" with either number is not exactly equivalent to classifying by the word "GPU".

In [59]:
docs_new = ['16572 24157 31156', '32837 7359 24157 GPU 24157 43731']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

for doc, category, original in zip(docs_new, predicted, ["God is Love", "OpenGL on the GPU is fast"]):
    print('%r => %r => %s' % (original, doc, category))

'God is Love' => '16572 24157 31156' => soc.religion.christian
'OpenGL on the GPU is fast' => '32837 7359 24157 GPU 24157 43731' => comp.graphics


Nonetheless, the predictions are the same as the uncleansed data, even when leaving out the word "GPU"

### Building a Pipline

In [71]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),])

### Evaluation of the performance on the test set

In [72]:
# A bit of a simpler, built-in way to get the accuracy score.
from sklearn import metrics
text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.873670212766


Using the Naive Bayes, we achieved a similar classification accuracy with the anonymized strings.

Now we will try the linear support vector machine using their same presets.

In [74]:
from sklearn.linear_model import SGDClassifier
>>> text_clf = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(loss='hinge', 
                                               penalty='l2',
                                               alpha=1e-3, 
                                               random_state=42,
                                               max_iter=5, 
                                               tol=None)),])
# Repeated code from above
text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

0.927526595745


Once again, our classification accuracy and report are quite similar to that achieved with the original data.

In [78]:
print(metrics.classification_report(y_test, y_pred))

print(metrics.confusion_matrix(y_test, y_pred))

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.81      0.89       339
         comp.graphics       0.93      0.99      0.96       386
               sci.med       0.98      0.93      0.95       388
soc.religion.christian       0.84      0.97      0.90       391

           avg / total       0.93      0.93      0.93      1504

[[274   4   3  58]
 [  0 382   1   3]
 [  0  20 359   9]
 [  3   6   2 380]]


### Parameter tuning using grid search

We can perform the same grid search with our classifier. However, we would expect that some parameters no longer behave similarly. For example, the model builder looses flexibility in using information lower than the word level on which to model the target data, such as letters or groups of punctuation, once the configuration options have been set.

In [105]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1,1),(1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),}

In [106]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [107]:
gs_clf = gs_clf.fit(X_train[:400], y_train[:400])

In [108]:
print(gs_clf.predict(['16572 24157 31156'])[0])

soc.religion.christian


In [109]:
print(gs_clf.best_score_)
print(gs_clf.best_params_)

0.8075
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}


The grid search has returned the same parameters as with the original data, but with a lower best score. Increasing the training size should increase the score. 

In [112]:
# Not present in original analysis. Training on the full training set (about 3000 rows)
import timeit
start_time = timeit.default_timer()

gs_clf = gs_clf.fit(X_train, y_train)
print(round(gs_clf.best_score_))
print(gs_clf.best_params_)

print("Time to search:", round(timeit.default_timer() - start_time, 1), 'seconds')

1.0
{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}
Time to search: 12.2 seconds


In summary, this walk-through demonstrates that the anonymized text data from our obscuration process retains it predictive power for machine learning categorization. However, there are a few drawbacks. The most obvious is the totally opaque nature of the text for the model builder. This could be problematic in cases where the model is picking up on something unintended. For example, if the orignal data accidentally contained the name of the target variable, the model could predict with 100% accuracy, but the model builder could not physically read the data to find the problem.  

As mentioned before, there is also an issue with presenting novel data to the model. If the new data is not encoded with the same numbers as the original training data, the model cannot predict the outcomes. To encode data with the same numbers, in its current state, the algorithm must run the new data at the same time as the old data