# A Notebook for Generating Labeled Data for a Multi-class NLP Classification
This notebook generates some useful assets (data/model) that can be used by other examples and notebooks for demonstration and debugging of NLP use cases in Fiddler. In particular, we use the public 20Newsgroups dataset and apply a TF-IDF vectorization to find embedding vectors of text data. Then we split the data into training and test samples and apply a logistic regression model to predict the probability of each the target for each data point. To make the classification task simpler, We group the original targets into more general news categories. In the end, we concatenate all the results in a pandas DataFrame and store both the labeled training and labeled test data as CSV files. This data can be used as baseline and production data in Fiddler when model artifacts and surrogate models are not required. We also store the trained model as a pickle file, for scenarios where access to the model is also required.    

# Fetch the 20 Newsgroup Dataset

First, we retrieve the 20Newsgroups dataset, which is available as part of the scikit-learn real-world dataset. This dataset contains around 18,000 newsgroup posts on 20 topics. The original dataset is available [here](http://qwone.com/~jason/20Newsgroups/).

In [5]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

In [2]:
data_bunch = fetch_20newsgroups(
    subset = 'train',
    shuffle=True,
    random_state=1,
    remove=('headers','footers','quotes')
)

A target name from 20 topics is assigned to each data sample in the above dataset, and you can access all the target names by running the: 
```
data_bunch.target_names
```
However, to make this example notebook simpler, we group similar topics and define more general targets as the following:



In [3]:
subcategories = {
    
    'computer': ['comp.graphics',
                 'comp.os.ms-windows.misc',
                 'comp.sys.ibm.pc.hardware',
                 'comp.sys.mac.hardware',
                 'comp.windows.x'],
    
    'politics': ['talk.politics.guns',
                 'talk.politics.mideast',
                 'talk.politics.misc'],
    
    'recreation':['rec.autos',
                  'rec.motorcycles',
                  'rec.sport.baseball',
                  'rec.sport.hockey'],
    
    'science': ['sci.crypt',
                'sci.electronics',
                'sci.med',
                'sci.space',],
    
    'religion': ['soc.religion.christian',
                 'talk.religion.misc',
                 'alt.atheism'],
    
    'forsale':['misc.forsale']
}

main_category = {}
for key,l in subcategories.items():
    for item in l:
        main_category[item] = key

Finally, we run some preprocessing and store the data in a pandas DataFrame.

In [6]:
data_prep = [s.replace('\n',' ').strip('\n,=,|,-, ,\,^') for s in data_bunch.data]
data_series = pd.Series(data_prep)
df = pd.DataFrame()
df['original_text'] = data_series
df['original_target'] = [data_bunch.target_names[t] for t in data_bunch.target]
df['target'] = [main_category[data_bunch.target_names[t]] for t in data_bunch.target]
df['original_text'].replace('', np.nan, inplace=True)
df.dropna(axis=0, subset=['original_text'], inplace=True)
df = df[df.target!='politics'] #delete political posts 
df.reset_index(drop=True, inplace=True)
df.head(3)

Unnamed: 0,original_text,original_target,target
0,"Yeah, do you expect people to read the FAQ, et...",alt.atheism,religion
1,Notwithstanding all the legitimate fuss about ...,sci.crypt,science
2,"Well, I will have to change the scoring on my ...",rec.sport.hockey,recreation


# TF-IDF *Vectorization*

Before training a model for predicting the targets, we transform the text data into a format that can be processed by standard ML models. This transformation step is often called "vectorization" and it is performed by embedding text data into high-dimensional vector space.  In this notebook, we use a simple TF-IDF vectorization method.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
embedding_dimension = 250

In [9]:
vectorizer = TfidfVectorizer(sublinear_tf=True,
                             max_features=embedding_dimension,
                             min_df=0.01,
                             max_df=0.9,
                             stop_words='english',
                             token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b')

tfidf_sparse = vectorizer.fit_transform(df['original_text'])
embedding_cols = vectorizer.get_feature_names_out()
embedding_col_names = ['tfidf_token_{}'.format(t) for t in embedding_cols]
tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf_sparse, columns=embedding_col_names)

In [10]:
tfidf_df

Unnamed: 0,tfidf_token_able,tfidf_token_access,tfidf_token_actually,tfidf_token_address,tfidf_token_application,tfidf_token_article,tfidf_token_ask,tfidf_token_available,tfidf_token_b,tfidf_token_bad,...,tfidf_token_work,tfidf_token_works,tfidf_token_world,tfidf_token_wrong,tfidf_token_x,tfidf_token_y,tfidf_token_year,tfidf_token_years,tfidf_token_yes,tfidf_token_z
0,0.0,0.0,0.226770,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.229739,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.198851,0.000000,0.219193,0.000000
2,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.296510,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.114576,0.000000,0.000000,0.000000,0.118824
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9471,0.0,0.0,0.000000,0.0,0.0,0.0,0.117575,0.0,0.000000,0.0,...,0.09222,0.0,0.109909,0.19099,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9472,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9473,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9474,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000


Now we concatenate the embedding representations and the DataFrame that we generated previously.

In [11]:
df_all = pd.concat([df,tfidf_df], axis=1)
df_all

Unnamed: 0,original_text,original_target,target,tfidf_token_able,tfidf_token_access,tfidf_token_actually,tfidf_token_address,tfidf_token_application,tfidf_token_article,tfidf_token_ask,...,tfidf_token_work,tfidf_token_works,tfidf_token_world,tfidf_token_wrong,tfidf_token_x,tfidf_token_y,tfidf_token_year,tfidf_token_years,tfidf_token_yes,tfidf_token_z
0,"Yeah, do you expect people to read the FAQ, et...",alt.atheism,religion,0.0,0.0,0.226770,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
1,Notwithstanding all the legitimate fuss about ...,sci.crypt,science,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.198851,0.000000,0.219193,0.000000
2,"Well, I will have to change the scoring on my ...",rec.sport.hockey,recreation,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
3,"I read somewhere, I think in Morton Smith's _J...",soc.religion.christian,religion,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
4,Ok. I have a record that shows a IIsi with an...,comp.sys.mac.hardware,computer,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.114576,0.000000,0.000000,0.000000,0.118824
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9471,I'd like to share my thoughts on this topic of...,soc.religion.christian,religion,0.0,0.0,0.000000,0.0,0.0,0.0,0.117575,...,0.09222,0.0,0.109909,0.19099,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9472,My sunroof leaks. I've always thought those t...,rec.autos,recreation,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9473,I agree. Home runs off Clemens are always mem...,rec.sport.baseball,recreation,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000
9474,I used HP DeskJet with Orange Micros Grappler ...,comp.sys.mac.hardware,computer,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,...,0.00000,0.0,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000


# Train a Multiclass Classifier

We are now ready to train a classifier to predict the labels assigned to each data sample. We use the logistic regression classifier from scikit-learn for this task. We split the data into train and test subsets and we use 25% of data points to train a logistic regression model.

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [13]:
df_train, df_test = train_test_split(df_all, test_size=0.75, random_state=1)

In [14]:
clf = LogisticRegression(random_state=1).fit(df_train[embedding_col_names], df_train.target)

In [15]:
clf_classes = clf.classes_
prob_col_names = ['prob_%s'%c for c in clf_classes]

Using the logistic regression classifier for a multi-class classification problem, we get a probability for each target label. We store all the predicted class probabilities as well as the predicted target for each data point in the training and test sets and we compute the prediction accuracy in each set.

In [18]:
predictions_df_train = pd.DataFrame(index=df_train.index)
predictions_df_train['predicted_target'] = clf.predict(df_train[embedding_col_names])
predicted_probs = clf.predict_proba(df_train[embedding_col_names])
for idx,col in enumerate(predicted_probs.T):
    predictions_df_train[prob_col_names[idx]] = col
baseline_df = pd.concat([predictions_df_train, df_train], axis=1)
acc_baseline = sum(baseline_df['predicted_target'] == baseline_df['target'])/baseline_df.shape[0]
print('accuracy on baseline:{:.2f}'.format(acc_baseline))

accuracy on baseline:0.76


In [19]:
predictions_df_test = pd.DataFrame(index=df_test.index)
predictions_df_test['predicted_target'] = clf.predict(df_test[embedding_col_names])
predicted_probs = clf.predict_proba(df_test[embedding_col_names])
for idx,col in enumerate(predicted_probs.T):
    predictions_df_test[prob_col_names[idx]] = col
production_df = pd.concat([predictions_df_test, df_test], axis=1)
acc_production = sum(production_df['predicted_target'] == production_df['target'])/production_df.shape[0]
print('accuracy on test data:{:.2f}'.format(acc_production))

accuracy on test data:0.67


# Store Data and Model

In [21]:
baseline_df.to_csv('20newsgroups_baseline.csv',index=False)
production_df.to_csv('20newsgroups_production.csv',index=False)

  baseline_df.to_csv('20newsgroups_baseline.csv',index=False)
  production_df.to_csv('20newsgroups_production.csv',index=False)


In [23]:
production_df

Unnamed: 0,predicted_target,prob_computer,prob_forsale,prob_recreation,prob_religion,prob_science,original_text,original_target,target,tfidf_token_able,...,tfidf_token_work,tfidf_token_works,tfidf_token_world,tfidf_token_wrong,tfidf_token_x,tfidf_token_y,tfidf_token_year,tfidf_token_years,tfidf_token_yes,tfidf_token_z
8621,science,0.177161,0.012032,0.095168,0.198660,0.516980,"Really>`? No, gravity is an inherent system....",alt.atheism,religion,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2707,computer,0.988958,0.004893,0.001215,0.000861,0.004072,Hello folks! I have an Archive XL5580 (intern...,comp.sys.ibm.pc.hardware,computer,0.0,...,0.0,0.303337,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
3660,computer,0.437696,0.176537,0.193054,0.054880,0.137833,In Article <1993Apr16.075822.22121@galileo.cc....,comp.sys.mac.hardware,computer,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
7267,science,0.051104,0.070322,0.089618,0.030233,0.758723,"As a general rule, no relay will cleanly switc...",sci.electronics,science,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.201164,0.0,0.0
2397,computer,0.865962,0.019988,0.033519,0.012982,0.067550,I've seen PGP 2.2 mentioned for the Mac platfo...,sci.crypt,science,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5401,recreation,0.026932,0.011266,0.844288,0.039284,0.078230,"Oh, Your Highness? And exactly why ""should"" ...",alt.atheism,religion,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
9012,science,0.061587,0.016104,0.323868,0.225100,0.373341,"Yeah, the ""Feingold Diet"" is a load of crap. ...",sci.med,science,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
6170,computer,0.439406,0.036730,0.110170,0.134251,0.279442,"No, I don't mean the LR, whatever that is. As...",talk.religion.misc,religion,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2915,religion,0.139367,0.007453,0.062675,0.706682,0.083823,\tI realize I'm entering this discussion rathe...,soc.religion.christian,religion,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0


In [None]:
import pickle
filename = 'LogisticRegression_clf'
pickle.dump(clf, open(filename, 'wb')) 