# Training a binary classifier 

Going to train a binary classifier on a real world dataset of __[20 news groups](http://https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset)__ from the scikit-learn datasets. The result of 
the fetch is a dataframe with the news group posts. The remove parameter removes the post headers, footers, and quotations of other posts.


***
## Data Prep

Read the data with outcomes: <tt><b>X</b><sub><i>n,2</i></sub></tt> Note that news_groups is n,3 because of the redundant binary ratings column. 

Vector <b>y</b><sub>n</sub> is the column that contains binary outcomes. These are converted to ones and zeros.

Data without outcomes: <tt><b>x</b><sub><i>n</i></sub> =  <b>X</b><sub><i>n,2</i></sub> - <b>y</b><sub>n</sub>  </tt>

<tt><b>x_train</b><sub><i>p</i></sub>, <b>x_test</b><sub><i>q</i></sub>, <b>y_train</b><sub><i>p</i></sub>, <b>y_test</b><sub><i>q</i></sub> = train_test_split(<b>x</b><sub><i>n</i></sub>, <b>y</b><sub>n</sub> )  where n = p + q


vectorizer = CountVectorizer().fit(<b>x_train</b><sub><i>p</i></sub>)   

<b>X_train_vectorized</b><sub><i>p,u</i></sub> = vect.transform(<b>x_train</b><sub><i>p</i></sub>)  u: indeterminate

<b>X_test_vectorized</b><sub><i>q,*u</i></sub> = vect.transform(<b>x_test</b><sub><i>q</i></sub>)  u: indeterminate
</tt>
***

In [1]:
import pandas as pd
import sklearn as sklearn
import sklearn.datasets as datasets
import numpy as np

bunch = datasets.fetch_20newsgroups(subset='all',
                                   remove=('headers',
                                          'footers',
                                          'quotes'))

news_groups = {'data':bunch['data'],
              'targets':bunch['target'],}

news_groups = pd.DataFrame(news_groups)
news_groups.head(2)

Unnamed: 0,data,targets
0,\n\nI am sure some bashers of Pens fans are pr...,10
1,My brother is in the market for a high-perform...,3


***

The news groups data is a multi-class classification problem. We turn this into a binary classification by 
set of related news groups to 1 and the others to 0.
***

In [9]:

# Drop missing values, if there are any.
news_groups.dropna(inplace=True)

def set_binary_targets(data_series):
    #data_series = np.where((((data_series < 6) & data_series != 0)), 1, 0)
    data_series = np.where(data_series == 0, 1, 0)
    return data_series

# There are 20 news groups, making for a 
news_groups['binary_tgts'] = set_binary_targets(news_groups['targets'])

news_groups.head(20)

Unnamed: 0,data,targets,binary_tgts
0,\n\nI am sure some bashers of Pens fans are pr...,10,0
1,My brother is in the market for a high-perform...,3,0
2,\n\n\n\n\tFinally you said what you dream abou...,17,0
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,0
4,1) I have an old Jasmine drive which I cann...,4,0
5,\n\nBack in high school I worked as a lab assi...,12,0
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,0
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,0
8,"\n\n\nYeah, it's the second one. And I believ...",10,0
9,\nIf a Christian means someone who believes in...,19,0


In [10]:
tgts = pd.DataFrame(bunch['target_names'], columns=['target names'])
tgts

Unnamed: 0,target names
0,alt.atheism
1,comp.graphics
2,comp.os.ms-windows.misc
3,comp.sys.ibm.pc.hardware
4,comp.sys.mac.hardware
5,comp.windows.x
6,misc.forsale
7,rec.autos
8,rec.motorcycles
9,rec.sport.baseball


In [11]:
# Our target set is a small percentage of the overall data.
# There will not be much to train the classifier

news_groups['targets'].value_counts()

targets
10    999
15    997
8     996
9     994
11    991
7     990
13    990
5     988
14    987
2     985
12    984
3     982
6     975
1     973
4     963
17    940
16    910
0     799
18    775
19    628
Name: count, dtype: int64

In [12]:
# Only 5.3% of the data in 'rec.autos'. 
# Only 4.2% of the data is in 'alt.atheism.'
news_groups.describe()

Unnamed: 0,targets,binary_tgts
count,18846.0,18846.0
mean,9.293166,0.042396
std,5.562798,0.201497
min,0.0,0.0
25%,5.0,0.0
50%,9.0,0.0
75%,14.0,0.0
max,19.0,1.0


*** 
Create the training and the test data.
***

In [13]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(news_groups['data'], 
                                                    news_groups['binary_tgts'], 
                                                    random_state=0)
print('\nnews_groups shape: ', news_groups.shape)
print('\nX_train shape: ', X_train.shape, '\ty_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape, '\t\ty_test shape: ', y_test.shape,'\n')


news_groups shape:  (18846, 3)

X_train shape:  (14134,) 	y_train shape:  (14134,)
X_test shape:  (4712,) 		y_test shape:  (4712,) 



## CountVectorizer

The scikit learn machine learning algorithms
cannot process words, so vectorization produces a unique numerical representation for all the words in all the posts in the input data. In text processing, the union of the news group posts is called the text <b>corpus</b> and each post is a <b>document</b>. The <b>__[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)__,</b> ignores text structure and only counts occurances of words to do its vectorization.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
#
# Fit the CountVectorizer to the training data
vect_VC = CountVectorizer().fit(X_train)

In [15]:
# The number of words in the corpus.
len(vect_VC.get_feature_names_out())

104069

In [16]:
vect_VC.get_feature_names_out()[::3000]   # look at every 3000th feature

array(['00', '164690', '3049', '5gppfjsq', '8574', 'a1c', 'animals',
       'ballyhoo', 'brouillette', 'christain', 'correlate', 'demonstate',
       'dustribute', 'exaclty', 'forsook', 'graphi', 'holger', 'inguianl',
       'justifiably', 'leans', 'macrocosm', 'minorit', 'napoleonic',
       'olcay', 'peripherals', 'proj', 'reamins', 'roussor', 'shawon',
       'ssa', 'taller', 'truth', 'v6eh', 'wheat', 'xvt'], dtype=object)

***
We have over 100,000 features (unique words in the corpus) and only 14,000 samples (individual documents in X_train). We can expect that the classifer will not perform well.
***

***
Transform the documents in the training data to a document-term matrix.

* Each row corresponds to a document: 14134. 
* Each column corresponds to a word in the vocabulary or corpus: 104,069.
* The entries represent the number of times the word appears in each document: 1,339,229.

Since most words do not appear in most documents, the array is sparce with many zero enties.
***

In [17]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized_VC = vect_VC.transform(X_train)

X_train_vectorized_VC

<14134x104069 sparse matrix of type '<class 'numpy.int64'>'
	with 1339229 stored elements in Compressed Sparse Row format>

***
The LogisticRegression model is good with high dimensional, sparce data.
***

In [18]:
X_train_vectorized_VC.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [19]:
# works well for high dimentional sparce data
from sklearn.linear_model import LogisticRegression

# Train the model
model_VC = LogisticRegression(max_iter=1000)
model_VC.fit(X_train_vectorized_VC, y_train)

***
Now use the model to predict the outcomes for X_test. 

We have the predicted outcomes for X_test, but we also have the actual outcomes, y_test. So now we can compute a measure of the performance of the model, the __[area under the curve(AUC)](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)__.

An AUC value of 0.5 is the equivalent of random guessing. This is because we have so many features and not enough samples.

There are methods to improve the peformance.

test_data_and_results is a dataframe where we aggregate data and results for comparison and error checking purposes.
***

In [20]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents

predictions_VC = model_VC.predict(vect_VC.transform(X_test))

test_data_and_results = pd.DataFrame({'x':X_test, 'predictions_VC': predictions_VC}) 
test_data_and_results['y_test'] = y_test

# any words that appear in X_test that are not in X_train are ignored.
print('AUC: ', roc_auc_score(y_test, predictions_VC))
test_data_and_results.describe()
test_data_and_results.head(1)


AUC:  0.6800424272270235


Unnamed: 0,x,predictions_VC,y_test
14736,Uh... slight clarification: That should be ...,0,0


In [21]:
# get the feature names as numpy array
feature_names = vect_VC.get_feature_names_out()

# Sort the coefficients from the model
sorted_coef_index = model_VC.coef_[0].argsort()

print(sorted_coef_index[0:10])

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

[27014 99807 40316 68747 99376 75317 82925 39995 25174 66806]
Smallest Coefs:
['christians' 'wondering' 'fbi' 'off' 'windows' 'provide' 'scripture'
 'fake' 'called' 'next']

Largest Coefs: 
['bobby' 'motto' 'atheism' 'atheists' 'conner' 'atheist' 'keith' 'loans'
 'cruel' 'define']


***
## TfidfVectorizer: term-frequency times inverse document-frequency

Fit the __[TfidfVectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) to the training data specifiying a minimum document frequency of 5

Weight terms based on how important they are to a document. High weight is given to terms that appear often in a document but not in the corpus.

min_df: the minimum number of documents in which a word has to appear
***

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect_tfid = TfidfVectorizer().fit(X_train)

X_train_vectorized_tfid = vect_tfid.transform(X_train)

# works well for high dimentional sparce data
from sklearn.linear_model import LogisticRegression

model_tfid = LogisticRegression(max_iter=1000)
model_tfid.fit(X_train_vectorized_tfid, y_train)

from sklearn.metrics import roc_auc_score
# predictions = model.predict_proba(vect.transform(X_test))
predictions_tfid = model_tfid.predict(vect_tfid.transform(X_test))
#binary_predictions_tfid = set_binary_targets(predictions_tfid)
print('AUC: ', roc_auc_score(y_test, predictions_tfid))

AUC:  0.5460086477944878


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect_tfid = TfidfVectorizer().fit(X_train)
len(vect_tfid.get_feature_names_out())

104069

In [24]:
vect_tfid.get_feature_names_out()[::3000]

array(['00', '164690', '3049', '5gppfjsq', '8574', 'a1c', 'animals',
       'ballyhoo', 'brouillette', 'christain', 'correlate', 'demonstate',
       'dustribute', 'exaclty', 'forsook', 'graphi', 'holger', 'inguianl',
       'justifiably', 'leans', 'macrocosm', 'minorit', 'napoleonic',
       'olcay', 'peripherals', 'proj', 'reamins', 'roussor', 'shawon',
       'ssa', 'taller', 'truth', 'v6eh', 'wheat', 'xvt'], dtype=object)

In [25]:
X_train_vectorized_tfid = vect_tfid.transform(X_train)
X_train_vectorized_tfid

<14134x104069 sparse matrix of type '<class 'numpy.float64'>'
	with 1339229 stored elements in Compressed Sparse Row format>

In [26]:
# works well for high dimentional sparce data
from sklearn.linear_model import LogisticRegression

model_tfid = LogisticRegression(max_iter=1000)
model_tfid.fit(X_train_vectorized_tfid, y_train)

In [27]:
from sklearn.metrics import roc_auc_score
# predictions = model.predict_proba(vect.transform(X_test))
predictions_tfid = model_tfid.predict(vect_tfid.transform(X_test))
#binary_predictions_tfid = set_binary_targets(predictions_tfid)
test_data_and_results['predictions_tfid'] = predictions_tfid

#binary_predictions_tfid = set_binary_targets(predictions_tfid)
print('AUC: ', roc_auc_score(y_test, predictions_tfid))

AUC:  0.5460086477944878


In [28]:
feature_names_tfid = np.array(vect_tfid.get_feature_names_out())

sorted_index_tfid = X_train_vectorized_tfid.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names_tfid[sorted_index_tfid[:10]]))
print('Largest tfidf: \n{}'.format(feature_names_tfid[sorted_index_tfid[:-11:-1]]))

Smallest tfidf:
['yf9f9f9f9f9f9f9f9f9f3t' 'kljn' 'newwj' 'wz4'
 '3v9f9f9f9f9f9f9f9f9f9f9f9f9' 'ewwhj' 'ewwj' '3v9f9f9f9f9f9f9f0' 'pnewwj'
 'nrizwz4']

Largest tfidf: 
['narrative' 'ditto' 'hello' 'anaheim' 'david' 'each' 'why' 'dir'
 'consistently' 'art']


In [29]:
sorted_coef_index = model_tfid.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names_tfid[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names_tfid[sorted_coef_index[:-22:-1]]))

Smallest Coefs:
['the' 'with' 'on' 'get' 'windows' 'use' 'will' 'for' 'off' 'israel']

Largest Coefs: 
['atheism' 'religion' 'atheists' 'islam' 'atheist' 'bobby' 'islamic'
 'morality' 'kent' 'motto' 'god' 'belief' 'is' 'rushdie' 'that' 'cobb'
 'what' 'cruel' 'moral' 'an' 'theists']


In [30]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
from sklearn.feature_extraction.text import CountVectorizer
vect_ngram = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect_ngram.transform(X_train)

len(vect_ngram.get_feature_names_out())

76480

In [31]:
model_ngram = LogisticRegression(max_iter=1000)
model_ngram.fit(X_train_vectorized, y_train)

predictions_ngram = model_ngram.predict(vect_ngram.transform(X_test))
binary_predictions_ngram = set_binary_targets(predictions_ngram)

print('AUC: ', roc_auc_score(y_test, binary_predictions_ngram))

AUC:  0.3392478908148301


In [32]:
feature_names_ngram = np.array(vect_ngram.get_feature_names_out())

sorted_coef_index_ngram = model_ngram.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names_ngram[sorted_coef_index_ngram[:10]]))
print('Largest Coefs: \n{}'.format(feature_names_ngram[sorted_coef_index_ngram[:-11:-1]]))

Smallest Coefs:
['they were' 'windows' 'off' 'christians' 'wondering' 'you can' 'hi'
 'phone' 'called' 'let']

Largest Coefs: 
['conner' 'atheists' 'atheism' 'keith' 'atheist' 'bobby' 'religion'
 'claim that' 'answered' 'gregg']
