# Binary Sentence Classifier

This notebook will demonstrate baseline binary text classification approaches to classify the excerpts from the given datasets into classes 1 (accountability) or 0 (not accountability). The given datasets are new articles excerpts from news articles about three shooting events. Accountability class refers to if the excerpt is talking about accountability for the crime.

The excerpts were processed into labelled single sentences, in order to test the effectiveness as a sentence based classifier. Three variations of the data will be tested:

    1) Only testing excerpts that were originally single sentences
    2) Testing labelled sentences from excerpts that were less than five sentences
    3) Testing labelled sentences from the full dataset of excerpts
    
This notebook will also assess the affects of class imbalance, and run the same classifiers with balanced classes using the sklearn function:
```python
    n_samples / (n_classes * np.bincount(y))
```

## Run the Classifiers


### Single Sentence Results

In [32]:
import sys
import tensorflow as tf
from classifiers.binary_classifier import *

In [4]:
# Single Sentences
single_sents_results = find_best_classifier(["data/single_sents_df.csv"])

Processing classifier: SVC
Classifier: SVC f1: [0.5406698564593301, 0.5243902439024389]
count vector SVC results:
[[1611  135]
 [  57  113]]
              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1746
           1       0.46      0.66      0.54       170

    accuracy                           0.90      1916
   macro avg       0.71      0.79      0.74      1916
weighted avg       0.92      0.90      0.91      1916

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.5393258426966292, 0.5301204819277109]
count vector LogisticRegression results:
[[1591  155]
 [  50  120]]
              precision    recall  f1-score   support

           0       0.97      0.91      0.94      1746
           1       0.44      0.71      0.54       170

    accuracy                           0.89      1916
   macro avg       0.70      0.81      0.74      1916
weighted avg       0.92      0.89      0.90      1916

Processing classifi

Processing classifier: SVC
Classifier: SVC f1: [0.5406698564593301, 0.5243902439024389]
count vector SVC results:
[[1611  135]
 [  57  113]]
              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1746
           1       0.46      0.66      0.54       170

    accuracy                           0.90      1916
   macro avg       0.71      0.79      0.74      1916
weighted avg       0.92      0.90      0.91      1916

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.5393258426966292, 0.5301204819277109]
count vector LogisticRegression results:
[[1591  155]
 [  50  120]]
              precision    recall  f1-score   support

           0       0.97      0.91      0.94      1746
           1       0.44      0.71      0.54       170

    accuracy                           0.89      1916
   macro avg       0.70      0.81      0.74      1916
weighted avg       0.92      0.89      0.90      1916

Processing classifier: RandomForestClassifier
Classifier: RandomForestClassifier f1: [0.4426229508196722, 0.41803278688524587]
count vector RandomForestClassifier results:
[[1726   20]
 [ 116   54]]
              precision    recall  f1-score   support

           0       0.94      0.99      0.96      1746
           1       0.73      0.32      0.44       170

    accuracy                           0.93      1916
   macro avg       0.83      0.65      0.70      1916
weighted avg       0.92      0.93      0.92      1916

### Short Excerpts Results


In [5]:
# sentences from excerpts less than five sentences
short_ex_results = find_best_classifier(["data/short_excerpts_df.csv"])

Processing classifier: SVC
Classifier: SVC f1: [0.45367412140575086, 0.46356809417495437]
tfidf vector SVC results:
[[7571 2305]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.77      0.85      9876
           1       0.33      0.77      0.46      1480

    accuracy                           0.77     11356
   macro avg       0.64      0.77      0.66     11356
weighted avg       0.88      0.77      0.80     11356

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.46609172873382987, 0.47326978864484043]
tfidf vector LogisticRegression results:
[[7672 2204]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.78      0.86      9876
           1       0.34      0.77      0.47      1480

    accuracy                           0.78     11356
   macro avg       0.65      0.77      0.67     11356
weighted avg       0.88      0.78      0.81     11356

Processing clas

Processing classifier: SVC
Classifier: SVC f1: [0.45367412140575086, 0.46356809417495437]
tfidf vector SVC results:
[[7571 2305]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.77      0.85      9876
           1       0.33      0.77      0.46      1480

    accuracy                           0.77     11356
   macro avg       0.64      0.77      0.66     11356
weighted avg       0.88      0.77      0.80     11356

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.46609172873382987, 0.47326978864484043]
tfidf vector LogisticRegression results:
[[7672 2204]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.78      0.86      9876
           1       0.34      0.77      0.47      1480

    accuracy                           0.78     11356
   macro avg       0.65      0.77      0.67     11356
weighted avg       0.88      0.78      0.81     11356

Processing classifier: RandomForestClassifier
Classifier: RandomForestClassifier f1: [0.5255366395262768, 0.5346831646044244]
tfidf vector RandomForestClassifier results:
[[9402  474]
 [ 767  713]]
              precision    recall  f1-score   support

           0       0.92      0.95      0.94      9876
           1       0.60      0.48      0.53      1480

    accuracy                           0.89     11356
   macro avg       0.76      0.72      0.74     11356
weighted avg       0.88      0.89      0.89     11356

### Test With Label Noise Cleaning

### Visualize Results

In [43]:
#!pip install cleanlab
!pip install witwidget
#!pip install tensorflow --ignore-installed

Collecting witwidget
  Using cached https://files.pythonhosted.org/packages/82/2f/93ba48c2a2a833d093819fdce7bea5f0a036efe02ccdf802d3f824084f67/witwidget-1.3-py3-none-any.whl
Collecting jupyter<2,>=1.0 (from witwidget)
  Using cached https://files.pythonhosted.org/packages/83/df/0f5dd132200728a86190397e1ea87cd76244e42d39ec5e88efd25b2abd7e/jupyter-1.0.0-py2.py3-none-any.whl
Collecting tensorflow-serving-api>=1.12.0 (from witwidget)
  Using cached https://files.pythonhosted.org/packages/24/c1/2b4ca53d699d79937c43d618de76f47db794364aeb3069349d97777678af/tensorflow_serving_api-1.14.0-py2.py3-none-any.whl
Collecting google-api-python-client>=1.7.8 (from witwidget)
  Using cached https://files.pythonhosted.org/packages/3f/f1/20fd18744c3d20307d634ffcc02592bc7efc45a59624e14655cf21cbfb5e/google_api_python_client-1.7.9-py3-none-any.whl
Collecting google-auth-httplib2>=0.0.3 (from google-api-python-client>=1.7.8->witwidget)
  Using cached https://files.pythonhosted.org/packages/33/49/c814d6d438b82

Collecting cachetools>=2.0.0 (from google-auth>=1.4.1->google-api-python-client>=1.7.8->witwidget)
  Using cached https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl
Collecting pyasn1-modules>=0.2.1 (from google-auth>=1.4.1->google-api-python-client>=1.7.8->witwidget)
  Using cached https://files.pythonhosted.org/packages/91/f0/b03e00ce9fddf4827c42df1c3ce10c74eadebfb706231e8d6d1c356a4062/pyasn1_modules-0.2.5-py2.py3-none-any.whl
Collecting rsa>=3.1.4 (from google-auth>=1.4.1->google-api-python-client>=1.7.8->witwidget)
  Using cached https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl
Collecting pyasn1<0.5.0,>=0.4.1 (from pyasn1-modules>=0.2.1->google-auth>=1.4.1->google-api-python-client>=1.7.8->witwidget)
  Using cached https://files.pythonhosted.org/packages/7b/7c/c9386b82a25115cccf1903441bba3cbadcfae7b678a20

In [2]:
sentences_df = load_data(['data/short_excerpts_df.csv'])
sentences_df.head()

In [12]:
df2 = sentences_df[['ACCOUNT', 'Excerpts']]

# Remove non ascii characters
comments = df2['Excerpts'].values
proc_comments = []
for c in comments:
    try:
        if sys.version_info >= (3, 0):
            c = bytes(c, 'utf-8')
        c = c.decode('unicode_escape')
        if sys.version_info < (3, 0):
            c = c.encode('ascii', 'ignore')
        proc_comments.append(c.strip())
    except:
        proc_comments.append('')

df3 = df2.assign(Excerpts=proc_comments)

label_column = 'ACCOUNT'
#make_label_column_numeric(df3, label_column, lambda val: val)

  # Remove the CWD from sys.path while we load stuff.


In [33]:
# Converts a dataframe into a list of tf.Example protos.
def df_to_examples(df, columns=None):
    examples = []
    if columns == None:
        columns = df.columns.values.tolist()
    for index, row in df.iterrows():
        example = tf.train.Example()
        for col in columns:
            if df[col].dtype is np.dtype(np.int64):
                example.features.feature[col].int64_list.value.append(int(row[col]))
            elif df[col].dtype is np.dtype(np.float64):
                example.features.feature[col].float_list.value.append(row[col])
            elif row[col] == row[col]:
                example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))
        examples.append(example)
    return examples

In [34]:
examples = df_to_examples(df3)
examples[1].features.feature['Excerpts']

In [39]:
# train simple classifier
excerpts = df3['Excerpts']
docs = [stem_tokenizer(doc) for doc in excerpts]
count_vectorizer = CountVectorizer(max_features=1000, min_df=10, max_df=0.7,
                                    stop_words=stopwords.words('english'))
stem_count_X = count_vectorizer.fit_transform(docs).toarray()

logreg = LogisticRegression(solver='lbfgs', max_iter=1000, class_weight='balanced')

docs_train, docs_test, y_train, y_test \
= train_test_split(pd.DataFrame(stem_count_X), df3['ACCOUNT'],
                    test_size=0.25, random_state=50)

logreg.fit(docs_train, y_train)
#y_pred = logreg.predict(docs_test)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [40]:
# Get raw string out of tf.Example and prepare it for keras model input
def examples_to_model_in(examples, tokenizer):
    texts = [ex.features.feature['Excerpts'].bytes_list.value[0] for ex in examples]
    if sys.version_info >= (3, 0):
        texts = [t.decode('utf-8') for t in texts]
    # Tokenize string into fixed length sequence of integer based on tokenizer 
    # and model padding
    #text_sequences = tokenizer.texts_to_sequences(texts)
    #model_ins = pad_sequences(text_sequences, maxlen=PADDING_LEN)
    return model_ins



In [41]:
# WIT predict functions:
def custom_predict(examples_to_infer):
    model_ins = examples_to_model_in(examples_to_infer, tokenizer1)
    preds = logreg.predict(model_ins)
    return preds

In [44]:
#@title Invoke What-If Tool for the data and two models (Note that this step may take a while due to prediction speed of the toxicity model){display-mode: "form"}
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder
num_datapoints = 1000  #@param {type: "number"}
tool_height_in_px = 720  #@param {type: "number"}

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(examples[:num_datapoints]).set_custom_predict_fn(
  custom_predict)#.set_compare_custom_predict_fn(custom_predict_2)

wv = WitWidget(config_builder, height=tool_height_in_px)

ImportError: cannot import name 'ensure_str'

In [1]:
import six

In [2]:
six.__file__

'/anaconda3/lib/python3.6/site-packages/six.py'

In [66]:
dir(six)

['BytesIO',
 'Iterator',
 'MAXSIZE',
 'Module_six_moves_urllib',
 'Module_six_moves_urllib_error',
 'Module_six_moves_urllib_parse',
 'Module_six_moves_urllib_request',
 'Module_six_moves_urllib_response',
 'Module_six_moves_urllib_robotparser',
 'MovedAttribute',
 'MovedModule',
 'PY2',
 'PY3',
 'PY34',
 'StringIO',
 '_LazyDescr',
 '_LazyModule',
 '_MovedItems',
 '_SixMetaPathImporter',
 '__author__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_add_doc',
 '_assertCountEqual',
 '_assertRaisesRegex',
 '_assertRegex',
 '_func_closure',
 '_func_code',
 '_func_defaults',
 '_func_globals',
 '_import_module',
 '_importer',
 '_meth_func',
 '_meth_self',
 '_moved_attributes',
 '_urllib_error_moved_attributes',
 '_urllib_parse_moved_attributes',
 '_urllib_request_moved_attributes',
 '_urllib_response_moved_attributes',
 '_urllib_robotparser_moved_attributes',
 'absolute_import',
 'add_metaclass

In [47]:
!pip install --upgrade six

Requirement already up-to-date: six in /anaconda3/lib/python3.6/site-packages (1.12.0)


In [50]:
!ls /anaconda3/lib/python3.6/site-packages/

[34mBabel-2.5.0-py3.6.egg-info[m[m
[34mBottleneck-1.2.1-py3.6.egg-info[m[m
[34mCrypto[m[m
[34mCython[m[m
[34mCython-0.26.1-py3.6.egg-info[m[m
[34mFlask-0.12.2-py3.6.egg-info[m[m
[34mFlask_Cors-3.0.3-py3.6.egg-info[m[m
[34mIPython[m[m
[34mJinja2-2.9.6-py3.6.egg-info[m[m
[34mKeras_Applications-1.0.8.dist-info[m[m
[34mKeras_Preprocessing-1.1.0.dist-info[m[m
[34mMako-1.0.7-py3.6.egg[m[m
Mako.pth
[34mMarkdown-3.1.1.dist-info[m[m
[34mMarkupSafe-1.0-py3.6.egg-info[m[m
OleFileIO_PL.py
[34mOpenSSL[m[m
[34mPIL[m[m
[34mPillow-6.0.0-py3.6.egg-info[m[m
[34mPyQt5[m[m
[34mPySocks-1.6.7-py3.6.egg-info[m[m
[34mPyWavelets-0.5.2-py3.6.egg-info[m[m
PyYAML-3.12-py3.6.egg-info
[34mPygments-2.2.0-py3.6.egg-info[m[m
[34mQtAwesome-0.4.4-py3.6.egg-info[m[m
[34mQtPy-1.7.0.dist-info[m[m
README.txt
[34mSPARQLWrapper[m[m
[34mSPARQLWrapper-1.8.0.dist-info[m[m
[34mSQLAlchemy-1.1.13-py3.6.egg-info[m[m
[34mSphin

ruamel_yaml-0.11.14-py3.6.egg-info
[34ms3transfer[m[m
[34ms3transfer-0.1.12.dist-info[m[m
[34mscikit_image-0.13.0-py3.6.egg-info[m[m
[34mscikit_learn-0.21.1.dist-info[m[m
[34mscikit_surprise-1.0.4.dist-info[m[m
[34mscipy[m[m
[34mscipy-1.3.0.dist-info[m[m
[34mscripts[m[m
[34mseaborn[m[m
[34mseaborn-0.8-py3.6.egg-info[m[m
[34msetuptools[m[m
[34msetuptools-41.0.1.dist-info[m[m
setuptools.pth
[34msimplegeneric-0.8.1-py3.6.egg-info[m[m
simplegeneric.py
[34msingledispatch-3.4.0.3-py3.6.egg-info[m[m
singledispatch.py
singledispatch_helpers.py
sip.pyi
[31msip.so[m[m
sipconfig.py
sipdistutils.py
[34msix-1.10.0-py3.6.egg-info[m[m
[34msix-1.12.0.dist-info[m[m
six.py
[34mskimage[m[m
[34msklearn[m[m
[34msmart_open[m[m
[34msmart_open-1.5.5.dist-info[m[m
[34msnowballstemmer[m[m
snowballstemmer-1.2.1-py3.6.egg-info
socks.py
sockshandler.py
[34msortedcollections[m[m
[34msortedcollections-0.5.3-py3.6

In [3]:
!cat /anaconda3/lib/python3.6/site-packages/six.py

# Copyright (c) 2010-2018 Benjamin Peterson
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 