# Binary Sentence Classifier

This notebook will demonstrate baseline binary text classification approaches to classify the excerpts from the given datasets into classes 1 (accountability) or 0 (not accountability). The given datasets are new articles excerpts from news articles about three shooting events. Accountability class refers to if the excerpt is talking about accountability for the crime.

The excerpts were processed into labelled single sentences, in order to test the effectiveness as a sentence based classifier. Three variations of the data will be tested:

    1) Only testing excerpts that were originally single sentences
    2) Testing labelled sentences from excerpts that were less than five sentences
    3) Testing labelled sentences from the full dataset of excerpts
    
This notebook will also assess the affects of class imbalance, and run the same classifiers with balanced classes using the sklearn function:
```python
    n_samples / (n_classes * np.bincount(y))
```

## Run the Classifiers


### Single Sentence Results

In [1]:
import sys
import tensorflow as tf
from classifiers.binary_classifier import *
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

In [4]:
# Single Sentences
single_sents_results = find_best_classifier(["data/single_sents_df.csv"])

Processing classifier: SVC
Classifier: SVC f1: [0.5406698564593301, 0.5243902439024389]
count vector SVC results:
[[1611  135]
 [  57  113]]
              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1746
           1       0.46      0.66      0.54       170

    accuracy                           0.90      1916
   macro avg       0.71      0.79      0.74      1916
weighted avg       0.92      0.90      0.91      1916

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.5393258426966292, 0.5301204819277109]
count vector LogisticRegression results:
[[1591  155]
 [  50  120]]
              precision    recall  f1-score   support

           0       0.97      0.91      0.94      1746
           1       0.44      0.71      0.54       170

    accuracy                           0.89      1916
   macro avg       0.70      0.81      0.74      1916
weighted avg       0.92      0.89      0.90      1916

Processing classifi

Processing classifier: SVC
Classifier: SVC f1: [0.5406698564593301, 0.5243902439024389]
count vector SVC results:
[[1611  135]
 [  57  113]]
              precision    recall  f1-score   support

           0       0.97      0.92      0.94      1746
           1       0.46      0.66      0.54       170

    accuracy                           0.90      1916
   macro avg       0.71      0.79      0.74      1916
weighted avg       0.92      0.90      0.91      1916

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.5393258426966292, 0.5301204819277109]
count vector LogisticRegression results:
[[1591  155]
 [  50  120]]
              precision    recall  f1-score   support

           0       0.97      0.91      0.94      1746
           1       0.44      0.71      0.54       170

    accuracy                           0.89      1916
   macro avg       0.70      0.81      0.74      1916
weighted avg       0.92      0.89      0.90      1916

Processing classifier: RandomForestClassifier
Classifier: RandomForestClassifier f1: [0.4426229508196722, 0.41803278688524587]
count vector RandomForestClassifier results:
[[1726   20]
 [ 116   54]]
              precision    recall  f1-score   support

           0       0.94      0.99      0.96      1746
           1       0.73      0.32      0.44       170

    accuracy                           0.93      1916
   macro avg       0.83      0.65      0.70      1916
weighted avg       0.92      0.93      0.92      1916

### Short Excerpts Results


In [5]:
# sentences from excerpts less than five sentences
short_ex_results = find_best_classifier(["data/short_excerpts_df.csv"])

Processing classifier: SVC
Classifier: SVC f1: [0.45367412140575086, 0.46356809417495437]
tfidf vector SVC results:
[[7571 2305]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.77      0.85      9876
           1       0.33      0.77      0.46      1480

    accuracy                           0.77     11356
   macro avg       0.64      0.77      0.66     11356
weighted avg       0.88      0.77      0.80     11356

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.46609172873382987, 0.47326978864484043]
tfidf vector LogisticRegression results:
[[7672 2204]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.78      0.86      9876
           1       0.34      0.77      0.47      1480

    accuracy                           0.78     11356
   macro avg       0.65      0.77      0.67     11356
weighted avg       0.88      0.78      0.81     11356

Processing clas

Processing classifier: SVC
Classifier: SVC f1: [0.45367412140575086, 0.46356809417495437]
tfidf vector SVC results:
[[7571 2305]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.77      0.85      9876
           1       0.33      0.77      0.46      1480

    accuracy                           0.77     11356
   macro avg       0.64      0.77      0.66     11356
weighted avg       0.88      0.77      0.80     11356

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.46609172873382987, 0.47326978864484043]
tfidf vector LogisticRegression results:
[[7672 2204]
 [ 338 1142]]
              precision    recall  f1-score   support

           0       0.96      0.78      0.86      9876
           1       0.34      0.77      0.47      1480

    accuracy                           0.78     11356
   macro avg       0.65      0.77      0.67     11356
weighted avg       0.88      0.78      0.81     11356

Processing classifier: RandomForestClassifier
Classifier: RandomForestClassifier f1: [0.5255366395262768, 0.5346831646044244]
tfidf vector RandomForestClassifier results:
[[9402  474]
 [ 767  713]]
              precision    recall  f1-score   support

           0       0.92      0.95      0.94      9876
           1       0.60      0.48      0.53      1480

    accuracy                           0.89     11356
   macro avg       0.76      0.72      0.74     11356
weighted avg       0.88      0.89      0.89     11356

### Test With Label Noise Cleaning

### Visualize Results

In [2]:
sentences_df = load_data(['data/short_excerpts_df.csv'])
sentences_df.head()

Unnamed: 0,file,StoryID,Excerpts,ACCOUNT
0,data/short_excerpts_df.csv,NI2599,"Are guns the problem, video\ngames, the increa...",1
1,data/short_excerpts_df.csv,NI2599,Can the increase in gun violence in our school...,1
2,data/short_excerpts_df.csv,NI2951,A 22-year-old student last Friday killed six p...,0
3,data/short_excerpts_df.csv,NI2951,A 22-year-old student last Friday killed six p...,1
4,data/short_excerpts_df.csv,NI2951,Commenters \nblamed the killer�s crimes on eve...,1


In [3]:
df2 = sentences_df[['ACCOUNT', 'Excerpts']]

# Remove non ascii characters
comments = df2['Excerpts'].values
proc_comments = []
for c in comments:
    try:
        if sys.version_info >= (3, 0):
            c = bytes(c, 'utf-8')
        c = c.decode('unicode_escape')
        if sys.version_info < (3, 0):
            c = c.encode('ascii', 'ignore')
        proc_comments.append(c.strip())
    except:
        proc_comments.append('')

df3 = df2.assign(Excerpts=proc_comments)

label_column = 'ACCOUNT'
#make_label_column_numeric(df3, label_column, lambda val: val)

  # Remove the CWD from sys.path while we load stuff.


In [4]:
# Converts a dataframe into a list of tf.Example protos.
def df_to_examples(df, columns=None):
    examples = []
    if columns == None:
        columns = df.columns.values.tolist()
    for index, row in df.iterrows():
        example = tf.train.Example()
        for col in columns:
            if df[col].dtype is np.dtype(np.int64):
                example.features.feature[col].int64_list.value.append(int(row[col]))
            elif df[col].dtype is np.dtype(np.float64):
                example.features.feature[col].float_list.value.append(row[col])
            elif row[col] == row[col]:
                example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))
        examples.append(example)
    return examples

In [5]:
examples = df_to_examples(df3)
examples[1].features.feature['Excerpts']

bytes_list {
  value: "Can the increase in gun violence in our schools\nbe a reflection of the generation on children we are raising or a result of the political unrest in our country?"
}

In [6]:
# train simple classifier
excerpts = df3['Excerpts']
docs = [stem_tokenizer(doc) for doc in excerpts]
count_vectorizer = CountVectorizer(max_features=1000, min_df=10, max_df=0.7,
                                    stop_words=stopwords.words('english'))
stem_count_X = count_vectorizer.fit_transform(docs).toarray()

logreg = LogisticRegression(solver='lbfgs', max_iter=1000, class_weight='balanced')

docs_train, docs_test, y_train, y_test \
= train_test_split(pd.DataFrame(stem_count_X), df3['ACCOUNT'],
                    test_size=0.25, random_state=50)

logreg.fit(docs_train, y_train)
#y_pred = logreg.predict(docs_test)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=1000,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [7]:
# Get raw string out of tf.Example and prepare it for keras model input
def examples_to_model_in(examples, tokenizer):
    texts = [ex.features.feature['Excerpts'].bytes_list.value[0] for ex in examples]
    if sys.version_info >= (3, 0):
        texts = [t.decode('utf-8') for t in texts]
    # Tokenize string into fixed length sequence of integer based on tokenizer 
    # and model padding
    #text_sequences = tokenizer.texts_to_sequences(texts)
    #model_ins = pad_sequences(text_sequences, maxlen=PADDING_LEN)
    return model_ins



In [8]:
# WIT predict functions:
def custom_predict(examples_to_infer):
    model_ins = examples_to_model_in(examples_to_infer, tokenizer1)
    preds = logreg.predict(model_ins)
    return preds

In [21]:
num_datapoints = 1000  #@param {type: "number"}
tool_height_in_px = 720  #@param {type: "number"}

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(examples[0:10]).set_custom_predict_fn(
  custom_predict)#.set_compare_custom_predict_fn(custom_predict_2)

wv = WitWidget(config_builder, height=tool_height_in_px)

In [22]:
WitWidget(config_builder, height=tool_height_in_px)

WitWidget(config={'model_type': 'classification', 'label_vocab': [], 'are_sequence_examples': False, 'inferenc…

In [23]:
display(wv)

WitWidget(config={'model_type': 'classification', 'label_vocab': [], 'are_sequence_examples': False, 'inferenc…

In [13]:
dir(wv)

['__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_add_notifiers',
 '_call_widget_constructed',
 '_comm_changed',
 '_compare',
 '_cross_validation_lock',
 '_default_keys',
 '_delete_example',
 '_display_callbacks',
 '_dom_classes',
 '_duplicate_example',
 '_gen_repr_from_keys',
 '_generate_sprite',
 '_get_element_html',
 '_get_eligible_features',
 '_get_embed_state',
 '_handle_custom_msg',
 '_handle_displayed',
 '_handle_msg',
 '_holding_sync',
 '_infer',
 '_infer_mutants',
 '_ipython_display_',
 '_is_numpy',
 '_json_from_tf_examples',
 '_lock_property',
 '_log_default',
 '_model_id',
 '_model_module',
 '_model_module_versio