# Binary Sentence Classifier

This notebook will demonstrate baseline binary text classification approaches to classify the excerpts from the given datasets into classes 1 (accountability) or 0 (not accountability). The given datasets are new articles excerpts from news articles about three shooting events. Accountability class refers to if the excerpt is talking about accountability for the crime.

The excerpts were processed into labelled single sentences, in order to test the effectiveness as a sentence based classifier. Three variations of the data will be tested:

    1) Only testing excerpts that were originally single sentences
    2) Testing labelled sentences from excerpts that were less than five sentences
    3) Testing labelled sentences from the full dataset of excerpts
    
This notebook will also assess the affects of class imbalance, and run the same classifiers with balanced classes using the sklearn function:
```python
    n_samples / (n_classes * np.bincount(y))
```

## Run the Classifiers


### Single Sentence Results

In [1]:
from classifiers.binary_classifier import *

In [2]:
# Single Sentences
single_sents_results = find_best_classifier(["data/single_sents_df.csv"])

Processing classifier: SVC
Classifier: SVC f1: [0.512, 0.4709897610921501]
count vector SVC results:
[[1247   86]
 [  36   64]]
              precision    recall  f1-score   support

           0       0.97      0.94      0.95      1333
           1       0.43      0.64      0.51       100

    accuracy                           0.91      1433
   macro avg       0.70      0.79      0.73      1433
weighted avg       0.93      0.91      0.92      1433

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.5323741007194244, 0.5209003215434084]
count vector LogisticRegression results:
[[1229  104]
 [  26   74]]
              precision    recall  f1-score   support

           0       0.98      0.92      0.95      1333
           1       0.42      0.74      0.53       100

    accuracy                           0.91      1433
   macro avg       0.70      0.83      0.74      1433
weighted avg       0.94      0.91      0.92      1433

Processing classifier: RandomFor

Results from standard classifiers:

        Processing classifier: SVC
        Classifier: SVC f1: [0.49769585253456217, 0.4246575342465754]
        count vector SVC results:
        [[1270   63]
         [  46   54]]
                      precision    recall  f1-score   support

                   0       0.97      0.95      0.96      1333
                   1       0.46      0.54      0.50       100

            accuracy                           0.92      1433
           macro avg       0.71      0.75      0.73      1433
        weighted avg       0.93      0.92      0.93      1433

        Processing classifier: LogisticRegression
        Classifier: LogisticRegression f1: [0.5287356321839081, 0.27118644067796616]
        count vector LogisticRegression results:
        [[1305   28]
         [  54   46]]
                      precision    recall  f1-score   support

                   0       0.96      0.98      0.97      1333
                   1       0.62      0.46      0.53       100

            accuracy                           0.94      1433
           macro avg       0.79      0.72      0.75      1433
        weighted avg       0.94      0.94      0.94      1433

        Processing classifier: RandomForestClassifier
        Classifier: RandomForestClassifier f1: [0.4275862068965517, 0.44137931034482764]
        tfidf vector RandomForestClassifier results:
        [[1320   13]
         [  68   32]]
                      precision    recall  f1-score   support

                   0       0.95      0.99      0.97      1333
                   1       0.71      0.32      0.44       100

            accuracy                           0.94      1433
           macro avg       0.83      0.66      0.71      1433
        weighted avg       0.93      0.94      0.93      1433
        
        
Results from class balance weight adjustment:

        Processing classifier: SVC
        Classifier: SVC f1: [0.512, 0.4709897610921501]
        count vector SVC results:
        [[1247   86]
         [  36   64]]
                      precision    recall  f1-score   support

                   0       0.97      0.94      0.95      1333
                   1       0.43      0.64      0.51       100

            accuracy                           0.91      1433
           macro avg       0.70      0.79      0.73      1433
        weighted avg       0.93      0.91      0.92      1433

        Processing classifier: LogisticRegression
        Classifier: LogisticRegression f1: [0.5323741007194244, 0.5209003215434084]
        count vector LogisticRegression results:
        [[1229  104]
         [  26   74]]
                      precision    recall  f1-score   support

                   0       0.98      0.92      0.95      1333
                   1       0.42      0.74      0.53       100

            accuracy                           0.91      1433
           macro avg       0.70      0.83      0.74      1433
        weighted avg       0.94      0.91      0.92      1433

        Processing classifier: RandomForestClassifier
        Classifier: RandomForestClassifier f1: [0.5157232704402516, 0.4705882352941176]
        count vector RandomForestClassifier results:
        [[1315   18]
         [  59   41]]
                      precision    recall  f1-score   support

                   0       0.96      0.99      0.97      1333
                   1       0.69      0.41      0.52       100

            accuracy                           0.95      1433
           macro avg       0.83      0.70      0.74      1433
        weighted avg       0.94      0.95      0.94      1433


### Short Excerpts Results


In [3]:
# sentences from excerpts less than five sentences
short_ex_results = find_best_classifier(["data/short_excerpts_df.csv"])

Processing classifier: SVC
Classifier: SVC f1: [0.5178236397748593, 0.5298050139275766]
tfidf vector SVC results:
[[5930 1461]
 [ 227  951]]
              precision    recall  f1-score   support

           0       0.96      0.80      0.88      7391
           1       0.39      0.81      0.53      1178

    accuracy                           0.80      8569
   macro avg       0.68      0.80      0.70      8569
weighted avg       0.88      0.80      0.83      8569

Processing classifier: LogisticRegression
Classifier: LogisticRegression f1: [0.5212121212121212, 0.5349957495041088]
tfidf vector LogisticRegression results:
[[5984 1407]
 [ 234  944]]
              precision    recall  f1-score   support

           0       0.96      0.81      0.88      7391
           1       0.40      0.80      0.53      1178

    accuracy                           0.81      8569
   macro avg       0.68      0.81      0.71      8569
weighted avg       0.89      0.81      0.83      8569

Processing classifi

Results from standard classifiers:

        Processing classifier: SVC
        Classifier: SVC f1: [0.4807692307692307, 0.4643705463182897]
        count vector SVC results:
        [[7226  165]
         [ 753  425]]
                      precision    recall  f1-score   support

                   0       0.91      0.98      0.94      7391
                   1       0.72      0.36      0.48      1178

            accuracy                           0.89      8569
           macro avg       0.81      0.67      0.71      8569
        weighted avg       0.88      0.89      0.88      8569

        Processing classifier: LogisticRegression
        Classifier: LogisticRegression f1: [0.49586776859504134, 0.46572104018912525]
        count vector LogisticRegression results:
        [[7204  187]
         [ 728  450]]
                      precision    recall  f1-score   support

                   0       0.91      0.97      0.94      7391
                   1       0.71      0.38      0.50      1178

            accuracy                           0.89      8569
           macro avg       0.81      0.68      0.72      8569
        weighted avg       0.88      0.89      0.88      8569

        Processing classifier: RandomForestClassifier
        Classifier: RandomForestClassifier f1: [0.6024937655860348, 0.594930160372478]
        count vector RandomForestClassifier results:
        [[7168  223]
         [ 574  604]]
                      precision    recall  f1-score   support

                   0       0.93      0.97      0.95      7391
                   1       0.73      0.51      0.60      1178

            accuracy                           0.91      8569
           macro avg       0.83      0.74      0.77      8569
        weighted avg       0.90      0.91      0.90      8569
        
        
        Results with class balanced weight adjustment:

        Processing classifier: SVC
        Classifier: SVC f1: [0.5178236397748593, 0.5298050139275766]
        tfidf vector SVC results:
        [[5930 1461]
         [ 227  951]]
                      precision    recall  f1-score   support

                   0       0.96      0.80      0.88      7391
                   1       0.39      0.81      0.53      1178

            accuracy                           0.80      8569
           macro avg       0.68      0.80      0.70      8569
        weighted avg       0.88      0.80      0.83      8569

        Processing classifier: LogisticRegression
        Classifier: LogisticRegression f1: [0.5212121212121212, 0.5349957495041088]
        tfidf vector LogisticRegression results:
        [[5984 1407]
         [ 234  944]]
                      precision    recall  f1-score   support

                   0       0.96      0.81      0.88      7391
                   1       0.40      0.80      0.53      1178

            accuracy                           0.81      8569
           macro avg       0.68      0.81      0.71      8569
        weighted avg       0.89      0.81      0.83      8569

        Processing classifier: RandomForestClassifier
        Classifier: RandomForestClassifier f1: [0.6190900981266727, 0.6271338724168913]
        tfidf vector RandomForestClassifier results:
        [[7041  350]
         [ 480  698]]
                      precision    recall  f1-score   support

                   0       0.94      0.95      0.94      7391
                   1       0.67      0.59      0.63      1178

            accuracy                           0.90      8569
           macro avg       0.80      0.77      0.79      8569
        weighted avg       0.90      0.90      0.90      8569

### Full Sentences Results

In [4]:
# sentences from all excerpts
#short_ex_results = find_best_classifier(["data/sentences_df.csv"])