# Sentence Classification Using Machine Learning

To classify sentences into one of several Categories, various Machine Learning and Statistical techniques can be used.  This note focuses on using a Supervised Learning mechanism, developing a model trained on a set of pre-classified sentences.

A particular approach is taken, avoiding the use of counting specific types of words (i.e. "question words") in the features and instead considers Part-Of-Speech patterns in a sentence.  For a full model this could be combined with word-count and other features.

**Notebook Process Flow**
1. Load Data
2. Extract Features
3. Build a Model against the Training Data Set and Validate against Test Set
4. Test the Model against a different Data Set

**Dependencies**  

This Python notebook is dependent on the set of classified sentences in **[sentences.csv](https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/sentences.csv)** (or an equivlent set of data) to run.  In addition to this, another data-set called **[pythonFAQ.csv](https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/pythonFAQ.csv)** is used to test the model.  These need to be downloaded in advance and then the path to these files needs to be set manually in the notebook to correctly reference where they are located.

In order to build a classification model, we need to extract some features, and this is the bulk of effort for the task as detailed in this note.  The Python Sci-Kit Learn library contains a comprehensive pre-packaged machine learning algorithms that can then be used with data-set.

The approach for extracting features is demonstrated in part with in-line code in the notebook, but the full set of functionality is wrapped in a bespoke module **[features.py](https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/features.py)** that needs to be downloaded in advance and then the path to this file needs to be set manually in the notebook to correctly reference where it is located.  The output of running various functions in features.py is saved in the file **[featuresDump.csv](https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/featuresDump.csv)** which also needs to be downloaded in advance and stored in the same location as the other CSV files.

To download these files, either clone the whole GitHub Repo https://github.com/edbullen/NLPBot , or download each file individually:
```
wget https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/sentences.csv
wget https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/pythonFAQ.csv
wget https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/features.py
wget https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/featuresDump.csv
```


## 1. Load Data ##

First, load some pre-classified data from a CSV file called "sentences.csv".


In [6]:
#load 100 sentences with a classification Q/S/C
import numpy as np
import pandas as pd

CODE_LOC = 'C://Zeldon/Projects/edbullenChatBot/nltk'   # !! Modify to path to "features.py" folder lcoation
DATA_LOC = 'C://Zeldon/Projects/edbullenChatBot/nltk/sentences.csv'  # !! Modify this to the CSV data location

sentences = pd.read_csv(filepath_or_buffer = DATA_LOC)   

In [7]:
sentences.head(10)

Unnamed: 0,SENTENCE,CLASS
0,"Sorry, I don't know about the weather.",S
1,That is a tricky question to answer.,C
2,What does OCM stand for,Q
3,MAX is a Mobile Application Accelerator,S
4,Can a dog see in colour?,Q
5,how are you,C
6,If you deploy a MySQL database in the Oracle c...,Q
7,who is dominic Fakename,Q
8,what's the weather like today?,C
9,Can the OCM host non Oracle software stacks?,Q


In [8]:
sentences.shape

(100, 2)

## 2. Feature Engineering - A Non-Standard, Bespoke Approach ##

Chapter 6 of the NLTK Book has a great deal of background and worked examples for classifying text using machine learning algorithms such as Naive Bayes Classifiers.   A different bespoke approach involving home-grown feature engineering and a scikit-learn Random Forest model is outlined in this note.

The code snippet below is an example of taking a sentence and extracting sets of *POS-tag Triples* from it.  We can use this approach for building up features from a sentence by counting occurances of triple-patterns (or other POS-tag patterns).

In [9]:
# Extract some patterns of PoS sequences
import nltk
from nltk import word_tokenize

list_of_triple_strings = []  # triple sequence of PoS tags
sentence = "Can a dog see in colour?"

sentenceParsed = word_tokenize(sentence)
pos_tags = nltk.pos_tag(sentenceParsed)
pos = [ i[1] for i in pos_tags ]
print("Words mapped to Part of Speech Tags:",pos_tags)
print("PoS Tags:", pos)

n = len(pos)
for i in range(0,n-3):
    t = "-".join(pos[i:i+3]) # pull out 3 list item from counter, convert to string
    list_of_triple_strings.append(t)
    
print("sequences of triples:", list_of_triple_strings)

Words mapped to Part of Speech Tags: [('Can', 'MD'), ('a', 'DT'), ('dog', 'NN'), ('see', 'NN'), ('in', 'IN'), ('colour', 'NN'), ('?', '.')]
PoS Tags: ['MD', 'DT', 'NN', 'NN', 'IN', 'NN', '.']
sequences of triples: ['MD-DT-NN', 'DT-NN-NN', 'NN-NN-IN', 'NN-IN-NN']


### Extracting Features ###
After pre-processing the sentences (using the approach above) we can get a set of triples for Questions, Chat, Statements.  There will be a lot of intersection, but hopefully some clear patterns

### The `features.py` Features Generator ###
This is a custom Python module to extract features from a sentence, written for this ChatBot demo.

`features.py` is located here: https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/features.py
    
Just
```
import features
```
and call 
```
features = features_dict(id,sentence, c)
```

to extract a dictionary of features for the given sentence.  

+ The "id" can be any arbirtary ID value - it just get s passed in and passout as an ID identifier in the resultant dictionary.
+ The "c" value can also be any arbitrary value representing the Class label - the idea is to supply an appropriate label so that the dict that is passed back has all the necessary information in it.

The actual features that are generated and the logic behind how this is done is all hard-coded in features.py (it is not paramaterised - a potential enhancement that could be added)

#### `features.py` POS Triples Extract ####

The features.py module includes a function  
```
get_triples(pos)
```  
which returns a string of the form `"POS-POS-POS"`  where "POS" is a Part-Of-Speech tag.

**Example**

In [11]:
import sys
sys.path.append(CODE_LOC)  # set search path to code cloned from GitHub
import features            # bespoke "feature engineering" module

sentence = "Can a dog see in colour?"

sentence = features.strip_sentence(sentence)
print(sentence)
pos = features.get_pos(sentence)
triples = features.get_triples(pos)

print(triples)

Can a dog see in colour
['MD-DT-NN', 'DT-NN-NN', 'NN-NN-IN', 'NN-IN-NN']


#### Process for Identifying Candidate Features - Analysis in SQL ####
The objective is to identify candidate PoS sequences that signify a liklihood of a Statement / Question / Chat sentence.

Approach: dump all triples for each sentence with sentence-type label ("S"/"Q"/"C") recorded for each item into a SQL database.

**Count all triples**
```sql
SELECT count(*) FROM triples;  
> 360
```  

**Break-down by label type**
```sql
SELECT count(triple),label 
FROM triples
GROUP by label;  

 count(triple) label
>          37  C
>         145  Q
>         178  S
```   

**Common occuring triple sequences by label type ** 
```sql
SELECT label, triple, occurences
FROM
    (SELECT triples.label label, triples.triple triple, count(triples.triple) occurences
    FROM triples,
        (select triple, count(triple) occurences
         from triples
         group by triple) counts
     WHERE counts.occurences > 2
     AND triples.triple = counts.triple
     GROUP BY triples.triple, triples.label
     ORDER BY 2,1
     ) triples_by_label
WHERE occurences > 1
 ;
```

<table align="left">
<tr><td>Q</td> <td align="left"> <font color="red"> CD-VB-VBN </font></td> <td align="left">5</td></tr>
<tr><td>S</td> <td align="left"> <font color="green">DT-JJ-NN </font></td> <td align="left">3</td></tr>
<tr><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">DT-NN-NN</td> <td bgcolor="lightgrey" align="left">3</td></tr>
<tr><td bgcolor="lightgrey">S</td> <td bgcolor="lightgrey" align="left">DT-NN-NN</td> <td bgcolor="lightgrey" align="left">3</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> DT-NN-VBZ </font> </td> <td align="left">3</td></tr>
<tr><td>S</td> <td align="left">DT-NNP-IN</td> <td align="left">2</td></tr>
<tr bgcolor="lightgrey"><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">DT-NNP-NN</td> <td bgcolor="lightgrey" align="left">4</td></tr>
<tr bgcolor="lightgrey"><td>S</td> <td align="left">DT-NNP-NN</td> <td align="left">5</td></tr>
<tr><td>S</td> <td align="left"><font color="green"> DT-NNP-NNP </font> </td> <td align="left">4</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> IN-DT-NN </font> </td> <td align="left">3</td></tr>
<tr><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">IN-DT-NNP</td> <td bgcolor="lightgrey" align="left">4</td></tr>
<tr><td bgcolor="lightgrey">S</td> <td bgcolor="lightgrey" align="left">IN-DT-NNP</td> <td bgcolor="lightgrey" align="left">3</td></tr>
<tr><td>S</td> <td align="left"> <font color="green">IN-NN-NNS </font> </td> <td align="left">3</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red">MD-PRP-VB </font> </td> <td align="left">5</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red">MD-VB-CD </font> </td> <td align="left">4</td></tr>
<tr><td>S</td> <td align="left"> <font color="green">MD-VB-VBN </font> </td> <td align="left">3</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> NN-IN-DT </font> </td> <td align="left">3</td></tr>
<tr><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">NN-NN-IN</td> <td bgcolor="lightgrey" align="left">2</td></tr>
<tr><td bgcolor="lightgrey">S</td> <td bgcolor="lightgrey" align="left">NN-NN-IN</td> <td bgcolor="lightgrey" align="left">3</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> NNP-IN-NNP </font> </td> <td align="left">3</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> NNP-NNP-NNP </font> </td> <td align="left">14</td></tr>
<tr><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">NNP-NNP-VBZ</td> <td bgcolor="lightgrey"  align="left">2</td></tr>
<tr><td bgcolor="lightgrey">S</td> <td bgcolor="lightgrey" align="left">NNP-NNP-VBZ</td> <td bgcolor="lightgrey" align="left">4</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> NNP-VBZ-DT </font></td> <td align="left">8</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> NNP-VBZ-NNP </font></td> <td align="left">5</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> NNS-IN-DT </font></td> <td align="left">3</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> PRP-VB-PRP </font></td> <td align="left">3</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> PRP-WP-NNP </font></td> <td align="left">3</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> VB-CD-VB </font> </td> <td align="left">4</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> VB-PRP-WP </font></td> <td align="left">3</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> VB-VBN-IN </font></td> <td align="left">3</td></tr>
<tr><td>S</td> <td align="left"> <font color="green"> VBZ-DT-JJ </font></td> <td align="left">3</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> VBZ-DT-NN </font> </td> <td align="left">7</td></tr>
<tr><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">VBZ-DT-NNP</td> <td bgcolor="lightgrey" align="left">2</td></tr>
<tr><td bgcolor="lightgrey">S</td> <td bgcolor="lightgrey" align="left">VBZ-DT-NNP</td> <td bgcolor="lightgrey" align="left">5</td></tr>
<tr><td bgcolor="lightgrey">Q</td> <td bgcolor="lightgrey" align="left">VBZ-NNP-NNP</td> <td bgcolor="lightgrey" align="left">3</td></tr>
<tr><td bgcolor="lightgrey">S</td> <td bgcolor="lightgrey" align="left">VBZ-NNP-NNP</td> <td bgcolor="lightgrey" align="left">5</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> WP-VBZ-DT </font> </td> <td align="left">6</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> WP-VBZ-NNP </font> </td> <td align="left">9</td></tr>
<tr><td>Q</td> <td align="left"> <font color="red"> WRB-MD-VB </font> </td> <td align="left">4</td></tr>
<tr><td>C</td> <td align="left">WRB-VBP-PRP</td> <td align="left">3</td></tr>
</table>



In [13]:
#### Bespoke Features Generator Example - Get a Python Dictionary of Features ####
sentences = ["Can a dog see in colour?",
             "Hey, How's it going?",
             "Oracle 12.2 will be released for on-premises users on 15 March 2017",
             "When will Oracle 12 be released"]
id = 1
for s in sentences:
    features_dict = features.features_dict(str(id),s) #takes id, sentence and optional category (default=X) returns a dict of features
    features_string,header = features.get_string(str(id),s)
    print(features_dict)
    #print(features_string)
    id += 1

{'stemmedCount': 4, 'VBZ': 0, 'qVerbCombo': 1, 'stemmedEndNN': 0, 'class': 'X', 'NN': 3, 'verbBeforeNoun': 1, 'startTuple0': 0, 'id': '1', 'qTripleScore': 0, 'PRP': 0, 'qMark': 1, 'NNPS': 0, 'NNS': 0, 'endTuple1': 0, 'VBG': 0, 'endTuple2': 0, 'NNP': 0, 'sTripleScore': 0, 'endTuple0': 1, 'CD': 0, 'wordCount': 6}
{'stemmedCount': 3, 'VBZ': 0, 'qVerbCombo': 1, 'stemmedEndNN': 0, 'class': 'X', 'NN': 0, 'verbBeforeNoun': 0, 'startTuple0': 0, 'id': '2', 'qTripleScore': 0, 'PRP': 1, 'qMark': 1, 'NNPS': 0, 'NNS': 0, 'endTuple1': 0, 'VBG': 1, 'endTuple2': 0, 'NNP': 2, 'sTripleScore': 0, 'endTuple0': 0, 'CD': 0, 'wordCount': 4}
{'stemmedCount': 8, 'VBZ': 0, 'qVerbCombo': 1, 'stemmedEndNN': 0, 'class': 'X', 'NN': 1, 'verbBeforeNoun': 0, 'startTuple0': 0, 'id': '3', 'qTripleScore': 0, 'PRP': 0, 'qMark': 0, 'NNPS': 0, 'NNS': 1, 'endTuple1': 0, 'VBG': 0, 'endTuple2': 0, 'NNP': 1, 'sTripleScore': 2, 'endTuple0': 0, 'CD': 3, 'wordCount': 12}
{'stemmedCount': 4, 'VBZ': 0, 'qVerbCombo': 1, 'stemmedEndNN

In [22]:
print(header.split(","))

['id', 'wordCount', 'stemmedCount', 'qVerbCombo', 'qMark', 'verbBeforeNoun', 'VBG', 'VBZ', 'NNP', 'NN', 'NNS', 'NNPS', 'PRP', 'CD', 'StemmedEndNN', 'startTuple1', 'endTuple1', 'endTuple2', 'endTuple3', 'qTripleScore', 'sTripleScore', 'class']


In [18]:
features_string

'4,6,4,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,4,0,X'

With this approach we can bulk generate some numeric data-features generated from a CSV file of sentences. If each sentence has a unique ID and we have a classifier label (S/Q/C) for each row observation, we can now try to build a ML classification model and assess it's effectiveness.

The script `featuresDump.py` processes a raw `sentences.csv` file with the `features.py` utility and dumps out a file in a format as listed below:

 ```
 id, wordCount, stemmedCount, stemmedEndNN, CD, NN, NNP, NNPS, NNS, PRP, VBG, VBZ, startTuple0, endTuple0, endTuple1, endTuple2, verbBeforeNoun, qMark, qVerbCombo, qTripleScore, sTripleScore, class  
 44d8a78d2ca66b1b, 7, 5, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, S  
 a9133770c79b2c43, 7, 4, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 2, C  
 246cf41a55627762, 5, 3, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, Q  
 53ac5757399632e8, 6, 4, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 2, S  
 78e580bde0b4396e, 6, 4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, Q  
...  
...  
...  
 036d7e8be25c3108, 4, 2, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, Q  
 b2dd2ca708214c2a, 6, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 4, 0, Q  
 73ebcc1f94f38ddf, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, C  
 617c60a010967c8a, 4, 3, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, C  
 ecef7fa7fcb25f20, 9, 6, 0, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, S  
 16fb4f28223d22a9, 7, 5, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, Q  
 7fea2d04212f8039, 8, 5, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, S  
 3df9464caeef89a4, 13, 7, 0, 0, 3, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 3, S  
```


## 3. Build a Machine Learning Model  ##

In this section we load a features CSV file called **`featuresDump.csv`** into a Pandas data-frame.  The data was generated with `features.py` reading in the `sentences.csv` file as described in the previous section.  The featuresDump.csv data is then used to train a Random Forest model to predict whether a sentence is **C**hat, **S**tatement or **Q**uestion.

The `featuresDump.csv` file can be downloaded from here: https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/featuresDump.csv

#### Load the Data ####

In [23]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

FNAME = 'C://Zeldon/Projects/edbullenChatBot/nltk/featuresDump.csv' # !! Modify this to the CSV data location

df = pd.read_csv(filepath_or_buffer = FNAME, )   
print(str(len(df)), "rows loaded")

# Strip any leading spaces from col names
df.columns = df.columns[:].str.strip()
df['class'] = df['class'].map(lambda x: x.strip())

width = df.shape[1]

100 rows loaded


#### Split into Test and Training Sets ####

In [24]:
#split into test and training (is_train: True / False col)
np.random.seed(seed=1)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
print(str(len(train)), " rows split into training set,", str(len(test)), "split into test set.")

features = df.columns[1:width-1]  #remove the first ID col and last col=classifier
print("FEATURES = {}".format(features))

77  rows split into training set, 23 split into test set.
FEATURES = Index(['wordCount', 'stemmedCount', 'stemmedEndNN', 'CD', 'NN', 'NNP', 'NNPS',
       'NNS', 'PRP', 'VBG', 'VBZ', 'startTuple0', 'endTuple0', 'endTuple1',
       'endTuple2', 'verbBeforeNoun', 'qMark', 'qVerbCombo', 'qTripleScore',
       'sTripleScore'],
      dtype='object')


#### Fit a Model with the Training Data-Set ####

In [25]:
# Fit an RF Model for "class" given features
clf = RandomForestClassifier(n_jobs=2, n_estimators = 100)
clf.fit(train[features], train['class'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=2, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

#### Generate Predictions from the Test Data-Set ####

In [26]:


# Predict against test set
preds = clf.predict(test[features])
predout = pd.DataFrame({ 'id' : test['id'], 'predicted' : preds, 'actual' : test['class'] })

In [27]:
print(predout)

   actual                 id predicted
13      Q   31cedeb4e04fba02         Q
20      S   af7dd6b70d544b56         S
21      Q   584d5d4428d60a5f         S
24      Q   9140ee537fbe5390         Q
25      S   cabf9e317ba4a072         S
29      Q   3d25a26134f0e450         Q
32      S   280b0360e0d3ffc1         S
37      Q   0d4a13fc4cce6dab         Q
39      C   35179a54ea587953         C
40      C   8cdda20f1ae22213         C
43      Q   8798ff1fe7ac435d         Q
46      S   bc013bdd28614223         S
68      Q   7055c710336d670c         Q
70      Q   3b416352816dc854         Q
73      S   601fdf6ab85a9875         S
76      S   498b643ac17bcc7d         C
78      S   64e22039495c59bf         S
80      S   cc0c263a455bb702         S
82      C   8b1a9953c4611296         C
85      S   6b2d6039a794fb49         S
87      S   94590dd047fcbfce         S
91      Q   7a0fc645497df2c6         Q
96      S   ecef7fa7fcb25f20         S


#### Basic Validation ####

In [28]:
## Cross-check accuracy ##
print(pd.crosstab(test['class'], preds, rownames=['actual'], colnames=['preds']))
print("\n",pd.crosstab(test['class'], preds, rownames=['actual']
                       , colnames=['preds']).apply(lambda r: round(r/r.sum()*100,2), axis=1) )

from sklearn.metrics import accuracy_score
print("\n\nAccuracy Score: ", round(accuracy_score(test['class'], preds),3) ) # https://en.wikipedia.org/wiki/Jaccard_index

preds   C  Q   S
actual          
C       3  0   0
Q       0  8   1
S       1  0  10

 preds        C      Q      S
actual                      
C       100.00   0.00   0.00
Q         0.00  88.89  11.11
S         9.09   0.00  90.91


Accuracy Score:  0.913


#### Flaws in the Approach and Further Validation ####

The accuracy appears pretty good, but the approach taken probably means we have over-fitted the feature selection.  In the next section we try out the model on a completely different data-set, taken from the Python FaQ at https://docs.python.org/3/faq/general.html

## 4. Test Model Against the Python FAQ ##

#### Generate Features using the Features Generator ####
A prepared CSV containing Statements and Questions from the Python FAQ site (https://docs.python.org/3/faq/general.html) can be downloaded from here: https://github.com/edbullen/NLPBot/releases/download/SupportingFiles/pythonFAQ.csv

Some random chat statements have been added to the file as well - EG "What do you reckon?" and "yeah, whatever".

#### Load Sentence Data and Generate Features ####

In [29]:
# load in some pre-formated FAQ data in a CSV
FNAME = 'C://Zeldon/Projects/edbullenChatBot/nltk/pythonFAQ.csv' # !! Modify this to the CSV data location

import csv
import hashlib 

import features

fin = open(FNAME, 'rt')
reader = csv.reader(fin)

keys = ["id",
"wordCount",
"stemmedCount",
"stemmedEndNN",
"CD",
"NN",
"NNP",
"NNPS",
"NNS",
"PRP",
"VBG",
"VBZ",
"startTuple0",
"endTuple0",
"endTuple1",
"endTuple2",
"verbBeforeNoun",
"qMark",
"qVerbCombo",
"qTripleScore",
"sTripleScore",
"class"]

rows = []

next(reader)  #Assume we have a header 
for line in reader:
    sentence = line[0]  
    c = line[1]        #class-label
    id = hashlib.md5(str(sentence).encode('utf-8')).hexdigest()[:16] # generate a unique ID
    
    f = features.features_dict(id,sentence, c)
    row = []
    
    for key in keys:
        value = f[key]
        row.append(value)
    rows.append(row)
    
faq = pd.DataFrame(rows, columns=keys)
fin.close()

In [30]:
faq.head()

Unnamed: 0,id,wordCount,stemmedCount,stemmedEndNN,CD,NN,NNP,NNPS,NNS,PRP,...,startTuple0,endTuple0,endTuple1,endTuple2,verbBeforeNoun,qMark,qVerbCombo,qTripleScore,sTripleScore,class
0,e8af070019393e21,3,2,0,0,0,1,0,0,0,...,0,0,0,1,1,1,1,1,0,Q
1,cb2bfe367a7f5bfe,6,4,0,0,0,3,0,0,0,...,0,0,0,0,1,1,1,1,2,Q
2,30c3a9b0a23a6365,9,5,1,0,1,2,0,1,0,...,0,0,0,0,0,1,1,0,2,Q
3,8f285b6ab8472a48,8,5,0,0,1,1,0,0,0,...,0,0,0,0,1,1,1,0,1,Q
4,ce34f3fa53325140,5,3,0,0,0,1,0,0,0,...,0,0,0,0,1,1,1,1,0,Q


#### Predict the Class of Sentence with Previously Built Model ####

In [31]:
# Predict against FAQ test set
featureNames = faq.columns[1:width-1]  #remove the first ID col and last col=classifier
faqPreds = clf.predict(faq[featureNames])

predout = pd.DataFrame({ 'id' : faq['id'], 'predicted' : faqPreds, 'actual' : faq['class'] })

#### Cross-Check Accuracy ####

In [32]:
## Cross-check accuracy ##
print(pd.crosstab(faq['class'], faqPreds, rownames=['actual'], colnames=['preds']))

print("\n",pd.crosstab(faq['class'], faqPreds, rownames=['actual'],
                       colnames=['preds']).apply(lambda r: round(r/r.sum()*100,2), axis=1) )


preds    C   Q   S
actual            
C       12   5   2
Q        0  14   2
S        0   3  13

 preds       C      Q      S
actual                     
C       63.16  26.32  10.53
Q        0.00  87.50  12.50
S        0.00  18.75  81.25


In [33]:
print("Accuracy Score:", round(accuracy_score(faq['class'], faqPreds) ,3) )

Accuracy Score: 0.765


This could be summarised as "OK" but not great ...  

The Question and Statement predictions are reported as greater than 80% accurate and the features extraction method could easily be expanded on and enhanced.  
Also the training data-set is small.

## 5. Ad-hoc testing and experiments ##

In [34]:
textout = {'Q': "QUESTION", 'C': "CHAT", 'S':"STATEMENT"}

mySentence = "Scikit-learn is a popular Python library for Machine Learning."
#mySentence = "The cat is dead"
#mySentence = "Is the cat dead"

myFeatures = features.features_dict('1',mySentence, 'X')

values=[]
for key in keys:
    values.append(myFeatures[key])

s = pd.Series(values)
width = len(s)
myFeatures = s[1:width-1]  #All but the last item (this is the class for supervised learning mode)
predict = clf.predict([myFeatures])

print("\n\nPrediction is: ", textout[predict[0].strip()])



Prediction is:  STATEMENT
