## Import Libs

In [None]:
import pandas
import sklearn
import numpy as np
from IPython.display import display

import matplotlib.pyplot as plt

## Loading data
First, we load the data from disk into a Dataframe.

A Dataframe is essentially a table, or 2D-array/Matrix with a name for each column.

In [None]:
phone_df = pandas.read_csv('phone-data.csv')

Let's preview the data.

In [None]:
# Show the top 5 rows
display(phone_df.head())
# Summarize the data
phone_df.describe()

## Data cleaning

We first select only the columns we are interested.

For this example we will be training a model to predict the "app tag" given a "Sentence Utterance". Thus we will be only selecting these two. Others can also be selected, but these two will serve as an example.

We call the DataFrame.describe() again.
Notice that there are 24 unique labels/classes that the model will try to predict.
But the "#ERROR!" class is not a class we want to predict.

In [None]:
data_df = phone_df[["Sentence Utterance", "app tag"]]
data_df.columns = ['input', 'label']
display(data_df.describe())
display(data_df.label.unique())

### Removing Nulls and Error

from above, we can see that there are 24 unique labels

But we want to remove the "#ERROR!" which is means that the paricular row is un useful data

We will clean unwanted data by removing Nulls and invalid values (in this case "#ERROR!")

In [None]:
print("Number for rows %d" % (len(data_df)))
data_df = data_df[data_df.input.notnull()]
print("remove nulls in input")
print("Number for rows %d" % (len(data_df)))
data_df = data_df[data_df.label.notnull()]
print("remove nulls in label")
print("Number for rows %d" % (len(data_df)))
data_df = data_df[data_df.label != "#ERROR!"]
print("remove \"#ERROR!\" in label")
print("Number for rows %d" % (len(data_df)))

### Remove duplicates input

There are some duplicates in the input of this dataset.

In [None]:
display(data_df.describe())

As you can see above, we no longer have "#ERROR!" in our labels

But there are still duplicates in our input. 7 rows with "สอบถามยอดค้างชำระค่ะ"

We will remove them now, by keeping only the first entry.

In [None]:
data_df = data_df.drop_duplicates("input", keep="first")
display(data_df.describe())

### Substitute Strings in Label
Computer don't actually understand the string in the label so we will substitute them with a number for each unique value.

In [None]:
data = np.array(data_df.as_matrix(), copy=True)

unique_label = data_df.label.unique()

label_2_num_map = dict(zip(unique_label, range(len(unique_label))))
num_2_label_map = dict(zip(range(len(unique_label)), unique_label))

print("Create Mappings")
display(num_2_label_map)
display(label_2_num_map)

print("Before Mappings")
display(data[:, 1])
data[:,1] = np.vectorize(label_2_num_map.get)(data[:,1])

print("After Mappings")
display(data[:, 1])

### String cleaning
Trim whitespace

In [None]:
def strip_str(string):
    return string.strip()
     
# Trim of extra begining and trailing whitespace in the string
print("Before")
print(data)
data[:,0] = np.vectorize(strip_str)(data[:,0])
print("After")
print(data)

### Visualize Class Count

We will now visualize the class imbalance. Note that training directly on imbalance dataset can yield bad results. 

In [None]:
def plot(label, count):
    fig, ax = plt.subplots()
    ind = np.arange(len(count))
    rects1 = ax.bar(ind, count, 0.5)

    ax.set_ylabel('Count')
    ax.set_title('Count for each class')
    ax.set_xticks(ind)
    ax.set_xticklabels(label)

    plt.show()
    
label, count = np.unique(data[:, 1], return_counts=True)
plot(label, count)

# pack the label and count together
bundle = list(zip(label, count))
# sort them by count
bundle = sorted(bundle, key=lambda e: e[1], reverse=True) 
# unpack the values
label, count = zip(*bundle)
plot(label, count)

Now we have our training data with input and labels

## Feature Engineering

Which is just a fancy word for making the input work with our model.

The models that we are going to tackle do not accpet varying size input, so we have to transform our input in some ways that makes the input have this property while also retaining some useful information.

### Feature #1: Char count

#### Finding the Chars
We will first find the list of possible chars in the dataset, we can just Google all the possible chars in Thai and English or we can just obtain it from the data set. The code bellow will do the latter.

In [None]:
all_the_string = "".join(data[:, 0])

np_str = np.array(list(all_the_string))
all_char = np.unique(np_str)

sorted(all_char)
print("There are %d unique chars in the data set" % len(all_char))
print(all_char)
char_map = dict(zip(all_char, range(len(all_char))))

In [None]:
def count_str(string):
    global all_char, char_map
    result = np.zeros(len(all_char))
    np_str = np.array(list(string))
    str_char, str_char_count = np.unique(np_str, return_counts=True)
    for char, count in zip(str_char, str_char_count):
        result[char_map[char]] = count
    return result

# run example feature transformation
print("Example String to feature conversion")
display(data[0, 0])
display(count_str(data[0, 0]))

In [None]:
# run on data set
temp = np.vectorize(count_str, otypes=[object])(data[:, 0])
x_f1 = np.array([[e for e in sl] for sl in temp.tolist()])
label = data[:, 1]
print("Data")
print("Data shape", x_f1.shape)
print("label shape", label.shape)

### Feature #2: Keyword Detection

Code bellow will show the first 3 entries for each class

Use this to find some keywords that you believe will useful for the classifer.

In [None]:
def show_first_in_label(first, select_label):
    print("Showing label \"%s\"" % num_2_label_map[select_label])
    select = data[data[:, 1] == select_label, 0]
    for i in range(min(first, len(select))):
        print(i, select[i])
    print("")
        
first_three = 3
number_of_classes = 23
for i in range(number_of_classes):
    show_first_in_label(first_three, i)

Here are some of the entries used to find keywords.

In [None]:
index = 30
display(data[index, 0], num_2_label_map[data[index, 1]])

index = 40
display(data[index, 0], num_2_label_map[data[index, 1]])

index = 80
display(data[index, 0], num_2_label_map[data[index, 1]])

Add keywords here, some are already added for you as an example. 
The transformed features (keywords) should differentiate each classes from one another. See bellow that the 3 entries each from 3 classes can be differentiate using the keywords added as an example.

In [None]:
keywords = ["โปร", "โทร", "ไม่ได้", "iservice"]

The "has_keyword" function only detects the keyword. But you can also experiment with counting the occurrence of the keyword by modifying the function bellow.

In [None]:
def has_keyword(string):
    global keywords
    result = np.zeros(len(keywords))
    for index, keyword in enumerate(keywords):
        if keyword in string:
            result[index] = 1
    return result

def preview(string_ind):
    print("Entry")
    display(data[string_ind, 0])
    print("Feature")
    print(has_keyword(data[string_ind, 0]), "->", num_2_label_map[data[string_ind, 1]])
    print("")

# run example feature transformation
print("Example String to feature conversion\n\n")
preview(30)
preview(40)
preview(80)

In [None]:
# run on data set
temp = np.vectorize(has_keyword, otypes=[object])(data[:, 0])
x_f2 = np.array([[e for e in sl] for sl in temp.tolist()])
label = data[:, 1].astype(int)
print("Data")
print("Data shape", x_f2.shape)
print("label shape", label.shape)

### Testing
See how well the model can fit to our current data. This is a quick and dirty way of testing hand crafted features. The model will not generalize if it does not fit in the first place. We (you) will do proper training later.

In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_f1, label)
y_pred = model.predict(x_f1)
print("Model Acc. on train data %f%%"
       % ((label == y_pred).sum() / x_f1.shape[0] * 100))

In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_f2, label)
y_pred = model.predict(x_f2)
print("Model Acc. on train data %f%%"
       % ((label == y_pred).sum() / x_f1.shape[0] * 100))

### Training

#### Feature #1
##### Split  data into train-data test-data

##### Train model

#### Display model performance
Accuracy, confusion-matrix, etc.

#### Feature #2
##### Split  data into train-data test-data

##### Train model

##### Display model performance

In [None]:
print(data)

#### Try combining the 2 features