<a href="https://colab.research.google.com/github/anmaxwell/UniNotebooks/blob/master/workissues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install "git+https://github.com/facebookresearch/fastText.git"

Collecting git+https://github.com/facebookresearch/fastText.git
  Cloning https://github.com/facebookresearch/fastText.git to /tmp/pip-req-build-7nr280vq
  Running command git clone -q https://github.com/facebookresearch/fastText.git /tmp/pip-req-build-7nr280vq
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3005626 sha256=6b022b86bff816a346099f0d883372f6c7aef817384a36e9e52067638f16bab8
  Stored in directory: /tmp/pip-ephem-wheel-cache-4fcyvf47/wheels/69/f8/19/7f0ab407c078795bc9f86e1f6381349254f86fd7d229902355
Successfully built fasttext


Install all necessary packages

In [0]:
import fasttext.util
import numpy as np
import pandas as pd
import re

from keras import layers
from keras.layers import Dropout 
from keras.models import Sequential
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split

Read in data and look at first item

In [6]:
df = pd.read_csv('ModelData.csv', names=['text','issue'], sep=',')
print(df.iloc[0])

text     install sent otdl
issue                    0
Name: 0, dtype: object


Create labels

In [0]:
labels = df['issue'].values

Load the fasttext model

In [8]:
fasttext.util.download_model('en', if_exists='ignore') 
ft = fasttext.load_model('cc.en.300.bin')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz





Set the parameters

In [0]:
review_length = 150
data_count = len(df)
dims = ft.get_dimension()

Function to take the text from the review column, clean it, turn it to individual words then convert to vectors

In [0]:
def text_to_vector(text):

  text = text.replace('&', ' and ')
  text = text.replace('@', ' at ')
  text = re.sub(r'[^\x41-\x7f]',r' ',text)
  text = text.lower().split()

  window = text[-review_length:]
  
  vectors = np.zeros((review_length, dims))

  for i, word in enumerate(window):
      vectors[i, :] = ft.get_word_vector(word).astype('float32')

  return vectors


Function to create the word embedding

In [0]:
def create_word_embedding(df):

    word_embedding = np.zeros((len(df), review_length, dims), dtype='float32')

    for i, review in enumerate(df['text'].values):
        word_embedding[i, :] = text_to_vector(review)

    return word_embedding

Create the embedding

In [0]:
embedding = create_word_embedding(df)

Create the training and test set

In [0]:
X_train, X_test, y_train, y_test = train_test_split(embedding, labels, test_size=0.30, random_state=42)

Create the CNN

In [0]:
def cnn_text_classifier():

    model = Sequential()
    model.add(layers.Conv1D(filters=128, kernel_size=3, activation='relu', input_shape=(review_length, dims)))
    model.add(layers.Conv1D(filters=64, kernel_size=3, activation='relu'))
    model.add(layers.GlobalAveragePooling1D())
    model.add(Dropout(0.5))
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.summary()
    return model

Build the model

In [43]:
model = cnn_text_classifier()
history = model.fit(X_train, y_train, epochs=20, verbose=False, validation_data=(X_test, y_test), batch_size=10)
model.save("model.h5")

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_4 (Conv1D)            (None, 148, 128)          115328    
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 146, 64)           24640     
_________________________________________________________________
global_average_pooling1d_4 ( (None, 64)                0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                650       
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 11        
Total params: 140,629
Trainable params: 140,629
Non-trainable params: 0
________________________________________________

In [0]:
#from google.colab import files

#files.download('model.h5')



Check the accuracy

In [44]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))


Training Accuracy: 0.9111
Testing Accuracy:  0.8630


In [45]:
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
y_pred1 = model.predict(X_test)
y_pred = np.argmax(y_pred1, axis=1)

# Print f1, precision, and recall scores
print("Precision:", precision_score(y_test, y_pred , average="macro"))
print("Recall:", recall_score(y_test, y_pred , average="macro"))
print("F1 score:", f1_score(y_test, y_pred , average="macro"))

Precision: 0.3798932384341637
Recall: 0.5
F1 score: 0.43174924165824063


  _warn_prf(average, modifier, msg_start, len(result))


Given the data is not evenly spread, looking at the F1 score gives a better understanding of how the model is performing.
The score needs to be as close to 1 as possible to ensure that it is correctly identifying issues as issues.

In [0]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')

def plot_history(training):
    acc = training.history['accuracy']
    val_acc = training.history['val_accuracy']
    loss = training.history['loss']
    val_loss = training.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(14, 8))
    plt.subplot(1, 2, 2)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'g', label='Training loss')
    plt.plot(x, val_loss, 'y', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()


In [0]:
plot_history(history)

In [0]:
def model_predict(predictdf):
  output_predictions = {}
  comment_vector = create_word_embedding(predictdf)
  issue_value = model.predict(comment_vector)
  for i,item in enumerate(issue_value):
    if item >0.5:
      predictdf.at[i,'value'] = 'Issue'
    else:
      predictdf.at[i,'value'] = 'No Issue' 
  predictdf.to_csv('results.csv')
