### 1-D Convolution
- one obvious application: speech recongnition
- speech is a 1-D signal
- automatic text trasciption


- Another application - treating a sequence of word vectors as a 1-D signal

- length T sentences and size D vector -> TxD 1-dimensional vector signal  
![](https://cn.bing.com/th?id=OIP.2Z4scweLbrmB4QZ5AXwJfwHaHY&pid=Api&rs=1&p=0)

### 1-D CNN word embedding Archtecture
![](https://cn.bing.com/th?id=OIP.0LXL9feRDVvbjxVOWiMGggHaC_&pid=Api&rs=1&p=0)

### Get data
[kaggle toxic data](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

"id","comment_text","toxic","severe_toxic","obscene","threat","insult","identity_hate"  
"0000997932d777bf","Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0,0,0,0,0,0
"000103f0d9cfb60f","D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)"  
,0,0,0,0,0,0


- Multi-label: a picture of a "car" can also be a picture of a "red object"
- similarly, these comments may have multiple labels, it is both a threat and toxic

### Multi-label
- treat it like 6 different binary classification problems
  - given comment -> is it toxic or not toxic?
  - given comment -> is it a threat or not a threat?
- It is like having a neural network with 6 seperate binary logistic regression at the end
- __Total loss is just the average binary cross-entropy__


#### Architecture

- neural network -> 6 X linear classifier
- features = feature_extractor.transform(input_data)
- model = LogisticRegression()
- model.fit(features from csv,labels)
- do this 6 times for each of the 6 labels

#### In keras
- x = output from previous layers
- outputs = Dense(6,activation="sigmodi")(x)
- model = Model(inputs,outputs)
- model.compile('binary_cross_entropy')

### Code


In [1]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from sklearn.metrics import roc_auc_score


Using TensorFlow backend.


In [9]:
# some configuration
MAX_SEQUENCE_LENGTH = 100
MAX_VOCAB_SIZE = 1000#20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128
EPOCHS = 10


In [7]:
# load in pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('./large_files/glove.6B/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f:
    count = 0  
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
        
        
        count += 1
        if count < 10:
            print("word {} vec {} len:{}".format(word,vec,len(vec)))
        
        if count >= 1000:
            break
print('Found %s word vectors.' % len(word2vec))


Loading word vectors...
word the vec [-0.038194 -0.24487   0.72812  -0.39961   0.083172  0.043953 -0.39141
  0.3344   -0.57545   0.087459  0.28787  -0.06731   0.30906  -0.26384
 -0.13231  -0.20757   0.33395  -0.33848  -0.31743  -0.48336   0.1464
 -0.37304   0.34577   0.052041  0.44946  -0.46971   0.02628  -0.54155
 -0.15518  -0.14107  -0.039722  0.28277   0.14393   0.23464  -0.31021
  0.086173  0.20397   0.52624   0.17164  -0.082378 -0.71787  -0.41531
  0.20335  -0.12763   0.41367   0.55187   0.57908  -0.33477  -0.36559
 -0.54857  -0.062892  0.26584   0.30205   0.99775  -0.80481  -3.0243
  0.01254  -0.36942   2.2167    0.72201  -0.24978   0.92136   0.034514
  0.46745   1.1079   -0.19358  -0.074575  0.23353  -0.052062 -0.22044
  0.057162 -0.15806  -0.30798  -0.41625   0.37972   0.15006  -0.53212
 -0.2055   -1.2526    0.071624  0.70565   0.49744  -0.42063   0.26148
 -1.538    -0.30223  -0.073438 -0.28312   0.37104  -0.25217   0.016215
 -0.017099 -0.38984   0.87424  -0.72569  -0.51058  -0

In [8]:
word2vec.keys()

dict_keys(['is', 'run', 'ii', 'again', 'february', 'been', 'am', 'every', 'my', '-', 'between', 'nothing', 'international', 'previous', 'reports', 'close', 'into', 'vice', 'ministry', 'monday', 'least', 'evidence', 'provide', 'same', 'since', 'having', 'ca', 'tax', "''", 'attacks', 'hit', 'continue', 'deal', 'fire', 'committee', 'talks', 'future', 'church', 'today', 'california', 'common', 'something', 'used', ';', 'free', 'i', 'canada', 'allowed', 'comes', 'growing', 'japanese', 'republican', 'months', 'remains', 'won', 'authorities', 'received', 'light', 'commission', 'construction', 'first', 'popular', 'scored', 'life', '2009', 'closed', 'clear', 'reporters', 'reached', 'them', 'days', 'election', 'against', 'cut', 'or', 'family', 'percent', 'five', '2005', 'they', 'vote', 'could', 'party', 'stocks', 'billion', 'face', 'much', 'just', 'important', 'french', "'ve", 'development', 'such', 'further', 'released', 'art', '12', 'military', 'southern', 'u.s.', '24', 'father', 'young', 'ope

In [28]:
# prepare text samples and their labels
print('Loading in comments...')

train = pd.read_csv("./large_files/jigsaw-toxic-comment-classification-challenge/train.csv",nrows = 200)
sentences = train["comment_text"].fillna("DUMMY_VALUE").values
possible_labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
targets = train[possible_labels].values


Loading in comments...


In [29]:
sentences[0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [30]:
targets[0]

array([0, 0, 0, 0, 0, 0])

In [31]:
# convert the sentences (strings) into integers
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
# print("sequences:", sequences); exit()


In [32]:
len(sequences[0]),sequences[0][:10],len(sequences),sentences.shape

(33, [347, 84, 1, 102, 89, 148, 31, 348, 65, 288], 200, (200,))

In [33]:
print("max sequence length:", max(len(s) for s in sequences))
print("min sequence length:", min(len(s) for s in sequences))
s = sorted(len(s) for s in sequences)
print("median sequence length:", s[len(s) // 2])


max sequence length: 467
min sequence length: 1
median sequence length: 31


In [34]:
# get word -> integer mapping
word2idx = tokenizer.word_index
print('Found %s unique tokens.' % len(word2idx))


Found 3768 unique tokens.


In [19]:
word2idx.keys()



In [35]:
word2idx['explanation']

347

In [36]:
# pad sequences so that we get a N x T matrix
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', data.shape)


Shape of data tensor: (200, 100)


In [37]:
data[0]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0, 347,  84,   1, 102,  89, 148,  31, 348,  65, 288,  53,
        47,  16,  69,  90,   8, 810,  26, 124,   6,  54,  40, 211,   1,
       243,  27,   1,  28,  29, 212,  48,  95, 811], dtype=int32)

- word2idx dict has word key and corresponding integer value  
- embedding_vector matrix has integer key index and corresponding word vector from glove

In [40]:
# prepare embedding matrix
print('Filling pre-trained embeddings...')
num_words = min(MAX_VOCAB_SIZE, len(word2idx) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word2idx.items():
    if i < MAX_VOCAB_SIZE:
        embedding_vector = word2vec.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all zeros.
            embedding_matrix[i] = embedding_vector

embedding_matrix.shape

Filling pre-trained embeddings...


(1000, 100)

In [41]:
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(
  num_words,
  EMBEDDING_DIM,
  weights=[embedding_matrix],
  input_length=MAX_SEQUENCE_LENGTH,
  trainable=False
)


In [None]:
print('Building model...')

# train a 1D convnet with global maxpooling
input_ = Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedding_layer(input_)
x = Conv1D(128, 3, activation='relu')(x) # 128 output filters, 3 kernel size
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = MaxPooling1D(3)(x)
x = Conv1D(128, 3, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
output = Dense(len(possible_labels), activation='sigmoid')(x)

model = Model(input_, output)
model.compile(
  loss='binary_crossentropy',
  optimizer='rmsprop',
  metrics=['accuracy']
)


In [None]:

print('Training model...')
r = model.fit(
  data,
  targets,
  batch_size=BATCH_SIZE,
  epochs=EPOCHS,
  validation_split=VALIDATION_SPLIT
)


# plot some data
plt.plot(r.history['loss'], label='loss')
plt.plot(r.history['val_loss'], label='val_loss')
plt.legend()
plt.show()

# accuracies
plt.plot(r.history['acc'], label='acc')
plt.plot(r.history['val_acc'], label='val_acc')
plt.legend()
plt.show()

# plot the mean AUC over each label
p = model.predict(data)
aucs = []
for j in range(6):
    auc = roc_auc_score(targets[:,j], p[:,j])
    aucs.append(auc)
print(np.mean(aucs))


![](https://blogs.mcgill.ca/cambam/files/2012/07/fig1.jpg)

[sklearn roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)  
[ROC explanation](https://jeongchul.tistory.com/545)