-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes
118 lines (89 loc) · 3.7 KB
/
notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
adding new line
text classification :
text
loss, waste
/data
/train
/loss #label
file.txt
/waste #label
file.txt
/test
/loss
files.txt
/waste
files.txt
loading_dataset:
train_text -> train_label
test_text -> text_label
explore_data:
x axis - length of decs rows (text from each website)
y axis - how many rows with same length
samples/number of words per sample ratio = 838860/14.0 = 59918.0
ratio > 1500
tokenize the text as sequences
use sepCNN model to classify them
a. Split the samples into words; select the top 20K words based on their frequency.
b. Convert the samples into word sequence vectors.
*****************c. If the original number of samples/number of words per sample ratio is less
than 15K, using a fine-tuned pre-trained embedding with the sepCNN
model will likely provide the best results.********************** ????
We use sequence model with
pre-trained embeddings that are fine-tuned for text classification when the
ratio of number of samples to number of words per sample for the given dataset
is neither small nor very large (~> 1500 && ~< 15K)
vectorization:
vectorized train,
vectorized text data
#numerical values -> to machine learining algorithm
two options :
on-hot coding
word embeddings
word embedding:
Words have meaning(s) associated with them.
As a result, we can represent word tokens in a dense vector space (~few hundred real numbers),
where the location and distance between words indicates how similar they are semantically
model :
first layer - embedding
This layer learns to turn word index sequences into word embedding vectors during the training process,
such that each word index gets mapped to a dense vector of real values representing that word’s location in semantic space
sequence vectors :
- order is important
-
model_building:
model :
no of units ?
no of layers ?
batch size ?
data : by scrapping
learning rate
epochs
dropout rate
- sequence models : models that can learn from the adjacency of tokens.
includes CNN or RNN classes of model
- Data is pre-processed as sequence vectors for these models.
- they have to learn a large number of parameters
- first layer :
embedding layer
learns the relationship between the words in a dense vector space
*Learning word relationships works best over many samples
- pre-trained embeddings :
Words in a given dataset are most likely not unique to that dataset
We can learn the relationship between the words in our dataset using other dataset(s).
we can transfer an embedding learned from another dataset into our embedding layer
these embedding is knows as pre-trained embedding
gives the model a head start in the learning process
['Arts', 'Business', 'Computers', 'Games', 'Health', 'Home', 'News', 'Recreation', 'Reference', 'Science', 'Shopping']
['Shopping', 'Society']
total labels : 13
number of classes : 13
training:
accuracy
fit fun = validates
trains
adjusts weights
Embedding dimensions: The number of dimensions we want to use to represent word embeddings—
i.e., the size of each word vector.
Recommended values: 50–300.
In our experiments, we used GloVe embeddings with 200 dimensions with a pre- trained embedding layer.
uderfitting :