![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
#### Add your code here ####
from tensorflow.keras.datasets import imdb
import warnings
warnings.filterwarnings("ignore")
data = imdb.load_data()

In [2]:
# Get train and test set
import numpy as np
import random
random.seed(0)

features = np.concatenate((data[0][0], data[1][0]), axis=0)
labels = np.concatenate((data[0][1], data[1][1]), axis=0)

# Take 10000 most frequent words
features_updated = []
for f in features:
    updated_f = []
    for i in f:        
        if i < 10000:
            updated_f.append(i)
    features_updated.append(updated_f)


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_updated, labels, test_size=0.20, random_state = 42)

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [3]:
#### Add your code here ####
# Number of reviews
print(np.asarray(features_updated).shape[0])

50000


In [4]:
#### Add your code here ####
# Number of words in each review
for f in features_updated:
    print(len(f))

212
182
139
499
146
40
115
525
228
128
430
99
115
225
99
121
157
624
208
169
119
116
249
783
90
133
201
187
155
214
163
569
232
159
555
212
51
311
136
220
676
137
811
120
115
492
52
206
97
178
107
166
416
130
292
697
335
140
59
182
103
150
177
402
147
653
358
307
247
254
707
207
304
112
358
120
598
286
156
106
92
149
138
81
482
304
189
162
57
141
187
618
385
484
426
191
202
169
105
167
154
138
746
248
233
61
120
877
701
218
190
266
200
82
953
788
166
200
49
113
133
136
58
55
403
392
206
138
495
297
198
530
126
150
54
87
164
179
165
156
76
177
150
218
126
61
350
79
54
165
152
57
140
148
210
186
140
126
191
19
94
432
499
128
313
142
358
205
463
131
123
341
232
298
110
199
270
781
138
232
145
181
167
119
121
101
459
323
243
148
199
314
177
80
171
315
115
119
225
163
154
91
59
319
477
131
206
143
198
529
143
130
145
119
124
190
423
174
133
292
128
191
535
122
488
753
857
509
160
601
129
51
257
164
179
203
80
203
126
150
266
569
152
510
268
84
276
152
329
42
216
115
173
112
238
355
126
143


675
241
397
484
295
156
312
464
213
271
142
159
413
190
123
330
62
147
130
78
392
140
115
213
154
129
141
153
273
161
122
402
193
251
265
137
151
188
612
67
167
284
223
119
113
431
151
229
48
122
45
299
178
132
180
186
417
115
251
49
468
241
286
88
123
766
211
78
491
393
562
259
116
159
207
345
374
141
57
504
166
107
185
470
407
404
243
142
116
122
203
77
802
291
183
969
494
207
52
125
192
105
536
111
243
138
131
181
152
247
247
130
158
212
170
726
120
489
166
132
562
63
70
258
122
568
66
134
190
168
633
139
255
221
163
119
405
142
260
326
69
295
185
117
127
501
260
117
106
122
172
43
137
168
198
243
57
221
28
496
203
81
208
88
46
162
268
329
657
256
191
118
138
68
173
229
219
323
279
122
153
93
257
169
561
147
165
447
384
84
283
121
352
200
102
454
921
59
183
152
430
220
91
181
121
183
280
193
544
137
140
126
187
168
106
232
583
486
333
770
156
332
412
338
375
327
58
402
161
91
444
575
649
106
284
330
195
121
62
455
171
110
278
193
582
267
156
968
111
154
384
69
162
223
133
238
399
17

297
253
179
170
48
113
274
75
194
142
99
186
137
175
257
111
356
436
601
102
44
118
146
459
237
117
167
324
613
95
653
509
109
141
763
180
519
284
172
59
214
150
75
36
119
133
105
99
152
360
214
171
235
92
143
145
81
95
905
552
125
197
419
831
141
175
138
86
63
118
288
271
145
240
340
230
196
496
255
198
364
241
116
209
141
118
106
57
123
241
190
52
71
420
125
130
43
83
222
131
169
164
119
177
50
174
109
163
476
124
153
113
216
233
124
268
143
77
917
189
175
158
189
509
96
853
433
174
140
178
213
183
54
785
109
131
176
233
114
57
132
215
201
493
162
150
151
574
191
138
134
191
119
465
220
225
119
148
493
106
112
284
194
133
506
122
144
161
83
160
150
98
235
385
128
216
222
407
43
171
129
226
90
410
233
140
92
191
97
320
98
146
121
104
137
321
120
142
144
163
138
267
264
377
115
271
206
31
146
220
452
255
139
480
124
141
181
217
117
73
195
79
138
153
126
303
36
399
133
238
238
210
114
138
179
155
182
189
37
216
135
127
143
500
133
227
113
131
156
118
349
175
96
11
204
382
85
787
320
259

142
169
134
178
237
218
127
485
158
104
139
80
261
789
285
194
305
161
202
157
161
424
1466
105
124
224
180
96
274
123
188
130
103
539
224
229
137
140
205
421
115
150
160
220
194
135
64
274
398
117
150
182
451
121
105
78
373
129
200
79
33
120
120
275
107
110
211
496
328
191
129
612
257
885
84
180
37
77
262
136
137
175
280
291
265
812
199
80
295
96
214
246
184
66
80
281
313
286
204
253
333
177
136
221
157
67
116
160
181
170
368
62
101
436
117
352
55
831
764
270
143
310
284
140
50
114
122
236
97
100
139
185
247
129
264
138
212
168
108
201
188
493
462
425
385
300
238
128
239
126
141
298
119
187
257
326
47
191
64
789
129
113
121
202
154
126
239
67
177
63
176
193
121
243
133
199
250
185
100
292
126
108
151
150
70
141
306
171
123
204
457
255
81
40
129
48
159
259
158
119
68
122
190
149
414
173
120
145
313
218
103
288
215
134
167
78
253
189
189
320
216
593
123
130
78
134
321
68
244
130
131
91
128
130
493
152
83
152
163
145
573
326
224
582
174
492
461
154
200
128
165
214
62
154
118
390
1077
123

183
390
170
408
347
351
88
286
140
257
86
605
182
57
378
62
142
42
116
155
194
228
125
113
123
101
241
124
148
156
95
662
405
259
138
477
116
302
117
137
429
666
318
154
83
195
75
163
119
132
85
282
297
221
130
226
161
164
134
339
93
296
93
181
385
481
570
722
171
239
188
179
133
596
128
244
131
131
308
218
122
353
128
54
181
100
178
377
480
574
101
150
693
144
351
202
144
257
293
238
287
102
187
108
430
241
217
162
231
275
250
178
585
679
110
151
236
284
160
115
161
106
1215
386
85
177
116
131
37
789
162
135
196
178
152
107
808
191
169
127
303
95
59
451
380
200
379
300
259
111
106
215
151
399
174
158
164
58
376
75
297
110
282
128
281
464
119
225
384
153
183
107
532
121
103
232
578
146
665
186
527
80
133
381
723
295
155
572
44
478
231
198
140
139
294
274
199
142
134
164
41
104
217
631
204
123
18
128
56
65
117
127
163
198
148
90
416
128
67
92
154
244
344
48
111
97
322
402
135
393
226
484
68
111
201
171
38
77
69
127
139
206
650
160
208
81
63
145
201
125
89
237
161
741
528
196
800
266
746

328
131
62
138
76
121
257
197
175
213
637
169
340
79
1336
474
112
112
313
190
167
155
168
398
68
154
387
590
117
423
114
227
116
127
142
288
207
548
125
195
525
122
1138
130
155
238
138
62
181
140
245
172
104
137
235
49
380
189
69
150
127
54
418
149
483
545
131
466
89
168
119
166
215
591
159
179
130
104
186
143
121
370
438
100
58
121
250
263
352
534
402
482
732
103
51
336
177
570
235
412
330
189
130
114
242
343
118
496
177
183
828
265
141
204
75
179
294
485
193
114
303
308
182
167
128
174
296
145
249
115
304
117
126
332
181
160
157
90
134
119
431
134
197
406
843
326
120
249
77
60
262
291
146
128
71
315
70
210
748
55
230
131
109
114
144
50
144
104
125
118
356
107
230
259
446
159
120
122
404
287
287
222
134
121
128
88
483
406
283
392
123
164
307
380
124
99
103
222
129
296
280
147
81
145
572
265
493
109
69
268
119
80
112
213
144
49
96
145
123
217
619
404
120
150
145
84
206
562
164
124
217
193
785
189
147
705
278
289
380
516
144
274
203
100
263
123
341
180
358
124
272
156
260
105
90
140
11

318
175
138
415
194
661
313
474
121
179
121
233
552
69
91
80
46
150
115
690
143
142
82
226
131
131
221
91
248
485
70
127
151
33
141
182
122
53
218
215
276
139
129
136
388
97
191
160
64
142
111
486
124
104
129
222
147
191
86
343
83
466
258
141
343
424
124
116
189
191
167
122
188
143
289
71
213
395
42
245
131
163
171
481
173
116
832
154
209
138
114
247
117
108
380
107
134
116
628
440
197
455
41
122
123
133
265
140
57
113
112
157
388
502
315
969
81
205
269
296
106
37
365
176
72
117
71
365
171
144
176
157
122
178
89
239
160
81
364
305
866
63
325
164
272
214
145
152
66
283
847
150
472
265
161
38
144
317
332
303
221
211
258
115
80
250
141
210
172
203
200
307
45
119
122
96
129
254
138
79
465
181
208
790
245
129
176
288
138
125
325
234
210
882
807
108
181
170
100
149
291
128
253
379
52
546
167
275
175
179
141
175
106
39
48
223
369
433
111
105
147
144
890
214
103
91
97
84
341
194
288
71
125
203
257
146
711
264
512
403
873
129
396
84
201
137
356
796
126
320
205
77
315
279
174
125
553
441
168
491

134
154
217
150
159
263
9
184
319
37
150
128
144
222
180
179
136
243
571
836
111
266
506
829
547
321
166
138
258
170
183
78
93
132
235
143
163
54
49
140
111
121
136
498
292
117
297
64
212
126
250
164
206
113
349
278
82
188
243
31
227
510
104
138
163
408
101
898
239
170
417
120
190
380
134
312
173
498
215
248
230
283
195
425
159
392
158
134
470
457
113
292
144
301
244
218
105
206
106
168
144
174
131
125
254
199
167
69
141
105
211
135
185
127
521
193
79
149
422
168
55
92
273
304
260
256
223
145
95
158
648
56
78
170
128
819
138
451
79
460
94
102
119
164
123
554
167
133
273
90
303
222
419
206
250
61
286
185
377
116
173
116
121
43
116
299
138
230
142
107
253
110
626
246
167
151
135
93
131
166
546
108
201
150
289
359
443
611
104
290
126
403
153
61
21
51
225
266
156
266
97
190
100
221
191
71
197
640
153
91
131
147
292
93
153
178
131
339
87
506
128
154
246
161
259
130
157
130
119
189
216
273
102
132
138
326
167
164
856
276
103
354
61
43
124
200
137
839
200
161
283
349
36
284
136
159
227
140
14

123
162
60
115
225
109
348
723
171
93
194
134
139
109
317
281
51
320
117
198
198
141
667
36
208
158
166
101
87
138
133
143
129
149
124
85
238
51
136
161
198
291
252
31
516
174
220
126
418
138
389
157
158
105
341
80
127
284
98
133
174
174
135
114
228
114
152
84
260
207
120
310
260
80
215
145
190
127
136
339
455
161
104
86
146
83
58
272
100
718
470
137
125
185
103
171
120
891
118
271
159
369
614
213
180
347
228
45
116
38
90
106
109
279
205
442
132
197
160
227
368
120
861
169
189
117
108
310
132
125
510
236
57
71
204
325
150
286
57
564
158
211
452
652
183
308
118
295
535
204
108
168
253
202
171
128
129
327
254
253
137
346
146
176
149
439
242
215
53
55
160
200
145
166
410
958
102
115
331
211
228
320
489
150
117
189
492
728
865
194
83
109
134
616
298
137
48
107
125
407
214
216
198
149
195
47
296
169
99
110
155
182
144
53
112
45
237
253
141
88
217
464
219
69
184
55
50
129
133
144
125
195
234
108
246
136
315
581
244
162
100
78
306
185
506
202
211
515
191
183
188
90
126
260
619
363
179
146
112

90
925
192
155
234
727
126
252
183
152
259
134
107
100
128
34
325
357
292
131
131
228
48
130
146
161
138
603
146
478
119
50
148
322
598
396
162
249
98
190
224
140
229
295
146
152
320
133
72
220
70
42
149
109
158
455
362
112
244
246
293
93
140
202
140
111
318
138
469
151
200
130
153
174
242
244
138
193
239
135
155
264
310
254
163
119
174
167
210
139
144
268
303
49
200
98
131
185
350
139
119
270
500
98
96
119
154
290
94
231
140
145
104
203
224
143
162
199
124
125
141
149
207
230
153
242
60
84
87
215
889
225
129
192
410
139
160
58
150
293
41
159
193
332
115
119
527
137
157
517
113
149
232
197
140
171
272
117
143
337
150
230
231
257
170
267
225
197
123
208
117
413
209
205
126
109
53
777
167
136
157
691
331
370
166
159
117
194
150
102
68
140
196
143
346
168
121
234
150
627
541
107
158
160
154
274
108
385
156
150
85
135
784
127
63
159
109
297
140
209
83
390
411
487
149
260
99
127
97
692
89
447
116
91
251
583
38
376
213
145
108
326
139
201
181
168
405
187
115
180
275
117
139
327
266
498
312
1

230
164
62
92
230
127
547
168
44
137
195
70
143
638
122
135
723
279
278
190
107
681
129
299
157
52
63
142
208
158
208
101
152
57
178
165
152
319
224
178
855
150
115
318
148
175
101
49
443
131
124
351
53
481
140
135
107
121
633
120
292
103
122
201
297
184
100
177
228
127
308
121
163
40
192
141
203
226
165
581
234
238
119
121
183
63
209
192
467
119
581
208
85
134
53
196
125
119
295
41
41
225
267
238
120
629
75
335
185
283
128
114
160
343
127
70
212
142
324
130
709
98
184
479
113
109
307
62
342
171
245
50
36
209
43
113
105
108
126
135
144
124
93
216
240
75
162
179
284
174
116
120
56
203
191
163
59
191
131
332
185
245
299
164
167
179
136
327
160
135
186
265
174
155
120
153
121
166
142
196
391
314
128
211
323
243
332
516
127
66
74
279
338
139
359
324
120
278
284
195
382
42
181
177
71
131
151
186
336
124
458
126
721
135
562
144
196
142
113
235
148
681
303
293
112
355
191
261
133
199
115
529
124
105
138
131
91
142
229
241
185
569
245
208
131
124
395
187
212
114
147
398
133
583
90
209
119
128


136
151
71
208
124
168
127
52
187
162
528
125
111
136
311
337
421
218
254
255
98
286
761
97
217
133
143
98
306
111
146
575
762
300
190
128
86
98
127
110
138
54
109
348
60
402
182
265
54
413
163
156
244
172
255
653
391
111
81
93
330
105
219
144
137
114
45
551
89
108
313
289
182
188
131
105
89
138
178
286
534
130
200
136
92
111
703
132
76
248
135
122
130
108
121
117
147
53
143
151
162
185
363
129
42
248
107
611
400
124
123
184
697
129
160
59
111
141
159
304
111
366
398
141
119
196
68
148
317
57
105
37
223
137
109
199
47
194
748
47
551
298
44
49
192
241
363
213
201
198
116
98
46
123
126
108
117
78
130
383
250
356
416
61
178
321
123
153
55
233
230
117
91
145
511
289
99
434
85
166
44
159
190
202
269
61
42
234
135
43
186
166
165
625
64
285
122
246
432
138
232
177
483
109
131
154
137
108
136
42
695
454
151
306
449
130
191
535
163
168
147
282
335
251
166
157
342
113
312
137
704
138
66
258
209
224
146
155
54
110
218
93
217
162
141
154
62
209
176
122
167
246
139
265
139
158
490
206
344
156
132
4

271
137
199
521
138
127
64
502
521
426
120
227
184
292
190
157
134
113
132
225
449
172
182
79
133
206
335
370
279
123
655
137
275
154
176
200
58
144
155
141
70
104
192
247
343
197
168
149
174
248
45
153
399
106
65
131
121
162
136
306
174
149
136
158
135
73
233
137
548
126
264
140
269
113
61
134
86
307
311
135
173
46
140
119
130
91
130
277
91
46
316
114
52
151
267
157
96
127
148
116
472
210
99
295
124
255
171
580
179
747
115
287
214
201
48
225
64
208
327
318
185
105
139
210
95
267
291
157
238
82
867
215
346
215
82
122
155
425
111
274
350
129
311
555
347
159
294
39
808
131
151
271
631
279
429
44
54
193
495
330
469
186
186
153
172
289
183
580
332
461
198
194
164
82
291
163
609
133
399
140
124
304
214
107
107
135
335
215
45
508
209
118
268
165
343
140
51
233
98
182
206
143
301
185
213
229
276
426
262
459
198
139
40
457
172
27
116
853
138
228
234
308
315
70
40
233
48
158
463
144
230
92
394
105
66
119
171
120
62
157
128
95
265
118
213
156
192
155
174
108
330
80
451
41
203
162
260
116
169
116

Number of labels

In [5]:
#### Add your code here ####
print(labels.shape[0])

50000


# Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [6]:
#### Add your code here ####
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = 300
features_updated = pad_sequences(maxlen=max_len, sequences=features_updated, padding="post", value=0, truncating="post")

In [7]:
X_train, X_test, y_train, y_test = train_test_split(features_updated, labels, test_size=0.20, random_state = 42)

### Print value of any one feature and it's label (4 Marks)

Feature value

In [8]:
#### Add your code here ####
d = 314
print(features_updated[d])

[   1    4  293  439    7    4   86 2109   20    9   15  600    7    4
  105   71 2259 3765 2051   39   27 1323    5   14   58  343    6 1451
 1362 2243    5    6 3389 1362    4  668 1362  434    9    4 7849 1362
  237 7744  472  137   50   26   49   52 5867   40    6  646  550 1270
 3756  136  138  161 3765  361   53    7   14   14   20    9 1282  751
   17   52   17   12  100   28   77   13  873    8   67  565 2109   11
  206    5   33  222   31 1207 5863 1780 4331  548  720   18  463 1004
    6  543    5   16  317  643  685  137 3653 4757   33    6 4291 1062
   74   94 6054   12  131 3465    6  117   11   49  531  151 1282  751
   17   78   17 2109  122   76   40 2109  190   14 1327 1330  751    4
  130    9   24   55 1596   10   10   91    7    4  752  203  481   40
 3653    9    6   78   20   21   12  407  218   12    9 1228  737    5
 3897  793    4  128 7609    8    4  512   12    9 1082   35 4981 4001
 1034   40  220  175   85 1362   20   46   50  885    9   12   35  206
 6535 

Label value

In [9]:
#### Add your code here ####
print(labels[d])

1


### Decode the feature value to get original sentence (4 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [10]:
#### Add your code here ####
word_index = imdb.get_word_index()
words = dict((v,k) for k,v in word_index.items())

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [11]:
#### Add your code here ####
sentence = ""
for i in features_updated[d]:
    if i != 0:
        if words[i] != 'br':
            sentence = sentence + words[i] + " "
        else:
            sentence = sentence + "." + " "
print(sentence)

the of watched she's . of how gangster on it for score . of films than ruined turner window or be themes to as my short is falling ride fame to is surrounding ride of power ride waste it of reagan ride he's isabelle  go more he good very shaking just is relationship anyway standard explicit scenes such nothing turner low up . as as on it humour nearly movie very movie that after one will was nature in can police gangster this without to they there's by anti schools jump waters matter paul but despite reasons is myself to with half cool due go revealing prisoners they is sh 'the been make contempt that these identify is over this good hour old humour nearly movie do movie gangster off get just gangster take as climax club nearly of here it his time hidden i i its . of sequel action totally just revealing it is do on not that itself interesting that it stick talk to abc re of still climbing in of soon that it sees so suspend orders further just family us because ride on some more questi

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [12]:
#### Add your code here ####
if labels[d] == 1:
    print("positive (1)")
else:
    print("negative (0)")

positive (1)


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [13]:
#### Add your code here ####
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten, TimeDistributed

model = Sequential()
model.add(Embedding(10000,100,input_length=300))
model.add(LSTM(196,return_sequences=True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(1,activation='softmax'))

### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [14]:
#### Add your code here ####
model.compile(optimizer='adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

### Print model summary (4 Marks)

In [15]:
#### Add your code here ####
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 196)          232848    
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          19700     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 30001     
Total params: 1,282,549
Trainable params: 1,282,549
Non-trainable params: 0
_________________________________________________________________
None


### Fit the model (4 Marks)

In [16]:
#### Add your code here ####
batch_size = 32
model.fit(X_train, y_train, epochs = 1, batch_size=batch_size, verbose = 2)

Train on 40000 samples
40000/40000 - 1912s - loss: 7.7004 - accuracy: 0.4978


<tensorflow.python.keras.callbacks.History at 0x24fbf59bb48>

### Evaluate model (4 Marks)

In [18]:
#### Add your code here ####
score,acc = model.evaluate(X_test, y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

10000/1 - 48s - loss: 7.1200 - accuracy: 0.5088
score: 7.53
acc: 0.51


### Predict on one sample (4 Marks)

In [77]:
#### Add your code here ####
e = 10
result = int(model.predict(X_test[e].reshape(1,X_test[e].shape[0]),batch_size=1)[0][0])

if result == 1:
    print("positive (1)")
else:
    print("negative (0)")       


positive (1)
