/
Wordnet.ipynb
1109 lines (1109 loc) · 55.2 KB
/
Wordnet.ipynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Wordnet sandbox\n",
"\n",
"Maintained by David J. Birnbaum, [djbpitt@gmail.com](mailto:djbpitt@gmail.com), http://www.obdurodon.org\n",
"\n",
"## Preface\n",
"\n",
"This tutorial illustrates the use of Wordnet for the types of exploration to be conducted in the [Dante’s _Inferno_](http://dante.obdurodon.org) and [Victorian ghost stories](http://ghost.obdurodon.org) research projects that were part of a [Computational methods in the humanities](http://dh.obdurodon.org) course in the autumn 2016 academic semester. Thanks to Na-Rae Han for discussion and suggestions.\n",
"\n",
"Students completing [Computational methods in the humanities](http://dh.obdurodon.org) to satisfy the “methods” requirement for the Linguistics major need to perform some linguistic tasks with their data, and Wordnet is one way to do that. Below, after an introduction to how Wordnet works, we describe how to add Wordnet-related markup to your XML and how to use that markup to explore your data. You do not need to add Wordnet-related markup to all of your data (which would not be feasible within the context of a semester-long course because some of the work must be performed manually and your documents may be long), but you should do enough of it to be able to experiment a bit with how it works. You also do not have to perform all of the tasks we describe below (which also would not be feasible in the available time); pick one or two that sound interesting and see what you’re able to learn about your documents by implementing them. Ask your instructors should you have any questions about either the content of this tutorial (that is, about how to use Wordnet) or the scope of the assignment.\n",
"\n",
"**tl;dr:** Use Wordnet as described below to add semantic markup to some (not all) of your data. Then perform some (not all) of the tasks below to explore how meaning is represented in your texts.\n",
"\n",
"## Introduction\n",
"\n",
"In Real Life you’ll export the words you care about from your XML using XSLT and then read the list into your Python program, but to start, let’s concentrate on learning how Wordet works. We’re writing this tutorial in the **Jupyter notebook** interface, which allows us to break up the code into pieces that are interspersed with discussion. Because the code is fragmented, in order to run the statements at the bottom of the page you need to have run at least some of the ones at the top. For example, we import Wordnet at the beginning with `from nltk.corpus import wordnet as wn`, and later code depends on our having done that. This means that if you copy and try to run something below without having done the import, you’ll throw an error. We also create some variables near the top that we use below without redeclaring them. You don’t need to use Jupyter notebook for your own development; we’ve used it here because the combination of code cells and text cells is convenient for tutorial purposes.\n",
"\n",
"**tl;dr:** Run the code from the top of this notebook to the bottom, and not just in a single cell.\n",
"\n",
"## How Wordnet is organized\n",
"\n",
"Wordnet is a hierarchical organization of units of meaning, called **synsets**. Synsets are represented in texts by **words**, and a combination of a **lexeme** (represented by the dictionary form of a word) with a specific synset is called a **lemma**. Synsets are identified within Wordnet by three dot-separated parts:\n",
"\n",
"1. A representative word, that is, a word that conveys the meaning of the synset. This representative word may not be the only word that conveys that meaning, and it may also be able to convey other meanings. We’ll see below that that the lexeme ‘ghost’ can represent several different meanings (that is, is associated with multiple synsets), and that each of those meanings can alternatively be conveyed by lexemes other than “ghost”.\n",
"1. A part of speech (POS) identifier, like “n” for ‘noun’ or “v” for ‘verb’.\n",
"1. A two-digit number that distinguishes different synsets that may have the same head word and the same POS, but that convey different meaning. For example, the synsets 'ghost.n.01' and 'ghost.n.02' are two different nominal meanings that can be expressed by the lexeme “ghost.”\n",
"\n",
"### Exploring synsets\n",
"\n",
"There’s a lot more organization within Wordnet, but for the purpose of this tutorial we’re going to stick to the information conveyed through synsets. Let’s explore that with the synset 'koala.n.01', which is a noun that represents a particular arboreal Australian marsupial. Here’s how it looks when we ask Python about it:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Synset('koala.n.01')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.corpus import wordnet as wn # import Wordnet and call it just “wn” for brevity\n",
"wn.synset('koala.n.01')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output above tells us that the synset 'koala.n.01' is a synset that Wordnet calls 'koala.n.01'. That tautology isn’t very useful, so the only point of the code snippet above is to determine whether such a synset exists. If it doesn’t, we’ll get an error. You can test this by running the cell below, which will raise an error because there is no 'koala.n.02' synset in Wordnet (your error message may differ from ours):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "WordNetError",
"evalue": "lemma 'koala' with part of speech 'n' has only 1 sense",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/Users/djb/anaconda/lib/python3.5/site-packages/nltk/corpus/reader/wordnet.py\u001b[0m in \u001b[0;36msynset\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 1233\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1234\u001b[0;31m \u001b[0moffset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_lemma_pos_offset_map\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mlemma\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpos\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0msynset_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1235\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mIndexError\u001b[0m: list index out of range",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mWordNetError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-3-5158dbbef11c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mwn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msynset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'koala.n.02'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/djb/anaconda/lib/python3.5/site-packages/nltk/corpus/reader/wordnet.py\u001b[0m in \u001b[0;36msynset\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 1243\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1244\u001b[0m \u001b[0mtup\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlemma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpos\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn_senses\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"senses\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1245\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mWordNetError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mtup\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1247\u001b[0m \u001b[0;31m# load synset information from the appropriate file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mWordNetError\u001b[0m: lemma 'koala' with part of speech 'n' has only 1 sense"
]
}
],
"source": [
"wn.synset('koala.n.02')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How can we know that there is a 'koala.n.01' synset but not 'koala.n.02' synset without having to ask for the latter and raising an error? We can ask Wordnet to tell us about all of the synsets associated with the word ‘koala’ by using the `wn.synsets()` function:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Synset('koala.n.01')]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synsets('koala')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The preceding code tells us that there is exactly one synset associated with the word ‘koala’, and that the synset is called 'koala.n.01'."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting the definition of a synset\n",
"\n",
"Synsets are units of meaning, and we can ask for a definition of a synset by using the `.definition()` method:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'sluggish tailless Australian arboreal marsupial with grey furry ears and coat; feeds on eucalyptus leaves and bark'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synset('koala.n.01').definition()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting the lexemes associated with a synset\n",
"\n",
"As we write above, synsets, as units of meaning, are represented in a text by lexemes, and the combination of a synset (a meaning) plus a lexeme (a word) is called a **lemma**. We can get the lemmata for a particular synset by asking for them with the `.lemmas()` method:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Lemma('koala.n.01.koala'),\n",
" Lemma('koala.n.01.koala_bear'),\n",
" Lemma('koala.n.01.kangaroo_bear'),\n",
" Lemma('koala.n.01.native_bear'),\n",
" Lemma('koala.n.01.Phascolarctos_cinereus')]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synset('koala.n.01').lemmas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that a lemma like 'koala.n.01.koala' combines the synset representation (“koala.n.01”) with a lexeme that expresses that meaning (“koala”). You can get just the lexical part, without the synset prefix, by applying the `name()` method to a lemma. Here we ask for the first (zeroeth in Python enumeration) lemma associated with our synset and return just its name:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'koala'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synset('koala.n.01').lemmas()[0].name()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What about inflected forms?\n",
"\n",
"As noted above, we can identify all of the synsets associated with a word by using the `wn.synsets()` function:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Synset('koala.n.01')]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synsets('koala')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The word that we use as an argument to the `wn.synsets()` function doesn’t have to be the dictionary form, which for nouns is a typically a singular. We’ll get the same result if we ask for the synsets associated with the plural:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Synset('koala.n.01')]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synsets('koalas')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see above that the lexeme “koala” (whether represented by its singular or plural form) belongs to only one synset. The word “ghost”, though, belongs to seven, four of which are nouns and three of which are verbs:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Synset('ghost.n.01'),\n",
" Synset('ghostwriter.n.01'),\n",
" Synset('ghost.n.03'),\n",
" Synset('touch.n.03'),\n",
" Synset('ghost.v.01'),\n",
" Synset('haunt.v.02'),\n",
" Synset('ghost.v.03')]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synsets('ghost')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Synset summary\n",
"\n",
"* A word may represent multiple meanings, and we get the meanings with `wn.synsets()`.\n",
"* We can get a definition of a synset with `.definition()` .\n",
"* We can get the lemmata (combination of a lexeme with a meaning) associated with a synset with `.lemmas()`.\n",
"* We can get just the lexical part of a lemma with `.name()`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Wordnet to explore course project data\n",
"\n",
"For this tutorial, assume that we’re interested in words that express scary concepts. This is close to the actual focus of [Victorian ghost stories](http://ghost.obdurodon.org); for the other project this semester, [Dante’s _Inferno_](http://dante.obdurodon.org), assume that we’re interested in painful concepts instead of scary ones. Pitt-Greensburg students on the [Eldritch team](https://github.com/PPH3/Eldritch) are investigating words in H. P. Lovecraft's writings that convey an impression of the bizarre and arcane. The project teams have already tagged the interesting words already using manual methods, but we’re assuming that they are all tagged only in a simple way, along the lines of `<spooky_word>ghost</spooky_word>`. This initial markup makes it possible to find the words we care about easily, but it doesn’t tell us what they mean beyond the fact that they’re associated with scariness.\n",
"\n",
"We can begin our richer exploration of meaning by compiling a list of sample words and examining their synsets. In the example below we’ve included four spooky words plus one non-spooky control item:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[[Synset('panic.n.02'),\n",
" Synset('scare.n.02'),\n",
" Synset('frighten.v.01'),\n",
" Synset('daunt.v.01')],\n",
" [Synset('ghost.n.01'),\n",
" Synset('ghostwriter.n.01'),\n",
" Synset('ghost.n.03'),\n",
" Synset('touch.n.03'),\n",
" Synset('ghost.v.01'),\n",
" Synset('haunt.v.02'),\n",
" Synset('ghost.v.03')],\n",
" [Synset('fear.n.01'), Synset('frighten.v.01')],\n",
" [Synset('creep.n.01'), Synset('ghost.n.01'), Synset('spook.v.01')],\n",
" [Synset('koala.n.01')]]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.corpus import wordnet as wn # import Wordnet and call it just “wn” for brevity\n",
"words = ['scare', 'ghost', 'fright', 'spook', 'koala'] # create a list of words to examine\n",
"synset_list =[wn.synsets(word) for word in words] # get the synsets for each word\n",
"synset_list # display them"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output above is a list of lists, where each of the inner lists contains the synsets that pertain to a particular word form. We can see that the first inner list shows the four synsets associated with the word “scare”, the second inner list shows the seven synsets associated with the word “ghost”, etc. Our assumption is that each word taken from a text is associated, _in the context in which it occurs_, with exactly one meaning represented by one of the available synsets. The part about context matters; the same lexeme may occur in different contexts with different meanings within the same text. For example, as noted above, the word “scare” may be a noun in one place and a verb in another.\n",
"\n",
"Occasionally your texts may contain words that are not included in Wordnet, or words that are used with meanings that are not represented in Wordnet. You cannot add anything to Wordnet, so when that happens, make a note of it, but otherwise you’ll have to exclude those words from your Wordnet processing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Add synset markup to your documents\n",
"\n",
"So far your data contains nothing more than a tag that identifies spooky words, e.g., `<spooky_word>ghost</spooky_word>`. Your goal here is to identify the synset represented by the word in its context and add an attribute (`@synset`) to the markup, using a value that identifies the synset. This task requires human analysis, since although Wordnet can tell you the possible synsets for a particular lexeme, it can’t tell which of those available meanings the lexeme has at a particular location in the text. Remember that the same word form may represent different synsets in different locations. For example, as noted above, “scare” could be a noun in one place and a verb in a different place, and those are different synsets. You don’t need to do this for your entire corpus, which wouldn’t be realistic given the fifteen-week semester and the size of the corpus, but you’ll want to do enough to get a sense of the relationship between word forms in your corpus and the synsets that Wordnet uses to represent units of meaning.\n",
"\n",
"The procedure for adding synset markup to the document has three steps:\n",
"\n",
"1. Get the definitions of each synset for each scary word in your corpus or selection. You can use Python to do this.\n",
"1. Choose the appropriate synset for each scary word in your corpus or selection. This requires human decisions, since Python doesn’t understand the context.\n",
"1. Write the correct synset into the markup as a new `@synset` attribute. You have to do this manually, as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Get the definitions of each synset for each word\n",
"\n",
"You can get the definition of a synset like `Synset('panic.n.02')` with:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'sudden mass fear and anxiety over anticipated events'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synset('panic.n.02').definition()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A lexeme like “scare” is associated with four synsets:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Synset('panic.n.02'),\n",
" Synset('scare.n.02'),\n",
" Synset('frighten.v.01'),\n",
" Synset('daunt.v.01')]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synsets('scare')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each occurrence of some form of “scare” in our texts (it might be ‘scare’ or ‘scares’ or some other inflected form), we want to add an attribute to our XML that indicates the appropriate synset. To tell the synsets apart (in case the sample word that’s part of the synset identifier is not sufficiently clear by itself), we can get their definitions. The code below outputs each synset and its definition:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[\"Synset('panic.n.02') means: sudden mass fear and anxiety over anticipated events\",\n",
" \"Synset('scare.n.02') means: a sudden attack of fear\",\n",
" \"Synset('frighten.v.01') means: cause fear in\",\n",
" \"Synset('daunt.v.01') means: cause to lose courage\"]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[str(item) + ' means: ' + item.definition() for item in wn.synsets('scare')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the `str()` function above to stringify the synset (represented by the variable `item`) so that we can concatenate it with the other strings for output."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Choose the appropriate synset for each spooky word _in context_\n",
"\n",
"Once you know the synsets that are available for each word in your document, look at your XML and choose the appropriate synset for each word _in context_. For example, if “scare” occurs as a verb that means ‘cause fear in’ in one place, the synset you‘d choose from above would be 'frighten.v.01'. If it occurs as a noun that means ‘a sudden attack of fear’ in another, you‘d choose 'scare.n.02'."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Write the synset information back into the XML\n",
"\n",
"You can’t write the synset information back into the XML automatically because the same word form in the XML might belong to different synsets in different locations (like the use of ‘scare’ as a verb or as a noun, described above). For that reason, you’ll want to add the synset value manually to the tagged words in your XML. For example, if you have:\n",
"\n",
"```xml\n",
"<p>He <spooky_word>scared</spooky_word> them.</p>\n",
"```\n",
"\n",
"You would expand the markup to:\n",
"\n",
"```xml\n",
"<p>He <spooky_word synset=\"frighten.v.01\">scared</spooky_word> them.</p>\n",
"```\n",
"\n",
"The easiest way to add this type of markup is to load the document into <oXygen/> and do a search and replace to search for the string \n",
"\n",
"```xml\n",
"<spooky_word\n",
"```\n",
"\n",
"and replace it with\n",
"\n",
"```xml\n",
"<spooky_word synset=\"\"\n",
"```\n",
"\n",
"This will write the `@synset` attribute into the start tag with a null value, and you can then use the XPath browser box to find all `<spooky_word>` elements (using the XPath expression `//spooky_word`) and type in the attribute values. You’ll want to modify your schema so that this new attribute will be valid."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examine the lemmata for each synset\n",
"\n",
"At the moment this is just for curiosity. Below we construct a list of two synsets and for each of them we print the Wordnet synset identifier and a list of the lexemes associated with it. As described above, we use the `.lemmas()` method to get the lemmata associated with the synset and we use the `.name()` method to keep only the lexical part of the lemma:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Synset('scare.n.02') has the following lemmata: ['scare', 'panic_attack']\n",
"Synset('frighten.v.01') has the following lemmata: ['frighten', 'fright', 'scare', 'affright']\n"
]
}
],
"source": [
"scare_synsets = [wn.synset('scare.n.02'), wn.synset('frighten.v.01')]\n",
"for synset in scare_synsets:\n",
" print(str(synset) + ' has the following lemmata: ' + str([lemma.name() for lemma in synset.lemmas()]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tasks\n",
"\n",
"Once you’ve added the new `@synset` attributes to your XML, as described above, here are some tasks you can perform to explore them. Linguistics students who need to meet the Linguistics Department “methods” requirement should choose one or two of the following tasks. You don’t need to process your entire corpus, which wouldn’t be realistic in the context of a one-semester course, and for the same reason you don’t need to implement all of the suggested tasks. But you want to do enough to get a sense of what Wordnet can tell you about the semantics of your documents."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Explore lexical ambiguity\n",
"\n",
"Word forms in your text will belong to zero or more synsets, although an _occurrence_ of a word form will belong to only one synset _in its particular context_. You can quantify the degree of lexical ambiguity, and thus the extent to which the meaning of the word depends on context, by retrieving the number of synsets for each word form in your data. Note that the focus here is on lexical ambiguity, that is, the meanings that a word could have in isolation. This is different from the contextual ambiguity that might interest scholars of literature, where ambiguity _in a specific context_ (that is, ambiguity that persists even in a particular context) might be used to express irony or for other rhetorical purposes.\n",
"\n",
"One way to think about lexical ambiguity from a Wordnet perspective is to count the synsets that a word can represent. Here’s how to do that (using the `words` variable we created above, which is equal to a list of five specific words):"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The word \"scare\" belongs to 4 synsets\n",
"The word \"ghost\" belongs to 7 synsets\n",
"The word \"fright\" belongs to 2 synsets\n",
"The word \"spook\" belongs to 3 synsets\n",
"The word \"koala\" belongs to 1 synsets\n"
]
}
],
"source": [
"for word in words:\n",
" synset_count = len(wn.synsets(word))\n",
" print('The word \"' + word + '\" belongs to ' + str(synset_count) + ' synsets')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The preceding output is fine for humans, but we want to write these counts back into our XML. We can do that automatically in three steps:\n",
"\n",
"1. Use XSLT to export a plain text list of words you’ve tagged (e.g., spooky words) from your XML data files.\n",
"1. Use Python to create an XML auxiliary document that maps each of those words to its synset count. The Python script will read the exported plain text list, use Wordnet to count the number of synsets associated with each of them, and write the word plus the count into the new XML document.\n",
"1. Use an XSLT _identity transformation_ to write the synset count into the XML as new content. Your XSLT transformation will transform each of your XML data files to itself (that is, the output will be identical to the input), except that it will insert an additional `@synset_count` attribute that includes the count of synsets associated with the word form.\n",
"\n",
"Here’s how that works:\n",
"\n",
"#### Step 1: Export a plain text list of words you’ve tagged (e.g., spooky words)\n",
"\n",
"Here’s some original sample XML:\n",
"\n",
"```xml\n",
"<root>\n",
" <p>The <spooky_word>ghost</spooky_word> <spooky_word>scared</spooky_word> \n",
" him by giving him a <spooky_word>scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"In this XML we have tagged the spooky words. We then manually add the synset markup, as described above:\n",
"\n",
"```xml\n",
"<root>\n",
" <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word> \n",
" <spooky_word synset=\"frighten.v.01\">scared</spooky_word> \n",
" him by giving him a \n",
" <spooky_word synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"We then run the following XSLT transformation, outputting plain text:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
" xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" exclude-result-prefixes=\"xs\" version=\"2.0\">\n",
" <xsl:output method=\"text\" indent=\"yes\"/>\n",
" <xsl:template match=\"/\">\n",
" <xsl:apply-templates select=\"//spooky_word\"/>\n",
" </xsl:template>\n",
" <xsl:template match=\"spooky_word\">\n",
" <xsl:value-of select=\"concat(.,'
')\"/>\n",
" </xsl:template>\n",
"</xsl:stylesheet>\n",
"```\n",
"\n",
"Note that the value of the `@method` attribute on the `<xsl:output>` element is \"text\" because we’re creating plain text. We apply templates to the `<spooky_word>` elements, and in the template that matches those elements, we output the value of concatenating the content of the element (the word itself) with a new line character (spelled `
`, which is the _numerical character reference_ for a new line). The output looks like:\n",
"\n",
" ghost\n",
" scared\n",
" scare\n",
"\n",
"We can save that to a file (let’s call it “spooky_words.txt”), so that we can access it later with Python.\n",
"\n",
"#### Step 2: Access that file with Python and create a new XML file that maps each word form to its synset count"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with open('spooky_words.txt', 'r') as infile: # open the plain text file that contains the list of words\n",
" wordlist = infile.read().split() # read the words into a list, splitting on the new lines\n",
"with open('synset_counts.xml', 'w') as outfile: # open a file to hold the XML output\n",
" outfile.write('<root>') # create a start tag for the root element in the output XML file\n",
" for word in wordlist: # create output for each word\n",
" synset_count = len(wn.synsets(word)) # for each word, count the number of synsets to which it belongs\n",
" outfile.write('<word><form>' + word + '</form><count>' + str(synset_count) + '</count></word>') # write it out\n",
" outfile.write('</root>') # create the end tag for the root element"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We saved the output to a file called synset\\_counts.xml, so we don’t see it here in the notebook, but we can now use Python to read it. This is just for human inspection, to make sure that it looks the way we want. It isn’t pretty-printed, but we can still see how it looks:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<root><word><form>ghost</form><count>7</count></word><word><form>scared</form><count>4</count></word><word><form>scare</form><count>4</count></word></root>\n"
]
}
],
"source": [
"with open('synset_counts.xml') as infile:\n",
" print(infile.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Step 3: To write the counts back into the XML, use an _identity transformation_, reading in the new count file with the XPath `document()` function\n",
"\n",
"Assume that we’ve saved our original XML (with the synsets, but without the counts) as original.xml. It looks like:\n",
"\n",
"```xml\n",
"<root>\n",
" <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word> \n",
" <spooky_word synset=\"frighten.v.01\">scared</spooky_word> \n",
" him by giving him a \n",
" <spooky_word synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"Transform it with the following XSLT:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
" xmlns:xs=\"http://www.w3.org/2001/XMLSchema\"\n",
" exclude-result-prefixes=\"xs\"\n",
" version=\"2.0\">\n",
" <xsl:variable name=\"count_file\" as=\"document-node()\" select=\"document('synset_counts.xml')\"/>\n",
" <xsl:template match=\"node()|@*\">\n",
" <xsl:copy>\n",
" <xsl:apply-templates select=\"@*|node()\"/>\n",
" </xsl:copy>\n",
" </xsl:template>\n",
" <xsl:template match=\"spooky_word\">\n",
" <xsl:copy>\n",
" <xsl:attribute name=\"synset_count\" select=\"$count_file//word[form eq current()]/count\"/>\n",
" <xsl:apply-templates select=\"@*|node()\"/>\n",
" </xsl:copy>\n",
" </xsl:template>\n",
"</xsl:stylesheet>\n",
"```\n",
"\n",
"The `document()` function opens synset\\_counts.xml (which we created with Python in Step #2) so that we can access it (using the variable name `$count_file`) while we’re transforming original.xml. The first template is an _identity transformation_, which you can read about at https://en.wikipedia.org/wiki/Identity_transform. When you perform an identity transformation, the identity template transforms everything to itself (that is, the output is a copy of the input), except that you write separate templates only for the bits that you want to change. In this case, we’re changing `<spooky_word>` elements to add a new `@synset_count` attribute, inserting the value it copies from the auxiliary file that we created with Python in the preceding step.\n",
"\n",
"Here’s the output of that last transformation:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<root>\n",
" <p>The <spooky_word synset_count=\"7\" synset=\"ghost.n.03\">ghost</spooky_word> \n",
" <spooky_word synset_count=\"4\" synset=\"frighten.v.01\">scared</spooky_word> \n",
" him by giving him a \n",
" <spooky_word synset_count=\"4\" synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"We can then calculate the extent of ambiguity for the entire document or for each individual paragraph. Note that we do not need to have identified a particular synset for each spooky word in order to determine the extent of ambiguity; that is, you do not actually need the `@synset` attributes shown in the example. All that is required is a count of all the available synsets for each word. We might decide that the ambiguity of a paragraph is the average of all of the `@synset_count` values in that paragraph, so that for the sole paragraph here it would be 5, that is, the sum of the three values (15) divided by the number of values (3). We could graph this with SVG to examine whether there’s a pattern to the ambiguity, that is, whether it’s higher in some locations of the story than in others. We could also look for correlations between, say, the number of spooky words and the degree of ambiguity. Or we could compare stories or authors to see whether one there is any regularity or other pattern in the ambiguity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Determine the number of representations of each synset in each document\n",
"\n",
"You can use XSLT to determine which synsets are favored in which texts or by which authors or at which periods. Consider the following input document:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<root>\n",
" <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word>\n",
" <spooky_word synset=\"frighten.v.01\">scared</spooky_word> and <spooky_word\n",
" synset=\"frighten.v.01\">frightened</spooky_word> him by giving him a <spooky_word\n",
" synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"This has four spooky words representing three different synsets. We can count the number of occurrences of each synset using XSLT:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
" xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" exclude-result-prefixes=\"xs\" version=\"2.0\">\n",
" <xsl:output method=\"xml\" indent=\"yes\"/>\n",
" <xsl:variable name=\"root\" select=\"/\"/>\n",
" <xsl:template match=\"/\">\n",
" <data>\n",
" <xsl:for-each select=\"distinct-values(//spooky_word/@synset)\">\n",
" <synset_count>\n",
" <synset>\n",
" <xsl:value-of select=\"current()\"/>\n",
" </synset>\n",
" <count>\n",
" <xsl:value-of select=\"count($root//spooky_word[@synset eq current()])\"/>\n",
" </count>\n",
" </synset_count>\n",
" </xsl:for-each>\n",
" </data>\n",
" </xsl:template>\n",
"</xsl:stylesheet>\n",
"```\n",
"\n",
"We set a variable called `$root` because when we do `<xsl:for-each>` over distinct values, we cut ourselves off from the tree, so if we want to get back to it, we need to access it through that variable. Here we get each distinct `@synset` value and count the number of `<spooky_word>` elements that have a `@synset` attribute with that value. In this case the output is:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<data>\n",
" <synset_count>\n",
" <synset>ghost.n.03</synset>\n",
" <count>1</count>\n",
" </synset_count>\n",
" <synset_count>\n",
" <synset>frighten.v.01</synset>\n",
" <count>2</count>\n",
" </synset_count>\n",
" <synset_count>\n",
" <synset>scare.n.02</synset>\n",
" <count>1</count>\n",
" </synset_count>\n",
"</data>\n",
"```\n",
"\n",
"We could transform that to HTML to SVG for display. The counts let us ask: do some works or authors show a preference for certain synset expressions of spookiness?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Explore the richness of the expression of spookiness\n",
"\n",
"Since we’ve already assigned a synset to each spooky word in our text, we can count the number of different synsets in the text. Do some writers represent spookiness with a greater range of spooky-related meanings, that is, with more synsets, than other writers? Because texts may be of different length, we might want not just to count the number of different synsets, but to express the value as the result of dividing the number of spooky word instances by the number of distinct synsets. We can do that with XSLT and write the result into the document as metadata, performing another identity transformation and this time just adding the count in a new element. Assume our input is the output of the last operation, that is:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<root>\n",
" <p>The <spooky_word synset_count=\"7\" synset=\"ghost.n.03\">ghost</spooky_word> \n",
" <spooky_word synset_count=\"4\" synset=\"frighten.v.01\">scared</spooky_word> \n",
" him by giving him a \n",
" <spooky_word synset_count=\"4\" synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"Apply the following XSLT transformation:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
" xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" exclude-result-prefixes=\"xs\" version=\"2.0\">\n",
" <xsl:output method=\"xml\" indent=\"yes\"/>\n",
" <xsl:template match=\"node() | @*\">\n",
" <xsl:copy>\n",
" <xsl:apply-templates select=\"@* | node()\"/>\n",
" </xsl:copy>\n",
" </xsl:template>\n",
" <xsl:template match=\"root\">\n",
" <xsl:copy>\n",
" <meta>\n",
" <spookiness_ratio>\n",
" <xsl:value-of\n",
" select=\"count(distinct-values(//spooky_word/@synset)) div count(//spooky_word)\"\n",
" />\n",
" </spookiness_ratio>\n",
" </meta>\n",
" <xsl:apply-templates/>\n",
" </xsl:copy>\n",
" </xsl:template>\n",
"</xsl:stylesheet>\n",
"```\n",
"\n",
"We start with the identity transformaiton, but when we match our root element (which we’ve arbitrarily called `<root>`), before we apply templates (that is, process its contents) we create a new `<meta>` child, which contains a `<spookiness_ratio>` element, and we calculate and insert the value there. In this case it turns out to be 1 because there are three `<spooky_word>` elements and three distinct `@synset` values. The fewer the number of synsets, the lower the value will be. If we use our sample input from above:\n",
"\n",
"```xml\n",
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
"<root>\n",
" <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word>\n",
" <spooky_word synset=\"frighten.v.01\">scared</spooky_word> and <spooky_word\n",
" synset=\"frighten.v.01\">frightened</spooky_word> him by giving him a <spooky_word\n",
" synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
"</root>\n",
"```\n",
"\n",
"and run the same transformation, the value is 0.75 because there are four spooky words and three distinct synsets. Note that if we apply this transformation to a document with no spooky words, we’ll throw an error because we would be dividing by zero. If we know that we won’t apply our transformation to any such documents, we can ignore the risk, but a less brittle strategy might trap the error, report it gracefully, and terminate cleanly, instead of falling back on the default Python error handling.\n",
"\n",
"Calculating the result of dividing the number of distinct synsets by the count of spooky words can be analogized to the _type/token ratio_ in corpus linguistics. In a type/token ratio, types are _distinct_ items (such as _different_ words in a text) and tokens are the items (such as words in the same text, regardless of whether they’re duplicates of other words that we’ve already seen). A high type/token ratio means that the text is lexically varied, with little repetition of words. A low ratio means a less varied vocabulary. In this case the number of distinct synsets is our type count and the number of spooky words is our token count. A high type/token ratio means that spookiness is expressed in a wider variety of ways; a value of 1 would mean that no synset is repeated. A low ratio would mean that the variety is less; the value cannot be 0 if there’s any spookiness at all, but the least varied possibility is that there are lots of spooky words, but they all represent the same synset.\n",
"\n",
"Type/token ratios are sensitive to text length. This is easiest to see at the extreme: the number of distinct words in a language may be very large, but it isn’t infinite (at least, it isn’t infinite in any real language context), while texts can be arbitrarily long. That means that after your text reaches a certain length, you don’t know any words you haven’t used already, so if you’re going to make the text any longer, you have to start repeating words you’ve already used before. The dependence of type/token ratio on text length means that if you want to compare type/token ratios, you can do that meaningfully only for texts of the same length. For that reason, if you want to compare our spookiness analogy across texts, you should use texts of the same length. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Explore the richness of the vocabulary (by writer or by text)\n",
"\n",
"Synsets are represented by one or more lemmata, which you can retrieve with the `lemmas()` method, as in:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Synset('ghost.n.01') means \"a mental representation of some haunting experience\" and has 6 lemmata: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']\n",
"Synset('ghostwriter.n.01') means \"a writer who gives the credit of authorship to someone else\" and has 2 lemmata: ['ghostwriter', 'ghost']\n",
"Synset('ghost.n.03') means \"the visible disembodied soul of a dead person\" and has 1 lemmata: ['ghost']\n",
"Synset('touch.n.03') means \"a suggestion of some quality\" and has 3 lemmata: ['touch', 'trace', 'ghost']\n",
"Synset('ghost.v.01') means \"move like a ghost\" and has 1 lemmata: ['ghost']\n",
"Synset('haunt.v.02') means \"haunt like a ghost; pursue\" and has 3 lemmata: ['haunt', 'obsess', 'ghost']\n",
"Synset('ghost.v.03') means \"write for someone else\" and has 2 lemmata: ['ghost', 'ghostwrite']\n"
]
}
],
"source": [
"synsets = wn.synsets('ghost')\n",
"for synset in synsets:\n",
" lemmata = synset.lemmas()\n",
" print(str(synset) + ' means \"' + synset.definition() + '\" and has ' + str(len(lemmata)) + ' lemmata: ' + \\\n",
" str([lemma.name() for lemma in lemmata]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the `name()` method to get just the lexical part of the lemma. (We took a lazy way out and used the plural “lemmata” even after the value 1, although “has 1 lemmata” should really read “has 1 lemma”. If we intended to use this code to produce final output for end-users, we’d include additional code to control for that difference.)\n",
"\n",
"A writer or text that uses the synset 'ghost.n.01' has six lemmata available to express that meaning. What proportion of the available vocabulary does your writer or text use? \n",
"\n",
"That would be easy to calculate if the writer always used the exact form provided by the `name()` method of lemmata. For example, you might find that a particular text contains the following mappings of lemmata and word forms:\n",
"\n",
"Synset | Word form\n",
"--- | ---\n",
"ghost.n.01 | ghost\n",
"ghost.n.01 | shade\n",
"ghost.n.01 | spook\n",
"\n",
"You can count up the number of word forms associated with each synset, and because each word form corresponds to a different one of the 6 lemmata for that synset, you’ll determine correctly that the writer or text uses 50% of the available lemmata."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 6 lemmata for ghost.n.01 and they are: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']\n",
"The 3 lemmata for ghost.n.01 used in the document are ['ghost', 'shade', 'spook']\n",
"The ratio of used (3) divided by available (6) = 0.5\n"
]
}
],
"source": [
"available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]\n",
"print('There are ' + str(len(available)) + ' lemmata for ghost.n.01 and they are: ' + str(available))\n",
"used = ['ghost', 'shade', 'spook']\n",
"print('The 3 lemmata for ghost.n.01 used in the document are ' + str(used))\n",
"print('The ratio of used (' + str(len(used)) + ') divided by available (' + \\\n",
" str(len(available)) + ') = ' + str(len(used) / len(available)))"