# Goals

We often need to look up specific characters in corpora to find the examples for the construction of interest. Some constructions are fairly easy to find, because we can look up the specific characters (e.g. aspect 咗, negation 冇). However, some expressions are trickier, as they include multiple elements (which may or may not be contiguous). 

- 連 … 都
- 如果 … 就
- 係 唔係


Let's load the PyCantonese library first.


In [None]:
!pip install pycantonese==3.2.4

Let's find all the examples of 連... 都. 

If we are looking for examples of 連... 都 with our bare eyes, we would probably be skimming to find 連 first, then check whether there is a 都 nearby.

We may also add that 連 should come before 都 to avoid noise in our data. 

In [None]:
import pycantonese as pc 

corpus = pc.hkcancor()
utterances = corpus.words(by_utterances=True)  # by_utterances=True keeps the utterance boundaries

for u in utterances:
  if '連' in u and '都' in u and u.index('連') < u.index('都'):
    print(u)

If we look at the first 2 results, they do not seem to be good examples for the 連...都 construction. We might want to limit the distance between 連 and 都 to avoid false positives.  

In [None]:
# Update to specify distance 

import pycantonese as pc 

corpus = pc.hkcancor()
utterances = corpus.words(by_utterances=True)  # by_utterances=True keeps the utterance boundaries

window_size = 5 # Here we can specify a window_size to avoid false positives

for u in utterances:
  if '連' in u and '都' in u and u.index('都')-u.index('連') < window_size:
    print(u)

# Defining a custom function 
We are pretty sure we will recyle this idea. 

We can define a function so we don't have to copy and paste the whole thing over and over again. There are 3 things we need to specify each time: 

  1. The beginning word, which should be a character
  2. The end word, also a character
  3. The window size, which should be an integer 

In [None]:
def find_mwe(text,begin_word,end_word,window_size): #This function finds Multiword Expressions, with 4 arguments: the text to look up, beginning word, end word, and the search window size. 
  for u in text:
    if begin_word in u and end_word in u and u.index(end_word)-u.index(begin_word) < window_size:
      print(u)

In [None]:
# We have used the name 'utterances' for our text 
# Beginning word: '連'
# End word: '都' 
# Window size: 5 

find_mwe(utterances,'連','都',5) 

## Exercise 1

Let's try the same thing with some other constructions. Here are some candidates, but feel free to come up with your own search queries! Not that the appropriate window size depends largely on the construction. 
 
1. 如果 ... 就 
2. 咪 ... 囖

3. 係唔係 
  - We may want to make sure '係' and '唔係' are next to each other. We can specify the index to ensure that. In the current segmentation, '係' is one word and '唔係' is another. 

In [None]:
# Use '如果' and '就' for the keywords, set the window size to 10  
# Don't forget to put '如果' and '就' in quotes!  

find_mwe(utterances, , , )

In [None]:
# Use '咪' and '囖' for the keywords, set the window size to 5 
# Don't forget to put '咪' and '囖' in quotes!  

find_mwe(utterances, , , )


## Exercise 2

In addition, there are over 300 tokens of '係 唔係', it might be useful to store the results in a list. 

We can update the `find_mwe` function as `find_mwe_as_list`. 

In [None]:
def find_mwe_as_list(text,begin_word,end_word,window_size): #This function finds Multiword Expressions, with 3 arguments: beginning word, end word, and the search window size. 
  results = []
  for u in text:
    if begin_word in u and end_word in u and u.index(end_word)-u.index(begin_word) < window_size:
      results.append(u)
  return results

In [None]:
# Modify the line below to get '係' '唔係', which are contigous! 
hai6m4hai6 = find_mwe_as_list(utterances, , , )

len(hai6m4hai6)

# Variable Keywords

Some constructions involve repetition of lexical items, rather than specific characters. 

For example, reduplications like 睇一睇 cannot be found directly through keyword search. 

Note that there are cases like '起 一 起壇' or '褪 一 褪後', where the element before and after 一 are not identical. 


In [None]:
!pip install --upgrade nskipgrams # To work with ngrams (i.e. strings of multiple words), we can use the `nskipgrams` package

Requirement already up-to-date: nskipgrams in /usr/local/lib/python3.7/dist-packages (0.3.0)


In [None]:
import nskipgrams

utterances = corpus.words(by_utterances=True)

results = []

for u in utterances:
  for trigram in nskipgrams.ngrams_from_seq(u, n=3):
    if "一" == trigram[1] and trigram[0] in trigram[2]:
      results.append(trigram)

len(results)

33

In [None]:
#Let's look at some results 
results

NB: There are false positives!

- 佢 啲 signal 係 **一 Group 一 Group** 𡃉 . (from HKCanCor ``FC-025_v.txt``, line 277)
- 講真 揾 唔 到 工 哩 啲 嘢 , 都 係 即係 **- 一 - ** 係 - 點 講 啊 ? (from HKCanCor ``FC-105_v2.txt``, line 370)




## Exercise 3

Using the same approach, we can identify more complex strings like 'V 來 V 去'. Modify the codes below to find 'V 來 V 去'. 

In [None]:
results = []

for u in utterances:
  # The 'V 來 V 去' combination should form a 4-gram, not 3. We can name it 'fourgram' instead of 'trigram'.  
  for trigram in nskipgrams.ngrams_from_seq(u, n=3):
    # We want to specify that (1) the index of 來 is the second element, (2) the index of 去 is the 4th element, 
    # and (3) the first and the third elements are identical
    if "一" == trigram[1] and trigram[0] in trigram[2]:
      results.append(trigram)

len(results)

In [None]:
results

## Exercise 4

- Feel free to play with another constructions involving reduplicative elements. 

- Since different contructions might be subject to very different restrictions, defining a dedicated function might not be as efficient. 

  1. AA哋 (e.g. 肥肥哋)
  2. V下V下 (e.g. 爬下爬下)
  3. A-not-A (e.g. 會唔會, 可唔可以)

In [None]:
# 哋
results = []

for u in utterances:
  for fourgram in nskipgrams.ngrams_from_seq(u, n=3): #Try changing n to 4 here! 
    #Modify the following line
    if "" == fourgram[2] and fourgram[0] in fourgram[1]:
      results.append(fourgram)

print(len(results))
results

In [None]:
# V 下 V 下

In [None]:
# A 唔 A

# Epilogue

- From these constructions, we can see that the knowledge about the constructions is vital in the coding (e.g. '褪一褪後' is possible but *'褪後一褪' isn't). 

- In other cases, we want to specify whether two elements are identical or one element is found within another one (e.g. 會唔會, 可唔可以, *可唔可). 

- Widening the window is often helpful in exploration. 