Tatoeba platform will be used to generate many japanese sentences that can be translated to English. WaniKani platform will be used to filter out sentences featuring kanji based on user preferences. Both platforms use RESTful APIs to allow for data retreival.

https://sakubun.xyz/known_kanji has a very good version of this desgined around the same purpose. I think the project could benefit in a more user-friendly design, rather than multiple steps in order to extract sentences. The interest lies more in developing middleware that can be easily shifted to apply to different data formats from different APIs.

# Interacting with WaniKani API
We have to first interact with the WaniKani API and extract out user data. We will use the Python requests package in order to explore the correct commands for the context. Later versions should use a code base more optimized for the context. We will use requests using my personal authorization. 

In [8]:
import requests
import json
# Import api key from file
with open('api_key.txt') as f:
    headers = {
        'Authorization': f'Bearer {f.read()}',
        'Wanikani-Revision': '20170710',
        'If-Modified-Since': 'Fri, 11 Nov 2011 11:11:11 GMT',
    }

### Singular Resources
The following are **singular resources**:
* assignment
* kanji
* level_progression
* radical
* reset
* review_statistic
* review
* spaced_repetition_system
* study_material
* user
* vocabulary

Specific subject IDs correspond to different resources. We are interested in radicals, kanji, and vocabulary. With the ID, we can retreive the resource and deserialize it.

In [9]:
# Resources by ID.
resource='subjects'
resource_id='/2505'
base_url='https://api.wanikani.com/v2/'
endpt=f'{base_url}{resource}{resource_id}'

response = requests.get(endpt, headers=headers)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')
resource=json['data']['characters']

In [10]:
resource

'ふじ山'

### Retreiving Subjects from Collections

We can also extract **collections** of resources. The main source should come from filtered assignments based on criteria set by the user. Pagination and optimization with caching comes later. Retreive the collection and serialize it.

In [11]:
# Filtered assignments.
base_url='https://api.wanikani.com/v2/'
resource='assignments'
resource_id='?subject_types=vocabulary'
endpt=f'{base_url}{resource}{resource_id}'

response = requests.get(endpt, headers=headers)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')
    
assignments=json['data']

Extract the IDs of the assigned subjects from the serialized list of assignments. We can assume there are no duplicates returned from assignments.

In [12]:
subject_ids=[assignments[ii]['data']['subject_id'] for ii in range(len(assignments))]

New request, passing in subject IDs, to retreive specific vocab information. This may take some time, depending on how many IDs are passed.

In [17]:
base_url='https://api.wanikani.com/v2/'
resource='subjects'
resource_id=f'?ids={",".join(map(str,subject_ids))}'

endpt=f'{base_url}{resource}{resource_id}'
response = requests.get(endpt, headers=headers)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')
    
subjects=json['data']

We have now extracted a list of vocabulary words based on assignments on WaniKani. We can pull specific characters out and use them to search other websites for sentences containing those characters

### Pre-processing by Parts of Speech
Each vocabulary word can be defined by its part of speech, each of which have different rules for downstream processes. 

In [78]:
[(i,s['data']['characters'],s['data']['parts_of_speech']) for i,s in enumerate(subjects)][30:40]

[(30, '口', ['noun']),
 (31, '入り口', ['noun']),
 (32, '大きい', ['い adjective']),
 (33, '大きさ', ['noun']),
 (34, '大した', ['adjective']),
 (35, '大人', ['noun', 'な adjective', 'の adjective']),
 (36, '女', ['noun']),
 (37, '山', ['noun']),
 (38, 'ふじ山', ['proper noun']),
 (39, '川', ['noun'])]

* **Verbs** - Destructive modification to sentence, need to reduce to root without destroying meaning. For now we will simply remove the last radical in the vocabulary word, but this is by no means robust.
    * Transitive, intransitive
    * Godan, ichidan
* **Adjectives** - Additive change sometimes.
* **Nouns** - No major modifications.
* **Adverbs** - No major modifications.

# Interacting with Tatoeba
In the interests of development speed, we should be interacting with tatoeba when we need to generate sentences from assignments. Deployment should host database for sentences. This would also benefit from cacheing and modification datetime filters. Correctness is limited by what is provided on the site, we can use other checks such contributions by native speakers. We sort by fewest words to decrease likelihood of irrelevant or unlearned vocabulary within sentence.

Query for pagination data.

In [165]:
query='"字"' # inner quotes provide explicit returns
sort='words' # fewest words, can also do relevance
endpt=f'https://tatoeba.org/eng/api_v0/search?from=jpn&has_audio=&native=yes&orphans=no&query={query}&sort={sort}&sort_reverse=&tags=&to=eng&trans_filter=limit&trans_has_audio=&trans_link=&trans_orphan=&trans_to=eng&trans_unapproved=&trans_user=&unapproved=no&user=&'
response = requests.get(endpt)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')
paging=json['paging']

After first request we can loop through pagination to retreive all sentences.

In [167]:
from tqdm.notebook import tqdm # progress bar package import
import time # for requests delay

In [168]:
sentences=[]
nPages=paging['Sentences']['pageCount']
for page in tqdm(range(1,nPages+1),desc="Extracting..."):
    endpt=f'https://tatoeba.org/eng/api_v0/search?from=jpn&has_audio=&native=yes&orphans=no&query={query}&sort={sort}&sort_reverse=&tags=&to=eng&trans_filter=limit&trans_has_audio=&trans_link=&trans_orphan=&trans_to=eng&trans_unapproved=&trans_user=&unapproved=no&user=&page={page}'
    response = requests.get(endpt)
    if response.status_code == 200:
        json = response.json()
    else:
        raise Exception(f'{response} with endpoint {endpt}')
    sentences.extend(json['results'])
    # Pause so requests doesn't overload. Most wait time is server-side, optimize by linking database during deployment.
    time.sleep(0.25)
print(f'Finished extracting {len(sentences)} sentences matching {query}.')

Extracting...:   0%|          | 0/18 [00:00<?, ?it/s]

Finished extracting 171 sentences matching "字".


With sentences retreived, we want to check the results. We can filter out grammatical syntax and sentences that contain subjects that have not been learned. This is a time to pause and reflect on how best to handle mass user inputs, where many vocab will be queried at once and redundant sentences are possible. This may require a dedicated database for the sentences in order to improve efficiency.

A more efficient query would involve stripping all sentences of grammatical syntax and searching through all relevant sentences to see which ones ONLY contain subjects within user's assignments.

### Strip grammatical syntax
Strip select characters that are used for grammar.

In [198]:
def linestrip(line):
    chars_grammar=''
    translation_table = dict.fromkeys(map(ord, chars_grammar), None)
    unicode_line = unicode_line.translate(translation_table)

### Scoring Sentences Based on Vocabulary Relevance
Compare sample of sentences to sample of vocabulary subjects being tested, wich higher score denoting higher composition of sampled subjects and therefore higher likelihood that participant can guess the sentence meaning. This works around some of the particle and conjugation issue if we remember to drop the endings where needed (some verb types).

In [186]:
test=[[v for v in vocab if (v in s['text'])] for s in sentences][4]
test

['女', '手', '上手', '字']

In [169]:
vocab=[s['data']['characters'] for i,s in enumerate(subjects)]

Start by scoring each sentence on how many times an assigned vocab word appears. 

In [192]:
scoring=[]
for s in sentences:
    matches=[v for v in vocab if (v in s['text'])]
    # check for duplicates by splitting vocab into component parts
    # for now only worry about two-kanji vocab
    # this code is inelegant and slow pls phase out at some point
    for v in matches:
        if len(v)>1:
            for char in v:
                if char in matches:
                    matches.remove(char)
    scoring.extend([s['text'],matches])

In [199]:
scoring[0:5]

['彼は字が下手だ。', ['下手', '字'], '渡辺が名字です。', ['名字'], '何の略字ですか。']