Tatoeba platform will be used to generate many japanese sentences that can be translated to English. WaniKani platform will be used to filter out sentences featuring kanji based on user preferences. Both platforms use RESTful APIs to allow for data retreival.

https://sakubun.xyz/known_kanji has a very good version of this desgined around the same purpose. I think the project could benefit in a more user-friendly design, rather than multiple steps in order to extract sentences. The interest lies more in developing middleware that can be easily shifted to apply to different data formats from different APIs.

## Interacting with WaniKani API
We have to first interact with the WaniKani API and extract out user data. We will use the Python requests package in order to explore the correct commands for the context. Later versions should use a code base more optimized for the context. We will use requests using my personal authorization. 

In [48]:
import requests
import json
headers = {
    'Authorization': 'Bearer d51a2fb4-201e-41db-ab76-0bc73d996561',
    'Wanikani-Revision': '20170710',
    'If-Modified-Since': 'Fri, 11 Nov 2011 11:11:11 GMT',
}

### Singular Resources
The following are **singular resources**:
* assignment
* kanji
* level_progression
* radical
* reset
* review_statistic
* review
* spaced_repetition_system
* study_material
* user
* vocabulary

We will retrieve based on vocabulary of the user. Specific subject IDs correspond to different resources, and it seems like radicals, kanji, and subjects have same domain of IDs.

In [113]:
# Resources by ID. Never updated.
resource='subjects'
resource_id='/2505'

In [122]:
base_url='https://api.wanikani.com/v2/'
endpt=f'{base_url}{resource}{resource_id}'
response = requests.get(endpt, headers=headers)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')

With the ID, we can retreive the resource and serialize it.

In [121]:
json

{'id': 193484823,
 'object': 'assignment',
 'url': 'https://api.wanikani.com/v2/assignments/193484823',
 'data_updated_at': '2020-10-14T19:05:35.276847Z',
 'data': {'created_at': '2020-08-21T02:38:41.279284Z',
  'subject_id': 2506,
  'subject_type': 'vocabulary',
  'srs_stage': 8,
  'unlocked_at': '2020-08-21T02:38:41.261784Z',
  'started_at': '2020-08-21T07:46:24.100698Z',
  'passed_at': '2020-08-23T13:10:46.547967Z',
  'burned_at': None,
  'available_at': '2021-02-11T18:00:00.000000Z',
  'resurrected_at': None,
  'hidden': False}}

### Retreiving Subjects from Collections

We can also extract **collections** of resources. The main source should come from filtered assignments based on criteria set by the user. Pagination and optimization with caching comes later.

In [125]:
# Filtered assignments.
resource='assignments'
resource_id='?subject_types=vocabulary'

In [119]:
# Specific assignment, updated moderately.
resource='assignments'
resource_id='/193484823'

In [99]:
# Reviews
resource='reviews'
resource_id=''

Retreive the collection and serialize it.

In [141]:
base_url='https://api.wanikani.com/v2/'
resource='assignments'
resource_id='?subject_types=vocabulary'

endpt=f'{base_url}{resource}{resource_id}'
response = requests.get(endpt, headers=headers)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')
    
assignments=json['data']

Extract the IDs of the assigned subjects from the serialized list of assignments. We can assume there are no duplicates returned from assignments.

In [151]:
subject_ids=[assignments[ii]['data']['subject_id'] for ii in range(len(assignments))]

New request, passing in subject IDs, to retreive specific vocab information. This may take some time, depending on how many IDs are passed.

In [169]:
base_url='https://api.wanikani.com/v2/'
resource='subjects'
resource_id=f'?ids={",".join(map(str,subject_ids))}'

endpt=f'{base_url}{resource}{resource_id}'
response = requests.get(endpt, headers=headers)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')
    
subjects=json['data']

In [239]:
subjects

[{'id': 2467,
  'object': 'vocabulary',
  'url': 'https://api.wanikani.com/v2/subjects/2467',
  'data_updated_at': '2022-11-10T09:46:33.119408Z',
  'data': {'created_at': '2012-02-28T08:04:47.000000Z',
   'level': 1,
   'slug': '一',
   'hidden_at': None,
   'document_url': 'https://www.wanikani.com/vocabulary/%E4%B8%80',
   'characters': '一',
   'meanings': [{'meaning': 'One', 'primary': True, 'accepted_answer': True}],
   'auxiliary_meanings': [{'type': 'whitelist', 'meaning': '1'}],
   'readings': [{'primary': True, 'reading': 'いち', 'accepted_answer': True}],
   'parts_of_speech': ['numeral'],
   'component_subject_ids': [440],
   'meaning_mnemonic': 'As is the case with most vocab words that consist of a single kanji, this vocab word has the same meaning as the kanji it parallels, which is <vocabulary>one</vocabulary>.',
   'reading_mnemonic': "When a vocab word is all alone and has no okurigana (hiragana attached to kanji) connected to it, it usually uses the kun'yomi reading. Numb

We have now extracted a list of vocabulary words based on assignments on WaniKani. We can pull specific characters out and use them to search other websites for sentences containing those characters

## Interacting with Tatoeba
In the interests of dynamic programming, we should be interacting with tatoeba when we need to generate sentences from assignments. This would also benefit from cacheing and modification datetime filters. Correctness is limited by what is provided on the site, we can use other checks such contributions by native speakers. We sort by fewest words to decrease likelihood of irrelevant or unlearned vocabulary within sentence.

In [236]:
page=1
query='"一つ"' # quotes provide explicit returns
sort='words' # fewest words, can also do relevance
endpt=f'https://tatoeba.org/eng/api_v0/search?from=jpn&has_audio=&native=yes&orphans=no&query={query}&sort={sort}&sort_reverse=&tags=&to=eng&trans_filter=limit&trans_has_audio=&trans_link=&trans_orphan=&trans_to=eng&trans_unapproved=&trans_user=&unapproved=no&user=&page={page}'
response = requests.get(endpt)
if response.status_code == 200:
    json = response.json()
else:
    raise Exception(f'{response} with endpoint {endpt}')

sentences=json['results']
paging=json['paging']

With sentences retreived, we want to check the results. We can filter out grammatical syntax and sentences that contain subjects that have not been learned. This is a time to pause and reflect on how best to handle mass user inputs, where many vocab will be queried at once and redundant sentences are possible. This may require a dedicated database for the sentences in order to improve efficiency.

A more efficient query would involve stripping all sentences of grammatical syntax and searching through all relevant sentences to see which ones ONLY contain subjects within user's assignments.

In [240]:
sentences

[{'id': 3507479,
  'text': '一つは青。',
  'lang': 'jpn',
  'correctness': 0,
  'script': None,
  'license': 'CC BY 2.0 FR',
  'translations': [[{'id': 2249741,
     'text': 'One is blue.',
     'lang': 'eng',
     'correctness': 0,
     'script': None,
     'transcriptions': [],
     'audios': [{'id': 115904,
       'author': 'CK',
       'attribution_url': '/en/user/profile/CK',
       'license': None}],
     'isDirect': True,
     'lang_name': 'English',
     'dir': 'ltr',
     'lang_tag': 'en'}],
   []],
  'transcriptions': [{'id': 1547731,
    'sentence_id': 3507479,
    'script': 'Hrkt',
    'text': '[一|ひと]つは[青|あお]。',
    'user_id': 81071,
    'needsReview': False,
    'modified': '2019-10-24T23:24:02+00:00',
    'user': {'username': 'Yorwba'},
    'readonly': False,
    'type': 'altscript',
    'html': '<ruby>一<rp>（</rp><rt>ひと</rt><rp>）</rp></ruby>つは<ruby>青<rp>（</rp><rt>あお</rt><rp>）</rp></ruby>。',
    'markup': None,
    'info_message': 'The furigana was last edited by Yorwba on Octo