# Vocab Analysis 
## Section 2: Prepare the Data

# Todo
- ~~create dedicated github project repository~~ https://github.com/avidrucker/vocab-analysis
- break up notebook into 5 sections/partitions as per "DataScienceProcessTips.pdf" **in_progress**
- save out csv after each section, label "section_#_output", import into the following section
    - ~~section 2 notes~~
    - ~~section 2 cards~~
    - section 2 revlogs
    - section 2 note_card_combo
        - binary only
        - non-binary only
        - full combo
- consolidate all paper todos, critique advices, etc. to this document
- upload this document to the dedicated github project repository
- import in Section 1 from personal laptop where question has been identified
- import in review (revlog) data (see model info here: https://github.com/ankidroid/Anki-Android/wiki/Database-Structure )
- combine revlogs w/ combo dataframe to tabulate review time, review count, lapse count, for both per card and per note
- add todos from email reminder (see email inbox)
- clearly name card, note, and revlog dataframes as numbered steps **in_progress**
    - cards
    - ~~notes~~
    - revlogs
- suffix "final" to card, note, and revlog dataframes to clearly denote analysis readiness
    - cards
    - ~~notes~~
    - revlogs
- mark cells for export to sections 3, 4 **in_progress**
- export section 3 & section 4 cells to next documents
- ~~link to presentation~~: https://docs.google.com/presentation/d/1UntQmGL2uhH9POCQzyVEh2j7qviHmrZjjyHJttlvVWU/edit?usp=sharing
- create new graphs (use broad survey & quick correlation graphs from recent exercises/modules)
- update presentation with new graphs
- assign (in section 2) JLPT "N" levels to each word w/ a JLPT "N" tag
- fix date conversions to occur before dataframe merge, not after **in_progress**
- re-export Anki collection from PC into project, then unzip & rerun with this notebook
- export all utility functions to be used in other notebooks **in_progress**
- Remove sentences, questions & phrases to prevent skewing of data & to hone in on vocabulary trends first. The above can be added back, or analyzed separately, after. (remove sentences, questions & phrases, idioms too)
- ~~remove words with 0 reps from primary graphing, as they dilute study consequence correlations/readings/analyses~~
- Before making graphs & charts, prepare a dataframe with only numerical data (binary, non-binary, and (?) both)
    - ie. remove: 'Yomi1', 'Translation','Translation2', 'Translation3', 'AlternateForms', 'PartOfSpeech', 'Sound', 'Sound2', 'Sound3', 'Examples', 'ExamplesAudio', 'AtoQ','AtoQaudio', 'AtoQkana', 'AtoQtranslation', 'QandApicture','answerPicture', 'Meaning1', 'SimilarWords', 'RelatedWords','Breakdown1', 'Comparison', 'Usage', 'Prompt1', 'Prompt2','KakuMCD', 'IuMCD', 'ExtraMemo', 'Yomi2', 'Meaning2', 'Breakdown2','Picture1', 'Picture2', 'Picture3', 'Picture4', 'HinshiMarker','Hint', 'Term2', 'ArabicNumeral', 'CounterKanji', 'Mnemonic','SameSoundWords', 'Yomi3', 'gChap', 'gBook', 'semester', 'gNumber','Transliteration', 'SoloLookCards', 'TagOverflow', 'blank1','blank2',
- add 'commonword' tag to tags from 'PartOfSpeech' column
- fix ordering (1) tag removal, renaming & confirmation (tag frequency), then (2) dataframe merges, then (3) addition of binary columns

for next time:
- we want to understand the words conceptually: abstract vs concrete, verbs vs nouns vs adjectives
- conduct data entry to add concrete boolean for each note
- inspect mod as a marker for 'freshness'
- notes: logistic regression: classification/categorization

Further Ideas:
- generate power as work (reps) over time
- use lapses to calculate efficiency per word
- generate stress as ratio of lapses to reps, compare wtih ease

### 1. Import Libraries

In [277]:
import pandas as pd
import numpy as np
import sqlite3
import json
from datetime import datetime, timedelta, date
import time

### 2. Import Data

In [382]:
location = "datasets/collection.anki2"
cnx = sqlite3.connect(location) # create sql file connection

In [383]:
# TDD backbone assertion to confirm a function call returns the desired result
def assertEquals(actual, expected, desc):
    assert(actual==expected), desc + " result: " + str(actual) + ", expected: " + str(expected)
    return "OK"

### A word anout how the data was stored

Anki, the Open Source, Spaced Repetition software (& app & service), saves a student's data in a few locations. There are the "Notes" which are the raw info used to make cards (fields of vocab data, metatags, trivia facts, images, audio, etc..) Then there are the "Cards" which are the actual studied items, where the study overview data is stored (such as study date-times, repetitions (reviews), intervals (how long a card is to be remembered), lapses (forgets & subsequent interval resets)). Additionally, data concerning the entire collection is stored under something cryptically called "Columns". Lastly, there is a "RevLog" which contains all the study data in detail for each individual repetition (study datetime, card studied, etc..) This document was critical to piecing together the puzzle: https://github.com/ankidroid/Anki-Android/wiki/Database-Structure

### 3. Extract Deck Creation Date

In [384]:
df_c = pd.read_sql_query("SELECT * FROM col", cnx)
crt = df_c['crt'][0] # save collection creation date (in epoch time)
pd_crt = pd.to_datetime(crt, unit = 's')
print(pd_crt)

assertEquals(str(pd_crt), "2013-01-08 09:00:00", "Collection Creation Date")

2013-01-08 09:00:00


'OK'

### 4. Extract field names to label columns

In [385]:
field_names = []
for row_index, blob in df_c['models'].items():
    for model_id, data in json.loads(blob).items():
        field_names += list(map(lambda fld: fld['name'], data['flds']))
field_names.append('Tags')
expected_names = ['Term', 'Yomi1', 'Translation', 'Translation2', 'Translation3', 'AlternateForms', 'PartOfSpeech', 'Sound', 'Sound2', 'Sound3', 'Examples', 'ExamplesAudio', 'AtoQ', 'AtoQaudio', 'AtoQkana', 'AtoQtranslation', 'QandApicture', 'answerPicture', 'Meaning1', 'SimilarWords', 'RelatedWords', 'Breakdown1', 'Comparison', 'Usage', 'Prompt1', 'Prompt2', 'KakuMCD', 'IuMCD', 'ExtraMemo', 'Yomi2', 'Meaning2', 'Breakdown2', 'Picture1', 'Picture2', 'Picture3', 'Picture4', 'HinshiMarker', 'Hint', 'Term2', 'ArabicNumeral', 'CounterKanji', 'Mnemonic', 'SameSoundWords', 'Yomi3', 'gChap', 'gBook', 'semester', 'gNumber', 'Transliteration', 'SoloLookCards', 'TagOverflow', 'blank1', 'blank2', 'Tags']

In [386]:
assertEquals(field_names, expected_names, "Field Names")

'OK'

### 5. Import card study data into data frame "df_cards"

In [387]:
# Step 6: Take in study data from Anki collection
df_cards = pd.read_sql_query("SELECT * FROM cards", cnx)
assertEquals(df_cards.shape[0],19315,"Rows")#6386, 21979, 19363
assertEquals(df_cards.shape[1],18,"Columns")

'OK'

### 6. Confirm that card data model matches expected format

In [388]:
expected_columns_1 = ['id', 'nid', 'did', 'ord', 'mod', 'usn', 'type', 'queue', 'due', 'ivl', 'factor',
 'reps', 'lapses', 'left', 'odue', 'odid', 'flags', 'data']

def lists_equal(a,b):
    return (a == b).all()

assertEquals(lists_equal(df_cards.columns.values, expected_columns_1), True, "Card Columns Import")

'OK'

### 7. Shallow check for duplicates (matching rows)

In [389]:
 def has_dupes(df_in):
    dupe = df_in.duplicated()
    return df_in.loc[dupe].shape[0] != 0

In [390]:
assertEquals(has_dupes(df_cards), False, "Duplicates Not Found")

'OK'

### 8. Remove unneeded card dataframe columns

In [391]:
def print_line_break():
    print("-"*75)

In [392]:
def print_before_after(b, a, t=""):
    if t != "":
        print_line_break()
        print(t)
    print_line_break()
    print("Before: " + str(b))
    print_line_break()
    print("After: " + str(a))
    print_line_break()

In [393]:
df_cards_001_less_cols = df_cards.copy()
df_cards_001_less_cols = df_cards_001_less_cols.drop(['did','usn','type','mod','left','odue','odid','flags','data'],axis=1)
expected_columns_2 = ['id', 'nid', 'ord', 'queue', 'due', 'ivl', 'factor', 'reps','lapses']

print_before_after(df_cards.columns.values, df_cards_001_less_cols.columns.values,"Card Columns:")

assertEquals(lists_equal(df_cards_001_less_cols.columns.values, expected_columns_2), True, "Card Model Slimmed")

---------------------------------------------------------------------------
Card Columns:
---------------------------------------------------------------------------
Before: ['id' 'nid' 'did' 'ord' 'mod' 'usn' 'type' 'queue' 'due' 'ivl' 'factor'
 'reps' 'lapses' 'left' 'odue' 'odid' 'flags' 'data']
---------------------------------------------------------------------------
After: ['id' 'nid' 'ord' 'queue' 'due' 'ivl' 'factor' 'reps' 'lapses']
---------------------------------------------------------------------------


'OK'

### 9. Import notes (words) into data frame "df_notes"

In [394]:
# let's take in the 'notes' table, and explicitly save the note id ("nid") 
df_notes = pd.read_sql_query("SELECT * FROM notes", cnx)
df_notes = df_notes.rename(columns={'id':'nid'})

In [395]:
assertEquals(df_notes.shape[0],8384,"Rows") # 2791, 9784, 8403
assertEquals(df_notes.shape[1],11,"Columns")

'OK'

### 10. Remove (drop) unneeded fields (columns)

In [396]:
df_notes_old_col_vals = df_notes.columns.values
df_notes = df_notes.drop(['guid','mid','usn','sfld','csum','flags','data'],axis=1)
#print(df_notes.columns.values)
print_before_after(df_notes_old_col_vals, df_notes.columns.values)

---------------------------------------------------------------------------
Before: ['nid' 'guid' 'mid' 'mod' 'usn' 'tags' 'flds' 'sfld' 'csum' 'flags' 'data']
---------------------------------------------------------------------------
After: ['nid' 'mod' 'tags' 'flds']
---------------------------------------------------------------------------


### 11. Split "fields" column into multiple, assign field names, drop combined col

In [397]:
def time_it(func, *args, **kwargs):
    start = time.time()
    func(*args, **kwargs)
    end = time.time()
    # https://stackoverflow.com/questions/8885663/how-to-format-a-floating-number-to-fixed-width-in-python
    print("{:.0f}".format((end - start)*1000) + " miliseconds")

In [398]:
for i in range(0,len(expected_names)-1):
    df_notes[expected_names[i]] = df_notes.flds.str.split('').str.get(i)
assertEquals('flds' in df_notes.columns.values, True, "'flds' Column Found")
df_notes = df_notes.drop(['flds'],axis=1)
assertEquals('flds' not in df_notes.columns.values, True, "'flds' Column Not Found")
print(df_notes.columns.values)

['nid' 'mod' 'tags' 'Term' 'Yomi1' 'Translation' 'Translation2'
 'Translation3' 'AlternateForms' 'PartOfSpeech' 'Sound' 'Sound2' 'Sound3'
 'Examples' 'ExamplesAudio' 'AtoQ' 'AtoQaudio' 'AtoQkana'
 'AtoQtranslation' 'QandApicture' 'answerPicture' 'Meaning1'
 'SimilarWords' 'RelatedWords' 'Breakdown1' 'Comparison' 'Usage' 'Prompt1'
 'Prompt2' 'KakuMCD' 'IuMCD' 'ExtraMemo' 'Yomi2' 'Meaning2' 'Breakdown2'
 'Picture1' 'Picture2' 'Picture3' 'Picture4' 'HinshiMarker' 'Hint' 'Term2'
 'ArabicNumeral' 'CounterKanji' 'Mnemonic' 'SameSoundWords' 'Yomi3'
 'gChap' 'gBook' 'semester' 'gNumber' 'Transliteration' 'SoloLookCards'
 'TagOverflow' 'blank1' 'blank2']


### 12. Check notes for duplicates (shallow check)

In [399]:
assertEquals(has_dupes(df_notes), False, "Duplicates Not Found")

'OK'

### 13. Check for duplicates by term field in notes data frame

In [400]:
def has_dupe_terms(df_in):
    location = df_in['Term'].duplicated()
    return df_in.loc[location].shape[0] != 0

In [401]:
assertEquals(has_dupe_terms(df_notes), False, "Duplicates Found")

'OK'

### 14. Confirm that duplicates dataframe is empty (no dups exist)

In [402]:
dupe = df_notes['Term'].duplicated() #creates list of True/False values
print(df_notes[dupe].shape)
assertEquals(df_notes[dupe].shape[0], 0, "Duplicates dataframe is empty.")

(0, 56)


'OK'

### 15. Inspect an individual card by its term

In [403]:
# Postal service
def inspect_note(df_in, term):
    return df_in[df_in['Term']==term]

In [404]:
sel1 = inspect_note(df_notes,'郵便')
sel1

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
3282,1361674609381,1555887839,Japanese Marked abaCheckNuance addDefinition ...,郵便,ゆうびん,"mail, postal service",,,〒,"Common word, Noun",...,,,,,,,,,,


### Save Point, Commit, Bonfire (for you Souls fans)*

At the point in time of the data extraction where the (meta) tag information is made available, we can treat it to both clarify (rename poorly worded tags) & reduce (delete unneeded tags). Since we now have all fields split into their own columns as well, we can treat (modifiy & improve) the columns as well, in a 1-2 process: (1) Fix the tags & (2) Fix the columns
*https://en.wikipedia.org/wiki/Souls_(series)

In [405]:
def shorten_list(takeIn, takeOut):
    temp = takeIn.lower().split() # split all the words into a list
    temp2 = [word for word in temp if word.lower() not in takeOut] # create a shorter list of words minus the take-outs
    return ' '.join(temp2) # return that shorter list as a string

In [406]:
tag_remove_list = ['japanese', 'checkpicture', 'complete', 'haspicture', 'nomemo',
                   'researched', 'aaaeditthis', 'addaudio', 'addaudio2', 'addaudioNow',
                   'addmore','adjustformatting', 'hascomparison', 'hasmnemonic',
                   'customediting','wikidefinition', 'givewill','addaudionow','addprompt',
                   'checknuance','giveyaneury','hastextimage', 'marked', 'addpicture',
                   'addexampletranslation','basicnumeric', 'genkiplus', 'hasaudio',
                   'nativeaudio', 'adddefinition','addexamples', 'addjapaneseprompt',
                   'computervoice','haspoliteprefix','nongoo','customdefinition','hashint',
                   'abahipriorityfix','kaki','mcd','nobodyknows+','missingwordtype',
                   'image','duplicate', 'hasprompt', 'ninshiki','abachecknuance',
                   'hasflag','things', 'jim', 'hasunicode', 'editthis','aaahipriority',
                   'hassimpledef', 'givecodie', 'forjimmy', 'hasnativeaudio', 'givejimmy2',
                   'checkaudio', 'checkwriting', 'hasjlptlevel', 'makekaki', 'checknuance2',
                   'checkagain', 'newaudio', 'mail', 'checkexamples','elementaryschool',
                   'nvc', 'checkprompt', 'gavejimmy', 'addnativeaudio','checkreading',
                   'givecodieapril', 'activated', 'fixformatting','hasplacesuffix',
                   'hassuffix','addtranslation','addnewcardtype','addnuance','addtextimage',
                   'semicomplete', 'removeroboaudio','fixaudio','hasgramconj', 
                   'hasquestion', 'addkanji','changenotetype', 'famous', 'challenging',
                   'kuverb', 'givwill','karutapoems', 'map', 'hasvisualcomparison',
                   'picturekaki', 'jyugemu', '2018', 'type1', 'hasslang', 'apologies',
                   'month', 'definitionresearched','soundshift', 'basics1', 'tsuverb',
                   'facebook', 'uverb', 'checkfrequency', 'degree', 'hasdefinition',
                   'addtransliteration', 'dnd', 'introductions', 'adjustprompt',
                   'job', 'particle', 'services', 'mature', 'splitpictures', 
                   'egaki', 'type5k', 'intimate','extrainfo', 'irregular', 'unlisted',
                   'fromwiki', 'checkdifference','addpronunciationdiagram', 'reset',
                   'currentevents', 'doubletextimage', 'comparison', 'verbscompoundpast2',
                   'attention', 'addmemo', 'averb', 'radio','hasascii', 'fontadjusted',
                   'haspronunciation', 'borroweddefinition','alphabet', 'graphics',
                   'chiebukuro', 'duolingo', 'ateji', 'fact','type5s', 'fixpicture',
                   'politebydefault', 'objects','sensitive', 'groupword', 'addmnemonic',
                   'hasmore', 'quote', 'checkformatting','overlap', 'kotobankdef',
                   'hasrudeness', 'changedeck', 'specialformatting','yoga',
                   'hasjapaneseprompt', 'hasprefix','questionword', 'business', 
                   'postoffice', 'firstten', 'money', 'robotvoice2', 'ichidan', 'godan',
                   'weather','count', 'nodefinition', 'muverb', 'addcomparisonchart', 
                   'ruverb', 'phone', 'conjugated','haddiv','vulgar','fromkaruta',
                   'karutamanual', 'teform', 'qanda', '2019'
                  ]

### 16. Remove unneeded tags (meta-data)

In [407]:
# survey a few notes to see example tag data
df_notes.head(3)

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
0,1331799797110,1511481489,Japanese Marked abaCheckNuance checkNuance co...,臨機応変,りんきおうへん,adapting oneself to the requirements of the mo...,,,,"Noun, No-adjective",...,,,,,,,,,,
1,1331799797112,1511481489,Japanese complete noMemo researched wwwjdic,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,,
2,1331799797113,1511481489,Japanese Marked abaCheckNuance checkNuance co...,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,,


In [408]:
# likely useful tags: katakana, Waseigo, Food, Phrases, casual, restaurant, travel, commonWord, noun, suruVerb

df_notes_001_less_tags = df_notes.copy() #originally "df_notes_less_tags"
df_notes_001_less_tags['tags'] = df_notes_001_less_tags['tags'].apply(lambda x: shorten_list(str(x), tag_remove_list))

print_before_after(df_notes['tags'].iloc[0], df_notes_001_less_tags['tags'].iloc[0],"Tags for " + df_notes['Term'].iloc[0])

assertEquals("Japanese" in df_notes['tags'].iloc[0].split(), True, "Contains Tag 'Japanese'")
assertEquals("Japanese" in df_notes_001_less_tags['tags'].iloc[0].split(), False, "Contains Tag 'Japanese'")

---------------------------------------------------------------------------
Tags for 臨機応変
---------------------------------------------------------------------------
Before:  Japanese Marked abaCheckNuance checkNuance complete noMemo researched wwwjdic yojijukugo 
---------------------------------------------------------------------------
After: wwwjdic yojijukugo
---------------------------------------------------------------------------


'OK'

### 17. Rename useful tags (meta-data) that were poorly named

In [409]:
# replace list (formerly named 'tag_replace_list')
tag_rename_dict = {
    'aalowfrequency':'rare checked', 'aatechnical':'technical checked', 'aaanonkaiwa':'nonconvo checked',
    'wwwjdic':'fromdict', 'expression':'phrases', 'numberonly':'number',
    'grammarpoint':'grammar', 'jisho':'fromdict', 'pointingword':'directions',
    'geometry':'math technical', 'genki':'textbook', 'jpn202':'college',
    'jpn201':'college', 'jpn101':'college', 'jpn102':'college', 'kentei':'fromexam',
    'proficiencytest':'fromexam', 'bodypart':'body', '5kyuu':'fromexam',
    'linguisticreference':'technical', 'conversation':'convo',
    'fromconvo':'convo', 'culturepoint':'culture', 'checkednuance':'checked',
    'checkedpictures':'checked', 'checkednuance':'checked', 'medical':'technical',
    'anatomy':'body', 'places':'place', 'animals':'animal',
    'newspaperterm':'fromnewspaper', 'checkedreading':'checked',
    'abbreviation':'abbr','firstsemester':'semester1','onecharacter':'len1',
    'sentence':'phrase', 'verbs':'verb', 'convook':'checked convo','inuse':'checked',
    'nuancechecked':'checked','insects':'animal insect','sightseeing':'travel',
    'accessories':'clothing', 'grammarsuffix':'suffix', 'oceanlife':'animal ocean',
    'science':'technical', 'written':'nonconvo', 'notrare':'checked',
    'aajoke':'silly', 'intonationcompare':'hassimilar', 'ij':'textbook',
    'goodcard':'inspect','aahilevel':'challenging inspect', 'ijvocab':'textbook',
    'cliothing':'clothing','unused':'nonconvo rare checked',
    'aaunused':'nonconvo rare checked', 'samesound':'hassame','animals':'animal',
    'dictionary':'fromdict','usuallywritteninkana':'kana',
    'abVeryRare':'rare checked', 'yojijukugo':'rare idiom', 'abcasual':'casual checked convo',
    'literaryform':'nonconvo', 'onomatopoeiclike':'onomatopoeic','kenjo':'humble',
    'colors':'color', 'forest':'nature','flower':'plant nature', 'aaok':'checked',
    'questions': 'question', 'adverbs':'adverb','book2':'textbook',
    'book1':'textbook','proficiencytest':'fromtest','animalscomplete':'animal',
    'sonkei':'respectful','eating':'food','fruit':'food','neverused':'nonconvo rare',
    'domainspecific':'technical','seaons':'season','seasons':'season',
    'prefecture':'place','plantpart':'plant', "hakataben":"dialect", "fish":"animal fish",
    "transitive":"transitive verb", "intransitive":"intransitive verb",
    "aaunecessary":"nonconvo checked", "vegetables":"vegetable food plant",
    "counters":"counter", "senmonyougo":"technical", "countries":"country place",
    "date":"datesandtime", "rarelyused":"rare", "aaakaiwa":"convo checked", "cool":"inspect",
    "investigate":"inspect"
}

#todo: investigate:
#editformatting,  datesandtime, linguistics, reference, adult, adjustpicture, checkpronunciation, addhint, challenging, inspect

In [410]:
def replace_list(takeIn, replaceDict):
    temp = takeIn.lower().split()
    temp2 = []
    for word in temp:
        if word in replaceDict:
            temp2.append(replaceDict.get(word)) # if the word exists in the dictionary, replace it
        else:
            temp2.append(word) # if the word doesnt't exist in the dictionary, leave it alone
    return ' '.join(temp2) # return that shorter list as a string

# inspect further:
# multiwriting, multimeaning, multipicture, multiterm, multireading, mergeterms, checkpronunciation, customterm,
# goodcard, personalized, silly, addjlptlevel, checkpronunciation, mergeterms, customterm, transportation vs travel

# categorize: iadjective, naajective, verb, counter, commonword, suruverb, pronoun, question, phrases, kuverb, godan, ichidan, intransitive, transitive, noun, adverbialnoun

In [411]:
df_notes_002_better_tags = df_notes_001_less_tags.copy() # originally "df_notes_better_tags"
df_notes_002_better_tags['tags'] = df_notes_002_better_tags['tags'].apply(lambda x: replace_list(str(x), tag_rename_dict))

print_before_after(df_notes_001_less_tags['tags'].iloc[0], df_notes_002_better_tags['tags'].iloc[0], "Tags for " + df_notes_002_better_tags['Term'].iloc[0])

assertEquals("wwwjdic" in df_notes_001_less_tags['tags'].iloc[0].split(), True, "Contains Tag 'wwwjdic'")
assertEquals("wwwjdic" in df_notes_002_better_tags['tags'].iloc[0].split(), False, "Contains Tag 'wwwjdic'")
assertEquals("fromdict" in df_notes_001_less_tags['tags'].iloc[0].split(), False, "Contains Tag 'fromdict'")
assertEquals("fromdict" in df_notes_002_better_tags['tags'].iloc[0].split(), True, "Contains Tag 'fromdict'")

---------------------------------------------------------------------------
Tags for 臨機応変
---------------------------------------------------------------------------
Before: wwwjdic yojijukugo
---------------------------------------------------------------------------
After: fromdict rare idiom
---------------------------------------------------------------------------


'OK'

In [412]:
df_notes_002_better_tags['tags'].value_counts()[:20]

                                       1622
fromdict                                796
fromtest textbook                       600
textbook textbook                       453
college textbook textbook               241
verb                                    200
fromdict verb                           144
fromexam                                126
len1                                    122
hiragana college textbook textbook      107
counter numeric                          97
numeric                                  81
addsimilar                               81
fromdict media                           72
college textbook semester1 textbook      71
fromexam textbook                        65
fromdict lyrics                          63
convo                                    61
n3 fromdict transitive verb verb         58
college textbook textbook katakana       54
Name: tags, dtype: int64

We can attempt to inspect which tags are most common, in which combinations, and which words would be ideal
for further additional metadata. However, **our tags are still lumped together** at this point. Also, there is
reason to believe that **some tags are showing up multiple times in the same tag string**. In order to properly count tag frequency, the duplicates must be confirmed absent (ie. found & removed). Then, the occurance (word frequency) of each tag may then be summed up for the tags column.

### 18. Inspect a note that you suspect has tag duplication

In [413]:
def inspect_note_by_id(df_in, nid):
    return df_in[df_in['nid']==nid]

In [414]:
# confirm that a particular note has tag duplicates
# crimison note id: 1369286386384
note_id_1 = 1369286386384
assertEquals(inspect_note_by_id(df_notes_002_better_tags,note_id_1).tags.values[0],"fromexam color fromexam len1","Four tags total with two duplicates exist")

'OK'

In [415]:
# example of item with tag duplication
sel2 = inspect_note_by_id(df_notes_002_better_tags,note_id_1)
sel2

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
3847,1369286386384,1511481489,fromexam color fromexam len1,紅,くれない,<div>deep red; crimson</div><div><br /></div><...,,,,"<div>Common word, Noun</div><div>Common word, ...",...,,,,,,,,,,


### 19. Remove duplicate tags (convert tag strings > lists > sets > strings)

In [416]:
# Converts a tag string to a list to a set back to a string (this removes the duplicates)
def remove_dupes(t):
    temp = list(set(t.lower().split()))
    return ' '.join(temp) # return as string

In [417]:
df_notes_003_tags_no_dups = df_notes_002_better_tags.copy()
df_notes_003_tags_no_dups['tags'] = df_notes_003_tags_no_dups['tags'].apply(lambda x: remove_dupes(str(x)))

In [418]:
# determines if an individual tag substring exists in a larger tags list string
def tag_exists(tags, tag):
    return 1 if tag in tags.split() else 0

In [419]:
print(inspect_note_by_id(df_notes_003_tags_no_dups,note_id_1).tags.values[0])
assertEquals(tag_exists(inspect_note_by_id(df_notes_003_tags_no_dups,note_id_1).tags.values[0],"len1"), 1, "tag 'len1' remains")
assertEquals(tag_exists(inspect_note_by_id(df_notes_003_tags_no_dups,note_id_1).tags.values[0],"fromexam"), 1, "tag 'fromexam' remains")

color len1 fromexam


'OK'

It appears we have most, if not all, of the data we need to start. The format of the dates though is not yet human readable. Let's fix that.

### 20. Convert (& preserve) note ID to note creation date

In [420]:
#dueNum = 782 # this represents days from collection creation date
#crt = 1357635600 # this represents the collection creation date #todo: query dynamically from database
#print("mid 'model id': " + time.ctime(int("1768161991"))) # 1 day = 86400 seconds

df_notes_004_with_date = df_notes_003_tags_no_dups.copy()
df_notes_004_with_date['NoteCreated']= pd.to_datetime(df_notes_004_with_date['nid'],unit='ms')
df_notes_004_with_date['NoteCreated'] = df_notes_004_with_date['NoteCreated'].dt.date
df_notes_004_with_date.head()

print_before_after(df_notes_003_tags_no_dups['nid'].iloc[0], df_notes_004_with_date['NoteCreated'].iloc[0],"Term " + df_notes_004_with_date['Term'].iloc[0])

assertEquals(df_notes_004_with_date['nid'].iloc[0], 1331799797110, "Note ID is in Epoch Units")
assertEquals(str(df_notes_004_with_date['NoteCreated'].iloc[0]), "2012-03-15", "Note ID is in datetime date format year-month-day")

---------------------------------------------------------------------------
Term 臨機応変
---------------------------------------------------------------------------
Before: 1331799797110
---------------------------------------------------------------------------
After: 2012-03-15
---------------------------------------------------------------------------


'OK'

### 21. Generate Note Last Modified Date from "Mod" ID

In [421]:
df_notes_005_last_modified = df_notes_004_with_date.copy()
df_notes_005_last_modified['mod'] = pd.to_datetime(df_notes_005_last_modified['mod'],unit='s')
df_notes_005_last_modified['mod'] = df_notes_005_last_modified['mod'].dt.date

assertEquals(str(df_notes_005_last_modified['mod'].iloc[0]), "2017-11-23", "Note last modified is in datetime date format year-month-day")

'OK'

### 22. Create df_notes_final data frame for export & further usage

In [422]:
df_notes_final = df_notes_005_last_modified.copy()

### 23. Export df_notes_section_2_final

In [423]:
df_notes_final.to_csv('datasets/notes_section_2_final.csv')

### 24. Generate Card Creation Date from Card ID

In [424]:
df_cards_002_created_date = df_cards_001_less_cols.copy()
df_cards_002_created_date['CardCreated'] = pd.to_datetime(df_cards_002_created_date['id'],unit='ms')
df_cards_002_created_date['CardCreated'] = df_cards_002_created_date['CardCreated'].dt.date

assertEquals(str(df_cards_002_created_date['CardCreated'].iloc[0]), "2012-03-15", "Card ID is in datetime date format year-month-day")

'OK'

In [425]:
#queue           integer not null,
#      -- -3=sched buried, -2=user buried, -1=suspended,
#      -- 0=new, 1=learning, 2=due (as for type)

df_cards_003_no_new = df_cards_002_created_date.copy()
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['queue']!=0] # remove cards marked as new
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['reps']!=0] # remove cards that have not been reviewed
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['queue']!=-1] # remove cards that are currently suspended
# https://stackoverflow.com/questions/18196203/how-to-conditionally-update-dataframe-column-in-pandas
df_cards_003_no_new.loc[df_cards_003_no_new['due'] > 10000, 'due'] = 0 # assign 0 to the due # todo: update w/ last studied date from revlog # todo: comment this line out once you have updated the collection import

print_before_after(df_cards_002_created_date.shape[0], df_cards_003_no_new.shape[0],"Card Rows:")

df_cards_003_no_new.tail(5)

---------------------------------------------------------------------------
Card Rows:
---------------------------------------------------------------------------
Before: 19315
---------------------------------------------------------------------------
After: 8231
---------------------------------------------------------------------------


Unnamed: 0,id,nid,ord,queue,due,ivl,factor,reps,lapses
18909,2019-02-06,1549184119039,2,2,2245,10,2410,2,0
18910,2019-02-17,1550402953788,0,2,2232,1,2410,3,0
18911,2019-02-17,1550402953788,2,2,2236,1,2210,5,1
18912,2019-02-17,1550403040864,0,2,2232,1,2410,2,0
18913,2019-02-17,1550403040864,2,2,2240,5,2410,3,0


In [426]:
# todo: remove this cell once the newest collection has been imported
# confirm that the three cards "in learning" have their due dates reset back to 0 for date transformation
sel3 = df_cards_003_no_new[df_cards_003_no_new['due'] == 0]
sel3.head()

Unnamed: 0,id,nid,ord,queue,due,ivl,factor,reps,lapses
3366,2013-03-11,1362961413265,0,1,0,1,1300,18,1
11245,2017-01-21,1483483650784,4,1,0,1,2160,8,1
11274,2017-02-26,1346220179900,4,1,0,1,1760,12,3


### 25. Generate Due Date from Due Value

In [427]:
df_cards_004_due_date = df_cards_003_no_new.copy()
df_cards_004_due_date['due'] = pd_crt + df_cards_004_due_date['due'].map(timedelta)
df_cards_004_due_date['due'] = df_cards_004_due_date['due'].dt.date

assertEquals(str(df_cards_004_due_date['due'].iloc[0]), "2015-03-08", "Card due date is in datetime date format year-month-day")

'OK'

### 26. Create df_cards_final data frame for export & further usage

In [428]:
df_cards_final = df_cards_004_due_date.copy()

### 27. Export df_cards_section_2_final

In [429]:
df_cards_final.to_csv('datasets/cards_section_2_final.csv')

### 28. Merge card & note data frames to conduct cross analysis

In [444]:
# now that we have note id's for all the words, we can
# join together these separate dataframes
df_combo = pd.merge(df_notes_final, df_cards_final, on='nid')
print(df_combo.shape)
df_combo.head()

(8231, 65)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,blank2,NoteCreated,id,ord,queue,due,ivl,factor,reps,lapses
0,1331799797110,2017-11-23,idiom fromdict rare,臨機応変,りんきおうへん,adapting oneself to the requirements of the mo...,,,,"Noun, No-adjective",...,,2012-03-15,2012-03-15,0,2,2015-03-08,65,1680,10,1
1,1331799797112,2017-11-23,fromdict,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,2012-03-15,2012-03-15,0,2,2015-03-03,149,2080,8,1
2,1331799797114,2017-11-23,fromdict,移籍,いせき,<div>changing household registry; transfer (e....,,,,"<div>Common word, Noun, Suru verb</div>",...,,2012-03-15,2012-03-15,0,2,2015-02-04,99,1980,7,0
3,1331799797117,2017-11-23,fromdict verb,吊るす,つるす,to hang,,,,,...,,2012-03-15,2012-03-15,0,2,2015-03-17,143,2130,6,1
4,1331799797118,2017-11-23,convo checked fromdict,和やか,なごやか,"harmonious, peaceful",,,,,...,,2012-03-15,2012-03-15,0,2,2015-02-06,74,1880,15,3


It appears that card types are being rendered as numbers, which makes it less human readible. We will fix this. Additionally, our card model has a bunch of columns (fields) with no values in them, whatsoever. These can be taken out for the data analysis.

In [447]:
def is_blank (s):
    return not (s and s.strip())

In [448]:
def get_frame_of_cards_by_term(df, t):
    return df.loc[df['Term']==t]

In [449]:
# let's look a a small slice of data, to infer what we may
# we can take a broad overview look at the dataset to more quickly isolate candidates for removal
s = get_frame_of_cards_by_term(df_combo, '発明')
s.head()

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,blank2,NoteCreated,id,ord,queue,due,ivl,factor,reps,lapses
2893,1354094556789,2018-12-03,n3 fromtest noun commonword textbook suruverb,発明,はつめい,"<span style=""""><div>invention</div></span>",,,,"Common word, Noun, Suru verb",...,,2012-11-28,2012-11-28,0,2,2015-06-30,80,1300,73,9
2894,1354094556789,2018-12-03,n3 fromtest noun commonword textbook suruverb,発明,はつめい,"<span style=""""><div>invention</div></span>",,,,"Common word, Noun, Suru verb",...,,2012-11-28,2013-06-21,4,2,2021-10-24,1056,1300,20,0


### 29. Determine which columns (fields) are unused & can be safely removed

In [453]:
col_names = df_combo.columns.values
#print(is_blank(df_combo['Translation2'].iloc[0])) # see that this cell for this row is indeed blank

row_cnt = df_combo.shape[0] # number of rows in df_combo

# https://stackoverflow.com/questions/49677060/pandas-count-empty-strings-in-a-column
empty_strings = pd.DataFrame(df_combo.values == '',columns=col_names) # find all empty strings in a DataFrame
temp_dict = (empty_strings.sum()).to_dict()  # save the location of all empty strings as a DataFrame of booleans
removal_candidates = []
for key in temp_dict.items():
    if key[1] == row_cnt:
        removal_candidates.append(key[0])
print("Removal candidates:", removal_candidates)

Removal candidates: ['Sound3', 'AtoQ', 'AtoQaudio', 'AtoQkana', 'AtoQtranslation', 'QandApicture', 'answerPicture', 'blank1', 'blank2']


### 30. Trim unneeded (empty) columns from combo data frame

In [455]:
df_combo_001_less_cols = df_combo.copy()

removal_list = list(removal_candidates + ['queue'])

df_combo_001_less_cols = df_combo_001_less_cols.drop(removal_list,axis=1)

print_before_after(df_combo.shape, df_combo_001_less_cols.shape)
print_before_after(df_combo.columns.values, df_combo_001_less_cols.columns.values)

---------------------------------------------------------------------------
Before: (8231, 65)
---------------------------------------------------------------------------
After: (8231, 55)
---------------------------------------------------------------------------
---------------------------------------------------------------------------
Before: ['nid' 'mod' 'tags' 'Term' 'Yomi1' 'Translation' 'Translation2'
 'Translation3' 'AlternateForms' 'PartOfSpeech' 'Sound' 'Sound2' 'Sound3'
 'Examples' 'ExamplesAudio' 'AtoQ' 'AtoQaudio' 'AtoQkana'
 'AtoQtranslation' 'QandApicture' 'answerPicture' 'Meaning1'
 'SimilarWords' 'RelatedWords' 'Breakdown1' 'Comparison' 'Usage' 'Prompt1'
 'Prompt2' 'KakuMCD' 'IuMCD' 'ExtraMemo' 'Yomi2' 'Meaning2' 'Breakdown2'
 'Picture1' 'Picture2' 'Picture3' 'Picture4' 'HinshiMarker' 'Hint' 'Term2'
 'ArabicNumeral' 'CounterKanji' 'Mnemonic' 'SameSoundWords' 'Yomi3'
 'gChap' 'gBook' 'semester' 'gNumber' 'Transliteration' 'SoloLookCards'
 'TagOverflow' 'blank1' 'blank2

### 31. Label card types by their names

In [461]:
# ord stands for 'ordinal' : identifies which of the card templates it corresponds to
print(df_combo_001_less_cols['ord'].value_counts()) # these are the card vectors

# since our dataset contains a single card of a single card vector, & the card vectors
# aren't named/labeled, let's remove the outlier & add the names
df_combo_002_types_labeled = df_combo_001_less_cols.copy()
df_combo_002_types_labeled = df_combo_002_types_labeled.drop(df_combo_002_types_labeled[df_combo_002_types_labeled['ord'] == 11].index)

df_combo_002_types_labeled['ord'].value_counts() # the check shall pass

# now, to map the names onto the card vectors
df_combo_002_types_labeled['ord'] = df_combo_002_types_labeled['ord'].map({0:'read', 2:'recall',4:'look',7:'listen'})
df_combo_002_types_labeled['ord'].value_counts()

0     6842
4     1109
7      267
2       11
11       2
Name: ord, dtype: int64


read      6842
look      1109
listen     267
recall      11
Name: ord, dtype: int64

### 32. Create binary exists/not columns based on presence of a given tag

In [462]:
def add_column_by_tag(df, tag):
    df[tag] = df['tags'].apply(lambda x: tag_exists(str(x), tag))

In [463]:
df_combo_003_with_binary = df_combo_002_types_labeled.copy()
inspect_list = ["commonword", "clothing", "animal", "body", "food", "place",
                "textbook", "college", "fromdict", "fromexam",
                "len1", "n1", "n2", "n3", "n4", "n5"
               ]
for item in inspect_list:
    add_column_by_tag(df_combo_003_with_binary, item)

In [480]:
df_combo_003_with_binary.dtypes.value_counts()

object    50
int64     21
dtype: int64

### 33. Create interval quartile sections for visualization purposes

In [None]:
# qcut: Quantile-based discretization function. Discretize variable into equal-sized buckets
# based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would
# produce a Categorical object indicating quantile membership for each data point.
# http://www.datasciencemadesimple.com/quantile-decile-rank-column-pandas-python-2/
df_combo_003_with_binary['ivl_q'] = pd.qcut(df_combo_003_with_binary['ivl'],5,labels=False)
df_combo_003_with_binary.head()

Let's further refine the dataframe entries to represent which notes have (1) visual data, (2) audio data, and (3) a L1 ("first language", English in this case) translation. We can represent these with binary values (0 for doesn't exist, 1 for exists).

### 35. Create boolean columns for predictive models

In [None]:
# Laura calls this process "Data Enriching"
intify_list = ['hasPOS','hasVisual','hasAudio','hasMultiMeaning','hasMultiReading','hasSimilar','hasHomophone','hasAltForm']

In [None]:
# https://stackoverflow.com/questions/17383094/how-can-i-map-true-false-to-1-0-in-a-pandas-dataframe
df_combo_003_with_binary['hasPOS'] = df_combo_003_with_binary['PartOfSpeech']!="" #todo: expand upon this, by tagify
df_combo_003_with_binary['hasVisual'] = df_combo_003_with_binary['Picture1']!=""
df_combo_003_with_binary['hasAudio'] = df_combo_003_with_binary['Sound']!=""
df_combo_003_with_binary['hasMultiMeaning'] = df_combo_003_with_binary['Translation2' and 'Translation3' and 'Meaning2']!=""
df_combo_003_with_binary['hasMultiReading'] = df_combo_003_with_binary['Yomi2']!="" # todo: inspect & incorporate venn diagram: https://commons.wikimedia.org/wiki/File:Homograph_homophone_venn_diagram.png
df_combo_003_with_binary['hasSimilar'] = df_combo_003_with_binary['SimilarWords']!=""
df_combo_003_with_binary['hasHomophone'] = df_combo_003_with_binary['SameSoundWords']!="" # write function, detect homophones
df_combo_003_with_binary['hasAltForm'] = df_combo_003_with_binary['Term2' and 'AlternateForms']!= ""

### 36. Drop non-numerical columns from combo data frame

In [None]:
df_combo_004_less_cols = df_combo_003_with_binary.copy()
df_combo_004_less_cols = df_combo_004_less_cols.drop(['Examples','ExamplesAudio',
                            'Meaning1','RelatedWords','Breakdown1','Comparison',
                           'Usage','Prompt1','Prompt2','KakuMCD','IuMCD','ExtraMemo',
                           'Breakdown2','Picture2','Picture3','Picture4','Mnemonic',
                            'Yomi3','gChap','gBook','semester','gNumber','ArabicNumeral',
                            'CounterKanji','SoloLookCards','HinshiMarker','Hint',
                            'mod','Transliteration'],axis=1)

In [None]:
# casts columns of type object to type int as directed, use with caution
def intify_bools(df, col):
    df[col] = df[col].astype(int)

### 37. Ensure numerical/boolean types are encoded properly

In [None]:
df_combo_004_less_cols.dtypes.value_counts()

In [None]:
for item in intify_list:
    intify_bools(df_combo_004_less_cols,item)

In [None]:
df_combo_004_less_cols.dtypes.value_counts()

### 38. Further reduce columns not in use 

In [None]:
df_combo_004_less_cols = df_combo_004_less_cols.drop(['Picture1','Sound','Sound2','Sound3','AtoQ','AtoQaudio',
                              'AtoQkana','AtoQtranslation','QandApicture','answerPicture',
                              'TagOverflow','Translation2','blank1','blank2',
                              'Meaning2','Yomi2','Term2','SameSoundWords','hasPOS',
                             'SimilarWords','AlternateForms','Translation3'],axis=1)

df_combo_004_less_cols.head(35)[30:]

#selection2 = df_binary.loc[df_binary['hasMultiMeaning']==1]
#selection2.head()

### 39. Count syllable count & character length for each term

In [None]:
df_combo_005_with_len = df_combo_004_less_cols.copy()

df_combo_005_with_len['TermLen'] = df_combo_005_with_len['Term'].str.len()
df_combo_005_with_len['Syllables'] = df_combo_005_with_len['Yomi1'].str.len()
df_combo_005_with_len.loc[df_combo_005_with_len['Syllables'] == 0, 'Syllables'] = df_combo_005_with_len['TermLen']

bins = [0,1,2,4,8,128]
labels = ["[1]","[2]","[3:4]","[5:8]","[9: ]"]
# https://stackoverflow.com/questions/45273731/binning-column-with-python-pandas
df_combo_005_with_len['TermLenGroup'] = pd.cut(df_combo_005_with_len['TermLen'], bins=bins, labels=labels)

#df.loc[df['Grades'] <= 77, 'Grades'] = 100
# https://stackoverflow.com/questions/42815768/pandas-adding-column-with-the-length-of-other-column-as-value
#df_binary2.head(35)[30:]
df_combo_005_with_len.tail(20)[:10]

In [None]:
# inspect the many syllable entries
df_many_syl = df_combo_005_with_len.copy()
many_syl = df_many_syl['Syllables'] > 20
df_many_syl.loc[many_syl]

# Further Information

The Spaced Repetition Software (\"SRS\") used for the study of Japanese by student \"A\" is an open souce program called Anki. The algorithm used by it to \"graduate\" (also refered to as \"maturing\") study items (called cards) so that subsequent reviews/practices will be spaced into the future is referred to as SM-2. [Please click here for more information on the SM-2 algorithm used in Anki.]("https://apps.ankiweb.net/docs/manual.html#what-algorithm")

In [None]:
# logistic regression: classification/categorization