<a name="top"></a>Vocab Analysis 
***
# Section 2: Prepare the Data
## [A. Cards](#cards)
- 2.1.1. Import card data into data frame "df_cards"
- 2.1.2. Confirm that card data model matches expected format
- 2.1.3. Shallow check for duplicates (matching rows)
- 2.1.4. Remove unneeded card dataframe columns, rename 'id' to 'cid' (card id)
- 2.1.5. Generate Card Creation Date from Card ID
- 2.1.6. Remove cards with no study data associated with them, cards that have been suspended from study
- 2.1.7. Confirm that no cards considered 'in learning' are present
- 2.1.8. Generate Due Date from Due Value
- 2.1.9. Label card types by their names, & drop outlier
- 2.1.10. Create interval quartile sections for visualization purposes
- 2.1.11. Create dummy variables for card types
- 2.1.12. Create df_cards_008_final_section_2 data frame for progress saving
- 2.1.13. Export df_cards_008_final_section_2

## [B. Notes](#notes)


## [C. Combo](#combo)


### [bottom of page](#bottom)

### 2.0.1. Import libraries

In [1]:
import pandas as pd
import sqlite3
import json
from datetime import datetime, timedelta, date
import time

### 2.0.2. Import Data

In [2]:
location = "datasets/collection.anki2"
cnx = sqlite3.connect(location) # create sql file connection

In [3]:
# TDD backbone assertion to confirm a function call returns the desired result
def assertEquals(actual, expected, desc):
    assert(actual==expected), desc + " result: " + str(actual) + ", expected: " + str(expected)
    return "OK"

### 2.0.3. Extract Deck Creation Date

In [4]:
df_c = pd.read_sql_query("SELECT * FROM col", cnx)
crt = df_c['crt'][0] # save collection creation date (in epoch time)
pd_crt = pd.to_datetime(crt, unit = 's')
print(pd_crt)

assertEquals(str(pd_crt), "2013-01-08 09:00:00", "Collection Creation Date")

2013-01-08 09:00:00


'OK'

### 2.0.4. Extract field names to label columns

In [5]:
field_names = []
for row_index, blob in df_c['models'].items():
    for model_id, data in json.loads(blob).items():
        field_names += list(map(lambda fld: fld['name'], data['flds']))
field_names.append('Tags')
expected_names = ['Term', 'Yomi1', 'Translation', 'Translation2', 'Translation3', 'AlternateForms',
    'PartOfSpeech', 'Sound', 'Sound2', 'Sound3', 'Examples', 'ExamplesAudio', 'AtoQ', 'AtoQaudio',
    'AtoQkana', 'AtoQtranslation', 'QandApicture', 'answerPicture', 'Meaning1', 'SimilarWords',
    'RelatedWords', 'Breakdown1', 'Comparison', 'Usage', 'Prompt1', 'Prompt2', 'KakuMCD', 'IuMCD',
    'ExtraMemo', 'Yomi2', 'Meaning2', 'Breakdown2', 'Picture1', 'Picture2', 'Picture3', 'Picture4',
    'HinshiMarker', 'Hint', 'Term2','ArabicNumeral', 'CounterKanji', 'Mnemonic', 'SameSoundWords',
    'Yomi3', 'gChap', 'gBook', 'semester', 'gNumber', 'Transliteration','SoloLookCards',
    'TagOverflow', 'blank1', 'blank2', 'Tags']

In [6]:
assertEquals(field_names, expected_names, "Field Names")

'OK'

***
- [Back to the top](#top)
- [Next section: Notes](#notes)
***

# <a name="cards"></a> Cards

### 2.1.1. Import card data into data frame "df_cards"

In [7]:
# Step 6: Take in study data from Anki collection
df_cards = pd.read_sql_query("SELECT * FROM cards", cnx)
assertEquals(df_cards.shape[0],19287,"Rows")#6386, 21979, 19363, 19314
assertEquals(df_cards.shape[1],18,"Columns")

'OK'

### 2.1.2. Confirm that card data model matches expected format

In [8]:
expected_columns_1 = ['id', 'nid', 'did', 'ord', 'mod', 'usn', 'type', 'queue', 'due', 'ivl', 'factor',
 'reps', 'lapses', 'left', 'odue', 'odid', 'flags', 'data']

def lists_equal(a,b):
    return (a == b).all()

assertEquals(lists_equal(df_cards.columns.values, expected_columns_1), True, "Card Columns Import")

'OK'

### 2.1.3.  Shallow check for duplicates (matching rows)

In [9]:
 def has_dupes(df_in):
    dupe = df_in.duplicated()
    return df_in.loc[dupe].shape[0] != 0

In [10]:
assertEquals(has_dupes(df_cards), False, "Duplicates Not Found")

'OK'

### 2.1.4.  Remove unneeded card dataframe columns, rename 'id' to 'cid' (card id)

In [11]:
def print_line_break():
    print("-"*75)

In [12]:
def print_before_after(b, a, t=""):
    if t != "":
        print_line_break()
        print(t)
    print_line_break()
    print("Before: " + str(b))
    print_line_break()
    print("After: " + str(a))
    print_line_break()

In [13]:
df_cards_001_less_cols = df_cards.copy()
df_cards_001_less_cols = df_cards_001_less_cols.drop(['did','usn','type','mod','left','odue','odid','flags','data'],axis=1)
df_cards_001_less_cols = df_cards_001_less_cols.rename(columns={'id':'cid'})
expected_columns_2 = ['cid', 'nid', 'ord', 'queue', 'due', 'ivl', 'factor', 'reps','lapses']

print_before_after(df_cards.columns.values, df_cards_001_less_cols.columns.values,"Card Columns:")

assertEquals(lists_equal(df_cards_001_less_cols.columns.values, expected_columns_2), True, "Card Model Slimmed")

---------------------------------------------------------------------------
Card Columns:
---------------------------------------------------------------------------
Before: ['id' 'nid' 'did' 'ord' 'mod' 'usn' 'type' 'queue' 'due' 'ivl' 'factor'
 'reps' 'lapses' 'left' 'odue' 'odid' 'flags' 'data']
---------------------------------------------------------------------------
After: ['cid' 'nid' 'ord' 'queue' 'due' 'ivl' 'factor' 'reps' 'lapses']
---------------------------------------------------------------------------


'OK'

### 2.1.5. Generate Card Creation Date from Card ID

In [14]:
df_cards_002_created_date = df_cards_001_less_cols.copy()
df_cards_002_created_date['CardCreated'] = pd.to_datetime(df_cards_002_created_date['cid'],unit='ms')
#df_cards_002_created_date['CardCreated'] = df_cards_002_created_date['CardCreated'].dt.date

#assertEquals(str(df_cards_002_created_date['CardCreated'].iloc[0]), "2012-03-15", "Card ID is in datetime date format year-month-day")

### 2.1.6. Remove cards with no study data associated with them, cards that have been suspended from study

In [15]:
#queue           integer not null,
#      -- -3=sched buried, -2=user buried, -1=suspended,
#      -- 0=new, 1=learning, 2=due (as for type)

df_cards_003_no_new = df_cards_002_created_date.copy()
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['queue']!=0] # remove cards marked as new
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['reps']!=0] # remove cards that have not been reviewed
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['queue']!=-1] # remove cards that are currently suspended
# https://stackoverflow.com/questions/18196203/how-to-conditionally-update-dataframe-column-in-pandas
df_cards_003_no_new.loc[df_cards_003_no_new['due'] > 10000, 'due'] = 0 # assign 0 to the due # todo: update w/ last studied date from revlog # todo: comment this line out once you have updated the collection import

print_before_after(df_cards_002_created_date.shape[0], df_cards_003_no_new.shape[0],"Card Rows:")

df_cards_003_no_new.tail(5)

---------------------------------------------------------------------------
Card Rows:
---------------------------------------------------------------------------
Before: 19287
---------------------------------------------------------------------------
After: 8256
---------------------------------------------------------------------------


Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated
18869,1549415185907,1549184119039,2,2,2388,87,2410,3,0,2019-02-06 01:06:25.907
18870,1550403009251,1550402953788,0,2,2399,91,2410,4,0,2019-02-17 11:30:09.251
18871,1550403009269,1550402953788,2,2,2317,9,2010,9,2,2019-02-17 11:30:09.269
18872,1550403084990,1550403040864,0,2,2383,82,2410,3,0,2019-02-17 11:31:24.990
18873,1550403085003,1550403040864,2,2,2401,93,2410,4,0,2019-02-17 11:31:25.003


In [16]:
df_cards_003_no_new.dtypes

cid                     int64
nid                     int64
ord                     int64
queue                   int64
due                     int64
ivl                     int64
factor                  int64
reps                    int64
lapses                  int64
CardCreated    datetime64[ns]
dtype: object

### 2.1.7. Confirm that no cards considered 'in learning' are present

In [17]:
sel3 = df_cards_003_no_new[df_cards_003_no_new['due'] == 0]

assertEquals(sel3.shape[0],0,"There are no cards currently in 'learning'.")

'OK'

### 2.1.8. Generate Due Date from Due Value

In [18]:
df_cards_004_due_date = df_cards_003_no_new.copy()
df_cards_004_due_date['DueDate'] = pd_crt + df_cards_004_due_date['due'].map(timedelta)
#df_cards_004_due_date['DueDate'] = df_cards_004_due_date['DueDate'].dt.date

#assertEquals(str(df_cards_004_due_date['DueDate'].iloc[0]), "2015-03-08", "Card due date is in datetime date format year-month-day")

df_cards_004_due_date.head()

Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated,DueDate
0,1331799797110,1331799797110,0,2,789,65,1680,10,1,2012-03-15 08:23:17.110,2015-03-08 09:00:00
1,1331799797112,1331799797112,0,2,784,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00
3,1331799797114,1331799797114,0,2,757,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00
5,1331799797117,1331799797117,0,2,798,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00
6,1331799797118,1331799797118,0,2,759,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00


In [19]:
df_cards_004_due_date.dtypes

cid                     int64
nid                     int64
ord                     int64
queue                   int64
due                     int64
ivl                     int64
factor                  int64
reps                    int64
lapses                  int64
CardCreated    datetime64[ns]
DueDate        datetime64[ns]
dtype: object

It appears that card types are being rendered as numbers, which makes it less human readible. We will fix this.   
### 2.1.9. Label card types by their names, & drop outlier

In [20]:
df_cards_005_named_types = df_cards_004_due_date.copy()
# ord stands for 'ordinal' : identifies which of the card templates it corresponds to
print(df_cards_005_named_types['ord'].value_counts()) # these are the card vectors

# The dataset contains only two cards of a single card vector: let's drop them as outliers
df_cards_005_named_types = df_cards_005_named_types.drop(df_cards_005_named_types[df_cards_005_named_types['ord'] == 11].index)

df_cards_005_named_types['ord'].value_counts() # the check shall pass

# now, to map the names onto the card vectors # read:JapaneseReading, recall:EngToJpnTranslate, look:PictureLook, listen:AudioListening
df_cards_005_named_types['CardType'] = df_cards_005_named_types['ord'].map(
    {0:'read', 2:'recall', 4:'look', 7:'listen'})
df_cards_005_named_types['CardType'].value_counts()

0     6845
4     1121
7      274
2       14
11       2
Name: ord, dtype: int64


read      6845
look      1121
listen     274
recall      14
Name: CardType, dtype: int64

### 2.1.10. Create interval quartile sections for visualization purposes

In [21]:
df_cards_006_ivl_buckets = df_cards_005_named_types.copy()
# qcut: Quantile-based discretization function. Discretize variable into equal-sized buckets
# based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would
# produce a Categorical object indicating quantile membership for each data point.
# http://www.datasciencemadesimple.com/quantile-decile-rank-column-pandas-python-2/
df_cards_006_ivl_buckets['ivl_q'] = pd.qcut(df_cards_006_ivl_buckets['ivl'],5,labels=False)
df_cards_006_ivl_buckets['factor_q'] = pd.qcut(df_cards_006_ivl_buckets['factor'],3,labels=False)
df_cards_006_ivl_buckets.head()

Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated,DueDate,CardType,ivl_q,factor_q
0,1331799797110,1331799797110,0,2,789,65,1680,10,1,2012-03-15 08:23:17.110,2015-03-08 09:00:00,read,0,1
1,1331799797112,1331799797112,0,2,784,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,read,0,2
3,1331799797114,1331799797114,0,2,757,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00,read,0,1
5,1331799797117,1331799797117,0,2,798,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,read,0,2
6,1331799797118,1331799797118,0,2,759,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,read,0,1


### 2.1.11. Create dummy variables for card types

In [22]:
df_cards_007_dummies = df_cards_006_ivl_buckets.copy()

df_cards_007_dummies = pd.get_dummies(df_cards_007_dummies, columns=['CardType'])

In [23]:
df_cards_007_dummies.tail(10)[:5]

Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated,DueDate,ivl_q,factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall
18827,1523893083493,1523892839900,0,2,2728,420,2210,7,1,2018-04-16 15:38:03.493,2020-06-28 09:00:00,3,2,0,0,1,0
18828,1523893083509,1523892839900,7,2,2823,515,2410,5,0,2018-04-16 15:38:03.509,2020-10-01 09:00:00,3,2,1,0,0,0
18829,1523893129423,1523892839900,2,2,2713,405,2210,7,1,2018-04-16 15:38:49.423,2020-06-13 09:00:00,3,2,0,0,0,1
18830,1524841320859,1523892839900,4,2,2308,4,2050,14,2,2018-04-27 15:02:00.859,2019-05-05 09:00:00,0,1,0,1,0,0
18868,1549184129288,1549184119039,0,2,2416,108,2410,5,0,2019-02-03 08:55:29.288,2019-08-21 09:00:00,0,2,0,0,1,0


### 2.1.12. Create df_cards_008_final_section_2 data frame for progress saving

In [24]:
df_cards_008_mid_section_2 = df_cards_007_dummies.copy()
# we will also drop a few columns that aren't needed anymore
df_cards_008_mid_section_2 = df_cards_008_mid_section_2.drop(['ord','queue','due'],axis=1)
print(df_cards_008_mid_section_2.shape)
df_cards_008_mid_section_2.head()

(8254, 14)


Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,ivl_q,factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall
0,1331799797110,1331799797110,65,1680,10,1,2012-03-15 08:23:17.110,2015-03-08 09:00:00,0,1,0,0,1,0
1,1331799797112,1331799797112,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,0,2,0,0,1,0
3,1331799797114,1331799797114,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00,0,1,0,0,1,0
5,1331799797117,1331799797117,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,0,2,0,0,1,0
6,1331799797118,1331799797118,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,0,1,0,0,1,0


### 2.1.13. Export df_cards_008_mid_section_2

In [25]:
df_cards_008_mid_section_2.to_csv('datasets/df_cards_008_mid_section_2.csv')

***
- [Previous section: Cards](#cards)
- [Next section: Combo](#combo)
***

# <a name="notes"></a> Notes

### 2.2.1. Import notes (terms/words) into data frame "df_notes"

In [26]:
# let's take in the 'notes' table, and explicitly save the note id ("nid") 
df_notes = pd.read_sql_query("SELECT * FROM notes", cnx)
df_notes = df_notes.rename(columns={'id':'nid'})

In [27]:
assertEquals(df_notes.shape[0],8381,"Rows") # 2791, 9784, 8403
assertEquals(df_notes.shape[1],11,"Columns")

'OK'

### 2.2.2. Remove (drop) unneeded fields (columns)

In [28]:
df_notes_old_col_vals = df_notes.columns.values
df_notes = df_notes.drop(['guid','mid','usn','sfld','csum','flags','data'],axis=1)
#print(df_notes.columns.values)
print_before_after(df_notes_old_col_vals, df_notes.columns.values)

---------------------------------------------------------------------------
Before: ['nid' 'guid' 'mid' 'mod' 'usn' 'tags' 'flds' 'sfld' 'csum' 'flags' 'data']
---------------------------------------------------------------------------
After: ['nid' 'mod' 'tags' 'flds']
---------------------------------------------------------------------------


### 2.2.3. Split "fields" column into multiple, assign field names, drop combined col

In [29]:
def time_it(func, *args, **kwargs):
    start = time.time()
    func(*args, **kwargs)
    end = time.time()
    # https://stackoverflow.com/questions/8885663/how-to-format-a-floating-number-to-fixed-width-in-python
    print("{:.0f}".format((end - start)*1000) + " miliseconds")

In [30]:
for i in range(0,len(expected_names)-1):
    df_notes[expected_names[i]] = df_notes.flds.str.split('').str.get(i)
assertEquals('flds' in df_notes.columns.values, True, "'flds' Column Found")
df_notes = df_notes.drop(['flds'],axis=1)
assertEquals('flds' not in df_notes.columns.values, True, "'flds' Column Not Found")
print(df_notes.columns.values)

['nid' 'mod' 'tags' 'Term' 'Yomi1' 'Translation' 'Translation2'
 'Translation3' 'AlternateForms' 'PartOfSpeech' 'Sound' 'Sound2' 'Sound3'
 'Examples' 'ExamplesAudio' 'AtoQ' 'AtoQaudio' 'AtoQkana'
 'AtoQtranslation' 'QandApicture' 'answerPicture' 'Meaning1'
 'SimilarWords' 'RelatedWords' 'Breakdown1' 'Comparison' 'Usage' 'Prompt1'
 'Prompt2' 'KakuMCD' 'IuMCD' 'ExtraMemo' 'Yomi2' 'Meaning2' 'Breakdown2'
 'Picture1' 'Picture2' 'Picture3' 'Picture4' 'HinshiMarker' 'Hint' 'Term2'
 'ArabicNumeral' 'CounterKanji' 'Mnemonic' 'SameSoundWords' 'Yomi3'
 'gChap' 'gBook' 'semester' 'gNumber' 'Transliteration' 'SoloLookCards'
 'TagOverflow' 'blank1' 'blank2']


### 2.2.4. Confirm all HTML tags have been removed from note terms & readings

In [31]:
assertEquals(df_notes[df_notes['Term'].str.contains("</div>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Term'].str.contains("<div>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Term'].str.contains("anki")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Yomi1'].str.contains("</span>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Yomi1'].str.contains("</div>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Yomi1'].str.contains("anki")].shape[0],0,"HTML tags have been removed")

'OK'

In [32]:
# todo: create function for this
# inspect notes that have spaces in the reading field
# df_notes[df_notes['Term'].str.contains(" ")]

### 2.2.5. Check notes for duplicates (shallow check)

In [33]:
assertEquals(has_dupes(df_notes), False, "Duplicates Not Found")

'OK'

### 2.2.6. Check for duplicates by term field in notes data frame

In [34]:
def has_dupe_terms(df_in):
    location = df_in['Term'].duplicated()
    return df_in.loc[location].shape[0] != 0

In [35]:
assertEquals(has_dupe_terms(df_notes), False, "Duplicates Found")

'OK'

### 2.2.7. Confirm that duplicates dataframe is empty (no dups exist)

In [36]:
dupe = df_notes['Term'].duplicated() #creates list of True/False values
print(df_notes[dupe].shape)
assertEquals(df_notes[dupe].shape[0], 0, "Duplicates dataframe is empty.")

(0, 56)


'OK'

### 2.2.8. Inspect an individual note by its term

In [38]:
def get_rows_by_value_in_col(df_in, value, col):
    return df_in.loc[df_in[col]==value]

In [39]:
# Postal service
sel1 = get_rows_by_value_in_col(df_notes, '発明','Term')
sel1

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
2696,1354094556789,1543874647,Japanese MCD N3 Noun commonWord complete edit...,発明,はつめい,"<span style=""""><div>invention</div></span>",,,,"Common word, Noun, Suru verb",...,,,,,,,,,,


### Save Point, Commit, Bonfire (for you Souls fans)*

At the point in time of the data extraction where the (meta) tag information is made available, we can treat it to both clarify (rename poorly worded tags) & reduce (delete unneeded tags). Since we now have all fields split into their own columns as well, we can treat (modifiy & improve) the columns as well, in a 1-2 process: (1) Fix the tags & (2) Fix the columns
*https://en.wikipedia.org/wiki/Souls_(series)

In [40]:
def shorten_list(takeIn, takeOut):
    temp = takeIn.lower().split() # split all the words into a list
    temp2 = [word for word in temp if word.lower() not in takeOut] # create a shorter list of words minus the take-outs
    return ' '.join(temp2) # return that shorter list as a string

In [41]:
tag_remove_list = ['japanese', 'checkpicture', 'complete', 'haspicture', 'nomemo',
                   'researched', 'aaaeditthis', 'addaudio', 'addaudio2', 'addaudioNow',
                   'addmore','adjustformatting', 'hascomparison', 'hasmnemonic',
                   'customediting','wikidefinition', 'givewill','addaudionow','addprompt',
                   'checknuance','giveyaneury','hastextimage', 'marked', 'addpicture',
                   'addexampletranslation','basicnumeric', 'genkiplus', 'hasaudio',
                   'nativeaudio', 'adddefinition','addexamples', 'addjapaneseprompt',
                   'computervoice','haspoliteprefix','nongoo','customdefinition','hashint',
                   'abahipriorityfix','kaki','mcd','nobodyknows+','missingwordtype',
                   'image','duplicate', 'hasprompt', 'ninshiki','abachecknuance',
                   'hasflag','things', 'jim', 'hasunicode', 'editthis','aaahipriority',
                   'hassimpledef', 'givecodie', 'forjimmy', 'hasnativeaudio', 'givejimmy2',
                   'checkaudio', 'checkwriting', 'hasjlptlevel', 'makekaki', 'checknuance2',
                   'checkagain', 'newaudio', 'mail', 'checkexamples','elementaryschool',
                   'nvc', 'checkprompt', 'gavejimmy', 'addnativeaudio','checkreading',
                   'givecodieapril', 'activated', 'fixformatting','hasplacesuffix',
                   'hassuffix','addtranslation','addnewcardtype','addnuance','addtextimage',
                   'semicomplete', 'removeroboaudio','fixaudio','hasgramconj', 
                   'addkanji','changenotetype', 'famous', 'kuverb',
                   'givwill','karutapoems', 'map', 'hasvisualcomparison','picturekaki',
                   'jyugemu', '2018', 'type1', 'hasslang', 'apologies',
                   'month', 'definitionresearched','soundshift', 'basics1', 'tsuverb',
                   'facebook', 'uverb', 'checkfrequency', 'degree', 'hasdefinition',
                   'addtransliteration', 'dnd', 'introductions', 'adjustprompt',
                   'job', 'particle', 'services', 'mature', 'splitpictures', 
                   'egaki', 'type5k', 'intimate','extrainfo', 'irregular', 'unlisted',
                   'fromwiki', 'checkdifference','addpronunciationdiagram', 'reset',
                   'currentevents', 'doubletextimage', 'comparison', 'verbscompoundpast2',
                   'attention', 'addmemo', 'averb', 'radio','hasascii', 'fontadjusted',
                   'haspronunciation', 'borroweddefinition','alphabet', 'graphics',
                   'chiebukuro', 'duolingo', 'ateji', 'fact','type5s', 'fixpicture',
                   'politebydefault', 'objects','sensitive', 'groupword', 'addmnemonic',
                   'hasmore', 'quote', 'checkformatting','overlap', 'kotobankdef',
                   'hasrudeness', 'changedeck', 'specialformatting','yoga',
                   'hasjapaneseprompt', 'hasprefix','questionword', 'business', 
                   'postoffice', 'firstten', 'money', 'robotvoice2', 'ichidan', 'godan',
                   'weather','count', 'nodefinition', 'muverb', 'addcomparisonchart', 
                   'ruverb', 'phone', 'conjugated','haddiv','vulgar','fromkaruta',
                   'karutamanual', 'teform', '2019'
                  ]

### 2.2.9. Remove unneeded tags (meta-data) from notes

In [42]:
# survey a few notes to see example tag data
df_notes.head(3)

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
0,1331799797110,1511481489,Japanese Marked abaCheckNuance checkNuance co...,臨機応変,りんきおうへん,adapting oneself to the requirements of the mo...,,,,"Noun, No-adjective",...,,,,,,,,,,
1,1331799797112,1511481489,Japanese complete noMemo researched wwwjdic,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,,
2,1331799797113,1511481489,Japanese Marked abaCheckNuance checkNuance co...,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,,


In [43]:
# likely useful tags: katakana, Waseigo, Food, Phrases, casual, restaurant, travel, commonWord, noun, suruVerb

df_notes_001_less_tags = df_notes.copy() #originally "df_notes_less_tags"
df_notes_001_less_tags['tags'] = df_notes_001_less_tags['tags'].apply(lambda x: shorten_list(str(x), tag_remove_list))

print_before_after(df_notes['tags'].iloc[0], df_notes_001_less_tags['tags'].iloc[0],"Tags for " + df_notes['Term'].iloc[0])

assertEquals("Japanese" in df_notes['tags'].iloc[0].split(), True, "Contains Tag 'Japanese'")
assertEquals("Japanese" in df_notes_001_less_tags['tags'].iloc[0].split(), False, "Contains Tag 'Japanese'")

---------------------------------------------------------------------------
Tags for 臨機応変
---------------------------------------------------------------------------
Before:  Japanese Marked abaCheckNuance checkNuance complete noMemo researched wwwjdic yojijukugo 
---------------------------------------------------------------------------
After: wwwjdic yojijukugo
---------------------------------------------------------------------------


'OK'

### 2.2.10. Rename useful tags (meta-data) that were poorly named (still on notes)

In [44]:
# replace list (formerly named 'tag_replace_list')
tag_rename_dict = {
    'aalowfrequency':'rare checked', 'aatechnical':'technical checked', 'aaanonkaiwa':'nonconvo checked',
    'wwwjdic':'fromdict', 'expression':'phrase', 'numberonly':'number',
    'grammarpoint':'grammar', 'jisho':'fromdict', 'pointingword':'directions',
    'geometry':'math technical', 'genki':'textbook', 'jpn202':'college',
    'jpn201':'college', 'jpn101':'college', 'jpn102':'college', 'kentei':'fromexam',
    'proficiencytest':'fromexam', 'bodypart':'body', '5kyuu':'fromexam',
    'linguisticreference':'technical', 'conversation':'convo',
    'fromconvo':'convo', 'culturepoint':'culture', 'checkednuance':'checked',
    'checkedpictures':'checked', 'checkednuance':'checked', 'medical':'technical',
    'anatomy':'body', 'places':'place', 'animals':'animal',
    'newspaperterm':'fromnewspaper', 'checkedreading':'checked',
    'abbreviation':'abbr','firstsemester':'semester1','onecharacter':'len1',
    'sentence':'phrase', 'verbs':'verb', 'convook':'checked convo','inuse':'checked',
    'nuancechecked':'checked','insects':'animal insect','sightseeing':'travel',
    'accessories':'clothing', 'grammarsuffix':'suffix', 'oceanlife':'animal ocean',
    'science':'technical', 'written':'nonconvo', 'notrare':'checked',
    'aajoke':'silly', 'intonationcompare':'hassimilar', 'ij':'textbook',
    'goodcard':'inspect','aahilevel':'challenging inspect', 'ijvocab':'textbook',
    'cliothing':'clothing','unused':'nonconvo rare checked',
    'aaunused':'nonconvo rare checked', 'samesound':'hassame','animals':'animal',
    'dictionary':'fromdict','usuallywritteninkana':'kana',
    'abVeryRare':'rare checked', 'yojijukugo':'rare idiom', 'abcasual':'casual checked convo',
    'literaryform':'nonconvo', 'onomatopoeiclike':'onomatopoeic','kenjo':'humble',
    'colors':'color', 'forest':'nature','flower':'plant nature', 'aaok':'checked',
    'questions': 'question', 'adverbs':'adverb','book2':'textbook',
    'book1':'textbook','proficiencytest':'fromtest','animalscomplete':'animal',
    'sonkei':'respectful','eating':'food','fruit':'food','neverused':'nonconvo rare',
    'domainspecific':'technical','seaons':'season','seasons':'season',
    'prefecture':'place','plantpart':'plant', "hakataben":"dialect", "fish":"animal fish",
    "transitive":"transitive verb", "intransitive":"intransitive verb",
    "aaunecessary":"nonconvo checked", "vegetables":"vegetable food plant",
    "counters":"counter", "senmonyougo":"technical", "countries":"country place",
    "date":"datesandtime", "rarelyused":"rare", "aaakaiwa":"convo checked", "cool":"inspect",
    "investigate":"inspect","challenging":"inspect","names":"name",'qanda':'question',
    'hasquestion':'question', "感情のもとにあったニーズ":"phrase rare","phrases":'phrase'
}

#todo: investigate:
#editformatting,  datesandtime, linguistics, reference, adult, adjustpicture, checkpronunciation, addhint, challenging, inspect

In [45]:
def replace_list(takeIn, replaceDict):
    temp = takeIn.lower().split()
    temp2 = []
    for word in temp:
        if word in replaceDict:
            temp2.append(replaceDict.get(word)) # if the word exists in the dictionary, replace it
        else:
            temp2.append(word) # if the word doesnt't exist in the dictionary, leave it alone
    return ' '.join(temp2) # return that shorter list as a string

# inspect further:
# multiwriting, multimeaning, multipicture, multiterm, multireading, mergeterms, checkpronunciation, customterm,
# goodcard, personalized, silly, addjlptlevel, checkpronunciation, mergeterms, customterm, transportation vs travel

# categorize: iadjective, naajective, verb, counter, commonword, suruverb, pronoun, question, phrases, kuverb, godan, ichidan, intransitive, transitive, noun, adverbialnoun

In [46]:
df_notes_002_better_tags = df_notes_001_less_tags.copy() # originally "df_notes_better_tags"
df_notes_002_better_tags['tags'] = df_notes_002_better_tags['tags'].apply(lambda x: replace_list(str(x), tag_rename_dict))

print_before_after(df_notes_001_less_tags['tags'].iloc[0], df_notes_002_better_tags['tags'].iloc[0], "Tags for " + df_notes_002_better_tags['Term'].iloc[0])

assertEquals("wwwjdic" in df_notes_001_less_tags['tags'].iloc[0].split(), True, "Contains Tag 'wwwjdic'")
assertEquals("wwwjdic" in df_notes_002_better_tags['tags'].iloc[0].split(), False, "Contains Tag 'wwwjdic'")
assertEquals("fromdict" in df_notes_001_less_tags['tags'].iloc[0].split(), False, "Contains Tag 'fromdict'")
assertEquals("fromdict" in df_notes_002_better_tags['tags'].iloc[0].split(), True, "Contains Tag 'fromdict'")

---------------------------------------------------------------------------
Tags for 臨機応変
---------------------------------------------------------------------------
Before: wwwjdic yojijukugo
---------------------------------------------------------------------------
After: fromdict rare idiom
---------------------------------------------------------------------------


'OK'

### 2.2.11. Inspect current tag strings, notice duplicate occurances

In [47]:
df_notes_002_better_tags['tags'].value_counts()[:5]

                                     1501
fromdict                              777
fromtest textbook                     581
textbook textbook                     428
college textbook textbook hasrobo     215
Name: tags, dtype: int64

### 2.2.12. Add "notags" tag to notes w/o any meta-tag data

In [48]:
df_notes_002_better_tags['tags'] = df_notes_002_better_tags['tags'].apply(lambda x: "hasnotags" if x == '' else x)

In [49]:
df_notes_002_better_tags['tags'].value_counts()[:5]

hasnotags                            1501
fromdict                              777
fromtest textbook                     581
textbook textbook                     428
college textbook textbook hasrobo     215
Name: tags, dtype: int64

We can attempt to inspect which tags are most common, in which combinations, and which words would be ideal
for further additional metadata. However, **our tags are still lumped together** at this point. Also, there is
reason to believe that **some tags are showing up multiple times in the same tag string**. In order to properly count tag frequency, the duplicates must be confirmed absent (ie. found & removed). Then, the occurance (word frequency) of each tag may then be summed up for the tags column.

### 2.2.13. Inspect a note suspected for tag duplication

In [50]:
# confirm that a particular note has tag duplicates
# crimison note id: 1369286386384
note_id_1 = 1369286386384
assertEquals(get_rows_by_value_in_col(df_notes_002_better_tags, note_id_1,'nid').tags.values[0],
             "fromexam color fromexam len1","Four tags total with two duplicates exist") #todo: count occurances of 'fromexam' instead



'OK'

In [51]:
# example of item with tag duplication
sel2 = get_rows_by_value_in_col(df_notes_002_better_tags, note_id_1,'nid')
sel2

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
3845,1369286386384,1511481489,fromexam color fromexam len1,紅,くれない,<div>deep red; crimson</div><div><br /></div><...,,,,"<div>Common word, Noun</div><div>Common word, ...",...,,,,,,,,,,


### 2.2.14. Remove duplicate tags (convert tag strings > lists > sets > strings)

In [52]:
# Converts a tag string to a list to a set back to a string (this removes the duplicates)
def remove_dupes(t):
    temp = list(set(t.lower().split()))
    return ' '.join(temp) # return as string

In [53]:
df_notes_003_tags_no_dups = df_notes_002_better_tags.copy()
df_notes_003_tags_no_dups['tags'] = df_notes_003_tags_no_dups['tags'].apply(lambda x: remove_dupes(str(x)))

In [54]:
# determines if an individual tag substring exists in a larger tags list string
def tag_exists(tags, tag):
    return 1 if tag in tags.split() else 0

In [55]:
print(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0])
assertEquals(tag_exists(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0],"len1"), 1, "tag 'len1' remains")
assertEquals(tag_exists(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0],"fromexam"), 1, "tag 'fromexam' remains")

fromexam color len1


'OK'

It appears we have most, if not all, of the data we need to start. The format of the dates though is not yet human readable. Let's fix that.

### 2.2.15. Convert (& preserve) note ID to note creation date

In [56]:
#dueNum = 782 # this represents days from collection creation date
#crt = 1357635600 # this represents the collection creation date #todo: query dynamically from database
#print("mid 'model id': " + time.ctime(int("1768161991"))) # 1 day = 86400 seconds

df_notes_004_with_date = df_notes_003_tags_no_dups.copy()
df_notes_004_with_date['NoteCreated']= pd.to_datetime(df_notes_004_with_date['nid'],unit='ms')
#df_notes_004_with_date['NoteCreated'] = df_notes_004_with_date['NoteCreated'].dt.date
df_notes_004_with_date.head()

print_before_after(df_notes_003_tags_no_dups['nid'].iloc[0], df_notes_004_with_date['NoteCreated'].iloc[0],"Term " + df_notes_004_with_date['Term'].iloc[0])

#assertEquals(df_notes_004_with_date['nid'].iloc[0], 1331799797110, "Note ID is in Epoch Units")
#assertEquals(str(df_notes_004_with_date['NoteCreated'].iloc[0]), "2012-03-15", "Note ID is in datetime date format year-month-day")

---------------------------------------------------------------------------
Term 臨機応変
---------------------------------------------------------------------------
Before: 1331799797110
---------------------------------------------------------------------------
After: 2012-03-15 08:23:17.110000
---------------------------------------------------------------------------


In [57]:
df_notes_004_with_date.dtypes

nid                         int64
mod                         int64
tags                       object
Term                       object
Yomi1                      object
Translation                object
Translation2               object
Translation3               object
AlternateForms             object
PartOfSpeech               object
Sound                      object
Sound2                     object
Sound3                     object
Examples                   object
ExamplesAudio              object
AtoQ                       object
AtoQaudio                  object
AtoQkana                   object
AtoQtranslation            object
QandApicture               object
answerPicture              object
Meaning1                   object
SimilarWords               object
RelatedWords               object
Breakdown1                 object
Comparison                 object
Usage                      object
Prompt1                    object
Prompt2                    object
KakuMCD       

### 2.2.15. Generate Note Last Modified Date from "Mod" ID

In [58]:
df_notes_005_last_modified = df_notes_004_with_date.copy()
df_notes_005_last_modified['LastModified'] = pd.to_datetime(df_notes_005_last_modified['mod'],unit='s')
#df_notes_005_last_modified['LastModified'] = df_notes_005_last_modified['LastModified'].dt.date

#assertEquals(str(df_notes_005_last_modified['LastModified'].iloc[0]), "2017-11-23", "Note last modified is in datetime date format year-month-day")

### 2.2.16. Remove rare words, phrases, expressions, questions & sentences from notes

In [59]:
df_notes_006_only_vocab_no_rare = df_notes_005_last_modified.copy()
print(df_notes_006_only_vocab_no_rare.shape)
df_notes_006_only_vocab_no_rare.head(3)

(8381, 58)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
0,1331799797110,1511481489,idiom rare fromdict,臨機応変,りんきおうへん,adapting oneself to the requirements of the mo...,,,,"Noun, No-adjective",...,,,,,,,,,2012-03-15 08:23:17.110,2017-11-23 23:58:09
1,1331799797112,1511481489,fromdict,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,2012-03-15 08:23:17.112,2017-11-23 23:58:09
2,1331799797113,1511481489,fromdict,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,2012-03-15 08:23:17.113,2017-11-23 23:58:09


In [60]:
sel4 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("rare")]
# https://stackoverflow.com/questions/37313691/how-to-remove-a-pandas-dataframe-from-another-dataframe
# remove rare words only first
df_notes_006_only_vocab_no_rare = pd.concat([df_notes_006_only_vocab_no_rare, sel4]).drop_duplicates(keep=False)

print(df_notes_006_only_vocab_no_rare.shape)
df_notes_006_only_vocab_no_rare.head(3)

# todo: assert that no rare words remain in 'df_notes_006_only_vocab_no_rare' by using 'contain("rare")'
# for selection, assert that selection has a row size of 0

(8268, 58)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
1,1331799797112,1511481489,fromdict,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,2012-03-15 08:23:17.112,2017-11-23 23:58:09
2,1331799797113,1511481489,fromdict,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,2012-03-15 08:23:17.113,2017-11-23 23:58:09
3,1331799797114,1511481489,fromdict,移籍,いせき,<div>changing household registry; transfer (e....,,,,"<div>Common word, Noun, Suru verb</div>",...,,,,,,,,,2012-03-15 08:23:17.114,2017-11-23 23:58:09


### 2.2.17. Remove phrases, sentences & questions now all at once

In [61]:
sel5 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("phrase")]
sel6 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("sentence")]
sel7 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("question")]
df_notes_006_only_vocab_no_rare = pd.concat([df_notes_006_only_vocab_no_rare, sel5, sel6, sel7]).drop_duplicates(keep=False)

print(df_notes_006_only_vocab_no_rare.shape)
df_notes_006_only_vocab_no_rare.head(3)

(8047, 58)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
1,1331799797112,1511481489,fromdict,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,2012-03-15 08:23:17.112,2017-11-23 23:58:09
2,1331799797113,1511481489,fromdict,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,2012-03-15 08:23:17.113,2017-11-23 23:58:09
3,1331799797114,1511481489,fromdict,移籍,いせき,<div>changing household registry; transfer (e....,,,,"<div>Common word, Noun, Suru verb</div>",...,,,,,,,,,2012-03-15 08:23:17.114,2017-11-23 23:58:09


The note model has a bunch of columns (fields) with no values in them. These can be taken out for data analysis.

In [62]:
# let's look a a small slice of data, to infer what we may
# we can take a broad overview look at the dataset to more quickly isolate candidates for removal
s = get_rows_by_value_in_col(df_notes_006_only_vocab_no_rare, '発明','Term')
s.head()

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
2696,1354094556789,1543874647,fromtest suruverb commonword n3 textbook noun,発明,はつめい,"<span style=""""><div>invention</div></span>",,,,"Common word, Noun, Suru verb",...,,,,,,,,,2012-11-28 09:22:36.789,2018-12-03 22:04:07


### 2.2.18. Determine which columns (fields) are unused & can be safely removed

In [63]:
def is_blank (s):
    return not (s and s.strip())

In [64]:
col_names = df_notes_006_only_vocab_no_rare.columns.values
# see that this cell for this row is indeed blank
#print(is_blank(df_notes_006_only_vocab_no_rare['Translation2'].iloc[0]))

row_cnt = df_notes_006_only_vocab_no_rare.shape[0] # number of rows in df_notes_006_only_vocab_no_rare

# https://stackoverflow.com/questions/49677060/pandas-count-empty-strings-in-a-column
empty_strings = pd.DataFrame(df_notes_006_only_vocab_no_rare.values == '',columns=col_names) # find all empty strings in a DataFrame
temp_dict = (empty_strings.sum()).to_dict()  # save the location of all empty strings as a DataFrame of booleans
removal_candidates = []
for key in temp_dict.items():
    if key[1] == row_cnt:
        removal_candidates.append(key[0])
print("Removal candidates:", removal_candidates)

Removal candidates: ['Sound3', 'AtoQ', 'AtoQaudio', 'AtoQkana', 'AtoQtranslation', 'QandApicture', 'answerPicture', 'blank1', 'blank2']


### 2.2.19. Drop empty columns from notes data frame

In [65]:
df_notes_007_less_cols = df_notes_006_only_vocab_no_rare.copy()

df_notes_007_less_cols = df_notes_007_less_cols.drop(removal_candidates,axis=1)

print_before_after(df_notes_006_only_vocab_no_rare.shape, df_notes_007_less_cols.shape)
print_before_after(df_notes_006_only_vocab_no_rare.columns.values, df_notes_007_less_cols.columns.values)

---------------------------------------------------------------------------
Before: (8047, 58)
---------------------------------------------------------------------------
After: (8047, 49)
---------------------------------------------------------------------------
---------------------------------------------------------------------------
Before: ['nid' 'mod' 'tags' 'Term' 'Yomi1' 'Translation' 'Translation2'
 'Translation3' 'AlternateForms' 'PartOfSpeech' 'Sound' 'Sound2' 'Sound3'
 'Examples' 'ExamplesAudio' 'AtoQ' 'AtoQaudio' 'AtoQkana'
 'AtoQtranslation' 'QandApicture' 'answerPicture' 'Meaning1'
 'SimilarWords' 'RelatedWords' 'Breakdown1' 'Comparison' 'Usage' 'Prompt1'
 'Prompt2' 'KakuMCD' 'IuMCD' 'ExtraMemo' 'Yomi2' 'Meaning2' 'Breakdown2'
 'Picture1' 'Picture2' 'Picture3' 'Picture4' 'HinshiMarker' 'Hint' 'Term2'
 'ArabicNumeral' 'CounterKanji' 'Mnemonic' 'SameSoundWords' 'Yomi3'
 'gChap' 'gBook' 'semester' 'gNumber' 'Transliteration' 'SoloLookCards'
 'TagOverflow' 'blank1' 'blank2

### 2.2.20. Create binary exists/not columns based on presence of a given tag in notes data frame

In [66]:
def add_column_by_tag(df, tag):
    df[tag] = df['tags'].apply(lambda x: tag_exists(str(x), tag))

In [67]:
df_notes_008_binary_tags = df_notes_007_less_cols.copy()
inspect_list = ["commonword", "clothing", "animal", "body", "food", "place",
                "textbook", "college", "fromdict", "fromexam",
                "len1", "n1", "n2", "n3", "n4", "n5", 'katakana','hiragana',
                'noun', 'verb', 'convo', 'hasnotags'
               ]
for item in inspect_list:
    add_column_by_tag(df_notes_008_binary_tags, item)

In [68]:
df_notes_008_binary_tags.dtypes.value_counts()

object            45
int64             24
datetime64[ns]     2
dtype: int64

### 2.2.21. Create boolean columns in notes data frame for predictive models

In [69]:
# https://stackoverflow.com/questions/17383094/how-can-i-map-true-false-to-1-0-in-a-pandas-dataframe
df_notes_008_binary_tags['hasPOS'] = df_notes_008_binary_tags['PartOfSpeech']!="" #todo: expand upon this, by tagify
df_notes_008_binary_tags['hasVisual'] = df_notes_008_binary_tags['Picture1']!=""
df_notes_008_binary_tags['hasAudio'] = df_notes_008_binary_tags['Sound']!=""
df_notes_008_binary_tags['hasMultiMeaning'] = df_notes_008_binary_tags['Translation2' and 'Translation3' and 'Meaning2']!=""
df_notes_008_binary_tags['hasMultiReading'] = df_notes_008_binary_tags['Yomi2']!="" # todo: inspect & incorporate venn diagram: https://commons.wikimedia.org/wiki/File:Homograph_homophone_venn_diagram.png
df_notes_008_binary_tags['hasSimilar'] = df_notes_008_binary_tags['SimilarWords']!=""
df_notes_008_binary_tags['hasHomophone'] = df_notes_008_binary_tags['SameSoundWords']!="" # write function, detect homophones
df_notes_008_binary_tags['hasAltForm'] = df_notes_008_binary_tags['Term2' and 'AlternateForms']!= ""
df_notes_008_binary_tags['hasRichExamples'] = df_notes_008_binary_tags['Examples' and 'ExamplesAudio']!=""

In [70]:
# Laura calls this process "Data Enriching"
# todo: confirm that intify_list is to be different/same than inspect_list
intify_list = ['hasVisual','hasAudio','hasMultiMeaning','hasMultiReading','hasSimilar','hasHomophone','hasAltForm','hasnotags','hasRichExamples']

### 2.2.22. Drop non-numerical columns from notes data frame

In [71]:
df_notes_009_less_cols = df_notes_008_binary_tags.copy()
df_notes_009_less_cols = df_notes_009_less_cols.drop(['Examples','ExamplesAudio',
                            'Meaning1','RelatedWords','Breakdown1','Comparison',
                           'Usage','Prompt1','Prompt2','KakuMCD','IuMCD','ExtraMemo',
                           'Breakdown2','Picture2','Picture3','Picture4','Mnemonic',
                            'Yomi3','gChap','gBook','semester','gNumber','ArabicNumeral',
                            'CounterKanji','SoloLookCards','HinshiMarker','Hint',
                            'mod','Transliteration','Picture1','Sound','Sound2',
                            'TagOverflow','Translation2', 'Meaning2','Yomi2','Term2',
                            'SameSoundWords','hasPOS','SimilarWords','AlternateForms',
                            'Translation3','Translation','PartOfSpeech'],axis=1)
# todo: explore 'mod' (last modified date) as freshness metric

In [72]:
# casts columns of type object to types (such as int) as directed, use with caution
def cast_to_typ(df, col, typ):
    df[col] = df[col].astype(typ)

### 2.2.23. Enforce proper numerical boolean type encoding in notes data frame

In [73]:
df_notes_009_less_cols.dtypes.value_counts()

int64             23
bool               8
object             3
datetime64[ns]     2
dtype: int64

In [74]:
for item in intify_list:
    cast_to_typ(df_notes_009_less_cols,item, int)

In [75]:
df_notes_009_less_cols.dtypes.value_counts()

int64             22
int32              9
object             3
datetime64[ns]     2
dtype: int64

In [76]:
df_notes_009_less_cols.head(35)[30:]

#selection2 = df_binary.loc[df_binary['hasMultiMeaning']==1]
#selection2.head()

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,convo,hasnotags,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples
36,1342506824728,textbook fromtest fromdict,方法,ほうほう,2012-07-17 06:33:44.728,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
37,1342506824729,fromdict,行為,こうい,2012-07-17 06:33:44.729,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
38,1342506824730,textbook fromtest,行動,こうどう,2012-07-17 06:33:44.730,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39,1342506824731,hassame hassimilar fromdict,事態,じたい,2012-07-17 06:33:44.731,2019-05-05 17:56:51,0,0,0,0,...,0,0,0,1,0,0,1,1,0,0
40,1342506824732,textbook fromtest suffix len1,形,かたち,2012-07-17 06:33:44.732,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


### 2.2.24. Count syllables & character length for each term in notes data frame

In [77]:
df_notes_010_with_len = df_notes_009_less_cols.copy()

df_notes_010_with_len['TermLen'] = df_notes_010_with_len['Term'].str.len()
df_notes_010_with_len['Syllables'] = df_notes_010_with_len['Yomi1'].str.len()
df_notes_010_with_len.loc[df_notes_010_with_len['Syllables'] == 0, 'Syllables'] = df_notes_010_with_len['TermLen']

bins = [0,1,2,4,8,128]
labels = ["[1]","[2]","[3:4]","[5:8]","[9: ]"]
# https://stackoverflow.com/questions/45273731/binning-column-with-python-pandas
df_notes_010_with_len['TermLenGroup'] = pd.cut(df_notes_010_with_len['TermLen'], bins=bins, labels=labels)

# example: df.loc[df['Grades'] <= 77, 'Grades'] = 100
# https://stackoverflow.com/questions/42815768/pandas-adding-column-with-the-length-of-other-column-as-value
df_notes_010_with_len.tail(20)[:5]

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup
8351,1485703825555,n5 katakana commonword noun multiintonation cl...,ズボン,,2017-01-29 15:30:25.555,2019-04-28 15:43:50,1,1,0,0,...,1,0,0,0,0,0,0,3,3,[3:4]
8352,1485705343576,katakana commonword n3 gairaigo clothing noun,ベルト,,2017-01-29 15:55:43.576,2019-05-06 03:48:16,1,1,0,0,...,1,0,0,0,0,0,0,3,3,[3:4]
8353,1485705402623,katakana commonword gairaigo clothing noun,ブラジャー,,2017-01-29 15:56:42.623,2019-04-28 16:03:26,1,1,0,0,...,0,0,0,0,0,0,0,5,5,[5:8]
8354,1489373157595,hasnotags,細切り,ほそぎり,2017-03-13 02:45:57.595,2017-11-23 23:58:10,0,0,0,0,...,0,0,0,0,0,0,0,3,4,[3:4]
8355,1489756408272,hasnotags,離陸,りりく,2017-03-17 13:13:28.272,2017-11-23 23:58:10,0,0,0,0,...,0,0,0,0,0,0,0,2,3,[2]


### 2.2.25. Inspect the longest syllable entries in notes data frame

In [78]:
df_many_syl = df_notes_010_with_len.copy()
many_syl = df_many_syl['Syllables'] > 16
df_many_syl.loc[many_syl] #todo: check nid of 1391477462767

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup
242,1346057958628,fromnewspaper inspect fromdict history culture,東京電力福島・第１原発事故,とうきょうでんりょくふくしま・だいいちげんぱつじこ,2012-08-27 08:59:18.628,2019-05-05 16:13:58,0,0,0,0,...,0,0,0,0,0,0,0,13,25,[9: ]
308,1346215143756,numeric fromdict datesandtime,1837～1901年,せんはっぴゃくさんじゅうななねんからせんきゅうひゃくいちねん,2012-08-29 04:39:03.756,2019-04-21 16:58:53,0,0,0,0,...,0,0,0,0,0,0,0,10,30,[9: ]
421,1346216471844,fromdict numeric counter datesandtime,千九百八十九年,せんきゅうひゃくはちじゅうきゅうねん,2012-08-29 05:01:11.844,2017-12-20 02:00:49,0,0,0,0,...,0,0,0,0,0,0,0,7,18,[5:8]
5542,1387411183585,numeric datesandtime,千九百八十七年,せんきゅうひゃくはちじゅうななねん,2013-12-18 23:59:43.585,2017-12-20 02:00:49,0,0,0,0,...,0,0,0,0,0,0,0,7,17,[5:8]


In [79]:
df_notes_010_with_len.columns.values

array(['nid', 'tags', 'Term', 'Yomi1', 'NoteCreated', 'LastModified',
       'commonword', 'clothing', 'animal', 'body', 'food', 'place',
       'textbook', 'college', 'fromdict', 'fromexam', 'len1', 'n1', 'n2',
       'n3', 'n4', 'n5', 'katakana', 'hiragana', 'noun', 'verb', 'convo',
       'hasnotags', 'hasVisual', 'hasAudio', 'hasMultiMeaning',
       'hasMultiReading', 'hasSimilar', 'hasHomophone', 'hasAltForm',
       'hasRichExamples', 'TermLen', 'Syllables', 'TermLenGroup'],
      dtype=object)

In [80]:
# labels terms by their jlpt level.
# bear in mind that some terms have multiple jlpt levels.
# this function merely assigns the lowest associated jlpt level with a term. 
def label_jlpt_lvl (row):
    if row['n5'] == 1 :
        return 5
    elif row['n4'] == 1:
        return 4
    elif row['n3'] == 1:
        return 3
    elif row['n2'] == 1:
        return 2
    elif row['n1'] == 1:
        return 1
    else:
        return None

### 2.2.26. Assign JLPT number to words with JLPT "N" levels in notes data frame

In [81]:
df_notes_011_jptl_lvl = df_notes_010_with_len.copy()
df_notes_011_jptl_lvl['jlpt_lvl_d'] = df_notes_011_jptl_lvl.apply (lambda row: label_jlpt_lvl(row), axis=1)

In [82]:
df_notes_011_jptl_lvl.columns.values

array(['nid', 'tags', 'Term', 'Yomi1', 'NoteCreated', 'LastModified',
       'commonword', 'clothing', 'animal', 'body', 'food', 'place',
       'textbook', 'college', 'fromdict', 'fromexam', 'len1', 'n1', 'n2',
       'n3', 'n4', 'n5', 'katakana', 'hiragana', 'noun', 'verb', 'convo',
       'hasnotags', 'hasVisual', 'hasAudio', 'hasMultiMeaning',
       'hasMultiReading', 'hasSimilar', 'hasHomophone', 'hasAltForm',
       'hasRichExamples', 'TermLen', 'Syllables', 'TermLenGroup',
       'jlpt_lvl_d'], dtype=object)

In [83]:
df_notes_011_jptl_lvl['jlpt_lvl_d'].value_counts()

3.0    160
5.0    100
4.0     84
1.0     39
2.0     36
Name: jlpt_lvl_d, dtype: int64

### 2.2.27. Create df_notes_012_final_section_2 data frame for progress saving

In [84]:
#df_combo_005_notes_galore
df_notes_012_mid_section_2 = df_notes_011_jptl_lvl.copy()
print(df_notes_012_mid_section_2.shape)
df_notes_012_mid_section_2.head()

(8047, 40)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,jlpt_lvl_d
1,1331799797112,fromdict,隙間,すきま,2012-03-15 08:23:17.112,2017-11-23 23:58:09,0,0,0,0,...,0,0,1,0,0,0,2,3,[2],
2,1331799797113,fromdict,苦汁,にがり,2012-03-15 08:23:17.113,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,0,0,2,3,[2],
3,1331799797114,fromdict,移籍,いせき,2012-03-15 08:23:17.114,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,0,0,2,3,[2],
5,1331799797117,verb fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2017-11-23 23:58:09,0,0,0,0,...,0,0,1,0,0,0,3,3,[3:4],
6,1331799797118,checked convo fromdict,和やか,なごやか,2012-03-15 08:23:17.118,2017-11-23 23:58:09,0,0,0,0,...,0,0,0,0,0,0,3,4,[3:4],


### 2.2.28. Export df_notes_midway

In [85]:
df_notes_012_mid_section_2.to_csv('datasets/df_notes_012_mid_section_2.csv')

***
- [Previous section: Notes](#notes)
- [Next section: Review Log](#revlog)
***
# <a name="combo"></a> Combo of Notes & Cards

### 2.3.1. Merge card & note data frames to conduct cross analysis

In [87]:
# now that we have note id's for all the words, we can
# join together these separate dataframes
df_combo = pd.merge(df_notes_012_mid_section_2, df_cards_008_mid_section_2, on='nid')
print(df_combo.shape)
df_combo.head()

(7964, 53)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,reps,lapses,CardCreated,DueDate,ivl_q,factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall
0,1331799797112,fromdict,隙間,すきま,2012-03-15 08:23:17.112,2017-11-23 23:58:09,0,0,0,0,...,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,0,2,0,0,1,0
1,1331799797114,fromdict,移籍,いせき,2012-03-15 08:23:17.114,2017-11-23 23:58:09,0,0,0,0,...,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00,0,1,0,0,1,0
2,1331799797117,verb fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2017-11-23 23:58:09,0,0,0,0,...,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,0,2,0,0,1,0
3,1331799797118,checked convo fromdict,和やか,なごやか,2012-03-15 08:23:17.118,2017-11-23 23:58:09,0,0,0,0,...,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,0,1,0,0,1,0
4,1331799797121,fromdict,営業日,えいぎょうび,2012-03-15 08:23:17.121,2017-11-23 23:58:09,0,0,0,0,...,6,1,2012-03-15 08:23:17.121,2015-03-03 09:00:00,0,2,0,0,1,0


Let's further refine the dataframe entries to represent which notes have (1) visual data, (2) audio data, and (3) a L1 ("first language", English in this case) translation. We can represent these with binary values (0 for doesn't exist, 1 for exists).

### 2.3.2. Group notes by ID to determine card type overlap, simple totals per note

In [88]:
# https://stackoverflow.com/questions/23919563/merge-rows-of-a-dataframe-in-pandas-based-on-a-column
# https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression
df_combo_001_grouped_notes = df_combo.copy()
df_combo_001_grouped_notes = df_combo_001_grouped_notes.drop(
    ['cid','hasAltForm','TermLen','Syllables','jlpt_lvl_d','katakana','hiragana','noun','verb',
     'convo','commonword','clothing','animal','body','food','place','textbook','college','fromdict',
     'fromexam','len1','n1','n2','n3','n4','n5','ivl_q','factor_q','hasVisual','hasAudio','hasMultiMeaning',
     'hasMultiReading','hasSimilar','hasHomophone','hasRichExamples','hasnotags'],axis=1)
df_combo_001_grouped_notes = df_combo_001_grouped_notes.groupby(['nid']).sum()
df_combo_001_grouped_notes.head()

Unnamed: 0_level_0,ivl,factor,reps,lapses,CardType_listen,CardType_look,CardType_read,CardType_recall
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1331799797112,149,2080,8,1,0,0,1,0
1331799797114,99,1980,7,0,0,0,1,0
1331799797117,143,2130,6,1,0,0,1,0
1331799797118,74,1880,15,3,0,0,1,0
1331799797121,132,2130,6,1,0,0,1,0


In [89]:
# this data frame will provide total reps per term, total lapses per term, and vectors (card types) per term 
df_combo_001_grouped_notes.tail(20)[-5:]

df_combo_002_note_totals = df_combo_001_grouped_notes.copy()
df_combo_002_note_totals = df_combo_002_note_totals.drop(['ivl','factor'],axis=1)
df_combo_002_note_totals.tail(20)[-5:]

Unnamed: 0_level_0,reps,lapses,CardType_listen,CardType_look,CardType_read,CardType_recall
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1517489889767,2,0,0,1,0,0
1523892839900,33,4,1,1,1,1
1549184119039,8,0,0,0,1,1
1550402953788,13,2,0,0,1,1
1550403040864,7,0,0,0,1,1


In [90]:
df_combo_002_note_totals = df_combo_002_note_totals.rename(
    columns={
        'reps':'reps_total', 'lapses':'lapses_total', 'CardType_listen':'hasListenCard',
        'CardType_recall':'hasTranslateCard', 'CardType_read':'hasReadCard',
        'CardType_look':'hasPictureCard'
    }
)

In [91]:
df_combo_002_note_totals.tail(20)[-5:]

Unnamed: 0_level_0,reps_total,lapses_total,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1517489889767,2,0,0,1,0,0
1523892839900,33,4,1,1,1,1
1549184119039,8,0,0,0,1,1
1550402953788,13,2,0,0,1,1
1550403040864,7,0,0,0,1,1


### 2.3.3. Group notes by ID to find simple average means per note

In [92]:
df_combo_003_note_means = df_combo_001_grouped_notes.copy()
df_combo_003_note_means = df_combo_003_note_means.groupby(['nid']).mean()

In [93]:
df_combo_003_note_means = df_combo_003_note_means.drop(['CardType_listen','CardType_recall',
    'CardType_read', 'CardType_look'],axis=1)
df_combo_003_note_means = df_combo_003_note_means.rename(
    columns={'ivl':'mean_ivl','factor':'mean_factor','reps':'mean_reps','lapses':'mean_lapses'})
df_combo_003_note_means.tail(20)[-5:]

Unnamed: 0_level_0,mean_ivl,mean_factor,mean_reps,mean_lapses
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1517489889767,1,2410,2,0
1523892839900,1344,8880,33,4
1549184119039,195,4820,8,0
1550402953788,100,4420,13,2
1550403040864,175,4820,7,0


### 2.3.4. Combine note totals, note means & general notes

In [94]:
df_combo_004_notes_galore = pd.merge(df_combo_003_note_means, df_combo_002_note_totals,on='nid')
df_combo_005_notes_galore = pd.merge(df_combo,df_combo_004_notes_galore,on='nid')
# https://stackoverflow.com/questions/47022070/display-all-dataframe-columns-in-a-jupyter-python-notebook
pd.options.display.max_columns = None
# drop card specific columns, these are no longer valid
df_combo_005_notes_galore = df_combo_005_notes_galore.drop(['cid','ivl','factor','reps','lapses','CardCreated','DueDate','ivl_q','factor_q','CardType_listen','CardType_look','CardType_read','CardType_recall'],axis=1)
df_combo_005_notes_galore.head(10)

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,len1,n1,n2,n3,n4,n5,katakana,hiragana,noun,verb,convo,hasnotags,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,jlpt_lvl_d,mean_ivl,mean_factor,mean_reps,mean_lapses,reps_total,lapses_total,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard
0,1331799797112,fromdict,隙間,すきま,2012-03-15 08:23:17.112,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,3,[2],,149,2080,8,1,8,1,0,0,1,0
1,1331799797114,fromdict,移籍,いせき,2012-03-15 08:23:17.114,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,[2],,99,1980,7,0,7,0,0,0,1,0
2,1331799797117,verb fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,3,3,[3:4],,143,2130,6,1,6,1,0,0,1,0
3,1331799797118,checked convo fromdict,和やか,なごやか,2012-03-15 08:23:17.118,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,4,[3:4],,74,1880,15,3,15,3,0,0,1,0
4,1331799797121,fromdict,営業日,えいぎょうび,2012-03-15 08:23:17.121,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,6,[3:4],,132,2130,6,1,6,1,0,0,1,0
5,1331799797122,fromdict,在庫,ざいこ,2012-03-15 08:23:17.122,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,3,[2],,224,2130,5,0,5,0,0,0,1,0
6,1331799797126,fromdict,有能,ゆうのう,2012-03-15 08:23:17.126,2019-03-23 22:24:15,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,4,[2],,248,2130,9,0,9,0,0,0,1,0
7,1331799797127,waseigo katakana fromdict,公衆トイレ,こうしゅうトイレ,2012-03-15 08:23:17.127,2019-03-26 14:13:19,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,5,8,[5:8],,815,4490,18,0,18,0,0,1,1,0
8,1331799797127,waseigo katakana fromdict,公衆トイレ,こうしゅうトイレ,2012-03-15 08:23:17.127,2019-03-26 14:13:19,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,5,8,[5:8],,815,4490,18,0,18,0,0,1,1,0
9,1331799797128,fromdict,送り賃,おくりちん,2012-03-15 08:23:17.128,2017-11-23 23:58:09,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,5,[3:4],,178,2120,8,0,8,0,0,0,1,0


### 2.3.5. Inspect combo dtypes

In [95]:
# strategies to fix dtypes: https://stackoverflow.com/questions/28910851/python-pandas-changing-some-column-types-to-categories
print(df_combo_005_notes_galore.dtypes.value_counts())
df_combo_005_notes_galore.dtypes

int64             30
int32              9
uint8              4
object             3
datetime64[ns]     2
category           1
float64            1
dtype: int64


nid                          int64
tags                        object
Term                        object
Yomi1                       object
NoteCreated         datetime64[ns]
LastModified        datetime64[ns]
commonword                   int64
clothing                     int64
animal                       int64
body                         int64
food                         int64
place                        int64
textbook                     int64
college                      int64
fromdict                     int64
fromexam                     int64
len1                         int64
n1                           int64
n2                           int64
n3                           int64
n4                           int64
n5                           int64
katakana                     int64
hiragana                     int64
noun                         int64
verb                         int64
convo                        int64
hasnotags                    int32
hasVisual           

### 2.3.6. Fix combo dtypes

In [96]:
convert_bool_list = ['hasRichExamples']

for item in convert_bool_list:
    cast_to_typ(df_combo_005_notes_galore,item, int)

convert_category_list = ['jlpt_lvl_d']

for col in convert_category_list:
    df_combo_005_notes_galore[col] = df_combo_005_notes_galore[col].astype('category')

convert_dates_list = ['NoteCreated','LastModified']

# https://stackoverflow.com/questions/28910851/python-pandas-changing-some-column-types-to-categories
for col in convert_dates_list:
    df_combo_005_notes_galore[col] = df_combo_005_notes_galore[col].astype('datetime64')
    
# todo: use the following three lists as a rough guide to start prepping dataframes
# for export & analysis in section 3!!! ^_^
binary_list = ['commonword','clothing','animal','body','food','place','textbook','college',
    'fromdict','fromexam','len1','n1','n2','n3','n4','n5','katakana','hiragana','noun','verb',
    'convo','hasnotags','hasVisual','hasAudio','hasMultiMeaning','hasMultiReading','hasSimilar',
    'hasHomophone','hasAltForm','hasRichExamples','hasListenCard','hasPictureCard ','hasReadCard',
    'hasTranslateCard']
continuous_list = ['TermLen','Syllables','mean_ivl','mean_factor','mean_reps','mean_lapses',
                   'reps_total','lapses_total']
discrete_non_binary_list = ['NoteCreated','LastModified','TermLenGroup','jlpt_lvl_d']
    
df_combo_005_notes_galore.dtypes

nid                          int64
tags                        object
Term                        object
Yomi1                       object
NoteCreated         datetime64[ns]
LastModified        datetime64[ns]
commonword                   int64
clothing                     int64
animal                       int64
body                         int64
food                         int64
place                        int64
textbook                     int64
college                      int64
fromdict                     int64
fromexam                     int64
len1                         int64
n1                           int64
n2                           int64
n3                           int64
n4                           int64
n5                           int64
katakana                     int64
hiragana                     int64
noun                         int64
verb                         int64
convo                        int64
hasnotags                    int32
hasVisual           

### 2.3.7. Create df_combo_006_final_section_2 data frame for export

In [97]:
df_combo_006_final_section_2 = df_combo_005_notes_galore.copy()
print(df_combo_006_final_section_2.shape)
df_combo_006_final_section_2.tail(10)

(7964, 50)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,len1,n1,n2,n3,n4,n5,katakana,hiragana,noun,verb,convo,hasnotags,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,jlpt_lvl_d,mean_ivl,mean_factor,mean_reps,mean_lapses,reps_total,lapses_total,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard
7954,1523892839900,n5 noun commonword,万年筆,まんねんひつ,2018-04-16 15:33:59.900,2018-05-14 17:58:39,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,3,6,[3:4],5.0,1344,8880,33,4,33,4,1,1,1,1
7955,1523892839900,n5 noun commonword,万年筆,まんねんひつ,2018-04-16 15:33:59.900,2018-05-14 17:58:39,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,3,6,[3:4],5.0,1344,8880,33,4,33,4,1,1,1,1
7956,1523892839900,n5 noun commonword,万年筆,まんねんひつ,2018-04-16 15:33:59.900,2018-05-14 17:58:39,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,3,6,[3:4],5.0,1344,8880,33,4,33,4,1,1,1,1
7957,1523892839900,n5 noun commonword,万年筆,まんねんひつ,2018-04-16 15:33:59.900,2018-05-14 17:58:39,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,3,6,[3:4],5.0,1344,8880,33,4,33,4,1,1,1,1
7958,1549184119039,hasnotags,閏年,うるうどし,2019-02-03 08:55:19.039,2019-02-06 01:06:25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,5,[2],,195,4820,8,0,8,0,0,0,1,1
7959,1549184119039,hasnotags,閏年,うるうどし,2019-02-03 08:55:19.039,2019-02-06 01:06:25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,5,[2],,195,4820,8,0,8,0,0,0,1,1
7960,1550402953788,suruverb commonword fromdict noun convo,段取り,だんどり,2019-02-17 11:29:13.788,2019-04-21 19:20:46,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,3,4,[3:4],,100,4420,13,2,13,2,0,0,1,1
7961,1550402953788,suruverb commonword fromdict noun convo,段取り,だんどり,2019-02-17 11:29:13.788,2019-04-21 19:20:46,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,3,4,[3:4],,100,4420,13,2,13,2,0,0,1,1
7962,1550403040864,hasnotags,触角,しょっかく,2019-02-17 11:30:40.864,2019-02-17 11:31:24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,5,[2],,175,4820,7,0,7,0,0,0,1,1
7963,1550403040864,hasnotags,触角,しょっかく,2019-02-17 11:30:40.864,2019-02-17 11:31:24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,5,[2],,175,4820,7,0,7,0,0,0,1,1


### 2.3.8. Export df_combo_006_final_section_2

In [98]:
df_combo_006_final_section_2.to_csv('datasets/df_combo_006_final_section_2.csv')

***
- [Previous section: Combo of Notes & Cards](#combo)
- [To to bottom](#bottom)
***
# <a name="revlog"></a> Review Log

### 2.4.1. Import in Review Log data

In [99]:
df_revlog = pd.read_sql_query("SELECT * FROM revlog", cnx)

In [100]:
print(df_revlog.shape)
df_revlog.head()

(114257, 9)


Unnamed: 0,id,cid,usn,ease,ivl,lastIvl,factor,time,type
0,1332393018515,1331799797110,0,1,0,0,2500,6673,0
1,1333279992123,1331799797110,0,4,8,0,2600,11656,0
2,1333280001016,1331799797112,0,4,8,0,2600,8887,0
3,1333280097922,1331799797113,0,1,0,0,2500,29162,0
4,1333280107916,1331799797114,0,4,8,0,2600,9987,0


In [101]:
df_revlog_001_review_date = df_revlog.copy()
df_revlog_001_review_date = df_revlog_001_review_date.rename(columns={'id':'rid'})
df_revlog_001_review_date['ReviewDate']= pd.to_datetime(df_revlog_001_review_date['rid'],unit='ms')
#df_revlog_001_review_date['ReviewDate'] = df_revlog_001_review_date['ReviewDate'].dt.date
df_revlog_001_review_date.head()

#assertEquals(df_revlog_001_review_date['rid'].iloc[0], 1332393018515, "Note ID is in Epoch Units")
#assertEquals(str(df_revlog_001_review_date['ReviewDate'].iloc[0]), "2012-03-22", "Note ID is in datetime date format year-month-day")

Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
0,1332393018515,1331799797110,0,1,0,0,2500,6673,0,2012-03-22 05:10:18.515
1,1333279992123,1331799797110,0,4,8,0,2600,11656,0,2012-04-01 11:33:12.123
2,1333280001016,1331799797112,0,4,8,0,2600,8887,0,2012-04-01 11:33:21.016
3,1333280097922,1331799797113,0,1,0,0,2500,29162,0,2012-04-01 11:34:57.922
4,1333280107916,1331799797114,0,4,8,0,2600,9987,0,2012-04-01 11:35:07.916


In [102]:
current_note_id = get_rows_by_value_in_col(df_cards_008_mid_section_2, df_revlog['cid'].iloc[0], 'cid')['nid'].iloc[0]

# get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid')
#print("Note ID: ", get_rows_by_value_in_col(df_notes_012_mid_section_2,current_note_id,'nid'))

#get_rows_by_value_in_col(df_notes_012_mid_section_2,1331799797110,'nid').shape

#print("Term: ", get_rows_by_value_in_col(df_notes_012_mid_section_2,current_note_id,'nid')['Term'].iloc[0])
#print("Translation: ", get_rows_by_value_in_col(df_notes_012_mid_section_2,current_note_id,'nid')['Translation'].iloc[0])

#inspect_card_by_id(df_cards_008_mid_section_2, df_revlog['cid'].iloc[0], 'id')
#get_rows_by_value_in_col(df_cards_008_mid_section_2, df_revlog['cid'].iloc[0],'cid')

In [103]:
get_rows_by_value_in_col(df_revlog_001_review_date, df_revlog['cid'].iloc[0], 'cid')

Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
0,1332393018515,1331799797110,0,1,0,0,2500,6673,0,2012-03-22 05:10:18.515
1,1333279992123,1331799797110,0,4,8,0,2600,11656,0,2012-04-01 11:33:12.123
80368,1397571358201,1331799797110,4480,1,-60,-60,2500,4292,0,2014-04-15 14:15:58.201
80369,1397571360841,1331799797110,4480,2,-600,-60,2500,2636,0,2014-04-15 14:16:00.841
80370,1397571363081,1331799797110,4480,2,1,-600,2280,2238,0,2014-04-15 14:16:03.081
80377,1397622541113,1331799797110,4490,3,2,1,2280,4023,1,2014-04-16 04:29:01.113
83544,1400914850867,1331799797110,4958,2,12,2,2130,3323,1,2014-05-24 07:00:50.867
93052,1410177777778,1331799797110,6257,2,44,12,1980,2300,1,2014-09-08 12:02:57.778
98902,1414062295845,1331799797110,6748,2,51,44,1830,16176,1,2014-10-23 11:04:55.845
104064,1420285596480,1331799797110,7154,2,65,51,1680,11880,1,2015-01-03 11:46:36.480


In [104]:
# todo: put all revlog data per card in a cell alongside each card in the cards data frame

### <a id="bottom"></a> Hi there! Want to go back [to the top](#top)