<a name="top"></a>Vocab Analysis 
***
# Section 2: Prepare the Data
## [A. Cards](#cards)
- 2.1.1. Import card data into data frame "df_cards"
- 2.1.2. Confirm that card data model matches expected format
- 2.1.3. Shallow check for duplicates (matching rows)
- 2.1.4. Remove unneeded card dataframe columns, rename 'id' to 'cid' (card id)
- 2.1.5. Generate Card Creation Date from Card ID
- 2.1.6. Remove cards with no study data associated with them, cards that have been suspended from study
- 2.1.7. Confirm that no cards considered 'in learning' are present
- 2.1.8. Generate Due Date from Due Value
- 2.1.9. Label card types by their names, & drop outlier
- 2.1.10. Create interval quartile sections for visualization purposes
- 2.1.11. Create dummy variables for card types
- 2.1.12. Create df_cards_009_final_section_2 data frame for progress saving
- 2.1.13. Export df_cards_009_final_section_2

## [B. Notes](#notes)


## [C. Combo](#combo)


### [bottom of page](#bottom)

### 2.0.1. Import libraries

In [1]:
import pandas as pd
import sqlite3
import json
from datetime import datetime, timedelta, date
import time

### 2.0.2. Import Data

In [2]:
location = "datasets/collection.anki2"
cnx = sqlite3.connect(location) # create sql file connection

In [3]:
# TDD backbone assertion to confirm a function call returns the desired result
def assertEquals(actual, expected, desc):
    assert(actual==expected), desc + " result: " + str(actual) + ", expected: " + str(expected)
    return "OK"

### 2.0.3. Extract Deck Creation Date

In [4]:
df_c = pd.read_sql_query("SELECT * FROM col", cnx)
crt = df_c['crt'][0] # save collection creation date (in epoch time)
pd_crt = pd.to_datetime(crt, unit = 's')
print(pd_crt)

assertEquals(str(pd_crt), "2013-01-08 09:00:00", "Collection Creation Date")

2013-01-08 09:00:00


'OK'

### 2.0.4. Extract field names to label columns

In [5]:
field_names = []
for row_index, blob in df_c['models'].items():
    for model_id, data in json.loads(blob).items():
        field_names += list(map(lambda fld: fld['name'], data['flds']))
field_names.append('Tags')
expected_names = ['Term', 'Yomi1', 'Translation', 'Translation2', 'Translation3', 'AlternateForms',
    'PartOfSpeech', 'Sound', 'Sound2', 'Sound3', 'Examples', 'ExamplesAudio', 'AtoQ', 'AtoQaudio',
    'AtoQkana', 'AtoQtranslation', 'QandApicture', 'answerPicture', 'Meaning1', 'SimilarWords',
    'RelatedWords', 'Breakdown1', 'Comparison', 'Usage', 'Prompt1', 'Prompt2', 'KakuMCD', 'IuMCD',
    'ExtraMemo', 'Yomi2', 'Meaning2', 'Breakdown2', 'Picture1', 'Picture2', 'Picture3', 'Picture4',
    'HinshiMarker', 'Hint', 'Term2','ArabicNumeral', 'CounterKanji', 'Mnemonic', 'SameSoundWords',
    'Yomi3', 'gChap', 'gBook', 'semester', 'gNumber', 'Transliteration','SoloLookCards',
    'TagOverflow', 'blank1', 'blank2', 'Tags']

In [6]:
assertEquals(field_names, expected_names, "Field Names")

'OK'

***
- [Back to the top](#top)
- [Next section: Notes](#notes)
***

# <a name="cards"></a> Cards

### 2.1.1. Import card data into data frame "df_cards"

In [7]:
# Step 6: Take in study data from Anki collection
df_cards = pd.read_sql_query("SELECT * FROM cards", cnx)
assertEquals(df_cards.shape[0],19514,"Rows")#6386, 21979, 19363, 19314
assertEquals(df_cards.shape[1],18,"Columns")

'OK'

### 2.1.2. Confirm that card data model matches expected format

In [8]:
expected_columns_1 = ['id', 'nid', 'did', 'ord', 'mod', 'usn', 'type', 'queue', 'due', 'ivl', 'factor',
 'reps', 'lapses', 'left', 'odue', 'odid', 'flags', 'data']

def lists_equal(a,b):
    return (a == b).all()

assertEquals(lists_equal(df_cards.columns.values, expected_columns_1), True, "Card Columns Import")

'OK'

### 2.1.3.  Shallow check for duplicates (matching rows)

In [9]:
 def has_dupes(df_in):
    dupe = df_in.duplicated()
    return df_in.loc[dupe].shape[0] != 0

In [10]:
assertEquals(has_dupes(df_cards), False, "Duplicates Not Found")

'OK'

### 2.1.4.  Remove unneeded card dataframe columns, rename 'id' to 'cid' (card id)

In [11]:
def print_line_break():
    print("-"*75)

In [12]:
def print_before_after(b, a, t=""):
    if t != "":
        print_line_break()
        print(t)
    print_line_break()
    print("Before: " + str(b))
    print_line_break()
    print("After: " + str(a))
    print_line_break()

In [13]:
df_cards_001_less_cols = df_cards.copy()
df_cards_001_less_cols = df_cards_001_less_cols.drop(['did','usn','type','mod','left','odue','odid','flags','data'],axis=1)
df_cards_001_less_cols = df_cards_001_less_cols.rename(columns={'id':'cid'})
expected_columns_2 = ['cid', 'nid', 'ord', 'queue', 'due', 'ivl', 'factor', 'reps','lapses']

print_before_after(df_cards.columns.values, df_cards_001_less_cols.columns.values,"Card Columns:")

assertEquals(lists_equal(df_cards_001_less_cols.columns.values, expected_columns_2), True, "Card Model Slimmed")

---------------------------------------------------------------------------
Card Columns:
---------------------------------------------------------------------------
Before: ['id' 'nid' 'did' 'ord' 'mod' 'usn' 'type' 'queue' 'due' 'ivl' 'factor'
 'reps' 'lapses' 'left' 'odue' 'odid' 'flags' 'data']
---------------------------------------------------------------------------
After: ['cid' 'nid' 'ord' 'queue' 'due' 'ivl' 'factor' 'reps' 'lapses']
---------------------------------------------------------------------------


'OK'

### 2.1.5. Generate Card Creation Date from Card ID

In [14]:
df_cards_002_created_date = df_cards_001_less_cols.copy()
df_cards_002_created_date['CardCreated'] = pd.to_datetime(df_cards_002_created_date['cid'],unit='ms')
#df_cards_002_created_date['CardCreated'] = df_cards_002_created_date['CardCreated'].dt.date

#assertEquals(str(df_cards_002_created_date['CardCreated'].iloc[0]), "2012-03-15", "Card ID is in datetime date format year-month-day")

### 2.1.6. Remove cards with no study data associated with them, cards that have been suspended from study

In [15]:
#queue           integer not null,
#      -- -3=sched buried, -2=user buried, -1=suspended,
#      -- 0=new, 1=learning, 2=due (as for type)

df_cards_003_no_new = df_cards_002_created_date.copy()
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['queue']!=0] # remove cards marked as new
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['reps']!=0] # remove cards that have not been reviewed
df_cards_003_no_new = df_cards_003_no_new[df_cards_003_no_new['queue']!=-1] # remove cards that are currently suspended
# https://stackoverflow.com/questions/18196203/how-to-conditionally-update-dataframe-column-in-pandas
df_cards_003_no_new.loc[df_cards_003_no_new['due'] > 10000, 'due'] = 0 # assign 0 to the due # todo: update w/ last studied date from revlog # todo: comment this line out once you have updated the collection import

print_before_after(df_cards_002_created_date.shape[0], df_cards_003_no_new.shape[0],"Card Rows:")

df_cards_003_no_new.tail(5)

---------------------------------------------------------------------------
Card Rows:
---------------------------------------------------------------------------
Before: 19514
---------------------------------------------------------------------------
After: 8403
---------------------------------------------------------------------------


Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated
19450,1559881952278,1371264340112,2,2,2346,3,2410,1,0,2019-06-07 04:32:32.278
19451,1559886801010,1403677369202,2,2,2344,1,2410,2,0,2019-06-07 05:53:21.010
19456,1560027314586,1371719856439,2,2,2348,5,2410,1,0,2019-06-08 20:55:14.586
19473,1560034184168,1392551376296,2,2,2346,3,2410,1,0,2019-06-08 22:49:44.168
19487,1560114782832,1413131857495,2,2,2347,4,2410,1,0,2019-06-09 21:13:02.832


In [16]:
df_cards_003_no_new.dtypes

cid                     int64
nid                     int64
ord                     int64
queue                   int64
due                     int64
ivl                     int64
factor                  int64
reps                    int64
lapses                  int64
CardCreated    datetime64[ns]
dtype: object

### 2.1.7. Confirm that no cards considered 'in learning' are present

In [17]:
sel3 = df_cards_003_no_new[df_cards_003_no_new['due'] == 0]

assertEquals(sel3.shape[0],0,"There are no cards currently in 'learning'.")

'OK'

### 2.1.8. Generate Due Date from Due Value

In [18]:
df_cards_004_due_date = df_cards_003_no_new.copy()
df_cards_004_due_date['DueDate'] = pd_crt + df_cards_004_due_date['due'].map(timedelta)
#df_cards_004_due_date['DueDate'] = df_cards_004_due_date['DueDate'].dt.date

#assertEquals(str(df_cards_004_due_date['DueDate'].iloc[0]), "2015-03-08", "Card due date is in datetime date format year-month-day")

df_cards_004_due_date.head()

Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated,DueDate
1,1331799797112,1331799797112,0,2,784,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00
3,1331799797114,1331799797114,0,2,757,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00
4,1331799797116,1331799797116,0,2,744,54,1680,20,4,2012-03-15 08:23:17.116,2015-01-22 09:00:00
5,1331799797117,1331799797117,0,2,798,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00
6,1331799797118,1331799797118,0,2,759,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00


In [19]:
df_cards_004_due_date.dtypes

cid                     int64
nid                     int64
ord                     int64
queue                   int64
due                     int64
ivl                     int64
factor                  int64
reps                    int64
lapses                  int64
CardCreated    datetime64[ns]
DueDate        datetime64[ns]
dtype: object

It appears that card types are being rendered as numbers, which makes it less human readible. We will fix this.   
### 2.1.9. Label card types by their names, & drop outlier

In [20]:
df_cards_005_named_types = df_cards_004_due_date.copy()
# ord stands for 'ordinal' : identifies which of the card templates it corresponds to
print(df_cards_005_named_types['ord'].value_counts()) # these are the card vectors

# The dataset contains only two cards of a single card vector: let's drop them as outliers
df_cards_005_named_types = df_cards_005_named_types.drop(df_cards_005_named_types[df_cards_005_named_types['ord'] == 11].index)

df_cards_005_named_types['ord'].value_counts() # the check shall pass

# now, to map the names onto the card vectors # read:JapaneseReading, recall:EngToJpnTranslate, look:PictureLook, listen:AudioListening
df_cards_005_named_types['CardType'] = df_cards_005_named_types['ord'].map(
    {0:'read', 2:'recall', 4:'look', 7:'listen'})
df_cards_005_named_types['CardType'].value_counts()

0     7034
4     1204
7      122
2       41
11       2
Name: ord, dtype: int64


read      7034
look      1204
listen     122
recall      41
Name: CardType, dtype: int64

### 2.1.10. Create interval quartile sections for visualization purposes

In [21]:
df_cards_006_ivl_buckets = df_cards_005_named_types.copy()
# qcut: Quantile-based discretization function. Discretize variable into equal-sized buckets
# based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would
# produce a Categorical object indicating quantile membership for each data point.
# http://www.datasciencemadesimple.com/quantile-decile-rank-column-pandas-python-2/
df_cards_006_ivl_buckets['c_ivl_q'] = pd.qcut(df_cards_006_ivl_buckets['ivl'],5,labels=False)
df_cards_006_ivl_buckets['c_factor_q'] = pd.qcut(df_cards_006_ivl_buckets['factor'],3,labels=False)
df_cards_006_ivl_buckets.head()

Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated,DueDate,CardType,c_ivl_q,c_factor_q
1,1331799797112,1331799797112,0,2,784,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,read,1,2
3,1331799797114,1331799797114,0,2,757,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00,read,0,1
4,1331799797116,1331799797116,0,2,744,54,1680,20,4,2012-03-15 08:23:17.116,2015-01-22 09:00:00,read,0,1
5,1331799797117,1331799797117,0,2,798,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,read,0,2
6,1331799797118,1331799797118,0,2,759,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,read,0,1


### 2.1.11. Create dummy variables for card types

In [22]:
df_cards_007_dummies = df_cards_006_ivl_buckets.copy()

df_cards_007_dummies = pd.get_dummies(df_cards_007_dummies, columns=['CardType'])
df_cards_007_dummies['cardtype'] = df_cards_006_ivl_buckets['CardType']

In [23]:
df_cards_007_dummies.tail(10)[:5]

Unnamed: 0,cid,nid,ord,queue,due,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype
19395,1558987755646,1372725008050,2,2,2344,1,2410,3,0,2019-05-27 20:09:15.646,2019-06-10 09:00:00,0,2,0,0,0,1,recall
19400,1558993207968,1355381898927,4,2,2344,1,2410,2,0,2019-05-27 21:40:07.968,2019-06-10 09:00:00,0,2,0,1,0,0,look
19419,1559002146834,1374547817844,2,2,2344,1,2410,2,0,2019-05-28 00:09:06.834,2019-06-10 09:00:00,0,2,0,0,0,1,recall
19437,1559003056201,1387004066653,2,2,2344,1,2410,3,0,2019-05-28 00:24:16.201,2019-06-10 09:00:00,0,2,0,0,0,1,recall
19442,1559006530728,1371777687981,7,2,2347,4,2410,1,0,2019-05-28 01:22:10.728,2019-06-13 09:00:00,0,2,1,0,0,0,listen


### 2.1.12. Create df_cards_008_final_section_2 data frame for progress saving

In [24]:
df_cards_008_mid_section_2 = df_cards_007_dummies.copy()
# we will also drop a few columns that aren't needed anymore
df_cards_008_mid_section_2 = df_cards_008_mid_section_2.drop(['ord','queue','due'],axis=1)
print(df_cards_008_mid_section_2.shape)
df_cards_008_mid_section_2.head()

(8401, 15)


Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype
1,1331799797112,1331799797112,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,1,2,0,0,1,0,read
3,1331799797114,1331799797114,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00,0,1,0,0,1,0,read
4,1331799797116,1331799797116,54,1680,20,4,2012-03-15 08:23:17.116,2015-01-22 09:00:00,0,1,0,0,1,0,read
5,1331799797117,1331799797117,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,0,2,0,0,1,0,read
6,1331799797118,1331799797118,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,0,1,0,0,1,0,read


# Derive card waste & ROI from lapses, reps & interval

- waste (lapses/reps) (formerly labeled 'efficiency')
- ROI (interval/reps) (formerly labeled 'durability')

In [25]:
df_cards_009_mid_section_2 = df_cards_008_mid_section_2.copy()
df_cards_009_mid_section_2['waste'] = (df_cards_009_mid_section_2['lapses'] + 1) / df_cards_009_mid_section_2['reps']
df_cards_009_mid_section_2['roi'] = df_cards_009_mid_section_2['ivl'] / df_cards_009_mid_section_2['reps']

In [26]:
df_cards_009_mid_section_2.head()

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi
1,1331799797112,1331799797112,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,1,2,0,0,1,0,read,0.25,18.625
3,1331799797114,1331799797114,99,1980,7,0,2012-03-15 08:23:17.114,2015-02-04 09:00:00,0,1,0,0,1,0,read,0.142857,14.142857
4,1331799797116,1331799797116,54,1680,20,4,2012-03-15 08:23:17.116,2015-01-22 09:00:00,0,1,0,0,1,0,read,0.25,2.7
5,1331799797117,1331799797117,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,0,2,0,0,1,0,read,0.333333,23.833333
6,1331799797118,1331799797118,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,0,1,0,0,1,0,read,0.266667,4.933333


# Remove card outliers by roi

In [27]:
#Interquartile Range Method
def df_wo_outliers(df, field):
    q1 = df[field].quantile(.25)
    q3 = df[field].quantile(.75)
    iqr = q3-q1
    toprange = q3 + iqr * 1.5
    botrange = q1 - iqr * 1.5

    newdf = df.copy()
    newdf = newdf.drop(newdf[newdf[field] > toprange].index)
    newdf = newdf.drop(newdf[newdf[field] < botrange].index)
    
    return newdf

In [28]:
old_df = df_cards_009_mid_section_2.copy()
df_cards_010_mid_section_2 = df_wo_outliers(old_df, 'roi')
new_df = df_cards_010_mid_section_2.copy()

# print out before and after stats
print_before_after(old_df.shape, new_df.shape, "shape")
print("diff:",old_df.shape[0]-new_df.shape[0],"cards")

---------------------------------------------------------------------------
shape
---------------------------------------------------------------------------
Before: (8401, 17)
---------------------------------------------------------------------------
After: (7401, 17)
---------------------------------------------------------------------------
diff: 1000 cards


In [29]:
card_cols = df_cards_009_mid_section_2.columns.values

In [30]:
# inspect removed cards
df_both = pd.merge(old_df, new_df, on=list(card_cols), how="outer", indicator=True
              ).query('_merge=="left_only"')
print(df_both.shape)
df_both.head()

(1000, 18)


Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,_merge
178,1346057958557,1346057958557,1729,2150,10,0,2012-08-27 08:59:18.557,2021-10-09 09:00:00,4,2,0,0,1,0,read,0.1,172.9,left_only
180,1346057958559,1346057958559,1625,2150,11,0,2012-08-27 08:59:18.559,2021-07-24 09:00:00,4,2,0,0,1,0,read,0.090909,147.727273,left_only
205,1346057958594,1346057958594,2064,2170,11,0,2012-08-27 08:59:18.594,2023-08-15 09:00:00,4,2,0,0,1,0,read,0.090909,187.636364,left_only
212,1346057958604,1346057958604,1510,2077,10,0,2012-08-27 08:59:18.604,2021-03-03 09:00:00,4,2,0,0,1,0,read,0.1,151.0,left_only
216,1346057958609,1346057958609,2753,2327,11,0,2012-08-27 08:59:18.609,2025-11-29 09:00:00,4,2,0,0,1,0,read,0.090909,250.272727,left_only


# Remove card outliers by waste

In [31]:
old_df = df_cards_010_mid_section_2.copy()
df_cards_011_mid_section_2 = df_wo_outliers(old_df, 'waste')
new_df = df_cards_011_mid_section_2.copy()

# print out before and after stats
print_before_after(old_df.shape, new_df.shape, "shape")
print("diff:",old_df.shape[0]-new_df.shape[0],"cards")

---------------------------------------------------------------------------
shape
---------------------------------------------------------------------------
Before: (7401, 17)
---------------------------------------------------------------------------
After: (7024, 17)
---------------------------------------------------------------------------
diff: 377 cards


In [32]:
# inspect removed cards
df_both = pd.merge(old_df, new_df, on=list(card_cols), how="outer", indicator=True
              ).query('_merge=="left_only"')
print(df_both.shape)
df_both.head()

(377, 18)


Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,_merge
0,1331799797112,1331799797112,149,2080,8,1,2012-03-15 08:23:17.112,2015-03-03 09:00:00,1,2,0,0,1,0,read,0.25,18.625,left_only
2,1331799797116,1331799797116,54,1680,20,4,2012-03-15 08:23:17.116,2015-01-22 09:00:00,0,1,0,0,1,0,read,0.25,2.7,left_only
3,1331799797117,1331799797117,143,2130,6,1,2012-03-15 08:23:17.117,2015-03-17 09:00:00,0,2,0,0,1,0,read,0.333333,23.833333,left_only
4,1331799797118,1331799797118,74,1880,15,3,2012-03-15 08:23:17.118,2015-02-06 09:00:00,0,1,0,0,1,0,read,0.266667,4.933333,left_only
5,1331799797121,1331799797121,132,2130,6,1,2012-03-15 08:23:17.121,2015-03-03 09:00:00,0,2,0,0,1,0,read,0.333333,22.0,left_only


In [33]:
# casts columns of type object to types (such as int) as directed, use with caution
def cast_to_typ(df, col, typ):
    df[col] = df[col].astype(typ)

# Tag cards that have been reviewed 3 times or more as sufficiently reviewed

In [34]:
df_cards_011_mid_section_2.loc[df_cards_011_mid_section_2.reps > 4, 'c_suff_reviewed'] = 1
df_cards_011_mid_section_2["c_suff_reviewed"].fillna(0, inplace=True)

#convert_bool_list = ['c_suff_reviewed']
#for item in convert_bool_list:
#    cast_to_typ(df_cards_011_mid_section_2, item, int)

In [35]:
df_cards_011_mid_section_2.tail()

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,c_suff_reviewed
11268,1511481491077,1342506824721,35,2210,13,1,2017-11-23 23:58:11.077,2019-07-13 09:00:00,0,2,0,0,0,1,recall,0.153846,2.692308,1.0
11278,1511481491087,1342506824731,14,2010,15,2,2017-11-23 23:58:11.087,2019-06-10 09:00:00,0,1,0,0,0,1,recall,0.2,0.933333,1.0
14991,1511481494806,1369382360781,19,2170,10,1,2017-11-23 23:58:14.806,2019-06-17 09:00:00,0,2,0,0,0,1,recall,0.2,1.9,1.0
16691,1511481497879,1389234637100,443,2370,7,0,2017-11-23 23:58:17.879,2020-08-02 09:00:00,3,2,0,0,0,1,recall,0.142857,63.285714,1.0
18825,1549184129288,1549184119039,108,2410,5,0,2019-02-03 08:55:29.288,2019-08-21 09:00:00,0,2,0,0,1,0,read,0.2,21.6,1.0


In [36]:
# Make separate dataframe to hold sufficiently reviewed cards (for analysis)
print(df_cards_011_mid_section_2.shape)
df_cards_012_mid_section_2 = df_cards_011_mid_section_2.copy()
df_cards_012_mid_section_2 = df_cards_012_mid_section_2.drop(
    df_cards_012_mid_section_2[df_cards_012_mid_section_2.c_suff_reviewed != 1].index)
print(df_cards_012_mid_section_2.shape)

(7024, 18)
(7024, 18)


In [37]:
df_cards_012_mid_section_2.tail(20)[10:]

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,c_suff_reviewed
11215,1490047770781,1483483650705,562,2560,5,0,2017-03-20 22:09:30.781,2020-11-10 09:00:00,3,2,0,1,0,0,look,0.2,112.4,1.0
11236,1508007825484,1508007536188,426,2410,5,0,2017-10-14 19:03:45.484,2020-07-04 09:00:00,3,2,1,0,0,0,listen,0.2,85.2,1.0
11246,1511481491055,1331799797116,29,2410,7,0,2017-11-23 23:58:11.055,2019-07-05 09:00:00,0,2,0,0,0,1,recall,0.142857,4.142857,1.0
11247,1511481491056,1331799797117,28,2410,6,0,2017-11-23 23:58:11.056,2019-07-04 09:00:00,0,2,0,0,0,1,recall,0.166667,4.666667,1.0
11248,1511481491057,1331799797118,14,2410,6,0,2017-11-23 23:58:11.057,2019-06-10 09:00:00,0,2,0,0,0,1,recall,0.166667,2.333333,1.0
11268,1511481491077,1342506824721,35,2210,13,1,2017-11-23 23:58:11.077,2019-07-13 09:00:00,0,2,0,0,0,1,recall,0.153846,2.692308,1.0
11278,1511481491087,1342506824731,14,2010,15,2,2017-11-23 23:58:11.087,2019-06-10 09:00:00,0,1,0,0,0,1,recall,0.2,0.933333,1.0
14991,1511481494806,1369382360781,19,2170,10,1,2017-11-23 23:58:14.806,2019-06-17 09:00:00,0,2,0,0,0,1,recall,0.2,1.9,1.0
16691,1511481497879,1389234637100,443,2370,7,0,2017-11-23 23:58:17.879,2020-08-02 09:00:00,3,2,0,0,0,1,recall,0.142857,63.285714,1.0
18825,1549184129288,1549184119039,108,2410,5,0,2019-02-03 08:55:29.288,2019-08-21 09:00:00,0,2,0,0,1,0,read,0.2,21.6,1.0


### 2.1.13. Export df_cards_009_mid_section_2

In [38]:
#df_cards_011_mid_section_2.to_csv('datasets/df_cards_011_mid_section_2.csv')
df_cards_012_mid_section_2.to_csv('datasets/df_cards_012_mid_section_2.csv')

In [39]:
# todo: assert that no rare, phrase, sentence or question cards remain in the collection of cards (as down below with notes)

***
- [Previous section: Cards](#cards)
- [Next section: Combo](#combo)
***

# <a name="notes"></a> Notes

### 2.2.1. Import notes (terms/words) into data frame "df_notes"

In [40]:
# let's take in the 'notes' table, and explicitly save the note id ("nid") 
df_notes = pd.read_sql_query("SELECT * FROM notes", cnx)
df_notes = df_notes.rename(columns={'id':'nid'})

In [41]:
assertEquals(df_notes.shape[0],8377,"Rows") # 2791, 9784, 8403
assertEquals(df_notes.shape[1],11,"Columns")

'OK'

### 2.2.2. Remove (drop) unneeded fields (columns)

In [42]:
df_notes_old_col_vals = df_notes.columns.values
df_notes = df_notes.drop(['guid','mid','usn','sfld','csum','flags','data'],axis=1)
#print(df_notes.columns.values)
print_before_after(df_notes_old_col_vals, df_notes.columns.values)

---------------------------------------------------------------------------
Before: ['nid' 'guid' 'mid' 'mod' 'usn' 'tags' 'flds' 'sfld' 'csum' 'flags' 'data']
---------------------------------------------------------------------------
After: ['nid' 'mod' 'tags' 'flds']
---------------------------------------------------------------------------


### 2.2.3. Split "fields" column into multiple, assign field names, drop combined col

In [43]:
def time_it(func, *args, **kwargs):
    start = time.time()
    func(*args, **kwargs)
    end = time.time()
    # https://stackoverflow.com/questions/8885663/how-to-format-a-floating-number-to-fixed-width-in-python
    print("{:.0f}".format((end - start)*1000) + " miliseconds")

In [44]:
for i in range(0,len(expected_names)-1):
    df_notes[expected_names[i]] = df_notes.flds.str.split('').str.get(i)
assertEquals('flds' in df_notes.columns.values, True, "'flds' Column Found")
df_notes = df_notes.drop(['flds'],axis=1)
assertEquals('flds' not in df_notes.columns.values, True, "'flds' Column Not Found")
print(df_notes.columns.values)

['nid' 'mod' 'tags' 'Term' 'Yomi1' 'Translation' 'Translation2'
 'Translation3' 'AlternateForms' 'PartOfSpeech' 'Sound' 'Sound2' 'Sound3'
 'Examples' 'ExamplesAudio' 'AtoQ' 'AtoQaudio' 'AtoQkana'
 'AtoQtranslation' 'QandApicture' 'answerPicture' 'Meaning1'
 'SimilarWords' 'RelatedWords' 'Breakdown1' 'Comparison' 'Usage' 'Prompt1'
 'Prompt2' 'KakuMCD' 'IuMCD' 'ExtraMemo' 'Yomi2' 'Meaning2' 'Breakdown2'
 'Picture1' 'Picture2' 'Picture3' 'Picture4' 'HinshiMarker' 'Hint' 'Term2'
 'ArabicNumeral' 'CounterKanji' 'Mnemonic' 'SameSoundWords' 'Yomi3'
 'gChap' 'gBook' 'semester' 'gNumber' 'Transliteration' 'SoloLookCards'
 'TagOverflow' 'blank1' 'blank2']


### 2.2.4. Confirm all HTML tags have been removed from note terms & readings

In [45]:
assertEquals(df_notes[df_notes['Term'].str.contains("</div>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Term'].str.contains("<div>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Term'].str.contains("anki")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Yomi1'].str.contains("</span>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Yomi1'].str.contains("</div>")].shape[0],0,"HTML tags have been removed")
assertEquals(df_notes[df_notes['Yomi1'].str.contains("anki")].shape[0],0,"HTML tags have been removed")

'OK'

In [46]:
# todo: create function for this
# inspect notes that have spaces in the reading field
# df_notes[df_notes['Term'].str.contains(" ")]

### 2.2.5. Check notes for duplicates (shallow check)

In [47]:
assertEquals(has_dupes(df_notes), False, "Duplicates Not Found")

'OK'

### 2.2.6. Check for duplicates by term field in notes data frame

In [48]:
def has_dupe_terms(df_in):
    location = df_in['Term'].duplicated()
    return df_in.loc[location].shape[0] != 0

In [49]:
assertEquals(has_dupe_terms(df_notes), False, "Duplicates Found")

'OK'

### 2.2.7. Confirm that duplicates dataframe is empty (no dups exist)

In [50]:
dupe = df_notes['Term'].duplicated() #creates list of True/False values
print(df_notes[dupe].shape)
assertEquals(df_notes[dupe].shape[0], 0, "Duplicates dataframe is empty.")

(0, 56)


'OK'

### 2.2.8. Inspect an individual note by its term

In [51]:
def get_rows_by_value_in_col(df_in, value, col):
    return df_in.loc[df_in[col]==value]

In [52]:
# Postal service
sel1 = get_rows_by_value_in_col(df_notes, '発明','Term')
sel1

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
2696,1354094556789,1558184056,MCD N3 Noun commonWord complete editThis forJ...,発明,はつめい,"<span style=""""><div>invention</div></span>",,,,"Common word, Noun, Suru verb",...,,,,,,,,,,


### Save Point, Commit, Bonfire (for you Souls fans)*

At the point in time of the data extraction where the (meta) tag information is made available, we can treat it to both clarify (rename poorly worded tags) & reduce (delete unneeded tags). Since we now have all fields split into their own columns as well, we can treat (modifiy & improve) the columns as well, in a 1-2 process: (1) Fix the tags & (2) Fix the columns
*https://en.wikipedia.org/wiki/Souls_(series)

In [53]:
def shorten_list(takeIn, takeOut):
    temp = takeIn.lower().split() # split all the words into a list
    temp2 = [word for word in temp if word.lower() not in takeOut] # create a shorter list of words minus the take-outs
    return ' '.join(temp2) # return that shorter list as a string

In [54]:
tag_remove_list = ['japanese', 'checkpicture', 'complete', 'haspicture', 'nomemo',
                   'researched', 'aaaeditthis', 'addaudio', 'addaudio2', 'addaudioNow',
                   'addmore','adjustformatting', 'hascomparison', 'hasmnemonic',
                   'customediting','wikidefinition', 'givewill','addaudionow','addprompt',
                   'checknuance','giveyaneury','hastextimage', 'marked', 'addpicture',
                   'addexampletranslation','basicnumeric', 'genkiplus', 'hasaudio',
                   'nativeaudio', 'adddefinition','addexamples', 'addjapaneseprompt',
                   'computervoice','haspoliteprefix','nongoo','customdefinition','hashint',
                   'abahipriorityfix','kaki','mcd','nobodyknows+','missingwordtype',
                   'image','duplicate', 'hasprompt', 'ninshiki','abachecknuance',
                   'hasflag','things', 'jim', 'hasunicode', 'editthis','aaahipriority',
                   'hassimpledef', 'givecodie', 'forjimmy', 'hasnativeaudio', 'givejimmy2',
                   'checkaudio', 'checkwriting', 'hasjlptlevel', 'makekaki', 'checknuance2',
                   'checkagain', 'newaudio', 'mail', 'checkexamples','elementaryschool',
                   'nvc', 'checkprompt', 'gavejimmy', 'addnativeaudio','checkreading',
                   'givecodieapril', 'activated', 'fixformatting','hasplacesuffix',
                   'hassuffix','addtranslation','addnewcardtype','addnuance','addtextimage',
                   'semicomplete', 'removeroboaudio','fixaudio','hasgramconj', 
                   'addkanji','changenotetype', 'famous', 'kuverb',
                   'givwill','karutapoems', 'map', 'hasvisualcomparison','picturekaki',
                   'jyugemu', '2018', 'type1', 'hasslang', 'apologies',
                   'month', 'definitionresearched','soundshift', 'basics1', 'tsuverb',
                   'facebook', 'uverb', 'checkfrequency', 'degree', 'hasdefinition',
                   'addtransliteration', 'dnd', 'introductions', 'adjustprompt',
                   'job', 'particle', 'services', 'mature', 'splitpictures', 
                   'egaki', 'type5k', 'intimate','extrainfo', 'irregular', 'unlisted',
                   'fromwiki', 'checkdifference','addpronunciationdiagram', 'reset',
                   'currentevents', 'doubletextimage', 'comparison', 'verbscompoundpast2',
                   'attention', 'addmemo', 'averb', 'radio','hasascii', 'fontadjusted',
                   'haspronunciation', 'borroweddefinition','alphabet', 'graphics',
                   'chiebukuro', 'duolingo', 'ateji', 'fact','type5s', 'fixpicture',
                   'politebydefault', 'objects','sensitive', 'groupword', 'addmnemonic',
                   'hasmore', 'quote', 'checkformatting','overlap', 'kotobankdef',
                   'hasrudeness', 'changedeck', 'specialformatting','yoga',
                   'hasjapaneseprompt', 'hasprefix','questionword', 'business', 
                   'postoffice', 'firstten', 'money', 'robotvoice2', 'ichidan', 'godan',
                   'weather','count', 'nodefinition', 'muverb', 'addcomparisonchart', 
                   'ruverb', 'phone', 'conjugated','haddiv','vulgar','fromkaruta',
                   'karutamanual', 'teform', '2019', 'onecharacter','checkpronunciation',
                   'basics','verbsinf'
                  ]

### 2.2.9. Remove unneeded tags (meta-data) from notes

In [55]:
# survey a few notes to see example tag data
df_notes.head(3)

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
0,1331799797110,1558125897,Noun checkNuance complete kanji naAdjective n...,臨機応変,りんきおうへん,adapting oneself to the requirements of the mo...,,,,"Noun, No-adjective",...,,,,,,,,,,
1,1331799797112,1558189992,N2 Noun commonWord complete kanji noMemo rese...,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,,
2,1331799797113,1558545381,Noun checkNuance complete kanji noMemo resear...,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,,


In [56]:
# likely useful tags: katakana, Waseigo, Food, Phrases, casual, restaurant, travel, commonWord, noun, suruVerb

df_notes_001_less_tags = df_notes.copy() #originally "df_notes_less_tags"
df_notes_001_less_tags['tags'] = df_notes_001_less_tags['tags'].apply(lambda x: shorten_list(str(x), tag_remove_list))

print_before_after(df_notes['tags'].iloc[0], df_notes_001_less_tags['tags'].iloc[0],"Tags for " + df_notes['Term'].iloc[0])

assertEquals("researched" in df_notes['tags'].iloc[0].split(), True, "Contains Tag 'researched'")
assertEquals("researched" in df_notes_001_less_tags['tags'].iloc[0].split(), False, "Contains Tag 'researched'")

---------------------------------------------------------------------------
Tags for 臨機応変
---------------------------------------------------------------------------
Before:  Noun checkNuance complete kanji naAdjective noAdjective noMemo researched wwwjdic yojijukugo 
---------------------------------------------------------------------------
After: noun kanji naadjective noadjective wwwjdic yojijukugo
---------------------------------------------------------------------------


'OK'

### 2.2.10. Rename useful tags (meta-data) that were poorly named (still on notes)

In [57]:
# replace list (formerly named 'tag_replace_list')
tag_rename_dict = {
    'aalowfrequency':'rare checked', 'aatechnical':'technical checked', 'aaanonkaiwa':'nonconvo checked',
    'wwwjdic':'fromdict', 'expression':'phrase', 'numberonly':'number',
    'grammarpoint':'grammar', 'jisho':'fromdict', 'pointingword':'directions',
    'geometry':'math technical', 'genki':'textbook', 'jpn202':'college',
    'jpn201':'college', 'jpn101':'college', 'jpn102':'college', 'kentei':'fromexam',
    'proficiencytest':'fromexam', 'bodypart':'body', '5kyuu':'fromexam',
    'linguisticreference':'technical', 'conversation':'convo',
    'fromconvo':'convo', 'culturepoint':'culture', 'checkednuance':'checked',
    'checkedpictures':'checked', 'checkednuance':'checked', 'medical':'technical',
    'anatomy':'body', 'places':'place', 'animals':'animal',
    'newspaperterm':'fromnewspaper', 'checkedreading':'checked',
    'abbreviation':'abbr','firstsemester':'semester1', 'verbs':'verb',
    'convook':'checked convo','inuse':'checked',
    'nuancechecked':'checked','insects':'animal insect','sightseeing':'travel',
    'accessories':'clothing', 'grammarsuffix':'suffix', 'oceanlife':'animal ocean',
    'science':'technical', 'written':'nonconvo', 'notrare':'checked',
    'aajoke':'silly', 'intonationcompare':'hassimilar', 'ij':'textbook',
    'goodcard':'inspect','aahilevel':'challenging inspect', 'ijvocab':'textbook',
    'cliothing':'clothing','unused':'nonconvo rare checked',
    'aaunused':'nonconvo rare checked', 'samesound':'hassame','animals':'animal',
    'dictionary':'fromdict',
    'abVeryRare':'rare checked', 'yojijukugo':'rare idiom', 'abcasual':'casual checked convo',
    'literaryform':'nonconvo', 'onomatopoeiclike':'onomatopoeic',
    "onomatopoeia":"onomatopoeic",'kenjo':'humble',
    'colors':'color', 'forest':'nature','flower':'plant nature', 'aaok':'checked',
    'questions': 'question', 'adverbs':'adverb','book2':'textbook',
    'book1':'textbook','proficiencytest':'fromtest','animalscomplete':'animal',
    'sonkei':'respectful','eating':'food','fruit':'food','neverused':'nonconvo rare',
    'domainspecific':'technical','seaons':'season','seasons':'season',
    'prefecture':'place','plantpart':'plant', "hakataben":"dialect", "fish":"animal fish",
    "transitive":"transitive verb", "intransitive":"intransitive verb",
    "aaunecessary":"nonconvo checked", "vegetables":"vegetable food plant",
    "counters":"counter", "senmonyougo":"technical", "countries":"country place",
    "date":"datesandtime", "rarelyused":"rare", "aaakaiwa":"convo checked", "cool":"inspect",
    "investigate":"inspect","challenging":"inspect","names":"name",'qanda':'question',
    'hasquestion':'question', "感情のもとにあったニーズ":"phrase rare","phrases":'phrase',
    'iadjective':'iadj adj', 'naajective':'naadj adj', 'adverbs':'adv', 'adverb':'adv',
    "sweets":"food",'holiday':'culture'
} # 'onecharacter':'len1', 'usuallywritteninkana':'kana',

#todo: investigate:
#editformatting,  datesandtime, linguistics, reference, adult, adjustpicture, checkpronunciation, addhint, challenging, inspect

In [58]:
def replace_list(takeIn, replaceDict):
    temp = takeIn.lower().split()
    temp2 = []
    for word in temp:
        if word in replaceDict:
            temp2.append(replaceDict.get(word)) # if the word exists in the dictionary, replace it
        else:
            temp2.append(word) # if the word doesnt't exist in the dictionary, leave it alone
    return ' '.join(temp2) # return that shorter list as a string

# inspect further:
# multiwriting, multimeaning, multipicture, multiterm, multireading, mergeterms, checkpronunciation, customterm,
# goodcard, personalized, silly, addjlptlevel, checkpronunciation, mergeterms, customterm, transportation vs travel

# categorize: iadjective, naajective, verb, counter, commonword, suruverb, pronoun, question, phrases, kuverb, godan, ichidan, intransitive, transitive, noun, adverbialnoun

In [59]:
df_notes_002_better_tags = df_notes_001_less_tags.copy() # originally "df_notes_better_tags"
df_notes_002_better_tags['tags'] = df_notes_002_better_tags['tags'].apply(lambda x: replace_list(str(x), tag_rename_dict))

print_before_after(df_notes_001_less_tags['tags'].iloc[0], df_notes_002_better_tags['tags'].iloc[0], "Tags for " + df_notes_002_better_tags['Term'].iloc[0])

assertEquals("wwwjdic" in df_notes_001_less_tags['tags'].iloc[0].split(), True, "Contains Tag 'wwwjdic'")
assertEquals("wwwjdic" in df_notes_002_better_tags['tags'].iloc[0].split(), False, "Contains Tag 'wwwjdic'")
assertEquals("fromdict" in df_notes_001_less_tags['tags'].iloc[0].split(), False, "Contains Tag 'fromdict'")
assertEquals("fromdict" in df_notes_002_better_tags['tags'].iloc[0].split(), True, "Contains Tag 'fromdict'")

---------------------------------------------------------------------------
Tags for 臨機応変
---------------------------------------------------------------------------
Before: noun kanji naadjective noadjective wwwjdic yojijukugo
---------------------------------------------------------------------------
After: noun kanji naadjective noadjective fromdict rare idiom
---------------------------------------------------------------------------


'OK'

### 2.2.11. Inspect current tag strings, notice duplicate occurances

In [60]:
df_notes_002_better_tags['tags'].value_counts()[:5]

kanji                      1343
kanji fromtest textbook     535
fromdict kanji              407
kanji fromdict              297
textbook kanji textbook     205
Name: tags, dtype: int64

### 2.2.12. Add "notags" tag to notes w/o any meta-tag data

In [61]:
# Since all words with kanji were tagged 'kanji', they would
# benefit from a more accurate tag than 'hasnotags'...
# How about 'metalite' to indicate that the meta data is sparse?
df_notes_002_better_tags['tags'] = df_notes_002_better_tags['tags'].apply(lambda x: "metalite" if x == 'kanji' else x)

In [62]:
df_notes_002_better_tags['tags'].value_counts()[:10]

metalite                                   1343
kanji fromtest textbook                     535
fromdict kanji                              407
kanji fromdict                              297
textbook kanji textbook                     205
college textbook textbook hasrobo kanji     192
kanji verb                                  183
textbook textbook kanji                     159
fromdict kanji verb                         137
kanji fromexam                              113
Name: tags, dtype: int64

We can attempt to inspect which tags are most common, in which combinations, and which words would be ideal
for further additional metadata. However, **our tags are still lumped together** at this point. Also, there is
reason to believe that **some tags are showing up multiple times in the same tag string**. In order to properly count tag frequency, the duplicates must be confirmed absent (ie. found & removed). Then, the occurance (word frequency) of each tag may then be summed up for the tags column.

### 2.2.13. Inspect a note suspected for tag duplication

In [63]:
# confirm that a particular note has tag duplicates
# crimison note id: 1369286386384
note_id_1 = 1369286386384
assertEquals(get_rows_by_value_in_col(df_notes_002_better_tags, note_id_1,'nid').tags.values[0].count('fromexam'),
             2,"Two occurances of the tag 'fromexam' exist")

#todo: count occurances of 'fromexam' instead

'OK'

In [64]:
# example of item with tag duplication
sel2 = get_rows_by_value_in_col(df_notes_002_better_tags, note_id_1,'nid')
sel2

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,Yomi3,gChap,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2
3845,1369286386384,1558184056,fromexam color kanji fromexam,紅,くれない,<div>deep red; crimson</div><div><br /></div><...,,,,"<div>Common word, Noun</div><div>Common word, ...",...,,,,,,,,,,


### 2.2.14. Remove duplicate tags (convert tag strings > lists > sets > strings)

In [65]:
# Converts a tag string to a list to a set back to a string (this removes the duplicates)
def remove_dupes(t):
    temp = list(set(t.lower().split()))
    return ' '.join(temp) # return as string

In [66]:
df_notes_003_tags_no_dups = df_notes_002_better_tags.copy()
df_notes_003_tags_no_dups['tags'] = df_notes_003_tags_no_dups['tags'].apply(lambda x: remove_dupes(str(x)))

In [67]:
# determines if an individual tag substring exists in a larger tags list string
def tag_exists(tags, tag):
    return 1 if tag in tags.split() else 0

In [68]:
print(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0])
assertEquals(tag_exists(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0],"color"), 1, "tag 'color' remains")
assertEquals(tag_exists(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0],"fromexam"), 1, "tag 'fromexam' remains")
assertEquals(tag_exists(get_rows_by_value_in_col(df_notes_003_tags_no_dups, note_id_1,'nid').tags.values[0],"kanji"), 1, "tag 'kanji' remains")

color fromexam kanji


'OK'

It appears we have most, if not all, of the data we need to start. The format of the dates though is not yet human readable. Let's fix that.

### 2.2.15. Convert (& preserve) note ID to note creation date

In [69]:
#dueNum = 782 # this represents days from collection creation date
#crt = 1357635600 # this represents the collection creation date #todo: query dynamically from database
#print("mid 'model id': " + time.ctime(int("1768161991"))) # 1 day = 86400 seconds

df_notes_004_with_date = df_notes_003_tags_no_dups.copy()
df_notes_004_with_date['NoteCreated']= pd.to_datetime(df_notes_004_with_date['nid'],unit='ms')
#df_notes_004_with_date['NoteCreated'] = df_notes_004_with_date['NoteCreated'].dt.date
df_notes_004_with_date.head()

print_before_after(df_notes_003_tags_no_dups['nid'].iloc[0], df_notes_004_with_date['NoteCreated'].iloc[0],"Term " + df_notes_004_with_date['Term'].iloc[0])

#assertEquals(df_notes_004_with_date['nid'].iloc[0], 1331799797110, "Note ID is in Epoch Units")
#assertEquals(str(df_notes_004_with_date['NoteCreated'].iloc[0]), "2012-03-15", "Note ID is in datetime date format year-month-day")

---------------------------------------------------------------------------
Term 臨機応変
---------------------------------------------------------------------------
Before: 1331799797110
---------------------------------------------------------------------------
After: 2012-03-15 08:23:17.110000
---------------------------------------------------------------------------


In [70]:
df_notes_004_with_date.dtypes

nid                         int64
mod                         int64
tags                       object
Term                       object
Yomi1                      object
Translation                object
Translation2               object
Translation3               object
AlternateForms             object
PartOfSpeech               object
Sound                      object
Sound2                     object
Sound3                     object
Examples                   object
ExamplesAudio              object
AtoQ                       object
AtoQaudio                  object
AtoQkana                   object
AtoQtranslation            object
QandApicture               object
answerPicture              object
Meaning1                   object
SimilarWords               object
RelatedWords               object
Breakdown1                 object
Comparison                 object
Usage                      object
Prompt1                    object
Prompt2                    object
KakuMCD       

### 2.2.15. Generate Note Last Modified Date from "Mod" ID

In [71]:
df_notes_005_last_modified = df_notes_004_with_date.copy()
df_notes_005_last_modified['LastModified'] = pd.to_datetime(df_notes_005_last_modified['mod'],unit='s')
#df_notes_005_last_modified['LastModified'] = df_notes_005_last_modified['LastModified'].dt.date

#assertEquals(str(df_notes_005_last_modified['LastModified'].iloc[0]), "2017-11-23", "Note last modified is in datetime date format year-month-day")

### 2.2.16. Remove rare words, phrases, expressions, questions & sentences from notes

In [72]:
df_notes_006_only_vocab_no_rare = df_notes_005_last_modified.copy()
print(df_notes_006_only_vocab_no_rare.shape)
df_notes_006_only_vocab_no_rare.head(3)

(8377, 58)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
0,1331799797110,1558125897,noun naadjective kanji noadjective rare idiom ...,臨機応変,りんきおうへん,adapting oneself to the requirements of the mo...,,,,"Noun, No-adjective",...,,,,,,,,,2012-03-15 08:23:17.110,2019-05-17 20:44:57
1,1331799797112,1558189992,kanji commonword noun n2,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,2012-03-15 08:23:17.112,2019-05-18 14:33:12
2,1331799797113,1558545381,kanji fromdict noun,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,2012-03-15 08:23:17.113,2019-05-22 17:16:21


In [73]:
sel4 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("rare")]
# https://stackoverflow.com/questions/37313691/how-to-remove-a-pandas-dataframe-from-another-dataframe
# remove rare words only first
df_notes_006_only_vocab_no_rare = pd.concat([df_notes_006_only_vocab_no_rare, sel4]).drop_duplicates(keep=False)

print(df_notes_006_only_vocab_no_rare.shape)
df_notes_006_only_vocab_no_rare.head(3)

# todo: assert that no rare words remain in 'df_notes_006_only_vocab_no_rare' by using 'contain("rare")'
# for selection, assert that selection has a row size of 0

(8258, 58)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
1,1331799797112,1558189992,kanji commonword noun n2,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,2012-03-15 08:23:17.112,2019-05-18 14:33:12
2,1331799797113,1558545381,kanji fromdict noun,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,2012-03-15 08:23:17.113,2019-05-22 17:16:21
3,1331799797114,1560123245,commonword noun kanji suruverb fromdict,移籍,いせき,<div>changing household registry; transfer (e....,,,,"<div>Common word, Noun, Suru verb</div>",...,,,,,,,,,2012-03-15 08:23:17.114,2019-06-09 23:34:05


### 2.2.17. Remove phrases, sentences, questions, & grammar (hasTilde) cards all at once

In [74]:
sel5 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("phrase")]
sel6 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("sentence")]
sel7 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("question")]
sel01 = df_notes_006_only_vocab_no_rare[df_notes_006_only_vocab_no_rare['tags'].str.contains("hasTilde")]
df_notes_006_only_vocab_no_rare = pd.concat([df_notes_006_only_vocab_no_rare, sel5, sel6, sel7, sel01]).drop_duplicates(keep=False)

print(df_notes_006_only_vocab_no_rare.shape)
df_notes_006_only_vocab_no_rare.head(3)

# todo: assert that no rare, phrase, sentence or question cards remain in the collection of notes

(8021, 58)


Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
1,1331799797112,1558189992,kanji commonword noun n2,隙間,すきま,<div>crevice; crack; gap; opening</div>,,,,"<div>Common word, Noun</div>",...,,,,,,,,,2012-03-15 08:23:17.112,2019-05-18 14:33:12
2,1331799797113,1558545381,kanji fromdict noun,苦汁,にがり,bittern; concentrated solution of salts (esp. ...,,,,Noun,...,,,,,,,,,2012-03-15 08:23:17.113,2019-05-22 17:16:21
3,1331799797114,1560123245,commonword noun kanji suruverb fromdict,移籍,いせき,<div>changing household registry; transfer (e....,,,,"<div>Common word, Noun, Suru verb</div>",...,,,,,,,,,2012-03-15 08:23:17.114,2019-06-09 23:34:05


The note model has a bunch of columns (fields) with no values in them. These can be taken out for data analysis.

In [75]:
# let's look a a small slice of data, to infer what we may
# we can take a broad overview look at the dataset to more quickly isolate candidates for removal
s = get_rows_by_value_in_col(df_notes_006_only_vocab_no_rare, '発明','Term')
s.head()

Unnamed: 0,nid,mod,tags,Term,Yomi1,Translation,Translation2,Translation3,AlternateForms,PartOfSpeech,...,gBook,semester,gNumber,Transliteration,SoloLookCards,TagOverflow,blank1,blank2,NoteCreated,LastModified
2696,1354094556789,1558184056,fromtest commonword noun kanji n3 textbook sur...,発明,はつめい,"<span style=""""><div>invention</div></span>",,,,"Common word, Noun, Suru verb",...,,,,,,,,,2012-11-28 09:22:36.789,2019-05-18 12:54:16


### 2.2.18. Determine which columns (fields) are unused & can be safely removed

In [76]:
def is_blank (s):
    return not (s and s.strip())

In [77]:
col_names = df_notes_006_only_vocab_no_rare.columns.values
# see that this cell for this row is indeed blank
#print(is_blank(df_notes_006_only_vocab_no_rare['Translation2'].iloc[0]))

row_cnt = df_notes_006_only_vocab_no_rare.shape[0] # number of rows in df_notes_006_only_vocab_no_rare

# https://stackoverflow.com/questions/49677060/pandas-count-empty-strings-in-a-column
empty_strings = pd.DataFrame(df_notes_006_only_vocab_no_rare.values == '',columns=col_names) # find all empty strings in a DataFrame
temp_dict = (empty_strings.sum()).to_dict()  # save the location of all empty strings as a DataFrame of booleans
removal_candidates = []
for key in temp_dict.items():
    if key[1] == row_cnt:
        removal_candidates.append(key[0])
print("Removal candidates:", removal_candidates)

Removal candidates: ['Sound3', 'AtoQ', 'AtoQaudio', 'AtoQkana', 'AtoQtranslation', 'QandApicture', 'answerPicture', 'blank1', 'blank2']


### 2.2.19. Drop empty columns from notes data frame

In [78]:
df_notes_007_less_cols = df_notes_006_only_vocab_no_rare.copy()

df_notes_007_less_cols = df_notes_007_less_cols.drop(removal_candidates,axis=1)

print_before_after(df_notes_006_only_vocab_no_rare.shape, df_notes_007_less_cols.shape)
print_before_after(df_notes_006_only_vocab_no_rare.columns.values, df_notes_007_less_cols.columns.values)

---------------------------------------------------------------------------
Before: (8021, 58)
---------------------------------------------------------------------------
After: (8021, 49)
---------------------------------------------------------------------------
---------------------------------------------------------------------------
Before: ['nid' 'mod' 'tags' 'Term' 'Yomi1' 'Translation' 'Translation2'
 'Translation3' 'AlternateForms' 'PartOfSpeech' 'Sound' 'Sound2' 'Sound3'
 'Examples' 'ExamplesAudio' 'AtoQ' 'AtoQaudio' 'AtoQkana'
 'AtoQtranslation' 'QandApicture' 'answerPicture' 'Meaning1'
 'SimilarWords' 'RelatedWords' 'Breakdown1' 'Comparison' 'Usage' 'Prompt1'
 'Prompt2' 'KakuMCD' 'IuMCD' 'ExtraMemo' 'Yomi2' 'Meaning2' 'Breakdown2'
 'Picture1' 'Picture2' 'Picture3' 'Picture4' 'HinshiMarker' 'Hint' 'Term2'
 'ArabicNumeral' 'CounterKanji' 'Mnemonic' 'SameSoundWords' 'Yomi3'
 'gChap' 'gBook' 'semester' 'gNumber' 'Transliteration' 'SoloLookCards'
 'TagOverflow' 'blank1' 'blank2

### 2.2.20. Create binary exists/not columns based on presence of a given tag in notes data frame

In [79]:
def add_column_by_tag(df, tag):
    df[tag] = df['tags'].apply(lambda x: tag_exists(str(x), tag))

In [80]:
df_notes_008_binary_tags = df_notes_007_less_cols.copy()
inspect_list = ["commonword",
                "clothing", "animal", "body", "food", "place",
                "textbook", "college", "fromdict", "fromexam",
                "n1", "n2", "n3", "n4", "n5",
                'katakana','hiragana','kanji',
                'adv', 'adj', 'noun', 'verb',
                'nonconvo', 'convo','metalite',
                'hassimilar','hassame'
               ] # todo: for next time, inspect "rare" tag
for item in inspect_list:
    add_column_by_tag(df_notes_008_binary_tags, item)

In [81]:
df_notes_008_binary_tags.columns.values

array(['nid', 'mod', 'tags', 'Term', 'Yomi1', 'Translation',
       'Translation2', 'Translation3', 'AlternateForms', 'PartOfSpeech',
       'Sound', 'Sound2', 'Examples', 'ExamplesAudio', 'Meaning1',
       'SimilarWords', 'RelatedWords', 'Breakdown1', 'Comparison',
       'Usage', 'Prompt1', 'Prompt2', 'KakuMCD', 'IuMCD', 'ExtraMemo',
       'Yomi2', 'Meaning2', 'Breakdown2', 'Picture1', 'Picture2',
       'Picture3', 'Picture4', 'HinshiMarker', 'Hint', 'Term2',
       'ArabicNumeral', 'CounterKanji', 'Mnemonic', 'SameSoundWords',
       'Yomi3', 'gChap', 'gBook', 'semester', 'gNumber',
       'Transliteration', 'SoloLookCards', 'TagOverflow', 'NoteCreated',
       'LastModified', 'commonword', 'clothing', 'animal', 'body', 'food',
       'place', 'textbook', 'college', 'fromdict', 'fromexam', 'n1', 'n2',
       'n3', 'n4', 'n5', 'katakana', 'hiragana', 'kanji', 'adv', 'adj',
       'noun', 'verb', 'nonconvo', 'convo', 'metalite', 'hassimilar',
       'hassame'], dtype=object)

In [82]:
df_notes_008_binary_tags = df_notes_008_binary_tags.rename(
    columns={
        'hassimilar':'hasSimilarSound','hassame':'hasSameSound'
    }
)

In [83]:
df_notes_008_binary_tags.dtypes.value_counts()

object            45
int64             29
datetime64[ns]     2
dtype: int64

### 2.2.21. Create boolean columns in notes data frame for predictive models

In [84]:
# https://stackoverflow.com/questions/17383094/how-can-i-map-true-false-to-1-0-in-a-pandas-dataframe
#df_notes_008_binary_tags['hasPOS'] = df_notes_008_binary_tags['PartOfSpeech']!="" #todo: expand upon this, by tagify
df_notes_008_binary_tags['hasVisual'] = df_notes_008_binary_tags['Picture1']!=""
#df_notes_008_binary_tags['hasReading'] = df_notes_008_binary_tags['Yomi1']!="" # todo: replace via 'kanji' tag
df_notes_008_binary_tags['hasAudio'] = df_notes_008_binary_tags['Sound']!=""
df_notes_008_binary_tags['hasMultiMeaning'] = df_notes_008_binary_tags['Translation2' and 'Translation3' and 'Meaning2']!=""
df_notes_008_binary_tags['hasMultiReading'] = df_notes_008_binary_tags['Yomi2']!="" # todo: inspect & incorporate venn diagram: https://commons.wikimedia.org/wiki/File:Homograph_homophone_venn_diagram.png
df_notes_008_binary_tags['hasSimilarMeaning'] = df_notes_008_binary_tags['SimilarWords']!=""
#df_notes_008_binary_tags['hasHomophone'] = df_notes_008_binary_tags['SameSoundWords']!="" # todo: write function, detect homophones # note: currently using meta tag instead to label
df_notes_008_binary_tags['hasAltForm'] = df_notes_008_binary_tags['Term2' and 'AlternateForms']!= ""
df_notes_008_binary_tags['hasRichExamples'] = df_notes_008_binary_tags['Examples' and 'ExamplesAudio']!=""

In [85]:
# Laura calls this process "Data Enriching"
# todo: confirm that intify_list is to be different/same than inspect_list
intify_list = ['hasVisual','hasAudio','hasMultiMeaning','hasMultiReading','hasSimilarMeaning',
               'hasSimilarSound','hasSameSound','hasAltForm','metalite','hasRichExamples'] # 'hasReading',

### 2.2.22. Drop non-numerical columns from notes data frame

In [86]:
df_notes_009_less_cols = df_notes_008_binary_tags.copy()
df_notes_009_less_cols = df_notes_009_less_cols.drop(['Examples','ExamplesAudio',
                            'Meaning1','RelatedWords','Breakdown1','Comparison',
                           'Usage','Prompt1','Prompt2','KakuMCD','IuMCD','ExtraMemo',
                           'Breakdown2','Picture2','Picture3','Picture4','Mnemonic',
                            'Yomi3','gChap','gBook','semester','gNumber','ArabicNumeral',
                            'CounterKanji','SoloLookCards','HinshiMarker','Hint',
                            'mod','Transliteration','Picture1','Sound','Sound2',
                            'TagOverflow','Translation2', 'Meaning2','Yomi2','Term2',
                            'SameSoundWords','SimilarWords','AlternateForms',
                            'Translation3','Translation','PartOfSpeech'],axis=1) #'hasPOS',
# todo: explore 'mod' (last modified date) as freshness metric

### 2.2.23. Enforce proper numerical boolean type encoding in notes data frame

In [87]:
df_notes_009_less_cols.dtypes.value_counts()

int64             28
bool               7
object             3
datetime64[ns]     2
dtype: int64

In [88]:
for item in intify_list:
    cast_to_typ(df_notes_009_less_cols,item, int)

In [89]:
df_notes_009_less_cols.dtypes.value_counts()

int64             35
object             3
datetime64[ns]     2
dtype: int64

In [90]:
df_notes_009_less_cols.head(35)[30:]

#selection2 = df_binary.loc[df_binary['hasMultiMeaning']==1]
#selection2.head()

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples
37,1342506824729,kanji fromdict,行為,こうい,2012-07-17 06:33:44.729,2019-05-18 14:33:12,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
38,1342506824730,kanji fromtest textbook,行動,こうどう,2012-07-17 06:33:44.730,2019-05-18 12:54:16,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39,1342506824731,kanji hassame fromdict,事態,じたい,2012-07-17 06:33:44.731,2019-05-18 14:33:12,0,0,0,0,...,0,0,1,0,1,0,0,1,0,0
40,1342506824732,kanji fromtest textbook suffix,形,かたち,2012-07-17 06:33:44.732,2019-05-18 12:54:16,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
41,1342506824733,kanji fromtest textbook,様子,ようす,2012-07-17 06:33:44.733,2019-05-18 14:33:12,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### 2.2.24. Count syllables & character length for each term in notes data frame

In [91]:
df_notes_010_with_len = df_notes_009_less_cols.copy()

df_notes_010_with_len['TermLen'] = df_notes_010_with_len['Term'].str.len()
df_notes_010_with_len['Syllables'] = df_notes_010_with_len['Yomi1'].str.len()
df_notes_010_with_len.loc[df_notes_010_with_len['Syllables'] == 0, 'Syllables'] = df_notes_010_with_len['TermLen']

bins = [0,1,2,4,8,128]
labels = ["[1]","[2]","[3:4]","[5:8]","[9: ]"]
# https://stackoverflow.com/questions/45273731/binning-column-with-python-pandas
df_notes_010_with_len['TermLenGroup'] = pd.cut(df_notes_010_with_len['TermLen'], bins=bins, labels=labels)
df_notes_010_with_len['SyllablesGroup'] = pd.cut(df_notes_010_with_len['Syllables'], bins=bins, labels=labels)

# example: df.loc[df['Grades'] <= 77, 'Grades'] = 100
# https://stackoverflow.com/questions/42815768/pandas-adding-column-with-the-length-of-other-column-as-value
df_notes_010_with_len.tail(20)[:5]

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup
8347,1489373157595,metalite,細切り,ほそぎり,2017-03-13 02:45:57.595,2019-05-18 12:54:16,0,0,0,0,...,0,0,0,0,0,0,3,4,[3:4],[3:4]
8348,1489756408272,metalite,離陸,りりく,2017-03-17 13:13:28.272,2019-05-18 12:54:16,0,0,0,0,...,0,0,0,0,0,0,2,3,[2],[3:4]
8350,1496869788801,commonword hasexamples suffix hiragana grammar,しか,,2017-06-07 21:09:48.801,2019-05-18 12:54:16,1,0,0,0,...,0,0,0,0,0,1,2,2,[2],[2]
8355,1508004573617,casual hiragana greetings,じゃあね,,2017-10-14 18:09:33.617,2019-05-18 12:54:16,0,0,0,0,...,1,0,0,0,1,1,4,4,[3:4],[3:4]
8356,1508004608171,casual hiragana greetings,やあ,,2017-10-14 18:10:08.171,2019-05-18 12:54:16,0,0,0,0,...,0,0,0,0,0,0,2,2,[2],[2]


### 2.2.25. Inspect the longest syllable entries in notes data frame

In [92]:
df_many_syl = df_notes_010_with_len.copy()
many_syl = df_many_syl['Syllables'] > 16
df_many_syl.loc[many_syl] #todo: check nid of 1391477462767

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup
242,1346057958628,history fromnews inspect kanji japan1st cultur...,東京電力福島・第１原発事故,とうきょうでんりょくふくしま・だいいちげんぱつじこ,2012-08-27 08:59:18.628,2019-05-22 17:03:12,0,0,0,0,...,0,0,0,0,0,0,13,25,[9: ],[9: ]
308,1346215143756,datesandtime numeric,1837～1901年,せんはっぴゃくさんじゅうななねんからせんきゅうひゃくいちねん,2012-08-29 04:39:03.756,2019-05-13 20:58:19,0,0,0,0,...,0,0,0,0,0,0,10,30,[9: ],[9: ]
421,1346216471844,datesandtime numeric kanji counter fromdict,千九百八十九年,せんきゅうひゃくはちじゅうきゅうねん,2012-08-29 05:01:11.844,2019-05-13 20:00:43,0,0,0,0,...,0,0,0,0,0,0,7,18,[5:8],[9: ]
5542,1387411183585,datesandtime numeric kanji,千九百八十七年,せんきゅうひゃくはちじゅうななねん,2013-12-18 23:59:43.585,2019-05-13 20:00:52,0,0,0,0,...,0,0,0,0,0,0,7,17,[5:8],[9: ]


In [93]:
df_notes_010_with_len.columns.values

array(['nid', 'tags', 'Term', 'Yomi1', 'NoteCreated', 'LastModified',
       'commonword', 'clothing', 'animal', 'body', 'food', 'place',
       'textbook', 'college', 'fromdict', 'fromexam', 'n1', 'n2', 'n3',
       'n4', 'n5', 'katakana', 'hiragana', 'kanji', 'adv', 'adj', 'noun',
       'verb', 'nonconvo', 'convo', 'metalite', 'hasSimilarSound',
       'hasSameSound', 'hasVisual', 'hasAudio', 'hasMultiMeaning',
       'hasMultiReading', 'hasSimilarMeaning', 'hasAltForm',
       'hasRichExamples', 'TermLen', 'Syllables', 'TermLenGroup',
       'SyllablesGroup'], dtype=object)

In [94]:
# labels terms by their jlpt level.
# bear in mind that some terms have multiple jlpt levels.
# this function merely assigns the lowest associated jlpt level with a term. 
def label_jlpt_lvl (row):
    if row['n5'] == 1 :
        return 5
    elif row['n4'] == 1:
        return 4
    elif row['n3'] == 1:
        return 3
    elif row['n2'] == 1:
        return 2
    elif row['n1'] == 1:
        return 1
    else:
        return None

### 2.2.26. Assign JLPT number to words with JLPT "N" levels in notes data frame

In [95]:
df_notes_011_jptl_lvl = df_notes_010_with_len.copy()
df_notes_011_jptl_lvl['jlpt_lvl_d'] = df_notes_011_jptl_lvl.apply (lambda row: label_jlpt_lvl(row), axis=1)

In [96]:
df_notes_011_jptl_lvl.columns.values

array(['nid', 'tags', 'Term', 'Yomi1', 'NoteCreated', 'LastModified',
       'commonword', 'clothing', 'animal', 'body', 'food', 'place',
       'textbook', 'college', 'fromdict', 'fromexam', 'n1', 'n2', 'n3',
       'n4', 'n5', 'katakana', 'hiragana', 'kanji', 'adv', 'adj', 'noun',
       'verb', 'nonconvo', 'convo', 'metalite', 'hasSimilarSound',
       'hasSameSound', 'hasVisual', 'hasAudio', 'hasMultiMeaning',
       'hasMultiReading', 'hasSimilarMeaning', 'hasAltForm',
       'hasRichExamples', 'TermLen', 'Syllables', 'TermLenGroup',
       'SyllablesGroup', 'jlpt_lvl_d'], dtype=object)

In [97]:
df_notes_011_jptl_lvl['jlpt_lvl_d'].value_counts()

3.0    243
5.0    171
2.0    129
1.0    124
4.0    116
Name: jlpt_lvl_d, dtype: int64

In [98]:
# 2.2.26.2 Assign script (char) type to each note
df_notes_012_script_type = df_notes_011_jptl_lvl.copy()

char_list = ['katakana','hiragana'] # ,'hasReading'

In [99]:
def label_script (row):
    if row['katakana'] == 1 :
        return 'katakana'
    elif row['hiragana'] == 1:
        return 'hiragana'
    elif row['kanji'] == 1:
        return 'kanji'
    else:
        return # todo: restore 'otherScript' after testing out '' # todo: add 'romanji' type

In [100]:
df_notes_012_script_type['script'] = df_notes_012_script_type.apply (lambda row: label_script(row), axis=1)

In [101]:
df_notes_012_script_type.head()

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script
1,1331799797112,kanji commonword noun n2,隙間,すきま,2012-03-15 08:23:17.112,2019-05-18 14:33:12,1,0,0,0,...,0,1,0,0,2,3,[2],[3:4],2.0,kanji
2,1331799797113,kanji fromdict noun,苦汁,にがり,2012-03-15 08:23:17.113,2019-05-22 17:16:21,0,0,0,0,...,0,0,0,0,2,3,[2],[3:4],,kanji
3,1331799797114,commonword noun kanji suruverb fromdict,移籍,いせき,2012-03-15 08:23:17.114,2019-06-09 23:34:05,1,0,0,0,...,0,0,0,0,2,3,[2],[3:4],,kanji
5,1331799797117,verb commonword n2 kanji transitive fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2019-05-18 14:33:12,1,0,0,0,...,0,1,0,0,3,3,[3:4],[3:4],2.0,kanji
6,1331799797118,commonword naadjective checked kanji n1 convo,和やか,なごやか,2012-03-15 08:23:17.118,2019-05-27 21:33:22,1,0,0,0,...,0,0,0,0,3,4,[3:4],[3:4],1.0,kanji


### 2.2.27. Create df_notes_012_final_section_2 data frame for progress saving

In [102]:
#df_combo_005_notes_galore
df_notes_013_mid_section_2 = df_notes_012_script_type.copy()
print(df_notes_013_mid_section_2.shape)
df_notes_013_mid_section_2.head()

(8021, 46)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script
1,1331799797112,kanji commonword noun n2,隙間,すきま,2012-03-15 08:23:17.112,2019-05-18 14:33:12,1,0,0,0,...,0,1,0,0,2,3,[2],[3:4],2.0,kanji
2,1331799797113,kanji fromdict noun,苦汁,にがり,2012-03-15 08:23:17.113,2019-05-22 17:16:21,0,0,0,0,...,0,0,0,0,2,3,[2],[3:4],,kanji
3,1331799797114,commonword noun kanji suruverb fromdict,移籍,いせき,2012-03-15 08:23:17.114,2019-06-09 23:34:05,1,0,0,0,...,0,0,0,0,2,3,[2],[3:4],,kanji
5,1331799797117,verb commonword n2 kanji transitive fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2019-05-18 14:33:12,1,0,0,0,...,0,1,0,0,3,3,[3:4],[3:4],2.0,kanji
6,1331799797118,commonword naadjective checked kanji n1 convo,和やか,なごやか,2012-03-15 08:23:17.118,2019-05-27 21:33:22,1,0,0,0,...,0,0,0,0,3,4,[3:4],[3:4],1.0,kanji


### 2.2.28. ~~Export df_notes_midway~~

In [103]:
#df_notes_013_mid_section_2.to_csv('datasets/df_notes_013_mid_section_2.csv')

***
- [Previous section: Notes](#notes)
- [Next section: Review Log](#revlog)
***
# <a name="combo"></a> Combo of Notes & Cards

### 2.3.1. Merge card & note data frames to conduct cross analysis

In [104]:
# now that we have note id's for all the words, we can
# join together these separate dataframes
df_combo = pd.merge(df_notes_013_mid_section_2, df_cards_012_mid_section_2, on='nid')
print(df_combo.shape)

# remove from df_combo any fields that belong strictly only to cards, such as c_ivl_q, c_factor_q
df_combo = df_combo.drop(['c_ivl_q','c_factor_q'], axis=1)
df_combo.head()

(6732, 63)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,...,CardCreated,DueDate,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,c_suff_reviewed
0,1331799797114,commonword noun kanji suruverb fromdict,移籍,いせき,2012-03-15 08:23:17.114,2019-06-09 23:34:05,1,0,0,0,...,2012-03-15 08:23:17.114,2015-02-04 09:00:00,0,0,1,0,read,0.142857,14.142857,1.0
1,1331799797117,verb commonword n2 kanji transitive fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2019-05-18 14:33:12,1,0,0,0,...,2017-11-23 23:58:11.056,2019-07-04 09:00:00,0,0,0,1,recall,0.166667,4.666667,1.0
2,1331799797118,commonword naadjective checked kanji n1 convo,和やか,なごやか,2012-03-15 08:23:17.118,2019-05-27 21:33:22,1,0,0,0,...,2017-11-23 23:58:11.057,2019-06-10 09:00:00,0,0,0,1,recall,0.166667,2.333333,1.0
3,1331799797122,kanji hassame fromdict,在庫,ざいこ,2012-03-15 08:23:17.122,2019-05-18 12:54:16,0,0,0,0,...,2012-03-15 08:23:17.122,2015-07-04 09:00:00,0,0,1,0,read,0.2,44.8,1.0
4,1331799797126,kanji fromdict,有能,ゆうのう,2012-03-15 08:23:17.126,2019-05-27 20:00:11,0,0,0,0,...,2012-03-15 08:23:17.126,2015-09-04 09:00:00,0,0,1,0,read,0.111111,27.555556,1.0


Let's further refine the dataframe entries to represent which notes have (1) visual data, (2) audio data, and (3) a L1 ("first language", English in this case) translation. We can represent these with binary values (0 for doesn't exist, 1 for exists).

### 2.3.2. Group notes by ID to determine card type overlap, simple totals per note

In [105]:
# https://stackoverflow.com/questions/23919563/merge-rows-of-a-dataframe-in-pandas-based-on-a-column
# https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression
df_combo_001_grouped_notes = df_combo.copy()
###### remove from this temporary grouped notes dataframe any card specific items, such as card interval or card factor (ease factor)
df_combo_001_grouped_notes = df_combo_001_grouped_notes.drop(
    ['cid','hasAltForm','TermLen','Syllables','jlpt_lvl_d','katakana','hiragana',
     'noun','verb','convo','commonword','clothing','animal','body','food','place','textbook',
     'college','fromdict','fromexam','n1','n2','n3','n4','n5','hasVisual','hasAudio',
     'hasMultiMeaning','hasMultiReading','hasSimilarSound','hasSameSound',
     'hasSimilarMeaning','hasRichExamples','metalite','waste','roi'],axis=1) # ,'hasReading','len1'

# create data frame for totals per term: reps, lapses, and all vectors (card types) 
df_combo_002_note_totals = df_combo_001_grouped_notes.copy()
df_combo_002_note_totals = df_combo_002_note_totals.groupby(['nid']).sum()

# drop numerical fields that don't logically add up (interval, ease factor, and the quartile/quintile buckets for interval and ease factor)
df_combo_002_note_totals = df_combo_002_note_totals.drop(['ivl','factor'],axis=1) # 'c_ivl_q','c_factor_q'
df_combo_002_note_totals.tail(20)[-5:]

Unnamed: 0_level_0,kanji,adv,adj,nonconvo,reps,lapses,CardType_listen,CardType_look,CardType_read,CardType_recall,c_suff_reviewed
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1483483650760,0,0,0,0,6,0,0,0,1,0,1.0
1483483650782,0,0,0,0,12,1,0,1,0,0,1.0
1483483650784,1,0,0,0,10,1,0,1,0,0,1.0
1483483650793,1,0,0,0,5,0,0,0,1,0,1.0
1549184119039,0,0,0,0,5,0,0,0,1,0,1.0


In [106]:
df_combo_002_note_totals = df_combo_002_note_totals.rename(
    columns={
        'reps':'total_reps', 'lapses':'total_lapses', 'CardType_listen':'hasListenCard',
        'CardType_recall':'hasTranslateCard', 'CardType_read':'hasReadCard',
        'CardType_look':'hasPictureCard'
    }
)

In [107]:
df_combo_002_note_totals.tail(20)[-5:]

Unnamed: 0_level_0,kanji,adv,adj,nonconvo,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1483483650760,0,0,0,0,6,0,0,0,1,0,1.0
1483483650782,0,0,0,0,12,1,0,1,0,0,1.0
1483483650784,1,0,0,0,10,1,0,1,0,0,1.0
1483483650793,1,0,0,0,5,0,0,0,1,0,1.0
1549184119039,0,0,0,0,5,0,0,0,1,0,1.0


### 2.3.3. Group notes by ID to find simple average means per note

In [108]:
df_combo_001_grouped_notes.tail(20)[-5:]

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,kanji,adv,adj,nonconvo,...,reps,lapses,CardCreated,DueDate,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,c_suff_reviewed
6727,1483483650760,commonword college noun semester1 n4 textbook ...,さっき,,2017-01-03 22:47:30.760,2019-05-18 12:54:16,0,0,0,0,...,6,0,2017-01-03 22:47:31.360,2017-05-15 09:00:00,0,0,1,0,read,1.0
6728,1483483650782,commonword sports college noun katakana textbo...,サーフィン,,2017-01-03 22:47:30.782,2019-05-18 12:54:16,0,0,0,0,...,12,1,2017-01-21 20:10:31.537,2021-02-14 09:00:00,0,1,0,0,look,1.0
6729,1483483650784,commonword college noun kanji n5 textbook seme...,葉書,はがき,2017-01-03 22:47:30.784,2019-05-18 12:54:16,1,0,0,0,...,10,1,2017-01-21 20:12:07.559,2019-05-04 09:00:00,0,1,0,0,look,1.0
6730,1483483650793,college naadjective kanji textbook semester1,好き,すき,2017-01-03 22:47:30.793,2019-05-18 12:54:16,1,0,0,0,...,5,0,2017-01-03 22:47:31.393,2019-11-21 09:00:00,0,0,1,0,read,1.0
6731,1549184119039,metalite,閏年,うるうどし,2019-02-03 08:55:19.039,2019-05-13 20:00:56,0,0,0,0,...,5,0,2019-02-03 08:55:29.288,2019-08-21 09:00:00,0,0,1,0,read,1.0


In [109]:
df_combo_003_note_means = df_combo_001_grouped_notes.copy()
df_combo_003_note_means = df_combo_003_note_means.groupby(['nid']).mean()
df_combo_003_note_means.tail(20)[-5:]

Unnamed: 0_level_0,kanji,adv,adj,nonconvo,ivl,factor,reps,lapses,CardType_listen,CardType_look,CardType_read,CardType_recall,c_suff_reviewed
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1483483650760,0.0,0.0,0.0,0.0,66.0,2420.0,6.0,0.0,0.0,0.0,1.0,0.0,1.0
1483483650782,0.0,0.0,0.0,0.0,658.0,2160.0,12.0,1.0,0.0,1.0,0.0,0.0,1.0
1483483650784,1.0,0.0,0.0,0.0,3.0,2160.0,10.0,1.0,0.0,1.0,0.0,0.0,1.0
1483483650793,1.0,0.0,0.0,0.0,552.0,2270.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0
1549184119039,0.0,0.0,0.0,0.0,108.0,2410.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0


In [110]:
df_combo_003_note_means = df_combo_003_note_means.drop(['CardType_listen','CardType_recall',
    'CardType_read', 'CardType_look'],axis=1)
df_combo_003_note_means = df_combo_003_note_means.rename(
    columns={'ivl':'mean_ivl','factor':'mean_factor','reps':'mean_reps','lapses':'mean_lapses'})
df_combo_003_note_means.tail(20)[-5:]

Unnamed: 0_level_0,kanji,adv,adj,nonconvo,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed
nid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1483483650760,0.0,0.0,0.0,0.0,66.0,2420.0,6.0,0.0,1.0
1483483650782,0.0,0.0,0.0,0.0,658.0,2160.0,12.0,1.0,1.0
1483483650784,1.0,0.0,0.0,0.0,3.0,2160.0,10.0,1.0,1.0
1483483650793,1.0,0.0,0.0,0.0,552.0,2270.0,5.0,0.0,1.0
1549184119039,0.0,0.0,0.0,0.0,108.0,2410.0,5.0,0.0,1.0


### 2.3.4. Combine note totals, note means & general notes

In [111]:
df_temp = df_combo_002_note_totals.copy()
df_temp = df_temp.drop(['kanji','adv','adj','nonconvo'],axis=1)

df_temp_2 = df_combo_003_note_means.copy()
df_temp_2 = df_temp_2.drop(['kanji','adv','adj','nonconvo'],axis=1) # 'ivl_q','factor_q'

df_combo_004_notes_galore = pd.merge(df_temp_2, df_temp,on='nid')
#print(df_combo_004_notes_galore.columns.values)

df_combo_005_notes_galore = pd.merge(df_combo,df_combo_004_notes_galore,on='nid')
# https://stackoverflow.com/questions/47022070/display-all-dataframe-columns-in-a-jupyter-python-notebook
pd.options.display.max_columns = None

#print(df_combo_005_notes_galore.columns.values)

# drop card specific columns, these are no longer valid
df_combo_005_notes_galore = df_combo_005_notes_galore.drop(['cid','ivl','factor','reps','lapses','CardCreated','DueDate','CardType_listen','CardType_look','CardType_read','CardType_recall'],axis=1)
df_combo_005_notes_galore.head(10)

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,cardtype,waste,roi,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y
0,1331799797114,commonword noun kanji suruverb fromdict,移籍,いせき,2012-03-15 08:23:17.114,2019-06-09 23:34:05,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,[2],[3:4],,kanji,read,0.142857,14.142857,1.0,99.0,1980.0,7.0,0.0,1.0,7,0,0,0,1,0,1.0
1,1331799797117,verb commonword n2 kanji transitive fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2019-05-18 14:33:12,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,3,3,[3:4],[3:4],2.0,kanji,recall,0.166667,4.666667,1.0,28.0,2410.0,6.0,0.0,1.0,6,0,0,0,0,1,1.0
2,1331799797118,commonword naadjective checked kanji n1 convo,和やか,なごやか,2012-03-15 08:23:17.118,2019-05-27 21:33:22,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,3,4,[3:4],[3:4],1.0,kanji,recall,0.166667,2.333333,1.0,14.0,2410.0,6.0,0.0,1.0,6,0,0,0,0,1,1.0
3,1331799797122,kanji hassame fromdict,在庫,ざいこ,2012-03-15 08:23:17.122,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,3,[2],[3:4],,kanji,read,0.2,44.8,1.0,224.0,2130.0,5.0,0.0,1.0,5,0,0,0,1,0,1.0
4,1331799797126,kanji fromdict,有能,ゆうのう,2012-03-15 08:23:17.126,2019-05-27 20:00:11,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,4,[2],[3:4],,kanji,read,0.111111,27.555556,1.0,248.0,2130.0,9.0,0.0,1.0,9,0,0,0,1,0,1.0
5,1331799797127,transportation noun travel mixedscript haskanj...,公衆トイレ,こうしゅうトイレ,2012-03-15 08:23:17.127,2019-05-27 20:59:39,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,5,8,[5:8],[5:8],,,read,0.111111,25.444444,1.0,229.0,2270.0,9.0,0.0,1.0,9,0,0,0,1,0,1.0
6,1331799797128,kanji fromdict,送り賃,おくりちん,2012-03-15 08:23:17.128,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,5,[3:4],[5:8],,kanji,read,0.125,22.25,1.0,178.0,2120.0,8.0,0.0,1.0,8,0,0,0,1,0,1.0
7,1331799797130,technical kanji fromdict noun,量子物理学,りょうしぶつりがく,2012-03-15 08:23:17.130,2019-05-28 00:40:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,5,9,[5:8],[9: ],,kanji,read,0.142857,29.142857,1.0,204.0,2270.0,7.0,0.0,1.0,7,0,0,0,1,0,1.0
8,1331799797132,kanji fromdict,抽象的,ちゅうしょうてき,2012-03-15 08:23:17.132,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,8,[3:4],[5:8],,kanji,read,0.105263,5.157895,1.0,98.0,1920.0,19.0,1.0,1.0,19,1,0,0,1,0,1.0
9,1331799797133,kanji fromdict,理想的,りそうてき,2012-03-15 08:23:17.133,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,5,[3:4],[5:8],,kanji,read,0.142857,70.714286,1.0,495.0,2270.0,7.0,0.0,1.0,7,0,0,0,1,0,1.0


### 2.3.5. Inspect combo dtypes

In [112]:
# strategies to fix dtypes: https://stackoverflow.com/questions/28910851/python-pandas-changing-some-column-types-to-categories
print(df_combo_005_notes_galore.dtypes.value_counts())
df_combo_005_notes_galore.dtypes

int64             39
float64           10
object             5
uint8              4
category           2
datetime64[ns]     2
dtype: int64


nid                           int64
tags                         object
Term                         object
Yomi1                        object
NoteCreated          datetime64[ns]
LastModified         datetime64[ns]
commonword                    int64
clothing                      int64
animal                        int64
body                          int64
food                          int64
place                         int64
textbook                      int64
college                       int64
fromdict                      int64
fromexam                      int64
n1                            int64
n2                            int64
n3                            int64
n4                            int64
n5                            int64
katakana                      int64
hiragana                      int64
kanji                         int64
adv                           int64
adj                           int64
noun                          int64
verb                        

### 2.3.6. Fix combo dtypes

In [113]:
convert_bool_list = ['hasRichExamples']

for item in convert_bool_list:
    cast_to_typ(df_combo_005_notes_galore,item, int)

convert_category_list = ['jlpt_lvl_d','script','cardtype']

for col in convert_category_list:
    df_combo_005_notes_galore[col] = df_combo_005_notes_galore[col].astype('category')

convert_dates_list = ['NoteCreated','LastModified']

# https://stackoverflow.com/questions/28910851/python-pandas-changing-some-column-types-to-categories
for col in convert_dates_list:
    df_combo_005_notes_galore[col] = df_combo_005_notes_galore[col].astype('datetime64')
    
# todo: use the following three lists as a rough guide to start prepping dataframes
# for export & analysis in section 3!!! ^_^
binary_list = ['commonword','clothing','animal','body','food','place','textbook','college',
    'fromdict','fromexam','n1','n2','n3','n4','n5','katakana','hiragana','noun','verb',
    'convo','metalite','hasVisual','hasAudio','hasMultiMeaning','hasMultiReading','hasSimilar',
    'hasHomophone','hasAltForm','hasRichExamples','hasListenCard','hasPictureCard','hasReadCard',
    'hasTranslateCard'] # 'len1'
continuous_list = ['TermLen','Syllables','mean_ivl','mean_factor','mean_reps','mean_lapses',
                   'total_reps','total_lapses']
discrete_non_binary_list = ['NoteCreated','LastModified','TermLenGroup','jlpt_lvl_d']
    
df_combo_005_notes_galore.dtypes

nid                           int64
tags                         object
Term                         object
Yomi1                        object
NoteCreated          datetime64[ns]
LastModified         datetime64[ns]
commonword                    int64
clothing                      int64
animal                        int64
body                          int64
food                          int64
place                         int64
textbook                      int64
college                       int64
fromdict                      int64
fromexam                      int64
n1                            int64
n2                            int64
n3                            int64
n4                            int64
n5                            int64
katakana                      int64
hiragana                      int64
kanji                         int64
adv                           int64
adj                           int64
noun                          int64
verb                        

### 2.3.7. Create df_combo_006_final_section_2 data frame for export

In [114]:
df_combo_006_final_section_2 = df_combo_005_notes_galore.copy()
print(df_combo_006_final_section_2.shape)
df_combo_006_final_section_2.tail(10)[:5]

(6732, 62)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,cardtype,waste,roi,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y
6722,1483483650705,college n5 katakana textbook semester1,テープ,,2017-01-03 22:47:30.705,2019-05-18 12:54:16,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,3,3,[3:4],[3:4],5.0,katakana,read,0.2,108.8,1.0,553.0,2560.0,5.0,0.0,1.0,10,0,0,1,1,0,2.0
6723,1483483650705,college n5 katakana textbook semester1,テープ,,2017-01-03 22:47:30.705,2019-05-18 12:54:16,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,3,3,[3:4],[3:4],5.0,katakana,look,0.2,112.4,1.0,553.0,2560.0,5.0,0.0,1.0,10,0,0,1,1,0,2.0
6724,1483483650706,college gairaigo katakana clothing textbook se...,トレーナー,,2017-01-03 22:47:30.706,2019-05-18 12:54:16,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,5,5,[5:8],[5:8],,katakana,read,0.2,96.8,1.0,484.0,2410.0,5.0,0.0,1.0,5,0,0,0,1,0,1.0
6725,1483483650709,textbook haskanji semester1 college,お手洗い,おてあらい,2017-01-03 22:47:30.709,2019-05-18 12:54:16,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,5,[3:4],[5:8],,,read,0.2,17.8,1.0,89.0,2410.0,5.0,0.0,1.0,5,0,0,0,1,0,1.0
6726,1483483650759,college food shopping katakana textbook semester1,レストラン,,2017-01-03 22:47:30.759,2019-05-18 12:54:16,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,5,5,[5:8],[5:8],,katakana,read,0.166667,20.666667,1.0,124.0,2420.0,6.0,0.0,1.0,6,0,0,0,1,0,1.0


### 2.3.8. ~~Export df_combo_006_final_section_2~~

In [115]:
#df_combo_006_final_section_2.to_csv('datasets/df_combo_006_final_section_2.csv')

In [116]:
# todo: recombine the mid section notes dataframe w/ df_combo_006_final_section2
# the goal here is to avoid unnecessary row data duplication & keep all columns

In [117]:
#df_temp_3 = df_notes_013_mid_section_2.copy()
#df_temp_3 = df_temp_3.drop(['tags','Term','Yomi1','NoteCreated','LastModified','commonword',
#    'clothing', 'animal', 'body', 'food', 'place','textbook', 'college', 'fromdict', 'fromexam',
#    'len1', 'n1', 'n2','n3', 'n4', 'n5', 'katakana', 'hiragana', 'noun', 'verb', 'convo',
#    'hasnotags', 'hasVisual', 'hasReading','hasAudio','hasMultiMeaning', 'hasMultiReading',
#    'hasSimilar', 'hasHomophone','hasAltForm', 'hasRichExamples', 'TermLen', 'Syllables',
#    'TermLenGroup','jlpt_lvl_d'],axis=1)

In [118]:
#df_temp_3.head()

In [119]:
# todo: remove extraneous cell & variable renaming
df_notes_013_final_section_2 = df_combo_006_final_section_2.copy()

In [120]:
df_notes_014_final_section_2 = df_notes_013_final_section_2.copy()
df_notes_014_final_section_2 = df_notes_014_final_section_2.drop(['waste','roi','cardtype'],axis=1) #'ivl_q','factor_q'
df_notes_014_final_section_2 = df_notes_014_final_section_2.drop_duplicates(['nid'], keep='first')

In [121]:
df_notes_014_final_section_2.tail(20)[-5:] ####

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y
6727,1483483650760,commonword college noun semester1 n4 textbook ...,さっき,,2017-01-03 22:47:30.760,2019-05-18 12:54:16,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,3,3,[3:4],[3:4],4.0,hiragana,1.0,66.0,2420.0,6.0,0.0,1.0,6,0,0,0,1,0,1.0
6728,1483483650782,commonword sports college noun katakana textbo...,サーフィン,,2017-01-03 22:47:30.782,2019-05-18 12:54:16,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,5,5,[5:8],[5:8],,katakana,1.0,658.0,2160.0,12.0,1.0,1.0,12,1,0,1,0,0,1.0
6729,1483483650784,commonword college noun kanji n5 textbook seme...,葉書,はがき,2017-01-03 22:47:30.784,2019-05-18 12:54:16,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,2,3,[2],[3:4],5.0,kanji,1.0,3.0,2160.0,10.0,1.0,1.0,10,1,0,1,0,0,1.0
6730,1483483650793,college naadjective kanji textbook semester1,好き,すき,2017-01-03 22:47:30.793,2019-05-18 12:54:16,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,2,[2],[2],,kanji,1.0,552.0,2270.0,5.0,0.0,1.0,5,0,0,0,1,0,1.0
6731,1549184119039,metalite,閏年,うるうどし,2019-02-03 08:55:19.039,2019-05-13 20:00:56,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,5,[2],[5:8],,,1.0,108.0,2410.0,5.0,0.0,1.0,5,0,0,0,1,0,1.0


**reps** = work done to remember a card  
**interval** = memory length as output of memorization work done  
**ease/factor** = indicator of effort to retreive & store memory  
**lapses** = result of memory deficit, a common side-effect & indicator of inefficiency of memorization efforts  

**lapses/reps ratio** (waste ratio) => the closer to 0, the better ("low waste"). the higher this is, the worse : "high waste"  
**interval/reps ratio** (ROI ratio) = the higher the better ("low effort" / "sticky"). the lower this is, the worse ("high effort", "slippery")  

In [122]:
df_notes_015_final_section_2 = df_notes_014_final_section_2.copy()

# waste denotes the ratio of lapses to reviews (reps),
# where higher numbers indicate waste / wasted effort
df_notes_015_final_section_2['mean_note_waste'] = df_notes_015_final_section_2['mean_lapses'] / df_notes_015_final_section_2['mean_reps']

# ROI denotes the ratio of interval (ivl) to reviews (reps),
# where higher numbers indicate longer ROI
df_notes_015_final_section_2['mean_note_roi'] = df_notes_015_final_section_2['mean_ivl'] / df_notes_015_final_section_2['mean_reps']

# adjusted ROI factors in the total number of repetitions per note,
# so it most closely accounts for the total work (reviews) for the gains (interval)
#df_notes_015_final_section_2['adj_note_roi'] = df_notes_015_final_section_2['mean_ivl'] / df_notes_015_final_section_2['total_reps']

df_notes_015_final_section_2['n_ivl_q'] = pd.qcut(df_notes_015_final_section_2['mean_ivl'],5,labels=False)
df_notes_015_final_section_2['n_factor_q'] = pd.qcut(df_notes_015_final_section_2['mean_factor'],3,labels=False)
df_notes_015_final_section_2['n_waste_q'] = pd.qcut(df_notes_015_final_section_2['mean_note_waste'],1,labels=False)
df_notes_015_final_section_2['n_roi_q'] = pd.qcut(df_notes_015_final_section_2['mean_note_roi'],5,labels=False)

In [123]:
print(df_notes_015_final_section_2.columns.values)
df_notes_015_final_section_2.head()

['nid' 'tags' 'Term' 'Yomi1' 'NoteCreated' 'LastModified' 'commonword'
 'clothing' 'animal' 'body' 'food' 'place' 'textbook' 'college' 'fromdict'
 'fromexam' 'n1' 'n2' 'n3' 'n4' 'n5' 'katakana' 'hiragana' 'kanji' 'adv'
 'adj' 'noun' 'verb' 'nonconvo' 'convo' 'metalite' 'hasSimilarSound'
 'hasSameSound' 'hasVisual' 'hasAudio' 'hasMultiMeaning' 'hasMultiReading'
 'hasSimilarMeaning' 'hasAltForm' 'hasRichExamples' 'TermLen' 'Syllables'
 'TermLenGroup' 'SyllablesGroup' 'jlpt_lvl_d' 'script' 'c_suff_reviewed'
 'mean_ivl' 'mean_factor' 'mean_reps' 'mean_lapses' 'c_suff_reviewed_x'
 'total_reps' 'total_lapses' 'hasListenCard' 'hasPictureCard'
 'hasReadCard' 'hasTranslateCard' 'c_suff_reviewed_y' 'mean_note_waste'
 'mean_note_roi' 'n_ivl_q' 'n_factor_q' 'n_waste_q' 'n_roi_q']


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y,mean_note_waste,mean_note_roi,n_ivl_q,n_factor_q,n_waste_q,n_roi_q
0,1331799797114,commonword noun kanji suruverb fromdict,移籍,いせき,2012-03-15 08:23:17.114,2019-06-09 23:34:05,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,[2],[3:4],,kanji,1.0,99.0,1980.0,7.0,0.0,1.0,7,0,0,0,1,0,1.0,0.0,14.142857,0,2,0,1
1,1331799797117,verb commonword n2 kanji transitive fromdict,吊るす,つるす,2012-03-15 08:23:17.117,2019-05-18 14:33:12,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,3,3,[3:4],[3:4],2.0,kanji,1.0,28.0,2410.0,6.0,0.0,1.0,6,0,0,0,0,1,1.0,0.0,4.666667,0,2,0,0
2,1331799797118,commonword naadjective checked kanji n1 convo,和やか,なごやか,2012-03-15 08:23:17.118,2019-05-27 21:33:22,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,3,4,[3:4],[3:4],1.0,kanji,1.0,14.0,2410.0,6.0,0.0,1.0,6,0,0,0,0,1,1.0,0.0,2.333333,0,2,0,0
3,1331799797122,kanji hassame fromdict,在庫,ざいこ,2012-03-15 08:23:17.122,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,3,[2],[3:4],,kanji,1.0,224.0,2130.0,5.0,0.0,1.0,5,0,0,0,1,0,1.0,0.0,44.8,2,2,0,4
4,1331799797126,kanji fromdict,有能,ゆうのう,2012-03-15 08:23:17.126,2019-05-27 20:00:11,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,4,[2],[3:4],,kanji,1.0,248.0,2130.0,9.0,0.0,1.0,9,0,0,0,1,0,1.0,0.0,27.555556,2,2,0,3


# Compare note to card export columns

In [124]:
note_cols = df_notes_015_final_section_2.columns.values

In [125]:
# https://www.geeksforgeeks.org/python-intersection-two-lists/
def intersection(lst1, lst2):
    temp = set(lst2) 
    lst3 = [value for value in lst1 if value in temp] 
    return lst3

In [126]:
def difference(list1, list2):
    s = set(list2)
    return [x for x in list1 if x not in s]

In [127]:
# See what matches between CARD and NOTE dataframes

In [128]:
print("both cards and notes have:")
print(intersection(note_cols, card_cols))

both cards and notes have:
['nid']


In [129]:
# See what differs between CARD and NOTE dataframes

In [130]:
diff = difference(note_cols, card_cols)
diff2 = difference(card_cols, note_cols)
print("only cards has:")
print(intersection(card_cols, diff2))
print("only notes has:")
print(intersection(note_cols, diff))

only cards has:
['cid', 'ivl', 'factor', 'reps', 'lapses', 'CardCreated', 'DueDate', 'c_ivl_q', 'c_factor_q', 'CardType_listen', 'CardType_look', 'CardType_read', 'CardType_recall', 'cardtype', 'waste', 'roi']
only notes has:
['tags', 'Term', 'Yomi1', 'NoteCreated', 'LastModified', 'commonword', 'clothing', 'animal', 'body', 'food', 'place', 'textbook', 'college', 'fromdict', 'fromexam', 'n1', 'n2', 'n3', 'n4', 'n5', 'katakana', 'hiragana', 'kanji', 'adv', 'adj', 'noun', 'verb', 'nonconvo', 'convo', 'metalite', 'hasSimilarSound', 'hasSameSound', 'hasVisual', 'hasAudio', 'hasMultiMeaning', 'hasMultiReading', 'hasSimilarMeaning', 'hasAltForm', 'hasRichExamples', 'TermLen', 'Syllables', 'TermLenGroup', 'SyllablesGroup', 'jlpt_lvl_d', 'script', 'c_suff_reviewed', 'mean_ivl', 'mean_factor', 'mean_reps', 'mean_lapses', 'c_suff_reviewed_x', 'total_reps', 'total_lapses', 'hasListenCard', 'hasPictureCard', 'hasReadCard', 'hasTranslateCard', 'c_suff_reviewed_y', 'mean_note_waste', 'mean_note_roi

# Remove note outliers by ROI

In [131]:
old_df = df_notes_015_final_section_2.copy()
df_notes_016_final_section_2 = df_wo_outliers(old_df, 'mean_note_roi')
new_df = df_notes_016_final_section_2.copy()

# print out before and after stats
print_before_after(old_df.shape, new_df.shape, "shape")
print("diff:",old_df.shape[0]-new_df.shape[0],"notes")

---------------------------------------------------------------------------
shape
---------------------------------------------------------------------------
Before: (6092, 65)
---------------------------------------------------------------------------
After: (5798, 65)
---------------------------------------------------------------------------
diff: 294 notes


# Inspect removed notes

In [132]:
# https://stackoverflow.com/questions/50543326/how-to-do-left-outer-join-exclusion-in-pandas
df_both = pd.merge(old_df, new_df, on=list(note_cols), how="outer", indicator=True
              ).query('_merge=="left_only"')
print(df_both.shape)
df_both.head()

(294, 66)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y,mean_note_waste,mean_note_roi,n_ivl_q,n_factor_q,n_waste_q,n_roi_q,_merge
163,1346057958558,kanji fromdict,小学生,しょうがくせい,2012-08-27 08:59:18.558,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,7,[3:4],[5:8],,kanji,1.0,638.0,2450.0,8.0,0.0,1.0,8,0,0,0,1,0,1.0,0.0,79.75,4,2,0,4,left_only
181,1346057958590,kanji textbook semester1 college,図書館,としょかん,2012-08-27 08:59:18.590,2019-05-18 12:54:16,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3,5,[3:4],[5:8],,kanji,1.0,1273.0,1920.0,14.0,1.0,1.0,14,1,0,0,1,0,1.0,0.071429,90.928571,4,2,0,4,left_only
196,1346057958615,mergeterms kanji fromdict,大きさ,おおきさ,2012-08-27 08:59:18.615,2019-05-13 20:00:43,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,4,[3:4],[3:4],,kanji,1.0,886.0,2377.0,8.0,0.0,1.0,8,0,0,0,1,0,1.0,0.0,110.75,4,2,0,4,left_only
207,1346057958630,college kanji textbook hasrobo grammar,雨が降る,あめがふる,2012-08-27 08:59:18.630,2019-05-18 12:54:16,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,5,[3:4],[5:8],,kanji,1.0,1149.0,1777.0,14.0,0.0,1.0,14,0,0,0,1,0,1.0,0.0,82.071429,4,1,0,4,left_only
225,1346057958660,technical kanji culture grammarconju fromdict,知っている,しっている,2012-08-27 08:59:18.660,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,[5:8],[5:8],,kanji,1.0,863.0,2329.0,10.0,0.0,1.0,10,0,0,0,1,0,1.0,0.0,86.3,4,2,0,4,left_only


# Remove note outliers by waste

In [133]:
old_df = df_notes_016_final_section_2.copy()
df_notes_017_final_section_2 = df_wo_outliers(old_df, 'mean_note_waste')
new_df = df_notes_017_final_section_2.copy()

In [134]:
# print out before and after stats
print_before_after(old_df.shape, new_df.shape, "shape")
print("diff:",old_df.shape[0]-new_df.shape[0],"notes")

---------------------------------------------------------------------------
shape
---------------------------------------------------------------------------
Before: (5798, 65)
---------------------------------------------------------------------------
After: (5743, 65)
---------------------------------------------------------------------------
diff: 55 notes


In [135]:
df_both = pd.merge(old_df, new_df, on=list(note_cols), how="outer", indicator=True
              ).query('_merge=="left_only"')
print(df_both.shape)
df_both.head()

(55, 66)


Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y,mean_note_waste,mean_note_roi,n_ivl_q,n_factor_q,n_waste_q,n_roi_q,_merge
36,1342693717935,kanji fromdict,現象,げんしょう,2012-07-19 10:28:37.935,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,5,[2],[5:8],,kanji,1.0,424.0,1400.0,25.0,4.0,1.0,25,4,0,0,1,0,1.0,0.16,16.96,3,0,0,2,left_only
118,1344946923715,kanji textbook addsimilar hasrobo,粋,いき,2012-08-14 12:22:03.715,2019-05-18 14:33:12,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,2,[1],[2],,kanji,1.0,152.0,1300.0,32.0,5.0,1.0,32,5,0,0,1,0,1.0,0.15625,4.75,1,0,0,0,left_only
147,1345878204694,kanji fromdict,劣悪,れつあく,2012-08-25 07:03:24.694,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,4,[2],[3:4],,kanji,1.0,237.0,1300.0,26.0,4.0,1.0,26,4,0,0,1,0,1.0,0.153846,9.115385,2,0,0,1,left_only
171,1346057958578,kanji fromdict,開設,かいせつ,2012-08-27 08:59:18.578,2019-05-18 14:33:12,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,4,[2],[3:4],,kanji,1.0,235.0,1300.0,44.0,8.0,1.0,44,8,0,0,1,0,1.0,0.181818,5.340909,2,0,0,0,left_only
179,1346057958589,kanji fromdict,改築,かいちく,2012-08-27 08:59:18.589,2019-05-18 12:54:16,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,4,[2],[3:4],,kanji,1.0,60.0,1300.0,59.0,9.0,1.0,59,9,0,0,1,0,1.0,0.152542,1.016949,0,0,0,0,left_only


In [136]:
def cat_note(row):
    if(row.suff_reviewed):
        if(row.high_roi):
            return "sticky"
        elif(row.high_waste and not row.few_lapses):
            return "slippery"
        else:
            return ""
    else:
        return "n/a yet"

In [137]:
# Tag stickiest cards (find top 25 intervals, then sort by reviews)
df_notes_018_sticky = df_notes_017_final_section_2.copy()
df_notes_018_sticky.loc[df_notes_018_sticky['mean_note_roi']>=df_notes_018_sticky.mean_note_roi.quantile(0.95), 'high_roi'] = True
df_notes_018_sticky.loc[df_notes_018_sticky['mean_reps']>=df_notes_018_sticky.mean_reps.quantile(0.05), 'suff_reviewed'] = True
df_notes_018_sticky.loc[df_notes_018_sticky['mean_note_waste']>=0.1, 'high_waste'] = True
df_notes_018_sticky.loc[df_notes_018_sticky['mean_note_waste']==0, 'no_waste'] = True
df_notes_018_sticky.loc[df_notes_018_sticky['mean_lapses'] < 3, 'few_lapses'] = True
df_notes_018_sticky["high_roi"].fillna(False, inplace=True)
df_notes_018_sticky["suff_reviewed"].fillna(False, inplace=True)
df_notes_018_sticky["high_waste"].fillna(False, inplace=True)
df_notes_018_sticky["no_waste"].fillna(False, inplace=True)
df_notes_018_sticky['few_lapses'].fillna(False, inplace=True)

#df_notes_018_sticky = df_notes_018_sticky.sort_values(by=['mean_note_waste'], ascending=[True])

df_notes_018_sticky['analysis_cat'] = df_notes_018_sticky.apply(cat_note, axis=1) #####

df_notes_018_sticky = df_notes_018_sticky.drop(['high_roi','high_waste','suff_reviewed','few_lapses'],axis=1)

convert_bool_list = ['no_waste']
for item in convert_bool_list:
    cast_to_typ(df_notes_018_sticky,item, int)

convert_category_list = ['analysis_cat']
for col in convert_category_list:
    df_notes_018_sticky[col] = df_notes_018_sticky[col].astype('category')

print("df_notes_018_sticky.mean_note_roi.quantile(0.9):",df_notes_018_sticky.mean_note_roi.quantile(0.9))
print("df_notes_018_sticky.mean_reps.quantile(0.1):",df_notes_018_sticky.mean_reps.quantile(0.1))
    
print(df_notes_018_sticky.shape)
#df_notes_018_sticky.head(10)
df_notes_018_sticky.dtypes

df_notes_018_sticky.mean_note_roi.quantile(0.9): 50.38666666666668
df_notes_018_sticky.mean_reps.quantile(0.1): 8.0
(5743, 67)


nid                           int64
tags                         object
Term                         object
Yomi1                        object
NoteCreated          datetime64[ns]
LastModified         datetime64[ns]
commonword                    int64
clothing                      int64
animal                        int64
body                          int64
food                          int64
place                         int64
textbook                      int64
college                       int64
fromdict                      int64
fromexam                      int64
n1                            int64
n2                            int64
n3                            int64
n4                            int64
n5                            int64
katakana                      int64
hiragana                      int64
kanji                         int64
adv                           int64
adj                           int64
noun                          int64
verb                        

In [138]:
df_notes_018_sticky.analysis_cat.value_counts()

            5061
n/a yet      272
sticky       230
slippery     180
Name: analysis_cat, dtype: int64

In [139]:
df_notes_019_final_section_2 = df_notes_018_sticky.copy()

# make a new dataframe & drop the n/a items from this one, export it too for analysis
df_notes_020_final_section_2 = df_notes_018_sticky.copy() ######

print(df_notes_019_final_section_2.shape)
df_notes_020_final_section_2 = df_notes_020_final_section_2.drop(
    df_notes_020_final_section_2[df_notes_020_final_section_2.analysis_cat == "n/a yet"].index)
print(df_notes_020_final_section_2.shape)


(5743, 67)
(5471, 67)


# Export actual final notes

In [140]:
#df_notes_019_final_section_2.to_csv('datasets/df_notes_019_final_section_2.csv')
df_notes_020_final_section_2.to_csv('datasets/df_notes_020_final_section_2.csv')

***
- [Previous section: Combo of Notes & Cards](#combo)
- [To to bottom](#bottom)
***
# <a name="revlog"></a> Review Log

### 2.4.1. Import in Review Log data

In [141]:
df_revlog = pd.read_sql_query("SELECT * FROM revlog", cnx)

In [142]:
print(df_revlog.shape)
df_revlog.head()

(116126, 9)


Unnamed: 0,id,cid,usn,ease,ivl,lastIvl,factor,time,type
0,1332393018515,1331799797110,0,1,0,0,2500,6673,0
1,1333279992123,1331799797110,0,4,8,0,2600,11656,0
2,1333280001016,1331799797112,0,4,8,0,2600,8887,0
3,1333280097922,1331799797113,0,1,0,0,2500,29162,0
4,1333280107916,1331799797114,0,4,8,0,2600,9987,0


# Confirm review count of 116126 

In [143]:
assertEquals(df_revlog.shape[0],116126,"116126 Study reviews")

'OK'

In [144]:
df_revlog_001_review_date = df_revlog.copy()
df_revlog_001_review_date = df_revlog_001_review_date.rename(columns={'id':'rid'})
df_revlog_001_review_date['ReviewDate']= pd.to_datetime(df_revlog_001_review_date['rid'],unit='ms')
#df_revlog_001_review_date['ReviewDate'] = df_revlog_001_review_date['ReviewDate'].dt.date
df_revlog_001_review_date.head()

#assertEquals(df_revlog_001_review_date['rid'].iloc[0], 1332393018515, "Note ID is in Epoch Units")
#assertEquals(str(df_revlog_001_review_date['ReviewDate'].iloc[0]), "2012-03-22", "Note ID is in datetime date format year-month-day")

Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
0,1332393018515,1331799797110,0,1,0,0,2500,6673,0,2012-03-22 05:10:18.515
1,1333279992123,1331799797110,0,4,8,0,2600,11656,0,2012-04-01 11:33:12.123
2,1333280001016,1331799797112,0,4,8,0,2600,8887,0,2012-04-01 11:33:21.016
3,1333280097922,1331799797113,0,1,0,0,2500,29162,0,2012-04-01 11:34:57.922
4,1333280107916,1331799797114,0,4,8,0,2600,9987,0,2012-04-01 11:35:07.916


# Let's inspect a note w/ only one card associated with it

In [145]:
#current_note_id = get_rows_by_value_in_col(df_cards_012_mid_section_2, df_revlog_001_review_date['cid'].iloc[4], 'cid')['nid'].iloc[0]
# df_cards_012_mid_section_2
# df_notes_020_final_section_2
# df_revlog_001_review_date

def get_review_data_from_card_id(cid):
    df_c_1 = get_rows_by_value_in_col(df_cards_011_mid_section_2, cid, 'cid') # index 9 in the cards dataframe return "koto"
    c_count_1 = df_c_1.shape[0]
    nid_1 = df_c_1.nid.iloc[0]
    cid_1 = df_c_1.cid.iloc[0]
    df_n_1 = get_rows_by_value_in_col(df_notes_019_final_section_2, nid_1, 'nid')
    term_1 = df_n_1.Term.iloc[0]
    df_rl_1 = df_n_1 = get_rows_by_value_in_col(df_revlog_001_review_date, cid_1, 'cid')
    r_count_1 = df_rl_1.shape[0]

    s = f"""
    note: {nid_1}
    card: {cid_1}
    term: {term_1}
    card count: {c_count_1}
    review count: {r_count_1}
    """

    print(s)
    
    #print(get_rows_by_value_in_col(df_cards_011_mid_section_2, cid_1, 'cid'))
    
    return df_rl_1
    
df_rl_1 = get_review_data_from_card_id(df_revlog_001_review_date['cid'].iloc[9])
df_rl_1


    note: 1342506824718
    card: 1342506824718
    term: 事
    card count: 1
    review count: 15
    


Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
9,1342508005342,1342506824718,0,1,0,0,2500,60000,0,2012-07-17 06:53:25.342
12,1342511101453,1342506824718,0,2,1,0,2500,60000,0,2012-07-17 07:45:01.453
16,1342517532327,1342506824718,0,2,1,0,2500,13309,0,2012-07-17 09:32:12.327
50,1342601697172,1342506824718,0,3,4,0,2500,3162,2,2012-07-18 08:54:57.172
94,1343047493883,1342506824718,0,2,6,4,2500,60000,1,2012-07-23 12:44:53.883
138,1344690513604,1342506824718,0,3,31,6,2500,2603,1,2012-08-11 13:08:33.604
524,1347420506456,1342506824718,0,2,38,31,2350,4430,1,2012-09-12 03:28:26.456
5199,1350875830303,1342506824718,0,2,46,38,2200,2121,1,2012-10-22 03:17:10.303
10292,1355057784263,1342506824718,0,2,54,46,2050,4069,1,2012-12-09 12:56:24.263
17371,1360388282758,1342506824718,108,2,67,54,1900,8113,1,2013-02-09 05:38:02.758


# Recipe
0. Look for the first lastIvl that isn't zero. In this case, we have entry index # 94, with a lastIvl of 4
1. Calculate the # of days between the current review and the preceeding.
2. Calculate the % diff between "SinceLast" (actual) and "lastIvl" (expected)

In [146]:
# to find time diffs between two rows/entries
# https://datascience.stackexchange.com/questions/42156/how-to-caculate-time-difference-in-between-rows-using-loop-in-panda-python

# https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None  # default='warn'

def expand_review_data(df_rl):
    df_rl['SinceLast'] = (df_rl['ReviewDate'] - df_rl['ReviewDate'].shift(1)).astype('timedelta64[h]')
    df_rl['SinceLast'] = df_rl['SinceLast']//24 # divide SinceLast hours by 24 to get days difference
    df_rl['DaysLate'] = df_rl['SinceLast'] - df_rl['lastIvl']
    df_rl['DiffPercent'] = df_rl['SinceLast']/df_rl['lastIvl']
    return df_rl

df_rl_1_expanded = expand_review_data(df_rl_1)
df_rl_1_expanded

Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate,SinceLast,DaysLate,DiffPercent
9,1342508005342,1342506824718,0,1,0,0,2500,60000,0,2012-07-17 06:53:25.342,,,
12,1342511101453,1342506824718,0,2,1,0,2500,60000,0,2012-07-17 07:45:01.453,-0.0,-0.0,
16,1342517532327,1342506824718,0,2,1,0,2500,13309,0,2012-07-17 09:32:12.327,0.0,0.0,
50,1342601697172,1342506824718,0,3,4,0,2500,3162,2,2012-07-18 08:54:57.172,0.0,0.0,
94,1343047493883,1342506824718,0,2,6,4,2500,60000,1,2012-07-23 12:44:53.883,5.0,1.0,1.25
138,1344690513604,1342506824718,0,3,31,6,2500,2603,1,2012-08-11 13:08:33.604,19.0,13.0,3.166667
524,1347420506456,1342506824718,0,2,38,31,2350,4430,1,2012-09-12 03:28:26.456,31.0,0.0,1.0
5199,1350875830303,1342506824718,0,2,46,38,2200,2121,1,2012-10-22 03:17:10.303,39.0,1.0,1.026316
10292,1355057784263,1342506824718,0,2,54,46,2050,4069,1,2012-12-09 12:56:24.263,48.0,2.0,1.043478
17371,1360388282758,1342506824718,108,2,67,54,1900,8113,1,2013-02-09 05:38:02.758,61.0,7.0,1.12963


In [147]:
def get_total_review_time(df_rl):
    time_total = df_rl['time'].sum()/1000

    s = f"""
    total card review time: {time_total} seconds
    total note review time: {time_total} seconds
    """

    print(s)
    
get_total_review_time(df_rl_1)


    total card review time: 314.117 seconds
    total note review time: 314.117 seconds
    


In [148]:
get_rows_by_value_in_col(df_notes_019_final_section_2, 1342506824718, 'nid')

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y,mean_note_waste,mean_note_roi,n_ivl_q,n_factor_q,n_waste_q,n_roi_q,no_waste,analysis_cat
16,1342506824718,kanji textbook,事,こと,2012-07-17 06:33:44.718,2019-05-18 14:33:12,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,2,[1],[2],,kanji,1.0,472.0,1600.0,13.0,0.0,1.0,13,0,0,0,1,0,1.0,0.0,36.307692,4,1,0,3,1,


In [149]:
# todo: only count reviews of type 2 (review), not type 0 (learning)

In [150]:
get_rows_by_value_in_col(df_cards_011_mid_section_2, 1331799797110, 'cid')

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,c_suff_reviewed


In [151]:
# Inspect all cards, and review data, for hatsumei, nid: 1354094556789

In [152]:
# Now, let's see all the associate study review data for the one card

# Let's inspect another note now...

...from the review data where there are multiple cards associated with one note.

The goal here is to calculate all the review data for a single note

# Get hatsumei note

In [153]:
current_note_id = 1354094556789

In [154]:
get_rows_by_value_in_col(df_notes_020_final_section_2,current_note_id,'nid')

Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,body,food,place,textbook,college,fromdict,fromexam,n1,n2,n3,n4,n5,katakana,hiragana,kanji,adv,adj,noun,verb,nonconvo,convo,metalite,hasSimilarSound,hasSameSound,hasVisual,hasAudio,hasMultiMeaning,hasMultiReading,hasSimilarMeaning,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,SyllablesGroup,jlpt_lvl_d,script,c_suff_reviewed,mean_ivl,mean_factor,mean_reps,mean_lapses,c_suff_reviewed_x,total_reps,total_lapses,hasListenCard,hasPictureCard,hasReadCard,hasTranslateCard,c_suff_reviewed_y,mean_note_waste,mean_note_roi,n_ivl_q,n_factor_q,n_waste_q,n_roi_q,no_waste,analysis_cat
2522,1354094556789,fromtest commonword noun kanji n3 textbook sur...,発明,はつめい,2012-11-28 09:22:36.789,2019-05-18 12:54:16,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,2,4,[2],[3:4],3.0,kanji,1.0,568.0,1300.0,46.5,4.5,1.0,93,9,0,1,1,0,2.0,0.096774,12.215054,4,0,0,1,0,


# Get 'hatsumei' cards

In [155]:
df_sel_cards_1 = get_rows_by_value_in_col(df_cards_012_mid_section_2,current_note_id,'nid')
list_sel_cards_1 = list(df_sel_cards_1.cid) # use this list next
list_sel_cards_1_type = list(df_sel_cards_1.cardtype)
df_sel_cards_1

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,CardCreated,DueDate,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,cardtype,waste,roi,c_suff_reviewed
2692,1354094556789,1354094556789,80,1300,73,9,2012-11-28 09:22:36.789,2015-06-30 09:00:00,0,0,0,0,1,0,read,0.136986,1.09589,1.0
4395,1371807076626,1354094556789,1056,1300,20,0,2013-06-21 09:31:16.626,2021-10-24 09:00:00,4,0,0,1,0,0,look,0.05,52.8,1.0


# Get 'hatsumei' review logs for its 'read' card

In [156]:
df_sel_revlog_1 = get_rows_by_value_in_col(df_revlog_001_review_date, list_sel_cards_1[0], 'cid')
print(f"hatsumei card type '{list_sel_cards_1_type[0]}' review count: {df_sel_revlog_1.shape[0]}")
df_sel_revlog_1.tail()

hatsumei card type 'read' review count: 68


Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
97366,1412814834439,1354094556789,6614,3,27,20,1300,28826,1,2014-10-09 00:33:54.439
99982,1415621351955,1354094556789,6861,2,35,27,1300,10036,1,2014-11-10 12:09:11.955
102482,1419135382168,1354094556789,7124,2,44,35,1300,9545,1,2014-12-21 04:16:22.168
105524,1423035255990,1354094556789,7320,3,56,44,1300,5542,1,2015-02-04 07:34:15.990
107057,1428824791878,1354094556789,0,3,80,56,1300,5122,1,2015-04-12 07:46:31.878


# Get 'hatsumei' review logs for its 'look' card

In [157]:
df_sel_revlog_2 = get_rows_by_value_in_col(df_revlog_001_review_date, list_sel_cards_1[1], 'cid')
print(f"hatsumei card type '{list_sel_cards_1_type[1]}' review count: {df_sel_revlog_2.shape[0]}")
df_sel_revlog_2.tail()

hatsumei card type 'look' review count: 20


Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
77157,1395545294330,1371807076626,4162,2,111,75,1300,5889,1,2014-03-23 03:28:14.330
87269,1405129167575,1371807076626,5617,3,172,111,1300,23207,1,2014-07-12 01:39:27.575
103255,1420000648823,1371807076626,7154,3,279,172,1300,3213,1,2014-12-31 04:37:28.823
109352,1447993800423,1371807076626,8190,3,475,279,1300,3315,1,2015-11-20 04:30:00.423
113688,1543874754300,1371807076626,10506,3,1056,475,1300,60000,1,2018-12-03 22:05:54.300


In [158]:
#print("Term: ", get_rows_by_value_in_col(df_notes_015_final_section_2,current_note_id,'nid')['Term'].iloc[0])
#print("Translation: ", get_rows_by_value_in_col(df_notes_015_final_section_2,current_note_id,'nid')['Translation'].iloc[0])

#inspect_card_by_id(df_cards_009_mid_section_2, df_revlog['cid'].iloc[0], 'id')
#get_rows_by_value_in_col(df_cards_009_mid_section_2, df_revlog['cid'].iloc[0],'cid')

In [159]:
get_rows_by_value_in_col(df_revlog_001_review_date, df_revlog['cid'].iloc[0], 'cid')

Unnamed: 0,rid,cid,usn,ease,ivl,lastIvl,factor,time,type,ReviewDate
0,1332393018515,1331799797110,0,1,0,0,2500,6673,0,2012-03-22 05:10:18.515
1,1333279992123,1331799797110,0,4,8,0,2600,11656,0,2012-04-01 11:33:12.123
80362,1397571358201,1331799797110,4480,1,-60,-60,2500,4292,0,2014-04-15 14:15:58.201
80363,1397571360841,1331799797110,4480,2,-600,-60,2500,2636,0,2014-04-15 14:16:00.841
80364,1397571363081,1331799797110,4480,2,1,-600,2280,2238,0,2014-04-15 14:16:03.081
80371,1397622541113,1331799797110,4490,3,2,1,2280,4023,1,2014-04-16 04:29:01.113
83538,1400914850867,1331799797110,4958,2,12,2,2130,3323,1,2014-05-24 07:00:50.867
93044,1410177777778,1331799797110,6257,2,44,12,1980,2300,1,2014-09-08 12:02:57.778
98887,1414062295845,1331799797110,6748,2,51,44,1830,16176,1,2014-10-23 11:04:55.845
104047,1420285596480,1331799797110,7154,2,65,51,1680,11880,1,2015-01-03 11:46:36.480


In [160]:
# todo: put all revlog data per card in a cell alongside each card in the cards data frame

In [161]:
df_revlog_001_review_date.to_csv('datasets/df_revlog_001_review_date.csv')

### <a id="bottom"></a> Hi there! Want to go back [to the top](#top)

In [162]:
# df_cards_012_mid_section_2
# df_notes_020_final_section_2
# df_revlog_001_review_date
print(2.861709e+02/365)
df_cards_012_mid_section_2.std(axis = 0)

0.7840298630136987


cid                2.095160e+10
nid                1.979267e+10
ivl                2.861709e+02
factor             3.868660e+02
reps               9.231241e+00
lapses             1.218840e+00
c_ivl_q            1.276237e+00
c_factor_q         7.694637e-01
CardType_listen    7.431251e-02
CardType_look      2.945528e-01
CardType_read      3.033338e-01
CardType_recall    3.155522e-02
waste              3.586785e-02
roi                2.513212e+01
c_suff_reviewed    0.000000e+00
dtype: float64

In [163]:
df_cards_012_mid_section_2.describe()

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,waste,roi,c_suff_reviewed
count,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0,7024.0
mean,1370454000000.0,1368934000000.0,328.458144,1712.236902,15.954869,0.653189,1.802107,0.826737,0.005552,0.095957,0.897494,0.000997,0.106348,27.550559,1.0
std,20951600000.0,19792670000.0,286.170879,386.865972,9.231241,1.21884,1.276237,0.769464,0.074313,0.294553,0.303334,0.031555,0.035868,25.132115,0.0
min,1331800000000.0,1331800000000.0,1.0,1300.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.00885,1.0
25%,1351670000000.0,1350448000000.0,162.0,1300.0,10.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.076923,9.454545,1.0
50%,1369467000000.0,1367645000000.0,235.0,1639.5,14.0,0.0,2.0,1.0,0.0,0.0,1.0,0.0,0.1,19.230769,1.0
75%,1387854000000.0,1387197000000.0,402.0,2050.0,19.0,1.0,3.0,1.0,0.0,0.0,1.0,0.0,0.125,38.511364,1.0
max,1549184000000.0,1549184000000.0,2148.0,2710.0,113.0,16.0,4.0,2.0,1.0,1.0,1.0,1.0,0.210526,116.090909,1.0


In [164]:
c_stats=pd.DataFrame()
c_stats["mean"]=df_cards_012_mid_section_2.mean()
c_stats["Std.Dev"]=df_cards_012_mid_section_2.std()
c_stats["Var"]=df_cards_012_mid_section_2.var()
c_stats.T

Unnamed: 0,cid,nid,ivl,factor,reps,lapses,c_ivl_q,c_factor_q,CardType_listen,CardType_look,CardType_read,CardType_recall,waste,roi,c_suff_reviewed
mean,1370454000000.0,1368934000000.0,328.458144,1712.236902,15.954869,0.653189,1.802107,0.826737,0.005552,0.095957,0.897494,0.000997,0.106348,27.550559,1.0
Std.Dev,20951600000.0,19792670000.0,286.170879,386.865972,9.231241,1.21884,1.276237,0.769464,0.074313,0.294553,0.303334,0.031555,0.035868,25.132115,0.0
Var,4.389697e+20,3.917499e+20,81893.772134,149665.28062,85.215819,1.485571,1.628781,0.592074,0.005522,0.086761,0.092011,0.000996,0.001287,631.623226,0.0


# Anki Database Structure

https://github.com/ankidroid/Anki-Android/wiki/Database-Structure