# Vocab Analysis
## Section 3: Analyze the Data

### 1. Import necessary libraries

In [7]:
import pandas as pd

### 2. Import necessary datasets

In [8]:
# import notes
notes_location = "datasets/df_notes_012_mid_section_2.csv"
df_notes = pd.read_csv(notes_location)

# import cards
cards_location = "datasets/df_cards_008_mid_section_2.csv"
df_cards = pd.read_csv(cards_location)

# import combo
combo_location = "datasets/df_combo_006_final_section_2.csv"
df_combo = pd.read_csv(combo_location)

# import revlog

### 3. Observe Metadata (tag) Frequency:

In [9]:
df_notes

Unnamed: 0.1,Unnamed: 0,nid,tags,Term,Yomi1,NoteCreated,LastModified,commonword,clothing,animal,...,hasMultiMeaning,hasMultiReading,hasSimilar,hasHomophone,hasAltForm,hasRichExamples,TermLen,Syllables,TermLenGroup,jlpt_lvl_d
0,1,1331799797112,fromdict,隙間,すきま,2012-03-15 08:23:17.112,2017-11-23 23:58:09.000,0,0,0,...,0,0,1,0,0,0,2,3,[2],
1,2,1331799797113,fromdict,苦汁,にがり,2012-03-15 08:23:17.113,2017-11-23 23:58:09.000,0,0,0,...,0,0,0,0,0,0,2,3,[2],
2,3,1331799797114,fromdict,移籍,いせき,2012-03-15 08:23:17.114,2017-11-23 23:58:09.000,0,0,0,...,0,0,0,0,0,0,2,3,[2],
3,5,1331799797117,fromdict verb,吊るす,つるす,2012-03-15 08:23:17.117,2017-11-23 23:58:09.000,0,0,0,...,0,0,1,0,0,0,3,3,[3:4],
4,6,1331799797118,fromdict convo checked,和やか,なごやか,2012-03-15 08:23:17.118,2017-11-23 23:58:09.000,0,0,0,...,0,0,0,0,0,0,3,4,[3:4],
5,7,1331799797121,fromdict,営業日,えいぎょうび,2012-03-15 08:23:17.121,2017-11-23 23:58:09.000,0,0,0,...,0,0,0,0,0,0,3,6,[3:4],
6,8,1331799797122,fromdict,在庫,ざいこ,2012-03-15 08:23:17.122,2017-11-23 23:58:09.000,0,0,0,...,0,0,0,1,0,0,2,3,[2],
7,10,1331799797126,fromdict,有能,ゆうのう,2012-03-15 08:23:17.126,2019-03-23 22:24:15.000,0,0,0,...,0,0,1,0,0,0,2,4,[2],
8,11,1331799797127,waseigo fromdict katakana,公衆トイレ,こうしゅうトイレ,2012-03-15 08:23:17.127,2019-03-26 14:13:19.000,0,0,0,...,0,0,0,0,0,0,5,8,[5:8],
9,12,1331799797128,fromdict,送り賃,おくりちん,2012-03-15 08:23:17.128,2017-11-23 23:58:09.000,0,0,0,...,0,0,0,0,0,0,3,5,[3:4],


In [17]:
tag_freq = pd.Series(' '.join(df_notes.tags).split()).value_counts()

In [18]:
tag_freq.head(20)

textbook      2625
fromdict      1898
hasnotags     1501
hasrobo       1035
verb           954
fromtest       940
college        879
len1           518
fromexam       369
hiragana       357
numeric        350
semester1      318
katakana       303
commonword     275
kana           253
checked        245
addsimilar     244
noun           243
convo          217
media          199
dtype: int64

# Initial Observations

Looks like our data is ready for some proper inspection! What are some questions that we might ask of this dataset? We could start with some simple/basic broad/overview observations about the (condensed) dataset such as:
- How many terms (unique notes) exist?
- How many study vectors (unique card types) exist (were utilized by student A)?
- When did student A first start studying?
- What is the data distribution for reps count? For laspes count?
- Of the terms that exist, how many had audio data?
- Of the terms that exist, how many had image data?

In [19]:
# unique terms in the condensed dataset
len(df_notes['Term'].unique())

8047

In [24]:
# confirm what card types exist
#df_combo['CardType'].value_counts()

In [26]:
#pd_crt # datetime of collection creation (studying commenced from this date)

In [27]:
print(df_cards.shape)

(8254, 15)


In [28]:
print(df_notes.shape)

(8047, 41)


In [29]:
print(df_combo.shape)

(7964, 51)


# Intitial Analysis

There appears to be a linear relationship between lapses & reps. This seems to make sense and is worth keeping in mind (that lapses would, it seems, incur a cost of increasing reps). However, this info doesn't seem (to the author) directly actionable, whether it be simple correlation or even causation. The primary focus is what can be done to optimize studying.

# Topical Analysis

After doing some basic assessments of the data, we can dig a bit deeper:
- Is there a correlation between words having multiple readings ("yomi") and their forget rate\*?
- Is there a correlation between words having same/similar sounding words and their forget rate\*?
- What might the effect of word length be on memorability? \*\*, \*\*\*

> \* Forget rate can be understand as a multitude of things, such as the ratio between lapses & reps, as well as the raw lapse count, the average interval, and other numbers/ratios to be determined. I will attempt to clarify this in the process.  
\*\* Memorability being loosely correlated with forget rate, where memorability could be understood as a word/term's intrinsic "stickiness" in the brain, as opposed to an individual or collective's capacity to keep words/terms in their head. Sources pending.  
\*\*\* A huge caveat here being that, this dataset has a sample size of 1 (for both student and language), so all observations, interpretations, and understandings must be taken with more than a few grains of salt (and tested further with larger sample sizes, of at least 200 students, and 5 or more languages).

In [None]:
#show correlation of stats via heatmap
df_worked = df_binary2.copy() # 'ivl','factor','lapses'
graph_drop_cols_1 = [
    'nid','commonword','clothing','animal','body','food','textbook','college','place',
    'fromdict','fromexam','onechar','n1','n2','n3','n4','n5','hasVisual',
    'hasAudio','hasSimilar','hasAltForm','TermLen','Syllables','ivl_q']
df_worked = df_worked.drop(graph_drop_cols_1,axis=1)
corr = df_worked.corr()
# https://stackoverflow.com/questions/38913965/make-the-size-of-a-heatmap-bigger-with-seaborn
fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
sns.heatmap(corr, vmin=-1, annot=True)

In [None]:
# todo: move this up one cell at least, if not up to the very top of initial graphing
# Basic correlogram
sns.pairplot(df_explore)
plt.show()

# Further Analysis

For a deeper understanding of what it means to aquire new terminology, the researcher believes it best to conduct analysis on term acquisition by merging multiple vectors (individual cards) of a single term into single entries, where dummy values for each vector (such as review count, lapse count, etc.) are encoded per entry. This would enable inspection and correlation analysis of:
- total reviews per term
- average ratio of reviews per term per vector (look vs hear vs recall vs read)
- where lapses are most likely to occur (per word, per vector, etc.)
- how word length, presence of kanji, katakana, hirgana, or combination thereof, may affect the above counts & ratios

# Further Information

The Spaced Repetition Software (\"SRS\") used for the study of Japanese by student \"A\" is an open souce program called Anki. The algorithm used by it to \"graduate\" (also refered to as \"maturing\") study items (called cards) so that subsequent reviews/practices will be spaced into the future is referred to as SM-2. [Please click here for more information on the SM-2 algorithm used in Anki.]("https://apps.ankiweb.net/docs/manual.html#what-algorithm")