# SECTION I

> **Try to find the most common words in the German Language, and as a result come up with a better frequency dictionary or your personal vocabulary list.**

In almost every natural language, there are some patterns, be it vocabulary, or grammatical constructs, or even the letters, that are present in a disproportionate amount. In other words, some words and patterns appear more often than others. For an enhanced approach to memorizing a high-return vocabulary, one might find these helpful. After all, why learn an arcane word early on when you can memorize frequent ones?

As a Python programmer, you'll find it even more joyful and rewarding to go through this analysis on your own. If you're interested, you may read more [here](https://github.com/davidahmed/TJoP) as to why this notebook exists.

In this section, you will encounter:
* basic text-file reading
* mapping lists (aka functors)
* very frequent functional operations such as map, reduce, sum

In [2]:
# The corpus was downloaded from here: 
#  https://wortschatz.uni-leipzig.de/en/download/

CORPUS_PATH = './corpus/deu_news_2015_300K-{corpus_type}.txt'

CORPUS_WORDS_PATH = CORPUS_PATH.format(corpus_type='words')

In [58]:
# Any command prefixed with an exclamation mark (!) is 
#  interpreted as a shell command.

!head {CORPUS_WORDS_PATH}

1	!	!	6010
2	"	"	57894
4	$	$	87
5	%	%	229
6	&	&	1265
7	'	'	1477
8	(	(	18870
9	)	)	18740
10	*	*	573
11	+	+	789


In [59]:
words = []

with open(CORPUS_WORDS_PATH) as r:
    for line in r:
        line = line.split()
        words.append((line[2], int(line[-1])))  

In [60]:
words.sort(key=lambda x: x[1], reverse=True)

In [61]:
words = list(filter(lambda _: len(_[0]) != 1, words))

In [62]:
total_count = sum(map(lambda _: _[-1], words))

In [63]:
import math

Ns = [10, 100, 200, 500, 1000, 2000, 5000]

for n in Ns:
    prozent = sum(map(lambda _: _[1], words[:n]))/total_count*100
    print("Top {n} words cover {prozent}% of the corpus!".format(n=n, 
                                                                prozent=math.floor(prozent)))

Top 10 words cover 13% of the corpus!
Top 100 words cover 36% of the corpus!
Top 200 words cover 42% of the corpus!
Top 500 words cover 50% of the corpus!
Top 1000 words cover 57% of the corpus!
Top 2000 words cover 63% of the corpus!
Top 5000 words cover 72% of the corpus!


### Discussion
At this point, we are counting the 'raw/textual' word-frequency. In German, like almost in any other language, the same lemma has various lexemes. For instance, the word der, die, das are lexemes used as the English language article *the* in the German language. [Lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)), thus, is the canonical form of lexemes; whereas a [lexeme](https://en.wikipedia.org/wiki/Lexeme) is just the smallest and independent unit of 'meaning'.

In other words, if you were to look up the German words *kommt*, *kommen*, *komme*, etc., they would all be listed under the same lemma (kommen) in the dictionary. In order to have a mapping of our lexemes in the `words`, we would need some Natural Language Processing. Assuming it's a blackbox which tells us the mapping, so we may finish the task at hand. I personally would rather know the lemma and it's most common lexemes, than not know it.

In [64]:
import spacy
nlp = spacy.load('de_core_news_md')

In [65]:
%%time
# WARNING: I'm finding the lemmas for first 10_000 words only.
#   I'm sure there are faster ways to do this rather than 
#   creating nlp objects for each word. But I'm happy so far.

limit = 10_000
lemmas = list(map(lambda _: (nlp(_[0])[0].lemma_, _[-1]), words[:10_000]))

CPU times: user 1min 16s, sys: 799 ms, total: 1min 16s
Wall time: 1min 20s


In [66]:
unique_lemmas = {}

for lemma in lemmas:
    if lemma[0] in unique_lemmas:
        unique_lemmas[lemma[0]] += lemma[-1]
    else:
        unique_lemmas[lemma[0]] = lemma[-1]

In [67]:
unique_lemmas = list(unique_lemmas.items())

In [68]:
words.sort(key=lambda x: x[1], reverse=True)

In [69]:
Ns = [10, 100, 200, 500, 1000, 2000, 5000]

for n in Ns:
    prozent = sum(map(lambda _: _[1], unique_lemmas[:n]))/total_count*100
    print("Top {n} lemmas cover {prozent}% of the corpus!".format(n=n, 
                                                                prozent=math.floor(prozent)))

Top 10 lemmas cover 20% of the corpus!
Top 100 lemmas cover 43% of the corpus!
Top 200 lemmas cover 49% of the corpus!
Top 500 lemmas cover 57% of the corpus!
Top 1000 lemmas cover 63% of the corpus!
Top 2000 lemmas cover 69% of the corpus!
Top 5000 lemmas cover 76% of the corpus!


## Discussion
We notice that the coverage goes up after we reduce the lexemes to their canonical form (i.e. lemmas). Note that this step is called **lemmatization**. Of course, this is all a coarse approximation, but nonetheless, could be helpful to 'hacking' vocabulary learning.

## Further Explorations:

> 1. Try coming up with the most common lexemes. For instance, try finding which form (i.e. conjugation) of a verb is most common.

> 2. Try a different genre/corpus/time. Do you notice a difference in the distribution of most common words? TIP: You may be able to get txt of old books (genres) from [Project Gutenberg](https://www.gutenberg.org/).

> 3. Can you come up with a better (by your own subjective judgement) frequency dictionary than [these](https://www.routledge.com/Routledge-Frequency-Dictionaries/book-series/RFD?gclid=EAIaIQobChMI046nt5Si6gIVDhoYCh1pnwAlEAAYASAAEgIGZPD_BwE).


# Section II 

> In this section we will get into n-gram analysis, particularly, bigrams.

You'll encounter: 
* some data cleaning, also referred to as **data wrangling**
* regex operations
* Python Counters

In [291]:
! head {CORPUS_SENTENCES_PATH}

1	00.48 Uhr: Und sie liegt nur 48 Kilometer entfernt.
2	00.56 Uhr: Der SRF-Moderator nützt die Nationalhymnen, um sich in Position zu bringen und übernimmt nach den Landesliedern als Kommentator.
3	007: Spectre-Weltpremiere in London Sogar die Royals kamen, um sich die Weltpremiere des neune James Bond Films..
4	01.00 Uhr - Am Dienstag geht das Tor zu an der ungarisch-serbischen Grenze.
5	01.05.2015 – 14:12 Fernsehen München (ots) - Das Thema: Deutschlands Löhne - Was ist unsere Arbeit wert?
6	01.07.15 – 01:31 min Mediathek Möglicher Grexit Kurzfristige Versorgung mit Drachmen ist nicht möglich "Die Griechen leben seit Jahren über ihre Verhältnisse", so Sinn.
7	01.07.2015 – 14:46 Uslar (ots) - USLAR (zi.) Am Sonntag, 28.06.2015, gegen 01.55 Uhr, wurde in der Langen Straße, in Höhe der Stadtschänke, ein 49-jähriger Taxifahrer von einer bisher unbekannten Person unvermittelt angegriffen und ins Gesicht geschlagen.
8	01.09.2015 08:00 Uhr Wolfgang Stieler vorlesen Mit Genmodifikatio

In [70]:
! file -I {CORPUS_SENTENCES_PATH}

./corpus/deu_news_2015_300K-sentences.txt: text/plain; charset=utf-8


In [71]:
import re
import math
from collections import Counter

CORPUS_PATH = './corpus/deu_news_2015_300K-{corpus_type}.txt'

CORPUS_SENTENCES_PATH = CORPUS_PATH.format(corpus_type='sentences')

In [72]:
sentences = open(CORPUS_SENTENCES_PATH, 'r', encoding='utf-8').readlines()

In [73]:
def sanitizeTimestamp(sentence):
    match = re.search(r'\sUhr[\:]*\s', sentence)
    if match:
        return sentence[match.end():].strip()
    else: 
        return sentence.split('\t')[-1].strip()

In [74]:
sentences = list(map(lambda _: sanitizeTimestamp(_), sentences))

In [75]:
sentences = ' '.join(sentences)

In [76]:
words = sentences.split(' ')
bigrams = []

for _, word in enumerate(words[:-1]):
    bigrams.append(' '.join(words[_:_+2]).lower())

In [77]:
bigrams = Counter(bigrams)

In [78]:
N = 20

print('Top {N} bigrams in the German language:\n'.format(N=N))

for bigram in bigrams.most_common(N):
    print(bigram[0], end='\n')

Top 20 bigrams in the German language:

in der
in den
für die
mit dem
auf die
und die
in die
bei der
mit der
auf der
auf dem
für den
an der
von der
mit einem
auf den
dass die
aus dem
sich die
gibt es


## Discussion
If you're new to the German language, now you shouldn't be surprised to see the bigrams like used so frequently in written text.

Also if you noticed, I didn't attempt to calculate the percentage (*prozent* in German) for the bigrams. I did so because percentage really doesn't make sense here- precisely because there are so many ways to calculate it, and also, the total number of possible bigrams would be a huge number (size_of_vocab^2). Relative frequency (or probability weights or the counts as I did here) would be all rather a better approach and thinking models.

I personally found that looking at and learning n-grams (i.e. bigrams, trigrams, quadrams, etc.) is usually a very interesting alternate way to learning the vocabulary of a language. For instance, *gibt es*, 20-th most common bigram in our corpus could actually be quite handy. It roughly means 'is there'; a handy phrase to ask in situations like- "(Do you sell/have)/(Is there) mayonnaise?". You'd say, "Gibt es Mayonnaise?". Or *Gibt es hier eine Zukunft?*. 

## Further Explorations:

> Try exploring other n-grams. You'd find that the higer your n goes, the little do things make sense. Why? 
>
> If you're interested, you might want to look into what Claude Shannon (a great computer scientist that you should know of) showed how a generative model of a language 
> can be created using n-gram analysis alone. ([Here's](https://web.stanford.edu/~jurafsky/slp3/3.pdf) a good reading)

> If you remember, I used `spacy` library above to do **lemmatization**. You can actually do something very interesting- explore what are the most common sentence structures in the German language!

> If you're a language enthusiast, you might find exploring other languages. A language like Arabic could be very rewarding if you've never encoundered it in NLP context before- it's (MSA/ Classical Arabic) morphology and templatic pattern might hook you for a lifetime. But of course, all languages are just beautiful and mesmerizing and divine. Aren't they?