# Guessing the author of a novel, 太宰治 or 宮沢賢治?

Let's implement a method that guesses who wrote a novel of an unknown author, 太宰治 (Osamu Dazai) or 宮沢賢治 (Kenji Miyazawa).

In this project, we use texts from [Aozora Bunko](https://www.aozora.gr.jp/index.html).
We have modified the texts by removing rubies and notes written by those who typed the texts.
We have also adjusted the number of novels by 太宰治 so that
the total numbers of letters in novels of the two authors are nearly equal.

We provide 81 novels by 太宰治, 148 novels by 宮沢賢治, and 10 novels by unknown authors.
Observe the characteristics of the texts by 太治宰 and 宮沢賢治, and for each of the novels by unknown authors,
guess who wrote it, 太宰治 or by 宮沢賢治.
Among the known features for classifying texts, try to use statistical features of letter n-grams.

For example, in the novels by 宮沢賢治, sentences tend to end with the です-ます style.
On the other hand, in the novels by 太宰治, sentences tend to end with the である style.
Therefore, given that the bi-grams in the novels by each author have been counted,
the probability of occurrences of `です` or ` す。` should be high in the novels by 宮沢賢治,
and that of `ある` or `る。` should be high in those by 太宰治.
If the probability of `です` or ` す。` in the novel by an unknown author is higher than that of `ある` or `る。`,
you can then conclude that the author is 宮沢賢治.

However, some novels by 宮沢賢治 are written in the です-ます style.
Therefore, compute the probability distribution of all the letter n-grams of each of the authors,
and observe the similarity between the probability distribution of letter n-grams of an unknown author
and that of 宮沢賢治 (太宰治).

## Preparation

A dataset of novels is stored into a table.
The table of novel data consists of the columns named `author`, `title`, and `text`,
which contain author names in Japanese, titles, and texts of novels, respectively.
Each row of the table stores the information of one novel;
we can identify the author of the novel by the `author` value.  

In the following Exercises 1 and 5, such a table of type `DataFrame` is given to functions.

Note that for-statements can handle iterations over a column extracted from `DataFrame`. 
For example, the following for-statement
```Python
for text in novels['text']:
    print(text)
```
prints the texts of all the novels stored in `novels`.

In [1]:
def multiline_ngrams(n, text):
    l = []
    for sentence in text.split('\n'):
        for i in range(len(sentence)-n+1):
            l.append(sentence[i:i+n])
    return l

In [2]:
def multiline_ngrams_gen(n, text):
    for sentence in text.split('\n'):
        for i in range(len(sentence)-n+1):
            yield sentence[i:i+n]

## 1: Extraction of n-grams

Define a function `author_ngrams(n,author,novels)` which returns a list or iterator of all n-grams
that appear in the texts in `novels` each of which is written by `author`.
The parameter `author` is 太宰治 or 宮沢賢治.

Each n-gram must appear in the list or iterator as many times as it appears in all the novels by the author.

In [3]:
import pandas as pd

def author_ngrams(n, author, novels):
    sub_df = novels[novels['author'] == author]
    ngrams = []
    
    for text in sub_df['text']:  
        for word in multiline_ngrams(n, text):
            ngrams.append(word)
            
    return (ngrams)

In [4]:
def tester():
    import pandas as pd
    TEST_NOVELS = [['太宰治', '政治家と家庭', '頭の禿げた善良そうな記者君が何度も来て、書け書け、と頭の汗を拭きながらおっしゃるので、書きます。\n佐倉宗五郎子別れの場、という芝居があります。ととさまえのう、と泣いて慕う子を振り切って、宗五郎は吹雪の中へ走って消えます。あれを、どうお思いでしょうか。アメリカ人が見たら、あれをどう感ずるでしょうか。ロシヤ人が見たら、何と判断するでしょうか。\nしかし私たち日本人、殊に男が何か仕事に打ち込んだ場合、たいていこの宗五郎のようになってしまいます。\n家族は、捨ててよいものでしょうか。日本の政治家たちは、たいてい家庭を捨てているようです。ひどいのになると、独身だか妻帯者だか、わからない人物もあります。しつけの良い家庭を営んでいる政治家は、少いように思われます。\nしつけのよい家庭を維持しながら、よい仕事も出来るという政治家もあってよいと思います。これこそ、至難の事業であります。けれども、兄は、それが出来るかも知れない極めて少数のひとの一人だと思います。\n無理なお願いでしょうけれどもお願いしてみます。私の為のお願いではありません。\n'],
                   ['宮沢賢治', '会計課', '九時六分のかけ時計\nその青じろき盤面に\nにはかに雪の反射来て\nパンのかけらは床に落ち\nインクの雫かわきたり\n'],
                  ]
    TEST_NOVELS = pd.DataFrame(TEST_NOVELS, columns=['author', 'title', 'text'])
    small_2_dazai = list(author_ngrams(2, '太宰治', TEST_NOVELS))
    small_2_miyazawa = list(author_ngrams(2, '宮沢賢治', TEST_NOVELS))
    print(len(small_2_dazai) == 452)
    print(len(small_2_miyazawa) == 44)
    print(small_2_dazai.count('す。') == 11)
    print(small_2_miyazawa.count('のか') == 2)

    novels = pd.read_csv('known_novels.csv', encoding='utf-8')
    large_3_dazai = list(author_ngrams(3, '太宰治', novels))
    large_3_miyazawa = list(author_ngrams(3, '宮沢賢治', novels))
    print(len(large_3_dazai) == 899275)
    print(len(large_3_miyazawa) == 868498)
    print(large_3_dazai.count('である') == 2891)
    print(large_3_miyazawa.count('である') == 290)
tester()

True
True
True
True
True
True
True
True


## 2: Occurrences of n-grams

Define a function `histogram(ngs)`,
which is given a list or iterator of n-grams and returns a dictionary whose keys are n-grams
with the number of their occurrences as their value.
The parameter `ngs` is a list or iterator of n-grams.

In [5]:
class Counter(dict):
    def __missing__(self, k):
        return 0
    
def histogram(ngs):
    hist = Counter()
    
    for word in ngs:
        hist[word] += 1
    return hist

In [6]:
def tester():
    import pandas as pd
    novels = pd.read_csv('known_novels.csv', encoding='utf-8')
    dazai_histogram = histogram(author_ngrams(3, '太宰治', novels))
    miyazawa_histogram = histogram(author_ngrams(3, '宮沢賢治', novels))
    unknown_novels = pd.read_csv('unknown_novels.csv', encoding='utf-8')
    un0_histogram = histogram(multiline_ngrams(3, unknown_novels.loc[0,'text']))
    print(len(dazai_histogram) == 268576)
    print(len(miyazawa_histogram) == 245770)
    print(dazai_histogram['である'] == 2891)
    print(miyazawa_histogram['である'] == 290)
    print(dazai_histogram['です。'] == 1203)
    print(miyazawa_histogram['です。'] == 1875)
    print(un0_histogram['である'] == 4)
    print(un0_histogram['です。'] == 18)
tester()

True
True
True
True
True
True
True
True


## 3: Probability ditributions of n-grams

Define a function `probability_distribution(hist)`,
which is given a distribution of occurrences of n-grams `hist` computed by the function `histogram` in Exercise 2,
and returns the probability distribution of n-grams.
The number of occurrences of each n-gram is divided by the total number of occurrences of all the n-grams.
The function `probability_distribution` returns a dictionary
whose keys are n-grams with the probability of their occurrences as their value.

In [7]:
def probability_distribution(hist):
    prob = {}
    total = 0
    
    for word in hist:
        total += hist[word]
    
    for word in hist:
        prob[word] = hist[word] / total
    
    return prob

In [8]:
def tester():
    import pandas as pd
    novels = pd.read_csv('known_novels.csv', encoding='utf-8')
    dazai_histogram = histogram(author_ngrams(3, '太宰治', novels))
    miyazawa_histogram = histogram(author_ngrams(3, '宮沢賢治', novels))
    print(round(probability_distribution(dazai_histogram)['である']*10**8) == 321481)
    print(round(probability_distribution(miyazawa_histogram)['である']*10**8) == 33391)
tester()

True
True


## 4: Distance between probability distributions of n-grams

You now compute the distance between probability distributions of n-grams in different texts.
Suppose that two n-gram probability distributions $d_1$ and $d_2$ are given.
In the following mathematical expressions, $d_i(x)$ denotes the probability of an n-gram $x$ in $d_i$.

The Tankard distance between $d_1$ and $d_2$ is obtained by summing the difference of the probabilities
of each n-gram in the two texts.
If the difference of the probabilities of each n-gram is larger, the two texts are considered more different.
The average difference for all n-grams is then computed.
The Tankard distance is therefore defined as follows.

$\mbox{Tankard}(d_1, d_2) =
\frac{1}{\mbox{card}(C)} \sum_{x \in C} {|d_1(x) - d_2(x)|}$,

in which $C$ denotes the set of n-grams whose probabilities in $d_1$ and $d_2$ are both positive, that is,

$C = \{~x~|~d_1(x)>0~\mbox{ and }~d_2(x) > 0 \}$,

and $\mbox{card}(C)$ denotes the number of elements in $C$.

If an n-gram probability distribution is represented by a dictionary,
the probability of an n-gram that is not stored in the dictionary is considered 0.

Now, define a function `Tankard(d1,d2)` that returns the Tankard distance
between the two n-gram probability distributions `d1` and `d2` that are given as dictionaries.

In [9]:
def Tankard(d1, d2):
    dist = 0
    denom = 0
    
    for word in d1:
        if word not in d2.keys():
            continue
        else:
            diff = abs(d1[word] - d2[word])
            denom += 1
            dist += diff
            
    return dist/denom

In [10]:
def tester():
    import pandas as pd
    novels = pd.read_csv('known_novels.csv', encoding='utf-8')
    dazai_histogram = histogram(author_ngrams(3, '太宰治', novels))
    miyazawa_histogram = histogram(author_ngrams(3, '宮沢賢治', novels))
    print(round(Tankard(probability_distribution(dazai_histogram),probability_distribution(miyazawa_histogram))*10**8) == 855)
tester()

True


## 5: Guessing the author

Now, let's guess who wrote unknown novels, 太宰治 or 宮沢賢治.
Define a function `which_author(n, known_novels, unknown_novels)` that
* takes a positive integer `n`, a dataset of known novels `known_novels`, and a dataset of unknown novels `unknown_novels`, and 
* returns a list of guessed results for all the novels in `unknown_novels` in order. 

The author of a novel is guessed as follows:
* Calculate the n-gram probability distribution of a given unknown novel;
* Calculate its Tankard distance to that of the novels of 太宰治 in `known_novels`;
* Calculate its Tankard distance to that of 宮沢賢治 in `known_novels`;
* Conclude that the author is `'太宰治'` if the distance to 太宰治 is smaller than that to 宮沢賢治,
* Conclude that the author is `'宮沢賢治'` otherwise.

Note that `unknown_novels` contains author information (i.e., true solutions) in the `author` column, but you are not allowed to use it for guessing.

In [11]:
def ngrams(n, novel):
    ngrams = []
    for text in novel:  
        for word in multiline_ngrams(n, text):
            ngrams.append(word)    
    return (ngrams)


def which_author(n, known_novels, unknown_novels):
    guess = []
    
    dazai_prob = probability_distribution(histogram(author_ngrams(n, '太宰治', known_novels)))
    miyazawa_prob = probability_distribution(histogram(author_ngrams(n, '宮沢賢治', known_novels)))
    
    for novel in unknown_novels['text']:
        novel_prob = probability_distribution(histogram(multiline_ngrams(n, novel)))
        dazai_tankard = Tankard(novel_prob, dazai_prob)
        miyazawa_tankard = Tankard(novel_prob, miyazawa_prob)
        
        if dazai_tankard <= miyazawa_tankard:
            guess.append('太宰治')
        else:
            guess.append('宮沢賢治')
    
    return guess

In [12]:
# Inferencing

def tester():
    import pandas as pd
    known_novels = pd.read_csv('known_novels.csv', encoding='utf-8')
    unknown_novels = pd.read_csv('unknown_novels.csv', encoding='utf-8')
    print(which_author(3, known_novels, unknown_novels) == ['太宰治', '太宰治', '太宰治', '太宰治', '太宰治', '宮沢賢治', '宮沢賢治', '宮沢賢治', '宮沢賢治', '宮沢賢治'])
tester()

True
