# Языковое моделирование

Мы будем работать с корпусом шекспировских текстов. Для того, чтобы его скачать, введите:
```python
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt
```

**Проделайте следующие упражнения:**
1. Разбейте текст на слова.
2. Приведите все к нижнему регистру.
3. Посчитайте частоты всех слов.
4. Замените слова с частотой встречаемости ниже 2 на UNK.
5. Создайте словарь, где по ключу _i_ будет лежать словарь с частотами _n_-грамм длины _i_.
6. Напишите функцию для оценки вероятностей предложений при помощи данного словаря с использованием сглаживания Лапласа.

## Генерация случайных текстов (Д/З)

Для того, чтобы сгенерировать случайный текст нужно запастись двумя вещами:

1. Тренировочный корпус.
2. Языковая модель.

Следуя комментариям напишите класс, реализующий простейшую **побуквенную** языковую модель.

In [14]:
from collections import Counter, defaultdict

class LanguageModel:
    def __init__(self, data, order=4):
        self.order = order
        self.ngrams = defaultdict(Counter)
        pad = '~' * order
        data = pad + data
        ### YOUR CODE HERE
        # For each ngram in data count all characters following this ngram.
        # For instance for oder = 2 and data = 'abcbcb' self.ngrams should be the following:
        # self.ngrams['~~']['a'] == 1
        # self.ngrams['~a']['b'] == 1
        # self.ngrams['ab']['c'] == 1
        # self.ngrams['bc']['c'] == 2
        # self.ngrams['cb']['c'] == 1
        for i in range(len(data) - order):
            history = data[i:i + order]
            self.ngrams[history][data[i + order]] += 1
        ### END YOUR CODE
        self.lm = {history: self.normalize(chars) for history, chars in self.ngrams.items()}
    
    def normalize(self, counter):
        ### YOUR CODE HERE
        # Normalize entries of counter.
        # For instance if you have Counter('a', 'b', 'a', 'a')
        # you should return the following list:
        # [('a', 0.75), ('b', 0.25)]
        norm = []
        summ = sum(counter.values())
        for i in counter.items():
            norm.append([i[0], i[1]/summ])
        return norm
        ### END YOUR CODE
    
    def __getitem__(self, history):
        return self.lm[history]

Что-ж, пришло время обучить языковую модельку и проверить результаты.

In [15]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read())

In [16]:
lm['ello']

[['w', 0.817717206132879],
 ["'", 0.017035775127768313],
 [',', 0.027257240204429302],
 [' ', 0.013628620102214651],
 ['u', 0.03747870528109029],
 ['?', 0.0068143100511073255],
 ['n', 0.0017035775127768314],
 [':', 0.005110732538330494],
 ['r', 0.059625212947189095],
 ['!', 0.0068143100511073255],
 ['.', 0.0068143100511073255]]

In [17]:
lm['Firs']

[['t', 1.0]]

А теперь напишем функцию для генерации случайных текстов!

In [18]:
from numpy.random import choice

def generate_letter(lm, history):
    history = history[-lm.order:]
    ### YOUR CODE HERE
    # Generate the next character according to the history.
    # Don't forget to use your probabilities!
    # Sample the next letter according to your probability distribution.
    mass = []
    prob = []
    letters = lm[history]
    for i in letters:
        mass.append(i[0])
        prob.append(i[1])
    n = choice(mass, 1, p=prob)
    return n[0]
    ### END YOUR CODE
        
def generate_text(lm, n_letters=1000):
    history = '~' * lm.order
    out = []
    ### YOUR CODE HERE
    # Generate random text and stash it into out variable.
    n = 0
    for i in range(n_letters):
        letter = generate_letter(lm, history)
        history = history[1:] + letter
        out.append(letter)
    ### END YOUR CODE
    return ''.join(out)

In [19]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read())
    
print(generate_text(lm, 1000))

First? it not.

TITUS ANDRONICUS:
Shall I fear the night you that I countrymen by your kindness.

ROSENCRANTZ:
We are now! whose winds to my beares I have the France Bassio is to see hair: take his very weakes meet
And can ready; with thee, wherefore,
That's sick and worship in here, trust at Richarge.

HORATIAN:
Now farth?

APEMANTUS:
For spherd were you,
And your to our of dear me, Signior Hercules,
Fully villain;
Am not so bid and prite loved close ignoming hell
And she can made that othersed.
Stirring Talbot sound a many thin meets.
You hoistem to more cavilion miscarceness will I heaving to save and that way
We that is: finds tongue of this language?

EMILIA:
O, steward.

Second constructed a those
Whithee
that arranteth must neither, well.

KATHARINE:
Do famills:
How shalt study, and thing varleman: first in English to a limb stews, which evenied old my mother sword's her,
So soul! see were in arms,
With ready are both York. What, did lease.
Where as coward to it will country
The

In [20]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read(), 8)
    
print(generate_text(lm, 2000))

First Citizen:
Before them. Provide this man's invention and me! for even here, but it was
Jove's case. From a pound shall commands he?

PAGE:
Well, good night, when he's not for the
lanthorn doth thy
husband says, is muddied,
This would revive the lottery of you, and lie under the babbling goss and hear him, we respect, my gracious liege, mine honours me,
But he that gave Amamon the
basket again?

GHOST:
Thy evil spirits.
Never did deny nothing wicked sir, halfpenny farthingale? didst
not see me here,
I ask you.

MENENIUS:
A noble duke is a
prave man!

TALBOT:
What need she, but I do not quantity of certain; and I said 'Good morrows to make
What cannot overboard, not perceive you long sojourn till now ripe in fortune's annual feast him like,
To fall into a man. I speak no more sons to
younger brother.

CORIOLANUS:
I have wish'd thee.

Tailor:
But seeming
He acts thy wife. Youth in us
At whose nature shall be a wall to expect.

Lord:
Hence from Venice,
And to what end?
Who dare scarcel

In [21]:
with open('shakespeare_input.txt', 'r') as fin:
    lm = LanguageModel(fin.read(), 16)
    
print(generate_text(lm, 2000))

First Citizen:
Before the time be out? no more!

ARIEL:
I prithee,
Remember I have done thee stir
Afresh within me, and these thy sons,
Thy kinsman and thy friends, I'll have more lives
Than drops of blood were in my father's signet in my purse,
Which was the model of thy father's life.

MARCUS ANDRONICUS:
Long have I been forlorn, and all for thee:
Welcome, dread Fury, to my woful house:
Rapine and Murder there.

TAMORA:
These are my ministers, and come with me.

TITUS ANDRONICUS:
Patience, dear niece. Good Titus, dry thine eyes.

TITUS ANDRONICUS:
What means my niece Lavinia by these signs?

TITUS ANDRONICUS:
I know thou dost but sigh
That thou no more shalt see this knack, as never
I mean thou shalt, we'll bar thee from succession;
Not hold thee of our blood, no, not our kin,
Far than Deucalion off: mark thou my words:
Follow us to the court. Thou churl, for this time,
Though full of our displeasure pieced,
And nothing more, may fitly like your grace
To let my tongue excuse all. Wha