文本生成之自动写论文,通过训练[ArXiv](http://arxiv.org/)里面的论文，从而生成一篇新的论文

[语料地址](https://www.kaggle.com/neelshah18/arxivdataset/)

## 数据描述
#### Context
Collection of 31000+ paper meta data.
#### Content
This data contains all paper related to ML, CL, NER, AI and CV field publish between 1992 to 2018-Feb.
#### Acknowledgements
arXiv is open source library for research papers. Thanks to arXiv for spreading knowledge.
#### Inspiration
To know what research is going on in the computer science all around the world.

通过数据集我们可以看到，我们只需要title跟summary部分
<img src="./picture1.png">

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### 第一步：加载文本，产生语料

In [2]:
data = pd.read_json("./arxivData.json")
data.head()

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
0,"[{'name': 'Ahmed Osman'}, {'name': 'Wojciech S...",1,1802.00209v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",2,We propose an architecture for VQA which utili...,"[{'term': 'cs.AI', 'scheme': 'http://arxiv.org...",Dual Recurrent Attention Units for Visual Ques...,2018
1,"[{'name': 'Ji Young Lee'}, {'name': 'Franck De...",12,1603.03827v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",3,Recent approaches based on artificial neural n...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Sequential Short-Text Classification with Recu...,2016
2,"[{'name': 'Iulian Vlad Serban'}, {'name': 'Tim...",2,1606.00776v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",6,We introduce the multiresolution recurrent neu...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Multiresolution Recurrent Neural Networks: An ...,2016
3,"[{'name': 'Sebastian Ruder'}, {'name': 'Joachi...",23,1705.08142v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",5,Multi-task learning is motivated by the observ...,"[{'term': 'stat.ML', 'scheme': 'http://arxiv.o...",Learning what to share between loosely related...,2017
4,"[{'name': 'Iulian V. Serban'}, {'name': 'Chinn...",7,1709.02349v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,We present MILABOT: a deep reinforcement learn...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",A Deep Reinforcement Learning Chatbot,2017


In [11]:
lines = data.apply(lambda row: row['title'] + ' ; ' + row['summary'], axis=1).tolist()

sorted(lines, key=len)[:3]

['Differential Contrastive Divergence ; This paper has been retracted.',
 'What Does Artificial Life Tell Us About Death? ; Short philosophical essay',
 'P=NP ; We claim to resolve the P=?NP problem via a formal argument for P=NP.']

In [12]:
print(len(lines))

41000


### 第二步Tokenization

In [27]:
from nltk import WordPunctTokenizer
wpt = WordPunctTokenizer()
process = lambda text: ' '.join(wpt.tokenize(text.lower()))
lines = list(map(process,lines))

In [30]:
sorted(lines, key=len)[:3]

['differential contrastive divergence ; this paper has been retracted .',
 'what does artificial life tell us about death ? ; short philosophical essay',
 'p = np ; we claim to resolve the p =? np problem via a formal argument for p = np .']

### 定义n-gram函数
关于n-gram的解释，这篇文章写的很好https://blog.csdn.net/ahmanz/article/details/51273500  
n-gram函数:计算每个单词在前(n-1)个单词出现的次数 
这里有两点是需要注意的  
- 如果前(n-1)的索引是在句子开头，需要用UNK来填充，例如：n=3  
empty prefix: "" -> (UNK, UNK)  
short prefix: "the" -> (UNK, the)  
long prefix: "the new approach" -> (new, approach)  
- 句子结尾加上<EOS>标识，代表句子的结束  
"... with deep neural networks ." -> (..., with, deep, neural, networks, ., EOS)  

In [31]:
from collections import defaultdict, Counter


UNK, EOS = "_UNK_", "_EOS_"

def count_ngrams(lines, n):
    '''
    计算每个单词在前(n-1)个单词出现的次数
    :param lines: 空格符分隔的句子列表
    :returns: { tuple(前(n-1)个单词): {下一个单词1: count_1, 下一个单词2: count_2}}
    '''
    counts = defaultdict(Counter)
    # counts[(word1, word2)][word3] = how many times word3 occured after (word1, word2)

    for line in lines:
        line_list = line.split()
        len_line = len(line_list)
        for i,word in enumerate(line_list):
            key =[]
            for j in range(1,n):
                if i-j >= 0:
                    key.append(line_list[i-j])
                else:
                    key.append(UNK)
            key = tuple(key[::-1])
            counts[key][word]+=1
            if i == len_line-1:
                key2= list(key[1:])
                key2.append(word)
                key2=tuple(key2)
                counts[key2][EOS]+=1
    return counts

In [32]:
dummy_lines = sorted(lines, key=len)[:100]
dummy_counts = count_ngrams(dummy_lines, n=3)
print(dummy_lines[2])
print(dummy_counts[('_UNK_', 'a')]['note'])

p = np ; we claim to resolve the p =? np problem via a formal argument for p = np .
3


In [33]:
print(dummy_counts[('_UNK_', '_UNK_')])

Counter({'a': 13, 'the': 3, 'on': 3, 'using': 2, 'learning': 2, 'automatic': 2, 'why': 2, 'proceedings': 2, 'piecewise': 2, 'differential': 1, 'what': 1, 'p': 1, 'computational': 1, 'weak': 1, 'creating': 1, 'defeasible': 1, 'essence': 1, 'deep': 1, 'statistical': 1, 'complex': 1, 'serious': 1, 'preprocessing': 1, 'liquid': 1, 'mining': 1, 'towards': 1, 'icon': 1, 'recognition': 1, 'glottochronologic': 1, 'utility': 1, 'temporized': 1, 'backpropagation': 1, 'random': 1, 'network': 1, 'glottochronology': 1, 'time': 1, 'convolutional': 1, 'fitness': 1, 'flip': 1, 'autonomous': 1, 'activitynet': 1, 'decision': 1, 'text': 1, 'discrimination': 1, 'are': 1, 'extraction': 1, 'comments': 1, 'resource': 1, 'advances': 1, 'exploration': 1, 'quantified': 1, 'in': 1, 'introduction': 1, 'beyond': 1, 'norm': 1, 'about': 1, 'unary': 1, 'some': 1, 'convex': 1, 'neurocontrol': 1, 'philosophy': 1, 'parallels': 1, 'an': 1, 'calculate': 1, 'group': 1, 'entropy': 1, 'word': 1, 'guarded': 1, 'cornell': 1, '

计算前(n-1)个单词产生w_t的概率为:

$$ P(w_t | prefix) = { Count(prefix, w_t) \over \sum_{\hat w} Count(prefix, \hat w) } $$

In [34]:
class NGramLanguageModel:    
    def __init__(self, lines, n):
        '''
        训练一个 count-based language model: 
        计算n-gram的概率：P(w_t | prefix) 
        
        :param n: n-gram
        :param lines: 空格符分隔的句子列表
        '''
        assert n >= 1
        self.n = n
    
        counts = count_ngrams(lines, self.n)
        
        # probs[(word1, word2)][word3] = P(word3 | word1, word2)
        self.probs = defaultdict(Counter)
        
        # 计算实际的概率，实现上一个单元格表示的公式
        for prefix,counts_pre in counts.items():
            sum_prefix = sum([value for value in counts_pre.values()])
            for word,num in counts_pre.items():
                self.probs[prefix][word] = num/sum_prefix
            
    def get_possible_next_tokens(self, prefix):
        '''
        :returns: 所有token的字典：{token : it's probability}
        '''
        prefix = prefix.split()
        prefix = prefix[max(0, len(prefix) - self.n + 1):]
        prefix = [ UNK ] * (self.n - 1 - len(prefix)) + prefix
        return self.probs[tuple(prefix)]
    
    def get_next_token_prob(self, prefix, next_token):
        '''
        给定prefix，获得给定token的概率
        :returns: P(next_token|prefix), 0 <= P <= 1
        '''
        return self.get_possible_next_tokens(prefix).get(next_token, 0)

In [35]:
dummy_lm = NGramLanguageModel(dummy_lines, n=3)

p_initial = dummy_lm.get_possible_next_tokens('') # '' -> ['_UNK_', '_UNK_']

print(p_initial['learning'])
print(p_initial['a'])

0.02
0.13


In [37]:
p_a = dummy_lm.get_possible_next_tokens('a') # '' -> ['_UNK_', 'a']
print(p_a['note'])
print(p_a.get('the', 0))

0.23076923076923078
0


产生句子的步骤
```
X=[]  
while w_next!= '<EOS>':  
    w_next = P(w_next|predix)
    X.append(w_next
```

为了让语料库中占比较大的单词sample出来的概率更大，我们需要设定一个temperature，这篇博客有详细的分析
https://www.jianshu.com/p/e054cd99089e

In [47]:
import random
def get_prob_token(probs,temperature):
    if temperature<=0.01:
        temperature = 0.01
    s = sum([value**(1/temperature) for value in probs.values()])
    n = random.uniform(0, s)
    word = ''
    for word, prob in probs.items():
        prob = prob ** (1/temperature)
        if n<prob:
            break
        n -= prob
    return word

In [48]:
def get_next_token(lm, prefix, temperature=1.0):
    # helper function to sample an index from a probability array
    token = get_prob_token(lm.get_possible_next_tokens(prefix),temperature)
    return token

In [41]:
lm = NGramLanguageModel(lines, n=3)

In [51]:
test_freqs = Counter([get_next_token(lm, 'there have') for _ in range(10000)])
print(test_freqs)

Counter({'been': 9042, 'not': 384, 'also': 179, 'only': 159, 'occurred': 83, 'lately': 79, 'very': 74})


In [52]:
test_freqs = Counter([get_next_token(lm, 'there have',temperature=0.5) for _ in range(10000)])
print(test_freqs)

Counter({'been': 9968, 'not': 25, 'also': 3, 'only': 2, 'lately': 2})


接下来就可以用我们的模型生成论文了

In [53]:
prefix = 'hello' 

for i in range(100):
    prefix += ' ' + get_next_token(lm, prefix)
    if prefix.endswith(EOS) or len(lm.get_possible_next_tokens(prefix)) == 0:
        break
        
print(prefix)

hello edge : learning canonical appearance transformations applied to genet . finally , the correlations in texts accompanied by rich semantic model to focus on one important problem in visual dialog scenario . _EOS_


In [56]:
prefix = 'bridging the' 

for i in range(100):
    prefix += ' ' + get_next_token(lm, prefix, temperature=0.5)
    if prefix.endswith(EOS) or len(lm.get_possible_next_tokens(prefix)) == 0:
        break
        
print(prefix)

bridging the gap between the two - stage approach of first - person tracking in real - world applications , such as the number of parameters in the total number of clusters is proposed . a comprehensive study on the other hand , the proposed method is proposed for the first time , we propose a new method for using dynamic programming algorithms under consideration for acceptance in upcoming conferences ; the use of the data . the results of a deep learning methods and show that the proposed method is more robust to the best performance . _EOS_


计算困惑度,介绍https://www.itread01.com/content/1542392710.html

In [57]:
def perplexity(lm, lines, min_logprob=np.log(10 ** -50.)):
    log_perplexity = 0
    count_words = 0
    for line in lines:
        sent = line.split()
        prefix = [UNK] * (lm.n - 1)
        prob = lm.get_next_token_prob(' '.join(prefix), sent[0])
        if prob > 0:
            addition = np.log(prob)
            if addition < min_logprob:
                addition = min_logprob
            log_perplexity += addition
        else:
            log_perplexity += min_logprob

        count_words += len(sent) + 1
        for i, token in enumerate(sent[:-1]):
            prefix = prefix[1:] + [token]
            prob = lm.get_next_token_prob(' '.join(prefix), sent[i + 1])
            if prob > 0:
                addition = np.log(prob)
                if addition < min_logprob:
                    addition = min_logprob
                log_perplexity += addition
            else:
                log_perplexity += min_logprob
            
        prefix = prefix[1:] + [sent[-1]]
        prob = lm.get_next_token_prob(' '.join(prefix), EOS)
        if prob > 0:
            addition = np.log(prob)
            if addition < min_logprob:
                addition = min_logprob
            log_perplexity += addition
        else:
            log_perplexity += min_logprob
            
    return np.exp(-log_perplexity / count_words)

运行下面代码发现，N-gram模型，遇到未见过的词组，其概率为0，导致困惑度很大，实际上可能因为语料库太小，未见过的词组将来可能会出现，所以我们需要进行一些平滑技术，使得其词组的出现概率大于0，具体的平滑方法可以查看：https://blog.csdn.net/qjf42/article/details/79761786

In [62]:
from sklearn.model_selection import train_test_split
train_lines, test_lines = train_test_split(lines, test_size=0.25, random_state=42)

for n in (1, 2, 3):
    lm = NGramLanguageModel(n=n, lines=train_lines)
    ppx = perplexity(lm, test_lines)
    print("N = %i, Perplexity = %.5f" % (n, ppx))

N = 1, Perplexity = 3246.38501
N = 2, Perplexity = 85653987.28774
N = 3, Perplexity = 61999196259043346743296.00000


拉普拉斯平滑

In [63]:
class LaplaceLanguageModel(NGramLanguageModel): 
    def __init__(self, lines, n, delta=1.0):
        self.n = n
        counts = count_ngrams(lines, self.n)
        self.vocab = set(token for token_counts in counts.values() for token in token_counts)
        self.probs = defaultdict(Counter)

        for prefix in counts:
            token_counts = counts[prefix]
            total_count = sum(token_counts.values()) + delta * len(self.vocab)
            self.probs[prefix] = {token: (token_counts[token] + delta) / total_count
                                          for token in token_counts}
    def get_possible_next_tokens(self, prefix):
        token_probs = super().get_possible_next_tokens(prefix)
        missing_prob_total = 1.0 - sum(token_probs.values())
        missing_prob = missing_prob_total / max(1, len(self.vocab) - len(token_probs))
        return {token: token_probs.get(token, missing_prob) for token in self.vocab}
    
    def get_next_token_prob(self, prefix, next_token):
        token_probs = super().get_possible_next_tokens(prefix)
        if next_token in token_probs:
            return token_probs[next_token]
        else:
            missing_prob_total = 1.0 - sum(token_probs.values())
            missing_prob_total = max(0, missing_prob_total) # prevent rounding errors
            return missing_prob_total / max(1, len(self.vocab) - len(token_probs))

In [64]:
for n in (1, 2, 3):
    lm = LaplaceLanguageModel(train_lines, n=n, delta=0.1)
    ppx = perplexity(lm, test_lines)
    print("N = %i, Perplexity = %.5f" % (n, ppx))

N = 1, Perplexity = 966.12894
N = 2, Perplexity = 470.48021
N = 3, Perplexity = 3679.44765
