Basic monroe engine:
- Takes raw text
- Tokenises
- Stems words using Porter Stemmer (removes 'ing', 's'...
- Sees how many match the (stemmed) most common 1000 words
- Returns a float percentage score

List of common words from http://www.ef.co.uk/english-resources/english-vocabulary/top-1000-words/

In [7]:
import re
import nltk


            
def tokenise(text):
    '''
    Takes raw text and returns a list of tokens
    
    At present, tokens are:
    - 2+ characters long
    - Alphabetic (numbers are automatically excluded)
    '''
    text = text.encode('ascii', 'ignore').decode('utf-8')
    tokens = re.findall("[A-Za-z]{2,}",text)
    tokens = [t.lower() for t in tokens]
    return tokens

def stem(tokens):
    '''
    Takes a list of tokens, applies a stemming algorithm (returns standardised forms of words - removes
    'ing', 's'...) and returns a list of stemmed words
    
    At present:
    - We use the Porter stemmer (least aggressive form of stemming - alternates are snowball and lancaster)
    '''
    stemmer = nltk.stem.PorterStemmer()
    return [stemmer.stem(t) for t in tokens]

def get_common():  
    '''
    Opens the text file containing the list of 1000 most common words found at
    http://www.ef.co.uk/english-resources/english-vocabulary/top-1000-words/
    removes the newlines and returns them as a list.
    '''
    text = []
    with open('1000common.txt', 'r') as f:
        for line in f:
            if line.endswith('\n'):
                text.append(line[0:-1])
            else:
                text.append(line)
    return text

def read_file(filename):
    '''
    Open a file, read text, return a string
    '''
    text = ''
    with open(filename, 'r') as f:
        for line in f:
            if line.endswith('\n'):
                text+=line[0:-1]
            else:
                text+=line
    return text


def munroe_score(text, verbose=True):
    '''
    Takes raw text, tokenises and stems it, and compares the stems to the set of the stemmed 1000 most common words
    Returns the percentage of words that were in the list of common words
    
    e.g. if output is 0.61, 61% of words were in the list of the 1000 most common. 
    '''
    tokens = tokenise(text)
    stems = stem(tokens)

    common = get_common()
    stemmed_common = set(stem(common))

    munroe = 0
    for s in stems:
        if s in stemmed_common:
            munroe+=1
            
    if verbose:
        print('You have '+ str(len(stems)) + ' words in your document')
        print('Of these, '+str(munroe)+' are in the most common 1000 words!')#
        print('Score: '+str(munroe/len(stems))+'%')
    return munroe/len(stems)   

In [8]:
text= """The bioethics of human embryonic stem cell research (hESR) is controversial, including in Asia. After the 2001 US-moratorium on the federal funding of hESR, some Asian countries jumped into the 'bioethical vacuum', claiming that Asian countries do not suffer from Western religious scruples about using human embryos in research. Nevertheless, controversies around the donation of oocytes, the trade and barter of embryos, stem cell research trials, and human embryonic cloning in Asia have attracted global media attention. International guidelines are being adopted into diverging economic, political and socio-cultural contexts in Asia.


This comparative research asks on what basis these guidelines are adopted in a socialist developing country such as China (PRC) and in a wealthy, democratic bureaucracy such as Japan. It investigates the formulation and implementation of regulations by visiting laboratories and clinics, interviewing donors of embryos and oocytes, observing scientists that handle the ‘materials’ and analysing public debates. Studying how bioethics guidelines created by governments, medical associations and private companies impact research and international research cooperation, the research expects to provide insights into how scientists, publics and governments deal with regulatory and bioethical problems in very different economic, political and cultural contexts."""
print(text)

The bioethics of human embryonic stem cell research (hESR) is controversial, including in Asia. After the 2001 US-moratorium on the federal funding of hESR, some Asian countries jumped into the 'bioethical vacuum', claiming that Asian countries do not suffer from Western religious scruples about using human embryos in research. Nevertheless, controversies around the donation of oocytes, the trade and barter of embryos, stem cell research trials, and human embryonic cloning in Asia have attracted global media attention. International guidelines are being adopted into diverging economic, political and socio-cultural contexts in Asia.


This comparative research asks on what basis these guidelines are adopted in a socialist developing country such as China (PRC) and in a wealthy, democratic bureaucracy such as Japan. It investigates the formulation and implementation of regulations by visiting laboratories and clinics, interviewing donors of embryos and oocytes, observing scientists that 

In [9]:
munroe_score(text, verbose=True)

You have 192 words in your document
Of these, 127 are in the most common 1000 words!
Score: 0.6614583333333334%


0.6614583333333334

In [11]:
import os 
tests = os.listdir('../Tests')
for test in tests:
    print(test)
    test_text = read_file(test)
    munroe_score(test_text, verbose=True)
    print('-'*50)
    