Simplified text processing for Python 2 and 3.
- Python >= 2.6 or >= 3.3
pip install -U textblob curl https://raw.github.com/sloria/TextBlob/master/download_corpora.py | python
This will install textblob and download the necessary NLTK corpora.
NOTE: If you don't have pip
(you should), run this first: curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python
Simple.
from text.blob import TextBlob
wikitext = '''
Python is a widely used general-purpose, high-level programming language.
Its design philosophy emphasizes code readability, and its syntax allows
programmers to express concepts in fewer lines of code than would be
possible in languages such as C.
'''
wiki = TextBlob(wikitext)
...are just properties.
wiki.pos_tags # [(Word('Python'), 'NNP'), (Word('is'), 'VBZ'),
# (Word('a'), u'DT'), (Word('widely'), 'RB')...]
wiki.noun_phrases # WordList(['python', 'design philosophy', 'code readability'])
Note: The first time you access noun_phrases
might take a few seconds because the noun phrase chunker needs to be trained. Subsequent calls to noun_phrases
will be quick, however, since all TextBlobs share the same instance of a noun phrase chunker.
The sentiment
property returns a tuple of the form (polarity, subjectivity)
where polarity
ranges from -1.0 to 1.0 and
subjectivity
ranges from 0.0 to 1.0.
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment # (0.4583333333333333, 0.4357142857142857)
zen = TextBlob("Beautiful is better than ugly. "
"Explicit is better than implicit. "
"Simple is better than complex.")
zen.words # WordList(['Beautiful', 'is', 'better'...])
zen.sentences # [Sentence('Beautiful is better than ugly.'),
# Sentence('Explicit is better than implicit.'),
# ...]
for sentence in zen.sentences:
print(sentence.sentiment)
Each word in TextBlob.words
or Sentence.words
is a Word
object (a subclass of unicode
) with useful methods, e.g. for word inflection.
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words
# OUT: WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
sentence.words[2].singularize()
# OUT: 'space'
sentence.words[-1].pluralize()
# OUT: 'levels'
wiki.word_counts['its'] # 2 (not case-sensitive by default)
wiki.words.count('its') # Same thing
wiki.words.count('its', case_sensitive=True) # 1
wiki.noun_phrases.count('code readability') # 1
zen[0:19] # TextBlob("Beautiful is better")
zen.upper() # TextBlob("BEAUTIFUL IS BETTER THAN UGLY...")
zen.find("Simple") # 65
apple_blob = TextBlob('apples')
banana_blob = TextBlob('bananas')
apple_blob < banana_blob # True
apple_blob + ' and ' + banana_blob # TextBlob('apples and bananas')
"{0} and {1}".format(apple_blob, banana_blob) # 'apples and bananas'
Use sentence.start
and sentence.end
. This can be useful for sentence highlighting, for example.
for sentence in zen.sentences:
print(sentence) # Beautiful is better than ugly
print("---- Starts at index {}, Ends at index {}"\
.format(sentence.start, sentence.end)) # 0, 30
zen.json # '[{"sentiment": [0.2166666666666667, ' '0.8333333333333334],
# "stripped": "beautiful is better than ugly", '
# '"noun_phrases": ["beautiful"], "raw": "Beautiful is better than ugly. ", '
# '"end_index": 30, "start_index": 0}
# ...]'
TextBlob currently has two noun phrases chunker implementations,
text.np_extractors.FastNPExtractor
(default, based on Shlomi Babluki's implementation from
this blog post)
and text.np_extractors.ConllExtractor
, which uses the CoNLL 2000 corpus to train a tagger.
You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase extractor to a TextBlob's constructor.
from text.blob import TextBlob
from text.np_extractors import ConllExtractor
extractor = ConllExtractor()
blob = TextBlob("Extract my noun phrases.", np_extractor=extractor)
blob.noun_phrases # This will use the Conll2000 noun phrase extractor
TextBlob currently has two POS tagger implementations, located in text.taggers
. The default is the PatternTagger
which uses the same implementation as the excellent pattern library.
The second implementation is NLTKTagger
which uses NLTK's TreeBank tagger. It requires numpy and only works on Python 2.
Similar to the noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a tagger instance to the constructor.
from text.blob import TextBlob
from text.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("Tag! You're It!", pos_tagger=nltk_tagger)
blob.pos_tags
Run
python run_tests.py
to run all tests.
TextBlob is licenced under the MIT license. See the bundled LICENSE file for more details.