### Working with Text

In [1]:
text1 = "If you've ever spent hours renaming files or updating hundreds of spreadsheet cells, you know how tedious tasks like these can be. But what if you could have your computer do them for you?"

In [2]:
len(text1)

188

In [4]:
text2 = text1.split() # Return a list of the words in text2, separating by ' '.
len(text2)

34

List comprehension allows us to find specific words:

In [10]:
print([w for w in text2 if len(w) > 3]) # Words that are greater than 3 letters long in text2

["you've", 'ever', 'spent', 'hours', 'renaming', 'files', 'updating', 'hundreds', 'spreadsheet', 'cells,', 'know', 'tedious', 'tasks', 'like', 'these', 'what', 'could', 'have', 'your', 'computer', 'them', 'you?']


In [6]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['If', 'But']

In [7]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['hours', 'files', 'hundreds', 'tedious', 'tasks']

We can find unique words using set().

In [9]:
print(set(text2))

{'of', 'be.', 'know', 'what', 'your', 'tasks', 'files', 'if', 'But', 'for', 'spreadsheet', 'updating', 'these', 'have', 'hours', 'or', 'them', 'hundreds', 'spent', 'ever', 'tedious', 'can', 'like', "you've", 'If', 'you', 'you?', 'could', 'how', 'computer', 'renaming', 'cells,', 'do'}


In [12]:
[w.upper() for w in text2] # converts text to upper

['IF',
 "YOU'VE",
 'EVER',
 'SPENT',
 'HOURS',
 'RENAMING',
 'FILES',
 'OR',
 'UPDATING',
 'HUNDREDS',
 'OF',
 'SPREADSHEET',
 'CELLS,',
 'YOU',
 'KNOW',
 'HOW',
 'TEDIOUS',
 'TASKS',
 'LIKE',
 'THESE',
 'CAN',
 'BE.',
 'BUT',
 'WHAT',
 'IF',
 'YOU',
 'COULD',
 'HAVE',
 'YOUR',
 'COMPUTER',
 'DO',
 'THEM',
 'FOR',
 'YOU?']

In [14]:
set([w.upper() for w in text2])

{'BE.',
 'BUT',
 'CAN',
 'CELLS,',
 'COMPUTER',
 'COULD',
 'DO',
 'EVER',
 'FILES',
 'FOR',
 'HAVE',
 'HOURS',
 'HOW',
 'HUNDREDS',
 'IF',
 'KNOW',
 'LIKE',
 'OF',
 'OR',
 'RENAMING',
 'SPENT',
 'SPREADSHEET',
 'TASKS',
 'TEDIOUS',
 'THEM',
 'THESE',
 'UPDATING',
 'WHAT',
 'YOU',
 "YOU'VE",
 'YOU?',
 'YOUR'}

In [17]:
len(set([w.upper() for w in text2]))

32

### Processing free-text

In [18]:
text3 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text4 = text3.split(' ')

text4

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

Finding Hastags

In [19]:
[w for w in text4 if w.startswith('#')]

['#UNSG']

Finding callouts:

In [20]:
[w for w in text4 if w.startswith('@')]

['@']

In [21]:
text5 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

We can use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

start with '@' and are followed by at least one:
capital letter ('A-Z')
lowercase letter ('a-z')
number ('0-9')
or underscore ('_')

In [22]:
import re # import re - a module that provides support for regular expressions

[w for w in text6 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']