---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [0]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [0]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [0]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [0]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [0]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [0]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [0]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [0]:
len(set(text4))

5

In [0]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [0]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [0]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Processing free-text

In [0]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

<br>
Finding hastags:

In [0]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [0]:
[w for w in text6 if w.startswith('@')]

['@']

In [0]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [0]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

In [0]:
# Finding specific characters
 text12 = 'ouagadougou'
re.findall(r'[aeiou]', text12)
# ['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']
 re.findall(r'[^aeiou]', text12)
# ['g', 'd', 'g’]

['g', 'd', 'g']

In [0]:
# Cleaning Text
 text8 = ' A quick brown fox jumped over the lazy dog. '
  
text8.split(' ')
# ['', '', '\t', 'A', 'quick', 'brown', 'fox', 'jumped', 'over',
# 'the', 'lazy', 'dog.', '']
text9 = text8.strip()
 text9.split(' ')
# ['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
# 'dog.']

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

In [0]:
# Changing Text
# • Find and replace
print(text9)
# 'A quick brown fox jumped over the lazy dog.'
print(text9.find('o'))
# 10
print(text9.rfind('o'))
# 40
print(text9.replace('o', 'O'))
# 'A quick brOwn fOx jumped Over the lazy dOg.

A quick brown fox jumped over the lazy dog.
10
40
A quick brOwn fOx jumped Over the lazy dOg.


In [0]:
# Processing Free-text
 text10 = '"Ethics are built right into the ideals and objectives of the United Nations" #UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'
text11 = text10.split(' ')
text11
# ['"Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals',
# 'and', 'objectives', 'of', 'the', 'United', 'Nations"', '#UNSG',
# '@', 'NY', 'Society', 'for', 'Ethical', 'Culture', 'bit.ly/
# 2guVelr', '@UN', '@UN_Women']
# • How do you find all Hashtags? Callouts?

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr',
 '@UN',
 '@UN_Women']

In [0]:
# Finding Specific Words
# • Hashtags
print([w for w in text11 if w.startswith('#')])
# ['#UNSG']
# • Callouts
print([w for w in text11 if w.startswith('@')])
# ['@', '@UN', '@UN_Women']

['#UNSG']
['@', '@UN', '@UN_Women']


In [0]:
text10 = ‘”Ethics are built right into the ideals and objectives of the United Nations” #UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women’
text11 = text10.split(' ')
[w for w in text11 if w.startswith('@')]
# ['@', '@UN', '@UN_Women']
# Import regular expressions first!
>>> import re
>>> [w for w in text11 if re.search('@[A-Za-z0-9_]+', w)]
# ['@UN', '@UN_Women']
[w for w in text11 if re.search('@w+', w)]

In [0]:
# Let’s look at some more examples!
# • Finding specific characters
import re
text12 = 'ouagadougou'
re.findall(r'[aeiou]', text12)
# ['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']
re.findall(r'[^aeiou]', text12)
# ['g', 'd', 'g’]

['g', 'd', 'g']