---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Working With Text

In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1

76

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

14

In [3]:
text2

['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']

<br>
List comprehension allows us to find specific words:

In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

['Ethics', 'United', 'Nations']

In [6]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

['Ethics', 'ideals', 'objectives', 'Nations']

<br>
We can find unique words using `set()`.

In [7]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)

6

In [8]:
len(set(text4))

5

In [9]:
set(text4)

{'To', 'be', 'not', 'or', 'to'}

In [10]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

4

In [11]:
set([w.lower() for w in text4])

{'be', 'not', 'or', 'to'}

### Processing free-text

In [12]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6

['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']

### REGEX

<br>
Finding hastags:

In [13]:
[w for w in text6 if w.startswith('#')]

['#UNSG']

<br>
Finding callouts:

In [14]:
[w for w in text6 if w.startswith('@')]

['@']

In [15]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')

In [16]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]?', w)]

['@UN', '@UN_Women', '@']

<br>

We can use regular expressions to help us with more complex parsing. 

For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)

In [17]:
[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

['@UN', '@UN_Women']

#### Meta-characters for REGEX

\* : wildcard, matches any single character

^ : search for start of string

$ : matches from the end of string

[] : matches one of the set of characters within

[a-z] : matches one of the range fo the characters a,b,c,...x,y,z

[^abc] : matches a character that is NOT a,b or c

a|b : matches a or b, where a and b are strings

( ) : scoping for operators

\ : Escape character for special characters (\t,\n,\b)

\b : Matches a word boundary

\d : Matches any digit, equivalent to [0-9]

\D : Matches any non-digit, equivalent to [^0-9]

\s : Any whitespace, same as [\t\n\r\f\v]

\S : Any non-whitespace, same as [^ \t\n\r\f\v]

\w : Alphanumeric character, same as [a-zA-Z0-9_]

\W : non-Alphanumeric Character, same as [^a-zA-Z0-9_]

\* : Mathces zero or more occurences

\+ : mathces one or more occurences

? : matches zero or once

{n} : matches exactly n times

{n,} : at least n repetitions

{,n} : at most n repetitions

{m,n} : at least m and at most n times

In [24]:
text20 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr @UN @UN_Women'

In [25]:
text21 = text20.split(" ")

In [31]:
# Demonstrate the \w is same as [a-zA-Z0-9_]
[w for w in text21 if re.search('@[\w]+',w)]

['@UN', '@UN_Women']

In [34]:
[w for w in text21 if re.search(r'@[a-zA-Z-0-9_]+',w)]

['@UN', '@UN_Women']

In [33]:
text22 = 'ouagadougou'

In [39]:
# find all vowels
re.findall(r'[aeiou]', text22)

['o', 'u', 'a', 'a', 'o', 'u', 'o', 'u']

In [40]:
# find all consonents by using non-vowels
re.findall(r'[^aeiou]', text22)

['g', 'd', 'g']

### Regex and Dates
- Dates are unique because there are many formats that can be used for the same date

In [43]:
# Examples of 23rd October 2002
dateStr = '''23-10-2002
23/10/2002
23/10/02
10/23/2002
23 Oct 2002
23 October 2002
Oct 23, 2002
October 23, 2002
'''

In [44]:
# first pass at a regex to find ALL date formats
re.findall(r'\d{2}[/-]\d{2}[/-]\d{4}', dateStr)

['23-10-2002', '23/10/2002', '10/23/2002']

In [45]:
# Second iteration to find the 23/10/02 format
re.findall(r'\d{2}[/-]\d{2}[/-]\d{2,4}', dateStr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [46]:
# Third iteration to find the more formats
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dateStr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [53]:
# Find dates with alpha descriptions, the () will only return that part of the expression
re.findall(r'\d{2} (Jan|Feb|Mar|Apr|May|June|July|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['Oct']

In [54]:
# to grab the entire regex put the ?: at the beginning of the ()
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|June|July|Aug|Sep|Oct|Nov|Dec) \d{4}', dateStr)

['23 Oct 2002']

In [72]:
# alternate, convoluded solution without (?:...
a = re.findall(r'(\d{2}) (Jan|Feb|Mar|Apr|May|June|July|Aug|Sep|Oct|Nov|Dec) (\d{4})', dateStr)[0]
(" ").join([each for each in a])

'23 Oct 2002'

In [75]:
# another iteration, this time with [a-z]* to grab any extended names of months
#
re.findall(r'\d{2} (?:Jan|Feb|Mar|Apr|May|June|July|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}', dateStr)

['23 Oct 2002', '23 October 2002']

In [76]:
# to grab dates with Month at the beginning
re.findall(r'(?:\d{2} )?(?:Jan|Feb|Mar|Apr|May|June|July|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{2}, )?\d{4}', dateStr)

['23 Oct 2002', '23 October 2002', 'Oct 23, 2002', 'October 23, 2002']

In [77]:
# Third iteration to find the more formats- shown again to grab all patterns
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', dateStr)

['23-10-2002', '23/10/2002', '23/10/02', '10/23/2002']

In [29]:
f = open("../udhr.txt", 'r')
f.readline()

'Preamble\n'

In [30]:
text12 = f.readlines()

In [31]:
len(text12)

154

In [32]:
# \n is included in .readlines()
text12 = f.readlines()
text12[0:2]

['Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,\n',
 '\n']

In [21]:
# \n is missing in .splitlines()
text13 = text12.splitlines()
text13[0:2]

['Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,',
 '']

In [23]:
text14 = text12.readlines()

AttributeError: 'str' object has no attribute 'readlines'

In [13]:
text14

''

### File Operations

In [33]:
'''
f = open(filename, mode)
f.readline(), f.read, f.read(n) where n = number of characters to read in
for line in f:
    doSomething(line)
f.seek(n) - resents reading position
f.write(message) 
f.close()
f.closed to check is a file has been closed
'''

'\nf = open(filename, mode)\nf.readline(), f.read, f.read(n) where n = number of characters to read in\nfor line in f:\n    doSomething(line)\nf.seek(n) - resents reading position\nf.write(message) \nf.close()\nf.closed to check is a file has been closed\n'

f = open('udhr.txt', 'r')

text14 = f.readline()

returns => 'Universal Declaration... Rights \n

- How to remove the last newline character?

In [37]:
f = open("../udhr.txt", 'r')
text15 = f.readline()
text15


'Preamble\n'

In [40]:
text15.rstrip()

'Preamble'