# Basic Python
### Beginning Python

In [2]:
def hello (who):                          # 1
    """Greet somebody"""                  # 2
    print("Hello " + who + "!")           # 3

1 Defines a new function/procedure called `hello` which takes a single argument.  Note that python variables are not typed, who could be a string, integer, array ... The line ends with a colon (:) which means we're beginning an indented code block.

2 Firstly note that there are no brackets delimiting the body of the procedure, Python instead uses indentation to delimit code blocks. So, getting the indentation right is crucial!
  
This line (2) is a documentation string for the procedure which gets associated with it in the python environment. The three double quotes delimit a multi-line string (could use ' or " in this context).

3 This is the body of the procedure, ``print`` is a built in command in python. Note that the Python 2.x versions do not use round brackets, this is a major difference with Python 3.x. We also see here the `+` operator used on strings (I'm assuming `who`is a string) to perform concatenation --- thus we have operator overloading based on object type just like other OO languages.


In [3]:
help(hello)

Help on function hello in module __main__:

hello(who)
    Greet somebody



In [4]:
hello("Steve")

Hello Steve!


Here I'm calling the new procedure with a literal string argument delimited by `"`.

In [5]:
hello('world')

Hello world!


And here delimited by `'` --- both of these delimiters are equivalent, use one if you want to include the other in the string, eg `"Steve's"`.

In [6]:
people =  ['Steve', "Mark", 'Diego']      # 6
for person in people:                     # 7
    hello(person)                         # 8

Hello Steve!
Hello Mark!
Hello Diego!


6 This defines a variable people to have a value which is a list of strings, lists are 1-D arrays and the elements can be any python object (including lists).

7 A `for` loop over the elements of the list. Again the line ends with a colon indicating a code block to follow.

8 Call the procedure with the variable which will be bound to successive elements of the list.


### Core Data Types
1. Strings
2. Numbers (integers, float, complex)
3. Lists
4. Tuples (inmutable sequences)
5. Dictionaries (associative arrays)

### Lists

In [7]:
a = ['one', 'two', 3, 'four']

In [8]:
a[0]

'one'

In [9]:
a[-1]

'four'

In [10]:
a[0:3]

['one', 'two', 3]

In [11]:
len(a)

4

In [12]:
len(a[0])

3

In [13]:
a[1] = 2
a

['one', 2, 3, 'four']

In [14]:
a.append('five')
a

['one', 2, 3, 'four', 'five']

In [15]:
top = a.pop()
a

['one', 2, 3, 'four']

In [16]:
top

'five'

### List Comprehensions
List comprehensions are a very powerful feature of Python. They reduce the need to write simple loops.

In [17]:
a = ['one', 'two', 'three', 'four']
len(a[0])

3

In [18]:
b = [w for w in a if len(w) > 3]
b

['three', 'four']

In [19]:
c = [[1,'one'],[2,'two'],[3,'three']]
d = [w for [n,w] in c]
d

['one', 'two', 'three']

### Tuples
Tuples are a sequence data type like lists but are immutable:
* Once created, elements cannot be added or modified.

Create tuples as literals using parentheses:


In [20]:
a  = ('one', 'two', 'three')

Or from another sequence type:

In [21]:
a = ['one', 'two', 'three']
b = tuple(a)

Use tuples as fixed length sequences: memory advantages.

### Dictionaries
* Associative array datatype (hash)
* Store values under some hash key
* Key can be any immutable type: string, number, tuple

In [22]:
names = dict()
names['madonna'] = 'Madonna'
names['john'] = ['Dr.', 'John', 'Marshall']
names.keys()

dict_keys(['madonna', 'john'])

In [23]:
list(names.keys())

['madonna', 'john']

In [24]:
ages = {'steve':41, 'john':22}
'john' in ages

True

In [25]:
41 in ages

False

In [26]:
'john' in ages.keys()

True

In [27]:
for k in ages:
    print(k, ages[k])

steve 41
john 22


### Organising Source Code: Modules
* In Python, a module is  a single source file wich defines one or more procedures or classes.
* Load a module with the `import` directive.
* After importing the module, all functions are grouped in the module namespace.
* Python provides many useful modules.


In [28]:
import math
20 * math.log(3)

21.972245773362197

### Defining Modules
* A module is a source file containing Python code
    * Usually class/function definitions.
* First non comment item can be a docstring for the module.

```python
# my python module
"""This is a python module to
do something interesting"""

def foo(x):
   'foo the x'
   print('the foo is ' + str(x))
```

# NLTK
NLTK is a Python module

In [29]:
import nltk

Let's do some simple statistics on the Gutenberg corpus

In [30]:
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/gideonsacks/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [31]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

192427

In [32]:
nltk.download('punkt')
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gideonsacks/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt


### Counting Words

In [33]:
import collections
emma_counter = collections.Counter(emma)
emma_counter.most_common(10)

[(',', 11454),
 ('.', 6928),
 ('to', 5183),
 ('the', 4844),
 ('and', 4672),
 ('of', 4279),
 ('I', 3178),
 ('a', 3004),
 ('was', 2385),
 ('her', 2381)]

In [34]:
emma_counter['Emma']

865

### Exercises
1. Identify the 10 most common words in each file of the Gutenberg corpus. Can you see any similarities among them?
2. Find the most frequent word with length of at least 7 characters.
3. Find the words that are longer than 7 characters and occur more than 7 times.

### Count Bigrams
A bigram is a sequence of two words.

In [35]:
list(nltk.bigrams([1,2,3,4,5,6]))

[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)]

In [36]:
list(nltk.bigrams(emma))[:5]

[('[', 'Emma'),
 ('Emma', 'by'),
 ('by', 'Jane'),
 ('Jane', 'Austen'),
 ('Austen', '1816')]

* A bigram is an ngram where n is 2
* A trigram is an ngram where n is 3

In [37]:
list(nltk.ngrams(emma,4))[:5]

[('[', 'Emma', 'by', 'Jane'),
 ('Emma', 'by', 'Jane', 'Austen'),
 ('by', 'Jane', 'Austen', '1816'),
 ('Jane', 'Austen', '1816', ']'),
 ('Austen', '1816', ']', 'VOLUME')]

### Exercises
1. Find the most frequent bigram in Austin's Emma.
2. Find the most frequent bigram that begins with 'the'.

# Text Processing in Python

### Sorting
* The function `sorted()` returns a sorted copy.
* Sequences can be sorted in place with the `sort()` method.
* Python 3 does not support sorting of lists with mixed contents.

In [38]:
foo = [2,5,9,1,11]
sorted(foo)

[1, 2, 5, 9, 11]

In [39]:
foo

[2, 5, 9, 1, 11]

In [40]:
foo.sort()

In [41]:
foo

[1, 2, 5, 9, 11]

In [42]:
foo2 = [2,5,6,1,'a']
sorted(foo2)

TypeError: '<' not supported between instances of 'str' and 'int'

### Sorting with a custom sorting criterion

In [1]:
l = ['a','abc','b','c','aa','bb','cc']

In [2]:
sorted(l)

['a', 'aa', 'abc', 'b', 'bb', 'c', 'cc']

In [3]:
sorted(l,key=len)

['a', 'b', 'c', 'aa', 'bb', 'cc', 'abc']

In [4]:
sorted(l,key=len,reverse=True)

['abc', 'aa', 'bb', 'cc', 'a', 'b', 'c']

In [5]:
def my_len(x):
    return -len(x)

In [6]:
sorted(l,key=my_len)

['abc', 'aa', 'bb', 'cc', 'a', 'b', 'c']

In [7]:
sorted(l,key = lambda x: -len(x))

['abc', 'aa', 'bb', 'cc', 'a', 'b', 'c']

### Exercises
You're given data of the following form:

```python
namedat = dict()
namedat['mc'] = ('Madonna', 45)
namedat['sc'] = ('Steve', 41)
```

1. How would you print a list ordered by name?
2. How would you print a list ordered by age?

### Strings in Python
* String is a base type.
* Strings are sequences and can use operations like lists or tuples.

In [None]:
foo = "A string"
len(foo)

In [None]:
foo[0]

In [None]:
foo[0:3]

In [None]:
multifoo = """A multiline 
string"""

In [None]:
multifoo

In [None]:
"my string".capitalize()

In [None]:
capitalize("my string")

In [None]:
"my string".upper()

In [None]:
"My String".lower()

In [None]:
a = "my string with my other text"
a.count("my")

In [None]:
a.find("with")

In [None]:
a.find("nothing")

### Split
* `split(sep)` is a central string operation.
* It splits a string wherever `sep` occurs (blank space by default)

In [None]:
foo = "one :: two :: three"
foo.split()

In [None]:
foo.split('::')

In [None]:
foo.split(' :: ')

In [None]:
"this is a test".split()

### Join
* Join is another useful function/method in the string module.
* It takes a list and joins the elements using some delimiter.


In [None]:
text="this is some text to analyse"
words=text.split()
print(words)
words.sort()
print(words)
print(", ".join(words))

### Replace

In [None]:
def censor(text):
   'replace bad words in a text with XXX'
   badwords = ['poo', 'bottom']
   for b in badwords:
      text = text.replace(b, 'XXX')
   return text

In [None]:
censor("this is all poo and more poo")

### Text Preprocessing with NLTK

In [8]:
import nltk
nltk.download("punkt")
text = "This is a sentence. This is another sentence."
nltk.sent_tokenize(text)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gideonsacks/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['This is a sentence.', 'This is another sentence.']

In [9]:
for s in nltk.sent_tokenize(text):
    for w in nltk.word_tokenize(s):
        print(w)
    print()

This
is
a
sentence
.

This
is
another
sentence
.



![WordPosPipeline](WordPosPipeline.png)

In [10]:
nltk.download("averaged_perceptron_tagger")
nltk.pos_tag(["this", "is", "a", "test"])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gideonsacks/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]

In [11]:
nltk.download("universal_tagset")
nltk.pos_tag(["this", "is", "a", "test"], tagset="universal")

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/gideonsacks/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


[('this', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('test', 'NOUN')]

In [12]:
nltk.pos_tag(nltk.word_tokenize("this is a test"), tagset="universal")

[('this', 'DET'), ('is', 'VERB'), ('a', 'DET'), ('test', 'NOUN')]

![SentPosPipeline](SentPosPipeline.png)

In [13]:
text = "This is a sentence. This is another sentence."
text_sent_tokens = [nltk.word_tokenize(s) for s in nltk.sent_tokenize(text)]
text_sent_tokens

[['This', 'is', 'a', 'sentence', '.'],
 ['This', 'is', 'another', 'sentence', '.']]

In [14]:
nltk.pos_tag_sents(text_sent_tokens, tagset="universal")

[[('This', 'DET'),
  ('is', 'VERB'),
  ('a', 'DET'),
  ('sentence', 'NOUN'),
  ('.', '.')],
 [('This', 'DET'),
  ('is', 'VERB'),
  ('another', 'DET'),
  ('sentence', 'NOUN'),
  ('.', '.')]]

### Exercises
1.  What is the sentence with the largest number of tokens
    in Austen's "Emma"?
2. What is the most frequent part of speech in Austen's "Emma"?
3. What is the number of distinct stems in Austen's "Emma"?
4. What is the most ambiguous stem in Austen's "Emma"?
    (meaning, which stem in Austen's "Emma" maps to the
    largest number of distinct tokens?)

# Regular Expressions

In [15]:
import re
def is_there_a_nine (str):
   """return True if there is a 9 in the string"""
   if re.search( '9', str ):
      return True
   else:
      return False

def find_digits (str):
   """return any sequences of digits in str as a list of strings"""
   return re.findall( '[0-9]+', str )

In [None]:
is_there_a_nine("12 3 4 9 10")

In [None]:
is_there_a_nine("12 3 4 nine 10")

In [None]:
find_digits("12 3 4 9 10")

In [None]:
re.split(r'\W',"words, words, words")

In [None]:
re.split(r'\W+',"words, words, words")

In [None]:
re.findall(r'\(.*\)','y(aba)+(daba)+doo)')

In [None]:
re.findall(r'\(.*?\)','y(aba)+(daba)+doo)')

In [None]:
def getdate(str):
  """find a date day/month/year in str return a list of [day, month, year]"""
  p = re.compile(r'([0-9]+)/([0-9]+)/([0-9]+)')
  m = p.search(str)
  if not m == None:
    return [m.group(1), m.group(2), m.group(3)]

In [None]:
getdate('2/08/2018')

In [None]:
getdate('there are no dates here')

In [None]:
def ozifydates(str):
  """map us dates (mm/dd/yy) to Aus. format (dd/mm/yy)"""
  p = re.compile(r'([0-9]+)/([0-9]+)/([0-9]+)')
  return p.sub(r'\2/\1/\3', str)

In [None]:
ozifydates('03/31/16')