# Module 1.3

In this module, we will cover
- functions;
- regular expressions;
- spaCy;

## Functions

## Regular expressions

### Pre-module quiz

This section is heavily built on this [tutorial](https://docs.python.org/3.6/howto/regex.html#regex-howto), and students are strongly recommended to go through it. A more comprehensive documentation of python regular expressions can be found [here](https://docs.python.org/3.6/library/re.html).

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Python uses the **raw string** notation for RE patterns, and backslashes '\' are not handled in any special way in a string literal prefixed with 'r'. So r'\n' is a two-character string containing '\' and 'n', while '\n' is a one-character string containing a newline. 

We’ll use the raw string notation for the rest of the section. We will also write REs in highlight style , i.e. r'\n' is equivalent to `\n`, usually without quotes, and strings to be matched 'in single quotes'.

REs can contain both special and ordinary characters. Most ordinary characters, like `A`, `a`, or `0`, are the simplest REs; they simply match themselves. You can concatenate ordinary characters to match more complex sequence. For example, `test` will match the string 'test' exactly.

Some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

Here’s a complete list of the metacharacters and their explanations:

- `.`             Matches any character except a newline.
- `^`             Matches the start of the string.
- `$`             Matches the end of the string or just before the newline at the end of the string.
- `*`             Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible.
- `+`             Matches 1 or more (greedy) repetitions of the preceding RE.
- `?`             Matches 0 or 1 (greedy) of the preceding RE.
- `*?, +?, ??`    Non-greedy versions of the previous three special characters.
- `{m,n}`         Matches from m to n repetitions of the preceding RE.
- `{m,n}?`        Non-greedy version of the above.
- `\\`            Either escapes special characters or signals a special sequence.
- `[]`            Indicates a set of characters. A "^" as the first character indicates a complementing set.
- `|`             A|B, creates an RE that will match either A or B.
- `(...)`         Matches the RE inside the parentheses. The contents can be retrieved or matched later in the string.


The special sequences consist of "\\" and a character from the list below:
- `\number`  Matches the contents of the group of the same number.
- `\A`       Matches only at the start of the string.
- `\Z`       Matches only at the end of the string.
- `\b`       Matches the empty string, but only at the start or end of a word.
- `\B`       Matches the empty string, but not at the start or end of a word.
- `\d`       Matches any decimal digit; equivalent to the set `[0-9]`.
- `\D`       Matches any non-digit character; equivalent to `[^\d]`.
- `\s`       Matches any whitespace character; equivalent to `[ \t\n\r\f\v]`.
- `\S`       Matches any non-whitespace character; equivalent to `[^\s]`.
- `\w`       Matches any alphanumeric character; equivalent to `[a-zA-Z0-9_]`.
- `\W`       Matches the complement of `\w`.
- `\\`       Matches a literal backslash.

In [1]:
# import built-in regular expression module in python
import re

In [2]:
text = ("""1609

THE SONNETS

by William Shakespeare



                     1
  From fairest creatures we desire increase, That thereby beauty's rose might never die, """
  "But as the riper should by time decease, "
  "His tender heir might bear his memory: "
  "But thou contracted to thine own bright eyes, "
  "Feed'st thy light's flame with self-substantial fuel, "
  "Making a famine where abundance lies, "
  "Thy self thy foe, to thy sweet self too cruel: "
  "Thou that art now the world's fresh ornament, "
  "And only herald to the gaudy spring, " 
  "Within thine own bud buriest thy content, "
  "And tender churl mak'st waste in niggarding: "
  "Pity the world, or else this glutton be, "
  """To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow, """
  "And dig deep trenches in thy beauty's field, "
  "Thy youth's proud livery so gazed on now, "
  "Will be a tattered weed of small worth held: "
  "Then being asked, where all thy beauty lies, "
  "Where all the treasure of thy lusty days; "
  "To say within thine own deep sunken eyes, "
  "Were an all-eating shame, and thriftless praise. "
  "How much more praise deserved thy beauty's use, "
  "If thou couldst answer 'This fair child of mine "
  "Shall sum my count, and make my old excuse' "
  "Proving his beauty by succession thine. "
  "This were to be new made when thou art old, "
  "And see thy blood warm when thou feel'st it cold.")

print(text)

1609

THE SONNETS

by William Shakespeare



                     1
  From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep su

We will mainly concentrate two essential usage of regular expressions: **searching** and **substituting** text.

### Searching text

Let's first find word 'where', in regardless of case.

The **re** module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them.

In [3]:
p = re.compile(r'\b[Ww]here\b')
p

re.compile(r'\b[Ww]here\b', re.UNICODE)

After we obtained compiled pattern object, it has several functions and attributes. For **searching** text, there are mainly four functions available:
- *match()*: determine if the RE matches at the beginning of the string.
- *search()*: scan through a string, looking for any location where this RE matches.
- *findall()*: find all substrings where the RE matches, and returns them as a list.
- *finditer()*: find all substrings where the RE matches, and returns them as an iterator.

Please consult the **re** documentation for a complete listing.

In [4]:
m = p.match(text)
print(m)

None


In [5]:
m = p.search(text)
print(m)

<_sre.SRE_Match object; span=(353, 358), match='where'>


The functions *match()* and *search()* return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In this example, as the text doesn't start with the word 'where', so *match()* won't find any match, whereas *search()* will look for any location where RE matches.

We can further query the match object for information about the matching string.

In [6]:
m.group()

'where'

In [7]:
m.start(), m.end()

(353, 358)

In [8]:
m.span()

(353, 358)

Now what if we want to find all matches, i.e. all words 'where'? We can use *findall()* and *finditer()*.

The *findall()* returns a list of matching strings.

In [9]:
m = p.findall(text)
print(m)

['where', 'where', 'Where']


The *finditer()* returns a sequence of match object instances as an iterator.

In [10]:
iterator = p.finditer(text)
for match in iterator:
    print(match)

<_sre.SRE_Match object; span=(353, 358), match='where'>
<_sre.SRE_Match object; span=(900, 905), match='where'>
<_sre.SRE_Match object; span=(927, 932), match='Where'>


You don’t have to create a pattern object and call its functions; the **re** module also provides the same top-level functions.

In [11]:
print(re.search(r'\b[Ww]here\b', text))

<_sre.SRE_Match object; span=(353, 358), match='where'>


In [12]:
print(re.findall(r'\b[Ww]here\b', text))

['where', 'where', 'Where']


Notice that when we compile the RE, there is a **re.UNICODE** item when we print the object. It is a compliation flag which let you modify some aspects of how RE works. The full flag list is available in the module documentation. For example, we can use **re.IGNORECASE** or **re.I** to do case-insensitive matches, so that we do not have to take care of capital letters.

In [13]:
p = re.compile(r'\bwhere\b', re.I)
p

re.compile(r'\bwhere\b', re.IGNORECASE|re.UNICODE)

And we will still have the same results.

In [14]:
m = p.findall(text)
print(m)

['where', 'where', 'Where']


Now try to come up with your regular expressions and search in the text to see if they will work.

#### Groupings

#### Identifying dates

### Substituting text

We will use the following functions for substitution:
- *sub()*: find all substrings where the RE matches, and replace them with a different string.
- *subn()*: does the same thing as sub(), but returns the new string and the number of replacements.

Again, you are encouraged to refer to the docs for more functions.

***sub*** *(replacement, string[, count=0])*

Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in *string* by the replacement *replacement*. If the pattern isn’t found, string is returned unchanged.

The optional argument *count* is the maximum number of pattern occurrences to be replaced; *count* must be a non-negative integer. The default value of 0 means to replace all occurrences.

For the same example, let's substitute all 'where' words to '-------'.

In [15]:
p = re.compile(r'\b[Ww]here\b')
print(p.sub('------', text))

1609

THE SONNETS

by William Shakespeare



                     1
  From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine ------ abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, ------ all thy beauty lies, ------ all the treasure of thy lusty days; To say within thine own deep

In [16]:
print(p.sub('------', text, count = 2))

1609

THE SONNETS

by William Shakespeare



                     1
  From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine ------ abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, ------ all thy beauty lies, Where all the treasure of thy lusty days; To say within thine own deep 

The *subn()* method does the same work, but returns a 2-tuple containing the new string value and the number of replacements that were performed:

In [17]:
print(p.subn('------', text))

("1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine ------ abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.\n\n\n                     2\n  When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, ------ all thy beauty lies, ------ all the treasure of thy lusty days; To say within

In [18]:
p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
p.sub(r'subsection{\g<1>}','section{First} section{second}')

'subsection{First} subsection{second}'

## SpaCy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

There are many useful resources online for spaCy, among which the official [tutorial](https://spacy.io/usage/spacy-101) and the [course](https://course.spacy.io/) are extremely useful.

### Installation

Follow the link [here](https://spacy.io/usage) to install spaCy with pip. Then download and install the English model by running:
```
python -m spacy download en_core_web_sm
```

spaCy is enables many different features introduced [here](https://spacy.io/usage/spacy-101#features), and we will focus on the simplest and direct features:
- tokenization;
- stemming; (not supported by spaCy)
- lemmatization;
- part-of-speech (POS) tagging;
- sentence boundary detection and sentence segmentation;
- dependency parsing;

In [19]:
# import spacy module
import spacy

# load english model
nlp = spacy.load("en_core_web_sm")
# read the text and process
doc = nlp(text)

In [20]:
# Tokenization
# get all separated tokens in doc
tokens = [token.text for token in doc]
print(tokens)

['1609', '\n\n', 'THE', 'SONNETS', '\n\n', 'by', 'William', 'Shakespeare', '\n\n\n\n                     ', '1', '\n  ', 'From', 'fairest', 'creatures', 'we', 'desire', 'increase', ',', 'That', 'thereby', 'beauty', "'s", 'rose', 'might', 'never', 'die', ',', 'But', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', ',', 'His', 'tender', 'heir', 'might', 'bear', 'his', 'memory', ':', 'But', 'thou', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', ',', "Feed'st", 'thy', 'light', "'s", 'flame', 'with', 'self', '-', 'substantial', 'fuel', ',', 'Making', 'a', 'famine', 'where', 'abundance', 'lies', ',', 'Thy', 'self', 'thy', 'foe', ',', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', ':', 'Thou', 'that', 'art', 'now', 'the', 'world', "'s", 'fresh', 'ornament', ',', 'And', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', ',', 'Within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', ',', 'And', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarding', ':', 'Pity', 'the', 'world

You can see from the list *tokens* that there each tokenized token is in the entry of the list.

In [21]:
# Lemmatization
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['1609', '\n\n', 'the', 'sonnet', '\n\n', 'by', 'William', 'Shakespeare', '\n\n\n\n                     ', '1', '\n  ', 'from', 'fair', 'creature', '-PRON-', 'desire', 'increase', ',', 'that', 'thereby', 'beauty', "'s", 'rose', 'may', 'never', 'die', ',', 'but', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', ',', '-PRON-', 'tender', 'heir', 'may', 'bear', '-PRON-', 'memory', ':', 'but', 'thou', 'contract', 'to', 'thine', 'own', 'bright', 'eye', ',', "Feed'st", 'thy', 'light', "'s", 'flame', 'with', 'self', '-', 'substantial', 'fuel', ',', 'make', 'a', 'famine', 'where', 'abundance', 'lie', ',', 'Thy', 'self', 'thy', 'foe', ',', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', ':', 'thou', 'that', 'art', 'now', 'the', 'world', "'s", 'fresh', 'ornament', ',', 'and', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', ',', 'within', 'thine', 'own', 'bud', 'buri', 'thy', 'content', ',', 'and', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarde', ':', 'Pity', 'the', 'world', ',', 'o

Did you notice that almost all pronouns are subsituted with `-PRON-`?

Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, `-PRON-`, which is used as the lemma for all personal pronouns.

In [22]:
# POS tagging
poss = [token.pos_ for token in doc]
print(poss)

['NUM', 'SPACE', 'DET', 'NOUN', 'SPACE', 'ADP', 'PROPN', 'PROPN', 'SPACE', 'NUM', 'SPACE', 'ADP', 'ADJ', 'NOUN', 'PRON', 'VERB', 'NOUN', 'PUNCT', 'DET', 'ADV', 'NOUN', 'PART', 'NOUN', 'VERB', 'ADV', 'VERB', 'PUNCT', 'CCONJ', 'ADP', 'DET', 'NOUN', 'VERB', 'ADP', 'NOUN', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'NOUN', 'VERB', 'VERB', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'NOUN', 'VERB', 'PART', 'VERB', 'ADJ', 'ADJ', 'NOUN', 'PUNCT', 'PROPN', 'DET', 'NOUN', 'PART', 'NOUN', 'ADP', 'NOUN', 'PUNCT', 'ADJ', 'NOUN', 'PUNCT', 'VERB', 'DET', 'NOUN', 'ADV', 'NOUN', 'VERB', 'PUNCT', 'PROPN', 'NOUN', 'DET', 'NOUN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'ADV', 'ADJ', 'PUNCT', 'NOUN', 'DET', 'NOUN', 'ADV', 'DET', 'NOUN', 'PART', 'ADJ', 'NOUN', 'PUNCT', 'CCONJ', 'ADV', 'NOUN', 'ADP', 'DET', 'NOUN', 'NOUN', 'PUNCT', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'ADJ', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'NOUN', 'NOUN', 'NUM', 'NOUN', 'ADP', 'VERB', 'PUNCT', 'PROPN', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'ADV', 'DET', 'NOUN', 'VERB', 'PUNCT

Compare the previous three lists' outputs.

Now let's move to sentence level features, i.e. sentence boundary detection and sentence segmentation. Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually more accurate than a rule-based approach, but it also means you’ll need a statistical model and accurate predictions.
If your texts are closer to general-purpose news or web text, this should work well out-of-the-box. 

In [23]:
sentences = list(doc.sents)
sent_texts = [sentence.text for sentence in sentences]
print(sent_texts)
print(len(sent_texts))

['1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     ', '1\n  ', "From fairest creatures we desire increase, That thereby beauty's rose might never die,", "But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel:", "Thou that art now the world's fresh ornament, And only herald to the gaudy spring,", "Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding:", "Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.\n\n\n                     ", '2\n  ', "When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy 

However, for other texts like social media texts, your application may benefit from a custom rule-based implementation. You can either use the built-in Sentencizer or plug an entirely custom rule-based function into your processing pipeline.

Let's build a rule-based sentence segmentor.

In [24]:
from spacy.lang.en import English

nlp_rule = English()  # just the language with no model
sentencizer = nlp_rule.create_pipe("sentencizer")
nlp_rule.add_pipe(sentencizer)
doc_rule = nlp_rule(text)
sent_rule_texts = [sent.text for sent in doc_rule.sents]
print(sent_rule_texts)
print(len(sent_rule_texts))

["1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee.", "\n\n\n                     2\n  When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tattered weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say withi

You can see clearly the difference in both methods, where the statistical dependency parse sentence segmentor generate 12 sentencs, but the rule-based sentence segmentor only yields 4 sentences.

Finally, let's have a look at the dependencies, i.e. the dependency relations between tokens, which is also used to segment sentences for the first model.

In [25]:
# Dependencies
deps = [token.dep_ for token in doc]
print(deps)

['ROOT', '', 'det', 'appos', '', 'prep', 'compound', 'pobj', '', 'ROOT', '', 'prep', 'amod', 'pobj', 'nsubj', 'compound', 'relcl', 'punct', 'nsubj', 'advmod', 'poss', 'case', 'amod', 'aux', 'neg', 'ROOT', 'punct', 'cc', 'mark', 'det', 'nsubj', 'aux', 'prep', 'compound', 'pobj', 'punct', 'poss', 'compound', 'nsubj', 'aux', 'advcl', 'poss', 'dobj', 'punct', 'cc', 'nsubj', 'ROOT', 'aux', 'xcomp', 'amod', 'amod', 'dobj', 'punct', 'amod', 'poss', 'poss', 'case', 'dobj', 'prep', 'npadvmod', 'punct', 'amod', 'pobj', 'punct', 'advcl', 'det', 'dobj', 'advmod', 'nsubj', 'relcl', 'punct', 'compound', 'compound', 'compound', 'dobj', 'punct', 'aux', 'amod', 'amod', 'pobj', 'advmod', 'advcl', 'punct', 'ROOT', 'nsubj', 'dobj', 'advmod', 'det', 'poss', 'case', 'amod', 'dobj', 'punct', 'cc', 'advmod', 'conj', 'prep', 'det', 'compound', 'pobj', 'punct', 'ROOT', 'amod', 'amod', 'compound', 'amod', 'compound', 'pobj', 'punct', 'cc', 'compound', 'conj', 'nsubj', 'dobj', 'prep', 'pcomp', 'punct', 'ROOT', 'd

Using spaCy’s built-in [displaCy visualizer](https://spacy.io/usage/visualizers), here’s what our example sentence and its dependencies.

In [29]:
from spacy import displacy
displacy.render(sentences, style = "dep")

We can store data, for example numbers, into **variables**, so we can use them in larger expressions.

We assign data to variables `var_name = value`, e.g. `a = 5`.

We can know the type of a variable with the fuction `type(var)`

In [None]:
a = 5 # we create variable a
b = 2 # we create variable b

In [None]:
type(a)

In [None]:
a + 5 # addition 5 + 5

In [None]:
(a + b) / 3

In [None]:
a ** b

Content of a variable can change over time!

In [None]:
a = 3
d = a + 3
d

In [None]:
a = 1
d = a + 3
d

Variables are first created when they are assigned. The expression above work because the variables `a` and `b` had been previously assigned.

Look what happens if we try to use a variable which has never been assigned:

In [None]:
c + 5

We can freely assign new content to the same value. This will make the value's type to change authomatically.

In [None]:
x = 3
print(type(x))

x = 3.5
print(type(x))

We can compare numerical values using the classic comparison operations:

+ greater `<` and lesser `>`
+ greater or equal `<=` and lesser or equal `>=`
+ equal `==` and not equal `!=`

Results of comparison operations are of type **boolean**.

In [None]:
a = 4
b = 3
c = 3

print(a < b)
print(b < c)
print(b <= c)

print(a == b)
print(a != b)

In [None]:
d = (a == b)
print(type(d))

### Strings

Strings represent immutable ordered sequences of characters.

...
Maybe talk about special characters as \n and \t ?

In [None]:
s1 = '' # empty string
s2 = 'hello'
s3 = 'mum'
s4 = 'hello mum'
s5 = '2345'

We can perform a wide range of operations over strings:

In [None]:
# print the length of the string

print(len(s1))
print(len(s2))
print(len(s3))
print(len(s5))

Strings can be concatenated and repeted:

In [None]:
print(s2 + s3) # we concatenate s2 and s3
print(s2 * 3)  # we repeat s2 three times
print(s2 + " " + s3)

Notice that `s2 + " " + s3` is equal to the string `"hello mum"`:

In [None]:
(s2 + " " + s3) == s4

Other common operations over strings are *indexing* and *slicing*.

Notice that in Python (as in many other programming languages) indexing starts from zero!

In [None]:
s = 'this is an example'

# indexing (to access a byte in the string)
print(s[0]) # print the first character of the string
print(s[1]) # print the second character of the string
print(s[-1]) # print the last character of the string (backwards indexing)

# slicing (to extract substrings)
print(s[0:2])
print(s[:2]) # the same as s[0:2]
print(s[1:2])

We can find specific substrings in it using the `find` operation:

In [None]:
print(s.find("example"))  # returns the index where the given substring starts
print(s.find("examples")) # returns -1: the substring has not been found

As well as modify it with the `replace` operation:

In [None]:
d = s.replace("example", "book")

print(d) # modified string is stored in the variable d
print(s) # Notice that string assigned to variable s is not modified! Strings are invariable

The `replace` operation will replace _all_ mentions of the given substring:

In [None]:
s = s.replace(" ", "_")
print(s)

Other useful operations on strings are:

In [None]:
x = "This is a nice University"

# convert to upper/lowercase
print(x.upper())
print(x.lower())

# concatenate with a given delimiter
print("-".join(x))
print("*".join(x))

# splits string at delimiter.
# creates a list (see below) with the obtaines substrings
print(x.split("nice"))  
print(x.split(" "))     # delimiter found multiple times.
print(x.split("x"))     # delimiter not found. Creates a list with the entire string as the only element

**Notes on conversion**

We can always convert an integer to its string representation:

In [None]:
a = 142
d = str(a)

print(a)
print(d)

print(type(a))
print(type(d))

print("142" == 142)

We usually cannot mix strings and numbers in operations, but sometimes we would obtain something different then expected:

In [None]:
"142" + 4

before this, we need to convert them to the desider data type!

In [None]:
int("142") + 4

It is important to always remember the data type of our variables.

If not, we could obtain results different than expected...

In [None]:
a = "3"
b = "4"

print(a + b)
print(int(a) + int(b))

### Lists and Sets

Lists are
positionally ordered collections of arbitrarily typed objects. 

In [None]:
# a list of three different object types
L=[1,'a',True]
print(L)
# the lenghth of the list
len(L)

we can index, slice, and so on, just as for strings:

In [None]:
#index
print(L[0])

#slice
print (L[:-1])
#concatenate with another list
print(L+[2,3,4])

Further, lists have no fixed size. That is, they can grow and
shrink on demand, in response to list-specific operations

In [None]:
L=[1,'a',True]

#append at the end
L.append('b')
print (L)

# delete an item at index 1 and returns the deleted item
print (L.pop(1))
# delete an item at index 0
del L[0]
print (L)
# delete the first matching item by value in a list:
L.remove('b')
print(L) # removes 'b'

#insert: L.insert (position, item): insert an item at position of L
L.insert(0,'a')
print (L)




In [None]:
# sort the list by ascending order
L=[1,4,3]
L.sort()
print (L)
# sort the list by descending order
L.reverse()
print (L)

Note: Because lists are mutable, most of the list methods modify the lists directly instead of create a new one. (Compare it with strings) Advanced: nesting and list comprehensions

Sets are unordered collections of unique and immutable objects. 

In [None]:
# create a set:
a={1,'a','b'}
# create a set from a list: will only maintain the unique items
b=set([1,'a','a','b'])
print (a)
print (b)

### Dictionaries

Dictionaries are a mutable mapping type that map keys to their associated value.

In [None]:
D = {'food': 'Spam', 'quantity': 4, 'color': 'pink'}
print (D['food']) # Fetch value of key 'food'
D['quantity'] = 1 # assign a new value
D['size']=10 #create a new key by assignment
print (D)

We will encounter a key error if we fetch a key that does not exist. 

In [None]:
D['shape']

We could use .get() method to return a default value

In [None]:
print(D.get('color')()
print(D.get('shape',0))

Advanced: sorting the keys

### Tuples

Tuples are sequences, like lists, but they are immutable, like strings.

There are some methods specific to tuples: 

In [None]:
T = (1, 2, 3, 4,2)
print(T.index(4)) # the index of the first matching 4 in the tuple 
print (T.count(2)) # the number of times 2 occurs in the tuple

Because tuples are immutable, we cannot change the tuples (ie. item assignment, appending...) once they are created. 

In [None]:
T[0]=2 # will give warnings

### Files

File objects are Python code’s main interface to external files on your computer. To create a file object, you call the built-in open function, passing in an external filename and a processing mode as strings. 

In [None]:
# create a file with 'w' (write) processing mode
f=open('test', 'w')
#write some strings to the file
f.write('hello world\n')
# don't forget to close the file
f.close()

In [None]:
#Even better, we could use the following without explicitly closing the file
with open('test','w') as f:
    f.write('hellow world\n')

## Statements and Loops

In Python
we normally code one statement per line and indent all the statements in a nested block
the same amount

Assignment Statememt

In [None]:
# basic assignment
count = 0
print (count)

# augmented assignemnt
count+=1 #equals to count=count+1
print (count)

# sequence assignment
a, b, c, d = 'spam' 
print (a)

#multiple target assignment
spam=ham='spam'
print (spam)
print (ham)

In Python, you can use an expression as a statement, too—that is, on a line by itself, but they do not return any values.

In [None]:
#for example, in-place list methods returns None
L=[1,2,3]
a=L.sort()
print (a)

# compare with sorted() function
b=sorted(L)
print (b)

### IF statements

In simple terms, the Python if statement selects actions to perform. 

It
takes the form of an if test, followed by one or more optional elif (“else if”) tests and
a final optional else block. The tests and the else part each have an associated block
of nested statements, indented under a header line.

In [None]:
#check if x is negative, 0 or positive, and print accordingly
x=1
if x<0:
    print ('negative')
elif x==0:
    print ('0')
else:
    print ('positive')

### FOR and WHILE loops

For and While loops are statements that repeat an action over

The first of these, the while statement, provides a way to code general loops. 

In [None]:
# a loop that strip the last letter of a string one by one
x = 'spam'
while x: # While x is not empty
    print(x)
    x = x[1:] # Strip first character off x

We could add 'break', 'continue' statements to the loop:

In [None]:
x = 'spam'
while x: # While x is not empty
    print(x)
    x = x[1:] # Strip first character off x
    if len(x)<2:
        break # stop when the length of the string is shorter than 2
    else:
        continue # else, continue to the next iteration in the loop

The second, the for statement, is designed for stepping through the items in a sequence
object and running a block of code for each. The built-in range function produces a series of successively higher integers, which can be used as indexes in a for loop

In [None]:
# Let's write the previous while method into the for loop:
x='spam'
for i in range(len(x)): #we need to specify the number of itrations in the loop. 
    print (x)
    x=x[1:]
    if len(x)<2:
        break
    else:
        continue

## Functions

## How to import Modules

Some useful packages for NLP:

+ NLTK
+ numpy
+ gensim (?)

## How to install Python and using a Console

## Further Readings

+ [Learning Python](http://shop.oreilly.com/product/0636920028154.do)
+ [Python Cookbook](https://www.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
+ [Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit](http://shop.oreilly.com/product/9780596516499.do) (https://www.nltk.org/book/)
+ maybe links to online Python courses?
+ ...?