# Python basics 2: Text

This notebook contains more basics of Python. Use it as a reference whenever needed.

## Python Syntax

### The Significant Whitespace

Most program languages use characters (e.g. `{...}`) or keywords (e.g. `begin ... end`) to delimitate blocks of codes. But, when writing Python code, you rely on **indentation** to structure your programs. 

All programming languages allow you to indent (and you should!), but in Python you **have to.** Otherwise, you'll receive an IndentationError and your code won't work!

#### How Indentation Works

- All statements with the same distance from the left border belong of the same block of code. This is related to the _scope_ of your code. 
- Sub-blocks are more indented, while the block ends at the line less indented.
- It is recommended to use **4 spaces** per indentation level. However, a tab character (`\t`) can also be used. Often, you have a setting in your favorite editor that automatically translates tabs to 4 spaces. 
- When a statement is too long, it's good practice to avoid lines of code longer than 79 characters, it can be split with `"\"` at the end of the line.
- **Never mix** spaces and tabs in a single source file. This will raise an error when you try to execute the code, but is also very hard to spot by hand. To help you, you can set your text editor to display whitespace characters.


#### Recommended Reading on Python syntax and style
[PEP 8 - Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/)

##### The code is way more readable:


```python
# input() reads from standard input (e.g. keyboard)
n_string = input('Enter a number, please:')

if not n_string.isdigit():
    print("This isn't a number...")
else:
    n = int(n_string)
    if n == 0:
        print("Zero? Why zero?")
    elif n % 2 == 0:
        print("Even")
    else:
        print("Odd")
```

### Strings

Strings are **immutable**, sequences of characters (so, they are **homogeneous**)

In the Python language, strings are defined by single `'` or double quotes `"`,  and their elements are contiguous.

In [None]:
demo_string = "Does Cersei have any friends?"
print(demo_string)

In [None]:
# Using single or double quotes is indifferent
demo_string_single = 'Does Cersei have any friends?'
demo_string == demo_string_single

In [None]:
# Remember: Digits are not numbers! (frequent source of debugging frustration)

print(type("1979") == type(1979))

In [None]:
print('"1979" type is: ' + str(type("1979")))

In [None]:
print('1979 type is: ' + str(type(1979)))

In some ways, strings behave the same as lists: by being immutable sequences, strings accept all the sequences methods. However, they not support item assignment as they are immutable.

In [None]:
# You can check the length (in characters, whitespaces count!) of a string
len(demo_string)

In [None]:
# Strings can be looked for in other strings
"Cersei" in demo_string

In [None]:
# How many 'a's do we have in our string?
demo_string.count("a")

In [None]:
# Concatenation is possible
demo_string += "\nNoway!"
demo_string

In [None]:
print(demo_string)

The escape sequence `\n` indicates the end of the line. 

Python escape sequences are introduced by the escape character `\`, whose goal is to signal the interpreter that the following character has an "unusual" interpretation. Here is a partial list:

| Escape Sequence|         Meaning|
|:--------------:|:---------------:|
|              \\\\|  backslash|
|              \\'|  single quote|
|              \\b|  backspace|
|              \\n|  new line|
|              \\t|  horizontal tab|


In [None]:
# Nicely print our string!
print(demo_string)

In [None]:
# What does a backspace do?
print(demo_string  +"\b")

In [None]:
# The escape character may be useful when you need single quotes inside a single quote-marked string...
'Can any of you pronounce \'s-Hertogenbosch?'

In [None]:
# but this solution is preferred (when possible)
"Can any of you pronounce 's-Hertogenbosch?"

Strings that span multiple lines can be written in a readable form by using the sequence `"""` as a delimiter

In [None]:
print("""Unsealed, on a porch a letter sat
Then you said, "I wanna leave it again"
Once I saw her on a beach of weathered sand
And on the sand I wanna leave her again""")

In some ways, string behave the same as lists. For instance, you can check the length of a string, or check if a character sequence is part of a string:

### Quiz

Run the cell below. Where do all the '\n's come from?

In [None]:
""""On a weekend I wanna wish it all away, yeah
And they called and I said that I'll go
And I said that I'll call out again
And the reason I ought ta leave her calm, I know
I said, "I don't know whether I'm the boxer or the bag"""""

#### String Methods

Strings have a buch of dedicated methods (see the [documentation](https://docs.python.org/3.8/library/stdtypes.html) for a complete list), that allows them to be both inspected or manipulated (they are not modified, rather a **new object** is returned). 

The following are used the most:

In [None]:
# Is the string composed solely of 1. digits 2. alphabetic characters 3. both?
print('100'.isdigit())
print('cat'.isalpha())
print('my cat is 100'.isalnum())  # Why False?

In [None]:
print(demo_string)

The `.startswith()` and `.endswith()` methods are useful:

In [None]:
# Does the string starts or ends with a given sequence of characters?
print(demo_string.startswith("d"))  # it is case sensitive !!!
print(demo_string.endswith("Noway!"))

If you want to compare strings, or want to have them stored in a normalized way, you can use the `.upper()` and `.lower()` methods:

In [None]:
# Change case to all the characters of a string
print(demo_string_single.upper())
print(demo_string_single.lower())

If you want to get rid of unwanted characters at the beginning or at the end of a string, you can use `.strip()`. This is commonly used to remove all whitespace (i.e. spaces, linebreaks, tabs) from the string.

In [None]:
# Remove a given character (default is any whitespace) from the beginning and the end of a string

text1 = "Twice minus: - before and after -"
text2 = "  \t  Too much space?"

print("Before:")
print(text1)
print(text2)
print()

print("After:")
print(text1.strip("-"))
print(text2.strip())
print()

You can remove one or multiple elements from a list by using `.replace()`:

In [None]:
# Replace a given sequence of characters with another 
print(demo_string.replace("Cersei", "Melisandre"))

In [None]:
# Or if you want to completely remove a character or series of characters,
# simply replace it by an empty string! You can also chain these. 
print(demo_string.replace("?", "").replace("!", ""))

#### From strings to lists

A string can be transformed into a list of string by splitting it on a given character. This is done through the `.split()` method. 

In [None]:
# By whitespace
demo_string.split(" ")

In [None]:
# However, the default character is any white line (that's convenient!)
demo_string.split()

In [None]:
# the maximun number of splits can be specified
demo_string.split(" ", 2)

#### From lists to strings

The inverse operation is possible, a list of strings can be joined by a single character using `join()` on another string. It's argument is the list.

In [None]:
# A whitespace
" ".join(["One","Two","Three"])

In [None]:
# An hyphen
"-".join(["One","Two","Three"])

In [None]:
# Any string basically
predifined_joinchar = "🌞"
predifined_joinchar.join(["Eins","Zwei","Drei"])

In [None]:
# Or no characters at all
"".join(["Super", "cali", "fragilistic", "expiali", "docious"] )

Keep in mind that you can only join elements of type string. 

In [None]:
# Raises a TypeError
" ".join(["This", "is", "notebook", "number", 1])

### Dictionaries

Dictionaries are **associative arrays** mapping **immutable** types (string, numbers, tuples...) to arbitrary objects of any kind (all datatypes you earlier saw, variables, functions, modules...). Intuitively, they can be thought as collections of objects that we can recall by means of a unique key. 

To visualize a Python dictionary you can think of a telephone book, in which people names are the unique keys that you use to retrieve difference kinds of information (phone numbers, street address, mail address...). The same telephone number, street address or other information can be present in the entries of more people, but a label cannot be associated with more than one entry. 

In text processing, dictionaries are often used for counting words, where each word is a key and the number of times it occurs in the text is the value.

In Python, dictionaries are defined by curly brackets `{}`, in which key-value pairs are separated by commas and joint by colons.

In [None]:
# An English-Dutch dictionary of colors

color2kleur = {
    "black": "zwart",
    "white": "wit",
    "red": "rood",
    "yellow": "geel"
}

print(color2kleur)

In [None]:
# Values can be recalled by their keys
color2kleur["white"]

In [None]:
# We can change a value associated with a key
color2kleur["white"] = "sneeuwwit"

print(color2kleur)

In [None]:
# If the key is missing a new key: value pair is added
color2kleur["blue"] = "blauw"

print(color2kleur)

In [None]:
# {Key: value} can be deleted with the command "del()"
del(color2kleur["blue"])
print(color2kleur)

In [None]:
# Check if a dictionary has a given key
"blue" in color2kleur

In [None]:
# Count the number of entries in a dictionary
len(color2kleur)

### Quiz

What you're actually doing by calling the dict that way, is calling the `.keys()` method. Try using the following methods below on the `color2kleur` dictionary:

1. `.keys()` vs. just calling the dictionary by `color2kleur`
2. `.values()`
3. `.items()`
    
What is the datatype that is returned for each of these methods? And what is de datatype inside these 'iterables'?

In [None]:
# You code here

#### Iterating over a Dictionary

In [None]:
# iterate over dictionary keys:
print(list(color2kleur))
print(list(color2kleur.keys()))

In [None]:
# iterate over dictionary values:
print(list(color2kleur.values()))

In [None]:
# iterate over dictionary key-value pairs:
print(list(color2kleur.items()))

## String formatting

Instead of writing a series of `print()` statements with multiple arguments, or concatenating (by `+`) strings, you can also use a Python string formatting method, called `f-strings`. More information can be read in PEP 498: https://www.python.org/dev/peps/pep-0498/

You can define a string as a template by inserting `{ }` characters with a variable name or expression in between. For this to work, you have to type an `f` in front of the `'`, `"` or `"""` start of the string definition. When defined, the string will read with the string value of the variable or the expression filled in. 

```python
name = "Joe"
text = f"My name is {name}."
```

Again, if you need a `'` or `"` in your expression, use the other variant in the Python source code to declare the string. Writing:

```python
f'This is my {example}.'
```

is equivalent to:

```python
f"This is my {example}."
```

In [None]:
name = "Joe"
text = f"My name is {name}."

print(text)

In [None]:
day = "Monday"
weather = "Sunny"
n_messages = 8

test_dict = {'test': 'test_value'}

text = f"""
Today is {day}. 
The weather is {weather.lower()} and you have {n_messages} unread messages. 
The first three letters of the weekday: {day[:3]}
An example expression is: {15 ** 2 = }



"""

text = f'Test by selecting key: {test_dict["test"]}'


print(text)

---

## Regular expressions

Using regular expressions can be very useful when working with texts. It is a powerful search mechanism by which you can search on patterns, instead of 'exact matches'. But, they can be difficult to grasp, at first sight.

A **regular expression**, for instance, allows you to substitute all digits in a text, following another text sequence, or to find all urls, phone numbers, or email addresses. Or any text, that meets a particular condition.

See the Python manual for the `re` module for more info: https://docs.python.org/3/library/re.html

You can/should use a cheatsheet when writing a regular expression. A nice website to write and test them is: https://regex101.com/. 

Some examples of commonly used expressions:

* `\d` for all digits 0-9
* `\w` for any word character
* `[abc]` for a set of characters (here: a, b, c)
* `.` any character
* `?` the preceding character/pattern 0 or 1 times
* `*` the preceding character/pattern 0 or multiple times
* `+` the preceding character/pattern 1 or multiple times
* `{1,2}` 1 or 2 times
* `^` the start of the string
* `$` the end of the string
* `|` or
* `()` capture group (only return this part)

In many text editors (e.g. VSCode) there is also an option to search (and replace) with the help of regular expressions. 

Python has a regex module built in. When working with a regular expression, you have to import it first:

In [None]:
import re

You can use a regular expression for **finding** occurences in a text. Let's say we want to filter out all web urls in a text:

In [None]:
text = """
There are various search engines on the web. 
There is https://www.google.com/, but also https://www.bing.com/. 
A more privacy friendly alternative is https://duckduckgo.com/. 
And who remembers http://www.altavista.com/?
"""

re.findall(r'https?://.+?/', text)

In [None]:
# Copied from https://www.imdb.com/search/title/?groups=top_250&sort=user_rating

text = """
1. The Shawshank Redemption (1994)
12 | 142 min | Drama

 9,3  Rate this 80 Metascore
Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

Director: Frank Darabont | Stars: Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler

Votes: 2.355.643 | Gross: $28.34M

2. The Godfather (1972)
16 | 175 min | Crime, Drama

 9,2  Rate this 100 Metascore
An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Director: Francis Ford Coppola | Stars: Marlon Brando, Al Pacino, James Caan, Diane Keaton

Votes: 1.630.157 | Gross: $134.97M

3. The Dark Knight (2008)
16 | 152 min | Action, Crime, Drama

 9,0  Rate this 84 Metascore
When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

Director: Christopher Nolan | Stars: Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine

Votes: 2.315.134 | Gross: $534.86M
"""

titles = re.findall(r'\d{1,2}\. (.+)', text)
titles

### Quiz
Try to get a list of all directors. And the gross income. 

In [None]:
# All directors


In [None]:
# Gross income


Or, you can use a regular expression to **replace** a character sequence. This is an equivalent to the `.replace()` function, but allows more variance in the string matching.

In [None]:
text = """
Tim Robbins
Morgan Freeman
Bob Gunton
William Sadler
Marlon Brando
Al Pacino
James Caan
Diane Keaton
Christian Bale
Heath Ledger
Aaron Eckhart
Michael Caine
"""

# Hint: test this with https://regex101.com/
new_text = re.sub(r"(?:(\w)\w+) (\w+)", r"\1. \2", text)

print(new_text)

---

# Exercises

## Reading
* Have another look at Python's built-in types: https://docs.python.org/3.8/library/stdtypes.html. Be sure to check out the part on [string methods](https://docs.python.org/3.8/library/stdtypes.html#string-methods) in particular. You can skip the other sections. 

## Tasks
* Try out the quizzes that we did not get to in the lecture!
* Try the exercises below.

---

### Excercise 1

Italian nobles tends to have an awful lot of names. For instance, "Vittorio Emanuele di Savoia" (or "Vittorio" for close friends) has the 12 names listed in `full_name`. 

Can you find a pythonic way to eliminate the less used names from this string?

In [None]:
full_name = "Vittorio Emanuele Alberto Carlo Teodoro Umberto Bonifacio Amedeo Damiano Bernardino Gennaro Maria di Savoia"

In [None]:
# Your code here


### Exercise 2

The code in the next cell creates a variable called `zen_text`, and assigns it a nicely formatted version of the textual elements of the Zen of Python.

* Count the number of in this manifesto of:
    1. Characters
    2. Words
    3. Unique words
    3. Non-empty lines
    

In [None]:
import this
zen_text = ''.join(this.d.get(el, el) for el in this.s)  # forget this complicated pattern for now

In [None]:
# Your code here



### Exercise 3
Remember the quiz from the previous notebook with the following code:
    
```python
homogeneous_list = ["1", "56", "33", "8", "220", "9"]
homogeneous_list.sort()
homogeneous_list
```

Try to fix this so that it sorts the list as if its elements were integers!

Can you then do the same for this list, so that Elephant is after canary?

```python
homogeneous_list = ["canary", "hippo", "kangaroo", "narwhal", "Elephant", "raccoon", "yak", "ant"]
homogeneous_list.sort()
homogeneous_list
```

In [None]:
# Your code here

### Exercise 4

The dictionary in the following cell reports, for each Scrubs character, a dictionary mapping is given with the name of the actor, the age of the character and its credentials.

Write code to answer the following questions:

- What are the **names of the actors** of the cast?

- What is the **average age** of the characters?

- How **many M.D.s** are there in the main cast?

- Which character is not listed in the credentials dictionary?

Hint: check the dictionary and set methods

In [None]:
scrubs2age = {
     'Bob Kelso': 70,
     'Carla Espinosa-Turk': 36,
     'Christopher Turk': 31,
     'Elliot Reid': 29,
     'J.D.': 31,
     'Janitor': 40,
     'Perry Cox': 45
  }

scrubs2cred = {
     'Bob Kelso': 'M.D.',
     'Carla Espinosa-Turk': 'RN',
     'Christopher Turk': 'M.D.',
     'Elliot Reid': 'M.D.',
     'J.D.': 'M.D.',
     'Perry Cox': 'M.D.'
  }

In [None]:
# Your code here

### Exercise 5

A string is stored in the `coe_books` variable. 

* How many different books of Jonathan Coe are listed?
* Check if the string starts with 'the accidental woman'
* Print this text titlecased (every word starts with a capital letter, ignore the prepositions for now). And uppercased. And capitalized.
* Convert the string to a list of words. Two of the book titles contain digits: 58 and 11. Find the index in the list and replace the string object for the number as integer. Print the list. 

In [None]:
coe_books = """the accidental woman
a touch of love
the dwarves of death
what a carve up! or the winshaw legacy viking
the house of sleep
the rotters' club
the closed circle
the rain before it falls
the terrible privacy of maxwell sim
expo 58
number 11
"""

In [None]:
# Your code here





## Graded Assignment Week 2

### To be announced on Canvas