# Text processing, web scraping (1)

## Basic string processing

In [146]:
# y'all remember that, right ?

In [1]:
# string normalization:
# .lowercase
# .strip

s = " Blois tours  amiens\n"

In [2]:
s.strip()

'Blois tours  amiens'

In [9]:
[str.lower(e) for e in s.strip().split(" ") if e!=""]

['blois', 'tours', 'amiens']

## String similarity measure

Given two strings, how do we decide whether they are the same?

- hamming distance: number of positions where characters differ
- edit distance (Levenshtein distance): number of single character edits (insertion, deletion, substitution) to transform one string into another.


In [10]:
t1 = "londres"
t2 = "london"
t3 = 'moscow'

#1: write a function which compute hamming distance between two strings of same length



In [12]:
def compare_1(s1,s2):
    n = len(s1)
    c = 0
    for i in range(n):
        if s1[i] != s2[i]:
            c +=1
    return c

In [14]:
compare_1(t2, t3)

4

In [15]:
compare_1(t2, t1)

2

In [16]:
import numpy as np

In [24]:
def compare_2(s1,s2): return sum(~(np.array(list(s1))==np.array(list(s2))))

In [25]:
compare_2(t3,t2)

4

#2: write a function which compute levenhstein distance 
(compare with Hamming distance)


In [27]:
from difflib import SequenceMatcher
def similar(a,b):
    return SequenceMatcher(None,a,b).ratio()

In [28]:
similar(t1,t2)

0.6153846153846154

In [29]:
similar(t2, t3)

0.3333333333333333

In [30]:
similar(s1,s2)

0.9

In [31]:
s1 = "Francois Hollande"
s2 = "Hollande Francois"
s3 = "Theresa May"

In [32]:
similar(s1,s2)

0.47058823529411764

In [33]:
similar(s1,s3)

0.2857142857142857

#3 show the difference between hamming and levenshtein on a simple example

In [34]:
s1 = "0123456"
s1 = "1234560"

## Regular expression

### Context

- Regular Expression: Concise way to describe a set of strings
- Deterministic Finite State Automaton: Machine to recognize whether a given string is in a given set.

Duality theorem: for any DFA, there exists a regular expression to describe
the same set of strings; for any regular expression, there exists a
DFA that recognizes the same set.

Most good programming language feature a regular expression engine which converts an expression pattern into an automaton. The latter form is compiled and very efficient.

_Example_: the state of an automaton moves as a function of the current character. A string is accepted if automaton ends in a prespeciefed end-state. The following automaton recognizes only three types of characters `a`, `b` and `c`. Can you predict all the strings recognized by the automaton?

![](automation.dio.png)

In [37]:
import re

In [38]:
regex = re.compile("a(?:ab*|ba*)bc*") # we'll explain the syntax fully below
# * means repeat one or many times
# | means or
# (?: ... ) creates a group

In [39]:
regex.match("abbc") # returns a match object

<re.Match object; span=(0, 4), match='abbc'>

In [40]:
regex.match("acb") # returns nothing

It is also possible to find one or many elements corresponding to a pattern in a substring:

In [45]:
txt = "This is a text containing aabbc and some random ababreviations"

In [46]:
# one element
m = regex.search(txt)

In [48]:
m

<re.Match object; span=(26, 31), match='aabbc'>

In [49]:
#many elements:
re.findall(regex, txt)

['aabbc', 'abab']

### Generic regular expression syntax

There are many tutorials to learn the syntax, some of them interactive.
See among others
- https://regexone.com/
- https://developers.google.com/edu/python/regular-expressions

When designing a regular expression, one can use a specialized software to get immediate feedback. For instance https://regex101.com/

#### Basic Patterns: 

(following is taken from developers.google.com)

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

- a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

- . (a period) -- matches any single character except newline '\n'

- \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

- \b -- boundary between word and non-word

- \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

- \t, \n, \r -- tab, newline, return

- \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

- ^ = start, $ = end -- match the start or end of the string

- \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

#### Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

- `+` -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
- `*` -- 0 or more occurrences of the pattern to its left
- `?` -- match 0 or 1 occurrences of the pattern to its left

Leftmost & Largest

- First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

- The non-greedy equivalent are `+?` and `*?` respectively.


Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. 

### Grouping

One can capture groups as the same time as the matching is performed:

In [50]:
txt = "id:34, name:gilles"

In [167]:
reg = re.compile("id:(.*), name:(.*)")

In [168]:
reg.match(txt).groups()

('34', 'gilles')

In order to group subexpressions, without capturing, one needs to start the group with `?:`

In [169]:
reg = re.compile("id:(?:.*), name:(?:.*)")

In [170]:
reg.match(txt)

<_sre.SRE_Match object; span=(0, 18), match='id:34, name:gilles'>

In [171]:
reg.match(txt).groups()

()

One can also name the groups: (this is python only syntax)

In [162]:
reg = re.compile("id:(?P<age>.*), name:(?P<name>.*)")

In [163]:
m = reg.match(txt)
(m.group('age'), m.group('name'))

('34', 'gilles')