But for today, the tasks are:

- Learn about regular expressions
- Learn about Pandas dataframes
- Put together some statistics about the composition of the US congress in the past 6 years
- Download and store (for later use) all the politicians-pages from Wikipedia
- Extract all the internal wikipedia-links that connect the politician-pages on wikipedia
- Generate the network of politicians on wikipedia.
- Calculate some simple network statistics.


# Prelude: Regular expressions¶
## Round one
Tutorial Examples

In [2]:
import re
str = "an example word: cat!!"
match = re.search(r'word: \w\w\w', str)
#If statement after search() tests if it succeeded
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')    

found word: cat


In [23]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
    
match = re.search(r'igs', 'piiig') #=>  not found, match == None
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
    
## . = any char but \n
match = re.search(r'..g', 'piiig') #=>  found, match.group() == "iig"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
    
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') #=>  found, match.group() == "123"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
match = re.search(r'\w\w\w', '@@abcd!!') #=>  found, match.group() == "abc"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  

found iii
did not find
found iig
found 123
found abc


In [26]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')# =>  found, match.group() == "ii"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') #=>  found, match.group() == "1 2   3"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') #=>  found, match.group() == "12  3"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
    
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') #=>  found, match.group() == "123"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') #=>  not found, match == None
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')  
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') #=>  found, match.group() == "bar"
if match:
    print ('found', match.group()) ##'found word: cat'
else:
    print ('did not find')

found piii
found ii
found 1 2   3
found 12  3
found 123
did not find
found bar


In [32]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())## 'b@google'

b@google


In [39]:
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print (match.group())
    print (match.group(1))
    print (match.group(2))

alice-b@google.com
alice-b
google.com


In [41]:
  ## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print (email)

alice@google.com
bob@abc.com


In [43]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print (tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
    print (tuple[0])  ## username
    print (tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


In [50]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


## What are regular expressions?
- In my words, regular expressions are basic linguistic structures that appear in human written communication and can be used to analyse texts.

Provide an example of a regex to match 4 digits numbers (by this, I mean precisely 4 digits, you should not match any part of numbers with e.g. 5 digits). In your notebook, use findall to show that your regex works on this test-text. Hint: a great place to test out regular expressions is: https://regex101.com.

In [66]:
import requests
testText = requests.get('https://raw.githubusercontent.com/suneman/socialgraphs2017/master/files/test.txt').text
match = re.findall(r'\d\d\d\d', testText)
print(match)

['1234', '9999', '2345']


In [81]:
print(re.findall(r'super\w+', testText))

['superpolaroid', 'supertaxidermy', 'superbeer']


### Excercise
Exercise: Regular expressions round 2.

Show that you can extract the wiki-links from the test-text.

Perhaps you can find inspiration on stack overflow or similar.

**Hint**: Try to solve this exercise on your own (that's what you will get the most out of - learning wise), but if you get stuck ... you will find the solution in one of the video lectures below.


In [110]:
match = re.findall(r'(\[\[\w+\]\])', testText)
for m in re.findall(r'(\[\[[a-zA-Z0-9_\s\(\)\-]+\|[a-zA-Z0-9_\s\(\)\-]+\]\])', testText):
    match.append(m)

In [115]:
print(match)

['[[gentrify]]', '[[hashtag]]', '[[Bicycle|Bicycle(two-wheeled type)]]', '[[Pitchfork|Pitchfork Magazine]]']
