# Week 4
## Notes


### Regular Expressions in Python
More about [regular expressions](https://developers.google.com/edu/python/regular-expressions)

In [3]:
import re # import regular expression module

In [4]:
# Basic
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:                      
  print 'found', match.group()
else:
  print 'did not find'

found word:cat


#### Findall

In [None]:
## Suppose we have a text with many email addresses
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
  emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
  for email in emails:
    # do something with each found email string
    print email

## Exercises

**What are regular expressions?**

Regular expressions uses pattern matching to find customized strings, substrings, repetitions etc. in texts and other material. There are many different ways to specify what you want.

It is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching. It is the same as what is used when finding and replacing in Word etc.

**Regex to match 4 digit number**

[Place to test regular expressions](https://regex101.com/)

**An even better place to [test regex'es](https://regexr.com/)**

In [66]:
# Match exatcly 4 digits, no more no less.
# \d = digit char, \w = word char
# ''r' indicates a 'raw' string

# Fetching test content (for all examples here)
import urllib2
response = urllib2.urlopen('https://raw.githubusercontent.com/suneman/socialgraphs2016/master/files/test.txt')
html = response.read()

str = "hello54o oo  5688oo 63855"
# Matches 4 digits, possibly separated by whitespace
# Using lookbehind and lookahead
match = re.findall(r'(?<!\d)[0-9]{4}(?!\d)', html)
# match = re.search(r'^[0-9]{4}', str) 

if match:                      
  print 'Sequence matching pattern:', match
else:
  print 'No sequence matching pattern.'

Sequence matching pattern: ['1234', '9999']


**Provide an example of a regex to match words starting with "super".** 

Show that it works on the [test-text](https://raw.githubusercontent.com/suneman/socialgraphs2016/master/files/test.txt).

In [75]:
# Regex matching words STARTING with super
match = re.findall(r'super\w+', html)

if match:                      
  print 'Sequence matching pattern:', match
else:
  print 'No sequence matching pattern.'

Sequence matching pattern: ['superpolaroid', 'supertaxidermy', 'superbeer']


### Regular expression pt. 2
Getting the Wiki-links

In [86]:
# Show that you can extract the wiki-links from the test-text. 
match = re.findall(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]', html)

if match:                      
  print 'Sequence matching pattern:', match
else:
  print 'No sequence matching pattern.'

Sequence matching pattern: ['drinking vinegar', 'gentrify', 'hashtag', 'Bicycle(two-wheeled type)', 'Pitchfork Magazine']


## A Fun with Wikipedia

How to get the pages/markup through the Wiki-API

### Basic Statistics - philosopher's

In [3]:
import urllib2
import re
import json
response = urllib2.urlopen('https://en.wikipedia.org/wiki/List_of_aestheticians')
html = response.read()

In [6]:
# Getting the philosopher informatiom
baseurl    = "https://en.wikipedia.org/w/api.php?"
action     = "action=query"
title      = "titles="
content    = "prop=revisions&rvprop=content"
dataformat = "format=json"

# List of all Philosophers by core area
list_phil = ['List_of_aestheticians','List_of_epistemologists','List_of_ethicists','List_of_logicians','List_of_metaphysicians','List_of_social_and_political_philosophers']
query = []
phil_count = 0
all_phil = []

# constructing query for getting all Philosopher data
for core_area in list_phil:
    query = "%s%s&%s&%s&%s&utf8=" % (baseurl,action,title+core_area,content,dataformat)

    f = urllib2.urlopen(query)
    area_links = re.findall(r'\*.?\[\[([^\]|]+)\|?[^\]]*\]\]',f.read())
    trim_area_links = [item for item in area_links if not item.lower().startswith('list of')]
    all_phil.append(trim_area_links)
    print "Query for:", core_area
    print (query)
    print "Number of philosophers: ", len(trim_area_links)
    print
    #print trim_area_links
    
    # Accumulate number of philosophers
    phil_count = phil_count + len(trim_area_links)

print "Philosophers in all: ", phil_count # each core area not accouting for duplicates

Query for: List_of_aestheticians
https://en.wikipedia.org/w/api.php?action=query&titles=List_of_aestheticians&prop=revisions&rvprop=content&format=json&utf8=
Number of philosophers:  124

Query for: List_of_epistemologists
https://en.wikipedia.org/w/api.php?action=query&titles=List_of_epistemologists&prop=revisions&rvprop=content&format=json&utf8=
Number of philosophers:  98

Query for: List_of_ethicists
https://en.wikipedia.org/w/api.php?action=query&titles=List_of_ethicists&prop=revisions&rvprop=content&format=json&utf8=
Number of philosophers:  272

Query for: List_of_logicians
https://en.wikipedia.org/w/api.php?action=query&titles=List_of_logicians&prop=revisions&rvprop=content&format=json&utf8=
Number of philosophers:  271

Query for: List_of_metaphysicians
https://en.wikipedia.org/w/api.php?action=query&titles=List_of_metaphysicians&prop=revisions&rvprop=content&format=json&utf8=
Number of philosophers:  96

Query for: List_of_social_and_political_philosophers
https://en.wikipedi

**Which is the largest branch of philosophy?**

Judging by the number of philosophers in the different groups, the philosophers within social and political philosophy is the greatest branch. They are closely followed by Logicians and Ethicists.

**Philosophers in several lists?**

To solve this problem it is easiest to use the list of all philosophers. To see whether they are in several lists, one need to look at the list with all names, where none has been removed to avoid duplication.

In [7]:
from collections import Counter
in_several_lists = Counter(brac)
#print in_several_lists
print "Philosopher in most lists: "
print in_several_lists.most_common(1)

TypeError: unhashable type: 'list'

**List of philosophers in more than one list**

In [299]:
import operator

popular_phil = {}
temp = dict(Counter(in_several_lists))

# Getting the values of philosophers in lists > 1
for key, value in temp.iteritems():
    if value > 1:
        popular_phil[key] = value

sorted_phil = sorted(popular_phil.items(), key=lambda x: (-x[1], x[0]))
print sorted_phil

[(u'Aristotle', 6), (u'Bertrand Russell', 5), (u'Immanuel Kant', 5), (u'Plato', 5), (u'Ayn Rand', 4), (u'Arthur Schopenhauer', 3), (u'David Hume', 3), (u'Georg Wilhelm Friedrich Hegel', 3), (u'Gottfried Leibniz', 3), (u'John Locke', 3), (u'John Stuart Mill', 3), (u'Judith Butler', 3), (u'Ludwig Wittgenstein', 3), (u'Mario Bunge', 3), (u'Nelson Goodman', 3), (u'Ruth Barcan Marcus', 3), (u'Susan Haack', 3), (u'S\xf8ren Kierkegaard', 3), (u'Thomas Aquinas', 3), (u'Abraham Joshua Heschel', 2), (u'Alain Badiou', 2), (u'Alfred North Whitehead', 2), (u'Alvin Plantinga', 2), (u'B. R. Ambedkar', 2), (u'Baruch Spinoza', 2), (u'Berit Brogaard', 2), (u'Catherine Elgin', 2), (u'Christian Wolff (philosopher)', 2), (u'Confucius', 2), (u'Constantin R\u0103dulescu-Motru', 2), (u'David Chalmers', 2), (u'David Kolb', 2), (u'Edward Said', 2), (u'Emma Goldman', 2), (u'Francis Bacon', 2), (u'Francis Hutcheson (philosopher)', 2), (u'Friedrich Nietzsche', 2), (u'Friedrich Schiller', 2), (u'G. E. Moore', 2), (

**'Top 5 guys'**

In [294]:
print in_several_lists.most_common(5)

[(u'Aristotle', 7), (u'Plato', 6), (u'Bertrand Russell', 6), (u'Immanuel Kant', 6), (u'Ayn Rand', 5)]


Yes, I know Aristotle, Plato, Russell and Kant.

### Creating the Philosopher's network

In [295]:
# Work on all collected data (saved in one file)
import io
import re
import numpy as np

f = io.open('./markup.txt','r',encoding='utf-8')
brac = re.findall(r'\*\s*\[\[([^\]|]+)\|?[^\]]*\]\]', f.read())



In [296]:
#Getting only unique values
unique_brac = np.unique(brac)

# Remove unwanted links 
trim_phil = [item for item in unique_brac if not item.lower().startswith('list of')]

# Printing the values
for link in trim_phil:
    print link



:nl:Martinus Dorpius
A. J. Ayer
Abraham Fraenkel
Abraham Joshua Heschel
Abraham Robinson
Abraham ibn Daud
Abul Kalam Azad
Adam Müller
Adam Smith
Adolf Lindenbaum
Adrian Johnston (philosopher)
Ahmed Raza Khan
Alain Badiou
Alan Bundy
Alan Carter (philosopher)
Alan Gewirth
Alan Ross Anderson
Alan Ryan
Alan Turing
Alasdair MacIntyre
Alasdair Urquhart
Alastair Norcross
Albert Camus
Albert Schweitzer
Albert of Saxony (philosopher)
Alcuin
Aldo Leopold
Alexander Bain
Alexander Campbell Fraser
Alexander Esenin-Volpin
Alexander Gerard
Alexander Gottlieb Baumgarten
Alexander S. Kechris
Alexander Zinoviev
Alexis de Tocqueville
Alfred Horn
Alfred North Whitehead
Alfred Rosenberg
Alfred Tarski
Algernon Charles Swinburne
Ali Shariati
Alice Crary
Alija Izetbegović
Allan Gibbard
Alon Ben-Meir
Alonzo Church
Alvin Goldman
Alvin Plantinga
Amartya Sen
Ambrose
Anandavardhana
Anaximander
Andrea Bonomì
Andreas Linder
Andrei Marga
Andronicus of Rhodes
Andrzej Mostowski
Andrzej Tadeusz Kijowski
André Malraux
An

In [297]:
import pickle
import urllib

# Run through all philosophers' links
all_phil = {}

for phil in trim_phil:
    baseurl    = "https://en.wikipedia.org/w/api.php?" # ensure English wiki
    if(phil.startswith(":")):
        # Redirect to right page if no English available
        cc = re.search(r'^:([a-zA-Z]+):',phil).group(1)
        phil = re.sub(r'^:([a-zA-Z]+):','',phil)
        baseurl = "https://%s.wikipedia.org/w/api.php?" % (cc)
    
    # Create query
    print phil
    phil_q = urllib.quote(phil.replace(" ","_"))
    query = "%s%s&%s&%s&%s&utf8=" % (baseurl,action,title+phil_q,content,dataformat)
    print query
    f = urllib2.urlopen(query)
    phil_all_links = re.findall(r'\[\[([^\]|]+)\|?[^\]]*\]\]',f.read())
    
    # Drop all but philosopher's links 
    # phil_links = [val for val in phil_all_links if val in trim_phil]
    for item in phil_all_links:
        for item2 in trim_phil:
            try:
                if item == item2.astype('U'):
                    item = item2
            except UnicodeDecodeError:
                print item, type(item)
                print item2, type(item2)
                
    
    # Save Philosopher's links
    all_phil[phil] = phil_links

phil_out = open('philosophers_associativity_list.pkl','wb')
pickle.dump(all_phil, phil_out, pickle.HIGHEST_PROTOCOL)

Martinus Dorpius
https://nl.wikipedia.org/w/api.php?action=query&titles=Martinus_Dorpius&prop=revisions&rvprop=content&format=json&utf8=
A. J. Ayer
https://en.wikipedia.org/w/api.php?action=query&titles=A._J._Ayer&prop=revisions&rvprop=content&format=json&utf8=




Abraham Fraenkel
https://en.wikipedia.org/w/api.php?action=query&titles=Abraham_Fraenkel&prop=revisions&rvprop=content&format=json&utf8=
Abraham Joshua Heschel
https://en.wikipedia.org/w/api.php?action=query&titles=Abraham_Joshua_Heschel&prop=revisions&rvprop=content&format=json&utf8=
Abraham Robinson
https://en.wikipedia.org/w/api.php?action=query&titles=Abraham_Robinson&prop=revisions&rvprop=content&format=json&utf8=
Abraham ibn Daud
https://en.wikipedia.org/w/api.php?action=query&titles=Abraham_ibn_Daud&prop=revisions&rvprop=content&format=json&utf8=
Abul Kalam Azad
https://en.wikipedia.org/w/api.php?action=query&titles=Abul_Kalam_Azad&prop=revisions&rvprop=content&format=json&utf8=
Adam Müller


KeyError: u'\xfc'

## Other