### Ejemplo importación en PostgreSQL, Riak, Cassandra, MongoDB y Neo4j,
#### Propósito
* Este ejercicio tiene como objetivo familiarizarse con operaciones CRUD en las bases de datos arriba mencionadas mediante la inserción y la consulta de un dataset previamente escogido, en cada una de ellas; para ello previamente será necesario el procesado del dataset, en este tema se centrará también una parte importante del ejercicio.

#### Dataset
* La fuente de los datos escogidos es https://nobelprize.readme.io/docs/: los premios Nobel; desde aquí se puede tener acceso a tres datasets diferentes, dos de los cuales (prize.json y laureate.json) contienen básicamente los mismos datos solo que el primero de ellos presenta cada galardón con referencia a los premiados y el segundo cada premiado con referencia a su/s premio/s; el tercer dataset es una referencia a países y sus actuales códigos ISO de dos caracteres  (en unos cuantos casos los lugares de nacimiento o fallecimiento de los laureados, no existen ya como tales o se han anexionado a otro país). 
* En un principio se barajó la idea de utilizar los tres ficheros, pero después de revisarlos a fondo se llegó a la conclusión de que laureate.json comprende el contenido de los otros dos, razón por la cual se decidió utilizar únicamente este: https://nobelprize.readme.io/docs/laureate
 
#### Caso de uso
* Finalmente la fuente de datos escogida es laureate.json, de aquí se puede sacar la siguiente información de interés,
 * Galardonados con el premio Nobel, a partir de aquí se modelará la entidad **Laureate**
 * Premios, a partir de aquí se modelará la entidad **Prize**
 * Alma Mater, se trata de aquellas universidades u otras instituciones a los que los galardonados estaban ligados en el momento que se les otorgó el premio, se modelará como  **Affiliation**
        
#### Elección de las queries
* Las queries escogidas tiene por objeto extraer algunas curiosidades o datos de interés del dataset; las mismas han sido modeladas para cada tipo de base de datos, el objetivo de esto ha sido evaluar cuan trabajoso resulta hacer lo mismo con las distintas herramientas ofrecidas por cada una.
 * *Hombres, mujeres y...*, contar como se han repartido los premios entre hombres, mujeres y organizaciones.
 * *¿De dónde vienen los Nobeles?*, aquellas instituciones vinculadas con un mayor número de laureados.
 * *Uno más por favor...*, aquellos laureados que han recibido más de un galardón.
 * *Hiperactivos...*, el laureado que ha estado vinculado a un mayor número de instituciones.
 * *Los estados más fértiles...*, los estados de EEUU en los que han nacido más laureados.
 * *No podían faltar...*, laureados españoles y sus galardones.


  
#### Alejandro Manuel Arranz López: alejandro.arranz@datahack.es

# POSTGRESQL

## Limpiando datos existentes...

In [1]:
!echo 'learner' | sudo -S -u postgres dropdb nobel

[sudo] password for learner: 

In [2]:
!echo 'learner' | sudo -S -u postgres createdb nobel -O learner

[sudo] password for learner: 

## Conectando con la nueva base de datos...

In [3]:
%load_ext sql

In [4]:
%sql postgresql://learner:learner@localhost/nobel

u'Connected: learner@nobel'

## Identificando versión de PostgreSQL...

In [5]:
%sql SELECT version()

1 rows affected.


version
"PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit"


In [6]:
#Just to ensure tables are created from scratch, initial drops are run
%sql drop table laureates
%sql drop table prizes
%sql drop table affiliations

(psycopg2.ProgrammingError) table "laureates" does not exist
 [SQL: 'drop table laureates']
(psycopg2.ProgrammingError) table "prizes" does not exist
 [SQL: 'drop table prizes']
(psycopg2.ProgrammingError) table "affiliations" does not exist
 [SQL: 'drop table affiliations']


## Creando tablas...

In [7]:
%%sql 
CREATE TABLE laureates (
    laureate_id     int not null PRIMARY KEY,
    firstName       varchar(100) not null,
    surname         varchar(75) null,
    gender          varchar(10) not null,
    born            date null, 
    bornCountry     varchar(60) null, 
    bornCountryCode varchar(2) null, 
    bornCity        varchar(75) null, 
    died            date null, 
    diedCountry     varchar(60) null, 
    diedCountryCode varchar(2) null, 
    diedCity        varchar(75) null
);

Done.


[]

In [8]:
%%sql 
CREATE TABLE prizes (
    year        smallint not null,
    category    varchar(20) not null, 
    laureate_id int not null REFERENCES laureates(laureate_id),
    motivation  varchar(500) null,    
    PRIMARY KEY (year, category, laureate_id)
);

Done.


[]

In [9]:
%%sql 
CREATE TABLE affiliations (
    year        smallint not null, 
    category    varchar(20) not null,
    laureate_id int not null,
    name        varchar(150) not null,
    city        varchar(75) null,
    country     varchar(60) null,
    PRIMARY KEY (year, category, laureate_id, name),
    FOREIGN KEY (year, category, laureate_id) REFERENCES prizes(year, category, laureate_id)
);

Done.


[]

## Cargando datos...

In [10]:
import psycopg2

In [11]:
# A connection to the database is created
con = psycopg2.connect(database='nobel', user='learner')

In [12]:
# A cursor that will assist us on handling
# the returned data, is bounded to the new connection
cur = con.cursor()
# As said, this cursor will fetch for us the 
# return records, for instance...
cur.execute('SELECT version()')
ver = cur.fetchone()
print ver

('PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit',)


In [13]:
# Now we'll ensure tables are empty
cur.execute("DELETE FROM laureates")
cur.execute("DELETE FROM prizes")
cur.execute("DELETE FROM affiliations")
con.commit()

In [17]:
# This funcion checks whether a field is present in 
# the json element before attempting to extract it.
# param1=> list_name to which the element will be appended.
# param2=> json element that sould contain the field to check.
# param3=> field_name to be checked.
# returns=> None if the field is not present, field value if it is.
def append_field(list_name, json, field_name):
    if field_name in json:
        list_name.append(json[field_name])
    else: 
        list_name.append(None)

In [18]:
# This function checks whether a date field is present in
# the json element and its format is correct (since there 
# are multiple dates in the dataset showing 0 values for the year,
# month or day to illustrate cases on which the laureates are still alive
# or the born/dead date is ignored).
# param1=> list_name to which the element will be appended.
# param2=> json element that sould contain the field to check.
# param3=> field_date to be checked.
# returns=> None if the field is not present or its values are not allowed,
# field value if it is and its values are allowed.
def append_date(list_name, json, field_date):
    value_date = json[field_date].split("-")
    if (field_date in json) and ('00' not in value_date):
        list_name.append(json[field_date])
    else: 
        list_name.append(None)

In [16]:
# This function checks whether the laureate had any affiliation
# when awarded; if so, it retrieves the table schema required fields.
# param1-3=> the foreign key shared with prizes and (in part) with laureates.
# param4=> json element containing the affiliations.
# returns=> an affiliation element if all the required fields are present
# or None in other case.
def parse_affiliation(f_key1, f_key2, f_key3, json):
    affiliations = []
    affiliation = []
    for a in json:
        if "name" in a:
            affiliation.append(f_key1)
            affiliation.append(f_key2)
            affiliation.append(f_key3)
            append_field(affiliation, a, "name")
            append_field(affiliation, a, "city")
            append_field(affiliation, a, "country")
        else:
            affiliation = None
            
        affiliations += [affiliation]
        affiliation = []
    
    return affiliations
        
        

In [17]:
# This function retrieves the table schema required fields
# for the prize/s awarded to the laureate; since affiliations
# are nested under prizes the function takes care of invoking
# the necessary utilities to retrieve them too.
# param1=> foreign key pointing to laureates table.
# param2=> json containing the prizes element.
# returns=> both prizes and affiliations related to the laureate
# being processed.

def parse_prize(f_key, json):
    prizes = []
    affiliations = []
    prize = []
    for p in json:
        prize.append(p["year"])
        prize.append(p["category"])
        append_field(prize, p, "motivation")
        prize.append(f_key)
        prizes += [prize]
        affiliations += parse_affiliation(p["year"], p["category"], f_key, p["affiliations"])
        prize = []
        

    return prizes, affiliations
        

In [18]:
# This funcion centralizes each main element (laureate)
# parsing. From here every element fitting the laureates table
# schema will be retrieved; furthermore the associated prizes processing
# will be managed from here.
# param1=> json comprising an entire row of the given json file.
# returns=> the process laureate record, together with its prize and affiliation records.

def parse_laureate(json):
    laureate = []
    prizes = []
    affiliations = []
    
    if ('id' in json) and ('firstname' in json) and ('gender' in json):
        laureate.append(json["id"])
        laureate.append(json["firstname"])
        append_field(laureate, json, "surname")
        laureate.append(json["gender"])
        append_date(laureate, json, "born")
        append_field(laureate, json, "bornCountry")
        append_field(laureate, json, "bornCountryCode")
        append_field(laureate, json, "bornCity")
        append_date(laureate, json, "died")
        append_field(laureate, json, "diedCountry")
        append_field(laureate, json, "diedCountryCode")
        append_field(laureate, json, "diedCity")
        
        prizes, affiliations = parse_prize(json["id"], json["prizes"])
        
        return laureate, prizes, affiliations
    else:
        return None, None, None

In [19]:
# This function is meant to load the json file and
# provide an entry point to process each of its lines or records
# also collects the processed output and add them up to the global lists
# that later will feed the database initial load.
# param1=> nobels_data_path the path where the json file is located.
# param2=> laureates the global list where the processed laureates containing
# only the table schema fields, will be stored.
# param3=> prizes the global list where the processed prizes containing
# only the table schema fields, will be stored.
# param4=> affiliations the global list where the processed affiliations containing
# only the table schema fields, will be stored.


def load_file(nobels_data_path, laureates, prizes, affiliations):
    nobels_file = open(nobels_data_path, "r")
    count=0
    for line in nobels_file:
        try:
            nobel = json.loads(line)
            laureate, prize, affiliation = parse_laureate(nobel)
            if (laureate!=None):
                laureates += [laureate]
                prizes += prize
                # When loading the data the possible "None" affiliations
                # will be filtered out.
                affiliations += affiliation
                
           
        except:
            print "Unexpected error:", sys.exc_info()[0]
            raise

In [20]:
import json

laureates = []
prizes = []
affiliations = []
load_file('../data/laureate.json', laureates, prizes, affiliations)
print(len(laureates))
print(len(prizes))
print(len(affiliations))

904
911
969


In [21]:
query = """INSERT INTO laureates (laureate_id, 
            firstName, 
            surname,
            gender,
            born,
            bornCountry,
            bornCountryCode,
            bornCity,
            died,
            diedCountry,
            diedCountryCode,
            diedCity)
        VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""

cur.executemany(query, laureates)

In [22]:
query = """INSERT INTO prizes (year, 
            category,
            motivation,
            laureate_id)
        VALUES (%s,%s,%s,%s)"""

cur.executemany(query, prizes)

In [23]:
query = """INSERT INTO affiliations (year, 
            category,
            laureate_id,
            name,
            city,
            country)
        VALUES (%s,%s,%s,%s,%s,%s)"""

cur.executemany(query, [x for x in affiliations if x is not None])

In [24]:
con.commit()
con.close()

## Consultando información...

In [25]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [26]:
%sql postgresql://learner:learner@localhost/nobel

u'Connected: learner@nobel'

### Primero revisamos que la información tiene el aspecto esperado

In [27]:
%sql select count(*) from laureates

1 rows affected.


count
904


In [28]:
%%sql 
SELECT * 
FROM laureates
LIMIT 10

10 rows affected.


laureate_id,firstname,surname,gender,born,borncountry,borncountrycode,borncity,died,diedcountry,diedcountrycode,diedcity
1,Wilhelm Conrad,Röntgen,male,1845-03-27,Prussia (now Germany),DE,Lennep (now Remscheid),1923-02-10,Germany,DE,Munich
2,Hendrik Antoon,Lorentz,male,1853-07-18,the Netherlands,NL,Arnhem,1928-02-04,the Netherlands,NL,Haarlem
3,Pieter,Zeeman,male,1865-05-25,the Netherlands,NL,Zonnemaire,1943-10-09,the Netherlands,NL,Amsterdam
4,Antoine Henri,Becquerel,male,1852-12-15,France,FR,Paris,1908-08-25,France,FR,Le Croisic
5,Pierre,Curie,male,1859-05-15,France,FR,Paris,1906-04-19,France,FR,Paris
6,Marie,"Curie, née Sklodowska",female,1867-11-07,Russian Empire (now Poland),PL,Warsaw,1934-07-04,France,FR,Sallanches
8,Lord Rayleigh,(John William Strutt),male,1842-11-12,United Kingdom,GB,"Langford Grove, Maldon, Essex",1919-06-30,United Kingdom,GB,Witham
9,Philipp Eduard Anton,von Lenard,male,1862-06-07,Hungary (now Slovakia),SK,Pressburg (now Bratislava),1947-05-20,Germany,DE,Messelhausen
10,Joseph John,Thomson,male,1856-12-18,United Kingdom,GB,"Cheetham Hill, near Manchester",1940-08-30,United Kingdom,GB,Cambridge
11,Albert Abraham,Michelson,male,1852-12-19,Prussia (now Poland),PL,Strelno (now Strzelno),1931-05-09,USA,US,"Pasadena, CA"


In [29]:
%sql select count(*) from prizes

1 rows affected.


count
911


In [30]:
%%sql 
SELECT * 
FROM prizes
LIMIT 10

10 rows affected.


year,category,laureate_id,motivation
1901,physics,1,in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him
1902,physics,2,in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena
1902,physics,3,in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena
1903,physics,4,in recognition of the extraordinary services he has rendered by his discovery of spontaneous radioactivity
1903,physics,5,in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel
1903,physics,6,in recognition of the extraordinary services they have rendered by their joint researches on the radiation phenomena discovered by Professor Henri Becquerel
1911,chemistry,6,"in recognition of her services to the advancement of chemistry by the discovery of the elements radium and polonium, by the isolation of radium and the study of the nature and compounds of this remarkable element"
1904,physics,8,for his investigations of the densities of the most important gases and for his discovery of argon in connection with these studies
1905,physics,9,for his work on cathode rays
1906,physics,10,in recognition of the great merits of his theoretical and experimental investigations on the conduction of electricity by gases


In [31]:
%sql select count(*) from affiliations

1 rows affected.


count
722


In [32]:
%%sql 
SELECT * 
FROM affiliations
LIMIT 10

10 rows affected.


year,category,laureate_id,name,city,country
1901,physics,1,Munich University,Munich,Germany
1902,physics,2,Leiden University,Leiden,the Netherlands
1902,physics,3,Amsterdam University,Amsterdam,the Netherlands
1903,physics,4,École Polytechnique,Paris,France
1903,physics,5,École municipale de physique et de chimie industrielles (Municipal School of Industrial Physics and Chemistry),Paris,France
1911,chemistry,6,Sorbonne University,Paris,France
1904,physics,8,Royal Institution of Great Britain,London,United Kingdom
1905,physics,9,Kiel University,Kiel,Germany
1906,physics,10,University of Cambridge,Cambridge,United Kingdom
1907,physics,11,University of Chicago,"Chicago, IL",USA


### (PostgreSQL) Hombres, mujeres y...

In [33]:
%%sql 
SELECT gender,COUNT(*) FROM laureates 
GROUP BY gender
ORDER BY COUNT(*) DESC

3 rows affected.


gender,count
male,833
female,48
org,23


### (PostgreSQL) ¿De dónde vienen los Nobeles?

In [34]:
%%sql 
SELECT name,COUNT(*) AS nNobels FROM affiliations 
GROUP BY name 
ORDER BY nNobels DESC LIMIT 10

10 rows affected.


name,nnobels
University of California,34
Harvard University,27
Stanford University,18
Massachusetts Institute of Technology (MIT),18
University of Cambridge,17
California Institute of Technology (Caltech),17
University of Chicago,17
Columbia University,16
Princeton University,14
Howard Hughes Medical Institute,11


### (PostgreSQL) Uno más por favor...

In [35]:
%%sql
SELECT l.firstname, l.surname, COUNT(*) AS nNobels 
FROM laureates l INNER JOIN prizes p ON l.laureate_id=p.laureate_id 
GROUP BY l.laureate_id 
HAVING COUNT(*)>1
ORDER BY nNobels DESC

6 rows affected.


firstname,surname,nnobels
Comité international de la Croix Rouge (International Committee of the Red Cross),,3
John,Bardeen,2
Office of the United Nations High Commissioner for Refugees (UNHCR),,2
Linus Carl,Pauling,2
Marie,"Curie, née Sklodowska",2
Frederick,Sanger,2


### (PostgreSQL) Hiperactivos...

In [36]:
%%sql
SELECT l.firstname, l.surname, a.name from
    (SELECT laureate_id, COUNT(*) AS nAffiliations
    FROM affiliations
    GROUP BY laureate_id
    ORDER BY nAffiliations DESC
    LIMIT 1) q, laureates l INNER JOIN affiliations a ON l.laureate_id=a.laureate_id WHERE l.laureate_id=q.laureate_id

3 rows affected.


firstname,surname,name
Jack W.,Szostak,Harvard Medical School
Jack W.,Szostak,Howard Hughes Medical Institute
Jack W.,Szostak,Massachusetts General Hospital


### (PostgreSQL) Los estados más fértiles...

In [37]:
%%sql
SELECT substring(borncity from '..$') AS state, count(substring(borncity from '..$')) AS nborns FROM laureates
WHERE borncountrycode='US'
GROUP BY state
ORDER BY nborns DESC
LIMIT 5

5 rows affected.


state,nborns
NY,64
IL,25
MA,22
CA,16
PA,14


### (PostgreSQL) No podían faltar...

In [38]:
%%sql
SELECT l.firstname, l.surname, p.category, p.year FROM laureates l INNER JOIN prizes p ON l.laureate_id=p.laureate_id 
WHERE borncountrycode='ES'

7 rows affected.


firstname,surname,category,year
Santiago,Ramón y Cajal,medicine,1906
Severo,Ochoa,medicine,1959
José,Echegaray y Eizaguirre,literature,1904
Jacinto,Benavente,literature,1922
Juan Ramón,Jiménez,literature,1956
Vicente,Aleixandre,literature,1977
Camilo José,Cela,literature,1989


# RIAK

In [39]:
import riak
import json

In [41]:
# Connect to database
myClient = riak.RiakClient()
myClient.ping()

True

In [42]:
# Buckets are defined
BUCKET_LAUREATES = 'laureates'
BUCKET_PRIZES = 'prizes'
BUCKET_AFFILIATIONS = 'affiliations'

laureates = myClient.bucket(BUCKET_LAUREATES)
prizes = myClient.bucket(BUCKET_PRIZES)
affiliations = myClient.bucket(BUCKET_AFFILIATIONS)

In [43]:
import unicodedata

# This function is meant to process the affiliations associated to each prize, hence,
# associated to each laureate. This function indexes each affiliation so that it can be 
# searched by prize and laureate.
# param1=> key_prize the id of the current prize record.
# param2=> key_laureate the id of the current laureate record.
# param3=> affiliations_json the part of the record corresponding to the affiliations.
# returns=> list containing the affiliation keys.

def insert_affiliations(key_prize, key_laureate, affiliations_json):
    key_affiliations = []
    for a in affiliations_json:
        if "name" in a:
            key_affiliation = unicodedata.normalize('NFKD', a["name"]).encode('ascii','ignore')
            if affiliations.get(key_affiliation).exists:
                affiliation = affiliations.get(key_affiliation)
                affiliation.add_index('idx_aff_prz_bin', key_prize)
                affiliation.add_index('idx_aff_lau_bin', key_laureate)   
                affiliation.store() 
            else:
                affiliation = affiliations.new(key_affiliation, a)
                affiliation.add_index('idx_aff_prz_bin', key_prize)
                affiliation.add_index('idx_aff_lau_bin', key_laureate)
                affiliation.store()            
        else:
            key_affiliation = None
        
        key_affiliations += [key_affiliation]
    
    return key_affiliations

In [44]:
# This function is meant to process the prizes awarded to the laureate
# represented by each json record. It's an entry point from where affiliations
# are processed too.
# param1=> key_laureate the id of the current laureate record.
# param2=> prizes_json the part of the record corresponding to the prizes.
# returns=> two lists containing the prizes keys and the affiliation keys, both
# are useful in case laureates want to be indexed by one of these two keys.

def insert_prizes(key_laureate, prizes_json):
    key_prizes = []
    curr_key_affiliations = []
    key_affiliations = []
    for p in prizes_json:
        key_prize = p["year"]+p["category"]
        prize = prizes.new(key_prize, p)
        prize.store()
        key_prizes += [key_prize]
        curr_key_affiliations = insert_affiliations(key_prize, key_laureate, p["affiliations"])
        key_affiliations += curr_key_affiliations
        
    return key_prizes, key_affiliations

In [45]:
# This function is meant to process each of the json file records, which
# is each laureate. It checks whether it's got the mandatory fields and also
# indexes each laureate so that it can be accessed by using its prizes keys.
# param1=> laureate_json a record from the json file.
# returns=> 1 if the laureate has all the mandatory fields, 0 if it has not.

def insert_laureate(laureate_json):
    if ('id' in laureate_json) and ('firstname' in laureate_json) and ('gender' in laureate_json):
        laureate = laureates.new(laureate_json["id"], laureate_json)
        l_prizes, l_affiliations = insert_prizes(laureate_json["id"], laureate_json['prizes'])
        for l_p_key in l_prizes:
            laureate.add_index('idx_lau_prz_bin', l_p_key)
            
        laureate.store()
        
        return 1
    else:
        return 0
        
    

In [46]:
nobels_data_path = '../data/laureate.json'
nobels_file = open(nobels_data_path, "r")

i = 0
proc_status=0
for line in nobels_file:
    laureate_json = json.loads(line)
    
    
    try:        
        proc_status = insert_laureate(laureate_json)
        i = i + proc_status
        
    except:
        raise

        
print("%s registros cargados" % i)

904 registros cargados


### (Riak) Hombres, mujeres y...

In [47]:
n_male = 0
n_female = 0
n_org = 0
for keys in laureates.stream_keys():
    for key in keys:
        l_gender = laureates.get(key).data["gender"]
        if(l_gender=="male"):
            n_male += 1
        elif(l_gender=="female"):
            n_female += 1
        else:
            n_org += 1
    
print("Male: "+str(n_male))
print("Female: "+str(n_female))
print("Org: "+str(n_org))

Male: 833
Female: 48
Org: 23


### (Riak) ¿De dónde vienen los Nobeles?

In [48]:
import operator

affs_dict = {}
for keys in laureates.stream_keys():
    for key in keys:
        key_affs = affiliations.stream_index('idx_aff_lau_bin', key)
        for key_a in key_affs:
            for k_a in key_a:
                try:
                    affs_dict[k_a] += 1
                except:
                    affs_dict[k_a] = 1
                    
sorted_affs_dict = sorted(affs_dict.items(), key=operator.itemgetter(1), reverse=True)
# Caltech will show one appereance less because Pauling (217) was affiliated to it 
# in his two nobel prizes, hence Caltech is duplicated for his record.
sorted_affs_dict[:10]

[('University of California', 34),
 ('Harvard University', 27),
 ('Stanford University', 18),
 ('Massachusetts Institute of Technology (MIT)', 18),
 ('University of Chicago', 17),
 ('University of Cambridge', 17),
 ('Columbia University', 16),
 ('California Institute of Technology (Caltech)', 16),
 ('Princeton University', 14),
 ('Howard Hughes Medical Institute', 11)]

### (Riak) Uno más por favor...

In [49]:
import operator
laurs_dict = {}

for keys in prizes.stream_keys():
    for key in keys:
        key_laurs = laureates.stream_index('idx_lau_prz_bin', key)
        for key_l in key_laurs:
            for k_l in key_l:
                try:
                    laurs_dict[k_l] += 1
                except:
                    laurs_dict[k_l] = 1
    
sorted_laurs_dict = sorted(laurs_dict.items(), key=operator.itemgetter(1), reverse=True)
for lau_tup in sorted_laurs_dict:
    nNobels = lau_tup[1]
    lau_tup_key = lau_tup[0]
    if(nNobels>1):
        try:
            print(laureates.get(lau_tup_key).data["firstname"] +" "+laureates.get(lau_tup_key).data["surname"] + " " + str(nNobels))
        except:
            print(laureates.get(lau_tup_key).data["firstname"] +" " + str(nNobels))

Comité international de la Croix Rouge (International Committee of the Red Cross) 3
Marie Curie, née Sklodowska 2
Office of the United Nations High Commissioner for Refugees (UNHCR) 2
John Bardeen 2
Linus Carl Pauling 2
Frederick Sanger 2


### (Riak) Hiperactivos...

In [50]:
import operator
aff_lau_dict = {}

for keys in laureates.stream_keys():
    for key in keys:
        key_affs = affiliations.stream_index('idx_aff_lau_bin', key)        
        for key_a in key_affs:
            try:
                aff_lau_dict[key] += 1
            except:
                aff_lau_dict[key] = 1
                
sorted_aff_lau_dict = sorted(aff_lau_dict.items(), key=operator.itemgetter(1), reverse=True)
top_lau_aff = sorted_aff_lau_dict[:1][0][0]
try:
    print(laureates.get(top_lau_aff).data["firstname"] + " " + laureates.get(top_lau_aff).data["surname"])
except:
    print(laureates.get(top_lau_aff).data["firstname"])
top_aff_lau_keys = affiliations.stream_index('idx_aff_lau_bin', top_lau_aff)        
for key_tal in top_aff_lau_keys:
    print(key_tal)

Jack W. Szostak
['Harvard Medical School']
['Howard Hughes Medical Institute']
['Massachusetts General Hospital']


### (Riak) Los estados más fértiles...

In [52]:
import operator
import re
state_dict = {}

for keys in laureates.stream_keys():
    for key in keys:
        lauData = laureates.get(key).data
        if(lauData["gender"]!="org"):
            if(lauData["bornCountryCode"]=="US"):
                state = re.search('..$', lauData["bornCity"])
                if(state):
                    try:
                        state_dict[state.group(0)] += 1
                    except:
                        state_dict[state.group(0)] = 1
                        
sorted_state_dict = sorted(state_dict.items(), key=operator.itemgetter(1), reverse=True)
sorted_state_dict[:5]

[(u'NY', 64), (u'IL', 25), (u'MA', 22), (u'CA', 16), (u'PA', 14)]

### (Riak) No podían faltar...

In [53]:
for keys in prizes.stream_keys():
    for key in keys:
        key_laurs = laureates.stream_index('idx_lau_prz_bin', key)
        for key_l in key_laurs:
            for k_l in key_l:
                l_data = laureates.get(k_l).data
                if(l_data["gender"]!="org"):
                    if(l_data["bornCountryCode"]=="ES"):
                        p_data = prizes.get(key).data
                        print(l_data["firstname"] + " " + l_data["surname"] + " " + p_data["category"] + " " + p_data["year"])

Camilo José Cela literature 1989
Jacinto Benavente literature 1922
José Echegaray y Eizaguirre literature 1904
Juan Ramón Jiménez literature 1956
Vicente Aleixandre literature 1977
Severo Ochoa medicine 1959
Santiago Ramón y Cajal medicine 1906


# CASSANDRA

## Limpiando datos existentes...

In [27]:
%load_ext cql

The cql extension is already loaded. To reload it, use:
  %reload_ext cql


In [28]:
%%cql
DROP KEYSPACE IF EXISTS nobel

'No results.'

In [29]:
%%cql
CREATE KEYSPACE nobel 
WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};

'No results.'

## Aseguramos que el keyspace que está en uso es el recién creado

In [30]:
%keyspace nobel

Using keyspace nobel


## Creación de tablas

In [31]:
%%cql 
CREATE TABLE laureates (
    laureate_id     int,
    firstName       text,
    surname         text,
    gender          text,
    born            date, 
    bornCountry     text, 
    bornCountryCode text, 
    bornCity        text, 
    died            date, 
    diedCountry     text, 
    diedCountryCode text, 
    diedCity        text,
    prizes_awarded list<text>,
    affiliations_lau list<text>,
    PRIMARY KEY (laureate_id)
);

'No results.'

In [32]:
%%cql
CREATE INDEX laureate_bcc ON laureates(bornCountryCode)

'No results.'

In [33]:
%%cql
CREATE TABLE prizes (
    year        smallint,
    category    text, 
    motivation  text,    
    PRIMARY KEY (year, category)
);

'No results.'

In [34]:
%%cql 
CREATE TABLE affiliations (
    name        text,
    city        text,
    country     text,
    ncount      int,
    PRIMARY KEY (name)
);

'No results.'

## Cargando datos...

In [35]:
from cassandra.cluster import Cluster, BatchStatement, ConsistencyLevel
cluster = Cluster()
session = cluster.connect('nobel')

In [36]:
# Once the whole file has been processed, 
# this function takes care of inserting each
# affiliation record into its corresponding table.
# Furthermore, it retrieves the appeareances of each institution
# and add the resulting count to each record.
# param1=> The list of the affiliations to insert.
# param2=> A dictionary containing each affiliation key and its number of appeareances.

def insert_affiliations_cs(affiliations, affiliation_keys):
    affiliations_with_count = []
    for a in affiliations:
        a_w_c = []
        a_w_c.extend(a)
        a_w_c.extend([affiliation_keys[a[0]]])
        affiliations_with_count += [a_w_c]

    query = """INSERT INTO affiliations (name, 
            city, 
            country,
            ncount)
        VALUES (%s,%s,%s,%s)"""

    for a in affiliations_with_count:
        session.execute(query, a)

In [37]:
# Once the whole file has been processed, 
# this function takes care of inserting each
# prize record into its corresponding table.
# param1=> The list of the prizes to insert.

def insert_prizes_cs(prizes):
    query = """INSERT INTO prizes (year, 
            category, 
            motivation)
        VALUES (%s,%s,%s)"""

    for p in prizes:
        session.execute(query, p)

In [38]:
# Once the whole file has been processed, 
# this function takes care of inserting each
# laureate record into its corresponding table.
# param1=> The list of the laureates to insert.

def insert_laureates_cs(laureates):
    query = """INSERT INTO laureates (laureate_id, 
            firstName, 
            surname,
            gender,
            born,
            bornCountry,
            bornCountryCode,
            bornCity,
            died,
            diedCountry,
            diedCountryCode,
            diedCity,
            prizes_awarded,
            affiliations_lau)
        VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""

    for l in laureates:
        session.execute(query, l)
        

In [39]:
import unicodedata

# Converts unicode tu ASCII
# param1=> json containing the field to convert.
# param2=> field to be converted.
# returns=> If the field exists, its value in ASCII format, otherwise None will be returned.
def encode_ascii(json, field_name):
    if field_name in json:
        return unicodedata.normalize('NFKD', json[field_name]).encode('ascii','ignore')
    else: 
        return None

In [40]:
# This function is meant to process each of the JSON records corresponding to a
# laureate; furthermore it takes care of processing the awarded prizes and the affiliations.
# It uses two functions (append_field and append_date) which were defined for the PostgreSQL
# implementation.
# param1=> laureate_json a record from the json file.
# returns=> the laureate record, a list with the laureate prizes, a list with the laureate prizes keys,
# a list containing the laureate affiliations and a list with the laureate affiliations keys.

def parse_laureate_cs(laureate_json):   
    if ('id' in laureate_json) and ('firstname' in laureate_json) and ('gender' in laureate_json):
        prizes_awarded = []
        affiliations_lau = []
        prize_keys = []
        affiliation_keys = []
        prizes_lau = []
        aff_lau = []
        firstname = encode_ascii(laureate_json, "firstname")
        surname = encode_ascii(laureate_json, "surname")
        for p in laureate_json["prizes"]:
            prize_key = p["year"]+p["category"]
            prizes_awarded.append(prize_key)
            prize = []
            prize.append(int(p["year"]))
            prize.append(p["category"])
            append_field(prize, p, "motivation")
            prizes_lau += [prize]
            prize_keys += [prize_key]            
            for a in p["affiliations"]:
                if "name" in a:
                    name = encode_ascii(a, "name")
                    affiliation_key = name
                    affiliations_lau.append(affiliation_key)
                    affiliation = []
                    affiliation.append(affiliation_key)
                    city = encode_ascii(a, "city")
                    affiliation.append(city)
                    append_field(affiliation, a, "country")
                    aff_lau += [affiliation]
                    affiliation_keys += [affiliation_key]
                else:
                    continue
                
           
        laureate = []
        laureate.append(int(laureate_json["id"]))
        laureate.append(firstname)
        laureate.append(surname)
        laureate.append(laureate_json["gender"])
        append_date(laureate, laureate_json, "born")
        append_field(laureate, laureate_json, "bornCountry")
        append_field(laureate, laureate_json, "bornCountryCode")
        laureate.append(encode_ascii(laureate_json, "bornCity"))
        append_date(laureate, laureate_json, "died")
        append_field(laureate, laureate_json, "diedCountry")
        append_field(laureate, laureate_json, "diedCountryCode")
        laureate.append(encode_ascii(laureate_json, "diedCity"))
        laureate.append(prizes_awarded)
        laureate.append(affiliations_lau)
        
        
        return laureate, prizes_lau, prize_keys, aff_lau, affiliation_keys
    else:
        return None, None, None, None, None

In [41]:
import json
from pprintpp import pprint as pp
import sys

nobels_data_path = '../data/laureate.json'
laureates = []
prizes = []
prizes_keys = []
affiliations = []
affiliations_keys = {}
laureates_by_prize = []
affiliations_by_prize = []
affiliations_by_laureate = []

nobels_file = open(nobels_data_path, "r")
count = 0
for line in nobels_file:
    laureate_json = json.loads(line)
    lau, pr, pr_keys, aff, aff_keys = parse_laureate_cs(laureate_json)
    if(lau is not None):
        laureates += [lau]
        for idx,p_key in enumerate(pr_keys):
            if(p_key not in prizes_keys):
                prizes_keys += [p_key]
                prizes += [pr[idx]]
        for idx,a_key in enumerate(aff_keys):
            if(a_key not in affiliations_keys):
                affiliations += [aff[idx]]
                affiliations_keys[a_key] = 1
            else:
                affiliations_keys[a_key] += 1

insert_laureates_cs(laureates)
insert_prizes_cs(prizes)
insert_affiliations_cs(affiliations, affiliations_keys)

In [42]:
print "Laureates", session.execute("SELECT count(*) from laureates")[0].count
print "Prizes", session.execute("SELECT count(*) from prizes")[0].count
print "Affiliations", session.execute("SELECT count(*) from affiliations")[0].count

Laureates 904
Prizes 579
Affiliations 315




### (Cassandra) Hombres, mujeres y...

In [43]:
result_cursor = session.execute("select * from laureates")

nMale = 0
nFemale = 0
nOrg = 0
for row in result_cursor:
    row_gender = row.gender
    if(row_gender=="male"):
        nMale+=1
    elif(row_gender=="female"): 
        nFemale+=1
    else:
        nOrg+=1

print("Male - "+str(nMale))
print("Female - "+str(nFemale))
print("Org - "+str(nOrg))


Male - 833
Female - 48
Org - 23


### (Cassandra) ¿De dónde vienen los Nobeles?

In [44]:
result_cursor = session.execute("select * from affiliations")

def getCount(item):
    return item.ncount

for row in sorted(result_cursor, key=getCount, reverse=True)[0:10]:
    print row.name, "-", row.ncount

University of California - 34
Harvard University - 27
Massachusetts Institute of Technology (MIT) - 18
Stanford University - 18
California Institute of Technology (Caltech) - 17
University of Chicago - 17
University of Cambridge - 17
Columbia University - 16
Princeton University - 14
Howard Hughes Medical Institute - 11


### (Cassandra) Uno más por favor...

In [45]:
result_cursor = session.execute("select * from laureates")

def getCount(item):
    return len(item.prizes_awarded)
    
for row in sorted(result_cursor, key=getCount, reverse=True)[0:10]:
    print row.firstname,row.surname, "-", len(row.prizes_awarded)

Comite international de la Croix Rouge (International Committee of the Red Cross) None - 3
Frederick Sanger - 2
Linus Carl Pauling - 2
Office of the United Nations High Commissioner for Refugees (UNHCR) None - 2
Marie Curie, nee Sklodowska - 2
John Bardeen - 2
Peter Agre - 1
Max Karl Ernst Ludwig Planck - 1
Abdus Salam - 1
William Golding - 1


### (Cassandra) Hiperactivos...

In [46]:
result_cursor = session.execute("select * from laureates")

def getCount(item):
    try:
        return len(item.affiliations_lau)
    except:
        return 0
    
for row in sorted(result_cursor, key=getCount, reverse=True)[0:3]:
    print row.firstname,row.surname, "-", row.affiliations_lau

Jack W. Szostak - [u'Harvard Medical School', u'Massachusetts General Hospital', u'Howard Hughes Medical Institute']
Abdus Salam - [u'International Centre for Theoretical Physics', u'Imperial College']
Robert J. Lefkowitz - [u'Howard Hughes Medical Institute', u'Duke University Medical Center']


### (Cassandra) Los estados más fértiles...

In [47]:
import re
from collections import Counter

result_cursor = session.execute("SELECT * FROM laureates WHERE borncountrycode='US'")

def getState(city):
    state = re.search('..$', city)
    if(state):
        return state.group(0)
    else:
        return None
print(Counter([getState(row.borncity) for row in result_cursor]).most_common(5))

[(u'NY', 64), (u'IL', 25), (u'MA', 22), (u'CA', 16), (u'PA', 14)]


### (Cassandra) No podían faltar...

In [48]:
result_cursor = session.execute("SELECT * FROM laureates WHERE borncountrycode='ES'")

for row in result_cursor:
    print row.firstname, row.surname, row.prizes_awarded

Jacinto Benavente [u'1922literature']
Juan Ramon Jimenez [u'1956literature']
Camilo Jose Cela [u'1989literature']
Severo Ochoa [u'1959medicine']
Santiago Ramon y Cajal [u'1906medicine']
Vicente Aleixandre [u'1977literature']
Jose Echegaray y Eizaguirre [u'1904literature']


# MongoDB

In [49]:
import pymongo 
from pymongo import MongoClient
import json

### Conectando a la base de datos...

In [50]:
connection = MongoClient('localhost', 27017)

### Borrado y creación de una nueva base de datos...

In [51]:
connection.drop_database("nobel")

In [52]:
db = connection.nobel

## Cargando datos...

In [53]:
# This function processes each json record so that for every
# laureate, prize and affiliation an _id field is set.
# It takes care of assigning every single affiliation related to
# a prize to it, furthermore makes some calculations such as the number
# of affiliations related to a certain laureate to make subsequent queries
# easier.
# param1=> laureate_json a record from the json file.

def insert_laureate_mn(laureate_json):
    if ('id' in laureate_json) and ('firstname' in laureate_json) and ('gender' in laureate_json):
        laureates_prizes_list = []
        n_affiliations_laureate = 0
        for p in laureate_json['prizes']:
            p["_id"] = p["year"] + p["category"]
            prize_affiliations_list = []
            for a in p["affiliations"]:
                if "name" in a:
                    a["_id"] = a["name"]
                    prize_affiliations_list.append(a)
                    if(db.affiliations.find({"_id":a["name"]}).count()==0):
                        db.affiliations.insert_one(a)
                   
                    db.affiliations_laureates.find_and_modify(query = {"_id" : a["_id"] }, 
                                                              update ={ "$inc": { "count": 1 } } , 
                                                              upsert = True)                    
            n_prize_affiliations = len(prize_affiliations_list)
            if(db.prizes.find({"_id":p["_id"]}).count()==0):
                if(n_prize_affiliations>0):
                    p["affiliations"] = prize_affiliations_list
                db.prizes.insert_one(p)
            else:
                db.prizes.update({"_id":p["_id"]},{"$addToSet":{"affiliations":{"$each":prize_affiliations_list}}}, upsert=True)
             
            n_affiliations_laureate += n_prize_affiliations
            laureates_prizes_list.append(p)
        
        laureate_json["prizes"] = laureates_prizes_list
        laureate_json["nAffiliations"] = n_affiliations_laureate
        laureate_json["_id"] = laureate_json["id"]
        db.laureates.insert_one(laureate_json)

In [54]:
nobels_data_path = '../data/laureate.json'

nobels_file = open(nobels_data_path, "r")
count = 0
for line in nobels_file:
    laureate_json = json.loads(line)
    insert_laureate_mn(laureate_json)
    

In [55]:
print "Número de laureados: " , db.laureates.count()
print "Número de premios: " , db.prizes.count()
print "Número de afiliaciones: " , db.affiliations.count()

Número de laureados:  904
Número de premios:  579
Número de afiliaciones:  315


### (MongoDB) Hombres, mujeres y...

In [56]:
result_cursor = db.laureates.aggregate([{"$group":{"_id": "$gender", "count": { "$sum" : 1}}}, {"$sort":{"count": -1}}])
for g in result_cursor:
    print g["_id"], str(g["count"])

male 833
female 48
org 23


### (MongoDB) ¿De dónde vienen los Nobeles?

In [57]:
result_cursor = db.affiliations_laureates.find().sort("count", pymongo.DESCENDING).limit(11)
for a in result_cursor:
    print(a)

{u'count': 34, u'_id': u'University of California'}
{u'count': 27, u'_id': u'Harvard University'}
{u'count': 18, u'_id': u'Stanford University'}
{u'count': 18, u'_id': u'Massachusetts Institute of Technology (MIT)'}
{u'count': 17, u'_id': u'University of Cambridge'}
{u'count': 17, u'_id': u'University of Chicago'}
{u'count': 17, u'_id': u'California Institute of Technology (Caltech)'}
{u'count': 16, u'_id': u'Columbia University'}
{u'count': 14, u'_id': u'Princeton University'}
{u'count': 11, u'_id': u'Rockefeller University'}
{u'count': 11, u'_id': u'Howard Hughes Medical Institute'}


### (MongoDB) Uno más por favor...

In [58]:
result_cursor = db.laureates.find( {"prizes.1": {"$exists": True }})
for l in result_cursor:
    try:
        print (l["firstname"] + " " + l["surname"] + " " + str(len(l["prizes"])))
    except:
        print (l["firstname"] + " " + str(len(l["prizes"])))

Marie Curie, née Sklodowska 2
John Bardeen 2
Linus Carl Pauling 2
Frederick Sanger 2
Comité international de la Croix Rouge (International Committee of the Red Cross) 3
Office of the United Nations High Commissioner for Refugees (UNHCR) 2


### (MongoDB) Hiperactivos...

In [59]:
result_cursor = db.laureates.aggregate([{"$group":{"_id":"$_id", "count":{"$max":"$nAffiliations"}}},
                                        {"$sort":{"count":-1}},
                                        {"$limit":1}])
for l in result_cursor:
    print(l["_id"])
    matching_laureate = db.laureates.find({"_id":l["_id"]})
    for m in matching_laureate:
        print(m["firstname"] + " " + m["surname"])
        for p in m["prizes"]:
            print(p["affiliations"])

837
Jack W. Szostak
[{u'city': u'Boston, MA', u'_id': u'Harvard Medical School', u'name': u'Harvard Medical School', u'country': u'USA'}, {u'city': u'Boston, MA', u'_id': u'Massachusetts General Hospital', u'name': u'Massachusetts General Hospital', u'country': u'USA'}, {u'_id': u'Howard Hughes Medical Institute', u'name': u'Howard Hughes Medical Institute'}]


### (MongoDB) Los estados más fértiles...

In [60]:
import operator
result_cursor = db.laureates.aggregate([{"$match":{"bornCountryCode":"US"}},
                                        {"$project":{"_id" : 0, "bornCity": 1}}])
                                        #{"$unwind" : "$state" },
                                        #{"$match":{"$state","/[A-Z]{2}/"}}])
states = {}
for l in result_cursor:
    try:
        states[l["bornCity"][-2:]] += 1
    except:
        states[l["bornCity"][-2:]] = 1

states_sorted = sorted(states.items(), key=operator.itemgetter(1), reverse=True)[:5]
print(states_sorted)


[(u'NY', 64), (u'IL', 25), (u'MA', 22), (u'CA', 16), (u'PA', 14)]


### (MongoDB) No podían faltar...

In [61]:
result_cursor = db.laureates.aggregate([{"$match":{"bornCountryCode":"ES"}},
                                        {"$project":{"_id":0, "firstname":1, "surname":1, "prizes":{"_id":1}}}])

for l in result_cursor:
    print(l["firstname"] + " " + l["surname"])
    for p in l["prizes"]:
        print(p["_id"])

Santiago Ramón y Cajal
1906medicine
Severo Ochoa
1959medicine
José Echegaray y Eizaguirre
1904literature
Jacinto Benavente
1922literature
Juan Ramón Jiménez
1956literature
Vicente Aleixandre
1977literature
Camilo José Cela
1989literature


# Neo4j

### Limpiando datos existentes...

In [1]:
%load_ext cypher

In [51]:
%%cypher
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r

28 relationship deleted.
25 nodes deleted.


### Creando constraints...

In [52]:
%%cypher
CREATE CONSTRAINT ON (l:laureate) ASSERT l.id IS UNIQUE

0 rows affected.


In [53]:
%%cypher
CREATE CONSTRAINT ON (p:prize) ASSERT p.id IS UNIQUE

0 rows affected.


In [54]:
%%cypher
CREATE CONSTRAINT ON (a:affiliation) ASSERT a.name IS UNIQUE

0 rows affected.


### Cargando datos

In [55]:
from py2neo import Graph, Relationship
import json

graph = Graph()

In [56]:
# This functions appends a new property (if it exists) to a given node.
# param1=> node to which the property will be appended.
# param2=> json from which the property value will be retrieved.
# param3=> prop whose value will be retrieved from the json and assigned to the node.
def append_property(node, json, prop):
    if prop in json:
        node[prop] = json[prop]

In [57]:
# This functions takes care of processing each prize into a node
# and also its corresponding affiliations (if any).
# It establishes the relationships between each prize and affiliation 
# and between each laureate and its prizes and each laureate and its affiliations.
# param1=> prize_json is the part of the json laureate record corresponding 
# the prizes.
# param2=> laureate is the node with which the relationships will be established.

def insert_prize_n4j(prize_json, laureate):
    for p in prize_json:
        p["id"] = p["year"] + p["category"]
        prize = graph.merge_one("Prize","id",p["id"])
        
        append_property(prize, p, "year")
        append_property(prize, p, "category")
        append_property(prize, p, "motivation")
        
        prize.push()
        affiliations = p["affiliations"]
        for a in affiliations:
            if "name" in a:
                affiliation = graph.merge_one("Affiliation","name",a["name"])
                
                append_property(affiliation, a, "city")
                append_property(affiliation, a, "country")
                
                affiliation.push()
                
                graph.create_unique(Relationship(prize,"RELATED_TO",affiliation))
                graph.create_unique(Relationship(laureate,"AFFILIATED_TO", affiliation))
        
        graph.create_unique(Relationship(laureate,"AWARDED",prize))
    
    
                

In [58]:
# This function is meant to create a new laureate node, fill up
# all its properties and then push into the Graph.
# param1=> laureate_json a record from the json file.

def insert_laureate_n4j(laureate_json):
    if ('id' in laureate_json) and ('firstname' in laureate_json) and ('gender' in laureate_json):
        laureate = graph.merge_one("Laureate","id",laureate_json["id"])
        
        append_property(laureate, laureate_json, "firstname")
        append_property(laureate, laureate_json, "surname")
        append_property(laureate, laureate_json, "born")
        append_property(laureate, laureate_json, "died")
        append_property(laureate, laureate_json, "bornCountry")
        append_property(laureate, laureate_json, "bornCountryCode")
        append_property(laureate, laureate_json, "bornCity")
        append_property(laureate, laureate_json, "diedCountry")
        append_property(laureate, laureate_json, "diedCountryCode")
        append_property(laureate, laureate_json, "diedCity")
        append_property(laureate, laureate_json, "gender")
        
        laureate.push()
        
        insert_prize_n4j(laureate_json["prizes"], laureate)

In [59]:
nobels_data_path = "../data/laureate.json"

nobels_file = open(nobels_data_path, "r")

for line in nobels_file:
    laureate_json = json.loads(line)
    insert_laureate_n4j(laureate_json)

### (Neo4j) Hombres, mujeres y...

In [60]:
%%cypher 
MATCH (l:Laureate)
RETURN l.gender, COUNT(*) AS NUMBER 
ORDER BY NUMBER DESC

3 rows affected.


l.gender,NUMBER
male,833
female,48
org,23


### (Neo4j) ¿De dónde vienen los Nobeles?

In [61]:
%%cypher
MATCH (l:Laureate)-->(a:Affiliation)
RETURN DISTINCT a.name, COUNT(a.name) AS nAffiliates
ORDER BY nAffiliates DESC
LIMIT 10

10 rows affected.


a.name,nAffiliates
University of California,34
Harvard University,27
Massachusetts Institute of Technology (MIT),18
Stanford University,18
University of Chicago,17
University of Cambridge,17
Columbia University,16
California Institute of Technology (Caltech),16
Princeton University,14
Rockefeller University,11


In [62]:
# Caltech will show one appereance less because Pauling (217) was affiliated to it 
# in his two nobel prizes, hence Caltech is duplicated for his record.

### (Neo4j) Uno más por favor...

In [63]:
%%cypher
MATCH (l:Laureate)-->(p:Prize)
WITH l, COUNT(p) AS nPrizes
WHERE nPrizes>1
RETURN l.firstname, l.surname, nPrizes


6 rows affected.


l.firstname,l.surname,nPrizes
Marie,"Curie, née Sklodowska",2
Frederick,Sanger,2
John,Bardeen,2
Comité international de la Croix Rouge (International Committee of the Red Cross),,3
Office of the United Nations High Commissioner for Refugees (UNHCR),,2
Linus Carl,Pauling,2


### (Neo4j) Hiperactivos...

In [64]:
%%cypher
MATCH (l:Laureate)-->(a:Affiliation)
WITH l.firstname AS firstname, l.surname AS surname, COUNT(a) AS nAffls, COLLECT(a.name) AS affls
RETURN firstname, surname, nAffls, affls
ORDER BY nAffls DESC
LIMIT 1



1 rows affected.


firstname,surname,nAffls,affls
Jack W.,Szostak,3,"[u'Howard Hughes Medical Institute', u'Harvard Medical School', u'Massachusetts General Hospital']"


### (Neo4j) Los estados más fértiles...

In [65]:
%%cypher
MATCH (l:Laureate)
WITH right(l.bornCity, 2) AS state
WHERE l.bornCountryCode = 'US'
RETURN state, count(state) AS nNobels
ORDER BY nNobels DESC
LIMIT 5

5 rows affected.


state,nNobels
NY,64
IL,25
MA,22
CA,16
PA,14


### (Neo4j) No podían faltar...

In [66]:
%%cypher
MATCH (l:Laureate)-->(p:Prize)
WITH l.firstname AS firstname, l.surname AS surname, l.bornCountryCode AS Country, COLLECT(p.id) AS prizes
WHERE Country = 'ES'
RETURN firstname, surname, prizes

7 rows affected.


firstname,surname,prizes
Jacinto,Benavente,[u'1922literature']
José,Echegaray y Eizaguirre,[u'1904literature']
Camilo José,Cela,[u'1989literature']
Juan Ramón,Jiménez,[u'1956literature']
Vicente,Aleixandre,[u'1977literature']
Santiago,Ramón y Cajal,[u'1906medicine']
Severo,Ochoa,[u'1959medicine']
