# Ejemplo de patrones: Índice invertido

En este ejercicio vamos a implementar un índice invertido sobre las palabras que aparecen en los textos de los posts del dataset proporcionado. 

Un post contiene una serie de campos separados por tabuladores. Los campos son los siguientes: 

"id", "title", "tagnames", "author_id", "body", "node_type", "parent_id", "abs_parent_id", "added_at", "score", "state_string", "last_edited_id", "last_activity_by_id", "last_activity_at", "active_revision_id", "extra", "extra_ref_id", "extra_count", "marked" 

Este índice debe contener para cada palabra que aparezca en el campo body, un listado de los identificadores (campo id) de los posts en los que aparece, así como un contador que indique en cuántos  posts aparece. Para separar las palabras, podemos utilizar el espacio en blanco, así como los siguientes caracteres: .!?:;"()<>[]#$=~/ 

Este índice debe servir para responder preguntas como por ejemplo: 

 - ¿Cuántas veces aparece la palabra frase en los foros? 

 - ¿En qué posts aparece la palabra frase (en orden ascendente)? 


In [1]:
! mkdir -p ejemplo-patrones/indiceinvertido

In [2]:
import os
os.chdir("/media/notebooks/ejemplo-patrones/indiceinvertido")

In [3]:
! pwd

/media/notebooks/ejemplo-patrones/indiceinvertido


In [4]:
%%writefile mapper.py
#!/usr/bin/env python

import sys
import csv
import re

#Utilizar espacios en blanco y estos caracteres para separar palabras: .!?:;"()<>[]#$=~/


reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

for line in reader:

    # PON TU CODIGO AQUI


Overwriting mapper.py


In [5]:
%%writefile reducer.py
#!/usr/bin/env python

import sys


prevWord = None
nodes = []

for line in sys.stdin:
# PON TU CODIGO AQUI


Overwriting reducer.py


In [6]:
! hdfs dfs -mkdir /tmp/forumdata

mkdir: `/tmp/forumdata': File exists


## Los ficheros forum_nodes.tsv y forum_users.tsv deben estar descargados en la carpeta /media/notebooks/forumdata/

Primero probamos nuestro código en local, solamente la función map

In [7]:
! cat /media/notebooks/forumdata/forum_node.tsv | python mapper.py | sort > salmap

In [8]:
! tail salmap

zzzz	14790
zzzz	30278
zzzz	30278
zzzz	30278
zzzz	6012819
zzzz	6013087
zzzzz	1007093
zzzzz	5006080
zzzzzzzz	60416
zzzzzzzzzzzzzzz	8353


Ahora probamos nuestro código en local, las funciones map y reduce

In [10]:
! cat salmap | python reducer.py > sal

In [11]:
! tail sal

zyxwvutsrqponmlkjihgfedcba	2	[2006416, 2006973]
zz	14	[18342, 18342, 1007745, 1007745, 3001761, 3001761, 6012667, 6017785, 6026138, 8001021, 8001116, 10011348, 10011348, 10011348]
zzpayr68yq0	2	[66767, 66767]
zzyov	5	[9001731, 9001731, 9001978, 9001978, 9001981]
zzz	6	[499, 8385, 44153, 44153, 1035273, 8001021]
zzzax	1	[11004608]
zzzz	8	[14790, 14790, 14790, 30278, 30278, 30278, 6012819, 6013087]
zzzzz	2	[1007093, 5006080]
zzzzzzzz	1	[60416]
zzzzzzzzzzzzzzz	1	[8353]


In [12]:
! cat /media/notebooks/forumdata/forum_node.tsv | python mapper.py | \
sort | python reducer.py > sal 

In [13]:
! tail sal

zyxwvutsrqponmlkjihgfedcba	2	[2006416, 2006973]
zz	14	[18342, 18342, 1007745, 1007745, 3001761, 3001761, 6012667, 6017785, 6026138, 8001021, 8001116, 10011348, 10011348, 10011348]
zzpayr68yq0	2	[66767, 66767]
zzyov	5	[9001731, 9001731, 9001978, 9001978, 9001981]
zzz	6	[499, 8385, 44153, 44153, 1035273, 8001021]
zzzax	1	[11004608]
zzzz	8	[14790, 14790, 14790, 30278, 30278, 30278, 6012819, 6013087]
zzzzz	2	[1007093, 5006080]
zzzzzzzz	1	[60416]
zzzzzzzzzzzzzzz	1	[8353]


Ahora lo probamos en Hadoop

In [14]:
! hdfs dfs -put /media/notebooks/forumdata/* /tmp/forumdata

put: `/tmp/forumdata/forum1.tsv': File exists
put: `/tmp/forumdata/forum_node.tsv': File exists
put: `/tmp/forumdata/forum_users.tsv': File exists


In [15]:
! hdfs dfs -ls /tmp/forumdata

Found 3 items
-rw-r--r--   3 root supergroup       1774 2019-08-12 08:03 /tmp/forumdata/forum1.tsv
-rw-r--r--   3 root supergroup  120313812 2019-08-12 08:03 /tmp/forumdata/forum_node.tsv
-rw-r--r--   3 root supergroup     530989 2019-08-12 08:03 /tmp/forumdata/forum_users.tsv


Eliminamos la carpeta que almacena la salida, antes de ejecutar el código

In [16]:
! hdfs dfs -rm /tmp/salida-indice/*
! hdfs dfs -rmdir /tmp/salida-indice

Deleted /tmp/salida-indice/_SUCCESS
Deleted /tmp/salida-indice/part-00000


In [17]:
! hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py \
-input /tmp/forumdata/forum_node.tsv -output /tmp/salida-indice

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.15.1.jar] /tmp/streamjob7080822615855652850.jar tmpDir=null
19/08/12 09:30:39 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
19/08/12 09:30:39 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
19/08/12 09:30:39 INFO mapred.FileInputFormat: Total input paths to process : 1
19/08/12 09:30:39 INFO mapreduce.JobSubmitter: number of splits:2
19/08/12 09:30:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565590814059_0010
19/08/12 09:30:40 INFO impl.YarnClientImpl: Submitted application application_1565590814059_0010
19/08/12 09:30:40 INFO mapreduce.Job: The url to track the job: http://yarnmaster:8088/proxy/application_1565590814059_0010/
19/08/12 09:30:40 INFO mapreduce.Job: Running job: job_1565590814059_0010
19/08/12 09:30:45 INFO mapreduce.Job: Job job_1565590814059_0010 running in uber mode : false
19/08/12 09:30:45 INFO mapreduce.Job

In [18]:
! hdfs dfs -tail /tmp/salida-indice/part-00000

n	1	[6382]
zylaijejwx	2	[2009935, 2009935]
zyrc	2	[6003614, 6003614]
zyrcster	41	[9419, 31688, 38541, 38541, 38541, 38541, 38541, 38541, 38541, 38541, 38541, 38541, 41733, 41733, 41733, 61855, 61855, 61855, 63196, 63196, 63196, 6007347, 6007347, 6007347, 6008775, 6016174, 6016174, 6016470, 6016473, 7000024, 7000089, 7000089, 7000089, 7000514, 7000738, 7000738, 7000738, 7001187, 7001187, 7001187, 7002685]
zyrcster's	1	[9247]
zyrcstr	1	[62785]
zyrcter	1	[11610]
zyrcwords	1	[6020095]
zytrax	3	[6028725, 6028725, 7002663]
zyvex	1	[944]
zyx	1	[8004310]
zyxwvutsrqponmlkjihgfedcba	2	[2006416, 2006973]
zz	14	[18342, 18342, 1007745, 1007745, 3001761, 3001761, 6012667, 6017785, 6026138, 8001021, 8001116, 10011348, 10011348, 10011348]
zzpayr68yq0	2	[66767, 66767]
zzyov	5	[9001731, 9001731, 9001978, 9001978, 9001981]
zzz	6	[499, 8385, 44153, 44153, 1035273, 8001021]
zzzax	1	[11004608]
zzzz	8	[14790, 14790, 14790, 30278, 30278, 30278, 6012819, 6013087]
zzzzz	2	[1007093, 5006080]
z