# Ejemplo de patrones: Posts cortos

En este ejercicio partimos de un dataset que contiene posts enviados a un foro de Internet. Se pide filtrar los posts, de forma solamente los que tengan 1 sola frase queden como resultado del filtrado. Para definir si se tiene una o más frases, se puede utilizar la presencia de los caracteres ‘.’, ‘?’ o ‘!’, de forma que los posts cuyo cuerpo contenga como mucho uno de estos caracteres al final son los que tienen una sola frase.  

Un post contiene una serie de campos separados por tabuladores. Los campos son los siguientes: 

"id", "title", "tagnames", "author_id", "body", "node_type", "parent_id", "abs_parent_id", "added_at", "score", "state_string", "last_edited_id", "last_activity_by_id", "last_activity_at", "active_revision_id", "extra", "extra_ref_id", "extra_count", "marked" 

Para este ejercicio solamente nos interesa el campo "body" que contiene el texto del post. 

In [1]:
! mkdir -p /media/notebooks/ejemplo-patrones/postscortos

In [2]:
import os
os.chdir("/media/notebooks/ejemplo-patrones/postscortos")

In [3]:
! pwd

/media/notebooks/ejemplo-patrones/postscortos


In [4]:
%%writefile mapper.py
#!/usr/bin/env python
import sys
import csv
import re


reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

# PON TU CODIGO AQUI

Overwriting mapper.py


In [None]:
! hdfs dfs -mkdir /tmp/forumdata

## Los ficheros forum_nodes.tsv y forum_users.tsv deben estar descargados en la carpeta /media/notebooks/forumdata/

In [5]:
! hdfs dfs -put /media/notebooks/forumdata/* /tmp/forumdata

put: `/tmp/forumdata/forum1.tsv': File exists
put: `/tmp/forumdata/forum_node.tsv': File exists
put: `/tmp/forumdata/forum_users.tsv': File exists


In [6]:
! hdfs dfs -ls /tmp/forumdata

Found 3 items
-rw-r--r--   3 root supergroup       1774 2019-08-12 08:03 /tmp/forumdata/forum1.tsv
-rw-r--r--   3 root supergroup  120313812 2019-08-12 08:03 /tmp/forumdata/forum_node.tsv
-rw-r--r--   3 root supergroup     530989 2019-08-12 08:03 /tmp/forumdata/forum_users.tsv


Las siguientes celdas realizan una prueba con un fichero de posts corto llamado forum1.tsv. Asi se ve facilmente que el código funciona.

In [15]:
! hdfs dfs -rm /tmp/salida-postscortosTest/*
! hdfs dfs -rmdir /tmp/salida-postscortosTest

Deleted /tmp/salida-postscortosTest/_SUCCESS
Deleted /tmp/salida-postscortosTest/part-00000


In [16]:
! hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files mapper.py -mapper mapper.py -input /tmp/forumdata/forum1.tsv \
-output /tmp/salida-postscortosTest

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.15.1.jar] /tmp/streamjob6761638013024166305.jar tmpDir=null
19/08/12 09:45:25 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
19/08/12 09:45:25 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
19/08/12 09:45:25 INFO mapred.FileInputFormat: Total input paths to process : 1
19/08/12 09:45:25 INFO mapreduce.JobSubmitter: number of splits:2
19/08/12 09:45:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565590814059_0014
19/08/12 09:45:25 INFO impl.YarnClientImpl: Submitted application application_1565590814059_0014
19/08/12 09:45:26 INFO mapreduce.Job: The url to track the job: http://yarnmaster:8088/proxy/application_1565590814059_0014/
19/08/12 09:45:26 INFO mapreduce.Job: Running job: job_1565590814059_0014
19/08/12 09:45:31 INFO mapreduce.Job: Job job_1565590814059_0014 running in uber mode : false
19/08/12 09:45:31 INFO mapreduce.Job

In [17]:
! hdfs dfs -cat /tmp/salida-postscortosTest/part-00000

"0202"	"Titulo"	"tags"	"9191"	"Este es un mensaje de 1 frase."	"question"	"\N"	"\N"	"2012-02-27 15:09:11.184434+00"	"0"	""	"\N"	"100003268"	"2012-02-27 15:09:11.184434+00"	"9322"	"\N"	"\N"	"106"	"f"
"0204"	"Titulo"	"tags"	"9191"	"Linea 1\n Linea2\n linea3\n Linea 4."	"question"	"\N"	"\N"	"2012-02-27 15:09:11.184434+00"	"0"	""	"\N"	"100003268"	"2012-02-27 15:09:11.184434+00"	"9322"	"\N"	"\N"	"106"	"f"
"0205"	"Titulo"	"tags"	"9191"	"Este es un mensaje de 3\n lineas\n pero solo 1 frase."	"question"	"\N"	"\N"	"2012-02-27 15:09:11.184434+00"	"0"	""	"\N"	"100003268"	"2012-02-27 15:09:11.184434+00"	"9322"	"\N"	"\N"	"106"	"f"
"0206"	"Titulo"	"tags"	"9191"	"Este es un mensaje de 5\n  lineas\n pero\n solo \n 1 frase."	"question"	"\N"	"\N"	"2012-02-27 15:09:11.184434+00"	"0"	""	"\N"	"100003268"	"2012-02-27 15:09:11.184434+00"	"9322"	"\N"	"\N"	"106"	"f"
"0207"	"Titulo"	"tags"	"9191"	"Este es\n un mensaje de 6\n lineas\n pero \nsolo 1 \nfrase."	"question"	"\N"	"\N"	"2012-02-27 15:09:11.184434+0

In [18]:
! hdfs dfs -rm /tmp/salida-postscortos/*
! hdfs dfs -rmdir /tmp/salida-postscortos

Deleted /tmp/salida-postscortos/_SUCCESS
Deleted /tmp/salida-postscortos/part-00000


In [19]:
! hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files mapper.py -mapper mapper.py -input /tmp/forumdata/forum_node.tsv \
-output /tmp/salida-postscortos

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.15.1.jar] /tmp/streamjob7310739197634376744.jar tmpDir=null
19/08/12 09:46:10 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
19/08/12 09:46:10 INFO client.RMProxy: Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
19/08/12 09:46:11 INFO mapred.FileInputFormat: Total input paths to process : 1
19/08/12 09:46:11 INFO mapreduce.JobSubmitter: number of splits:2
19/08/12 09:46:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565590814059_0015
19/08/12 09:46:11 INFO impl.YarnClientImpl: Submitted application application_1565590814059_0015
19/08/12 09:46:11 INFO mapreduce.Job: The url to track the job: http://yarnmaster:8088/proxy/application_1565590814059_0015/
19/08/12 09:46:11 INFO mapreduce.Job: Running job: job_1565590814059_0015
19/08/12 09:46:16 INFO mapreduce.Job: Job job_1565590814059_0015 running in uber mode : false
19/08/12 09:46:16 INFO mapreduce.Job

In [20]:
! hdfs dfs -tail /tmp/salida-postscortos/part-00000

39+00"	"0"	""	"\N"	"100002528"	"2012-02-24 21:41:27.808585+00"	"6407"	"\N"	"\N"	"0"	"f"
“Something bad happened trying to communicate with the website”</p>"	"comment"	"20623"	"20620"	"2012-03-06 04:35:52.670198+00"	"0"	""	"\N"	"100001360"	"2012-03-06 04:35:52.670198+00"	"27315"	"\N"	"\N"	"0"	"f"
‹ Prev 20 1-2 Next 20 ›<br>	
⇒S=p⋅2^p−(2^p−1)<br>	
⇒T(n)=4T(n/4)+theta(logn)+theta(2logn/2)<br>	
⇒T(n)=theta(2^p+1−p−2)<br>	
⇒T(n)=theta(2n)<br>	
⇒T(n)=theta(2n−logn−2)<br>	
⇒T(n)=theta(logn+2logn/2+4logn/4+8logn/8+…[logn terms])<br>	
⇒T(n)=theta(p+2<em>(p−1)+4</em>(p−2)+8<em>(p−3)+…[p terms])<br>	
⇒T(n)=theta(p</em>(1+2^1+2^2+…+2^p)−2(1+2⋅2^1+3⋅2^2+…+p⋅2^p−1))<br>	
⇒T(n)=theta(p⋅(2^p+1−1)−2S) …… (1)<br>	
⇒T(n/2)=2T(n/4)+theta(logn/2)<br>	
∴S=(p−1)⋅2^p+1 …… (2)<br>	
∴T(n)=theta(n)</p>"	"answer"	"11001437"	"11001437"	"2012-07-07 15:11:15.549324+00"	"0"	""	"\N"	"100044448"	"2012-07-07 15:11:15.549324+00"	"11002434"	"\N"	"\N"	"0"	"f"
