# mrjob: Palabra más frecuente

En este ejemplo vamos a retornar la palabra que tenga más ocurrencias en un fichero de entrada. Para ello será necesario definir un trbaajo que tenga varios pasos MapReduce. Un paso (step) es una combinación de mapper, combiner y reducer, donde ninguno de estas funciones es obligatoria pero debe haber al menos una en cada paso. Mrjob se encarga de encadenar las entradas y salidas de forma transparente al usuario, de forma que se pueden implementar desarrollos más complejos.  

In [1]:
! mkdir -p mrjob/palabramasfrecuente

In [2]:
import os
os.chdir("/media/notebooks/mrjob/palabramasfrecuente")

In [3]:
! pwd

/media/notebooks/mrjob/palabramasfrecuente


In [4]:
%%writefile mrjob-ejercicio.py

from mrjob.job import MRJob 
from mrjob.step import MRStep 
import re 
 

WORD_RE = re.compile(r"[\w']+") 

class MRMostUsedWord(MRJob): 
    def mapper_get_words(self, _, line): 
        # Para cada palabra en la linea, emitimos un par <palabra, 1> 
        for word in WORD_RE.findall(line): 
            yield (word.lower(), 1) 

    def combiner_count_words(self, word, counts): 
        # sumamos las palabras que hemos encontrado hasta ahora 
        yield (word, sum(counts)) 

 
    def reducer_count_words(self, word, counts): 
        # Envia todos los pares < num_ocurrencias , palabra > al mismo reducer,  
        # ya que la clave que emitimos es None y es la misma para todos los items 
        yield None, (sum(counts), word) 

 
    # descartamos la clave, es None 
    def reducer_find_max_word(self, _, word_count_pairs): 
        # cada item de word_count_pairs es (contador, palabra), 
        # de forma que emitiendo el maximo nos da la palabra con mas ocurrencias 
        yield max(word_count_pairs) 

 
    def steps(self): 
        return [ 
            MRStep(mapper=self.mapper_get_words, 
                   combiner=self.combiner_count_words, 
                   reducer=self.reducer_count_words), 
            MRStep(reducer=self.reducer_find_max_word) 
        ] 

if __name__ == '__main__': 
    MRMostUsedWord.run() 

Overwriting mrjob-ejercicio.py


In [5]:
! python mrjob-ejercicio.py /media/notebooks/marktwain.txt  > ouputlocal

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/mrjob-ejercicio.root.20190812.104209.840901
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/mrjob-ejercicio.root.20190812.104209.840901/output
Streaming final output from /tmp/mrjob-ejercicio.root.20190812.104209.840901/output...
Removing temp directory /tmp/mrjob-ejercicio.root.20190812.104209.840901...


In [6]:
! tail ouputlocal

155184	"the"


In [11]:
! hdfs dfs -rm /tmp/carpeta/mrjob-palabramasfrecuente-output/*
! hdfs dfs -rmdir /tmp/carpeta/mrjob-palabramasfrecuente-output

Deleted /tmp/carpeta/mrjob-palabramasfrecuente-output/_SUCCESS
Deleted /tmp/carpeta/mrjob-palabramasfrecuente-output/part-00000


In [12]:
! python mrjob-ejercicio.py hdfs:///tmp/carpeta/marktwain.txt -r hadoop --python-bin /opt/anaconda/bin/python3.7 \
--output-dir hdfs:///tmp/carpeta/mrjob-palabramasfrecuente-output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/lib/hadoop/bin...
Found hadoop binary: /usr/lib/hadoop/bin/hadoop
Using Hadoop version 2.6.0
Looking for Hadoop streaming jar in /usr/lib/hadoop...
Looking for Hadoop streaming jar in /usr/lib/hadoop-mapreduce...
Found Hadoop streaming jar: /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
Creating temp directory /tmp/mrjob-ejercicio.root.20190812.104945.821766
uploading working dir files to hdfs:///user/root/tmp/mrjob/mrjob-ejercicio.root.20190812.104945.821766/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/mrjob-ejercicio.root.20190812.104945.821766/files/
Running step 1 of 2...
  packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.15.1.jar] /tmp/streamjob6228870573217200255.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Total 

In [13]:
! hdfs dfs -tail /tmp/carpeta/mrjob-palabramasfrecuente-output/part-00000


155184	"the"
