# TP1 - Adrian Jose Zapater Reig

# Ejercicio 1.2: País con el número de clientes buenos mas alto.
Ejemplo de mapreduce en python que devuelve el país con mas clientes categorizados como "buenos".

## Output:
num_clientes    ["país"]

## Diseño

¿Cuántos pasos MapReduce son necesarios?

¿Qué hace cada función de cada paso?

¿Qué datos se pasan de una función a la siguiente?

#### Nota: 
Los datos deben estar en la ruta: /media/notebook/datos/

Los ficheros de origen necesarios son: countries.csv y clients.csv 

El directorio de trabajo es /media/notebook/tp1-notebooks/mrjob

In [1]:
! mkdir -p /media/notebook/tp1-notebooks/mrjob

In [2]:
import os
os.chdir("/media/notebook/tp1-notebooks/mrjob")

In [3]:
! pwd

/media/notebook/tp1-notebooks/mrjob


## Los ficheros countries.csv y clients.csv se cargan de la carpeta /media/notebook/datos/

In [4]:
cat /media/notebook/datos/countries.csv

Name,Code
Afghanistan,AF
Åland Islands,AX
Albania,AL
Algeria,DZ
American Samoa,AS
Andorra,AD
Angola,AO
Anguilla,AI
Antarctica,AQ
Antigua and Barbuda,AG
Argentina,AR
Armenia,AM
Aruba,AW
Australia,AU
Austria,AT
Azerbaijan,AZ
Bahamas,BS
Bahrain,BH
Bangladesh,BD
Barbados,BB
Belarus,BY
Belgium,BE
Belize,BZ
Benin,BJ
Bermuda,BM
Bhutan,BT
"Bolivia, Plurinational State of",BO
"Bonaire, Sint Eustatius and Saba",BQ
Bosnia and Herzegovina,BA
Botswana,BW
Bouvet Island,BV
Brazil,BR
British Indian Ocean Territory,IO
Brunei Darussalam,BN
Bulgaria,BG
Burkina Faso,BF
Burundi,BI
Cambodia,KH
Cameroon,CM
Canada,CA
Cape Verde,CV
Cayman Islands,KY
Central African Republic,CF
Chad,TD
Chile,CL
China,CN
Christmas Island,CX
Cocos (Keeling) Islands,CC
Colombia,CO
Comoros,KM
Congo,CG
"Congo, the Democratic Republic of the",CD
Cook Islands,CK
Costa Rica,CR
Côte d'Ivoire,CI
Croatia,HR
Cuba,CU
Curaçao,C

In [5]:
cat /media/notebook/datos/clients.csv

Bertram Pearcy  ,bueno,SO
Steven Ulman  ,regular,ZA
Enid Follansbee  ,malo,GS
Candie Jacko  ,malo,SS
Alana Zufelt  ,regular,ES
Craig Pinkett  ,malo,LK
Carson Levey  ,bueno,GU
Reanna Calabrese  ,regular,GT
Elliott Kosak  ,malo,GG
Yuette Steinman  ,bueno,GN
Grisel Wines  ,regular,GW
Kathryne Dieguez  ,regular,AE
Donna Raabe  ,malo,GB
Norine Mundt  ,bueno,US
Brittaney Amaro  ,bueno,ES
Penni Husted  ,bueno,ES
Delmer Semon  ,malo,IT
Lennie Dunkerson  ,bueno,CA
Mayra Bobb  ,regular,IT
Altagracia Merced  ,regular,CA
Verda Belgrave  ,malo,GB
Jonnie Urban  ,malo,US
Chung Frankum  ,malo,ES
Vincenzo Samples  ,regular,TT
Dominick Barkan  ,bueno,GU
Carisa Ellingwood  ,bueno,TR
Garret Wess  ,regular,TM
Zoraida Muise  ,bueno,GU
Samantha Cusson  ,bueno,PT
Jenine Greenburg  ,regular,PR
Geri Paddock  ,bueno,QA
Antonia Klosterman  ,regular,RE
Moriah Galey  ,malo,RO
Nyla Eckard  ,malo,GB
Arlean Harries  ,malo,US
Kenyatta Lippold  ,malo,ES
Samuel Knipe  ,malo,MV
Jamison

In [6]:
 %%writefile mrjob-ejercicio_1_2.py
import sys, os, re
from mrjob.job import MRJob
from mrjob.step import MRStep

class MRPaisMaxClientesBuenos(MRJob):

    # Realiza la ordenacion secundaria
    MRJob.SORT_VALUES = True

    # Igual que en el ejercicio 1.1
    def map_and_filter(self, _, line):
        splits = line.rstrip("\n").split(",")

        if len(splits) == 2: # datos de paises
            symbol = 'A' # ordenamos los paises antes que los datos de personas
            country2digit = splits[1]
            yield country2digit, [symbol, splits]
        else: #  datos de personas
            if splits[1].lower() == "bueno":
                symbol = 'B'
                country2digit = splits[2]
                yield country2digit, [symbol, splits]
                
    # Igual que en el ejercicio 1.1            
    def reducer_join_clients_country(self, key, values):
        countries = [] # paises primero ya que llevan la clave 'A'
        for value in values:
            if value[0] == 'A':
                countries.append(value)
            if value[0] == 'B':
                for country in countries:
                    countryName = country[1][0]
                    yield [countryName], 1
    
    # Este reducer se encarga de devolver cada pais y la suma de clientes buenos que tiene.
    # Cabe destacar que, como queremos que el resultado de este reducer se trabaje en el mismo reducer, tenemos que
    # usar una misma key en el yield. Como no nos importa que key usar, utilizamos 'None'.
    # Devolvemos una tupla (Nº total de clientes buenos, pais).
    def reducer_count_clients(self, country, counts):
        yield None, (sum(counts), country)
    
    
    # Este reducer se encarga de devolver el número máximo de clientes buenos y el pais.
    # Aplica la función max de python sobre la lista de tuplas y devuelve la tupla con mas clientes buenos.
    # Nota: Sólo devuelve 1 en caso de estar empatados.
    def reducer_max_clients_bueno(self, _, country_pair):
        yield max(country_pair)
        
        
    # Usamos steps para definir el orden de los mappers y reducers.
    def steps(self):
        return [
            MRStep(
                mapper=self.map_and_filter,
                reducer=self.reducer_join_clients_country),
            MRStep(
                reducer=self.reducer_count_clients
            ),
            MRStep(
                reducer=self.reducer_max_clients_bueno
            )
        ]
    
    
if __name__ == '__main__':
    MRPaisMaxClientesBuenos.run()

Overwriting mrjob-ejercicio_1_2.py


Primero ejecutamos el código en local y luego en Hadoop

In [7]:
! python mrjob-ejercicio_1_2.py /media/notebook/datos/countries.csv  \
/media/notebook/datos/clients.csv > ouputlocal

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/mrjob-ejercicio_1_2.root.20191115.230416.118125
Running step 1 of 3...
Running step 2 of 3...
Running step 3 of 3...
job output is in /tmp/mrjob-ejercicio_1_2.root.20191115.230416.118125/output
Streaming final output from /tmp/mrjob-ejercicio_1_2.root.20191115.230416.118125/output...
Removing temp directory /tmp/mrjob-ejercicio_1_2.root.20191115.230416.118125...


In [8]:
! cat ouputlocal

3	["Spain"]


In [9]:
! hdfs dfs -mkdir -p /tmp/mrjoin
! hdfs dfs -put -f /media/notebook/datos/countries.csv  /tmp/mrjoin
! hdfs dfs -put -f /media/notebook/datos/clients.csv  /tmp/mrjoin

In [10]:
! hdfs dfs -ls  /tmp/mrjoin

Found 2 items
-rw-r--r--   3 root supergroup       1289 2019-11-15 23:04 /tmp/mrjoin/clients.csv
-rw-r--r--   3 root supergroup       4120 2019-11-15 23:04 /tmp/mrjoin/countries.csv


Borramos la carpeta donde dejaremos la salida del programa en HDFS y su contenido.

In [11]:
! hdfs dfs -rm /tmp/carpeta/mrjob-join-output/*
! hdfs dfs -rmdir /tmp/carpeta/mrjob-join-output

Deleted /tmp/carpeta/mrjob-join-output/_SUCCESS
Deleted /tmp/carpeta/mrjob-join-output/part-00000


In [12]:
! python mrjob-ejercicio_1_2.py hdfs:///tmp/mrjoin/* -r hadoop --python-bin /opt/anaconda/bin/python3.7 \
--output-dir hdfs:///tmp/carpeta/mrjob-join-output

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/lib/hadoop/bin...
Found hadoop binary: /usr/lib/hadoop/bin/hadoop
Using Hadoop version 2.6.0
Looking for Hadoop streaming jar in /usr/lib/hadoop...
Looking for Hadoop streaming jar in /usr/lib/hadoop-mapreduce...
Found Hadoop streaming jar: /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
Creating temp directory /tmp/mrjob-ejercicio_1_2.root.20191115.230438.143364
uploading working dir files to hdfs:///user/root/tmp/mrjob/mrjob-ejercicio_1_2.root.20191115.230438.143364/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/mrjob-ejercicio_1_2.root.20191115.230438.143364/files/
Running step 1 of 3...
  packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.15.1.jar] /tmp/streamjob2905431204285557816.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.22.0.2:8032
  Connecting to ResourceManager at yarnmaster/172.22.0.2:8

Removing temp directory /tmp/mrjob-ejercicio_1_2.root.20191115.230438.143364...


In [13]:
! hdfs dfs -cat /tmp/carpeta/mrjob-join-output/part-00000

3	["Spain"]


## Conclusión
