# Plantilla para la Tarea online BDA02

# David Carlón Cembranos

En esta tarea deberás completar las celdas que están incompletas. Se muestra el resultado esperado de la ejecución. Se trata de que implementes un proceso MapReduce que produzca ese resultado. Puedes implementar el proceso MapReduce con el lenguaje y librería que prefieras (`Bash`, Python, `mrjob` ...). Los datos de entrada del proceso son meros ejemplos y el proceso que implementes debería funcionar con esos y cualquier otro fichero de entrada que tenga la misma estructura.

## 1.- Partiendo del fichero de `notas.txt`, calcula la nota más alta obtenida por cada alumno con un proceso MapReduce.

Es decir, que si tenemos el fichero de notas:

In [23]:
%%writefile notas.txt
pedro 6 7
luis 0 4
ana 7
pedro 8 1 3
ana 5 6 7
ana 10
luis 3

Overwriting notas.txt


Se espera obtener el siguiente resultado:

![solución 1](./img/1.png)

In [24]:
%%writefile marksMR.py
#!/usr/bin/python3

from mrjob.job import MRJob

#Definimos una clase MrJob
class MarksMR(MRJob):
        
    # Mapper: En esta etapa aún no hay clave (_), el valor lo recibimos en la variable line
    def mapper(self, _, line):
        #Por cada línea, esta se divide en los campos que forman las columnas
        name, *marks = line.split()
        for mark in marks:            
            yield name, float(mark)
         
    #Reducer: La clave será el nombre y los valores las notas
    def reducer(self, name, marks):
        yield name, max(marks)
        
if __name__=='__main__':
    MarksMR.run()

Overwriting marksMR.py


In [25]:
! chmod ugo+x marksMR.py

In [26]:
! python3 marksMR.py notas.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/marksMR.root.20240120.120444.034067
Running step 1 of 1...
job output is in /tmp/marksMR.root.20240120.120444.034067/output
Streaming final output from /tmp/marksMR.root.20240120.120444.034067/output...
"ana"	10.0
"luis"	4.0
"pedro"	8.0
Removing temp directory /tmp/marksMR.root.20240120.120444.034067...


In [28]:
! hadoop fs -copyFromLocal notas.txt /user/root/

In [29]:
! python3 marksMR.py -r hadoop hdfs:///user/root/notas.txt

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /app/hadoop-3.3.1/bin...
Found hadoop binary: /app/hadoop-3.3.1/bin/hadoop
Using Hadoop version 3.3.1
Looking for Hadoop streaming jar in /app/hadoop-3.3.1...
Found Hadoop streaming jar: /app/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar
Creating temp directory /tmp/marksMR.root.20240120.121356.073349
uploading working dir files to hdfs:///user/root/tmp/mrjob/marksMR.root.20240120.121356.073349/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/marksMR.root.20240120.121356.073349/files/
Running step 1 of 1...
  packageJobJar: [/tmp/hadoop-unjar4757324624452854386/] [] /tmp/streamjob1839655337540389898.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1705747935772_0

## 2.- Usando un proceso MapReduce muestra las 10 palabras más utilizadas en `El Quijote`.

Lo primero será descargar El Quijote:

In [16]:
! wget -O '2000-0.txt' https://www.gutenberg.org/files/2000/2000-0.txt

--2024-01-17 19:46:36--  https://www.gutenberg.org/files/2000/2000-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2226045 (2.1M) [text/plain]
Saving to: ‘2000-0.txt’


2024-01-17 19:46:37 (2.40 MB/s) - ‘2000-0.txt’ saved [2226045/2226045]



Al igual que hicimos en la primera práctica, eliminamos aquellas líneas que son metadata y no forman parte de la obra. Sobrescribimos el fichero sin esas líneas.

In [6]:
with open('2000-0.txt') as f:
    lines = f.readlines()

head = 24
tail = 360
book = lines[head:-tail]

with open('2000-0.txt', 'w') as f:
    for line in book:
        f.write(f"{line}\n")


El resultado debería ser el mismo que el que obtuvimos en la primera práctica.

![solución 2](./img/2.png)

In [7]:
%%writefile most_used_words.py
#!/usr/bin/python3

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_PATTERN = re.compile(r"[\w']+")

class MRCommonWord(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.map_words,
                   combiner=self.combine_counts,
                   reducer=self.reduce_counts),
            MRStep(reducer=self.reduce_frequency)
        ]

    def map_words(self, _, line):
        for word in WORD_PATTERN.findall(line):
            yield (word.lower(), 1)

    def combine_counts(self, word, counts):
        yield (word, sum(counts))

    def reduce_counts(self, word, counts):
        yield None, (sum(counts), word)
        
    def reduce_frequency(self, _, word_counts):
        sorted_counts = sorted(word_counts, key=lambda x: x[0], reverse=True)
        yield None, list(sorted_counts[:10])

if __name__ == '__main__':
    MRCommonWord.run()

Overwriting most_used_words.py


In [8]:
! chmod ugo+x most_used_words.py

In [9]:
! python3 most_used_words.py 2000-0.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/most_used_words.root.20240120.112256.574476
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/most_used_words.root.20240120.112256.574476/output
Streaming final output from /tmp/most_used_words.root.20240120.112256.574476/output...
null	[[20651, "que"], [18276, "de"], [18150, "y"], [10433, "la"], [9816, "a"], [8236, "en"], [8204, "el"], [6304, "no"], [4737, "los"], [4723, "se"]]
Removing temp directory /tmp/most_used_words.root.20240120.112256.574476...


In [11]:
! hadoop fs -copyFromLocal 2000-0.txt /user/root/

In [12]:
! python3 most_used_words.py -r hadoop hdfs:///user/root/2000-0.txt

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /app/hadoop-3.3.1/bin...
Found hadoop binary: /app/hadoop-3.3.1/bin/hadoop
Using Hadoop version 3.3.1
Looking for Hadoop streaming jar in /app/hadoop-3.3.1...
Found Hadoop streaming jar: /app/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar
Creating temp directory /tmp/most_used_words.root.20240120.112311.453859
uploading working dir files to hdfs:///user/root/tmp/mrjob/most_used_words.root.20240120.112311.453859/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/most_used_words.root.20240120.112311.453859/files/
Running step 1 of 2...
  packageJobJar: [/tmp/hadoop-unjar7679586938635845343/] [] /tmp/streamjob3618276325474358801.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.sta

## 3.- Muestra la clasificación de temporada 2021/2022 de La Liga pero únicamente de los puntos obtenidos como visitante.

En [esta Web](https://resultados.as.com/resultados/futbol/primera/2021_2022/clasificacion/) puedes consultar cuántos puntos obtuvo cada equipo fuera de casa.

Empezamos descargando el fichero de resultados de la temporada 2021/2022 y renombrándolo a `laliga2122.csv`.

In [23]:
! wget -O laliga2122.csv https://www.football-data.co.uk/mmz4281/2122/SP1.csv

--2022-12-05 10:25:13--  https://www.football-data.co.uk/mmz4281/2122/SP1.csv
Resolving www.football-data.co.uk (www.football-data.co.uk)... 217.160.0.246
Connecting to www.football-data.co.uk (www.football-data.co.uk)|217.160.0.246|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 172174 (168K) [text/csv]
Saving to: ‘laliga2122.csv’


2022-12-05 10:25:14 (625 KB/s) - ‘laliga2122.csv’ saved [172174/172174]



Se espera este resultado:

![solución 3](./img/3.png)

In [1]:
%%writefile ligavisitante.py
#!/usr/bin/python3

from mrjob.job import MRJob
from mrjob.step import MRStep
    
class LaLigaMR(MRJob):
        
    # Mapper: En esta etapa aún no hay clave (_), el valor lo recibimos en la variable line
    def mapper_points(self, _, line):
        #Por cada línea, esta se divide en los campos que forman las columnas
        _, _, _, home_team, away_team, _, _, result, *rest = line.split(',')
        

        if home_team == "HomeTeam":
            return
        

        if result == 'D':            
            yield away_team, 1
        elif result == 'A':
            yield away_team, 3
            
    def combiner_points(self, team, points):
        yield team, sum(points)
            
    def reducer_points(self, team, points):
        yield None, (team, sum(points))
        
    def reducer_classification(self, _, points):
        yield None, sorted(points, key=lambda t: t[1], reverse=True)
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper_points,
                   combiner=self.combiner_points,
                   reducer=self.reducer_points),
            MRStep(reducer=self.reducer_classification)
        ]
         
if __name__=='__main__':
    LaLigaMR.run()

Overwriting ligavisitante.py


In [2]:
! python3 ligavisitante.py laliga2122.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/ligavisitante.root.20240120.105717.637809
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/ligavisitante.root.20240120.105717.637809/output
Streaming final output from /tmp/ligavisitante.root.20240120.105717.637809/output...
null	[["Real Madrid", 42], ["Barcelona", 35], ["Betis", 33], ["Ath Madrid", 30], ["Sevilla", 28], ["Sociedad", 27], ["Osasuna", 25], ["Villarreal", 23], ["Valencia", 22], ["Ath Bilbao", 21], ["Celta", 21], ["Cadiz", 21], ["Granada", 16], ["Elche", 15], ["Levante", 13], ["Vallecano", 13], ["Mallorca", 12], ["Getafe", 11], ["Espanol", 9], ["Alaves", 6]]
Removing temp directory /tmp/ligavisitante.root.20240120.105717.637809...


In [4]:
! python3 ligavisitante.py -r hadoop laliga2122.csv

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /app/hadoop-3.3.1/bin...
Found hadoop binary: /app/hadoop-3.3.1/bin/hadoop
Using Hadoop version 3.3.1
Looking for Hadoop streaming jar in /app/hadoop-3.3.1...
Found Hadoop streaming jar: /app/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar
Creating temp directory /tmp/ligavisitante.root.20240120.105742.374952
uploading working dir files to hdfs:///user/root/tmp/mrjob/ligavisitante.root.20240120.105742.374952/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/ligavisitante.root.20240120.105742.374952/files/
Running step 1 of 2...
  packageJobJar: [/tmp/hadoop-unjar3080984501314666200/] [] /tmp/streamjob3575331902094351241.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/j

## 4.- Muestra la diferencia de goles entre el equipo que más goles ha marcado y el que menos goles ha marcado en la temporada 2021/2022 de La Liga.

Se espera que el proceso MapReuce produzca una salida similar a la siguiente:

![solución 4](./img/4.png)

In [13]:
%%writefile ligagoals.py
#!/usr/bin/python3

from mrjob.job import MRJob
from mrjob.step import MRStep

class LaLigaAnalysis(MRJob):

    def mapper_score(self, _, line):
        _, _, _, home_team, away_team, home_goals, away_goals, result, *rest = line.split(',')
        
        if home_team == "HomeTeam":
            return

        if result in ['D', 'H']: 
            yield home_team, (int(home_goals))
            yield away_team,(int(away_goals))
        else:            
            yield away_team, (int(away_goals))
            yield home_team,(int(home_goals))

    def combiner_score(self, team, goals):
        yield team, sum(goals)
            
    def reducer_score(self, team, goals):
        yield None, (team, sum(goals))
        
    def reducer_score_diff(self, _, teams):
        sorted_teams = sorted(teams, key=lambda t: t[1], reverse=True)
        top_team = sorted_teams[0][0]
        bottom_team = sorted_teams[-1][0]
        score_diff = sorted_teams[0][1] - sorted_teams[-1][1]
        
        match_up = top_team + " vs " + bottom_team
        diff_str = "Diferencia de goles " + str(score_diff)
        
        yield  (match_up , diff_str)
        
    def steps(self):
        return [
            MRStep(mapper=self.mapper_score,
                   combiner=self.combiner_score,
                   reducer=self.reducer_score),
            MRStep(reducer=self.reducer_score_diff)
        ]

if __name__=='__main__':
    LaLigaAnalysis.run()


Overwriting ligagoals.py


In [14]:
! python3 ligagoals.py laliga2122.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/ligagoals.root.20240120.112639.033178
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/ligagoals.root.20240120.112639.033178/output
Streaming final output from /tmp/ligagoals.root.20240120.112639.033178/output...
"Real Madrid vs Alaves"	"Diferencia de goles 49"
Removing temp directory /tmp/ligagoals.root.20240120.112639.033178...


In [15]:
! python3 ligagoals.py -r hadoop laliga2122.csv

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /app/hadoop-3.3.1/bin...
Found hadoop binary: /app/hadoop-3.3.1/bin/hadoop
Using Hadoop version 3.3.1
Looking for Hadoop streaming jar in /app/hadoop-3.3.1...
Found Hadoop streaming jar: /app/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar
Creating temp directory /tmp/ligagoals.root.20240120.112710.268554
uploading working dir files to hdfs:///user/root/tmp/mrjob/ligagoals.root.20240120.112710.268554/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/ligagoals.root.20240120.112710.268554/files/
Running step 1 of 2...
  packageJobJar: [/tmp/hadoop-unjar137519969768994010/] [] /tmp/streamjob1411855678605371120.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1705747935

## 5.- Calcula la racha de los últimos cinco partidos de cada equipo en la clasificación final de La Liga en la temporada 2021/2022.

[Observa](https://www.google.com/search?q=clasificacion+liga+2021+2022&oq=clasificacion+liga+2021+2022#sie=lg) que las últimas columnas de la clasificación muestran cuál ha sido el resultado de los últimos 5 partidos de cada equipo.

![clasificacion](./img/clasificacion.png)

Se trata de que muestres la clasificación final junto con los resultados de los últimos 5 partidos. Este ejercicio es un poco más difícil y laborioso que los otros. Si usas `mrjob` probablemente te sea útil utilizar [ordenación secundaria por valor](https://mrjob.readthedocs.io/en/latest/job.html#secondary-sort), aunque también se puede resolver sin hacer uso de ella.

Se espera este resultado:

![solución 5](./img/5.png)

In [16]:
%%writefile laliga5.py
#!/usr/bin/python3

from mrjob.job import MRJob
from mrjob.step import MRStep
from datetime import datetime

class LaLigaRecentPerformance(MRJob):

    SORT_VALUES = True

    def mapper_score(self, _, line):
        _, date, _, home_team, away_team, _, _, result, *rest = line.split(',')
        
        if home_team == "HomeTeam":
            return

        date = datetime.strptime(date, "%d/%m/%Y").strftime("%Y/%m/%d")

        if result == 'D':            
            yield home_team, (date, 1)
            yield away_team, (date, 1)
        elif result == 'H':
            yield home_team, (date, 3)
            yield away_team, (date, 0)
        else:
            yield home_team, (date, 0)
            yield away_team, (date, 3)

    def reducer_score(self, team, scores):
        scores = list(scores)
        scores = [s for date, s in scores]
        recent_five_scores = scores[-5:]
        recent_five_scores.reverse()
        yield None, (team, sum(scores), recent_five_scores)

    def reducer_ranking(self, _, scores):
        yield None, sorted(scores, key=lambda t: t[1], reverse=True)

    def steps(self):
        return [
            MRStep(mapper=self.mapper_score, reducer=self.reducer_score),
            MRStep(reducer=self.reducer_ranking)
        ]

if __name__=='__main__':
    LaLigaRecentPerformance.run()

Writing laliga5.py


In [17]:
! python3 laliga5.py laliga2122.csv

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/laliga5.root.20240120.113849.267838
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/laliga5.root.20240120.113849.267838/output
Streaming final output from /tmp/laliga5.root.20240120.113849.267838/output...
null	[["Real Madrid", 86, [1, 1, 3, 0, 3]], ["Barcelona", 73, [0, 1, 3, 3, 3]], ["Ath Madrid", 71, [3, 1, 3, 3, 0]], ["Sevilla", 70, [3, 1, 1, 1, 1]], ["Betis", 65, [1, 3, 3, 0, 1]], ["Sociedad", 62, [0, 3, 3, 0, 1]], ["Villarreal", 59, [3, 0, 3, 1, 0]], ["Ath Bilbao", 55, [0, 3, 0, 1, 3]], ["Valencia", 48, [3, 1, 0, 1, 1]], ["Osasuna", 47, [0, 0, 1, 1, 1]], ["Celta", 46, [0, 3, 0, 3, 1]], ["Elche", 42, [3, 0, 0, 0, 1]], ["Espanol", 42, [1, 1, 0, 1, 0]], ["Vallecano", 42, [0, 0, 0, 1, 1]], ["Cadiz", 39, [3, 1, 0, 3, 1]], ["Getafe", 39, [0, 1, 1, 1, 1]], ["Mallorca", 39, [3, 3, 1, 0, 0]], ["Granada", 38, [1, 0, 3, 3, 1]], ["Levante", 35, [3, 3, 0

In [18]:
! python3 laliga5.py -r hadoop laliga2122.csv

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /app/hadoop-3.3.1/bin...
Found hadoop binary: /app/hadoop-3.3.1/bin/hadoop
Using Hadoop version 3.3.1
Looking for Hadoop streaming jar in /app/hadoop-3.3.1...
Found Hadoop streaming jar: /app/hadoop-3.3.1/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar
Creating temp directory /tmp/laliga5.root.20240120.113859.855856
uploading working dir files to hdfs:///user/root/tmp/mrjob/laliga5.root.20240120.113859.855856/files/wd...
Copying other local files to hdfs:///user/root/tmp/mrjob/laliga5.root.20240120.113859.855856/files/
Running step 1 of 2...
  packageJobJar: [/tmp/hadoop-unjar5213089982492104415/] [] /tmp/streamjob139541007877437592.jar tmpDir=null
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Connecting to ResourceManager at yarnmaster/172.21.0.3:8032
  Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1705747935772_00