In [1]:
## Archivos de prueba
##   A continuación se generarán tres archivos de prueba para probar el sistema. Puede usar directamente comandos del sistema operativo en el Terminal y el editor de 
##   texto pico para crear los archivos.

In [2]:
## Se crea el directorio de entrada
!rm -rf input output
!mkdir input

In [3]:
%%writefile input/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include predictive analytics, prescriptive
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,
marketing optimization and marketing mix modeling, web analytics, call analytics, speech
analytics, sales force sizing and optimization, price and promotion modeling, predictive
science, credit risk analysis, and fraud analytics. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics.

Writing input/text0.txt


In [4]:
%%writefile input/text1.txt
The field of data analysis. Analytics often involves studying past historical data to
research potential trends, to analyze the effects of certain decisions or events, or to
evaluate the performance of a given tool or scenario. The goal of analytics is to improve
the business by gaining knowledge which can be used to make improvements or changes.

Writing input/text1.txt


In [5]:
%%writefile input/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions
about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions and by
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.

Writing input/text2.txt


In [6]:
## Código en Apache Pig
##   Nota. Se usan los dos guiones -- para comentario de una línea y /* … */ para comentarios de varias líneas.

In [7]:
%%writefile script.pig

-- crea la carpeta input in el HDFS
fs -mkdir input

-- copia los archivos del sistema local al HDFS
fs -put input/ .

-- carga de datos
lines = LOAD 'input/text*.txt' AS (line:CHARARRAY);

-- genera una tabla llamada words con una palabra por registro
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- agrupa los registros que tienen la misma palabra
grouped = GROUP words BY word;

-- genera una variable que cuenta las ocurrencias por cada grupo
wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- selecciona las primeras 15 palabras
s = LIMIT wordcount 15;

-- escribe el archivo de salida
STORE s INTO 'output';

-- copia los archivos del HDFS al sistema local
fs -get output/ .

Writing script.pig


In [8]:
## Ejecución del script en modo batch

In [9]:
!pig -execute 'run script.pig'

2019-11-28 00:43:51,611 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-11-28 00:43:55,475 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator
2019-11-28 00:43:56,004 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:43:56,260 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2019-11-28 00:43:56,262 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2019-11-28 00:43:56,273 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use

In [10]:
## Visualización de los resultados en el HDFS

In [11]:
!hadoop fs -ls output/*

-rw-r--r--   1 root supergroup          0 2019-11-28 00:44 output/_SUCCESS
-rw-r--r--   1 root supergroup         81 2019-11-28 00:44 output/part-r-00000


In [12]:
!hadoop fs -cat output/part-r-00000

a	1
DA	1
be	1
by	2
in	5
is	3
of	8
on	1
or	5
to	12
Big	1
The	2
aid	1
and	15
are	1


In [13]:
## Movimiento de los resultados al sistema local

In [14]:
!hadoop fs -copyToLocal output output
!ls output/*

output/_SUCCESS  output/part-r-00000

output/output:
_SUCCESS  part-r-00000


In [15]:
## Visualilzación de resultados en el sistema local

In [16]:
!ls -1 output/*

output/_SUCCESS
output/part-r-00000

output/output:
_SUCCESS
part-r-00000


In [17]:
!cat output/part-r-00000

a	1
DA	1
be	1
by	2
in	5
is	3
of	8
on	1
or	5
to	12
Big	1
The	2
aid	1
and	15
are	1


In [18]:
## Supresion de información detallada
##   Apache Pig imprime mucha información en pantalla relacionada con su ejecución y la Hadoop. Para regular el nivel de información entregada, se puede realizar una copia al directorio actual del archivo ./conf/log4j.properties ubicado en la carpeta de instalación de Pig.
##   El archivo log4j.properties se modifica para que se impriman únicamente los mensajes de error de Pig y Hadoop. Para ello, se modifica la línea correspondiente para que quede así:
##     log4j.logger.org.apache.pig=error, A
##   y se agrega la siguiente para modificar el nivel de información entregado por Hadoop:
##     log4j.logger.org.apache.hadoop=error, A
##   Se invoca Pig con:
##     pig -4 log4j.properties

In [19]:
########################################################################################################
## Ejecución de Pig en Jupyter
##    A continuación se describe como ejecutar comandos de Pig en Jupyter usando la extensión de Jupyter bigdata.

In [20]:
%load_ext bigdata

In [21]:
%timeout 300

In [22]:
%pig_start

In [23]:
## WordCount en Apache Pig
##   Se cargan los archivos en Apache Pig.

In [24]:
%%pig
lines = LOAD 'input/text*.txt' AS (line:CHARARRAY);
DUMP lines;

 lines = LOAD 'input/text*.txt' AS (line:CHARARRAY);
 DUMP lines;
2019-11-28 00:48:20,051 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:48:20,391 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2019-11-28 00:48:20,397 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2019-11-28 00:48:20,415 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.submit.replication is deprecated. Instead, use mapreduce.client.submit.file.replication
2019-11-28 00:48:21,286 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address
2019-11-28 00:48:21,308 [JobControl] INFO  org.apache.hadoop.y

In [25]:
## Realiza el conteo de palabras.

In [26]:
%%pig
-- genera una tabla llamada words con una palabra por registro
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
DUMP words;

 -- genera una tabla llamada words con una palabra por registro
 words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 DUMP words;
2019-11-28 00:48:44,192 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:48:44,425 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:48:44,454 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-28 00:48:44,496 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 3
2019-11-28 00:48:44,563 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-28 00:48:44,619 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1574900907551_0004
2019-11-28 00:48:44,632 [JobControl] INFO  org.apache.

In [27]:
%%pig
-- agrupa los registros que tienen la misma palabra
grouped = GROUP words BY word;
DUMP grouped;

 -- agrupa los registros que tienen la misma palabra
 grouped = GROUP words BY word;
 DUMP grouped;
2019-11-28 00:49:13,357 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:49:14,016 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:49:14,046 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-28 00:49:14,067 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 3
2019-11-28 00:49:14,135 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-28 00:49:14,177 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1574900907551_0005
2019-11-28 00:49:14,181 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not 

In [28]:
%%pig
-- genera una variable que cuenta las ocurrencias por cada grupo
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

 -- genera una variable que cuenta las ocurrencias por cada grupo
 wordcount = FOREACH grouped GENERATE group, COUNT(words);
 DUMP wordcount;
2019-11-28 00:49:51,321 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:49:51,922 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:49:51,946 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-28 00:49:51,963 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 3
2019-11-28 00:49:52,423 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-28 00:49:52,879 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1574900907551_0006
2019-11-28 00:49:52,883 [JobControl] INFO  org.apache.

In [29]:
%%pig
-- selecciona las primeras 15 palabras
s = LIMIT wordcount 15;
DUMP s;

 -- selecciona las primeras 15 palabras
 s = LIMIT wordcount 15;
 DUMP s;
2019-11-28 00:51:37,071 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:51:37,590 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-28 00:51:37,615 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-28 00:51:37,630 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 3
2019-11-28 00:51:37,685 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-28 00:51:37,731 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1574900907551_0007
2019-11-28 00:51:37,743 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any ja

In [30]:
%pig_quit

In [31]:
## Limpieza del HDFS y de la máquina local

In [32]:
## Se elimina el directorio de salida en el hdfs si existe
!hadoop fs -rm input/*
!hadoop fs -rm output/*
!hadoop fs -rmdir input output

Deleted input/text0.txt
Deleted input/text1.txt
Deleted input/text2.txt
Deleted output/_SUCCESS
rm: `output/output': No such file or directory
Deleted output/part-r-00000


In [33]:
!rm -rf input output