In [1]:
%load_ext bigdata


In [2]:
%hive_start


In [3]:
%timeout 300

In [4]:
##
## Se crea el directorio wordcount en la carpeta actual de trabajo
## y se escriben tres archivos en ella.
##
!mkdir -p wordcount/

In [9]:
## A continuación se generan tres archivos de prueba que se almacenan en la carpeta wordcount/.

In [6]:
%%writefile wordcount/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include predictive analytics, prescriptive
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,
marketing optimization and marketing mix modeling, web analytics, call analytics, speech
analytics, sales force sizing and optimization, price and promotion modeling, predictive
science, credit risk analysis, and fraud analytics. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics.

Writing wordcount/text0.txt


In [7]:
%%writefile wordcount/text1.txt
The field of data analysis. Analytics often involves studying past historical data to
research potential trends, to analyze the effects of certain decisions or events, or to
evaluate the performance of a given tool or scenario. The goal of analytics is to improve
the business by gaining knowledge which can be used to make improvements or changes.

Writing wordcount/text1.txt


In [8]:
%%writefile wordcount/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions
about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions and by
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.

Writing wordcount/text2.txt


In [10]:
## En esta aplicación se usarán dos tablas:
##   docs: para cargar el contenido de los archivos de texto, donde cada línea equivale a un registro.
##   word_counts: En donde aparece cada palabra y su respectivo conteo.
## A continuación se elimnan dichas tablas si existen en el sistema, y luego se crea la tabla docs con un solo campo del tipo STRING.

In [11]:
%%hive
DROP TABLE IF EXISTS docs;
DROP TABLE IF EXISTS word_counts;
CREATE TABLE docs (line STRING);

DROP TABLE IF EXISTS docs;
OK
Time taken: 12.205 seconds
DROP TABLE IF EXISTS word_counts;
OK
Time taken: 0.013 seconds
CREATE TABLE docs (line STRING);
OK
Time taken: 1.229 seconds


In [12]:
## En el siguiente código, se hace la carga directa de todos los archivos que se encuentran en el directorio wordcount en la tabla docs. Luego, se imprimen los primeros cinco registros de la tabla para verificar que la lectura fue correcta.

In [13]:
%%hive
LOAD DATA LOCAL INPATH "wordcount/" OVERWRITE INTO TABLE docs;
SELECT * FROM docs LIMIT 5;

LOAD DATA LOCAL INPATH "wordcount/" OVERWRITE INTO TABLE docs;
Loading data to table default.docs
OK
Time taken: 2.679 seconds
SELECT * FROM docs LIMIT 5;
OK
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Time taken: 2.331 seconds, Fetched: 5 row(s)


In [14]:
## Una vez cargados los archivos, se procede a partir las líneas por palabras, usando la función split(line, '\\s'); 
## la expresión \\s indica que se realice la partición por los espacios en blanco; de esta forma, split() genera una lista de palabras. 
## La función explode(.) de Hive en conjunto con SELECT, genera un nuevo registro por cada palabra en line.

In [15]:
%%hive
SELECT explode(split(line, '\\s')) AS word FROM docs LIMIT 5;

SELECT explode(split(line, '\\s')) AS word FROM docs LIMIT 5;
OK
Analytics
is
the
discovery,
interpretation,
Time taken: 0.603 seconds, Fetched: 5 row(s)


In [16]:
## Para realizar el conteo, la expresión SELECT word, count(1) AS count ... GROUP BY word genera una tabla con dos columnas, 
##  donde la primera columna (word) correspodne a cada palabra en el texto, y la segunda columna representa la cantidad de veces que aparece en los registros generados 
##  por la expresión SELECT explode(split(line, '\\s')) AS word FROM docs.

In [17]:
%%hive
CREATE TABLE word_counts
AS
    SELECT word, count(1) AS count
    FROM
        (SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY
    word
ORDER BY
    word;

CREATE TABLE word_counts
AS
    SELECT word, count(1) AS count
    FROM
        (SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY
    word
ORDER BY
    word;
Query ID = root_20191127201718_42add91d-6b46-41ad-a705-27d2ab07fae8
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1574885052112_0001, Tracking URL = http://499bf3956267:8088/proxy/application_1574885052112_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1574885052112_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-11-27 20:17:34,770 Stage-1 map = 0%,  reduce = 0%
2019-11-27 20:17:41,587 Stage-

In [18]:
## Para visualizar los resultados obtenidos, se realiza un SELECT sobre la tabla word_counts.

In [19]:
%%hive
SELECT * FROM word_counts LIMIT 10;

SELECT * FROM word_counts LIMIT 10;
OK
	1
(DA)	1
(see	1
Analytics	2
Analytics,	1
Big	1
Data	3
Especially	1
Organizations	1
Since	1
Time taken: 0.218 seconds, Fetched: 10 row(s)


In [20]:
## Finalmente, y una vez se ha terminado de depurar el código, se cierra el interprete de Hive que se abrió en el background.

In [21]:
%hive_quit