# Trabajo con serializadors/deserializadores

En este ejercicio vamos a trabajar con serializadores/deserializadores (serdes). Vamos a partir de un fichero de datos que contiene registros de un servidor web. Este fichero contiene el registro de acceso siguiendo el formato Apache Common Web Log, que es ampliamente utilizado en los servidores de Internet. Abajo se muestra un ejemplo de una línea de un log siguiendo este formato:

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
    
Donde cada dato tiene el siguiente significado:

•	127.0.0.1 es la dirección IP del cliente que ha hecho la petición al servidor. 

•	user-identifier es la identidad del cliente según el RFC 1413.

•	frank es el identificador de usuario de la persona que solicita el documento.

•	[10/Oct/2000:13:55:36 -0700] es la fecha, hora y zona horaria cuando se recibió la petición.

•	"GET /apache_pb.gif HTTP/1.0" es la petición del cliente.

•	200 es el código resultado de HTTP que se envía de vuelta al cliente.

•	2326 es el tamaño del objeto que se retorna al cliente, medido en bytes. 

Partiendo de este fichero de datos, vamos a crear una base de datos y una tabla que contenga esta información. Tras esto, crearemos una tabla que contenga estos datos salvo el usuario.



In [4]:
! mkdir -p hive-serdes
import os
os.chdir("hive-serdes")

In [5]:
! pwd

/home/cloudera/FrameworkHadoop-privado/Hive/Notebooks/hive-serdes


In [6]:
%%writefile ejerciciohive.hql
create database if not exists bdlogs
Comment 'BD delogs'
Location '/user/cloudera/bdlogs'
With dbproperties ('Creada por'='User','Creada el'='26-Dic-2017');

Writing ejerciciohive.hql


In [7]:
! beeline -u "jdbc:hive2://localhost:10000/default" -f ejerciciohive.hql

scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/default> create database if not exists bdlogs
. . . . . . . . . . . . . . . . . . . .> Comment 'BD delogs'
. . . . . . . . . . . . . . . . . . . .> Location '/user/cloudera/bdlogs'
'Creada el'='26-Dic-2017'); . . . . . .> With dbproperties ('Creada por'='User', 
INFO  : Compiling command(queryId=hive_20171226001414_25c8d495-e38e-44fe-a029-ad32e1a9ae38): create database if not exists bdlogs
Comment 'BD delogs'
Location '/user/cloudera/bdlogs'
With dbproperties ('Creada por'='User','Creada el'='26-Dic-2017')
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO  : Completed compiling command(queryId=hive_20171226001414_25c8d495-e38e-44fe-a029-ad32e1a9ae38); Time taken: 0.033 second

In [8]:
! hadoop fs -put ../common_access_log.txt /user/cloudera

In [11]:
%%writefile ejerciciohive.hql
CREATE TABLE apache_common_log (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)"
);
LOAD DATA INPATH '/user/cloudera/common_access_log.txt' INTO TABLE apache_common_log;

Overwriting ejerciciohive.hql


In [12]:
! beeline -u "jdbc:hive2://localhost:10000/bdlogs" -f ejerciciohive.hql

scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/bdlogs
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/bdlogs> CREATE TABLE apache_common_log (
. . . . . . . . . . . . . . . . . . . >   host STRING,
. . . . . . . . . . . . . . . . . . . >   identity STRING,
. . . . . . . . . . . . . . . . . . . >   user STRING,
. . . . . . . . . . . . . . . . . . . >   time STRING,
. . . . . . . . . . . . . . . . . . . >   request STRING,
. . . . . . . . . . . . . . . . . . . >   status STRING,
. . . . . . . . . . . . . . . . . . . >   size STRING
. . . . . . . . . . . . . . . . . . . >   )
.serde2.RegexSerDe' . . . . . . . . . > ROW FORMAT SERDE 'org.apache.hadoop.hive 
. . . . . . . . . . . . . . . . . . . > WITH SERDEPROPERTIES (
*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)"*) ([^ ]*) ([^ ] 
. . . . . . . . . . . . . . . . . . . > );
IN

In [13]:
! beeline -u "jdbc:hive2://localhost:10000/bdlogs" -e "select count(distinct host) from apache_common_log ;"

scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/bdlogs
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20171226002222_ce50375c-6934-440a-8c2f-3175b911daa7): select count(distinct host) from apache_common_log
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20171226002222_ce50375c-6934-440a-8c2f-3175b911daa7); Time taken: 0.094 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20171226002222_ce50375c-6934-440a-8c2f-3175b911daa7): select count(distinct host) from apache_common_log
INFO  : Query ID = hive_20171226002222_ce50375c-6934-440a-8c2f-3175b911daa7
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Sta

In [14]:
! beeline -u "jdbc:hive2://localhost:10000/bdlogs" -e "select count(host) from apache_common_log ;"

scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/bdlogs
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20171226002222_fa900db5-17b6-453a-9803-913aa48956a0): select count(host) from apache_common_log
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20171226002222_fa900db5-17b6-453a-9803-913aa48956a0); Time taken: 0.083 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20171226002222_fa900db5-17b6-453a-9803-913aa48956a0): select count(host) from apache_common_log
INFO  : Query ID = hive_20171226002222_fa900db5-17b6-453a-9803-913aa48956a0
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-

Ahora creamos la segunda tabla, que contiene la misma información salvo el usuario.

In [15]:
! hadoop fs -put ../common_access_log.txt /user/cloudera

In [16]:
%%writefile ejerciciohive.hql
CREATE TABLE apache_common_log_nouser (
  host STRING,
  identity STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING
  )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) [^ ]* (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)"
);
LOAD DATA INPATH '/user/cloudera/common_access_log.txt' INTO TABLE apache_common_log_nouser;

Overwriting ejerciciohive.hql


In [17]:
! beeline -u "jdbc:hive2://localhost:10000/bdlogs" -f ejerciciohive.hql

scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/bdlogs
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/bdlogs> CREATE TABLE apache_common_log_nouser (
. . . . . . . . . . . . . . . . . . . >   host STRING,
. . . . . . . . . . . . . . . . . . . >   identity STRING,
. . . . . . . . . . . . . . . . . . . >   time STRING,
. . . . . . . . . . . . . . . . . . . >   request STRING,
. . . . . . . . . . . . . . . . . . . >   status STRING,
. . . . . . . . . . . . . . . . . . . >   size STRING
. . . . . . . . . . . . . . . . . . . >   )
.serde2.RegexSerDe' . . . . . . . . . > ROW FORMAT SERDE 'org.apache.hadoop.hive 
. . . . . . . . . . . . . . . . . . . > WITH SERDEPROPERTIES (
 (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)" ]*) ([^ ]*) [^ ]* 
. . . . . . . . . . . . . . . . . . . > );
INFO  : Compiling command(queryId=hive_20171226002

In [18]:
! beeline -u "jdbc:hive2://localhost:10000/bdlogs" -e "select count(host) from apache_common_log_nouser ;"

scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/bdlogs
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20171226002626_632b51d0-25a6-40cc-ad86-f4a28daf1839): select count(host) from apache_common_log_nouser
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20171226002626_632b51d0-25a6-40cc-ad86-f4a28daf1839); Time taken: 0.091 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20171226002626_632b51d0-25a6-40cc-ad86-f4a28daf1839): select count(host) from apache_common_log_nouser
INFO  : Query ID = hive_20171226002626_632b51d0-25a6-40cc-ad86-f4a28daf1839
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Startin

In [19]:
! beeline -u "jdbc:hive2://localhost:10000/bdlogs" -e "select count(distinct host) from apache_common_log_nouser;"

scan complete in 1ms
Connecting to jdbc:hive2://localhost:10000/bdlogs
Connected to: Apache Hive (version 1.1.0-cdh5.12.0)
Driver: Hive JDBC (version 1.1.0-cdh5.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO  : Compiling command(queryId=hive_20171226002727_52202781-a27d-4c9b-b7b7-cf5627a3e513): select count(distinct host) from apache_common_log_nouser
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20171226002727_52202781-a27d-4c9b-b7b7-cf5627a3e513); Time taken: 0.088 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20171226002727_52202781-a27d-4c9b-b7b7-cf5627a3e513): select count(distinct host) from apache_common_log_nouser
INFO  : Query ID = hive_20171226002727_52202781-a27d-4c9b-b7b7-cf5627a3e513
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of