GitHub - cfcastro24/visa_auto

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bin		bin
ext		ext
logs		logs
src		src
README		README
run_etl1.sh		run_etl1.sh
run_etl2.sh		run_etl2.sh
run_etl3.sh		run_etl3.sh
run_etl4.sh		run_etl4.sh
visa_auto.py		visa_auto.py

Repository files navigation

*************************************************************************
README for script visa_auto.py

USAGE: python visa_auto.py -d <dataset> -m <monthday> -o <options {1-5}>
where options are:
        1 - Rezip
        2 - Load HDFS
        3 - Run MapReduce
        4 - Create External Tables
        5 - Load GPDB

DESCRIPTION:    
The visa_auto.py script is designed to automate the flow of data logs from ETL, 
HDFS and GPDB. This process includes file management, log parsing via map/reduce, 
transformation and data loads to GPDB system for the following datasets: bluecoat,
vpn, active directory, netscout, dns, dhcp, fwcheckpoint and fwpix.

The script can be run in its entirely or by individual steps as indicated in the
options parameter. below are some examples of its usage:

EXAMPLES:
Run full data flow for vpn for October 2010
 python visa_auto.py -d vpn -m 201210

Run map reduce jobs for vpn for October 2010
 python visa_auto.py -d vpn -m 201210 -o 3

Run full data flow for vpn for October 2010 in the background and save log
 python -u visa_auto.py -d vpn -m 201210 > logs/visa_auto.vpn_201210.log 2>&1 &

The script uses and expects the following scripts/directories for it to work. Make
sure the paths are consistent on all etl servers and hadoop cluster

Folder Structure

Scripts:	

visa_auto.py home directory		/data1/auto_scripts/
directory for rezip scripts		/data1/auto_scripts/bin/
directory for MapReduce jar & scripts	/data1/auto_scripts/bin/
directory for external tables		/data1/auto_scripts/ext/
directory for gpdb load scripts		/data1/auto_scripts/ext/
directory for source code including ddl
			 and map reduce	/data1/auto_scripts/src/

Source Data: The source data folders will be spread on all etl servers in the 
following manner:

Active Directory: ETL1
	input: /data1/input/active_directory/
	output: /data2/output/active_directory/
VPN: ETL1
        input: /data1/input/vpn/
        output: /data2/output/vpn/
FW Pix: ETL2
        input: /data1/input/fwpix/
        output: /data2/output/fwpix/
FW Checkpoint: ETL2
        input: /data1/input/fwcheckpoint/
        output: /data2/output/fwcheckpoint/
DNS: ETL3
        input: /data1/input/dns/
        output: /data2/output/dns/
DHCP: ETL3
        input: /data1/input/dhcp/
        output: /data2/output/dhcp/
Bluecoat: ETL4
        input: /data1/input/bluecoat/
        output: /data2/output/bluecoat/
Netscout: ETL4
        input: /data1/input/netscout/
        output: /data2/output/netscout/

Hadoop Data: hdfs will have input and output directories for each dataset in
the following manner

INPUT:
hdfs:/data/input/<dataset>/<monthday>/
hdfs:/data/output/<dataset>/<monthday>/
*************************************************************************