ETL Project

A simple ETL project interlinking the various components of Hadoop Ecosystem

Creating a simple dataset in Mysql

creating table in MySql

create table studdetails(id integer(10), firstname varchar(20), lastname varchar(30), age integer(10), 
 phone integer(14), location varchar(30));

insert some values

insert into studdetails values(001,"Rajiv","Reddy",21,9848022337,"Hyderabad");
insert into studdetails values(002,"siddarth","Battacharya",22,9848022338,"Kolkata");
insert into studdetails values(003,"Rajesh","Khanna",22,9848022339,"Delhi");
insert into studdetails values(004,"Preethi","Agarwal",21,9848022330,"Pune");
insert into studdetails values(005,"Trupthi","Mohanthy",23,9848022336,"Bhuwaneshwar");
insert into studdetails values(006,"Archana","Mishra",23,9848022335,"Chennai");
insert into studdetails values(007,"Komal","Nayak",24,9848022334,"trivendram");
insert into studdetails values(008,"Bharathi","Nambiayar",24,9848022333,"Chennai");

Use sqoop to ingest the data and import to Hive

sqoop import --connect  jdbc:mysql://localhost/appu1 --driver com.mysql.jdbc.Driver --username user 
 --password 12345 --table studdetails --hive-import -m 1 --hive-table appu.studdetails1

Create a Hive partition table

create table studentpart(id int, firstname String, lastname String, phone bigint, loc String) PARTITIONED BY 
 (age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

 // we are partitioning based on age
INSERT OVERWRITE TABLE studentpart PARTITION(age) SELECT id,firstname,lastname,phone, 
 loc,age from studdetails1;

// Check your partitions
show partitions studentpart;
// view data based on partitions
select * from studentpart where age='21';

Loading Data to Pig and converting to JSON

pig -useHCatalog

A = LOAD 'appu.studentpart' USING org.apache.hive.hcatalog.pig.HCatLoader(); 

B = filter A by age == 24;

store B into 'studentjson' USING JsonStorage();

Loading Data to SPARK

import org.apache.spark.sql.SparkSession

val session = SparkSession.builder().appName("test").master("local").getOrCreate()

val df = session.read.json("part-m-000000")

df.show()

Import the same data to MongoDB

//create a database and a collection
use student
db.createCollection('studentdetails')
//now exit the mongo shell
//now use mongo import
mongoimport -d student -c studentdetails part-m-000000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ETL Project

Creating a simple dataset in Mysql

Use sqoop to ingest the data and import to Hive

Create a Hive partition table

Loading Data to Pig and converting to JSON

Loading Data to SPARK

Import the same data to MongoDB

Files

README.md

Latest commit

History

README.md

File metadata and controls

ETL Project

Creating a simple dataset in Mysql

Use sqoop to ingest the data and import to Hive

Create a Hive partition table

Loading Data to Pig and converting to JSON

Loading Data to SPARK

Import the same data to MongoDB