# Hands on Course- Big Data Essentials

## Project Context
The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, an US employer must offer a job and petition for H-1B visa with the US immigration department. This is the most common visa status applied for and held by international students once they complete college/ higher education (Masters, PhD) and work in a full-time position.
The Office of Foreign Labor Certification (OFLC) generates program data that is useful information about the immigration programs including the H1-B visa. The disclosure data updated annually is available at their official website.

### Data Set:
The dataset description is as follows: The columns in the dataset include:

#### CASE_STATUS: 
Status associated with the last significant event or decision. Valid values include "Certified","Certified-Withdrawn","Denied"and"Withdrawn". 

1. **Certified**: Employer filed the LCA, which was approved by DOL 
2. **Certified Withdrawn**: LCA was approved but later withdrawn by employer 
3. **Withdrawn**: LCA was withdrawn by employer before 
4. **Denied**: LCA was denied by DOL 

#### **EMPLOYER_NAME**:
Name of employer submitting labor condition application. 

#### **SOC_NAME**:
the Occupational name associated with the SOC_CODE. SOC_CODE is the occupational code associated with the job being requested for temporary labor condition, as classified by the Standard Occupational Classification (SOC) System. 

#### **JOB_TITLE**: Title of the job 

#### FULL_TIME_POSITION:
Y = Full Time Position N = Part Time Position 

#### PREVAILING_WAGE: 
Prevailing Wage for the job being requested for temporary labor condition. The wage is listed at annual scale in USD. The prevailing wage for a job position is defined as the average wage paid to similarly employed workers in the requested occupation in the area of intended employment. The prevailing wage is based on the employer’s minimum requirements for the position. 

#### YEAR: 
Year in which the H1B visa petition was filed 

#### WORKSITE: 
City and State information of the foreign worker’s intended area of employment 

#### lon:
longitude of the Worksite 

#### lat:
latitude of the Worksite

### Data Source:
File Name Format Size Location h1b_data.csv CSV 470 SharedLocation Note: Please don’t delete the CSV file once you download from the shared location.


### Big Data Technologies to be applied:
##### HDFS:
The input CSV file will be loaded into HDFS residing in respective cloud lab. The output will be stored on HDFS by creating dedicated directories for the same
##### Yarn and MapReduce:
It’s a processing framework. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

#### Hive:
It’s a processing tool. Hive is a SQL like query language which is often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

#### Pig:
A scripting platform for processing and analyzing large data sets. Apache Pig allows Apache Hadoop users to write complex MapReduce transformations using a simple scripting language called Pig Latin.

#### Hbase:
It's a non-relational (NoSQL) database that runs on top of HDFS. HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.

#### Spark:
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

### Requirements/Use cases/questions

1. Is the number of petitions with Data Engineer job title increasing over time? 
2. Find top 5 job titles who are having highest growth in applications.
3. Which part of the US has the most Hardware Engineer jobs for each year? 
4. Find top 5 locations in the US who have got certified visa for each year. 
5. Which industry has the most number of Data Scientist positions? 
6. Which top 5 employers file the most petitions each year? 
7. Find the most popular top 10 job positions for H1B visa applications for each year? 
8. Find the percentage and the count of each case status on total applications for each year.

9. Find the average Prevailing Wage for each Job for each Year (take part time and full time separate). Arrange the output in descending order. 
10. Which are employers along with the number of petitions who have the success rate more than 70% in petitions and total petitions filed more than 1000? 
11. Which are the job positions along with the number of petitions which have the success rate more than 70% in petitions and total petitions filed more than 1000?

### Solution expectation:
* Step 1: Load datasets to HDFS 
* Step 2: Write MapReduce program for questions: 1, 2 & 3 
* Step 3: Write Hive based queries for questions: 4 & 5 
* Step 4: Write Pig scripting for questions: 6 & 7 
* Step 5: Write Hbase queries for questions: 8 & 9 
* Step 6; Write Spark based queries for question: 10 & 11

### Procedure to submit the solution:
1. Submit both solution document for each questions along with screen capture of output from your screen. 
2. Solution document should contain respective program/query/script for the corresponding questions. 
3. Submit your solution as per guidelines shared by program management team

## Step 1: Load datasets to HDFS

In [20]:
!hdfs dfs -ls /user/bdhfeb201

Found 5 items
drwx------   - bdhfeb201 bdhfeb201          0 2020-02-27 07:00 /user/bdhfeb201/.Trash
drwx------   - bdhfeb201 bdhfeb201          0 2020-02-27 10:49 /user/bdhfeb201/.staging
drwxr-xr-x   - bdhfeb201 bdhfeb201          0 2020-02-26 06:21 /user/bdhfeb201/h-1b-visa
drwxr-xr-x   - bdhfeb201 bdhfeb201          0 2020-02-27 10:49 /user/bdhfeb201/wordcount
-rw-r--r--   2 bdhfeb201 bdhfeb201         28 2020-02-27 10:47 /user/bdhfeb201/wordcount_test.txt


In [21]:
#!hdfs dfs -copyFromLocal h-1b-visa/ /user/bdhfeb201/

In [22]:
!hdfs dfs -ls /user/bdhfeb201

Found 5 items
drwx------   - bdhfeb201 bdhfeb201          0 2020-02-27 07:00 /user/bdhfeb201/.Trash
drwx------   - bdhfeb201 bdhfeb201          0 2020-02-27 10:49 /user/bdhfeb201/.staging
drwxr-xr-x   - bdhfeb201 bdhfeb201          0 2020-02-26 06:21 /user/bdhfeb201/h-1b-visa
drwxr-xr-x   - bdhfeb201 bdhfeb201          0 2020-02-27 10:49 /user/bdhfeb201/wordcount
-rw-r--r--   2 bdhfeb201 bdhfeb201         28 2020-02-27 10:47 /user/bdhfeb201/wordcount_test.txt


### Step 2: Write MapReduce program for questions: 1, 2 & 3
        

1. Is the number of petitions with Data Engineer job title increasing over time?

In [1]:
!cat Q1DataEngineerIncreasing.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Q1DataEngineerIncreasing {

  public static class DataEngineerMapper {
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      String[] arrOfStr = value.toString()
      while (itr.hasMoreToke

2. Find top 5 job titles who are having highest growth in applications.

3. Which part of the US has the most Hardware Engineer jobs for each year?

### Step 3: Write Hive based queries for questions: 4 & 5

4. Find top 5 locations in the US who have got certified visa for each year. 

5. Which industry has the most number of Data Scientist positions? 

### Step 4: Write Pig scripting for questions: 6 & 7

6. Which top 5 employers file the most petitions each year? 

7. Find the most popular top 10 job positions for H1B visa applications for each year? 

### Step 5: Write Hbase queries for questions: 8 & 9

8. Find the percentage and the count of each case status on total applications for each year.

9. Find the average Prevailing Wage for each Job for each Year (take part time and full time separate). Arrange the output in descending order.

### Step 6; Write Spark based queries for question: 10 & 11

10. Which are employers along with the number of petitions who have the success rate more than 70% in petitions and total petitions filed more than 1000? 

11. Which are the job positions along with the number of petitions which have the success rate more than 70% in petitions and total petitions filed more than 1000?