### **Steps 1&2 :** Download the DOS/Windows for Dec. zip file, extract file.
+ Visit https://www.census.gov/data/datasets/2017/demo/cps/cps-basic-2017.html to download the file.
+ The downloaded file in *dec17pub.dat*, available in the project root folder.
+ Load essential pyspark libraries and initialize spark context and session.
+ Display spark engine version

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

spark = SparkSession.builder.appName('Data Engineer - Take Home Project').getOrCreate()
print(f'The spark version is : {spark.version}')

The spark version is : 3.5.1


### **Step 3 :** Showing a sample of *DOS/Windows for Dec* zip file
+ Open the file and map columns from each line to variables
+ Add each mapped line columns to a List object
+ Create a dataframe df_Master from the List
+ Display sample records from df_Master 

In [2]:
# Create list to hold records from file
rows = list()

# Open the file for read operation
with open('dec17pub.dat','r') as file:
    
    # Read each line and map the columns to variables
    for line in file:
        full_household_identifier = line[:15]
        time_of_interview = line[17:21] + '/' + line[15:17]
        final_outcome_of_survey = line[23:26]
        type_of_housing_unit = line[31:32]
        household_type = line[61:62]
        household_has_telephone = line[33:34]
        household_can_access_telephone = line[35:36]
        is_telephone_interview_acceptable = line[37:38]
        type_of_interview = line[65:66]
        family_income_range = line[39:40]
        division_location =  line[90:91]
        race =  line[138:140]

        #Create a record from above variables and add record to a List
        item = (full_household_identifier,
                time_of_interview,
                final_outcome_of_survey,
                type_of_housing_unit,
                household_type,
                household_has_telephone,
                household_can_access_telephone,
                is_telephone_interview_acceptable,
                type_of_interview,
                family_income_range,    
                division_location,
                race)
        rows.append(item)

# Create a dataframe and display sample records
from schema import master_schema
df_master = spark.createDataFrame(rows, master_schema)
df_master.show(5, truncate= False) 

+-------------------------+-----------------+----------------------------+-------------------------+-------------------+----------------------------+-----------------------------------+--------------------------------------+----------------------+------------------------+----------------------+---------+
|full_household_identifier|time_of_interview|final_outcome_of_survey_code|type_of_housing_unit_code|household_type_code|household_has_telephone_code|household_can_access_telephone_code|is_telephone_interview_acceptable_code|type_of_interview_code|family_income_range_code|division_location_code|race_code|
+-------------------------+-----------------+----------------------------+-------------------------+-------------------+----------------------------+-----------------------------------+--------------------------------------+----------------------+------------------------+----------------------+---------+
|000004795110719          |2017/12          |201                         |1       

### **Step 4 :** Answer to Questions 1 - 4 


### **1.** What is the count of responders per family income range (show all)?
+ Create dataframe **df_family_income** to decode Family Income Range
+ Select only required fields from *df_Master* for a fast runtime
+ Join the dataframes and generate the result
+ Please note - *there are null values in the outcome of this join based on findings from deeper analysis*

In [3]:
# Load schema and data for family income data structure
from schema import family_income_range_schema, family_income_range_data
df_family_income = spark.createDataFrame(family_income_range_data, family_income_range_schema)

# Create the join and run the analysis
Question_1 = df_master.select('family_income_range_code')\
    .join(df_family_income, df_master.family_income_range_code == df_family_income.family_code, 'left')\
        .groupBy('family_income_range').count().orderBy('count', ascending = False)

# Format and display the result for Question one
Question_1.toDF('FAMILY_INCOME_RANGE','COUNT_OF_RESPONDERS').show(truncate = False)

+-------------------+-------------------+
|FAMILY_INCOME_RANGE|COUNT_OF_RESPONDERS|
+-------------------+-------------------+
|LESS THAN $5,000   |33315              |
|12,00 TO 14,999    |20408              |
|15,000 TO 19,999   |20222              |
|10,000 TO 12,999   |19718              |
|7,500 TO 9,999     |15719              |
|5,000 TO 7,499     |11596              |
|30,000 TO 34,      |6743               |
|NULL               |6620               |
|20,000 TO 24,999   |6312               |
|25,000 TO 29,999   |5803               |
+-------------------+-------------------+



### **2.** What is the count of responders per geographical division/location and race (show top 10)?
+ Create dataframe **df_geo_location** and **df_race** to decode Location and Race
+ Select only required fields from *df_Master* for fast execution time 
+ Join the dataframes and generate the result
+ Please note - *there are null values in the outcome of this join based on findings from deeper analysis*

In [6]:
from schema import division_location_schema, race_schema, division_location_data, race_data
df_div_location = spark.createDataFrame(division_location_data, division_location_schema)
df_race = spark.createDataFrame(race_data, race_schema)

# Create the join and run the analysis
Question_2 = df_master.select('division_location_code','race_code')\
    .join(df_div_location, df_master.division_location_code == df_div_location.div_loc_code, 'left')\
        .join(df_race, df_master.race_code == df_race.race_code, 'left')\
            .groupBy('division_location','race').count().orderBy('count', ascending = False)

# Format and display Top 10 only
Question_2.toDF('DIVISION_LOCATION', 'RACE', 'COUNT_OF_RESPONDERS').show(10)

+------------------+--------+-------------------+
| DIVISION_LOCATION|    RACE|COUNT_OF_RESPONDERS|
+------------------+--------+-------------------+
|    SOUTH ATLANTIC|    NULL|              27609|
|           PACIFIC|    NULL|              20659|
|          MOUNTAIN|    NULL|              18470|
|WEST SOUTH CENTRAL|    NULL|              16498|
|EAST NORTH CENTRAL|    NULL|              15296|
|WEST NORTH CENTRAL|    NULL|              13052|
|   MIDDLE ATLANTIC|    NULL|              12756|
|       NEW ENGLAND|    NULL|              11281|
|EAST SOUTH CENTRAL|    NULL|              10345|
|           PACIFIC|Asian-HP|                 70|
+------------------+--------+-------------------+
only showing top 10 rows



#### **Question 3 :** How many responders do not have telephone in their house, but can access a telephone elsewhere and telephone interview is accepted?
+ Use *df_Master* for fast execution, no decoding is required

In [7]:
Question_3 = df_master.where(
    (col('household_has_telephone_code') == lit('2')) &
    (col('household_can_access_telephone_code')  == lit('1')) &
    (col('is_telephone_interview_acceptable_code') == lit('1'))
).count()

print(f'The answer to Question (3) is : {Question_3}')

The answer to Question (3) is : 635


#### **Question 4 :** How many responders can access a telephone, but telephone interview is not accepted?
+ Select only required fields from *df_Master* for fast execution time 
+ Observation - data values for '*Is telephone interview acceptable*' shows values (0,1) instaed of (1,2) as expected.

In [10]:
df_master.where(
        (col('household_can_access_telephone_code')  == lit('1')) &
        (col('is_telephone_interview_acceptable_code') == lit('2'))
    ).count()

0