## Problem 01

Weightage: 10

Get all those member details who does not have phone numbers.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements

* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem01/solution
```
* Use parquet file format to save the output. Output should be saved in 2 files.
* Here are the column names. Data types should be same as input data.
```
 |-- id: long
 |-- first_name: string
 |-- last_name: string
 |-- email: string
```
* Data should be sorted in ascending order by id.

## Validation

Here are the self validation steps:
* Run the following code to create data frame.
```
import getpass
username = getpass.getuser()
path = f'/user/{username}/mock_test_02/problem01/solution'
data = spark. \
    read. \
    parquet(path)
```
* Get Schema by running `data.printSchema()`. Output should be as below. Ignore Nullability if it does not match exactly.
```
root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
```
* Get count by running `data.count()`. It should return **258160**.
* Run `data.orderBy('id').show()` to validate the data. Output should be like this.

| id|first_name|last_name|               email|
|---|----------|---------|--------------------|
| 16|  Eleonore|   Cordle|ecordlef@printfri...|
| 18|     Heddi|   Sackes|hsackesh@business...|
| 23|       Zak|    Rigts| zrigtsm@cornell.edu|
| 25|     Wiatt|     Wane|    wwaneo@tmall.com|
| 26|    Aubrie| Ashworth|aashworthp@networ...|
| 28|    Lindsy|  Kellart|lkellartr@istockp...|
| 30|    Harman|   Birley|hbirleyt@deliciou...|
| 33|     Randa|   Eberst|   reberstw@tamu.edu|
| 34|    Stinky| Penniall|spenniallx@domain...|
| 35|     Marya|   Rahlof|mrahlofy@oaic.gov.au|
| 42|     Peder|  Harring|pharring15@list-m...|
| 54|       Row|    Anker|ranker1h@squidoo.com|
| 57|    Morgun|      Loy|mloy1k@deviantart...|
| 60|  Geoffrey|Ashbridge|gashbridge1n@wufo...|
| 61|     Nance|  Gladdis|ngladdis1o@weathe...|
| 62|     Allyn|    Monni| amonni1p@devhub.com|
| 64|     Kleon|  Tolchar|ktolchar1r@angelf...|
| 66|  Georgena|    Ingre|gingre1t@marriott...|
| 69|   Belicia|    Trigg|   btrigg1w@army.mil|
| 79|  Courtnay|  Umpleby|cumpleby26@trelli...|

In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 01 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
df=spark.read.json('/public/addresses')

In [3]:
output=df. \
        filter('phone_numbers is null'). \
        select('id','first_name','last_name','email'). \
        orderBy('id')

In [4]:
output. \
    coalesce(2). \
    write. \
    parquet(f'/user/{username}/mock_test_02/problem01/solution',mode='overwrite',compression="None")

In [5]:
%%sh

hdfs dfs -ls /user/${USER}/mock_test_02/problem01/solution

Found 3 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-28 10:27 /user/itv002461/mock_test_02/problem01/solution/_SUCCESS
-rw-r--r--   3 itv002461 supergroup    5640715 2022-06-28 10:27 /user/itv002461/mock_test_02/problem01/solution/part-00000-97fe7195-20a9-446a-9157-71ac68b8be4a-c000.parquet
-rw-r--r--   3 itv002461 supergroup    5620287 2022-06-28 10:27 /user/itv002461/mock_test_02/problem01/solution/part-00001-97fe7195-20a9-446a-9157-71ac68b8be4a-c000.parquet


In [6]:
import getpass
username = getpass.getuser()
path = f'/user/{username}/mock_test_02/problem01/solution'
data = spark. \
  read. \
  parquet(path)

In [7]:
data.printSchema()

root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)



In [8]:
data.count()

258160

In [9]:
data.orderBy('id').show()

+---+----------+---------+--------------------+
| id|first_name|last_name|               email|
+---+----------+---------+--------------------+
| 16|  Eleonore|   Cordle|ecordlef@printfri...|
| 18|     Heddi|   Sackes|hsackesh@business...|
| 23|       Zak|    Rigts| zrigtsm@cornell.edu|
| 25|     Wiatt|     Wane|    wwaneo@tmall.com|
| 26|    Aubrie| Ashworth|aashworthp@networ...|
| 28|    Lindsy|  Kellart|lkellartr@istockp...|
| 30|    Harman|   Birley|hbirleyt@deliciou...|
| 33|     Randa|   Eberst|   reberstw@tamu.edu|
| 34|    Stinky| Penniall|spenniallx@domain...|
| 35|     Marya|   Rahlof|mrahlofy@oaic.gov.au|
| 42|     Peder|  Harring|pharring15@list-m...|
| 54|       Row|    Anker|ranker1h@squidoo.com|
| 57|    Morgun|      Loy|mloy1k@deviantart...|
| 60|  Geoffrey|Ashbridge|gashbridge1n@wufo...|
| 61|     Nance|  Gladdis|ngladdis1o@weathe...|
| 62|     Allyn|    Monni| amonni1p@devhub.com|
| 64|     Kleon|  Tolchar|ktolchar1r@angelf...|
| 66|  Georgena|    Ingre|gingre1t@marri