## Problem 04

Weightage: 10

Get the number of members per city. There might be duplicate city names which might belong to different states and hence make sure that you include state as well while getting the count.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem04/solution
```
* Use Parquet with Snappy compression. Make sure that data is saved in exactly one file.
* Here are the column names. Data types should be same as input data.
```
 |-- state: string
 |-- city: string
 |-- member_count: long
```
* Data should be sorted in ascending order by state and then in descending order by count.

## Validation

Here are the self validation steps:
* Run the following command and check for the extension. It should contain snappy and parquet.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem04/solution
```
* Run the following code to create data frame.
```
import getpass
username = getpass.getuser()
path = f'/user/{username}/mock_test_02/problem04/solution'
data = spark. \
    read. \
    parquet(path)
```
* Get Schema by running `data.printSchema()`. Output should be as below. Ignore Nullability if it does not match exactly.
```
root
 |-- state: string (nullable = true)
 |-- city: string (nullable = true)
 |-- member_count: long (nullable = true)
```
* Get count by running `data.count()`. It should return **494**.
* Run below code to validate the data.
```
from pyspark.sql.functions import col
data.orderBy(col('state'), col('member_count').desc()).show()
```
* Sample output

|  state|           city|member_count|
|-------|---------------|------------|
|Alabama|     Birmingham|        7987|
|Alabama|     Montgomery|        4279|
|Alabama|         Mobile|        4236|
|Alabama|     Huntsville|        2139|
|Alabama|     Tuscaloosa|        1071|
|Alabama|        Gadsden|         500|
|Alabama|       Anniston|         499|
| Alaska|      Anchorage|        2684|
| Alaska|      Fairbanks|        1105|
| Alaska|         Juneau|         530|
|Arizona|        Phoenix|        8625|
|Arizona|         Tucson|        5289|
|Arizona|     Scottsdale|        1559|
|Arizona|           Mesa|        1551|
|Arizona|       Glendale|        1077|
|Arizona|        Gilbert|         561|
|Arizona|       Chandler|         537|
|Arizona|Apache Junction|         527|
|Arizona|         Peoria|         516|
|Arizona|       Prescott|         512|


In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 02 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
read_df=spark.read.json("/public/addresses")

In [3]:
from pyspark.sql.functions import col,count,lit
output=read_df.\
        select(col('address').state.alias('state'),col('address').city.alias('city')).\
        groupBy('state','city').\
        agg(count(lit(1)).alias('member_count')).\
        orderBy('state',col('member_count').desc())

In [9]:
output. \
    coalesce(1). \
    write. \
    parquet(f'/user/itv002461/mock_test_02/problem04/solution',mode='overwrite',compression='snappy')

In [10]:
%%sh
hdfs dfs -ls /user/${USER}/mock_test_02/problem04/solution

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-28 10:35 /user/itv002461/mock_test_02/problem04/solution/_SUCCESS
-rw-r--r--   3 itv002461 supergroup       8142 2022-06-28 10:35 /user/itv002461/mock_test_02/problem04/solution/part-00000-4af80203-202c-4f5c-bfbf-022009e3c8e4-c000.snappy.parquet


In [11]:
import getpass
username = getpass.getuser()
path = f'/user/{username}/mock_test_02/problem04/solution'
data = spark. \
  read. \
  parquet(path)

In [12]:
data.printSchema()

root
 |-- state: string (nullable = true)
 |-- city: string (nullable = true)
 |-- member_count: long (nullable = true)



In [13]:
data.count()

494

In [14]:
from pyspark.sql.functions import col
data.orderBy(col('state'), col('member_count').desc()).show()

+-------+---------------+------------+
|  state|           city|member_count|
+-------+---------------+------------+
|Alabama|     Birmingham|        7987|
|Alabama|     Montgomery|        4279|
|Alabama|         Mobile|        4236|
|Alabama|     Huntsville|        2139|
|Alabama|     Tuscaloosa|        1071|
|Alabama|        Gadsden|         500|
|Alabama|       Anniston|         499|
| Alaska|      Anchorage|        2684|
| Alaska|      Fairbanks|        1105|
| Alaska|         Juneau|         530|
|Arizona|        Phoenix|        8625|
|Arizona|         Tucson|        5289|
|Arizona|     Scottsdale|        1559|
|Arizona|           Mesa|        1551|
|Arizona|       Glendale|        1077|
|Arizona|        Gilbert|         561|
|Arizona|       Chandler|         537|
|Arizona|Apache Junction|         527|
|Arizona|         Peoria|         516|
|Arizona|       Prescott|         512|
+-------+---------------+------------+
only showing top 20 rows

