## Problem 05

Weightage: 15

Get states by top 10 member count. There is a chance that more than 1 state might get the same rank if the counts are same. You need to get all the states which contain top 10 member count.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem05/solution
```
* Use CSV and save the output to exactly one file. Make sure to preserve the header.
* Here are the column names. Data types should be same as input data.
```
 |-- state: string
 |-- member_count: long
```
* Data should be sorted in descending order by count.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem05/solution
```
* Run the following to validate the data. It should show 11 or more records including header. Validate against the output.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem05/solution/part*
```
* Output
```
state,member_count
California,109817
Texas,109346
Florida,82625
New York,55343
Ohio,36316
Virginia,35214
District of Columbia,32289
Pennsylvania,32226
Georgia,28814
Illinois,24943
```


In [8]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 05 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [9]:
read_df=spark.read.json('/public/addresses')

In [17]:
from pyspark.sql.functions import dense_rank,col,count,lit
from pyspark.sql.window import Window    

In [29]:
working_df=read_df.\
        select(col('address').state.alias('state')).\
        groupBy('state').\
        agg(count(lit(1)).alias('member_count'))

In [30]:
spec=Window.orderBy(col('member_count').desc())

In [46]:
output=working_df. \
            withColumn('dense_rank',dense_rank().over(spec)). \
            orderBy(col('member_count').desc()). \
            filter(col('dense_rank')<=10). \
            drop('dense_rank')

In [47]:
output. \
    coalesce(1). \
    write. \
    csv(f'/user/{username}/mock_test_02/problem05/solution',
        header=True,
        mode='overwrite'
       )

In [48]:
%%sh
hdfs dfs -ls /user/${USER}/mock_test_02/problem05/solution

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-29 03:11 /user/itv002461/mock_test_02/problem05/solution/_SUCCESS
-rw-r--r--   3 itv002461 supergroup        180 2022-06-29 03:11 /user/itv002461/mock_test_02/problem05/solution/part-00000-8550965c-45f0-437c-9139-b371bf6acb4b-c000.csv


In [49]:
%%sh
hdfs dfs -cat /user/${USER}/mock_test_02/problem05/solution/part*

state,member_count
California,109817
Texas,109346
Florida,82625
New York,55343
Ohio,36316
Virginia,35214
District of Columbia,32289
Pennsylvania,32226
Georgia,28814
Illinois,24943
