## Problem 06

Weightage: 25

Get cities with top ten female member count from each state. There is a chance that more than 1 city might get the same rank if the counts are same. You need to get all the cities which contain top ten female member count from each state.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem06/solution
```
* Use CSV and save the output to exactly one file. Make sure to preserve the header.
* Here are the column names. Data types should be same as input data.
```
 |-- state: string
 |-- city:string
 |-- female_count: long
```
* Data should be sorted in ascending order by state and then in descending order by count.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem06/solution
```
* Run the following to validate the data. Review the data to see if it is sorted in ascending order by state and then in descending order by count.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*
```
* Run this command to get the count including header. Result should be 320.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*|wc -l
```


In [2]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 06 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [3]:
read_df=spark.read.json('/public/addresses')

In [6]:
from pyspark.sql.functions import col,lit,count,dense_rank
from pyspark.sql.window import Window
spec=Window.\
        partitionBy('state'). \
        orderBy(col('state').asc(),col('female_count').desc())

In [7]:
df=read_df. \
    filter(col('gender')=='Female'). \
    select(col('address').state.alias('state'),col('address').city.alias('city')). \
    groupBy('state','city'). \
    agg(count(lit(1)).alias('female_count')). \
    orderBy('state',col('female_count').desc())

In [21]:
output=df. \
        withColumn('dense_rank',dense_rank().over(spec)).\
        orderBy('state',col('female_count').desc()). \
        filter(col('dense_rank')<=10). \
        drop('dense_rank')

In [23]:
output. \
    coalesce(1). \
    write. \
    csv(f'/user/{username}/mock_test_02/problem06/solution',
        header=True,
        mode='overwrite'
       )

In [24]:
%%sh
hdfs dfs -ls /user/${USER}/mock_test_02/problem06/solution

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-29 05:17 /user/itv002461/mock_test_02/problem06/solution/_SUCCESS
-rw-r--r--   3 itv002461 supergroup       7643 2022-06-29 05:17 /user/itv002461/mock_test_02/problem06/solution/part-00000-ec49dc85-dd9c-4021-8fe7-3228952d6b78-c000.csv


In [None]:
%%sh
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*

In [26]:
%%sh
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*|wc -l

320


In [None]:
# from pyspark.sql.functions import col,lit,count
# from pyspark.sql.window import Window
# df=read_df. \
#     filter(col('gender')=='Female'). \
#     select(col('address').state.alias('state'),col('address').city.alias('city')). \
#     groupBy('state','city'). \
#     agg(count(lit(1)).alias('female_count')). \
#     orderBy('state',col('female_count').desc())

In [35]:
# df. \
#     select('state').\
#     groupBy('state'). \
#     agg(count(lit(1)).alias('city_count')). \
#     orderBy('state')


state,city_count
Alabama,7
Alaska,3
Arizona,11
Arkansas,4
California,65
Colorado,11
Connecticut,9
Delaware,2
District of Columbia,1
Florida,48
