## Problem 07

Weightage: 25

Get Station Name, latitude, longitude and number of bikes started from each station using **startstationid** for each day in the data set.

## Data Description
All of the citibike trip data is available under **/public/citibike/trips**. It contain multiple folders - one for each month. Here is the schema.

```
root
 |-- tripduration: integer (nullable = true)
 |-- starttime: timestamp (nullable = true)
 |-- stoptime: timestamp (nullable = true)
 |-- startstationid: string (nullable = true)
 |-- endstationid: string (nullable = true)
 |-- bikeid: integer (nullable = true)
 |-- usertype: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: integer (nullable = true)
 |-- month: integer (nullable = true)
```
All of the citibike station data is available under **/public/citibike/stations**. 
```
root
 |-- stationid: long (nullable = true)
 |-- stationlatitude: string (nullable = true)
 |-- stationlongitude: string (nullable = true)
 |-- stationname: string (nullable = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem07/solution
```
* Use CSV and save the output to exactly 2 files. Make sure to preserve the header.
* Here are the column names. Data types should be as below.
```
 |-- stationname: string (nullable = true)
 |-- stationlatitude: double (nullable = true)
 |-- stationlongitude: double (nullable = true)
 |-- ridestartdate: integer (nullable = true)
 |-- ridecount: integer (nullable = true)
```
* Data should be sorted in ascending order by ridestartdate and then in descending order by ridecount.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem07/solution
```
* Run this code to create dataframe by name data.
```
import getpass
username = getpass.getuser()
data = spark.read. \
    csv(f'/user/{username}/mock_test_02/problem07/solution',
        header=True,
        inferSchema=True
       )
```
* Run `data.printSchema()` to validate the data types of the fields.
```
root
 |-- stationname: string (nullable = true)
 |-- stationlatitude: double (nullable = true)
 |-- stationlongitude: double (nullable = true)
 |-- ridestartdate: integer (nullable = true)
 |-- ridecount: integer (nullable = true)
```
* Run `data.count()` to validate number of records. It should be **785303**
* Run `data.orderBy(col('ridestartdate'), col('ridecount').desc()).show()` to preview the sample output.

|         stationname|  stationlatitude| stationlongitude|ridestartdate|ridecount|
|--------------------|-----------------|-----------------|-------------|---------|
|Central Park S & ...|      40.76590936|     -73.97634151|     20170101|      160|
|Centre St & Chamb...|      40.71273266|      -74.0046073|     20170101|      126|
|  Broadway & W 60 St|      40.76915505|     -73.98191841|     20170101|      114|
|  Broadway & E 14 St|           40.734|          -73.992|     20170101|      110|
|Central Park West...|40.77579376683666|-73.9762057363987|     20170101|      103|
|West St & Chamber...|      40.71754834|     -74.01322069|     20170101|      101|
|Central Park Nort...|        40.799484|       -73.955613|     20170101|       98|
|  Carmine St & 6 Ave|      40.73038599|     -74.00214988|     20170101|       97|
|Allen St & Stanto...|        40.722055|       -73.989111|     20170101|       96|
|     9 Ave & W 22 St|       40.7454973|     -74.00197139|     20170101|       96|
|     5 Ave & E 88 St|           40.782|          -73.959|     20170101|       94|
|Grand Army Plaza ...|           40.764|          -73.974|     20170101|       93|
|     5 Ave & E 73 St|      40.77282817|     -73.96685276|     20170101|       89|
|Christopher St & ...|      40.73291553|     -74.00711384|     20170101|       89|
|Central Park West...|      40.78472675|     -73.96961715|     20170101|       85|
|Grand Army Plaza ...|       40.6729679|     -73.97087984|     20170101|       82|
|Grand St & Elizab...|        40.718822|        -73.99596|     20170101|       82|
|Greenwich Ave & 8...|    40.7390169121|   -74.0026376103|     20170101|       82|
|Central Park West...|       40.7734066|     -73.97782542|     20170101|       80|
|    12 Ave & W 40 St|      40.76087502|     -74.00277668|     20170101|       80|


In [16]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 07 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [17]:
trips_df=spark.read.csv('/public/citibike/trips/',header=True)

In [18]:
station_df=spark.read.json('/public/citibike/stations')

In [19]:
final_df=trips_df.join(station_df,on=trips_df.startstationid.cast('int')==station_df.stationid.cast("int"),how="inner")

In [20]:
from pyspark.sql.functions import date_format,count,lit,col
output=final_df. \
        select("startstationid","stationname", "stationlatitude", "stationlongitude",
               date_format("starttime","yyyyMMdd").alias('ridestartdate')). \
        groupBy("startstationid","stationname", "stationlatitude", "stationlongitude","ridestartdate"). \
        agg(count(lit(1)).alias("ridecount")). \
        orderBy("ridestartdate",col("ridecount").desc()). \
        drop("startstationid")

In [21]:
output. \
    coalesce(2). \
    write. \
    csv(f'/user/{username}/mock_test_02/problem07/solution',
        header=True,
        mode='overwrite'
       )

In [22]:
%%sh
hdfs dfs -ls /user/${USER}/mock_test_02/problem07/solution

Found 3 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-29 15:08 /user/itv002461/mock_test_02/problem07/solution/_SUCCESS
-rw-r--r--   3 itv002461 supergroup   22404461 2022-06-29 15:08 /user/itv002461/mock_test_02/problem07/solution/part-00000-c76f73fe-d48c-4649-8b0b-f79c15554608-c000.csv
-rw-r--r--   3 itv002461 supergroup   22345192 2022-06-29 15:08 /user/itv002461/mock_test_02/problem07/solution/part-00001-c76f73fe-d48c-4649-8b0b-f79c15554608-c000.csv


In [23]:
import getpass
username = getpass.getuser()
data = spark.read. \
  csv(f'/user/{username}/mock_test_02/problem07/solution',
      header=True,
      inferSchema=True
     )

In [24]:
data.printSchema()

root
 |-- stationname: string (nullable = true)
 |-- stationlatitude: double (nullable = true)
 |-- stationlongitude: double (nullable = true)
 |-- ridestartdate: integer (nullable = true)
 |-- ridecount: integer (nullable = true)



In [25]:
data.count()

785303

In [26]:
data.orderBy(col('ridestartdate'), col('ridecount').desc()).show()

+--------------------+-----------------+-----------------+-------------+---------+
|         stationname|  stationlatitude| stationlongitude|ridestartdate|ridecount|
+--------------------+-----------------+-----------------+-------------+---------+
|Central Park S & ...|      40.76590936|     -73.97634151|     20170101|      160|
|Centre St & Chamb...|      40.71273266|      -74.0046073|     20170101|      126|
|  Broadway & W 60 St|      40.76915505|     -73.98191841|     20170101|      114|
|  Broadway & E 14 St|           40.734|          -73.992|     20170101|      110|
|Central Park West...|40.77579376683666|-73.9762057363987|     20170101|      103|
|West St & Chamber...|      40.71754834|     -74.01322069|     20170101|      101|
|Central Park Nort...|        40.799484|       -73.955613|     20170101|       98|
|  Carmine St & 6 Ave|      40.73038599|     -74.00214988|     20170101|       97|
|     9 Ave & W 22 St|       40.7454973|     -74.00197139|     20170101|       96|
|All