### Step 1: Create the dataframe from userdetails.json file

In [2]:
df = spark.read.json("/user/datacouch21/data/userdetails.json")
df.show(truncate=False)

+------------------------+----------------------------+----------+------+---+---------------+---------+---------------+----------+
|city                    |email                       |first_name|gender|id |ip_address     |last_name|race           |timestamp |
+------------------------+----------------------------+----------+------+---+---------------+---------+---------------+----------+
|Miami                   |wbell0@tumblr.com           |Wayne     |Male  |1  |77.152.229.131 |Bell     |Costa Rican    |1463468947|
|Xiejia                  |aallen1@state.gov           |Anthony   |Male  |2  |62.49.195.170  |Allen    |Comanche       |1447538277|
|Illintsi                |ehenderson2@theguardian.com |Eric      |Male  |3  |136.129.71.231 |Henderson|Comanche       |1476462811|
|Tugusirna               |jsmith3@spiegel.de          |Jimmy     |Male  |4  |117.176.172.187|Smith    |Native Hawaiian|1445804900|
|Nasīrābād               |dwheeler4@vkontakte.ru      |Diana     |Female|5  |119.22

### Step 2: Print the schema in a  tree format.

In [3]:
df.printSchema()

root
 |-- city: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- race: string (nullable = true)
 |-- timestamp: string (nullable = true)



### Step 3: Select only the first_name column.

In [4]:
df.select("first_name").show()

+----------+
|first_name|
+----------+
|     Wayne|
|   Anthony|
|      Eric|
|     Jimmy|
|     Diana|
|     Karen|
|    Philip|
|    Ashley|
|     Norma|
|     Helen|
|     Larry|
|     Emily|
|   Patrick|
|    Eugene|
|   Phyllis|
|       Roy|
|    Albert|
|   Antonio|
|    Robert|
|   Dorothy|
+----------+
only showing top 20 rows



### Step 4: Count the number of people by gender.

In [5]:
df.groupBy("gender").count().show()

+------+-----+
|gender|count|
+------+-----+
|Female|  507|
|  Male|  493|
+------+-----+



### Step 5: Select the  people whose id is greater than 30.

In [14]:
details=df.filter(df['id'] > 30)

In [15]:
details.show()

+--------------------+--------------------+----------+------+---+---------------+---------+--------------------+----------+
|                city|               email|first_name|gender| id|     ip_address|last_name|                race| timestamp|
+--------------------+--------------------+----------+------+---+---------------+---------+--------------------+----------+
|           Sembakung|wleeu@chronoengin...|     Wanda|Female| 31|  155.210.60.27|      Lee|              Korean|1459621486|
|        Karanggeneng|   brayv@example.com|    Bonnie|Female| 32|195.232.235.145|      Ray|            Delaware|1436824502|
|              Baiyan|aaustinw@chicagot...|    Angela|Female| 33|142.235.215.210|   Austin|           Cambodian|1474366455|
|              Plavsk|    ageorgex@hhs.gov|       Amy|Female| 34|  60.112.106.36|   George|            Seminole|1446059690|
|            Plumtree|    ideany@salon.com|     Irene|Female| 35|  150.69.30.179|     Dean|              Pueblo|1458412193|
|       

In [16]:
details.write.mode("overwrite").csv('/user/datacouch21/persons_details')

In [17]:
!hdfs dfs -ls /user/datacouch21/persons_details

Found 2 items
-rw-r--r--   1 root hadoop          0 2020-04-26 08:33 /user/datacouch21/persons_details/_SUCCESS
-rw-r--r--   1 root hadoop      88888 2020-04-26 08:33 /user/datacouch21/persons_details/part-00000-5d9f30f6-dd0f-4c63-a0f8-97f4f3de2ac7-c000.csv


In [18]:
!hdfs dfs -cat /user/datacouch21/persons_details/part-* | head

Sembakung,wleeu@chronoengine.com,Wanda,Female,31,155.210.60.27,Lee,Korean,1459621486
Karanggeneng,brayv@example.com,Bonnie,Female,32,195.232.235.145,Ray,Delaware,1436824502
Baiyan,aaustinw@chicagotribune.com,Angela,Female,33,142.235.215.210,Austin,Cambodian,1474366455
Plavsk,ageorgex@hhs.gov,Amy,Female,34,60.112.106.36,George,Seminole,1446059690
Plumtree,ideany@salon.com,Irene,Female,35,150.69.30.179,Dean,Pueblo,1458412193
Barinitas,klynchz@4shared.com,Kimberly,Female,36,47.198.228.160,Lynch,Native Hawaiian,1477826863
Uthai,jmedina10@icio.us,Jane,Female,37,101.35.64.93,Medina,Native Hawaiian and Other Pacific Islander (NHPI),1445940255
Suchań,mjenkins11@last.fm,Matthew,Male,38,4.243.194.186,Jenkins,Paiute,1478116071
Triolet,pbutler12@blogtalkradio.com,Pamela,Female,39,73.90.6.34,Butler,Korean,1448070262
Mesa,hburton13@bandcamp.com,Harry,Male,40,43.210.10.124,Burton,Yuman,1452745031
cat: Unable to write to output stream.


### END