# Ex3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ex3").getOrCreate()
spark

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [2]:
from pyspark import SparkFiles

In [4]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"

spark.sparkContext.addFile(url)

users = spark.read.csv(SparkFiles.get("u.user"), header=True, inferSchema=True, sep='|')

### Step 4. See the first 25 entries

In [6]:
users.show(25)

+-------+---+------+-------------+--------+
|user_id|age|gender|   occupation|zip_code|
+-------+---+------+-------------+--------+
|      1| 24|     M|   technician|   85711|
|      2| 53|     F|        other|   94043|
|      3| 23|     M|       writer|   32067|
|      4| 24|     M|   technician|   43537|
|      5| 33|     F|        other|   15213|
|      6| 42|     M|    executive|   98101|
|      7| 57|     M|administrator|   91344|
|      8| 36|     M|administrator|   05201|
|      9| 29|     M|      student|   01002|
|     10| 53|     M|       lawyer|   90703|
|     11| 39|     F|        other|   30329|
|     12| 28|     F|        other|   06405|
|     13| 47|     M|     educator|   29206|
|     14| 45|     M|    scientist|   55106|
|     15| 49|     F|     educator|   97301|
|     16| 21|     M|entertainment|   10309|
|     17| 30|     M|   programmer|   06355|
|     18| 35|     F|        other|   37212|
|     19| 40|     M|    librarian|   02138|
|     20| 42|     F|    homemake

### Step 5. See the last 10 entries

In [7]:
users.tail(10)

[Row(user_id=934, age=61, gender='M', occupation='engineer', zip_code='22902'),
 Row(user_id=935, age=42, gender='M', occupation='doctor', zip_code='66221'),
 Row(user_id=936, age=24, gender='M', occupation='other', zip_code='32789'),
 Row(user_id=937, age=48, gender='M', occupation='educator', zip_code='98072'),
 Row(user_id=938, age=38, gender='F', occupation='technician', zip_code='55038'),
 Row(user_id=939, age=26, gender='F', occupation='student', zip_code='33319'),
 Row(user_id=940, age=32, gender='M', occupation='administrator', zip_code='02215'),
 Row(user_id=941, age=20, gender='M', occupation='student', zip_code='97229'),
 Row(user_id=942, age=48, gender='F', occupation='librarian', zip_code='78209'),
 Row(user_id=943, age=22, gender='M', occupation='student', zip_code='77841')]

### Step 6. What is the number of observations in the dataset?

In [8]:
print(users.count())

943


### Step 7. What is the number of columns in the dataset?

In [9]:
print(len(users.columns))

5


### Step 8. Print the name of all the columns.

In [10]:
users.columns

['user_id', 'age', 'gender', 'occupation', 'zip_code']

### Step 9. How is the dataset indexed?

In [11]:
# trying to find an answer

### Step 10. What is the data type of each column?

In [12]:
users.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 11. Print only the occupation column

In [13]:
users.select(users["occupation"]).show(5)

+----------+
|occupation|
+----------+
|technician|
|     other|
|    writer|
|technician|
|     other|
+----------+
only showing top 5 rows



### Step 12. How many different occupations are in this dataset?

In [14]:
from pyspark.sql.functions import countDistinct

In [20]:
users.select(countDistinct(users["occupation"])).head(1)[0][0]

21

### Step 13. What is the most frequent occupation?

In [21]:
users.show(5)

+-------+---+------+----------+--------+
|user_id|age|gender|occupation|zip_code|
+-------+---+------+----------+--------+
|      1| 24|     M|technician|   85711|
|      2| 53|     F|     other|   94043|
|      3| 23|     M|    writer|   32067|
|      4| 24|     M|technician|   43537|
|      5| 33|     F|     other|   15213|
+-------+---+------+----------+--------+
only showing top 5 rows



In [27]:
frequent_occupation = users.groupBy(users["occupation"]).count()

In [33]:
print("Occupation:", frequent_occupation.orderBy("count", ascending=0).head(1)[0][0])
print("Count:", frequent_occupation.orderBy("count", ascending=0).head(1)[0][1])

Occupation: student
Count: 196


### Step 14. Summarize the DataFrame.

In [35]:
users.describe().show()

+-------+-----------------+-----------------+------+-------------+------------------+
|summary|          user_id|              age|gender|   occupation|          zip_code|
+-------+-----------------+-----------------+------+-------------+------------------+
|  count|              943|              943|   943|          943|               943|
|   mean|            472.0|34.05196182396607|  null|         null| 50868.78810810811|
| stddev|272.3649512449549|12.19273973305903|  null|         null|30891.373254138176|
|    min|                1|                7|     F|administrator|             00000|
|    max|              943|               73|     M|       writer|             Y1A6B|
+-------+-----------------+-----------------+------+-------------+------------------+



### Step 15. Summarize all the columns

In [36]:
users.describe().show()

+-------+-----------------+-----------------+------+-------------+------------------+
|summary|          user_id|              age|gender|   occupation|          zip_code|
+-------+-----------------+-----------------+------+-------------+------------------+
|  count|              943|              943|   943|          943|               943|
|   mean|            472.0|34.05196182396607|  null|         null| 50868.78810810811|
| stddev|272.3649512449549|12.19273973305903|  null|         null|30891.373254138176|
|    min|                1|                7|     F|administrator|             00000|
|    max|              943|               73|     M|       writer|             Y1A6B|
+-------+-----------------+-----------------+------+-------------+------------------+



### Step 16. Summarize only the occupation column

In [38]:
users.describe(["occupation"]).show()

+-------+-------------+
|summary|   occupation|
+-------+-------------+
|  count|          943|
|   mean|         null|
| stddev|         null|
|    min|administrator|
|    max|       writer|
+-------+-------------+



### Step 17. What is the mean age of users?

In [42]:
users.describe(['age']).head(2)[1][1]

'34.05196182396607'

### Step 18. What is the age with least occurrence?

In [44]:
age_count = users.groupBy(users['age']).count()

In [47]:
age_count.orderBy("count").show()

+---+-----+
|age|count|
+---+-----+
| 73|    1|
| 10|    1|
| 11|    1|
|  7|    1|
| 66|    1|
| 69|    2|
| 62|    2|
| 64|    2|
| 68|    2|
| 14|    3|
| 58|    3|
| 61|    3|
| 65|    3|
| 63|    3|
| 70|    3|
| 59|    3|
| 54|    4|
| 13|    5|
| 16|    5|
| 15|    6|
+---+-----+
only showing top 20 rows

