# K-Means Quiz
Use this Jupyter notebook to find the answers to the quiz in the previous section. There is an answer key in the next part of the lesson.

We might want to take a look at the distribution of the Title+Body length feature we used before and instead of using the raw number of words create categories based on this length: short, longer,..., super long.

In the questions below I'll refer to length of the combined Title and Body fields as Description length (and by length we mean the number of words when the text is tokenized with pattern="\W").

In [1]:
from pyspark.sql import SparkSession

# TODOS: 
# 1) import any other libraries you might need
# 2) run the cells below to read the dataset and extract description length features
# 3) write code to answer the quiz questions
from pyspark.sql.functions import col, concat, lit, max, min, avg, stddev, count, udf
from pyspark.ml.feature import RegexTokenizer, VectorAssembler
from pyspark.sql.types import IntegerType
from pyspark.ml.regression import LinearRegression
from pyspark.ml.clustering import KMeans

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/13 23:45:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/13 23:45:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Read Dataset

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

                                                                                

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

### Build Description Length Features

In [5]:
df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

In [6]:
regexTokenizer = RegexTokenizer(inputCol="Desc", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)

In [7]:
body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("DescLength", body_length(df.words))

In [8]:
assembler = VectorAssembler(inputCols=["DescLength"], outputCol="DescVec")
df = assembler.transform(df)

In [9]:
number_of_tags = udf(lambda x: len(x.split(" ")), IntegerType())
df = df.withColumn("NumTags", number_of_tags(df.Tags))

In [10]:
df.select("*").limit(5).toPandas()

Traceback (most recent call last):                                  (0 + 1) / 1]
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 663, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 564, in read_int
    raise EOFError
EOFError
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/

Unnamed: 0,Body,Id,Tags,Title,oneTag,Desc,words,DescLength,DescVec,NumTags
0,<p>I'd like to check if an uploaded file is an...,1,php image-processing file-upload upload mime-t...,How to check if an uploaded file is an image w...,php,How to check if an uploaded file is an image w...,"[how, to, check, if, an, uploaded, file, is, a...",96,[96.0],5
1,"<p>In my favorite editor (vim), I regularly us...",2,firefox,How can I prevent firefox from closing when I ...,firefox,How can I prevent firefox from closing when I ...,"[how, can, i, prevent, firefox, from, closing,...",83,[83.0],1
2,<p>I am import matlab file and construct a dat...,3,r matlab machine-learning,R Error Invalid type (list) for variable,r,R Error Invalid type (list) for variable <p>I ...,"[r, error, invalid, type, list, for, variable,...",3168,[3168.0],3
3,"<p>This is probably very simple, but I simply ...",4,c# url encoding,How do I replace special characters in a URL?,c#,How do I replace special characters in a URL? ...,"[how, do, i, replace, special, characters, in,...",124,[124.0],3
4,<pre><code>function modify(.......)\n{\n $mco...,5,php api file-get-contents,How to modify whois contact details?,php,How to modify whois contact details? <pre><cod...,"[how, to, modify, whois, contact, details, pre...",154,[154.0],3


# Question 1
How many times greater is the Description Length of the longest question than the Description Length of the shortest question (rounded to the nearest whole number)?

Tip: Don't forget to import Spark SQL's aggregate functions that can operate on DataFrame columns.

In [11]:
# TODO: write your code to answer this question
df.agg(max("DescLength"), min("DescLength")).show()

[Stage 2:>                                                          (0 + 1) / 1]

+---------------+---------------+
|max(DescLength)|min(DescLength)|
+---------------+---------------+
|           7532|             10|
+---------------+---------------+



                                                                                

In [12]:
7532 / 10

753.2

# Question 2
What is the mean and standard deviation of the Description length?

In [13]:
# TODO: write your code to answer this question
df.agg(avg("DescLength"), stddev("DescLength")).show()

[Stage 5:>                                                          (0 + 1) / 1]

+---------------+-----------------------+
|avg(DescLength)|stddev_samp(DescLength)|
+---------------+-----------------------+
|      180.28187|     192.10819533505023|
+---------------+-----------------------+



                                                                                

# Question 3
Let's use K-means to create 5 clusters of Description lengths. Set the random seed to 42 and fit a 5-class K-means model on the Description length column (you can use KMeans().setParams(...) ). What length is the center of the cluster representing the longest questions?

In [14]:
# TODO: write your code to answer this question
kmeans = KMeans(seed=42, k=5, featuresCol="DescVec", predictionCol="DescGroup")
kmeans_model = kmeans.fit(df)
df = kmeans_model.transform(df)

                                                                                

In [15]:
df.select("*").limit(5).toPandas()

Traceback (most recent call last):                                  (0 + 1) / 1]
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 663, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 564, in read_int
    raise EOFError
EOFError
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
  File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/

Unnamed: 0,Body,Id,Tags,Title,oneTag,Desc,words,DescLength,DescVec,NumTags,DescGroup
0,<p>I'd like to check if an uploaded file is an...,1,php image-processing file-upload upload mime-t...,How to check if an uploaded file is an image w...,php,How to check if an uploaded file is an image w...,"[how, to, check, if, an, uploaded, file, is, a...",96,[96.0],5,0
1,"<p>In my favorite editor (vim), I regularly us...",2,firefox,How can I prevent firefox from closing when I ...,firefox,How can I prevent firefox from closing when I ...,"[how, can, i, prevent, firefox, from, closing,...",83,[83.0],1,0
2,<p>I am import matlab file and construct a dat...,3,r matlab machine-learning,R Error Invalid type (list) for variable,r,R Error Invalid type (list) for variable <p>I ...,"[r, error, invalid, type, list, for, variable,...",3168,[3168.0],3,2
3,"<p>This is probably very simple, but I simply ...",4,c# url encoding,How do I replace special characters in a URL?,c#,How do I replace special characters in a URL? ...,"[how, do, i, replace, special, characters, in,...",124,[124.0],3,0
4,<pre><code>function modify(.......)\n{\n $mco...,5,php api file-get-contents,How to modify whois contact details?,php,How to modify whois contact details? <pre><cod...,"[how, to, modify, whois, contact, details, pre...",154,[154.0],3,0


In [16]:
df.groupBy("DescGroup").agg(avg("DescLength"), avg("NumTags"), count("DescLength")).orderBy("avg(DescLength)").show()

[Stage 60:>                                                         (0 + 1) / 1]

+---------+------------------+------------------+-----------------+
|DescGroup|   avg(DescLength)|      avg(NumTags)|count(DescLength)|
+---------+------------------+------------------+-----------------+
|        0| 96.02297592997812|2.7428884026258205|            63066|
|        4|238.22969197457567|3.0864357058042886|            28634|
|        1|    492.6833982403|3.2330881292369824|             6933|
|        3|1062.4118629908103|3.2957393483709274|             1197|
|        2|2726.1882352941175|3.4235294117647057|              170|
+---------+------------------+------------------+-----------------+



                                                                                