# Creating Features Quiz
Use this Jupyter notebook to find the answers to the quiz in the previous section. There is an answer key in the next part of the lesson.

In [48]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, CountVectorizer, \
    IDF, StringIndexer
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler, MinMaxScaler
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

import re


In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

### Read Dataset

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [5]:
df.printSchema()

root
 |-- Body: string (nullable = true)
 |-- Id: long (nullable = true)
 |-- Tags: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- oneTag: string (nullable = true)



### Build Body Length Feature

In [6]:
regexTokenizer = RegexTokenizer(inputCol="Body", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)

In [7]:
body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("BodyLength", body_length(df.words))

In [8]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [9]:
df.show()

+--------------------+---+--------------------+--------------------+----------------+--------------------+----------+
|                Body| Id|                Tags|               Title|          oneTag|               words|BodyLength|
+--------------------+---+--------------------+--------------------+----------------+--------------------+----------+
|<p>I'd like to ch...|  1|php image-process...|How to check if a...|             php|[p, i, d, like, t...|        83|
|<p>In my favorite...|  2|             firefox|How can I prevent...|         firefox|[p, in, my, favor...|        71|
|<p>I am import ma...|  3|r matlab machine-...|R Error Invalid t...|               r|[p, i, am, import...|      3161|
|<p>This is probab...|  4|     c# url encoding|How do I replace ...|              c#|[p, this, is, pro...|       115|
|<pre><code>functi...|  5|php api file-get-...|How to modify who...|             php|[pre, code, funct...|       148|
|<p>I am using a m...|  6|proxy active-dire...|setting p

# Question 1
Select the question with Id = 1112. How many words does its body contain (check the BodyLength column)?

In [14]:
df.select(["Id", "BodyLength"]).filter(df.Id==1112).show()

+----+----------+
|  Id|BodyLength|
+----+----------+
|1112|        63|
+----+----------+



# Question 2
Create a new column that concatenates the question title and body. Apply the same functions we used before to compute the number of words in this combined column. What's the value in this new column for Id = 5123?

In [19]:
concat = udf(lambda x, y: x + ' ' + y)
df = df.withColumn("body_title", concat(df.Body, df.Title))
# another way
# df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

In [22]:
df.select("body_title").head()

Row(body_title="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n How to check if an uploaded file is an image without mime type?")

In [23]:
df.select("Body").head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n")

In [24]:
df.select("Title").head()

Row(Title='How to check if an uploaded file is an image without mime type?')

In [25]:
# df = df.withColumn("body_title_length", body_length(df.body_title))

In [27]:
regexTokenizer2 = RegexTokenizer(inputCol="body_title", outputCol="words2", pattern="\\W")
df = regexTokenizer2.transform(df)
df = df.withColumn("body_title_length", body_length(df.words2))
df.select(["Id", "BodyLength", "body_title_length"]).filter(df.Id==5123).show()

+----+----------+-----------------+
|  Id|BodyLength|body_title_length|
+----+----------+-----------------+
|5123|       132|              135|
+----+----------+-----------------+



# Create a Vector
Create a vector from the combined Title + Body length column. In the next few questions, you'll try different normalizer/scaler methods on this new column.

In [30]:
assembler = VectorAssembler(inputCols=["body_title_length"], outputCol="NumFeatures")
df = assembler.transform(df)

In [32]:
df.show(2)

+--------------------+---+--------------------+--------------------+-------+--------------------+----------+--------------------+-----------------+--------------------+-----------+
|                Body| Id|                Tags|               Title| oneTag|               words|BodyLength|          body_title|body_title_length|              words2|NumFeatures|
+--------------------+---+--------------------+--------------------+-------+--------------------+----------+--------------------+-----------------+--------------------+-----------+
|<p>I'd like to ch...|  1|php image-process...|How to check if a...|    php|[p, i, d, like, t...|        83|<p>I'd like to ch...|               96|[p, i, d, like, t...|     [96.0]|
|<p>In my favorite...|  2|             firefox|How can I prevent...|firefox|[p, in, my, favor...|        71|<p>In my favorite...|               83|[p, in, my, favor...|     [83.0]|
+--------------------+---+--------------------+--------------------+-------+-------------------

# Question 3
Using the Normalizer method what's the normalized value for question Id = 512?

In [33]:
scaler = Normalizer(inputCol="NumFeatures", outputCol="ScaledNumFeatures")
df = scaler.transform(df)

In [39]:
df.select(["Id", "NumFeatures", "ScaledNumFeatures"]).filter(df.Id==512).show()

+---+-----------+-----------------+
| Id|NumFeatures|ScaledNumFeatures|
+---+-----------+-----------------+
|512|     [57.0]|            [1.0]|
+---+-----------+-----------------+



In [38]:
df.select(["ScaledNumFeatures"]).dropDuplicates().show()

+-----------------+
|ScaledNumFeatures|
+-----------------+
|            [1.0]|
+-----------------+



### Not exactly sure wht does not do this correctly (as expected), is it because in  [] form?

In [52]:
normalizer = Normalizer(inputCol="NumFeatures", outputCol="ScaledNumFeatures_p1", p=1.0)
df = normalizer.transform(df)

In [53]:
df.select(["Id", "NumFeatures", "ScaledNumFeatures_p1"]).filter(df.Id==512).show()

+---+-----------+--------------------+
| Id|NumFeatures|ScaledNumFeatures_p1|
+---+-----------+--------------------+
|512|     [57.0]|               [1.0]|
+---+-----------+--------------------+



# Question 4
Using the StandardScaler method (**scaling both the mean and the standard deviation**) what's the normalized value for question Id = 512?

In [44]:
df = df.drop("ScaledNumFeatures2")
scaler2 = StandardScaler(inputCol="NumFeatures", outputCol="ScaledNumFeatures2", withStd=True, withMean=True)
scalerModel = scaler2.fit(df)
df = scalerModel.transform(df)

In [45]:
df.select(["Id", "NumFeatures", "ScaledNumFeatures2"]).filter(df.Id==512).show()

+---+-----------+--------------------+
| Id|NumFeatures|  ScaledNumFeatures2|
+---+-----------+--------------------+
|512|     [57.0]|[-0.6417314460998...|
+---+-----------+--------------------+



# Question 5
Using the MinMAxScaler method what's the normalized value for question Id = 512?

In [49]:
scaler = MinMaxScaler(inputCol="NumFeatures", outputCol="ScaledNumFeatures3")
scalerModel = scaler.fit(df)
df = scalerModel.transform(df)

In [50]:
df.select(["Id", "NumFeatures", "ScaledNumFeatures3"]).filter(df.Id==512).show()

+---+-----------+--------------------+
| Id|NumFeatures|  ScaledNumFeatures3|
+---+-----------+--------------------+
|512|     [57.0]|[0.00624833820792...|
+---+-----------+--------------------+



In [58]:
df.select("body_title_length").describe().show() # describe can't be done on vector form, need the raw number

+-------+------------------+
|summary| body_title_length|
+-------+------------------+
|  count|            100000|
|   mean|         180.28187|
| stddev|192.10819533505023|
|    min|                10|
|    max|              7532|
+-------+------------------+



In [None]:
# correct : (57-10)/(7532-10)=0.0062

In [62]:
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.5, -1.0]),),
    (1, Vectors.dense([2.0, 1.0, 1.0]),),
    (2, Vectors.dense([4.0, 10.0, 2.0]),)
], ["id", "features"])

In [63]:
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
l1NormData = normalizer.transform(dataFrame)
print("Normalized using L^1 norm")
l1NormData.show()

Normalized using L^1 norm
+---+--------------+------------------+
| id|      features|      normFeatures|
+---+--------------+------------------+
|  0|[1.0,0.5,-1.0]|    [0.4,0.2,-0.4]|
|  1| [2.0,1.0,1.0]|   [0.5,0.25,0.25]|
|  2|[4.0,10.0,2.0]|[0.25,0.625,0.125]|
+---+--------------+------------------+



In [66]:
normalizer = Normalizer(inputCol="features", outputCol="normFeatures2", p=2.0)
l1NormData = normalizer.transform(dataFrame)
print("Normalized using L^1 norm")
l1NormData.head()

Normalized using L^1 norm


Row(id=0, features=DenseVector([1.0, 0.5, -1.0]), normFeatures2=DenseVector([0.6667, 0.3333, -0.6667]))

In [67]:
dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0]),),
    (1, Vectors.dense([2.0]),),
    (2, Vectors.dense([4.0]),)
], ["id", "features"])
normalizer = Normalizer(inputCol="features", outputCol="normFeatures3", p=1.0)
l1NormData = normalizer.transform(dataFrame)
print("Normalized using L^1 norm")
l1NormData.head()

Normalized using L^1 norm


Row(id=0, features=DenseVector([1.0]), normFeatures3=DenseVector([1.0]))

In [71]:
dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0]),),
    (1, Vectors.dense([2.0]),),
    (2, Vectors.dense([4.0]),)
], ["id", "features"])
normalizer = MinMaxScaler(inputCol="features", outputCol="normFeatures4")
scalerModel = normalizer.fit(dataFrame)
l1NormData = scalerModel.transform(dataFrame)
print("Normalized using min_max norm")
l1NormData.show()

Normalized using min_max norm
+---+--------+--------------------+
| id|features|       normFeatures4|
+---+--------+--------------------+
|  0|   [1.0]|               [0.0]|
|  1|   [2.0]|[0.3333333333333333]|
|  2|   [4.0]|               [1.0]|
+---+--------+--------------------+

