# <center>Feature Engineering (v3)</center>

<br>
<br>
<p>In this last version, we will implement a model using Spark through PySpark. First we'll create a Spark session and read the train and test data files.</p>
<br>
<br>

In [3]:

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ml_tweet_sentiment').getOrCreate()
df = spark.read.csv('training_data.csv', header = True, inferSchema = True)
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- target: integer (nullable = true)
 |-- text: string (nullable = true)



In [4]:
df.show(5)

+---+------+--------------------+
|_c0|target|                text|
+---+------+--------------------+
|  0|     0|@switchfoot http:...|
|  1|     0|is upset that he ...|
|  2|     0|@Kenichan I dived...|
|  3|     0|my whole body fee...|
|  4|     0|@nationwideclass ...|
+---+------+--------------------+
only showing top 5 rows



In [8]:
ts = spark.read.csv('test_data.csv', header = True, inferSchema = True)
ts.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- target: integer (nullable = true)
 |-- text: string (nullable = true)



In [9]:
ts.show(5)

+---+------+--------------------+
|_c0|target|                text|
+---+------+--------------------+
|  0|     1|@stellargirl I lo...|
|  1|     1|Reading my kindle...|
|  2|     1|Ok, first assesme...|
|  3|     1|@kenburbary You'l...|
|  4|     1|@mikefish  Fair e...|
+---+------+--------------------+
only showing top 5 rows



<br>
<br>
<p>We will apply a Tokenizer to separate the words in the texts. Then we'll apply HahingTF and IDF to convert the words into numbers and to assign it statistical measurement values. We will use a Pipeline to run the process through the two datasets.</p>
<br>
<br>

In [5]:
from pyspark.ml.feature import  Tokenizer, HashingTF, IDF
from pyspark.ml import Pipeline


tok = Tokenizer(inputCol="text", outputCol="words")
tf = HashingTF(inputCol="words", outputCol="tf", numFeatures=500)
idf = IDF(inputCol="tf", outputCol="features")


In [7]:
feat_pipeline = Pipeline(stages=[tok, tf, idf])

feat_model = feat_pipeline.fit(df)
features = feat_model.transform(df)

In [11]:

test_model = feat_pipeline.fit(ts)
test = test_model.transform(ts)

<br>
<br>
<p>OK, the data is ready to feed and test the new model.</p>
<br>
<br>