# NLP Pipeline - Tokenization

In this notebook, we will see how to use pyspark's Tokenizer. The tokenizer converts an input string to lowercase and then splits it by white spaces.

[Docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Tokenizer.html)


## 1. Setup

If you are using google collab, do not forget to install spark first!

In [1]:
# Make Spark session "findable"
import findspark
findspark.init()

In [2]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NLPTokens").getOrCreate()

21/11/14 12:40:37 WARN Utils: Your hostname, Marias-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.141.13 instead (on interface en0)
21/11/14 12:40:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/14 12:40:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

## 2. Import data

In [4]:
# Create sample DataFrame
dataframe = spark.createDataFrame([
    (0, "Spark is great"),
    (1, "We are learning Spark"),
    (2, "Spark is better than hadoop no doubt")
], ["id", "sentence"])

In [5]:
# Show DataFrame
dataframe.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|      Spark is great|
|  1|We are learning S...|
|  2|Spark is better t...|
+---+--------------------+



## 3. Apply tokenizer

In [6]:
# Tokenize word
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenizer

Tokenizer_59588399feaa

In [7]:
# Transform and show DataFrame
tokenized = tokenizer.transform(dataframe)
tokenized.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|      Spark is great|  [spark, is, great]|
|  1|We are learning S...|[we, are, learnin...|
|  2|Spark is better t...|[spark, is, bette...|
+---+--------------------+--------------------+



In [8]:
# Show the entire column content
tokenized.show(truncate=False)

+---+------------------------------------+--------------------------------------------+
|id |sentence                            |words                                       |
+---+------------------------------------+--------------------------------------------+
|0  |Spark is great                      |[spark, is, great]                          |
|1  |We are learning Spark               |[we, are, learning, spark]                  |
|2  |Spark is better than hadoop no doubt|[spark, is, better, than, hadoop, no, doubt]|
+---+------------------------------------+--------------------------------------------+

