## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
%python
from pyspark.sql.functions import regexp_replace, concat_ws
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer, RegexTokenizer
# File location and type
file_path = "dbfs:///FileStore/tables/*.csv"
stopWords = set(stopwords.words('english'))
df = spark.read.csv(file_path, header="true", inferSchema="true").select("id", "title", "content")
df = df.withColumn('content', regexp_replace('content', '[^0-9a-zA-Z]+', ' '))#remove special characteres
df = df.withColumn('content', regexp_replace('content', '(?:^| )\w(?:$| )', ' '))#remove single words

tokenizer = Tokenizer(inputCol="content", outputCol="words")
tokenized = tokenizer.transform(df).select('id', 'title' ,'words')

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
cleanedDataFrame = remover.transform(tokenized).select('id', 'title' ,'filtered')
cleanedDataFrame = cleanedDataFrame.withColumn('filtered', concat_ws(' ', cleanedDataFrame.filtered))
#display(cleanedDataFrame)

for i,content in enumerate(cleanedDataFrame.select('filtered')):
  print(i,content)