# Author Identification

Input dataset:

```https://www.kaggle.com/c/spooky-author-identification/data```

The task is to predict the author of a given sentence given a large corpus of sample sentences.

Data was loaded into our cluster using the DataBricks UI. We can now select from it into a DataFrame.

In [3]:
df = spark.sql("SELECT text, author FROM pandas_train_csv")
df.printSchema()

Let's first start by making sure our `author` column contains the correct 3 values.

In [5]:
display(df.select('author').distinct())

Our DataFrame now looks ready to feed into our pipeline.

In [7]:
display(df.head(10))

In [8]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.feature import StringIndexer, Tokenizer, HashingTF, IDF, CountVectorizer
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

labelIndexer = StringIndexer(inputCol="author", outputCol="label")

tokenizer = Tokenizer(inputCol="text", outputCol="words")

countv =  CountVectorizer(inputCol="words", outputCol="rawFeatures", vocabSize=3000, minDF=2.0)

idf = IDF(inputCol="rawFeatures", outputCol="features")

nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

paramGrid = ParamGridBuilder()\
  .addGrid(nb.smoothing, [0.4, 0.8, 1.0, 2.0])\
  .addGrid(countv.vocabSize, [1000, 3000, 5000, 700])\
  .build()
  
cv = CrossValidator(estimator=nb, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds=4)

pipeline = Pipeline(stages=[labelIndexer, tokenizer, countv, idf, cv])

train, test = df.randomSplit([0.8, 0.2])
model = pipeline.fit(train)
predictions = model.transform(test)
accuracy = evaluator.evaluate(predictions)

print "Accuracy on our test set: %g" % accuracy

In [9]:
cv.getEstimator().extractParamMap()

In [10]:
display(predictions.select('text', 'label', 'prediction'))