## Assignment

**1. Build a Classification Model**

In this exercise, you will fit a binary logistic regression model to the baby name dataset you used in the previous exercise. This model will predict the sex of a person based on their age, name, and state they were born in. To train the model, you will use the data found in baby-names/names-classifier.

a. Prepare in Input Features

First, you will need to prepare each of the input features. While age is a numeric feature, state and name are not. These need to be converted into numeric vectors before you can train the model. Use a StringIndexer along with the OneHotEncoderEstimator to convert the name, state, and sex columns into numeric vectors. Use the VectorAssembler to combine the name, state, and age vectors into a single features vector. Your final dataset should contain a column called features containing the prepared vector and a column called label containing the sex of the person.

**2. Fit and Evaluate the Model**

Fit the model as a logistic regression model with the following parameters. LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8). Provide the area under the ROC curve for the model.

In [2]:
dbutils.library.installPyPI("matplotlib")
dbutils.library.restartPython()

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, OneHotEncoderEstimator, VectorAssembler
from pyspark.sql.functions import col
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

In [4]:
# File location and type
file_location = "/FileStore/tables/baby_names-b9fc6.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
babynames = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# Take a look at the data
display(babynames)

state,sex,year,name,count
AK,F,1910,Mary,14
AK,F,1910,Annie,12
AK,F,1910,Anna,10
AK,F,1910,Margaret,8
AK,F,1910,Helen,7
AK,F,1910,Elsie,6
AK,F,1910,Lucy,6
AK,F,1910,Dorothy,5
AK,F,1911,Mary,12
AK,F,1911,Margaret,7


In [5]:
babynames.printSchema()

As we can see, `yaer` and `count` are numeric and does not need any transformation. But we need to encode `state`, `sex` and `name` so that they could be used in the model.

In [7]:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_i").fit(babynames) for column in list(set(babynames.columns)-set(['year','count'])) ]

# Convert the strings into numeric vectors
encoder = OneHotEncoderEstimator(
    inputCols=[indexer.getOutputCol() for indexer in indexers],
    outputCols=[
        "{0}_encoded".format(indexer.getOutputCol()) for indexer in indexers]
)

# Combine into a single feature column
assembler = VectorAssembler(
    inputCols=encoder.getOutputCols(),
    outputCol="features"
)

# Sequence stages as pipeline
pipeline = Pipeline(stages=indexers + [encoder, assembler])

# Store featured engineered data in a dataframe
babynames_ftr = pipeline.fit(babynames).transform(babynames)
babynames_ftr.show()

In [8]:
# Keep required columns to prepare data for model
babynames_mdl = babynames_ftr.select(col("sex_i").alias("label"), col("features"))
babynames_mdl.show()

In [9]:
# Split data into test and train
train, test = babynames_mdl.randomSplit([0.7, 0.3], seed = 123)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

In [10]:
# create model
lr = LogisticRegression(labelCol="label", featuresCol="features",maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrmodel=lr.fit(train)

# predict on test
predict=lrmodel.transform(test)
predict.groupby('label','prediction').count().show()

In [11]:
# evaluate and calculate auc
bin_eval = BinaryClassificationEvaluator()
auc = bin_eval.evaluate(predict, {bin_eval.metricName:"areaUnderROC"})
auc