## **PRODUCT CATEGORY CLASSIFICATION**

### Here, we want to develop an automatic and scalable first prototipe that helps to correctly categorize a new product in the available categories when it arrives.

This first prototipe will help us to identify in what elements we have to go deeper in order to get the best model.


https://towardsdatascience.com/multi-class-text-classification-with-pyspark-7d78d022ed35

## **Summary of requirements**

1. Train a model that predicts the product category for Software, Digital Software, and
Digital Video Games products using the Amazon Customer Reviews dataset.
2. Evaluate and validate your model.

In [1]:
# I checked warnings, but for the final report I prefer ignore those 
#that really does not affect the results (warnings of libraries, etc)
import warnings
warnings.simplefilter('ignore')

In [2]:
#my own functions
%load_ext autoreload
%autoreload 2

from utils.py_functions import *
from utils.cleaning_functions import *

In [3]:
#%%info

## **Load all the required libraries**

In [4]:
!pip install wordcloud



In [5]:
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from functools import reduce
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud 
import pandas as pd
import re
import string

## **Create spark session and provide master as yarn-client and provide application name.**

In [6]:
# Configuration properties of Apache Spark
#sc.stop()
from pyspark import SparkConf
from pyspark.sql import SparkSession

APP_NAME = 'pyspark_python'
MASTER = 'local[*]'

conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster(MASTER)
spark = SparkSession.builder.config(conf = conf).getOrCreate()
sc = spark.sparkContext

## **Load data.**

In [7]:

schema = StructType([
    StructField("marketplace",  StringType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product_id",  StringType(), True),
    StructField("product_parent", IntegerType(), True),
    StructField("product_title", StringType(), True),
    StructField("product_category", StringType(), True),
    StructField("star_rating", IntegerType(), True),
    StructField("helpful_votes", IntegerType(), True),
    StructField("total_votes", IntegerType(), True),
    StructField("vine", StringType(), True),
    StructField("verified_purchase", StringType(), True),
    StructField("review_headline", StringType(), True),
    StructField("review_body", StringType(), True),
    StructField("review_date", StringType(), True)])
'''
df_video_games = spark.read\
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
    .option("delimiter",",")\
    .option("inferSchema", "True")\
    .option("header", "True")\
    .load('data/amazon_reviews_us_Digital_Video_Games_v1_00.tsv')
'''

df_video_games = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option("delimiter","\t")\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load('data/amazon_reviews_us_Digital_Video_Games_v1_00.tsv')

df_software = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option("delimiter","\t")\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load('data/amazon_reviews_us_Software_v1_00.tsv')

df_digital_software = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option("delimiter","\t")\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load('data/amazon_reviews_us_Digital_Software_v1_00.tsv')

## **MERGE DATA**

In [8]:
df = df_digital_software.union(df_software);
df = df.union(df_video_games);

In [9]:
df.show()

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-------------------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|        review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-------------------+
|         US|   17747349|R2EI7QLPK4LF7U|B00U7LCE6A|     106182406|CCleaner Free [Do...|Digital_Software|          4|            0|          0|   N|                Y|          Four Stars|      So far so good|2015-08-31 00:00:00|
|         US|   10956619|R1W5OMFK1Q3I3O|B00HRJMOM4|     162269768|ResumeMaker Profe...|D

In [10]:
'''
names = ['marketplace', 'customer_id',  'product_id',
       'product_parent', 'product_title', 'product_category', 'star_rating',
       'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
       'review_headline', 'review_body', 'review_date'] #'review_id'
df_x = df.toDF(*names)
df_x.toPandas().head()
'''

"\nnames = ['marketplace', 'customer_id',  'product_id',\n       'product_parent', 'product_title', 'product_category', 'star_rating',\n       'helpful_votes', 'total_votes', 'vine', 'verified_purchase',\n       'review_headline', 'review_body', 'review_date'] #'review_id'\ndf_x = df.toDF(*names)\ndf_x.toPandas().head()\n"

## **Top 20 PRODUCTS**

In [11]:
# from pyspark.sql.functions import col
df.groupBy("product_title") \
    .count() \
    .orderBy(col("count").desc()) \
    .show(truncate=False)

+---------------------------------------------------------+-----+
|product_title                                            |count|
+---------------------------------------------------------+-----+
|Playstation Network Card                                 |13642|
|Avast Free Antivirus 2015 [Download]                     |9470 |
|TurboTax Deluxe Fed + Efile + State                      |8965 |
|Xbox Live Subscription                                   |7307 |
|Playstation Plus Subscription                            |4712 |
|Quicken Deluxe 20                                        |4020 |
|TurboTax Deluxe Federal + E-File + State 2012            |3831 |
|Turbo Tax Parent V2                                      |3708 |
|Block Financial H&R Block Tax Software 14 Deluxe + State |3682 |
|TurboTax Deluxe Fed, Efile and State 2013                |3658 |
|Xbox Live Gift Card                                      |3438 |
|SimCity - Limited Edition                                |3421 |
|Norton 36

In [12]:
df.groupBy("review_headline").count().orderBy(col("count").desc()).show(truncate=False)

+---------------+-----+
|review_headline|count|
+---------------+-----+
|Five Stars     |46735|
|Four Stars     |9027 |
|One Star       |7578 |
|Three Stars    |4261 |
|Two Stars      |2415 |
|Great          |1603 |
|Great Product  |1588 |
|Great product  |1308 |
|Excellent      |1012 |
|Great game     |998  |
|Awesome        |897  |
|Great Game     |856  |
|great          |733  |
|Easy to use    |714  |
|Good           |688  |
|Disappointed   |676  |
|Great!         |670  |
|Good product   |621  |
|good           |550  |
|Love it        |543  |
+---------------+-----+
only showing top 20 rows



In [13]:
df.groupBy("review_body").count().orderBy(col("count").desc()).show(truncate=False)

+-------------+-----+
|review_body  |count|
+-------------+-----+
|good         |948  |
|Good         |931  |
|Great        |861  |
|great        |602  |
|Excellent    |559  |
|ok           |457  |
|Perfect      |275  |
|excellent    |256  |
|Great!       |244  |
|Awesome      |243  |
|very good    |240  |
|Nice         |237  |
|Very good    |234  |
|love it      |230  |
|Ok           |212  |
|Great product|211  |
|Thanks       |198  |
|Love it      |188  |
|Great game   |159  |
|nice         |158  |
+-------------+-----+
only showing top 20 rows



## **DROP DATA**

In [14]:
drop_list = ['marketplace', 'customer_id',  'product_id',
       'product_parent', 'star_rating',
       'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
        'review_date', 'review_headline'] #,'review_id',, 'review_body'
df = df.select([column for column in df.columns if column not in drop_list])
df.show(10)

+--------------+--------------------+----------------+--------------------+
|     review_id|       product_title|product_category|         review_body|
+--------------+--------------------+----------------+--------------------+
|R2EI7QLPK4LF7U|CCleaner Free [Do...|Digital_Software|      So far so good|
|R1W5OMFK1Q3I3O|ResumeMaker Profe...|Digital_Software|Needs a little mo...|
| RPZWSYWRP92GI|Amazon Drive Desk...|Digital_Software|      Please cancel.|
|R2WQWM04XHD9US|Norton Internet S...|Digital_Software|  Works as Expected!|
|R1WSPK2RA2PDEF|SecureAnywhere In...|Digital_Software|I've had Webroot ...|
|R11JVGRZRHTDAS|Pc Matic Performa...|Digital_Software|EXCELLENT softwar...|
|R2B8468OKXXYE2|Microsoft OneNote...|Digital_Software|The variations cr...|
|R2HGGCCZSSNUCB|Intuit Quicken Re...|Digital_Software|Horrible!  Would ...|
| REEE4LHSVPRV9|Avast Free Antivi...|Digital_Software|     Waste of time .|
|R25OMUUILFFHI9|Apache OpenOffice...|Digital_Software|Work as easy as o...|
+-----------

In [15]:
df.printSchema()

root
 |-- review_id: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- review_body: string (nullable = true)



## **ELIMINATE DUPLICATED DATA**

In [16]:
#So, we eliminate them
print(df.toPandas().shape)
df = df.dropDuplicates(subset= ['product_title', 'product_category'])
print(df.toPandas().shape)

(589446, 4)
(37354, 4)


In [17]:
df.show()

+--------------+--------------------+-------------------+--------------------+
|     review_id|       product_title|   product_category|         review_body|
+--------------+--------------------+-------------------+--------------------+
| RP4MUWMW5PDMH|2014 World Book M...|           Software|available for win...|
| RLXRY2WDMC7F1|2017 TKT Payroll ...|           Software|It say 2015 but t...|
| RUBVRR7OYC2VG|4 Step Loan Modif...|           Software|I would like to f...|
|R167B987XQ467P|900 Vector Art De...|           Software|             Exelent|
|R31TIOVO2HIPC7|A Love of Art 3 C...|           Software|If you're looking...|
|R1OPMK9DQ1MF3U|ACCPAC INTERNATIO...|           Software|Will not run on m...|
|R2IIPA0W00VQV5|ACT! Pro v16 Soft...|           Software|Delivered on time...|
|R3LJECXTVSR7EX|ASA 2014 Airframe...|           Software|               Ismok|
|R2H6ZULQI27QXH|Adobe Acrobat 6.0...|           Software|This version does...|
|R3VKJWMSO23TLF|Adobe CS6 Master ...|           Soft

## **COMPLETE MISSING**

In [18]:
df = df.toPandas()

In [19]:
df.review_body.fillna(df.product_title, inplace=True)
df=spark.createDataFrame(df)

## **NUMERICAL VARIABLES**

In [20]:
# Our list of functions to apply.
transform_functions = [
    lambda x: len(x),
    lambda x: x.count(" "),
    lambda x: x.count("."),
    lambda x: x.count("!"),
    lambda x: x.count("?"),
    lambda x: len(x) / (x.count(" ") + 1),
    lambda x: x.count(" ") / (x.count(".") + 1),
    lambda x: len(re.findall("CD|DVD", x)), # CD 
    lambda x: len(re.findall(r"\d+st|\d+th|\d+sd", x)), # th--> 4th, 5th or 1st or 2sd
    lambda x: len(re.findall("[A-Z]", x)), # number of uppercase letters
    lambda x: len(re.findall("[0-9]", x)), #numbers
    lambda x: len(re.findall("\d{4}", x)),
    lambda x: len(re.findall("\d$", x)), #end with number
    lambda x: len(re.findall("^\d", x)), #start with number
    lambda x: len(re.findall("[\w]+-[\w]+",x)), #words separated with -
    lambda x: len(re.findall("OLD VERSION|Old Version|old version",x)), #old version
]

transform_functions_len = [
    lambda x: len(x)
]

In [21]:
df_num_2 = df.toPandas()

In [22]:
df_num = df_num_2[['product_title']]
df_num_2 = df_num_2[['review_id']]
for func in transform_functions:
     df_num_2 = pd.concat([df_num_2, df_num['product_title'].apply(func)], axis=1)

In [23]:
df_num_2.columns = ['review_id', 'title_len', 'title_words', 'title_points',
                  'title_exc', 'title_int', 'ratio_spaces_point', 'ratio_len_points', 
                    'title_cd','title_th', 'title_upper_letters', 'title_numbers',
                    'title_years', 'end_number', 'starts_number', 'word_sep', 
                  'title_old_version']

In [24]:
df_num_2.head()

Unnamed: 0,review_id,title_len,title_words,title_points,title_exc,title_int,ratio_spaces_point,ratio_len_points,title_cd,title_th,title_upper_letters,title_numbers,title_years,end_number,starts_number,word_sep,title_old_version
0,RP4MUWMW5PDMH,46,6,0,0,0,6.571429,6.0,1,0,7,4,1,0,1,0,0
1,RLXRY2WDMC7F1,46,6,0,0,0,6.571429,6.0,0,0,7,4,1,0,1,0,0
2,RUBVRR7OYC2VG,50,8,0,0,0,5.555556,8.0,0,0,7,1,0,0,1,0,0
3,R167B987XQ467P,68,12,0,0,0,5.230769,12.0,0,0,11,4,0,0,1,1,0
4,R31TIOVO2HIPC7,39,8,0,0,0,4.333333,8.0,1,0,11,1,0,0,0,1,0


## **CLEAN DATA**

In [25]:
def product_title_cleaning(df):
    #eliminate contractions I'm -> I am
    df_X = df.rdd.map(lambda x: (x["review_id"], x["product_category"],  x["product_title"], fix_abbreviation(x["review_body"])))
    df_X=spark.createDataFrame(df_X, schema = ["review_id", "product_category", "product_title", "review_body"])
    #consider only noums in the text
    df_X = df_X.rdd.map(lambda x: (x["review_id"], x["product_category"],  x["product_title"], tag_and_remove(x["review_body"])))
    df_X=spark.createDataFrame(df_X, schema = ["review_id", "product_category", "product_title", "review_body"])
    #lemmatization
    df_X = df_X.rdd.map(lambda x: (x["review_id"], x["product_category"],  x["product_title"], lemitizeWords(x["review_body"])))
    df_X=spark.createDataFrame(df_X, schema = ["review_id", "product_category", "product_title", "review_body"])

    #clean text
    df_X = df_X.rdd.map(lambda x: (x["review_id"], x["product_category"],  x["product_title"], clean_text(x["review_body"])))
    df_X=spark.createDataFrame(df_X, schema = ["review_id", "product_category", "product_title", "review_body"])
    #spelling correction
    df_X = df_X.rdd.map(lambda x: (x["review_id"], x["product_category"],  x["product_title"], spell_correction(x["review_body"])))
    df_X=spark.createDataFrame(df_X, schema = ["review_id", "product_category", "product_title", "review_body"])
    return df_X

In [26]:
df = product_title_cleaning(df)
df.show(25, truncate=False)

+--------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|review_id     |product_category   |product_title                                                                                                                                                                           |review_body                                                                                                                                                                                   

In [27]:
stopppp

NameError: name 'stopppp' is not defined

In [28]:
#clean text
df = df.rdd.map(lambda x: (x["review_id"], x["product_category"],  clean_text(x["product_title"]), x["review_body"]))
df=spark.createDataFrame(df, schema = ["review_id", "product_category", "product_title", "review_body"])

In [29]:
#concatenate columns

from pyspark.sql.functions import concat, col, lit
col_list = ['product_title','review_body']
df = df.withColumn('product_title',concat(*col_list))


In [30]:
df.show(5)

+--------------+----------------+--------------------+--------------------+
|     review_id|product_category|       product_title|         review_body|
+--------------+----------------+--------------------+--------------------+
| RP4MUWMW5PDMH|        Software|world book multim...|         windows mac|
| RLXRY2WDMC7F1|        Software|tkt payroll softw...|form print data p...|
| RUBVRR7OYC2VG|        Software|step loan modific...|i reviewer indivi...|
|R167B987XQ467P|        Software|vector art design...|             exedent|
|R31TIOVO2HIPC7|        Software|a love of art cdr...|software time egg...|
+--------------+----------------+--------------------+--------------------+
only showing top 5 rows



## **MERGE DATA**

In [31]:
df_num_2 = spark.createDataFrame(df_num_2) #

In [32]:
#drop the column taht we do not need anymore
drop_list = ['review_body'] 
df = df.select([column for column in df.columns if column not in drop_list])

In [33]:
df.show(5)

+--------------+----------------+--------------------+
|     review_id|product_category|       product_title|
+--------------+----------------+--------------------+
| RP4MUWMW5PDMH|        Software|world book multim...|
| RLXRY2WDMC7F1|        Software|tkt payroll softw...|
| RUBVRR7OYC2VG|        Software|step loan modific...|
|R167B987XQ467P|        Software|vector art design...|
|R31TIOVO2HIPC7|        Software|a love of art cdr...|
+--------------+----------------+--------------------+
only showing top 5 rows



In [34]:
#df2 = df.join(df_num_2, df.review_id == df_num_2.review_id, 'left')

In [35]:
#df2.show(5)

In [36]:
#df2.printSchema()

In [37]:
#stoppp

## **MODEL PIPELINE**
* regexTokenizer: Tokenization (with Regular Expression)
* stopwordsRemover: Remove Stop Words
* countVectors: Count vectors (“document-term vectors”)

In [39]:
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# regular expression tokenizer
#-------
#title 
#------
regexTokenizer = RegexTokenizer(inputCol="product_title", outputCol="words", pattern="\\W") #Spliting text into words
# stop words
add_stopwords = ["http","https","amp","rt","t","c","the"] 
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(add_stopwords) #Removing stop words
# bag of words count
countVectors = CountVectorizer(inputCol="filtered", outputCol="features_1", vocabSize=10000, minDF=5) #Converting text into vectors of token counts.
#xx =  polarity_txt(x["product_title"])
#selector =SelectKBest(chi2, k=1000) 

In [40]:
###-----------
# review_body
#-------------

regexTokenizer_2 = RegexTokenizer(inputCol='review_headline', outputCol="words_2", pattern="\\W") 
# stop words
stopwordsRemover_2 = StopWordsRemover(inputCol="words_2", outputCol="filtered_2")
# bag of words count
countVectors_2 = CountVectorizer(inputCol="filtered_2", outputCol="features_2", vocabSize=10000, minDF=5)


## **PIPELINES OF SEVERAL VARIABLES**

https://medium.com/@armandj.olivares/a-basic-nlp-tutorial-for-news-multiclass-categorization-82afa6d46aa5

## **StringIndexer**
StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.
In our case, the label column (Category) will be encoded to label indices, from 0 to 32; the most frequent label (LARCENY/THEFT) will be indexed as 0.

In [41]:
'''
from pyspark.ml.feature import *
from pyspark.ml import Pipeline
tok = Tokenizer(inputCol="text", outputCol="words")
htf = HashingTF(inputCol="words", outputCol="tf", numFeatures=200)
w2v = Word2Vec(inputCol="text", outputCol="w2v")
ohe = OneHotEncoder(inputCol="userGroup", outputCol="ug")
va = VectorAssembler(inputCols=["tf", "w2v", "ug"], outputCol="features")
pipeline = Pipeline(stages=[tok,htf,w2v,ohe,va])
'''

'\nfrom pyspark.ml.feature import *\nfrom pyspark.ml import Pipeline\ntok = Tokenizer(inputCol="text", outputCol="words")\nhtf = HashingTF(inputCol="words", outputCol="tf", numFeatures=200)\nw2v = Word2Vec(inputCol="text", outputCol="w2v")\nohe = OneHotEncoder(inputCol="userGroup", outputCol="ug")\nva = VectorAssembler(inputCols=["tf", "w2v", "ug"], outputCol="features")\npipeline = Pipeline(stages=[tok,htf,w2v,ohe,va])\n'

In [42]:
fea_col = ["features_1"]
       #, 'title_len', 'title_words', 'title_points', 'title_exc', 'title_int',
       #'ratio_spaces_point', 'ratio_len_points', 'title_cd', 'title_th',
       #'title_upper_letters', 'title_numbers', 'title_years', 'end_number',
       #'starts_number', 'word_sep', 'title_old_version']

In [43]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
label_stringIdx = StringIndexer(inputCol = "product_category", outputCol = "label") #Lets encode column of category to a column of category indices
va = VectorAssembler(inputCols=fea_col, outputCol="features")
pipeline = Pipeline(stages = [regexTokenizer, stopwordsRemover, countVectors,
                             label_stringIdx, va])


In [44]:
# Fit the pipeline to training documents.
pipelineFit = pipeline.fit(df)
dataset = pipelineFit.transform(df)
#dataset.show(5)

TypeError: Cannot recognize a pipeline stage of type <class 'sklearn.feature_selection.univariate_selection.SelectKBest'>.

In [None]:
# set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

In [None]:
trainingData.schema

In [None]:
lr = LogisticRegression(labelCol = 'label', featuresCol = 'features', maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
    .select("product_title","product_category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

# **Logistic Regression using TF-IDF Features**

In [None]:
from pyspark.ml.feature import HashingTF, IDF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=10000)
idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, hashingTF, idf, label_stringIdx])
pipelineFit = pipeline.fit(df)
dataset = pipelineFit.transform(df)
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
    .select("product_title","product_category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

## **Cross-Validation**

In [None]:
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_stringIdx])
pipelineFit = pipeline.fit(df)
dataset = pipelineFit.transform(df)
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.1, 0.3, 0.5]) # regularization parameter
             .addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2]) # Elastic Net Parameter (Ridge = 0)
#            .addGrid(model.maxIter, [10, 20, 50]) #Number of iterations
#            .addGrid(idf.numFeatures, [10, 100, 1000]) # Number of features
             .build())
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, \
                    estimatorParamMaps=paramGrid, \
                    evaluator=evaluator, \
                    numFolds=5)
cvModel = cv.fit(trainingData)

predictions = cvModel.transform(testData)
# Evaluate best model
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

## **Naive Bayes**

In [None]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(smoothing=1)
model = nb.fit(trainingData)
predictions = model.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
    .select("product_title","product_category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

## **Random Forest**

In [None]:
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)
# Train model with Training Data
rfModel = rf.fit(trainingData)
predictions = rfModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
    .select("product_title","product_category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

In [None]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)