# **Count Vectorizer**

## **Run required Utilities**

In [1]:
%run ../utilities/CassandraUtility.ipynb
%run ../utilities/S3Utility.ipynb



## **Read database from Keyspaces using PySpark**

**1.** Download the required jar files (`spark-cassandra-connector_2.12-3.3.0.jar, spark-cassandra-connector-assembly_2.12-3.3.0.jar`).

**2.** Download your `cassandra_truststore.jks` file.

**3.** Create `application.conf` file.

In [2]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

In [3]:
spark=createSparkSessionWithCassandraConf("CountVectorizer")
# Spark version
spark.version

23/10/02 16:35:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


'3.3.0'

In [4]:
articles=createDataFrameFromTable(spark, "GFGArticles", "BasicPreprocessedGFGArticles")
articles.show(5)

23/10/02 16:35:57 WARN CassandraConnectionFactory: Ignoring all programmatic configuration, only using configuration from application.conf


[Stage 0:>                                                          (0 + 1) / 1]

+-----+--------------------+
|   ID| PreprocessedContent|
+-----+--------------------+
| 8772|consider follow p...|
|11346|amcat amcat aspir...|
|23825|online code round...|
|23790|samsung r institu...|
|13740|man command linux...|
+-----+--------------------+
only showing top 5 rows



                                                                                

## **Convert the preprocessed content into tokens**

In [5]:
from pyspark.ml.feature import Tokenizer

In [6]:
tokenizer=Tokenizer(inputCol="PreprocessedContent", outputCol="Tokens")
articles=tokenizer.transform(articles).toDF("ID", "PreprocessedContent", "Tokens")
articles.show(5)

+-----+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|
+-----+--------------------+--------------------+
| 8772|consider follow p...|[consider, follow...|
|11346|amcat amcat aspir...|[amcat, amcat, as...|
|23825|online code round...|[online, code, ro...|
|23790|samsung r institu...|[samsung, r, inst...|
|13740|man command linux...|[man, command, li...|
+-----+--------------------+--------------------+
only showing top 5 rows



## **CountVectorizer**

In [7]:
from pyspark.ml.feature import CountVectorizer

In [8]:
countVec=CountVectorizer(inputCol="Tokens", outputCol="Counts", vocabSize=50000)
countVecModel=countVec.fit(articles)
articles=countVecModel.transform(articles).toDF("ID", "PreprocessedContent", "Tokens", "CountVector")
articles.show(5)

                                                                                

+-----+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|         CountVector|
+-----+--------------------+--------------------+--------------------+
| 8772|consider follow p...|[consider, follow...|(50000,[8,13,20,2...|
|11346|amcat amcat aspir...|[amcat, amcat, as...|(50000,[9,14,16,1...|
|23825|online code round...|[online, code, ro...|(50000,[0,1,2,4,6...|
|23790|samsung r institu...|[samsung, r, inst...|(50000,[0,1,2,4,6...|
|13740|man command linux...|[man, command, li...|(50000,[0,1,2,4,5...|
+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



In [9]:
# Look into the schema
articles.printSchema()

root
 |-- ID: integer (nullable = false)
 |-- PreprocessedContent: string (nullable = true)
 |-- Tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- CountVector: vector (nullable = true)



In [10]:
# Look into the schema
articles.schema

StructType([StructField('ID', IntegerType(), False), StructField('PreprocessedContent', StringType(), True), StructField('Tokens', ArrayType(StringType(), True), True), StructField('CountVector', VectorUDT(), True)])

## **Transform the `CountVector` into separate columns**

In [11]:
from pyspark.sql.functions import col, lit, udf
from pyspark.sql.types import ArrayType, IntegerType

In [12]:
# articles=articles.withColumn("FeaturesCount", udf(lambda countVector : countVector.size, 
#                                          IntegerType())(col("CountVector")))
articles=articles.withColumn("FeaturesIndices", udf(lambda countVector : countVector.indices.tolist(), 
                                           ArrayType(IntegerType()))(col("CountVector")))
articles=articles.withColumn("FeaturesValues", udf(lambda countVector : countVector.values.astype(np.int32).tolist(), 
                                           ArrayType(IntegerType()))(col("CountVector")))
articles.show(5)

[Stage 7:>                                                          (0 + 1) / 1]

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|         CountVector|     FeaturesIndices|      FeaturesValues|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|17036|b b c c answer ex...|[b, b, c, c, answ...|(50000,[3,4,6,7,8...|[3, 4, 6, 7, 8, 9...|[13, 1, 1, 5, 1, ...|
|16625|give two array ta...|[give, two, array...|(50000,[0,1,2,3,4...|[0, 1, 2, 3, 4, 5...|[6, 5, 3, 7, 5, 4...|
|  188|c strcat function...|[c, strcat, funct...|(50000,[10,13,14,...|[10, 13, 14, 15, ...|[2, 1, 2, 1, 3, 1...|
|  564|world programming...|[world, programmi...|(50000,[0,1,2,3,4...|[0, 1, 2, 3, 4, 6...|[3, 4, 3, 7, 2, 9...|
|11971|article know appr...|[article, know, a...|(50000,[0,1,2,4,8...|[0, 1, 2, 4, 8, 1...|[3, 4, 1, 1, 12, ...|
+-----+--------------------+--------------------+--------------------+--------------------+-----

                                                                                

## **Write to a new table in Keyspaces**

In [13]:
BATCH_SIZE=1024

In [14]:
saveDataFrameToTable(articles[["ID", "FeaturesIndices", "FeaturesValues"]], "GFGArticles", "CountVectorGFGArticles")



Batch [33792, 34815] saved to GFGArticles.CountVectorGFGArticles.


                                                                                

## **Save the model**

In [15]:
# Save the model. A folder with the given name will be created.
countVecModel.write().overwrite().save("CountVectorizer")
# Save the model to S3.
saveModelToS3("CountVectorizer", "shivmlstorage", "eshivani/GFG-Articles-Summarizer-Models")

                                                                                