# **TFIDF Vectorizer**

## **Run required Utilities**

In [1]:
%run ../utilities/CassandraUtility.ipynb
%run ../utilities/S3Utility.ipynb



## **Read database from Keyspaces using PySpark**

**1.** Download the required jar files (`spark-cassandra-connector_2.12-3.3.0.jar, spark-cassandra-connector-assembly_2.12-3.3.0.jar`).

**2.** Download your `cassandra_truststore.jks` file.

**3.** Create `application.conf` file.

In [2]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

In [3]:
spark=createSparkSessionWithCassandraConf("TFIDFVectorizer")
# Spark version
spark.version

23/10/02 16:36:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/10/02 16:36:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/10/02 16:36:53 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


'3.3.0'

In [4]:
articles=createDataFrameFromTable(spark, "GFGArticles", "BasicPreprocessedGFGArticles")
articles.show(5)

23/10/02 16:36:56 WARN CassandraConnectionFactory: Ignoring all programmatic configuration, only using configuration from application.conf


                                                                                

+-----+--------------------+
|   ID| PreprocessedContent|
+-----+--------------------+
|29550|functional progra...|
|28635|give integer n ta...|
|24783|give integer n ar...|
|23435|give integer n k ...|
| 4382|problem use switc...|
+-----+--------------------+
only showing top 5 rows



## **Convert the preprocessed content into tokens**

In [5]:
from pyspark.ml.feature import Tokenizer

In [6]:
tokenizer=Tokenizer(inputCol="PreprocessedContent", outputCol="Tokens")
articles=tokenizer.transform(articles).toDF("ID", "PreprocessedContent", "Tokens")
articles.show(5)

+-----+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|
+-----+--------------------+--------------------+
| 8772|consider follow p...|[consider, follow...|
|11346|amcat amcat aspir...|[amcat, amcat, as...|
|23825|online code round...|[online, code, ro...|
|23790|samsung r institu...|[samsung, r, inst...|
|13740|man command linux...|[man, command, li...|
+-----+--------------------+--------------------+
only showing top 5 rows



## **HashingTF and IDF Vectorizer**

In [7]:
from pyspark.ml.feature import HashingTF, IDF

In [8]:
tfVec=HashingTF(inputCol="Tokens", outputCol="TFVector", numFeatures=50000)
tfArticles=tfVec.transform(articles)
tfArticles.show(5)

+-----+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|            TFVector|
+-----+--------------------+--------------------+--------------------+
|17036|b b c c answer ex...|[b, b, c, c, answ...|(50000,[922,2189,...|
|16625|give two array ta...|[give, two, array...|(50000,[223,1487,...|
|  188|c strcat function...|[c, strcat, funct...|(50000,[453,573,1...|
|  564|world programming...|[world, programmi...|(50000,[7,440,922...|
|11971|article know appr...|[article, know, a...|(50000,[3095,5133...|
+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



In [9]:
idfVec=IDF(inputCol="TFVector", outputCol="TFIDFVector")
idfVecModel=idfVec.fit(tfArticles)
articles=idfVecModel.transform(tfArticles).toDF("ID", "PreprocessedContent", "Tokens", "TFVector", "TFIDFVector")
articles.show(5)

                                                                                

+-----+--------------------+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|            TFVector|         TFIDFVector|
+-----+--------------------+--------------------+--------------------+--------------------+
| 2218|matplotlib highly...|[matplotlib, high...|(50000,[564,1119,...|(50000,[564,1119,...|
|20243|give string str l...|[give, string, st...|(50000,[790,1659,...|(50000,[790,1659,...|
|17714|c variable always...|[c, variable, alw...|(50000,[223,573,9...|(50000,[223,573,9...|
|19583|want share interv...|[want, share, int...|(50000,[573,1264,...|(50000,[573,1264,...|
|16870|string basic cove...|[string, basic, c...|(50000,[453,564,5...|(50000,[453,564,5...|
+-----+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



In [10]:
# Look into the schema
articles.printSchema()

root
 |-- ID: integer (nullable = false)
 |-- PreprocessedContent: string (nullable = true)
 |-- Tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- TFVector: vector (nullable = true)
 |-- TFIDFVector: vector (nullable = true)



In [11]:
# Look into the schema
articles.schema

StructType([StructField('ID', IntegerType(), False), StructField('PreprocessedContent', StringType(), True), StructField('Tokens', ArrayType(StringType(), True), True), StructField('TFVector', VectorUDT(), True), StructField('TFIDFVector', VectorUDT(), True)])

## **Transform the `TFIDFVector` into separate columns**

In [12]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, IntegerType

In [13]:
# articles=articles.withColumn("FeaturesCount", udf(lambda tfidfVector : tfidfVector.size, 
#                                          IntegerType())(col("TFIDFVector")))
articles=articles.withColumn("FeaturesIndices", udf(lambda tfidfVector : tfidfVector.indices.tolist(), 
                                           ArrayType(IntegerType()))(col("TFIDFVector")))
articles=articles.withColumn("FeaturesValues", udf(lambda tfidfVector : tfidfVector.values.astype(np.int32).tolist(), 
                                           ArrayType(IntegerType()))(col("TFIDFVector")))
articles.show(5)

[Stage 6:>                                                          (0 + 1) / 1]

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|            TFVector|         TFIDFVector|     FeaturesIndices|      FeaturesValues|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|29550|functional progra...|[functional, prog...|(50000,[573,585,1...|(50000,[573,585,1...|[573, 585, 1069, ...|[1, 3, 7, 2, 14, ...|
|28635|give integer n ta...|[give, integer, n...|(50000,[86,585,92...|(50000,[86,585,92...|[86, 585, 922, 28...|[4, 1, 5, 2, 2, 1...|
|24783|give integer n ar...|[give, integer, n...|(50000,[163,330,5...|(50000,[163,330,5...|[163, 330, 592, 1...|[6, 3, 3, 19, 2, ...|
|23435|give integer n k ...|[give, integer, n...|(50000,[585,743,1...|(50000,[585,743,1...|[585, 743, 1364, ...|[4, 3, 5, 3, 9, 1...|
| 4382|problem use switc...|[problem, use, sw...|(50000,[190,6

                                                                                

## **Write to a new table in Keyspaces**

In [14]:
BATCH_SIZE=1024

In [15]:
saveDataFrameToTable(articles[["ID", "FeaturesIndices", "FeaturesValues"]], "GFGArticles", "TFIDFVectorGFGArticles")



Batch [33792, 34815] saved to GFGArticles.TFIDFVectorGFGArticles.


                                                                                

## **Save the model**

In [16]:
# Save the model. A folder with the given name will be created.
idfVecModel.write().overwrite().save("TFIDFVectorizer")
# Save the model to S3.
saveModelToS3("TFIDFVectorizer", "shivmlstorage", "eshivani/GFG-Articles-Summarizer-Models")

                                                                                