# **TFIDF Vectorizer**

## **Run required Utilities**

In [1]:
%run ../utilities/CassandraUtility.ipynb

## **Read database from Keyspaces using PySpark**

**1.** Download the required jar files (`spark-cassandra-connector_2.12-3.3.0.jar, spark-cassandra-connector-assembly_2.12-3.3.0.jar`).

**2.** Download your `cassandra_truststore.jks` file.

**3.** Create `application.conf` file.

**4.** Create `SparkSession` and set the configuration to connect to Keyspaces using service-specific credentials.

**5.** Read all rows from `BasicPreprocessedGFGArticles` table, `GFGArticles` keyspace into PySpark dataframe.

In [2]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

In [3]:
spark=createSparkSessionWithCassandraConf("TFIDFVectorizer")
# Spark version
spark.version

23/09/27 02:37:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


'3.3.0'

In [4]:
articles=createDataFrameFromTable(spark, "GFGArticles", "BasicPreprocessedGFGArticles")
articles.show(5)

23/09/27 02:37:13 WARN CassandraConnectionFactory: Ignoring all programmatic configuration, only using configuration from application.conf


[Stage 0:>                                                          (0 + 1) / 1]

+-----+--------------------+
|   ID| PreprocessedContent|
+-----+--------------------+
| 2218|matplotlib highly...|
|20243|give string str l...|
|17714|c variable always...|
|19583|want share interv...|
|16870|string basic cove...|
+-----+--------------------+
only showing top 5 rows



                                                                                

## **Convert the preprocessed content into tokens**

In [5]:
from pyspark.ml.feature import Tokenizer

In [6]:
tokenizer=Tokenizer(inputCol="PreprocessedContent", outputCol="Tokens")
articles=tokenizer.transform(articles).toDF("ID", "PreprocessedContent", "Tokens")
articles.show(5)

+-----+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|
+-----+--------------------+--------------------+
|29550|functional progra...|[functional, prog...|
|28635|give integer n ta...|[give, integer, n...|
|24783|give integer n ar...|[give, integer, n...|
|23435|give integer n k ...|[give, integer, n...|
| 4382|problem use switc...|[problem, use, sw...|
+-----+--------------------+--------------------+
only showing top 5 rows



## **HashingTF and IDF Vectorizer**

In [7]:
from pyspark.ml.feature import HashingTF, IDF

In [8]:
tfVec=HashingTF(inputCol="Tokens", outputCol="TFVector", numFeatures=50000)
tfArticles=tfVec.transform(articles)
tfArticles.show(5)

+-----+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|            TFVector|
+-----+--------------------+--------------------+--------------------+
|23474|resizable propert...|[resizable, prope...|(50000,[459,2143,...|
| 3531|sometimes require...|[sometimes, requi...|(50000,[480,564,1...|
|22273|javascript synchr...|[javascript, sync...|(50000,[573,668,2...|
|24847|give non negative...|[give, non, negat...|(50000,[3095,4265...|
|23798|non homogeneous p...|[non, homogeneous...|(50000,[453,922,1...|
+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



In [9]:
idfVec=IDF(inputCol="TFVector", outputCol="TFIDFVector")
idfVecModel=idfVec.fit(tfArticles)
articles=idfVecModel.transform(tfArticles).toDF("ID", "PreprocessedContent", "Tokens", "TFVector", "TFIDFVector")
articles.show(5)

                                                                                

+-----+--------------------+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|            TFVector|         TFIDFVector|
+-----+--------------------+--------------------+--------------------+--------------------+
|21711|parametric method...|[parametric, meth...|(50000,[572,1195,...|(50000,[572,1195,...|
|24672|give two positive...|[give, two, posit...|(50000,[564,1659,...|(50000,[564,1659,...|
| 7520|maximum number sq...|[maximum, number,...|(50000,[223,743,1...|(50000,[223,743,1...|
|14521|10th september tc...|[10th, september,...|(50000,[440,585,6...|(50000,[440,585,6...|
|16910|generator python ...|[generator, pytho...|(50000,[453,666,9...|(50000,[453,666,9...|
+-----+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



In [10]:
# Look into the schema
articles.printSchema()

root
 |-- ID: integer (nullable = false)
 |-- PreprocessedContent: string (nullable = true)
 |-- Tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- TFVector: vector (nullable = true)
 |-- TFIDFVector: vector (nullable = true)



In [11]:
# Look into the schema
articles.schema

StructType([StructField('ID', IntegerType(), False), StructField('PreprocessedContent', StringType(), True), StructField('Tokens', ArrayType(StringType(), True), True), StructField('TFVector', VectorUDT(), True), StructField('TFIDFVector', VectorUDT(), True)])

## **Transform the `TFIDFVector` into separate columns**

In [12]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, IntegerType

In [13]:
# articles=articles.withColumn("FeaturesCount", udf(lambda tfidfVector : tfidfVector.size, 
#                                          IntegerType())(col("TFIDFVector")))
articles=articles.withColumn("FeaturesIndices", udf(lambda tfidfVector : tfidfVector.indices.tolist(), 
                                           ArrayType(IntegerType()))(col("TFIDFVector")))
articles=articles.withColumn("FeaturesValues", udf(lambda tfidfVector : tfidfVector.values.astype(np.int32).tolist(), 
                                           ArrayType(IntegerType()))(col("TFIDFVector")))
articles.show(5)

[Stage 6:>                                                          (0 + 1) / 1]

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|   ID| PreprocessedContent|              Tokens|            TFVector|         TFIDFVector|     FeaturesIndices|      FeaturesValues|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|31599|give string conta...|[give, string, co...|(50000,[453,1143,...|(50000,[453,1143,...|[453, 1143, 2037,...|[5, 4, 4, 2, 5, 3...|
|14638|primary memory li...|[primary, memory,...|(50000,[417,573,5...|(50000,[417,573,5...|[417, 573, 586, 1...|[4, 8, 23, 3, 6, ...|
| 5992|database offer nu...|[database, offer,...|(50000,[261,814,2...|(50000,[261,814,2...|[261, 814, 2093, ...|[4, 4, 8, 3, 1, 6...|
|29258|article learn det...|[article, learn, ...|(50000,[564,1652,...|(50000,[564,1652,...|[564, 1652, 1998,...|[3, 2, 20, 2, 22,...|
| 7998|give two string x...|[give, two, strin...|(50000,[922,4

                                                                                

## **Write to a new table in Keyspaces**

In [14]:
BATCH_SIZE=1024

In [15]:
saveDataFrameToTable(articles[["ID", "FeaturesIndices", "FeaturesValues"]], "GFGArticles", "TFIDFVectorGFGArticles")



Batch [33792, 34815] saved to GFGArticles.TFIDFVectorGFGArticles.


                                                                                