# Chi Sq Selector

The Chi Sq Selector is a feature selection method used in machine learning. It is specifically designed for categorical features.

The Chi Sq Selector calculates the chi-square statistic between each feature and the target variable. It measures the dependence between the feature and the target variable. The higher the chi-square statistic, the more dependent the feature is on the target variable.

The Chi Sq Selector is good for identifying the most relevant categorical features that have a strong relationship with the target variable. It helps in reducing the dimensionality of the dataset by selecting only the most informative features. This can improve the performance of machine learning models by focusing on the most important features and reducing noise from irrelevant features.

In [1]:
import findspark, pyspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.appName("chisqselector").getOrCreate()

24/04/02 23:26:23 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.0.108 instead (on interface wlo1)
24/04/02 23:26:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/02 23:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/02 23:26:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/04/02 23:26:24 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
from pyspark.ml.feature import RFormula, ChiSqSelector

In [3]:
cars = spark.read.csv("../0_data/Carros.csv", header=True, inferSchema=True, sep=";")
cars.show(5)

+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|Consumo|Cilindros|Cilindradas|RelEixoTraseiro|Peso|Tempo|TipoMotor|Transmissao|Marchas|Carburadors| HP|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
|     21|        6|        160|             39| 262| 1646|        0|          1|      4|          4|110|
|     21|        6|        160|             39|2875| 1702|        0|          1|      4|          4|110|
|    228|        4|        108|            385| 232| 1861|        1|          1|      4|          1| 93|
|    214|        6|        258|            308|3215| 1944|        1|          0|      3|          1|110|
|    187|        8|        360|            315| 344| 1702|        0|          0|      3|          2|175|
+-------+---------+-----------+---------------+----+-----+---------+-----------+-------+-----------+---+
only showing top 5 rows



In [4]:
r_formula = RFormula(formula="HP ~ .", featuresCol="independant", labelCol="dependant")
cars_rf = r_formula.fit(cars).transform(cars)
cars_rf.select("independant", "dependant").show(10, truncate=False)

+-----------------------------------------------------+---------+
|independant                                          |dependant|
+-----------------------------------------------------+---------+
|[21.0,6.0,160.0,39.0,262.0,1646.0,0.0,1.0,4.0,4.0]   |110.0    |
|[21.0,6.0,160.0,39.0,2875.0,1702.0,0.0,1.0,4.0,4.0]  |110.0    |
|[228.0,4.0,108.0,385.0,232.0,1861.0,1.0,1.0,4.0,1.0] |93.0     |
|[214.0,6.0,258.0,308.0,3215.0,1944.0,1.0,0.0,3.0,1.0]|110.0    |
|[187.0,8.0,360.0,315.0,344.0,1702.0,0.0,0.0,3.0,2.0] |175.0    |
|[181.0,6.0,225.0,276.0,346.0,2022.0,1.0,0.0,3.0,1.0] |105.0    |
|[143.0,8.0,360.0,321.0,357.0,1584.0,0.0,0.0,3.0,4.0] |245.0    |
|[244.0,4.0,1467.0,369.0,319.0,20.0,1.0,0.0,4.0,2.0]  |62.0     |
|[228.0,4.0,1408.0,392.0,315.0,229.0,1.0,0.0,4.0,2.0] |95.0     |
|[192.0,6.0,1676.0,392.0,344.0,183.0,1.0,0.0,4.0,4.0] |123.0    |
+-----------------------------------------------------+---------+
only showing top 10 rows



In [6]:
selector = ChiSqSelector(selectorType="fdr", 
                         fdr=0.01, 
                         featuresCol="independant", 
                         outputCol="selected", 
                         labelCol="dependant")
cars_selected = selector.fit(cars_rf).transform(cars_rf)
cars_selected.select("selected", "dependant").show(10, truncate=False)

+--------------+---------+
|selected      |dependant|
+--------------+---------+
|[160.0,39.0]  |110.0    |
|[160.0,39.0]  |110.0    |
|[108.0,385.0] |93.0     |
|[258.0,308.0] |110.0    |
|[360.0,315.0] |175.0    |
|[225.0,276.0] |105.0    |
|[360.0,321.0] |245.0    |
|[1467.0,369.0]|62.0     |
|[1408.0,392.0]|95.0     |
|[1676.0,392.0]|123.0    |
+--------------+---------+
only showing top 10 rows

