# Handling with missing values

Let us read the csv file, with missing values.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Missing_values').getOrCreate()

In [2]:
# Reading the file with missing values
df_pyspark = spark.read.csv('data/names_and_ages_missing_val.csv', header = True , inferSchema = True , sep = ';')
df_pyspark.show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

## Using the method .drop

In [3]:
# Drop the rows with NULL values
df_pyspark.na.drop().show()

+--------+---+----------+-----------+------+--------------------+
|    Name|Age|Experience|Salary(USD)|ID job|    Current Position|
+--------+---+----------+-----------+------+--------------------+
|   Alice| 25|         2|       9666|     3|     Project Manager|
|     Bob| 30|         4|       7226|     5|   Marketing Manager|
| Charlie| 22|         7|       7484|    10|  Operations Manager|
|    Emma| 28|         9|       3170|     8|Customer Service ...|
|   Grace| 23|         3|       4815|     8|Customer Service ...|
|   Henry| 32|        14|       8611|     9|    Graphic Designer|
|   Karen| 26|         3|       5347|     6|   Financial Analyst|
|     Leo| 29|         1|       6668|     4|     Sales Associate|
|    Paul| 38|         1|       3162|     1|   Software Engineer|
|   Quinn| 21|        14|       8665|     5|   Marketing Manager|
|  Rachel| 34|         3|       5624|     8|Customer Service ...|
|Victoria| 20|         6|       1763|     8|Customer Service ...|
|  Xander|

In [4]:
# To drop all the columns which has NULL values in all the line,
# but this command doesn't drop the rows with one, two or three null entries
# in the same line
df_pyspark.na.drop(how = 'all').show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

In [5]:
# To drop all the columns which contains at least one NULL value in the line
df_pyspark.na.drop(how = 'any').show()

+--------+---+----------+-----------+------+--------------------+
|    Name|Age|Experience|Salary(USD)|ID job|    Current Position|
+--------+---+----------+-----------+------+--------------------+
|   Alice| 25|         2|       9666|     3|     Project Manager|
|     Bob| 30|         4|       7226|     5|   Marketing Manager|
| Charlie| 22|         7|       7484|    10|  Operations Manager|
|    Emma| 28|         9|       3170|     8|Customer Service ...|
|   Grace| 23|         3|       4815|     8|Customer Service ...|
|   Henry| 32|        14|       8611|     9|    Graphic Designer|
|   Karen| 26|         3|       5347|     6|   Financial Analyst|
|     Leo| 29|         1|       6668|     4|     Sales Associate|
|    Paul| 38|         1|       3162|     1|   Software Engineer|
|   Quinn| 21|        14|       8665|     5|   Marketing Manager|
|  Rachel| 34|         3|       5624|     8|Customer Service ...|
|Victoria| 20|         6|       1763|     8|Customer Service ...|
|  Xander|

In [6]:
# threshold of only ONE null value
df_pyspark.na.drop(how = 'any' , thresh = 1).show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

In [7]:
# threshold of only TWO null value
df_pyspark.na.drop(how = 'any' , thresh = 2).show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

In [8]:
# threshold of only THREE null value
df_pyspark.na.drop(how = 'any' , thresh = 3).show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

In [9]:
# To drop according to a subset
df_pyspark.na.drop(how = 'any', subset = ['Experience']).show()

+--------+----+----------+-----------+------+--------------------+
|    Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+--------+----+----------+-----------+------+--------------------+
|   Alice|  25|         2|       9666|     3|     Project Manager|
|     Bob|  30|         4|       7226|     5|   Marketing Manager|
| Charlie|  22|         7|       7484|    10|  Operations Manager|
|   David|  35|        12|       7993|  NULL|                NULL|
|    Emma|  28|         9|       3170|     8|Customer Service ...|
|   Grace|  23|         3|       4815|     8|Customer Service ...|
|   Henry|  32|        14|       8611|     9|    Graphic Designer|
|   Irene|NULL|        25|       2896|  NULL|                NULL|
|   Karen|  26|         3|       5347|     6|   Financial Analyst|
|     Leo|  29|         1|       6668|     4|     Sales Associate|
|    NULL|  31|         0|       2706|     5|   Marketing Manager|
|    NULL|  24|         3|       6806|     2|      Data Scient

## Filling the missing values

In [10]:
df_pyspark.dtypes

[('Name', 'string'),
 ('Age', 'int'),
 ('Experience', 'int'),
 ('Salary(USD)', 'int'),
 ('ID job', 'int'),
 ('Current Position', 'string')]

If we desire to filleach NULL value we can use the function .na.fill(missin_value), note that the type of the variable missing value play a relevant role to fill in, as the next cases:

In [11]:
# Filling all the NULL values of type string
missing_value = 'Missing Values'
df_pyspark.na.fill(missing_value).show()

+--------------+----+----------+-----------+------+--------------------+
|          Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+--------------+----+----------+-----------+------+--------------------+
|         Alice|  25|         2|       9666|     3|     Project Manager|
|           Bob|  30|         4|       7226|     5|   Marketing Manager|
|       Charlie|  22|         7|       7484|    10|  Operations Manager|
|         David|  35|        12|       7993|  NULL|      Missing Values|
|          Emma|  28|         9|       3170|     8|Customer Service ...|
|         Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|         Grace|  23|         3|       4815|     8|Customer Service ...|
|         Henry|  32|        14|       8611|     9|    Graphic Designer|
|         Irene|NULL|        25|       2896|  NULL|      Missing Values|
|          Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|         Karen|  26|         3|       5347|     6|

In [12]:
# Filling all the NULL values of type int (or any type of number according to type)
missing_value = 10000
df_pyspark.na.fill(missing_value).show()

+-------+-----+----------+-----------+------+--------------------+
|   Name|  Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+-----+----------+-----------+------+--------------------+
|  Alice|   25|         2|       9666|     3|     Project Manager|
|    Bob|   30|         4|       7226|     5|   Marketing Manager|
|Charlie|   22|         7|       7484|    10|  Operations Manager|
|  David|   35|        12|       7993| 10000|                NULL|
|   Emma|   28|         9|       3170|     8|Customer Service ...|
|  Frank|10000|     10000|      10000|     1|   Software Engineer|
|  Grace|   23|         3|       4815|     8|Customer Service ...|
|  Henry|   32|        14|       8611|     9|    Graphic Designer|
|  Irene|10000|        25|       2896| 10000|                NULL|
|   Jack|   33|     10000|      10000|     9|    Graphic Designer|
|  Karen|   26|         3|       5347|     6|   Financial Analyst|
|    Leo|   29|         1|       6668|     4|     Sales Associ

We also could fill several columns at the same time with a value

In [13]:
# Filling several NULL values in two columns as 'Experience' , 'Salary(USD)' of type string
df_pyspark.na.fill('Missing Values', ['Name' , 'Current Position']).show()

+--------------+----+----------+-----------+------+--------------------+
|          Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+--------------+----+----------+-----------+------+--------------------+
|         Alice|  25|         2|       9666|     3|     Project Manager|
|           Bob|  30|         4|       7226|     5|   Marketing Manager|
|       Charlie|  22|         7|       7484|    10|  Operations Manager|
|         David|  35|        12|       7993|  NULL|      Missing Values|
|          Emma|  28|         9|       3170|     8|Customer Service ...|
|         Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|         Grace|  23|         3|       4815|     8|Customer Service ...|
|         Henry|  32|        14|       8611|     9|    Graphic Designer|
|         Irene|NULL|        25|       2896|  NULL|      Missing Values|
|          Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|         Karen|  26|         3|       5347|     6|

In [14]:
# Filling several NULL values in two columns as 'Experience' , 'Salary(USD)' of type int
df_pyspark.na.fill(0, ['Experience' , 'Salary(USD)']).show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|         0|          0|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|         0|          0|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

In [15]:
df_pyspark.fillna({'Age': '123a'}).show()

+-------+----+----------+-----------+------+--------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+-------+----+----------+-----------+------+--------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|
|Charlie|  22|         7|       7484|    10|  Operations Manager|
|  David|  35|        12|       7993|  NULL|                NULL|
|   Emma|  28|         9|       3170|     8|Customer Service ...|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer|
|  Grace|  23|         3|       4815|     8|Customer Service ...|
|  Henry|  32|        14|       8611|     9|    Graphic Designer|
|  Irene|NULL|        25|       2896|  NULL|                NULL|
|   Jack|  33|      NULL|       NULL|     9|    Graphic Designer|
|  Karen|  26|         3|       5347|     6|   Financial Analyst|
|    Leo|  29|         1|       6668|     4|     Sales Associate|
|   NULL| 

We also coul use the function .fillna()

In [16]:
# Fill null values in a specific column
df_pyspark.fillna({"Name": 'Missing Values', 'Experience': 0}).show()  # Fill null values in the 'Age' column with 0


+--------------+----+----------+-----------+------+--------------------+
|          Name| Age|Experience|Salary(USD)|ID job|    Current Position|
+--------------+----+----------+-----------+------+--------------------+
|         Alice|  25|         2|       9666|     3|     Project Manager|
|           Bob|  30|         4|       7226|     5|   Marketing Manager|
|       Charlie|  22|         7|       7484|    10|  Operations Manager|
|         David|  35|        12|       7993|  NULL|                NULL|
|          Emma|  28|         9|       3170|     8|Customer Service ...|
|         Frank|NULL|         0|       NULL|     1|   Software Engineer|
|         Grace|  23|         3|       4815|     8|Customer Service ...|
|         Henry|  32|        14|       8611|     9|    Graphic Designer|
|         Irene|NULL|        25|       2896|  NULL|                NULL|
|          Jack|  33|         0|       NULL|     9|    Graphic Designer|
|         Karen|  26|         3|       5347|     6|

## Imputer

The `Imputer` class is used for imputing missing values in DataFrame columns. Imputation is the process of replacing missing values with substituted values, typically based on statistical measures such as mean, median, or mode.

In [17]:
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols = ['Age' , 'Experience' , 'Salary(USD)'],
    outputCols = ['{}_imputed'.format(c) for c in ['age', 'Experience' , 'Salary(USD)']]
    ). setStrategy('mean')

Usin the imputer class we can create new columns with the mean of each number column that we write in the Imputer.

In [18]:
# To add imputation cols to df_pyspark we fit the
imputer.fit(df_pyspark).transform(df_pyspark).show()

+-------+----+----------+-----------+------+--------------------+-----------+------------------+-------------------+
|   Name| Age|Experience|Salary(USD)|ID job|    Current Position|age_imputed|Experience_imputed|Salary(USD)_imputed|
+-------+----+----------+-----------+------+--------------------+-----------+------------------+-------------------+
|  Alice|  25|         2|       9666|     3|     Project Manager|         25|                 2|               9666|
|    Bob|  30|         4|       7226|     5|   Marketing Manager|         30|                 4|               7226|
|Charlie|  22|         7|       7484|    10|  Operations Manager|         22|                 7|               7484|
|  David|  35|        12|       7993|  NULL|                NULL|         35|                12|               7993|
|   Emma|  28|         9|       3170|     8|Customer Service ...|         28|                 9|               3170|
|  Frank|NULL|      NULL|       NULL|     1|   Software Engineer