<h1 align="center"> Prepare Data for Modeling</h1>

# Table of Contents

- [Duplicates](#Duplicates)

All data is dirty, irrespective of what the source of the data might lead you to believe: it might be your colleague, a telemetry system that monitors your environment, a dataset you download from the web, or some other source. Until you have tested and proven to yourself that your data is in a clean state, you should neither trust it nor use it for modeling.

Your data can be stained with duplicates, missing observations and outliers, non-existent addresses, wrong phone numbers and area codes, inaccurate geographical coordinates, wrong dates, incorrect labels, mixtures of upper and lower cases, trailing spaces, and many other more subtle problems. It is your job to clean it, irrespective of whether you are a data scientist or data engineer, so you can build a statistical or machine learning model.


Your dataset is considered technically clean if none of the aforementioned problems can be found. However, to clean the dataset for modeling purposes, you also need to check the distributions of your features and confirm they fit the predefined criteria.

As a data scientist, you can expect to spend 80-90% of your time **massaging** your data and getting familiar with all the features. This chapter will guide you through that process, leveraging Spark capabilities.

# Duplicates
[back to top](#Table-of-Contents)

Duplicates are observations that appear as distinct rows in your dataset, but which, upon closer inspection, look the same. That is, if you looked at them side by side, all the features in these two (or more) rows would have exactly the same values.

On the other hand, if your data has some form of an ID to distinguish between records (or associate them with certain users, for example), then what might initially appear as a duplicate may not be; sometimes systems fail and produce erroneous IDs. In such a situation, you need to either check whether the same ID is a real duplicate, or you need to come up with a new ID system.

In [5]:
df = spark.createDataFrame([
        (1, 144.5, 5.9, 33, 'M'),
        (2, 167.2, 5.4, 45, 'M'),
        (3, 124.1, 5.2, 23, 'F'),
        (4, 144.5, 5.9, 33, 'M'),
        (5, 133.2, 5.7, 54, 'F'),
        (3, 124.1, 5.2, 23, 'F'),
        (5, 129.2, 5.3, 42, 'M'),
    ], ['id', 'weight', 'height', 'age', 'gender'])

In [6]:
# Check for duplicates
print('Count of rows: {0}'.format(df.count()))
print('Count of distinct rows: {0}'.format(df.distinct().count()))

We can drop these duplicate rows by using the .dropDuplicates(...) method.

In [8]:
df = df.dropDuplicates()
df.show()

In [9]:
# Let's confirm
print('Count of ids: {0}'.format(df.count()))
print('Count of distinct ids: {0}'.format(df.select([c for c in df.columns if c != 'id']).distinct().count()))

We still have one more duplicate. We use the ```.dropDuplicates(...)``` but add the ```subset``` parameter

The subset parameter instructs the ```.dropDuplicates(...)``` method to look for duplicated rows using only the columns specified via the subset parameter; in the preceding example, we will drop the duplicated records with the same weight, height, age, and gender but not id.

In [11]:
df = df.dropDuplicates(subset = [c for c in df.columns if c!= 'id'])
df.show()

To calculate the total and distinct number of IDs in one step we can use the ```.agg(...)``` method.

In [13]:
import pyspark.sql.functions as fn

df.agg(
       fn.count('id').alias('count'),
       fn.countDistinct('id').alias('distinct')
).show()

we use the ```.count(...)``` and ```.countDistinct(...)``` to, respectively, calculate the number of rows and the number of distinct ids in our DataFrame. The .alias(...) method allows us to specify a friendly name to the returned column.

As you can see, we have five rows in total, but only four distinct IDs. Since we have already dropped all the duplicates, we can safely assume that this might just be a fluke in our ID data, so we will give each row a unique ID:

In [15]:
# Give each row a unique ID.
df.withColumn('new_id', fn.monotonically_increasing_id()).show()

# Missing Observations

The missing values can happen for a variety of reasons: systems failure, people error, data schema changes, just to name a few.

The simplest way to deal with missing values, if your data can afford it, is to drop the whole observation when any missing value is found. You have to be careful not to drop too many: depending on the distribution of the missing values across your dataset it might severely affect the usability of your dataset. If, after dropping the rows, I end up with a very small dataset, or find that the reduction in data size is more than 50%, I start checking my data to see what features have the most holes in them and perhaps exclude those altogether; if a feature has most of its values missing (unless a missing value bears a meaning), from a modeling point of view, it is fairly useless.

The other way to deal with the observations with missing values is to impute some value in place of those Nones. Given the type of your data, you have several options to choose from:



- If your data is a discrete Boolean, you can turn it into a categorical variable by adding a third category — Missing
- If your data is already categorical, you can simply extend the number of levels and add the Missing category as well
- If you're dealing with ordinal or numerical data, you can impute either mean, median, or some other predefined value (for example, first or third quartile, depending on the distribution shape of your data)

In [17]:
df_miss = spark.createDataFrame([
        (1, 143.5, 5.6, 28,   'M',  100000),
        (2, 167.2, 5.4, 45,   'M',  None),
        (3, None , 5.2, None, None, None),
        (4, 144.5, 5.9, 33,   'M',  None),
        (5, 133.2, 5.7, 54,   'F',  None),
        (6, 124.1, 5.2, None, 'F',  None),
        (7, 129.2, 5.3, 42,   'M',  76000),
    ], ['id', 'weight', 'height', 'age', 'gender', 'income'])

To find the number of missing observations per row we can use the following snippet.

In [19]:
df_miss.rdd.map(
             lambda row: (row['id'], sum([c == None for c in row]))).collect()

To see what values are missing, we count missing observations in columns we can decide to drop the observation altogether or impute some of the observations.

In [21]:
df_miss.where('id == 3').show()

Here is what we get:
![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_04_08.jpg)

Let's now check what percentage of missing observations are there in each column:

In [24]:
df_miss.agg(*[
    (1 - (fn.count(c) / fn.count('*'))).alias(c + '_missing')
    for c in df_miss.columns
  ]).show()

This is what we get
![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_04_09.jpg)

**Note:**

The ```*``` argument to the ```.count(...)``` method (in place of a column name) instructs the method to count all rows. On the other hand, the ```*``` preceding the list declaration instructs the ```.agg(...)``` method to treat the list as a set of separate parameters passed to the function.

So, we have 14% of missing observations in the weight and gender columns, twice as much in the height column, and almost 72% of missing observations in the income column. Now we know what to do.First, we will drop the 'income' feature, as most of its values are missing.

In [26]:
df_miss_no_income = df_miss.select([
    c for c in df_miss.columns if c != 'income'
  ])
df_miss_no_income.show()

if you decide to drop the observations instead, you can use the ```.dropna(...)``` method, as shown here. Here, we will also use the thresh parameter, which allows us to specify a threshold on the number of missing observations per row that would qualify the row to be dropped. This is useful if you have a dataset with tens or hundreds of features and you only want to drop those rows that exceed a certain threshold of missing values:

In [28]:
df_miss_no_income.dropna(thresh=3).show()

Here is the output
![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_04_10.jpg)

On the other hand, if you wanted to impute the observations, you can use the .fillna(...) method. This method accepts a single integer (long is also accepted), float, or string; all missing values in the whole dataset will then be filled in with that value. You can also pass a dictionary of a form ```{'<colName>': <value_to_impute>}```. This has the same limitation, in that, as the ```<value_to_impute>```, you can only pass an integer, float, or string.

If you want to impute a mean, median, or other calculated value, you need to first calculate the value, create a dictionary with such values, and then pass it to the .fillna(...) method.

In [30]:
means = df_miss_no_income.agg(
          *[fn.mean(c).alias(c)
           for c in df_miss_no_income.columns if c != 'gender']
          ).toPandas().to_dict('records')[0]


means['gender'] = 'missing'

df_miss_no_income.fillna(means).show()

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_04_11.jpg)

**Note that** calling the .toPandas() can be problematic, as the method works essentially in the same way as .collect() in RDDs. It collects all the information from the workers and brings it over to the driver. It is unlikely to be a problem with the preceding dataset, unless you have thousands upon thousands of features.