<h1 align="center"> Prepare Data for Modeling</h1>

# Table of Contents

- [Duplicates](#Duplicates)

All data is dirty, irrespective of what the source of the data might lead you to believe: it might be your colleague, a telemetry system that monitors your environment, a dataset you download from the web, or some other source. Until you have tested and proven to yourself that your data is in a clean state, you should neither trust it nor use it for modeling.

Your data can be stained with duplicates, missing observations and outliers, non-existent addresses, wrong phone numbers and area codes, inaccurate geographical coordinates, wrong dates, incorrect labels, mixtures of upper and lower cases, trailing spaces, and many other more subtle problems. It is your job to clean it, irrespective of whether you are a data scientist or data engineer, so you can build a statistical or machine learning model.


Your dataset is considered technically clean if none of the aforementioned problems can be found. However, to clean the dataset for modeling purposes, you also need to check the distributions of your features and confirm they fit the predefined criteria.

As a data scientist, you can expect to spend 80-90% of your time **massaging** your data and getting familiar with all the features. This chapter will guide you through that process, leveraging Spark capabilities.

# Duplicates
[back to top](#Table-of-Contents)

Duplicates are observations that appear as distinct rows in your dataset, but which, upon closer inspection, look the same. That is, if you looked at them side by side, all the features in these two (or more) rows would have exactly the same values.

On the other hand, if your data has some form of an ID to distinguish between records (or associate them with certain users, for example), then what might initially appear as a duplicate may not be; sometimes systems fail and produce erroneous IDs. In such a situation, you need to either check whether the same ID is a real duplicate, or you need to come up with a new ID system.

In [5]:
df = spark.createDataFrame([
        (1, 144.5, 5.9, 33, 'M'),
        (2, 167.2, 5.4, 45, 'M'),
        (3, 124.1, 5.2, 23, 'F'),
        (4, 144.5, 5.9, 33, 'M'),
        (5, 133.2, 5.7, 54, 'F'),
        (3, 124.1, 5.2, 23, 'F'),
        (5, 129.2, 5.3, 42, 'M'),
    ], ['id', 'weight', 'height', 'age', 'gender'])

In [6]:
# Check for duplicates
print('Count of rows: {0}'.format(df.count()))
print('Count of distinct rows: {0}'.format(df.distinct().count()))

We can drop these duplicate rows by using the .dropDuplicates(...) method.

In [8]:
df = df.dropDuplicates()
df.show()

In [9]:
# Let's confirm
print('Count of ids: {0}'.format(df.count()))
print('Count of distinct ids: {0}'.format(df.select([c for c in df.columns if c != 'id']).distinct().count()))

We still have one more duplicate. We use the ```.dropDuplicates(...)``` but add the ```subset``` parameter

The subset parameter instructs the ```.dropDuplicates(...)``` method to look for duplicated rows using only the columns specified via the subset parameter; in the preceding example, we will drop the duplicated records with the same weight, height, age, and gender but not id.

In [11]:
df = df.dropDuplicates(subset = [c for c in df.columns if c!= 'id'])
df.show()

To calculate the total and distinct number of IDs in one step we can use the ```.agg(...)``` method.

In [13]:
import pyspark.sql.functions as fn

df.agg(
       fn.count('id').alias('count'),
       fn.countDistinct('id').alias('distinct')
).show()

we use the ```.count(...)``` and ```.countDistinct(...)``` to, respectively, calculate the number of rows and the number of distinct ids in our DataFrame. The .alias(...) method allows us to specify a friendly name to the returned column.

As you can see, we have five rows in total, but only four distinct IDs. Since we have already dropped all the duplicates, we can safely assume that this might just be a fluke in our ID data, so we will give each row a unique ID:

In [15]:
# Give each row a unique ID.
df.withColumn('new_id', fn.monotonically_increasing_id()).show()