<a href="https://colab.research.google.com/github/dss5202-2410/Notebooks/blob/main/Data_manipulation_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The data preprocessing pipeline

Like all data scientists, we need to preprocess (clean) our data prior to building a machine learning algorithm. In fact, most of the work consists of just understanding the data and thinking about it through **exploratory analysis**, **visualizations**, and **cleaning**. This includes creating graphs from the data to understand what they look like and extracting descriptive statistics to get a better feel for what is going in.

Preparation of data can be done in many ways and will depend on the features available in the data. For example, we need to decide how to handle **missing value**: Sometimes we remove the rows or columns with missing data, while other times we might want to impute the missing values and fill them in with the imputed value (that is, make some educated guess regarding which value should be there based on other data we have).

Other data preparation may include **engineering new features** out of existing features, or in the case of numerical features, **scaling** the data. Scaling is important when different features are measured on different scales. It ensures that all features will be on the same scale, as inputting different scalesl into an algorithm can lead to low predictive ability.

Preparation of categorical data differs from that of numerical data. For instance, we cannot put words into a mathematical model and must therefore represent them numerically before using them in an algorithm. However, categories arenot on a scale -- even if they are numbers. For example, if a feature includes color preferences and includes the values blue, red, and green. We cannot simply convert the colors into numbers 1, 2, and 3 because using them to represent colors simply does not make sense.

The distance between 1 and 2 is equal to the distance between 2 and 3. But when we think about what that means with colors, would it mean if someone likes blue better than red, or red better than green? Clearly, these are not the same.

Therefore, we need to not only represent categorical variables with numbers, but do so in a way that makes sense as well. Depending on the nature of your data, this may be as simple as **one-hot-encoding**, or something more complicated with **TF-IDF**. If we are doing natural language processing, we may have further work to do, such as **removing stop words** and **stemming**.

In the following, we will try to build a data preprocessing pipeline for analysis. We will use a simple dataset that has certain features so we can experiment with ways to process data. In the future, this can be used and modified in accordance with your data needs.

## UCI Breast Cancer Data

The first data set is the [UCI Breast Cancer data](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). The data only have numerical values.

Let's first import the data directly from the web as a `pandas` DataFrame.


In [29]:
import pandas as pd
import numpy as np

In [None]:
!pip install dfply

In [5]:
from dfply import *

In [52]:
df1 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
                  header = None,
                  names=['ID', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y'],
                  usecols=['ID', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y'])

In [10]:
# Check the first few lines of the data frame
df1.head()

Unnamed: 0,ID,x1,x2,x3,x4,x5,x6,x7,x8,x9,y
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [12]:
# Check variable types
df1.dtypes

ID     int64
x1     int64
x2     int64
x3     int64
x4     int64
x5     int64
x6    object
x7     int64
x8     int64
x9     int64
y      int64
dtype: object

In [13]:
# Check some summary stats for numerical variables
df1.describe()

Unnamed: 0,ID,x1,x2,x3,x4,x5,x7,x8,x9,y
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


The first thing we want to do is to check that the patient IDs (`ID`) are unique. In the code below, we find that there are 54 duplicates.

Since it is unclear why there are duplicates, we *decide* to remove the duplicated IDs.

In [53]:
# Use dfply to see whether IDs are unique
# Remove them if there are duplicated IDs
(df1
 >> summarize(distinct_id = n_distinct(X.ID),
              sum_id = n(X.ID))
 >> mutate(diff = X.sum_id - X.distinct_id)
 )

Unnamed: 0,distinct_id,sum_id,diff
0,645,699,54


In [54]:
df1 = df1 >> distinct(X.ID)

The feature `x6` is listed as dtype object above. All other features are listed as integers. So we begin by selecting `x6` only and listing all its distinct values.

In [55]:
df1 >> group_by(X.x6) >> summarize(n = n(X.x6))

Unnamed: 0,x6,n
0,1,362
1,10,126
2,2,27
3,3,28
4,4,19
5,5,28
6,6,4
7,7,8
8,8,19
9,9,8


The code above shows that there are 16 `?` in the feature `x6`.

We have more than 600 observations in the data set -- so less than 5% of the rows has a question mark in feature `x6`.

+ We would be comfortable dropping these rows.

+ Imputing the missing data would be fine as well.

+ If we can afford to do so computationally and time-wise, it likely pays to run the model twice -- one with the data dropped, and once with them imputed.

In the following, we will try to impute these missing values.

First, we convert the feature to a numeric type. During the conversion, we can specify and force any non-numeric values to be `NaN`s.

In [56]:
df1 = df1.apply(pd.to_numeric, errors = "coerce")
df1.dtypes

ID      int64
x1      int64
x2      int64
x3      int64
x4      int64
x5      int64
x6    float64
x7      int64
x8      int64
x9      int64
y       int64
dtype: object

In [33]:
df1.describe()

Unnamed: 0,ID,x1,x2,x3,x4,x5,x6,x7,x8,x9,y
count,645.0,645.0,645.0,645.0,645.0,645.0,629.0,645.0,645.0,645.0,645.0
mean,1074419.0,4.471318,3.182946,3.269767,2.893023,3.275969,3.624801,3.497674,2.955039,1.613953,2.71938
std,637262.7,2.858115,3.059049,2.985748,2.918036,2.247455,3.670647,2.459374,3.120682,1.744056,0.960564
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,871549.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238186.0,6.0,5.0,5.0,4.0,4.0,7.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


We can see that every column has a count of 645 except for `x6`, indicating that it is likely the only feature with missing value.

To further examine the missing data, we can use the `isnull()` function:

In [34]:
df1.isnull().sum()

ID     0
x1     0
x2     0
x3     0
x4     0
x5     0
x6    16
x7     0
x8     0
x9     0
y      0
dtype: int64

Next, we want to deal with the missing data by imputing the missing values.

In [50]:
df2 = df1
df2["x6"] = [df2.x6.mean() if np.isnan(a) else a for a in df2["x6"]]

In [51]:
df2.isnull().sum()

ID    0
x1    0
x2    0
x3    0
x4    0
x5    0
x6    0
x7    0
x8    0
x9    0
y     0
dtype: int64

We can also write a function to impute the missing value. Let's name it as `impute_missing()`.

In [49]:
def impute_missing(df, column):
  # Calculate the mean of that column
  mean =df[column].mean()
  # Replace NaN with the mean
  df[column] = df[column].fillna(mean)
  # Return the new data frame
  return df

To use the function, simply pass the DataFrame and the column name (in quotes) as arguments:

In [57]:
df3 = impute_missing(df1, "x6")
df3.isnull().sum()

ID    0
x1    0
x2    0
x3    0
x4    0
x5    0
x6    0
x7    0
x8    0
x9    0
y     0
dtype: int64