# Introduction

If you search through a non-trivial number of datasets, you'll likely notice inconsistencies, missing values, and various other issues. Depending on what you want to do with a dataset, the best strategy might be to first "edit" your dataset -- this process is referred to as *data cleaning* and *data normalization. 

# Data Normalization

We'll begin this tutorial by reviewing data normalization techniques. Data normalization refers to   

## Dropping Observations

We've established our data isn't always perfect. Sometimes that means dropping values all together. In this example, we'll look at dorm data. We begin by loading the data into `pandas`. 


In [13]:
import pandas as pd
houses = pd.read_csv("./housing.csv")
print(houses)

          Dorm            Name
0  East Campus      Helen Chen
1     Broadway   Danielle Jing
2      Shapiro    Craig Rhodes
3         Watt  Lesley Cordero
4  East Campus    Martin Perez
5     Broadway   Menna Elsayed
6      Wallach   Will Essilfie


For this example, we'll be getting rid of the first two rows, which we can easily do with the `drop()` function:

In [14]:
houses = houses.drop([0,1])
print(houses)

          Dorm            Name
2      Shapiro    Craig Rhodes
3         Watt  Lesley Cordero
4  East Campus    Martin Perez
5     Broadway   Menna Elsayed
6      Wallach   Will Essilfie


Now, let's say one of the students graduated and moved out - obviously we no longer want them in our dataset anymore, so we want to filter it out with condition

In [16]:
houses = houses[houses.Name != "Lesley Cordero"]
print(houses)

Empty DataFrame
Columns: [Dorm, Name]
Index: []


## Strings


### Lower and Upper

The `upper()` and `lower()` string methods return a new string where all the letters in the original string have been converted to uppercase or lower-case, respectively. Nonletter characters in the string remain unchanged. 


In [20]:
spam = 'Hello World!'
spam = spam.upper()
print(spam)

HELLO WORLD!


Likewise, there's a `lower()` function in Python that you can also utilize:

In [22]:
spam = spam.lower()
print(spam)

hello world!


These methods don't change the string itself but return new strings. If you want to change the original string, you have to call `upper()` or `lower()` on the string and then assign the new string to the variable where the original was stored. This is why you must use `spam = spam.upper()` to change the string in spam instead of simply `spam.upper()`. 

The `upper()` and `lower()` methods are helpful if you need to make a case-insensitive comparison. The strings 'great' and 'GREat' are not equal to each other. But in the many instances, it does not matter whether the user types Great, GREAT, or grEAT because the string is first converted to lowercase.

### StartsWith and EndsWith

The `startswith()` and `endswith()` methods return `True` if the string value they are called on begins or ends (respectively) with the string passed to the method; otherwise, they return False. 

In [24]:
print('Hello world!'.startswith('Hello'))
print('Hello world!'.endswith('world'))

True
False


Now, here's an example where we return a `false`:

In [53]:
print('abc123'.startswith('abcdef'))

False


These methods are useful alternatives to the == equals operator if you need to check only whether the first or last part of the string, rather than the whole thing, is equal to another string.

### Join and Split

The `join()` method is useful when you have a list of strings that need to be joined together into a single string value. The `join()` method is called on a string, gets passed a list of strings, and returns a string. The returned string is the concatenation of each string in the passed-in list. 

In [54]:
print(', '.join(['python', 'R', 'Java']))

python, R, Java


Oppositely, you can split a sentence into its word components. In natural language processing, this is called <b>word tokenization</b>. 

In [26]:
print('My name is Lesley'.split())

['My', 'name', 'is', 'Lesley']


## Missing Values

Missing data can often be a huge hindrance in data science and taking missing values into account isn't always so simple either. We'll now go over the different methodology of missing values.

If the amount of missing data is very small relative to the size of the dataset, leaving out the few samples with missing features may be the best strategy to prevent biasing the analysis. 

Leaving out available datapoints, however, deprives the data of some amount of information. Depending on the situation, you may want to look for other fixes before deleting potentially useful datapoints from your dataset.

While some quick fixes such as mean-substitution may be fine in some cases, such simple approaches usually introduce bias into the data, for instance, applying mean substitution leaves the mean unchanged but decreases variance.

#### Mice

The mice package in R helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution specifically designed for each missing datapoint.

We'll now proceed with an example using the airquality dataset available in R:

Let's remove some datapoints to work with in this tutorial:

Replacing categorical variables is usually not a good idea. Some data scientists opt to include replacing missing categorical variables with the mode of the observed ones, however, it's not always the best choice. 

Here, we'll remove the categorical variables for simplicity. Then we look at the data using `summary()`.

Ozone seems to be the variable with the most missing datapoints. 

#### Missing Data Classification 

Understanding the reasons why data are missing is important to correctly handle the remaining data. If values are missing completely at random, the data sample is likely still representative of the population. But if the values are missing systematically, analysis may be biased. Here, we'll go into the different types of missing data:

<b>Missing Completely at Random (MCAR)</b> means the data is actually missing at random, which is the best case scenario when it comes to missing data. 

<b>Missing at Random (MAR)</b> means the missing data is not random, but can be accounted for when you take into account another variable.

<b>Missing Not at Random (MNAR)</b> means it's not missing at random, a much more serious issue because the reason why there's missing data is usually unknown. 

MCAR data is obviously the best scenario, but even that can pose a problem too if there's too much missing data. Typically, the maximum threshold for missing data is 5% of the total for large datasets. If it goes beyond that, it's probably a good idea to leave that feature or sample out.

With that said, we'll check to make sure we have sufficient data:

Yikes. Ozone is missing almost 25% of its datapoints. This means we should drop it. 

### Missing Data Pattern

The `mice` package provides the `md.pattern()` function to get a better understanding of the pattern of missing data: 

This tells us that 104 samples are complete, 34 samples miss only the Ozone measurement, 4 samples miss only the Solar.R value and so on.

Just to make sense of what this means, let's try a visual representation using the `VIM` package:

The plot helps us understand that almost 70% of the samples are not missing any information, 22% are missing the Ozone value, and the remaining shows other missing patterns.

#### Imputation

The `mice()` function takes care of the imputing process, like this:

Here, `m = 5` refered to the number of imputed datasets - 5 is just a default value. `meth = 'pmm'` just refers to the imputation <b>method</b>. In this example, we're using mean matching. If you want to check out what other methods exist, type:

## Outlier Detection

An Outlier is an observation or point that is distant from other observations/points. They can also be referred to as observations whose probability to occur is very low. Outliers are important because they can impact accuracy of predictive models. 

### Causes

Often, a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and only if it is necessary to remove the outlier.


### Parametric vs Non-Parametric

There are two main types of outliers, representative and nonrepresentative. An outlier that is considered representative is one that is correct and not considered unique; and therefore, should not be disregarded from its dataset. 

A nonrepresentative outlier, then, is one that's incorrect because its cause is due to error or because there are no values like it in the rest of the population. These should typically be excluded. 


### Novelty vs Outlier

There are two methods for general outlier detection: novelty detection and outlier detection.

Novelty Detection is used for datasets which are <b>not</b> polluted by outliers and your goal is to identify the abnormal datapoints. On the other hand, if your dataset has outlier and the goal is to distinguish inliers from outliers, then you're performing outlier detection.

### Example 1

Outlier detection varies between single dataset and multiple datasets. There isn't a concrete definition for what encompasses an outlier, so there are different methodologies to accomplish outlier detection. 

Two methods we'll focus on are Median Absolute Deviation (MAD) and Standard deviation (SD). Though MAD and SD give different results, they're used for the same work.

Let's generate a sample dataset:

In [56]:
from __future__ import division
import numpy as np

x = [10, 9, 13, 14, 15,8, 9, 10, 11, 12, 9, 0, 8, 8, 25,9,11,10]

#### Median Absolute Deviation

In [1]:
axis = None
num = np.mean(numpy.abs(x - np.mean(x, axis)), axis)
mad = np.abs(x - np.median(x)) / num

NameError: name 'np' is not defined

#### Standard Deviation

### Extreme Value Analysis

Extreme Value Analysis is the simplest form of outlier detection, but is only good for 1D data since it relies on the assumption that values which are too large or too small are outliers. Two common tests for this sort of analysis are Z-tests and Students T-tests.

### Probabilistic and Statistical Models

Working off the assumption of specific types of distributions, these models use the expectation-maximization methods to estimate the model parameters. Once done, the membership probabilities of each data point are calculated and the points with low probabilities are marked as outliers. 

### Linear Models

This method begins by modeling the data into lower dimensional subspaces through the use of linear correlations. The distance of each point to the plan that fits this subspace is calculated and then used to find outliers.

### Proximity Based Models

This strategy of detecting outliers models outliers as points that are isolated from the rest of the observations. 

#### Cluster Based

Cluster based methods work by classifying data to clusters and choosing the data points which are not members of any cluster to be outliers.

In [58]:
from ggplot import diamonds
from sklearn.neighbors import NearestNeighbors

In [62]:
m = 15
neighs = NearestNeighbors(n_neighbors=3)
neighs.fit(diamonds)
distances, indices = neighs.kneighbors(diamonds)

ValueError: could not convert string to float: 'SI2'

In [None]:
neighs1 = NearestNeighbors(n_neighbors=15)
neighs1.fit(diamonds)
distances1, indices1 = neighs1.kneighbors(diamonds)

#### Distance Based

Distance based methods, on the other hand, use distance between individual points to find outliers. 

#### Density Based