# 06. Duplicates

### Introduction
It is possible that your dataset has duplicated rows where every single value of one row is equal to every single value of another row. This may not necessarily be bad data, but it is likely something you will want to investigate further.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

diamonds = pd.read_csv('../data/diamonds.csv')

new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]

order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], ordered=True, categories=order)

order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
diamonds['color'] = pd.Categorical(diamonds['color'], ordered=True, categories=order)

order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
diamonds['clarity'] = pd.Categorical(diamonds['clarity'], ordered=True, categories=order)

x_out = diamonds['x'] < 3
y_out = (diamonds['y'] > 30) | (diamonds['y'] > 20)
carat_out = diamonds['carat'] > 4
depth_out = (diamonds['depth'] < 45) | (diamonds['depth'] > 75)
table_out = (diamonds['table'] < 40) | (diamonds['table'] > 90)

d = {'x': x_out, 
     'y': y_out, 
     'carat': carat_out, 
     'depth': depth_out, 
     'table_out':table_out}

outliers = pd.DataFrame(d)

### Duplicated rows
The `duplicated` method returns a boolean Series informing us which rows are duplicates. By default, it returns False for the first duplicate occurrence. By setting `keep` to False, it returns True for every duplicate including the first.

In [None]:
filt = diamonds.duplicated(keep=False)
filt.head()

In [None]:
dupes = diamonds[filt]
dupes.head(10)

In [None]:
dupes.shape

### Must investigate these further

We have 289 duplicated rows. We must investigate these further to see if they are valid or need to be dropped.

## Finding duplicate columns
There are a couple different methods for finding duplicate columns. First, we can call the `corr` method, which takes the correlation between every pair of numeric columns. If a correlation is exactly 1, then the two columns move together perfectly. It does not necessarily mean that the values are exactly the same. For instance, you could have one column that is exactly twice another. Their correlation would be 1, but their values would not be the same.

### Duplicate columns are problematic for machine learning
When two columns are exactly the same, then only one is needed to transmit the information during a machine learning algorithm, and in fact, having both variables in your dataset can cause problems during training. It is advised to remove duplicate columns.

In [None]:
diamonds.corr()

### Transposing the dataset
The `duplicated` method only works columnwise and it has no `axis` parameter to change the direction of the operation. Instead, you can transpose the dataset and then call the `duplicated` method. Warning, this can be very slow.

In [None]:
diamonds.T.duplicated()

### Using `crosstab` to find duplicate columns
This might be easier to see with a simple dataset of two columns. If we look at the frequency of occurrence between values in two columns and find a unique mapping of one value to another, then we have a duplicated column. The following DataFrame shows this exact scenario.

In [None]:
df = pd.DataFrame({'col_1':['w', 'z', 'x', 'w', 'z', 'x', 'x', 'y', 'x'], 
              'col_2':['c', 'b', 'd', 'c', 'b', 'd', 'd', 'a', 'd']})
df

Notice how each row and column have only one non-zero value. For instance, 'w' is mapped to 'c'.

In [None]:
pd.crosstab(index=df['col_1'], columns=df['col_2'])

We can use this logic on our dataset. Since there is no one to one mapping below, none of the categorical columns have been duplicated.

In [None]:
pd.crosstab(index=diamonds['cut'], columns=diamonds['color'])

# Exercise
Replicate on your dataset