<h1>Cleaning Not-Null Values</h1>
<p>Using methods like <code>isnull, fillna/dropna</code> typically takes care of things. However, sometimes you have invalid values that are not just 'missing data'(<code>None</code> or <code>NaN</code>). For example:

In [4]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({
    'Sex': ['M','F','F','D','?'],
    'Age': [29,30,24,290,25]
})

df

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,290
4,?,25


While this DataFrame isn't missing any values, the value are still invalid. For instance, 'D' and '?' aren't known Sex-types. 
<br><br>
<b>Finding Unique Values</b><br>
- Notice Them
- Identify Them
- Handle Them (remove, replace, etc...)

Within the 'Sex' field, we know that it takes two discrete values ('M'/'F'). We start by analyzing the variety of values present using the <code>unique()</code> method:

In [4]:
df['Sex'].unique()

array(['M', 'F', 'D', '?'], dtype=object)

In [5]:
df['Sex'].value_counts()

F    2
M    1
D    1
?    1
Name: Sex, dtype: int64

When analyzing the data, values like 'D' and '?' are attention grabbing. What to do with them? The <code>replace</code> function could be handy:

In [6]:
df['Sex'].replace('D','F')

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

It can also accept a dictionary of values to replace. For example, you may also need to replace a few 'N's with 'M':

In [8]:
df['Sex'].replace({'D':'F', 'N':'M'})

0    M
1    F
2    F
3    F
4    ?
Name: Sex, dtype: object

<b>NOTE:</b> If you have MANY columns to replace, you could apply it at the 'DataFrame' level:

In [9]:
df.replace({
    'Sex': {
        'D': 'F',
        'N': 'M',
    },
    'Age': {
        290: 29
    }
})

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,F,29
4,?,25


<h2>Duplicates</h2>
Checking duplicate values is simple. It'll behave differently between SERIES and DataFrames. For example, a party is being thrown and Ambassadors from Europe are invited. However, only one Ambassador per country is invited. Let's fix that:

<b>Duplicates in Series</b>

In [11]:
amb = pd.Series([
    'France',
    'UK',
    'UK',
    'Italy', 
    'Germany',
    'Germany', 
    'Germany'],
    index = [
        'Gerard Araud',
        'Kim Darroch',
        'Peter Westmacott', 
        'Armando Varricchio', 
        'Peter Wittig', 
        'Peter Ammon', 
        'Klaus Scharioth'    
])

amb

Gerard Araud           France
Kim Darroch                UK
Peter Westmacott           UK
Armando Varricchio      Italy
Peter Wittig          Germany
Peter Ammon           Germany
Klaus Scharioth       Germany
dtype: object

The two most important methods for dealing with duplicates are <code>duplicated</code> (will tell you which values are duplicates) and <code>drop_dupliates</code> (will just get rid of duplicates):

In [12]:
amb.duplicated() #Tells which are duplicates

Gerard Araud          False
Kim Darroch           False
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig          False
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

The <code>duplicated()</code> keeps returns the first instance as 'False' (or <code>Kim Darroch</code> for example). You can change this by altering the <code>keep=</code> param:

In [13]:
amb.duplicated(keep='last')

Gerard Araud          False
Kim Darroch            True
Peter Westmacott      False
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth       False
dtype: bool

By keeping the last value, <code>Kim Darroch</code> and other first-instance duplicates are now considered. You could also just mark every duplicate by initializing the <code>keep=</code> param to 'False':

In [14]:
amb.duplicated(keep=False)

Gerard Araud          False
Kim Darroch            True
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth        True
dtype: bool

<b>Duplicates in DataFrames</b>
<br>
<br>Conceptually, duplicates in DataFrames happen at 'row' level; Two rows with exactly the same values are considered to be duplicates:

In [5]:
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'Lebron James',
        'Kobe Bryant',
        'Carmelo Anthony', 
        'Kobe Bryant',],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF' ]
})

players.duplicated()

0    False
1    False
2     True
3    False
4    False
dtype: bool

Here, <code>Kobe Bryange</code> isn't registered as a duplicate at index 4. This is because <code>duplicated</code> default states that <b>ALL</b> column values should be duplicates. We can override this with a simple <code>subset=[]</code> param. The rules of the <code>keep=</code> attribute remain the same across both data structures:

In [8]:
players.duplicated(subset=['Name'], keep='last')

0     True
1    False
2     True
3    False
4    False
dtype: bool

The <code>drop_duplicates</code> function also takes the same parameters:

In [9]:
players.drop_duplicates(subset=['Name'], keep='last')

Unnamed: 0,Name,Pos
1,Lebron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


<br>

<h2>Text Handling - Cleaning Text Values</h2>
Although incredibily hard, cleaning text values still happens. Here are some common cases.

<b>Splitting Columns</b><br><br>The following is a survey case:

In [10]:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_   IT_1',
        '1985_F_I  T_2'
]})

df

Unnamed: 0,Data
0,1987_M_US _1
1,1990?_M_UK_1
2,1992_F_US_2
3,1970?_M_ IT_1
4,1985_F_I T_2


The <b><code>.split()</code></b> method:

In [16]:
df['Data'].str.split('_') #Splits data into lists

0       [1987, M, US , 1]
1       [1990?, M, UK, 1]
2        [1992, F, US, 2]
3    [1970?, M,    IT, 1]
4      [1985, F, I  T, 2]
Name: Data, dtype: object

In [15]:
df['Data'].str.split('_', expand=True) #Creates Table View

Unnamed: 0,0,1,2,3
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1970?,M,IT,1
4,1985,F,I T,2


In [17]:
df = df['Data'].str.split('_',expand=True) #Updates the original DataFrame with the split version

In [18]:
df.columns = ['Year', 'Sex', 'Country', 'No. Children'] #Creates column headers to replace 0-3

In [19]:
df

Unnamed: 0,Year,Sex,Country,No. Children
0,1987,M,US,1
1,1990?,M,UK,1
2,1992,F,US,2
3,1970?,M,IT,1
4,1985,F,I T,2


Next, we find abnormal values by using the <code>.contains()</code> method:

In [21]:
df['Year'].str.contains('\?') #uses regex/pattern as first value

0    False
1     True
2    False
3     True
4    False
Name: Year, dtype: bool

Next, we can use <code>.strip()</code> and <code>.replace()</code> methods to remove whitespaces and replace bad values:

In [22]:
df['Country'].str.strip()

0      US
1      UK
2      US
3      IT
4    I  T
Name: Country, dtype: object

In [24]:
df['Country'].str.replace(' ','') #also takes regex/pattern

0    US
1    UK
2    US
3    IT
4    IT
Name: Country, dtype: object

Since both <code>.replace()</code> and <code>.contains()</code> take regex patterns, it is easy to replace values in bulk:

In [25]:
df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m:m.group('year'))

  df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m:m.group('year'))


0    1987
1    1990
2    1992
3    1970
4    1985
Name: Year, dtype: object