## Masking Data

Lets create a list

In [1]:
import pandas as pd
import numpy as np

In [2]:
sample_list = pd.Series(range(10))

In [3]:
sample_list

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Looking at the entries lesser than 3

In [4]:
sample_list[sample_list < 3]

0    0
1    1
2    2
dtype: int64

#### The where() method of a Pandas Series
If you want a series of the same shape as the original. Returns an object of same shape as self and whose corresponding entries are from self where the specified condition is True. Where the entries do not match the condition, the value can be set by the 'other' argument (is NaN by default). 

In [5]:
sample_list.where(sample_list < 3)

0    0.0
1    1.0
2    2.0
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
dtype: float64

Masking those entries and replacing the original list

In [6]:
masked_list = sample_list.mask(sample_list < 3)

masked_list

0    NaN
1    NaN
2    NaN
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

Now that the elements below 3 are masked, lets see the result if we look at the list with the same condition again

In [7]:
masked_list.where(masked_list<3)

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
dtype: float64

Recreating the original list

In [8]:
sample_list

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [9]:
sample_list.where(sample_list > 3)

0    NaN
1    NaN
2    NaN
3    NaN
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

We will mask by replacing the entries by a specified character

In [10]:
sample_list.where(sample_list > 3, 'x')

0    x
1    x
2    x
3    x
4    4
5    5
6    6
7    7
8    8
9    9
dtype: object

In [11]:
pd.Series(range(10))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Comparing this masked series with the original list,

In [12]:
sample_list.where(sample_list > 3, 'x') == pd.Series(range(10))

0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
8     True
9     True
dtype: bool

## Removing duplicate data

In [13]:
df = pd.DataFrame({'color':['red','yellow','green','yellow',
                            'blue','green','green'],
                   'vehicle':['truck','bus','van','truck',
                              'bus','van','car']})

In [14]:
df

Unnamed: 0,color,vehicle
0,red,truck
1,yellow,bus
2,green,van
3,yellow,truck
4,blue,bus
5,green,van
6,green,car


Viewing duplicated rows in column 1

In [15]:
df.duplicated('color')

0    False
1    False
2    False
3     True
4    False
5     True
6     True
dtype: bool

Viewing duplicated rows in column 2

In [16]:
df.duplicated('vehicle')

0    False
1    False
2    False
3     True
4     True
5     True
6    False
dtype: bool

Dropping the duplicated row based on entry in column 1

In [17]:
df.drop_duplicates('color')

Unnamed: 0,color,vehicle
0,red,truck
1,yellow,bus
2,green,van
4,blue,bus


Dropping the first of the duplicated rows, same as default

In [18]:
df.drop_duplicates('color',
                   keep = 'first')

Unnamed: 0,color,vehicle
0,red,truck
1,yellow,bus
2,green,van
4,blue,bus


dropping the last of the duplicated rows

In [19]:
df.drop_duplicates('color',
                   keep = 'last')

Unnamed: 0,color,vehicle
0,red,truck
3,yellow,truck
4,blue,bus
6,green,car


dropping all instances of duplicated data

In [20]:
df.drop_duplicates('color',
                   keep = False)

Unnamed: 0,color,vehicle
0,red,truck
4,blue,bus


finding an entire row that is duplicated - checking both column entries ;  and dropping it

In [21]:
df.duplicated(['color','vehicle'])

0    False
1    False
2    False
3    False
4    False
5     True
6    False
dtype: bool

In [22]:
df.drop_duplicates(['color','vehicle'])

Unnamed: 0,color,vehicle
0,red,truck
1,yellow,bus
2,green,van
3,yellow,truck
4,blue,bus
6,green,car


In [23]:
df.drop_duplicates(keep = False)

Unnamed: 0,color,vehicle
0,red,truck
1,yellow,bus
3,yellow,truck
4,blue,bus
6,green,car


## Few other interesting functions

#### Crosstab - get a feel of the data

In [24]:
pd.crosstab(df['color'], 
            df['vehicle'],
            margins = True)

vehicle,bus,car,truck,van,All
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
blue,1,0,0,0,1
green,0,1,0,2,3
red,0,0,1,0,1
yellow,1,0,1,0,2
All,2,1,2,2,7


the entries in column 2 become the columns of this new dataframe, with rows from column 1

We can see how many blue colored trucks, red coloured cars are present in this dataframe