# Data Transformation

Dealing with duplicate or invalid values with possible removal or replacement.

In [1]:
import numpy as np
import pandas as pd

---

## Removing Duplicates

In [2]:
dframe = pd.DataFrame({'color': ['white','white','red','red','white'],
                        'value': [2,1,3,3,2]})

In [3]:
dframe

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3
3,red,3
4,white,2


In [4]:
dframe.duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [5]:
dframe[dframe.duplicated()]
# we know the duplicated values

Unnamed: 0,color,value
3,red,3
4,white,2


In [7]:
dframe.drop_duplicates()

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3


## Mapping

The mapping is nothing more than the creation of a list of matches between two
different values, with the ability to bind a value to a particular label or string.

The functions that you will see in this section perform specific operations but all of them are united
from accepting a dict object with matches as an argument.
- replace(): replaces values
- map(): creates a new column
- rename(): replaces the index values

## Replacing Values via Mapping

In [8]:
frame = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                        'color':['white','rosso','verde','black','yellow'],
                        'price':[5.56,4.20,1.30,0.56,2.75]})

In [9]:
frame

Unnamed: 0,item,color,price
0,ball,white,5.56
1,mug,rosso,4.2
2,pen,verde,1.3
3,pencil,black,0.56
4,ashtray,yellow,2.75


In [10]:
newcolors = {'rosso': 'red', 'verde': 'green'}

In [11]:
frame.replace(newcolors)

Unnamed: 0,item,color,price
0,ball,white,5.56
1,mug,red,4.2
2,pen,green,1.3
3,pencil,black,0.56
4,ashtray,yellow,2.75


In [12]:
ser = pd.Series([1, 3, np.nan, 4, 6, np.nan, 3])

In [13]:
ser

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

In [14]:
ser.replace({np.nan:0})

0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    3.0
dtype: float64

In [15]:
ser.replace(np.nan, 0)

0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    3.0
dtype: float64

## Adding Values via Mapping

In [16]:
frame = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                        'color':['white','red','green','black','yellow']})

In [17]:
frame

Unnamed: 0,item,color
0,ball,white
1,mug,red
2,pen,green
3,pencil,black
4,ashtray,yellow


In [19]:
price = {'ball' : 5.56, 'mug' : 4.20, 'bottle' : 1.30, 
         'scissors' : 3.41, 'pen' : 1.30, 
         'pencil' : 0.56, 'ashtray' : 2.75}

In [23]:
frame['price'] = frame['item'].map(price) 
# the left side will return the result of mapping 
# or range in term of mathematical function

In [24]:
frame

Unnamed: 0,item,color,price
0,ball,white,5.56
1,mug,red,4.2
2,pen,green,1.3
3,pencil,black,0.56
4,ashtray,yellow,2.75


## Rename the Indexed of the Axes

Replace the index labels or column labels.

In [27]:
frame.rename(index={0:'first', 1:'second', 2:'third', 3:'fourth', 
             4:'fifth'})

Unnamed: 0,item,color,price
first,ball,white,5.56
second,mug,red,4.2
third,pen,green,1.3
fourth,pencil,black,0.56
fifth,ashtray,yellow,2.75


In [29]:
frame.rename(index={0:'first', 1:'second', 2:'third', 3:'fourth', 
             4:'fifth'},
             columns={'item':'obj', 'color':'colour'})

Unnamed: 0,obj,colour,price
first,ball,white,5.56
second,mug,red,4.2
third,pen,green,1.3
fourth,pencil,black,0.56
fifth,ashtray,yellow,2.75


---

# Discretization and Binning

A more complex process of transformation that you will see in this section is discretization. Sometimes it can happen, especially in some experimental cases, to handle large quantities of data generated in sequence. To carry out an analysis of the data, however, it is necessary to transform this data into discrete categories, for example, by dividing the range of values of such readings in smaller intervals and counting the occurrence or statistics within each of them. 

Another case might be to have a huge amount of samples due to precise readings on a population. Even here, to facilitate analysis of the data it is necessary to divide the range of values into categories and then analyze the occurrences and statistics related to each of them.

In [30]:
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

In [31]:
bins = [0,25,50,75,100]

In [32]:
cat = pd.cut(results, bins)

In [33]:
cat

[(0, 25], (25, 50], (50, 75], (50, 75], (25, 50], ..., (75, 100], (0, 25], (25, 50], (75, 100], (75, 100]]
Length: 17
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

The object returned by the cut() function is a special object of Categorical type. You can consider it as
an array of strings indicating the name of the bin. Internally it contains a levels array indicating the names
of the different internal categories and a labels array that contains a list of numbers equal to the elements of
results (i.e., the array subjected to binning). The number corresponds to the bin to which the corresponding
element of results is assigned.

In [36]:
cat.codes # in what interval they are

array([0, 1, 2, 2, 1, 3, 3, 0, 0, 2, 2, 1, 3, 0, 1, 3, 3], dtype=int8)

In [42]:
cat.categories # what interval we use

IntervalIndex([(0, 25], (25, 50], (50, 75], (75, 100]],
              closed='right',
              dtype='interval[int64]')

In [43]:
pd.value_counts(cat)

(75, 100]    5
(50, 75]     4
(25, 50]     4
(0, 25]      4
dtype: int64

In [44]:
cat.value_counts()

(0, 25]      4
(25, 50]     4
(50, 75]     4
(75, 100]    5
dtype: int64

In [45]:
bin_names = ['unlikely','less likely','likely','highly likely']

In [51]:
list(pd.cut(results, bins, labels=bin_names))

['unlikely',
 'less likely',
 'likely',
 'likely',
 'less likely',
 'highly likely',
 'highly likely',
 'unlikely',
 'unlikely',
 'likely',
 'likely',
 'less likely',
 'highly likely',
 'unlikely',
 'less likely',
 'highly likely',
 'highly likely']

If the cut() function is passed as an argument to an integer instead of explicating the bin edges, this will
divide the range of values of the array in many intervals as specified by the number.

In [52]:
pd.cut(results, 5)

[(2.904, 22.2], (22.2, 41.4], (60.6, 79.8], (41.4, 60.6], (22.2, 41.4], ..., (79.8, 99.0], (22.2, 41.4], (41.4, 60.6], (79.8, 99.0], (79.8, 99.0]]
Length: 17
Categories (5, interval[float64]): [(2.904, 22.2] < (22.2, 41.4] < (41.4, 60.6] < (60.6, 79.8] < (79.8, 99.0]]

In [54]:
list(pd.cut(results, 5, right=False))

[Interval(3.0, 22.2, closed='left'),
 Interval(22.2, 41.4, closed='left'),
 Interval(60.6, 79.8, closed='left'),
 Interval(41.4, 60.6, closed='left'),
 Interval(22.2, 41.4, closed='left'),
 Interval(79.8, 99.096, closed='left'),
 Interval(79.8, 99.096, closed='left'),
 Interval(3.0, 22.2, closed='left'),
 Interval(3.0, 22.2, closed='left'),
 Interval(41.4, 60.6, closed='left'),
 Interval(60.6, 79.8, closed='left'),
 Interval(41.4, 60.6, closed='left'),
 Interval(79.8, 99.096, closed='left'),
 Interval(22.2, 41.4, closed='left'),
 Interval(41.4, 60.6, closed='left'),
 Interval(79.8, 99.096, closed='left'),
 Interval(79.8, 99.096, closed='left')]

In addition to cut(), pandas provides another method for binning: qcut(). This function divides the
sample directly into quintiles. In fact, depending on the distribution of the data sample, using cut() rightly
you will have a different number of occurrences for each bin. Instead qcut() will ensure that the number of
occurrences for each bin is equal, but the edges of each bin to vary.

In [56]:
quintiles = pd.qcut(results, 5)

In [58]:
list(quintiles)

[Interval(2.999, 24.0, closed='right'),
 Interval(24.0, 46.0, closed='right'),
 Interval(62.6, 87.0, closed='right'),
 Interval(46.0, 62.6, closed='right'),
 Interval(24.0, 46.0, closed='right'),
 Interval(87.0, 99.0, closed='right'),
 Interval(87.0, 99.0, closed='right'),
 Interval(2.999, 24.0, closed='right'),
 Interval(2.999, 24.0, closed='right'),
 Interval(46.0, 62.6, closed='right'),
 Interval(62.6, 87.0, closed='right'),
 Interval(24.0, 46.0, closed='right'),
 Interval(62.6, 87.0, closed='right'),
 Interval(2.999, 24.0, closed='right'),
 Interval(46.0, 62.6, closed='right'),
 Interval(87.0, 99.0, closed='right'),
 Interval(62.6, 87.0, closed='right')]

In [59]:
quintiles.value_counts()

(2.999, 24.0]    4
(24.0, 46.0]     3
(46.0, 62.6]     3
(62.6, 87.0]     4
(87.0, 99.0]     3
dtype: int64

In [60]:
pd.value_counts(quintiles)

(62.6, 87.0]     4
(2.999, 24.0]    4
(87.0, 99.0]     3
(46.0, 62.6]     3
(24.0, 46.0]     3
dtype: int64

In [61]:
pd.cut(results, 5).value_counts()

(2.904, 22.2]    3
(22.2, 41.4]     3
(41.4, 60.6]     4
(60.6, 79.8]     2
(79.8, 99.0]     5
dtype: int64

## Detecting and Filtering Outliers

Detecting abnormal values or outliers

In [62]:
randframe = pd.DataFrame(np.random.randn(1000, 3))

In [63]:
randframe.describe()

Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,-0.011984,-0.002081,0.008907
std,1.052053,0.981189,1.008527
min,-3.321166,-3.667491,-2.589661
25%,-0.733104,-0.633384,-0.699882
50%,-0.052974,-0.042776,-0.007303
75%,0.707233,0.65132,0.68069
max,3.636879,3.250588,3.489639


In [64]:
randframe.std()

0    1.052053
1    0.981189
2    1.008527
dtype: float64

Now you apply the filtering of all the values of the DataFrame, applying the corresponding standard
deviation for each column. Thanks to the any() function, you can apply the filter on each column.

In [71]:
randframe[((np.abs(randframe)) > (3*randframe.std())).any(1)]

Unnamed: 0,0,1,2
110,0.472035,1.52275,3.45019
235,1.173894,3.250588,-0.109475
533,-0.644063,3.195031,-0.878147
640,-3.175203,-2.006558,0.953543
712,-3.321166,-0.631173,-0.226828
781,1.92069,-3.667491,-0.522527
810,3.199387,-1.153079,-1.056149
884,3.636879,1.416868,-1.098889
910,-0.782313,1.517145,3.489639


---

# Permutation

The operations of permutation (random reordering) of a Series or the rows of a DataFrame are easy to do
using the numpy.random.permutation() function.

In [73]:
nframe = pd.DataFrame(np.arange(25).reshape(5, 5))

In [74]:
nframe

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,23,24


In [75]:
new_order = np.random.permutation(5)

In [76]:
new_order

array([2, 0, 4, 1, 3])

In [77]:
nframe.take(new_order) # rearranging the index (row)

Unnamed: 0,0,1,2,3,4
2,10,11,12,13,14
0,0,1,2,3,4
4,20,21,22,23,24
1,5,6,7,8,9
3,15,16,17,18,19


In [78]:
new_order1 = [3, 4, 2]
nframe.take(new_order1)

Unnamed: 0,0,1,2,3,4
3,15,16,17,18,19
4,20,21,22,23,24
2,10,11,12,13,14


## Random Sampling

In [79]:
sample = np.random.randint(0, len(nframe), size=3)

In [80]:
sample

array([0, 0, 3])

In [81]:
nframe.take(sample)

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
0,0,1,2,3,4
3,15,16,17,18,19


---

# String Manipulation

## Built-in Methods for Manipulation of Strings

In [82]:
name = "sekardayu hana pradiani"

In [83]:
name.split(' ')

['sekardayu', 'hana', 'pradiani']

In [84]:
text = "sekar, saskia, arifa"

In [85]:
text.split(',')

['sekar', ' saskia', ' arifa']

In [86]:
text = '16 Bolton Avenue , Boston'

In [91]:
text.split(',')

['16 Bolton Avenue ', ' Boston']

In [89]:
# strip: trim the white spaces and newlines
tokens = [s.strip() for s in text.split(',')]

In [90]:
tokens

['16 Bolton Avenue', 'Boston']

In [92]:
adress, city = [s.strip() for s in text.split(',')]

In [93]:
adress

'16 Bolton Avenue'

In [94]:
city

'Boston'

In [95]:
# the most intuitive way
adress + ', ' + city

'16 Bolton Avenue, Boston'

In [96]:
strings = ['A+','A','A-','B','BB','BBB','C+']

In [97]:
';'.join(strings)

'A+;A;A-;B;BB;BBB;C+'

In [98]:
'Boston' in text

True

In [99]:
text.index('Boston')

19

In [100]:
text.find('Boston')

19

In [101]:
text.count('e')

2

In [102]:
text.count('Avenue')

1

In [104]:
text.replace('Avenue', 'Street')

'16 Bolton Street , Boston'

In [105]:
text.replace('1', '')

'6 Bolton Avenue , Boston'

## Regular Expressions

In [106]:
import re

The re module provides a set of functions that can be divided into three different categories:
- pattern matching
- substitution
- splitting

Now you start with a few examples. For example, the regex for expressing a sequence of one or
more whitespace characters is \s+.

In [107]:
re.split('\s+', text)

['16', 'Bolton', 'Avenue', ',', 'Boston']

But analyze more deeply the mechanism of re module. When you call the re.split() function, the
regular expression is first compiled, then subsequently calls the split() function on the text argument. You
can compile the regex function with the re.compile() function, thus obtaining a reusable object regex and
so gaining in terms of CPU cycles.
This is especially true in the operations of iterative search of a substring in a set or an array of strings.

In [108]:
regex = re.compile('\s+')

In [109]:
regex.split(text)

['16', 'Bolton', 'Avenue', ',', 'Boston']

As regards matching a regex pattern with any other business substrings in the text, you can use the
findall() function. It returns a list of all the substrings in the text that meet the requirements of the regex.

For example, if you want to find in a string all the words starting with “A” uppercase, or for example, with
“a” regardless whether upper- or lowercase, you need to enter what follows:

In [110]:
text = 'This is my address: 16 Bolton Avenue, Boston'

In [111]:
re.findall('A\w+', text)

['Avenue']

In [113]:
re.findall('[A,a]\w+', text)

['address', 'Avenue']

---

# Important Points

- Removing duplicates:
    - obj.duplicated(): return boolean array, True if the value is a duplicate
    - obj.drop_duplicates: removing duplicates row
- Mapping:
    - obj.replace():
        - takes dictionary: key-data want to be replaced, value-data to replace
        - takes two arguments: new data, old data
    - obj.map(): can be used to add new features
        - return range of domain
        - take dictionary
    - obj.rename(): rename index label or column label
- Discretization and Binnings:
    - pandas.cut():
        - parameters: data, bins(number or list), labels
        - return Categories object, which represent which interval the datapoints are belong to
        - instance variable: 
            - categories: the interval
            - codes: the data points in what interval
    - pandas.qcut(): quintile
        - parameters: data, bins, labels
        - return Categories object
        - the data is evenly distribution in each interval
- Detecting and Filtering Outliers:
    - no new functions. use tricks
    - ex: 3 times standard deviation is considered as outliers
    - randframe[((np.abs(randframe)) > (3*randframe.std())).any(1)]
- Permutation: 
    - obj.take(): to rearrange index the same the arguments pass to it
- String manipulation:
    - obj.strip(): trim whitespaces or newlines
    - obj.split(): split the string, take parameter as splitting point
    - delimiter.join(listofstring)