# Chapter 7 - Data Cleaning and Preparation

## 7.2 Data Transformation (1)

In [1]:
import pandas as pd
import numpy as np

- Identifying & removing duplicates

- use `.map()` to map values from a domain to another

- use `.apply()` to pass a `Series` or `DataFrame` to a function

- replace values

- renaming axis indices

<hr>

In [2]:
df = pd.read_csv('dataset-H-videos.csv')
display(df)

Unnamed: 0,video_id,title,views
0,-EL8TuMsb-k,Full Face TESTING BEAUTY GURUS Makeup FAVORITE...,565288
1,-EL8TuMsb-k,Full Face TESTING BEAUTY GURUS Makeup FAVORITE...,565288
2,0lr497GhD04,"10,000 Calorie Baconator Challenge!! (11 Burgers)",1752464
3,1purAy2MsOc,22 Years of Life Milestone,1092447
4,1purAy2MsOc,22 Years of Life Milestone,1092450
5,2QK6Usg2KT0,Is McDonald's Garlic White Cheddar Burger Real...,1312727
6,2QK6Usg2KT0,Is McDonald's Garlic White Cheddar Burger Real...,1312727
7,4vZJn6r0dRw,YOUTUBERS REACT TO TOP 10 VEVO CHANNELS OF ALL...,1595752
8,Ga7thDlGbs8,President Trump on bump stocks (C-SPAN),37389
9,4ykXguKWqy4,People Living With Disabilities Review Charact...,677579


To identify duplicates, use `.duplicated()`. For each row, the value is `True` if it has been observed in the previous rows (traversing downwards).

In [3]:
df.duplicated()

0     False
1      True
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10     True
11    False
12    False
13     True
14     True
15     True
dtype: bool

To remove them, use `.drop_duplicates()`. All observed duplicates will be removed. Note that the index names are not seen after removing duplicates.

In [4]:
display(df.drop_duplicates())

Unnamed: 0,video_id,title,views
0,-EL8TuMsb-k,Full Face TESTING BEAUTY GURUS Makeup FAVORITE...,565288
2,0lr497GhD04,"10,000 Calorie Baconator Challenge!! (11 Burgers)",1752464
3,1purAy2MsOc,22 Years of Life Milestone,1092447
4,1purAy2MsOc,22 Years of Life Milestone,1092450
5,2QK6Usg2KT0,Is McDonald's Garlic White Cheddar Burger Real...,1312727
7,4vZJn6r0dRw,YOUTUBERS REACT TO TOP 10 VEVO CHANNELS OF ALL...,1595752
8,Ga7thDlGbs8,President Trump on bump stocks (C-SPAN),37389
9,4ykXguKWqy4,People Living With Disabilities Review Charact...,677579
11,5MJpurZ9ShI,Do Essential Oils Really Work? And Why?,326617
12,BPmAxzEJpWA,Snoop Dogg - One More Day (feat. Charlie Wilso...,393494


To remove duplicates by considering only a subset of columns use `.drop_duplicates([col1, col2])`. The duplicates will be removed, but only the selected columns are used to evaluate.

In [5]:
display(df.drop_duplicates(['video_id', 'views']))

Unnamed: 0,video_id,title,views
0,-EL8TuMsb-k,Full Face TESTING BEAUTY GURUS Makeup FAVORITE...,565288
2,0lr497GhD04,"10,000 Calorie Baconator Challenge!! (11 Burgers)",1752464
3,1purAy2MsOc,22 Years of Life Milestone,1092447
4,1purAy2MsOc,22 Years of Life Milestone,1092450
5,2QK6Usg2KT0,Is McDonald's Garlic White Cheddar Burger Real...,1312727
7,4vZJn6r0dRw,YOUTUBERS REACT TO TOP 10 VEVO CHANNELS OF ALL...,1595752
8,Ga7thDlGbs8,President Trump on bump stocks (C-SPAN),37389
9,4ykXguKWqy4,People Living With Disabilities Review Charact...,677579
11,5MJpurZ9ShI,Do Essential Oils Really Work? And Why?,326617
12,BPmAxzEJpWA,Snoop Dogg - One More Day (feat. Charlie Wilso...,393494


<hr>

In [6]:
df2 = pd.read_csv('dataset-H2-videos.csv')
df2 = df2[['video_id', 'title', 'category_id']]
display(df2)

Unnamed: 0,video_id,title,category_id
0,1qIj0m7-sHI,Top 10 NFL Rookies of the 2017 Season | NFL Hi...,17
1,k1xvol1SCx8,"Dua Lipa - IDGAF ft. Charli XCX, Zara Larsson,...",10
2,8sg8lY-leE8,Vermilion Parish teacher gets arrested at Verm...,23
3,Y6zucdAzNi4,Thank You Peter Capaldi | Doctor Who Christmas...,24
4,BmMuLuG1yW8,"G-Eazy On Stepping Away From H&M, Being A Craz...",24


To perform simple mapping of elements in a `Series` from one domain to another, use `Series.map()`. 

In [7]:
category_to_names = {17 : 'Sports', 10 : 'Music', 23 : 'Comedy', 24 : 'Entertainment'}
df2['category_name'] = df2['category_id'].map(category_to_names)
display(df2)

Unnamed: 0,video_id,title,category_id,category_name
0,1qIj0m7-sHI,Top 10 NFL Rookies of the 2017 Season | NFL Hi...,17,Sports
1,k1xvol1SCx8,"Dua Lipa - IDGAF ft. Charli XCX, Zara Larsson,...",10,Music
2,8sg8lY-leE8,Vermilion Parish teacher gets arrested at Verm...,23,Comedy
3,Y6zucdAzNi4,Thank You Peter Capaldi | Doctor Who Christmas...,24,Entertainment
4,BmMuLuG1yW8,"G-Eazy On Stepping Away From H&M, Being A Craz...",24,Entertainment


To use one column as an input to a function, use `Series.apply()`. Note that `DataFrame.apply()` is also valid. Further reading about this includes handling function with keyword arguments.

In [8]:
def calculate_disbursed_amount(x):
    # Disbursed amount is 90% of the loan amount
    exact_amt = x*0.90
    # Round this value down to the nearest $100
    rounded_amt = int(exact_amt/100)*100.0
    return rounded_amt

In [9]:
df3 = pd.read_csv('dataset-A-loans.csv', index_col=0)
# Observe that the conversion returns a float, to be coherent with the loan_amnt column
df3['loan_disbursed'] = df3['loan_amnt'].apply(calculate_disbursed_amount)
display(df3)

Unnamed: 0,loan_amnt,int_rate,term,grade,loan_disbursed
48304290,30000.0,8.18,36 months,B,27000.0
49904421,14225.0,13.33,60 months,C,12800.0
32038416,12000.0,20.2,60 months,E,10800.0
11456303,18000.0,8.39,36 months,A,16200.0
23613274,4000.0,12.49,36 months,B,3600.0
55949701,15000.0,16.99,60 months,D,13500.0


<hr>
Use replace values, use `Series.replace()`.

In [10]:
grades = df3.copy()['grade']
display(grades)
display(grades.replace('D', 'C'))

48304290    B
49904421    C
32038416    E
11456303    A
23613274    B
55949701    D
Name: grade, dtype: object

48304290    B
49904421    C
32038416    E
11456303    A
23613274    B
55949701    C
Name: grade, dtype: object

`Series.replace()` can be use for more complex substitution logic. To substitute multiple values to one value, pass a `list` in the first argument. To declare multiple substitution rules, pass a `dict`.

In [11]:
display(grades)
display(grades.replace(['D', 'E'], 'C'))

48304290    B
49904421    C
32038416    E
11456303    A
23613274    B
55949701    D
Name: grade, dtype: object

48304290    B
49904421    C
32038416    C
11456303    A
23613274    B
55949701    C
Name: grade, dtype: object

In [12]:
display(grades)
display(grades.replace({'A' : '1', 'B' : '2'}))

48304290    B
49904421    C
32038416    E
11456303    A
23613274    B
55949701    D
Name: grade, dtype: object

48304290    2
49904421    C
32038416    E
11456303    1
23613274    2
55949701    D
Name: grade, dtype: object

<hr>
The index can be modified by using `df.index`

In [13]:
df = pd.read_csv('dataset-G-subsidies.csv', sep='|')
display(df)
df.index = df.index-2000
display(df)

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,,4231,,2907
2017,4503,10424.0,4500,,4425,,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
14,3526,9315.0,3385,3402.0,3554,2177.0,2488
15,4096,,3853,3906.0,4233,2887.0,3093
16,4248,10062.0,4251,,4231,,2907
17,4503,10424.0,4500,,4425,,3339
18,4494,10347.0,4535,4397.0,4424,2957.0,3152


To update both the index and columns in one step, use `df.rename()`, specifying the `index` and `columns` parameters.

In [14]:
df = pd.read_csv('dataset-G-subsidies.csv', sep='|')
display(df)
# Simultaneously rename the index and the columns
# inplace=True modifes the variable directly.
_ = df.rename(
    index={2014: 14, 2015 : 15, 2016 : 16, 2017 : 17, 2018 : 18},
    columns={'Condominiums & Other Apartments' : 'condo', 'Landed Properties' : 'landed'}, 
    inplace=True)
display(df)

Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,Condominiums & Other Apartments,Landed Properties
2014,3526,9315.0,3385,3402.0,3554,2177.0,2488
2015,4096,,3853,3906.0,4233,2887.0,3093
2016,4248,10062.0,4251,,4231,,2907
2017,4503,10424.0,4500,,4425,,3339
2018,4494,10347.0,4535,4397.0,4424,2957.0,3152


Unnamed: 0,Year Total,HDB 1- & 2- Room Flats,HDB 3-Room Flats,HDB 4-Room Flats,HDB 5-Room & Executive Flats,condo,landed
14,3526,9315.0,3385,3402.0,3554,2177.0,2488
15,4096,,3853,3906.0,4233,2887.0,3093
16,4248,10062.0,4251,,4231,,2907
17,4503,10424.0,4500,,4425,,3339
18,4494,10347.0,4535,4397.0,4424,2957.0,3152


<div class="alert alert-warning"><b>Warning: </b>Note the difference between `rename()` and `replace()`.</div>

**References:**

Python for Data Analysis, 2nd Edition, McKinney (2017)