<a href="https://colab.research.google.com/github/aldiirianto/praktikum-machinelearning/blob/main/Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Panduan Cleaning Data Daftar Buku di British Library**

Tujuan kegiatan kali ini adalah untuk melakukan cleaning data pada Daftar Buku yang terdapat di British Library.
- Sumber Dataset : https://github.com/realpython/python-data-cleaning/tree/master/Datasets

##**1. Data Exploration**

In [2]:
import pandas as pd
import numpy as np
from functools import reduce

#Mounted at Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#explore the dataset
df = pd.read_csv('/content/drive/MyDrive/Machine Learning/dataset/BL-Flickr-Images-Book.csv')

In [4]:
df.shape

(8287, 15)

In [5]:
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


Disini, dapat kita lihat bahwa terdapat **8.287** rows dan **15** columns yang terdapat pada dataset. Langkah selanjutnya yang akan kita lakukan adalah cleaning data.

##**2. Data Cleaning**

In [6]:
df.columns

Index(['Identifier', 'Edition Statement', 'Place of Publication',
       'Date of Publication', 'Publisher', 'Title', 'Author', 'Contributors',
       'Corporate Author', 'Corporate Contributors', 'Former owner',
       'Engraver', 'Issuance type', 'Flickr URL', 'Shelfmarks'],
      dtype='object')

- remove unwanted columns

In [7]:
df.drop(['Edition Statement',
           'Corporate Author',
           'Corporate Contributors',
           'Former owner',
           'Engraver',
           'Contributors',
           'Issuance type',
           'Shelfmarks'],axis=1,inplace=True) #do not execute two times

In [8]:
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


In [9]:
df.shape

(8287, 7)

Dapat dilihat bahwa, kolom yang tersisa setelah dilakukan pembersihan unwanted columns adalah **6** kolom

- setting the index of the dataset

In [10]:
df.set_index('Identifier', inplace = True)
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


Pemberian **index** pada kolom 'Identifier' dilakukan untuk memudahkan proses pencarian data yang akan dibutuhkan.

- cleaning unwanted character pada kolom 'Date of Publication' 

In [11]:
df['Date of Publication'].head(25)

Identifier
206            1879 [1878]
216                   1868
218                   1869
472                   1851
480                   1857
481                   1875
519                   1872
667                    NaN
874                   1676
1143                  1679
1280                  1802
1808                  1859
1905                  1888
1929           1839, 38-54
2836                  1897
2854                  1865
2956               1860-63
2957                  1873
3017                  1866
3131                  1899
4598                  1814
4884                  1820
4976                  1800
5382    1847, 48 [1846-48]
5385               [1897?]
Name: Date of Publication, dtype: object

In [12]:
unwanted_characters = ['[', ',', '-']

def clean_dates(item):
    dop= str(item.loc['Date of Publication'])
    
    if dop == 'nan' or dop[0] == '[':
        return np.NaN
    
    for character in unwanted_characters:
        if character in dop:
            character_index = dop.find(character)
            dop = dop[:character_index]
    
    return dop

df['Date of Publication'] = df.apply(clean_dates, axis = 1)

In [13]:
df['Date of Publication'].head(25)

Identifier
206     1879 
216      1868
218      1869
472      1851
480      1857
481      1875
519      1872
667       NaN
874      1676
1143     1679
1280     1802
1808     1859
1905     1888
1929     1839
2836     1897
2854     1865
2956     1860
2957     1873
3017     1866
3131     1899
4598     1814
4884     1820
4976     1800
5382     1847
5385      NaN
Name: Date of Publication, dtype: object

Dapat dilihat bahwa, data yang terdapat pada kolom 'Date of Publication' telah dibersihkan dari karakter - karakter yang tidak dibutuhkan.

- cleaning unwanted character 'Author' 

In [14]:
df['Author'].head(25)

Identifier
206                                                 A. A.
216                                             A., A. A.
218                                             A., A. A.
472                                             A., E. S.
480                                             A., E. S.
481                                             A., E. S.
519                                             A., F. E.
667                                         A., J.|A., J.
874                                                Remaʿ.
1143                                               A., T.
1280                                                  NaN
1808                                         AALL, Jacob.
1905    AAR, Ermanno - pseud. [i.e. Luigi Giuseppe Oro...
1929                                                  NaN
2836                            ABATE, Giovanni Agostino.
2854                                    ABATI, Francesco.
2956                        ABBADIE, Antoine Thompson d'.
295

In [15]:
def clean_author_names(author):
    
    author = str(author)
    
    if author == 'nan':
        return 'NaN'
    
    author = author.split(',')

    if len(author) == 1:
        name = filter(lambda x: x.isalpha(), author[0])
        return reduce(lambda x, y: x + y, name)
    
    last_name, first_name = author[0], author[1]

    first_name = first_name[:first_name.find('-')] if '-' in first_name else first_name
    
    if first_name.endswith(('.', '.|')):
        parts = first_name.split('.')
        
        if len(parts) > 1:
            first_occurence = first_name.find('.')
            final_occurence = first_name.find('.', first_occurence + 1)
            first_name = first_name[:final_occurence]
        else:
            first_name = first_name[:first_name.find('.')]
    
    last_name = last_name.capitalize()
    
    return f'{first_name} {last_name}'


df['Author'] = df['Author'].apply(clean_author_names)

In [16]:
df['Author'].head(25)

Identifier
206                                      AA
216                                 A. A A.
218                                 A. A A.
472                                 E. S A.
480                                 E. S A.
481                                 E. S A.
519                                 F. E A.
667                                 J.|A A.
874                                   Remaʿ
1143                                   T A.
1280                                    NaN
1808                             Jacob Aall
1905                           Ermanno  Aar
1929                                    NaN
2836                Giovanni Agostino Abate
2854                        Francesco Abati
2956            Antoine Thompson d' Abbadie
2957            Antoine Thompson d' Abbadie
3017     Agustín Íñigo  Abbad y lasierra
3131                         William Abbatt
4598                   Thomas Eastoe Abbott
4884                                    NaN
4976                 

In [17]:
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,AA,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,A. A A.,http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,A. A A.,http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...",E. S A.,http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...",E. S A.,http://www.flickr.com/photos/britishlibrary/ta...


Dapat dilihat bahwa, data yang terdapat pada kolom 'Author' telah dibersihkan dari karakter - karakter yang tidak dibutuhkan.

- cleaning unwanted character 'Place of Publication' 

In [18]:
df['Place of Publication'].head()

Identifier
206                      London
216    London; Virtue & Yorston
218                      London
472                      London
480                      London
Name: Place of Publication, dtype: object

In [20]:
pub = df['Place of Publication']
df['Place of Publication'] = np.where(pub.str.contains('London'), 'London',
    np.where(pub.str.contains('Oxford'), 'Oxford',
        np.where(pub.eq('Newcastle upon Tyne'),
            'Newcastle-upon-Tyne', df['Place of Publication'])))

In [21]:
df['Place of Publication'].head()

Identifier
206    London
216    London
218    London
472    London
480    London
Name: Place of Publication, dtype: object

Dapat dilihat bahwa, data yang terdapat pada kolom 'Place of Publication' telah dibersihkan dari karakter - karakter yang tidak dibutuhkan.

In [22]:
df

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,AA,http://www.flickr.com/photos/britishlibrary/ta...
216,London,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,A. A A.,http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,A. A A.,http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...",E. S A.,http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...",E. S A.,http://www.flickr.com/photos/britishlibrary/ta...
...,...,...,...,...,...,...
4158088,London,1838,,"The Parochial History of Cornwall, founded on,...",afterwards GILBERT Giddy,http://www.flickr.com/photos/britishlibrary/ta...
4158128,Derby,1831,M. Mozley & Son,The History and Gazetteer of the County of Der...,Stephen Glover,http://www.flickr.com/photos/britishlibrary/ta...
4159563,London,,T. Cadell and W. Davies,Magna Britannia; being a concise topographical...,Daniel Lysons,http://www.flickr.com/photos/britishlibrary/ta...
4159587,Newcastle-upon-Tyne,1834,Mackenzie & Dent,"An historical, topographical and descriptive v...",E. (Eneas) Mackenzie,http://www.flickr.com/photos/britishlibrary/ta...
