<img src="header.png" align="left"/>

# Anwendungsbeispiel Cleaning and transformation of data

Das Ziel dieses Beispieles ist es einige Aufgaben der Reinigung und Transformation von Daten zu erklären und den Effekt zu testen.


Konkret werden wir folgende Punkte durchgehen:

- Reiningen von unbenötigten Samples und Features
- Füllen von ungültigen und leeren Werten
- Entfernen von Duplikaten
- Prüfen von Wertebereichen
- Reinigen von Textfeldern
- Umwandeln von Datumswerten
- Resampling und accumulation





Die Beispiele wurden aus folgenden Quellen entnommen:

- [1] [https://www.import.io/post/what-is-data-cleansing-and-transformation-wrangling/](https://www.import.io/post/what-is-data-cleansing-and-transformation-wrangling/) 
- [2] [https://realpython.com/python-data-cleaning-numpy-pandas/](https://realpython.com/python-data-cleaning-numpy-pandas/)


# Motivation

Eine gute Übersicht zum Thema liefert dieser Artikel: [https://cleverdata.io/clean-select-transform-data/](https://cleverdata.io/clean-select-transform-data/)

# Importe

In [3]:
#
# modules
#
import pandas as pd
import numpy as np

# Laden von Daten

In [4]:
df = pd.read_csv('data/BL-Flickr-Images-Book.csv')

In [5]:
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


# Entfernen von nicht notwendigen Features

In [6]:
to_drop = ['Edition Statement','Corporate Author','Corporate Contributors','Former owner','Engraver','Contributors','Issuance type','Shelfmarks']

In [7]:
df.drop(to_drop, inplace=True, axis=1)

In [8]:
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


# Festlegen eines Index zum Zugriff auf die Daten

In [10]:
#
# Prüfen ob Identifier ein geeigneter Index ist
#
df['Identifier'].is_unique

True

In [11]:
df = df.set_index('Identifier')

In [12]:
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


In [13]:
#
# Zugriff mit Hilfe des Index Feldes
#
df.loc[480]

Place of Publication                                               London
Date of Publication                                                  1857
Publisher                                            Wertheim & Macintosh
Title                   [The World in which I live, and my place in it...
Author                                                          A., E. S.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 480, dtype: object

In [15]:
df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

# Reinigen von Textfeldern

In [16]:
#
# Reinigen der Datumswerte um einen einzelnen Wert zu erhalten
# mehr zu Regex ist hier zu finden: https://docs.python.org/3.6/howto/regex.html
#
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

In [19]:
extr.head(10)

Identifier
206     1879
216     1868
218     1869
472     1851
480     1857
481     1875
519     1872
667      NaN
874     1676
1143    1679
Name: Date of Publication, dtype: object

In [21]:
#
# Zurückschreiben der gesäuberten Werte als Zahl
#
df['Date of Publication'] = pd.to_numeric(extr)

In [22]:
df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905    1888.0
1929    1839.0
2836    1897.0
2854    1865.0
2956    1860.0
2957    1873.0
3017    1866.0
3131    1899.0
4598    1814.0
4884    1820.0
Name: Date of Publication, dtype: float64

In [23]:
df['Date of Publication'].isnull().sum() / len(df)

0.11717147339205986

In [24]:
df['Date of Publication'].isnull()

Identifier
206        False
216        False
218        False
472        False
480        False
           ...  
4158088    False
4158128    False
4159563     True
4159587    False
4160339    False
Name: Date of Publication, Length: 8287, dtype: bool