# Löschen von fehlerhaften oder fehlenden Werten
Identifiziert man fehlende oder fehlerhafte Werte in einem Datenset, kann man diese entweder löschen oder imputieren (ersetzen).

In [1]:
import pandas as pd

In [2]:
missings=['na','?','n/a','-']
penguins=pd.read_csv ('short_penguins.csv', na_values=missings)
penguins

Unnamed: 0,id,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,1,Adelie,torgersen,39.1,18.7,181.0,3750.0,male
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,4,Adelie,Torgersen,,,,,
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,21,Adelie,Biscoe,37.8,18.3,174.0,3400.0,female
6,22,Adelie,Biscoe,37.7,18.7,180.0,3600.0,male
7,23,Adelie,Biscoe,35.9,19.2,189.0,3800.0,female
8,24,Adelie,Biscoe,38.2,18.1,185.0,3950.0,male
9,25,Adelie,Biscoe,38.8,17.2,180.0,3800.0,male


Die Zeilen werden gelöscht, in denen mindestens ein Element fehlt.

In [3]:
penguins.dropna()

Unnamed: 0,id,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,1,Adelie,torgersen,39.1,18.7,181.0,3750.0,male
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,21,Adelie,Biscoe,37.8,18.3,174.0,3400.0,female
6,22,Adelie,Biscoe,37.7,18.7,180.0,3600.0,male
7,23,Adelie,Biscoe,35.9,19.2,189.0,3800.0,female
8,24,Adelie,Biscoe,38.2,18.1,185.0,3950.0,male
9,25,Adelie,Biscoe,38.8,17.2,180.0,3800.0,male
10,31,Adelie,Dream,39.5,16.7,178.0,3250.0,female


Die Spalten werden gelöscht, in denen mindestens ein Element fehlt.

In [4]:
penguins.dropna(axis='columns')

Unnamed: 0,id,species,island
0,1,Adelie,torgersen
1,2,Adelie,Torgersen
2,3,Adelie,Torgersen
3,4,Adelie,Torgersen
4,5,Adelie,Torgersen
5,21,Adelie,Biscoe
6,22,Adelie,Biscoe
7,23,Adelie,Biscoe
8,24,Adelie,Biscoe
9,25,Adelie,Biscoe


Natürlich kann man auch nach einzelnen Spalten filtern.

In [5]:
penguins.dropna(subset=['bill_length_mm'])

Unnamed: 0,id,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,1,Adelie,torgersen,39.1,18.7,181.0,3750.0,male
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,21,Adelie,Biscoe,37.8,18.3,174.0,3400.0,female
6,22,Adelie,Biscoe,37.7,18.7,180.0,3600.0,male
7,23,Adelie,Biscoe,35.9,19.2,189.0,3800.0,female
8,24,Adelie,Biscoe,38.2,18.1,185.0,3950.0,male
9,25,Adelie,Biscoe,38.8,17.2,180.0,3800.0,male
10,31,Adelie,Dream,39.5,16.7,178.0,3250.0,female


Zeilen und Spalten können auch mit Hilfe von tresh (Wert für tresh einsetzen) gelöscht werden: Behalte nur Zeilen/Spalten, die mindestens tresh nicht fehlende Werte haben.
Sehen wir uns gleich ein Beispiel an: Es sollen alle Zeilen behalten werden, die mind. 5 Merkmale besitzen, die Werte enthalten, die nicht NaNs sind.

In [6]:
penguins.dropna(thresh=5)

Unnamed: 0,id,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,1,Adelie,torgersen,39.1,18.7,181.0,3750.0,male
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,21,Adelie,Biscoe,37.8,18.3,174.0,3400.0,female
6,22,Adelie,Biscoe,37.7,18.7,180.0,3600.0,male
7,23,Adelie,Biscoe,35.9,19.2,189.0,3800.0,female
8,24,Adelie,Biscoe,38.2,18.1,185.0,3950.0,male
9,25,Adelie,Biscoe,38.8,17.2,180.0,3800.0,male
10,31,Adelie,Dream,39.5,16.7,178.0,3250.0,female


Behalte alle Spalten, in denen mindestens 15 Werte nicht fehlend sind:

In [7]:
penguins.dropna(thresh=15, axis=1)

Unnamed: 0,id,species,island,bill_length_mm,bill_depth_mm,body_mass_g
0,1,Adelie,torgersen,39.1,18.7,3750.0
1,2,Adelie,Torgersen,39.5,17.4,3800.0
2,3,Adelie,Torgersen,40.3,18.0,3250.0
3,4,Adelie,Torgersen,,,
4,5,Adelie,Torgersen,36.7,19.3,3450.0
5,21,Adelie,Biscoe,37.8,18.3,3400.0
6,22,Adelie,Biscoe,37.7,18.7,3600.0
7,23,Adelie,Biscoe,35.9,19.2,3800.0
8,24,Adelie,Biscoe,38.2,18.1,3950.0
9,25,Adelie,Biscoe,38.8,17.2,3800.0


Natürlich kann man auch ganze Spalten löschen.

In [8]:
penguins.drop(columns=['island'])

Unnamed: 0,id,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,1,Adelie,39.1,18.7,181.0,3750.0,male
1,2,Adelie,39.5,17.4,186.0,3800.0,female
2,3,Adelie,40.3,18.0,195.0,3250.0,female
3,4,Adelie,,,,,
4,5,Adelie,36.7,19.3,193.0,3450.0,female
5,21,Adelie,37.8,18.3,174.0,3400.0,female
6,22,Adelie,37.7,18.7,180.0,3600.0,male
7,23,Adelie,35.9,19.2,189.0,3800.0,female
8,24,Adelie,38.2,18.1,185.0,3950.0,male
9,25,Adelie,38.8,17.2,180.0,3800.0,male
