___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<head>
    <center><title>~ Pandas Datenrahmen | Lektion-1 ~</title></center>
</head>
    

# Datenrahmen

``DataFrames`` sind das Arbeitspferd der Pandas und direkt von der Programmiersprache R inspiriert. Wir können uns einen DataFrame als eine Ansammlung von Series-Objekten vorstellen, die zusammengestellt wurden, um denselben Index zu verwenden. Lassen Sie uns Pandas verwenden, um dieses Thema zu erkunden!
https://daten_setcience.eu/de/programmierung/python-pandas-datenrahmen/

In [2]:
import pandas as pd
import numpy as np

## Erstellen eines Datenrahmen 

### Erstellen eines Datenrahmen unter Verwendung der ``list`` von Daten und Spalten

In [7]:
daten_set = [1, 3, 5, 7, 9, 18]
columns = ['alter']
daten_set, columns

([1, 3, 5, 7, 9, 18], ['alter'])

In [8]:
pd.DataFrame(daten_set, columns=columns)

Unnamed: 0,alter
0,1
1,3
2,5
3,7
4,9
5,18


### Erstellen eines Datenrahmen mit einem ``NumPy Array``

In [9]:
daten_set = np.arange(1, 24, 2).reshape(3, 4)
daten_set

array([[ 1,  3,  5,  7],
       [ 9, 11, 13, 15],
       [17, 19, 21, 23]])

In [11]:
pd.DataFrame(daten_set, columns=['var1','var2','var3','var4'])

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [13]:
df = pd.DataFrame(data=daten_set, columns=['var1','var2','var3','var4'])
df

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [14]:
df.head(2)

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
1,9,11,13,15


In [15]:
df.tail(2)

Unnamed: 0,var1,var2,var3,var4
1,9,11,13,15
2,17,19,21,23


In [16]:
df.sample(2)

Unnamed: 0,var1,var2,var3,var4
0,1,3,5,7
2,17,19,21,23


In [17]:
df.columns

Index(['var1', 'var2', 'var3', 'var4'], dtype='object')

In [20]:
[i for i in df.columns]

['var1', 'var2', 'var3', 'var4']

In [21]:
df.columns=['new1','new2','new3','new4']
df

Unnamed: 0,new1,new2,new3,new4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [None]:
type(df)

pandas.core.frame.DataFrame

In [27]:
print("Zeil-Spalte:", df.shape, "Spalte:", df.shape[1], "Dimention:",  df.ndim, "Größe:", df.size, "len:", len(df))

Zeil-Spalte: (3, 4) Spalte: 4 Dimention: 2 Größe: 12 len: 3


In [29]:
df

Unnamed: 0,new1,new2,new3,new4
0,1,3,5,7
1,9,11,13,15
2,17,19,21,23


In [28]:
df.values

array([[ 1,  3,  5,  7],
       [ 9, 11, 13, 15],
       [17, 19, 21, 23]])

In [40]:
df.index.values

array([0, 1, 2], dtype=int64)

In [39]:
print("Index:", df.index.values, "Index[1]:", df.index[1])

Index: [0 1 2] Index[1]: 1


### Erstellen eines Datenrahmen mit einem ``dict``

In [41]:
s1 = np.random.randint(2, 10, size = 4)
s2 = np.random.randint(3, 10, size = 4)
s3 = np.random.randint(4, 15, size = 4)

In [42]:
s1, s2, s3

(array([3, 9, 6, 9]), array([6, 5, 9, 6]), array([ 6,  7, 10,  6]))

In [48]:
dict_= {'var1':s1,'var2':s2,'var3':s3}

In [49]:
df_ = pd.DataFrame(dict_)
df_

Unnamed: 0,var1,var2,var3
0,3,6,6
1,9,5,7
2,6,9,10
3,9,6,6


In [50]:
df_.index

RangeIndex(start=0, stop=4, step=1)

In [53]:
[i for i in df_.index]

[0, 1, 2, 3]

In [54]:
df_.index = ["a", "b", "c", "d"]

In [55]:
df_

Unnamed: 0,var1,var2,var3
a,3,6,6
b,9,5,7
c,6,9,10
d,9,6,6


In [61]:
# Wir können jeden Spaltennamen überprüfen, ob er zum DataFrame gehört oder nicht
"var2" in df_, 'var5' in df_

(True, False)

## Indizierung, Auswahl und Schneiden von Datenrahmen
Betrachten wir nun noch einmal die Methoden ``(Indizierung)indexing`` ``Auswahl(selection)`` und ``Schneiden(slicing)`` und verschiedene ``Attribute(attribute)`` mit einem anderen DataFrame

In [62]:
from numpy.random import randn
np.random.seed(101)

In [69]:
df = pd.DataFrame(randn(5, 4),
                    index='A B C D E'.split(),
                    columns='W X Y Z'.split())

In [70]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [71]:
# Erstellen eines Datenrahmen durch 'positionale Argumente'
pd.DataFrame(randn(5, 4), 'a b c d e'.split(), 'w x y z'.split())

Unnamed: 0,w,x,y,z
a,1.60678,-1.11571,-1.385379,-1.32966
b,0.04146,-0.411055,-0.771329,0.110477
c,-0.804652,0.253548,0.649148,0.358941
d,-1.080471,0.902398,0.161781,0.833029
e,0.97572,-0.388239,0.783316,-0.708954


In [72]:
# Erstellen eines Datenrahmendurch 'Schlüsselwortargumente'
pd.DataFrame(randn(5, 4), columns='w x y z'.split(), index='a b c d e'.split())

Unnamed: 0,w,x,y,z
a,0.586847,-1.621348,0.677535,0.026105
b,-1.678284,0.333973,-0.532471,2.117727
c,0.197524,2.302987,0.729024,-0.863091
d,0.305632,0.243178,0.864165,-1.560931
e,-0.251897,-0.57812,0.236996,0.20078


### Auswahl und Indizierung

Lernen wir die verschiedenen Methoden kennen, um Daten aus einem Datenrahmen zu holen.

In [73]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [74]:
df['Y']

A   -0.162535
B   -0.855196
C   -0.568581
D   -0.391157
E    0.720788
Name: Y, dtype: float64

In [76]:
# SQL-Syntax (NICHT EMPFOHLEN!)
df.Y

A   -0.162535
B   -0.855196
C   -0.568581
D   -0.391157
E    0.720788
Name: Y, dtype: float64

Datenrahmen-Spalten sind nur Serien

In [81]:
df['Y'], type(df['Y'])

(A   -0.162535
 B   -0.855196
 C   -0.568581
 D   -0.391157
 E    0.720788
 Name: Y, dtype: float64,
 pandas.core.series.Series)

In [80]:
df[['Y']], type(df[['Y']])

(          Y
 A -0.162535
 B -0.855196
 C -0.568581
 D -0.391157
 E  0.720788,
 pandas.core.frame.DataFrame)

In [84]:
# Übergeben Sie eine Liste mit Spaltennamen
# df['Z','X'] gibt Fehler
df[['Z','X']]

Unnamed: 0,Z,X
A,0.485809,-0.494095
B,1.54199,0.221491
C,1.407338,-0.538235
D,1.028293,-0.9051
E,-1.223082,-0.866885


In [103]:
df["X":"Z"]

Unnamed: 0,W,X,Y,Z


In [107]:
df['B':'C']

Unnamed: 0,W,X,Y,Z
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338


In [None]:
df3["B":"C"][["Y", "Z"]]

Unnamed: 0,Y,Z
A,0.907969,0.503826
B,-0.848077,0.605965
C,0.528813,-0.589001


**Neue Spalte erstellen:**

In [108]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [109]:
df['X*Y'] = df['X'] * df['Y']
df

Unnamed: 0,W,X,Y,Z,X*Y
A,-1.467514,-0.494095,-0.162535,0.485809,0.080308
B,0.392489,0.221491,-0.855196,1.54199,-0.189418
C,0.666319,-0.538235,-0.568581,1.407338,0.30603
D,0.641806,-0.9051,-0.391157,1.028293,0.354036
E,-1.972605,-0.866885,0.720788,-1.223082,-0.62484


In [110]:
df["T"] = [1, 2, 3, 4, 5]
df 

Unnamed: 0,W,X,Y,Z,X*Y,T
A,-1.467514,-0.494095,-0.162535,0.485809,0.080308,1
B,0.392489,0.221491,-0.855196,1.54199,-0.189418,2
C,0.666319,-0.538235,-0.568581,1.407338,0.30603,3
D,0.641806,-0.9051,-0.391157,1.028293,0.354036,4
E,-1.972605,-0.866885,0.720788,-1.223082,-0.62484,5


### Spalten & Zeilen entfernen
http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-drop.ipynb

#### Spalten entfernen

In [111]:
df.drop('X*Y', axis=1)

Unnamed: 0,W,X,Y,Z,T
A,-1.467514,-0.494095,-0.162535,0.485809,1
B,0.392489,0.221491,-0.855196,1.54199,2
C,0.666319,-0.538235,-0.568581,1.407338,3
D,0.641806,-0.9051,-0.391157,1.028293,4
E,-1.972605,-0.866885,0.720788,-1.223082,5


In [112]:
df

Unnamed: 0,W,X,Y,Z,X*Y,T
A,-1.467514,-0.494095,-0.162535,0.485809,0.080308,1
B,0.392489,0.221491,-0.855196,1.54199,-0.189418,2
C,0.666319,-0.538235,-0.568581,1.407338,0.30603,3
D,0.641806,-0.9051,-0.391157,1.028293,0.354036,4
E,-1.972605,-0.866885,0.720788,-1.223082,-0.62484,5


In [113]:
df.drop(["X*Y", "T"], axis=1)

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [114]:
df

Unnamed: 0,W,X,Y,Z,X*Y,T
A,-1.467514,-0.494095,-0.162535,0.485809,0.080308,1
B,0.392489,0.221491,-0.855196,1.54199,-0.189418,2
C,0.666319,-0.538235,-0.568581,1.407338,0.30603,3
D,0.641806,-0.9051,-0.391157,1.028293,0.354036,4
E,-1.972605,-0.866885,0.720788,-1.223082,-0.62484,5


In [115]:
# Nicht vorhanden, sofern ``inplace`` nict gibt an!
df.drop(["X*Y", "T"], axis=1, inplace=True)

In [116]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


#### Zeilen entfernen

In [117]:
df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [118]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
C,0.666319,-0.538235,-0.568581,1.407338
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [119]:
# der Standardwert der Achse ist 0 (axis= 0)
df = df.drop('C', axis=0)
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [120]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199
D,0.641806,-0.9051,-0.391157,1.028293
E,-1.972605,-0.866885,0.720788,-1.223082


In [125]:
df.drop(["D","E"], axis=0, inplace=True)

In [126]:
df

Unnamed: 0,W,X,Y,Z
A,-1.467514,-0.494095,-0.162535,0.485809
B,0.392489,0.221491,-0.855196,1.54199


### Zeilen auswählen

Werfen wir zunächst einen kurzen Blick auf ``.loc[]`` und ``.iloc[]``

#### ``.loc[] ``
Ermöglicht es uns, Daten mit Labels(Namen) von Zeilen (Index) und Spalten auszuwählen

#### `.iloc[]` 
Ermöglicht es uns, Daten mit **Indexnummern** von Zeilen (Index) und Spalten auszuwählen. es ist wie eine klassische Indizierungslogik

In [127]:
daten_set = np.random.randint(1, 40, size=(8, 4))
df = pd.DataFrame(daten_set, columns = ["var1","var2","var3",'var4'])
df

Unnamed: 0,var1,var2,var3,var4
0,13,31,21,6
1,10,35,6,26
2,15,29,11,12
3,34,10,21,19
4,31,4,30,28
5,18,8,26,31
6,39,22,30,7
7,7,12,34,35


In [134]:
df.loc[4]

var1    31
var2     4
var3    30
var4    28
Name: 4, dtype: int32

In [135]:
df.loc[[4]]

Unnamed: 0,var1,var2,var3,var4
4,31,4,30,28


In [137]:
# Slicing erzeugt den gleichen Datentyp. Hier, Datenrahmen
df.loc[2:5]

Unnamed: 0,var1,var2,var3,var4
2,15,29,11,12
3,34,10,21,19
4,31,4,30,28
5,18,8,26,31


In [138]:
df.iloc[2:5]

Unnamed: 0,var1,var2,var3,var4
2,15,29,11,12
3,34,10,21,19
4,31,4,30,28


In [139]:
df

Unnamed: 0,var1,var2,var3,var4
0,13,31,21,6
1,10,35,6,26
2,15,29,11,12
3,34,10,21,19
4,31,4,30,28
5,18,8,26,31
6,39,22,30,7
7,7,12,34,35


In [140]:
df.index='a b c d e f g h'.split()
df

Unnamed: 0,var1,var2,var3,var4
a,13,31,21,6
b,10,35,6,26
c,15,29,11,12
d,34,10,21,19
e,31,4,30,28
f,18,8,26,31
g,39,22,30,7
h,7,12,34,35


In [141]:
df.iloc[1:4]

Unnamed: 0,var1,var2,var3,var4
b,10,35,6,26
c,15,29,11,12
d,34,10,21,19


In [None]:
# df.loc[1:4] gibt Fehler, weil die Indizes/die Indexe sind markiert wurden

In [142]:
df.loc['c':'g']

Unnamed: 0,var1,var2,var3,var4
c,15,29,11,12
d,34,10,21,19
e,31,4,30,28
f,18,8,26,31
g,39,22,30,7


In [143]:
df

Unnamed: 0,var1,var2,var3,var4
a,13,31,21,6
b,10,35,6,26
c,15,29,11,12
d,34,10,21,19
e,31,4,30,28
f,18,8,26,31
g,39,22,30,7
h,7,12,34,35


In [145]:
df.iloc[4, 1]

4

In [147]:
df.iloc[:, 1]

a    31
b    35
c    29
d    10
e     4
f     8
g    22
h    12
Name: var2, dtype: int32

In [149]:
df.loc['d':'g', 'var3']

d    21
e    30
f    26
g    30
Name: var3, dtype: int32

In [162]:
df.loc[:, 'var3']

a    21
b     6
c    11
d    21
e    30
f    26
g    30
h    34
Name: var3, dtype: int32

In [154]:
df.loc['d':'g'][['var3']]

Unnamed: 0,var3
d,21
e,30
f,26
g,30


In [156]:
# Wie können wir diese Daten als Datenframe und nicht als Serie auswählen?
df.loc['d':'g'][['var3']]

Unnamed: 0,var3
d,21
e,30
f,26
g,30


In [157]:
df.loc['d':'g', ["var3"]]

Unnamed: 0,var3
d,21
e,30
f,26
g,30


In [158]:
df.iloc[2:5, 2]

c    11
d    21
e    30
Name: var3, dtype: int32

In [160]:
df.iloc[2:5][['var2']]

Unnamed: 0,var2
c,29
d,10
e,4


Let' s continue to examine ``.loc[]`` and ``.iloc[]`` 

In [163]:
df = pd.DataFrame(randn(5, 4),
                    index='A B C D E'.split(),
                    columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.348337,-0.503218,0.513932,-0.64482


In [164]:
df.loc['C']

W   -0.495700
X   -0.050826
Y    0.914642
Z    0.300847
Name: C, dtype: float64

Oder wählen Sie basierend auf der Position anstelle des Labels

In [172]:
df.iloc[2]

W   -0.495700
X   -0.050826
Y    0.914642
Z    0.300847
Name: C, dtype: float64

In [171]:
type(df.iloc[2])

pandas.core.series.Series

In [174]:
df.iloc[2].values

array([-0.49569977, -0.05082572,  0.91464243,  0.30084735])

In [170]:
df.iloc[[2]]

Unnamed: 0,W,X,Y,Z
C,-0.4957,-0.050826,0.914642,0.300847


In [173]:
type(df.iloc[[2]])

pandas.core.frame.DataFrame

In [175]:
df.iloc[[2]].values

array([[-0.49569977, -0.05082572,  0.91464243,  0.30084735]])

In [167]:
# gibt als Datenrahmen zurück
df.loc[['C']]

Unnamed: 0,W,X,Y,Z
C,-0.4957,-0.050826,0.914642,0.300847


In [169]:
# gibt als Datenrahmen zurück
df.iloc[[2]]

Unnamed: 0,W,X,Y,Z
C,-0.4957,-0.050826,0.914642,0.300847


In [177]:
# Nun, wie können wir die gesamte Spalte 'Y' mit '.iloc[]' auswählen
df.iloc[:, 2]

A    0.694564
B   -1.887987
C    0.914642
D   -1.022971
E    0.513932
Name: Y, dtype: float64

In [179]:
df.iloc[:,[2]]

Unnamed: 0,Y
A,0.694564
B,-1.887987
C,0.914642
D,-1.022971
E,0.513932


In [181]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [180]:
df[['Y','X']]

Unnamed: 0,Y,X
A,0.694564,-0.715531
B,-1.887987,0.776504
C,0.914642,-0.050826
D,-1.022971,-0.394781
E,0.513932,-0.503218


In [182]:
df[['X','Y']]

Unnamed: 0,X,Y
A,-0.715531,0.694564
B,0.776504,-1.887987
C,-0.050826,0.914642
D,-0.394781,-1.022971
E,-0.503218,0.513932


#### Auswahl einer Teilmenge(subset) von Zeilen und Spalten

 `.loc[[row labels|names], [column labels|names]]`

`.iloc[[row index numbers], [column index numbers]]`

In [183]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.348337,-0.503218,0.513932,-0.64482


In [190]:
df.loc['C','Z']

0.3008473458905787

In [186]:
# Wählen wir dieselben Daten wie einen Datenrahmen
df.loc[['C'],['Z']]

Unnamed: 0,Z
C,0.300847


In [188]:
df.loc[['C']][['Z']]

Unnamed: 0,Z
C,0.300847


In [193]:
df.loc[['A','C'],['W','Z']]

Unnamed: 0,W,Z
A,-0.005648,-0.771256
C,-0.4957,0.300847


In [194]:
df.loc[['A','C']][['W','Z']]

Unnamed: 0,W,Z
A,-0.005648,-0.771256
C,-0.4957,0.300847


In [196]:
df.iloc[[0,  2], [0, 3]]

Unnamed: 0,W,Z
A,-0.005648,-0.771256
C,-0.4957,0.300847


#### Bedingte Auswahl
Ein wichtiges Merkmal von Pandas ist die bedingte Auswahl mit Klammernotation, die der numpy sehr ähnlich ist:

In [197]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.348337,-0.503218,0.513932,-0.64482


In [199]:
# gibt einen Datenrahmen zurück, der aus dem Typ bool besteh
df > 0.5

Unnamed: 0,W,X,Y,Z
A,False,False,True,False
B,False,True,False,False
C,False,False,True,False
D,False,False,False,True
E,False,False,True,False


In [200]:
df[df > 0.5]

Unnamed: 0,W,X,Y,Z
A,,,0.694564,
B,,0.776504,,
C,,,0.914642,
D,,,,1.308103
E,,,0.513932,


In [201]:
# Es gibt basierend auf Zeilen zurück.
df[df['Z'] > 0.5]

Unnamed: 0,W,X,Y,Z
D,-0.239617,-0.394781,-1.022971,1.308103


In [202]:
df[['Z']]

Unnamed: 0,Z
A,-0.771256
B,-1.045924
C,0.300847
D,1.308103
E,-0.64482


In [203]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.348337,-0.503218,0.513932,-0.64482


In [204]:
df[df['X'] < 1][['W']]

Unnamed: 0,W
A,-0.005648
B,-0.908903
C,-0.4957
D,-0.239617
E,0.348337


In [207]:
# Wie können wir die Daten als Datenrahmen auswählen

In [208]:
df[df['Y'] > 0][['Z', 'W', 'Y']]

Unnamed: 0,Z,W,Y
A,-0.771256,-0.005648,0.694564
C,0.300847,-0.4957,0.914642
E,-0.64482,0.348337,0.513932


Hinweis: Für zwei Bedingungen können Sie 

**|** → `or`, 

**&** → `and` mit Klammern verwenden.

In [209]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.348337,-0.503218,0.513932,-0.64482


In [210]:
df[(df['W'] > 0) & (df['Y'] < 1)]

Unnamed: 0,W,X,Y,Z
E,0.348337,-0.503218,0.513932,-0.64482


In [211]:
df[(df['W'] > 0) & (df['Y'] < 1)] = 0
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.0,0.0,0.0,0.0


#### Bedingte Auswahl mit ``.loc[]`` und ``.iloc[]``

In [212]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.0,0.0,0.0,0.0


In [213]:
df.loc[(df.X > 0), ['X','Z']]

Unnamed: 0,X,Z
B,0.776504,-1.045924


In [214]:
df.loc[(df.X > 0)][['X','Z']]

Unnamed: 0,X,Z
B,0.776504,-1.045924


In [215]:
df.loc[((df.W > 1) | (df.Y < 1)), ['Y','Z']]

Unnamed: 0,Y,Z
A,0.694564,-0.771256
B,-1.887987,-1.045924
C,0.914642,0.300847
D,-1.022971,1.308103
E,0.0,0.0


## Weitere Indexdetails

Lassen Sie uns noch einige weitere Funktionen der Indizierung besprechen, einschließlich des Zurücksetzens des Index oder eines anderen Festlegens. Wir werden auch über die Indexhierarchie sprechen!

In [216]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.0,0.0,0.0,0.0


In [217]:
# Zurücksetzen auf Standard 0,1...n Index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,-0.005648,-0.715531,0.694564,-0.771256
1,B,-0.908903,0.776504,-1.887987,-1.045924
2,C,-0.4957,-0.050826,0.914642,0.300847
3,D,-0.239617,-0.394781,-1.022971,1.308103
4,E,0.0,0.0,0.0,0.0


In [218]:
df

Unnamed: 0,W,X,Y,Z
A,-0.005648,-0.715531,0.694564,-0.771256
B,-0.908903,0.776504,-1.887987,-1.045924
C,-0.4957,-0.050826,0.914642,0.300847
D,-0.239617,-0.394781,-1.022971,1.308103
E,0.0,0.0,0.0,0.0


In [220]:
df.reset_index(drop=True)

Unnamed: 0,W,X,Y,Z
0,-0.005648,-0.715531,0.694564,-0.771256
1,-0.908903,0.776504,-1.887987,-1.045924
2,-0.4957,-0.050826,0.914642,0.300847
3,-0.239617,-0.394781,-1.022971,1.308103
4,0.0,0.0,0.0,0.0


In [221]:
neueindx = 'CA NY WY OR CO'.split()
neueindx

['CA', 'NY', 'WY', 'OR', 'CO']

In [222]:
df['neueidx'] = neueindx
df

Unnamed: 0,W,X,Y,Z,neueidx
A,-0.005648,-0.715531,0.694564,-0.771256,CA
B,-0.908903,0.776504,-1.887987,-1.045924,NY
C,-0.4957,-0.050826,0.914642,0.300847,WY
D,-0.239617,-0.394781,-1.022971,1.308103,OR
E,0.0,0.0,0.0,0.0,CO


In [223]:
df.set_index('neueidx')

Unnamed: 0_level_0,W,X,Y,Z
neueidx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.005648,-0.715531,0.694564,-0.771256
NY,-0.908903,0.776504,-1.887987,-1.045924
WY,-0.4957,-0.050826,0.914642,0.300847
OR,-0.239617,-0.394781,-1.022971,1.308103
CO,0.0,0.0,0.0,0.0


In [224]:
df

Unnamed: 0,W,X,Y,Z,neueidx
A,-0.005648,-0.715531,0.694564,-0.771256,CA
B,-0.908903,0.776504,-1.887987,-1.045924,NY
C,-0.4957,-0.050826,0.914642,0.300847,WY
D,-0.239617,-0.394781,-1.022971,1.308103,OR
E,0.0,0.0,0.0,0.0,CO


In [225]:
df.set_index('neueidx',inplace=True)

In [226]:
df

Unnamed: 0_level_0,W,X,Y,Z
neueidx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.005648,-0.715531,0.694564,-0.771256
NY,-0.908903,0.776504,-1.887987,-1.045924
WY,-0.4957,-0.050826,0.914642,0.300847
OR,-0.239617,-0.394781,-1.022971,1.308103
CO,0.0,0.0,0.0,0.0


## Multi-Index und Index-Hierarchie

Lassen Sie uns die Arbeit mit Multi-Index durchgehen. Zuerst erstellen wir ein kurzes Beispiel dafür, wie ein Multi-Indexed Datenrahmen aussehen würde:

In [232]:
# Indexstufen
stufe1 = [1, 2, 3, 1, 2, 3, 5, 6, 7]
stufe2 = ['M1', 'M1', 'M1', 'M2', 'M2', 'M2','M3', 'M3', 'M3']
multi_index = list(zip(stufe2 , stufe1))
multi_index

[('M1', 1),
 ('M1', 2),
 ('M1', 3),
 ('M2', 1),
 ('M2', 2),
 ('M2', 3),
 ('M3', 5),
 ('M3', 6),
 ('M3', 7)]

In [233]:
index_ = pd.MultiIndex.from_tuples(multi_index)

In [234]:
index_

MultiIndex([('M1', 1),
            ('M1', 2),
            ('M1', 3),
            ('M2', 1),
            ('M2', 2),
            ('M2', 3),
            ('M3', 5),
            ('M3', 6),
            ('M3', 7)],
           )

In [235]:
df = pd.DataFrame(np.random.randn(9, 4), 
                  index=index_, 
                  columns=['A','B','C','D'])
df

Unnamed: 0,Unnamed: 1,A,B,C,D
M1,1,-0.186946,-0.07285,0.360293,-0.253136
M1,2,1.424846,-1.148209,-1.745976,-0.851874
M1,3,-0.148627,0.478169,-2.079632,0.364785
M2,1,-0.389643,1.054263,0.193175,0.866667
M2,2,1.912587,1.212039,-0.828568,0.508801
M2,3,1.812898,0.438464,0.184212,0.088795
M3,5,-0.448151,2.25707,0.030853,-0.268911
M3,6,2.770488,-0.573197,0.014738,1.267547
M3,7,0.368468,1.02288,0.344081,-0.904709


Lassen Sie uns nun zeigen, wie man dies indiziert! Für die Indexhierarchie verwenden wir ``df.loc[]``, wenn dies auf der Spaltenachse wäre, würden Sie einfach die normale Klammernotation ``df[]`` verwenden. Der Aufruf einer Ebene des Indexes gibt den Unterdatenrahmen zurück:

In [236]:
df.loc['M1']

Unnamed: 0,A,B,C,D
1,-0.186946,-0.07285,0.360293,-0.253136
2,1.424846,-1.148209,-1.745976,-0.851874
3,-0.148627,0.478169,-2.079632,0.364785


In [237]:
df.loc['M1'].loc[2]

A    1.424846
B   -1.148209
C   -1.745976
D   -0.851874
Name: 2, dtype: float64

In [238]:
df.loc['M1'].loc[[2]]

Unnamed: 0,A,B,C,D
2,1.424846,-1.148209,-1.745976,-0.851874


In [239]:
df.index.names

FrozenList([None, None])

In [240]:
df.index.names = ['Group','Num']

In [241]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,1,-0.186946,-0.07285,0.360293,-0.253136
M1,2,1.424846,-1.148209,-1.745976,-0.851874
M1,3,-0.148627,0.478169,-2.079632,0.364785
M2,1,-0.389643,1.054263,0.193175,0.866667
M2,2,1.912587,1.212039,-0.828568,0.508801
M2,3,1.812898,0.438464,0.184212,0.088795
M3,5,-0.448151,2.25707,0.030853,-0.268911
M3,6,2.770488,-0.573197,0.014738,1.267547
M3,7,0.368468,1.02288,0.344081,-0.904709


let's take a quick look at the ``.xs()``
http://localhost:8888/notebooks/pythonic/DAwPythonSessions/w3resource-pandas-dataframe-xs.ipynb

In [245]:
# Diese Methode benötigt ein `key`-Argument, um Daten auf einer bestimmten Ebene eines MultiIndex auszuwählen.
df.xs('M1')

Unnamed: 0_level_0,A,B,C,D
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-0.186946,-0.07285,0.360293,-0.253136
2,1.424846,-1.148209,-1.745976,-0.851874
3,-0.148627,0.478169,-2.079632,0.364785


In [246]:
df.loc['M1']

Unnamed: 0_level_0,A,B,C,D
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-0.186946,-0.07285,0.360293,-0.253136
2,1.424846,-1.148209,-1.745976,-0.851874
3,-0.148627,0.478169,-2.079632,0.364785


In [247]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M1,1,-0.186946,-0.07285,0.360293,-0.253136
M1,2,1.424846,-1.148209,-1.745976,-0.851874
M1,3,-0.148627,0.478169,-2.079632,0.364785
M2,1,-0.389643,1.054263,0.193175,0.866667
M2,2,1.912587,1.212039,-0.828568,0.508801
M2,3,1.812898,0.438464,0.184212,0.088795
M3,5,-0.448151,2.25707,0.030853,-0.268911
M3,6,2.770488,-0.573197,0.014738,1.267547
M3,7,0.368468,1.02288,0.344081,-0.904709


In [251]:
df.xs(['M1', 2])

  """Entry point for launching an IPython kernel.


A    1.424846
B   -1.148209
C   -1.745976
D   -0.851874
Name: (M1, 2), dtype: float64

In [252]:
df.xs(('M3',6))

A    2.770488
B   -0.573197
C    0.014738
D    1.267547
Name: (M3, 6), dtype: float64

In [254]:
df.xs(('M3',6), level=[0,1])

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M3,6,2.770488,-0.573197,0.014738,1.267547


In [258]:
df.xs(5, level='Num')

Unnamed: 0_level_0,A,B,C,D
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M3,-0.448151,2.25707,0.030853,-0.268911


In [259]:
df.xs(3, level=1)

Unnamed: 0_level_0,A,B,C,D
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1,-0.148627,0.478169,-2.079632,0.364785
M2,1.812898,0.438464,0.184212,0.088795


In [260]:
df.xs('C', axis=1)

Group  Num
M1     1      0.360293
       2     -1.745976
       3     -2.079632
M2     1      0.193175
       2     -0.828568
       3      0.184212
M3     5      0.030853
       6      0.014738
       7      0.344081
Name: C, dtype: float64

## Lernen wir neue Funktionen/Attribute/Methoden zu "iris daten_set" kennen

In [268]:
from sklearn import datasets
import seaborn as sns

In [272]:
df = sns.load_dataset("iris")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [273]:
df.shape

(150, 5)

In [274]:
df.ndim

2

In [275]:
df.size

750

In [276]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [277]:
df.sample(4)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
92,5.8,2.6,4.0,1.2,versicolor
78,6.0,2.9,4.5,1.5,versicolor
1,4.9,3.0,1.4,0.2,setosa
100,6.3,3.3,6.0,2.5,virginica


In [278]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [279]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [280]:
df.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [281]:
df.mean()

  """Entry point for launching an IPython kernel.


sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

In [282]:
df.sum(axis=0)

sepal_length                                                876.5
sepal_width                                                 458.6
petal_length                                                563.7
petal_width                                                 179.9
species         setosasetosasetosasetosasetosasetosasetosaseto...
dtype: object

In [283]:
df.sum(axis=1)

  """Entry point for launching an IPython kernel.


0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

In [284]:
df.sepal_length.sum()

876.5

In [285]:
df.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [286]:
df.isnull()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
145,False,False,False,False,False
146,False,False,False,False,False
147,False,False,False,False,False
148,False,False,False,False,False


In [287]:
df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [288]:
len(df)

150

In [289]:
df.head(9)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa


In [291]:
df.iloc[0:6 ,0:]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa


In [292]:
df.loc[0:6, :]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa


In [293]:
df.drop('species', axis=1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [294]:
df[(df.sepal_length > 5) & (df.sepal_width > 3)].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
10,5.4,3.7,1.5,0.2,setosa
14,5.8,4.0,1.2,0.2,setosa
15,5.7,4.4,1.5,0.4,setosa


In [295]:
df[(df.sepal_length > 5) | (df.sepal_width > 3)].tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [296]:
df.sort_values(by='species', ascending=True)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
27,5.2,3.5,1.5,0.2,setosa
28,5.2,3.4,1.4,0.2,setosa
29,4.7,3.2,1.6,0.2,setosa
30,4.8,3.1,1.6,0.2,setosa
...,...,...,...,...,...
119,6.0,2.2,5.0,1.5,virginica
120,6.9,3.2,5.7,2.3,virginica
121,5.6,2.8,4.9,2.0,virginica
111,6.4,2.7,5.3,1.9,virginica


<head>
    <center><title>~ Ende der Pandas Datenrahmen | Lektion-1 ~</title></center>
</head>