In [1]:
import pandas as pd
import numpy as np

#### SERIES

Las series es un objeto de pandas diseñado para representar una estructura de una sola dimensión.
La series contienen dos arreglos, uno para los datos y otro para los índices. 

Para crear una serie se utiliza el constructor pd.Series().
Como podemos ver en el resultado del ejemplo, la serie tiene un índice, que es la primer columna y los datos, que es la segunda columna, y el tipo de dato de la serie es enteros.

In [2]:
a = pd.Series([9, 8, 7, 6, 5])
a

0    9
1    8
2    7
3    6
4    5
dtype: int64

Algunos atributos útiles de las series son:
- size para conocer el tamaño de la serie.
- dtype para conocer el tipo de dato de la serie.
- shape para conocer la forma de la serie, es decir, la dimención y longitud.
- index para conocer los índices, el resultado nos indica en donde inicia, donde termina y los saltos (de uno en uno, de dos en dos, etc).

Sera mediante los indices que podremos obtener y modificar los datos de la serie.

In [3]:
a.size

5

In [4]:
a.dtype

dtype('int64')

In [5]:
a.shape

(5,)

In [6]:
a.index

RangeIndex(start=0, stop=5, step=1)

Para obtener el dato de un índice basta con indicar el nombre de la serie y entre [] el indice que queremos obtener.

In [7]:
a[3]

6

Para modificar un dato de la serie basta con indicar el nombre de la serie, entre [] el indice que queremos modificar y el nuevo valor del indice.

In [8]:
a[3] = 100
a

0      9
1      8
2      7
3    100
4      5
dtype: int64

Si queremos definir nuestros propios índices, basta con indicar el parámetro index y pasarle una lista con los nombres de los índices al momento de crear la serie.
Los indices se asignan con respecto a los valores.

De igual manera, si queremos consultar o modificar un dato, debemos utilizar el nombre del índice, tal cual lo hicimos en el ejemplo anterior.

Tambien podemos asignar un nombre a la serie mediante el parámetro name y definir el tipo de dato mediante el parámetro dtype, los tipos de datos pueden ser int, float, str, bool, etc.

In [9]:
b = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'], name='Numeros', dtype=int)
b

a    1
b    2
c    3
d    4
e    5
Name: Numeros, dtype: int64

### Crear series

Para crear una serie, podemos hacerlo por medio de un diccionario, donde las llaves serán los índices y los valores serán los datos de la serie.

Bata con crear o tener creado un diccionario y pasarlo como parámetro al constructor pd.Series().

In [10]:
colores = {
    'rojo' : 1,
    'azul' : 2,
    'purpura' : 3,
    'blanco' : 4,
    'negro' : 5
}

y = pd.Series(colores)
y

rojo       1
azul       2
purpura    3
blanco     4
negro      5
dtype: int64

### Valores nulos

Como sabemos, en python, los valores nulos se representan con None, pero en pandas, los valores nulos se representan con Nan(Not a numbre). 
Numpy cuenta con una constante llamada np.nan, es por eso que importamos numpy, para poder utilizar dicha constante, pero esto solo es si queremos crear una serie con valores nulos, o sea que no es necesario en caso de ya tengamos los valores en un origen definido.

In [11]:
r = pd.Series([1, np.nan, np.nan, 5, 8, 9, np.nan])
r

0    1.0
1    NaN
2    NaN
3    5.0
4    8.0
5    9.0
6    NaN
dtype: float64

El metodo isnull() nos da como resultado una nueva serie con valores boleanos, donde True indica que el valor es nulo y False que el valor no es nulo. 

In [12]:
r.isnull()

0    False
1     True
2     True
3    False
4    False
5    False
6     True
dtype: bool

Tambien podemos usar notnull(), la diferencia con isnull() es que notnull() nos da True si el valor no es nulo y False si el valor es nulo.

In [13]:
r.notnull()

0     True
1    False
2    False
3     True
4     True
5     True
6    False
dtype: bool

Si solo queremos conocer los indices con valores nulos, podemos usar la serie original y entre corchetes usar el resultado de isnull().

In [14]:
r[ r.isnull() ]

1   NaN
2   NaN
6   NaN
dtype: float64

En caso contrario de isnull(), si queremos conocer los indices con valores no nulos, usamos notnull().

In [15]:
r[ r.notnull() ]

0    1.0
3    5.0
4    8.0
5    9.0
dtype: float64

#### DataFrame

Un dataframe es una estructura de datos de pandas muy similar a las hojas de cálculo de excel o a las tablas de sql, ya que están conformadas por columnas y filas.

Para crear un dataframe, basta con crear o tener creado un diccionario.
Las llaves del diccionario serán los nombres de las columnas y los valores serán las filas (celdas con datos).

In [16]:
usuarios = {
    'username' : ['user1', 'user2', 'user3'],
    'email' : ['user1@example.com', 'user2@example.com', 'user3@example.com'],
    'age' : [25, 30, 45],
    'status' : [True, True, False]
}

l = pd.DataFrame(usuarios)
l

Unnamed: 0,username,email,age,status
0,user1,user1@example.com,25,True
1,user2,user2@example.com,30,True
2,user3,user3@example.com,45,False


La creacion de un dataframe nos da una lista de indices por defecto, en el ejemplo anterior los indices son 0, 1 y 2, pero podemos definir nuestros propios indices mediante el parámetro index y pasarle una lista con los nombres de los indices.

In [17]:
l = pd.DataFrame(usuarios, index=['a', 'b', 'c'])
l

Unnamed: 0,username,email,age,status
a,user1,user1@example.com,25,True
b,user2,user2@example.com,30,True
c,user3,user3@example.com,45,False


De la siguiente manera podemos conocer los datos de una columna, basta con indicar el nombre del dataframe y entre corchetes el nombre de la columna a consultar.

In [18]:
l['username']

a    user1
b    user2
c    user3
Name: username, dtype: object

Así mismo, podemos obtener el dato de una columna columna con un índice en especifico, basta con indicar en un segundo corchetes el nombre del índice que queremos consultar.

In [19]:
l['username']['a']

'user1'

Otra manera de conocer los datos de una columna, es mediante los atributos del dataframe, basta con indicar el nombre del dataframe, un punto y el nombre de la columna.
Aunque lo habitual es hacerlo de la primer manera que se mostro, es decir, mediante corchetes.

In [20]:
l.age

a    25
b    30
c    45
Name: age, dtype: int64

Si queremos consultar los nombres de las columnas, podemos usar el atributo columns del dataframe.
Basta con indicar el nombre del dataframe, un punto y columns.

In [21]:
l.columns

Index(['username', 'email', 'age', 'status'], dtype='object')

Si queremos conocer los valores que conforman nuestro dataframe, podemos usar el atributo values del dataframe. Basta con indicar el nombre del dataframe, un punto y values.

In [22]:
l.values

array([['user1', 'user1@example.com', 25, True],
       ['user2', 'user2@example.com', 30, True],
       ['user3', 'user3@example.com', 45, False]], dtype=object)

### Columnas del DataFrame


Lo primero que háremos es añadir una nueva columna al dataframe, pero antes, recordemos que las columnas son series, por lo tanto, antes de añadir la nueva columna debemos crear una serie.
En este caso, crearemos una serie don números aleatorios enteros del 5 al 10 y como nuestro dataframe solo tiene 3 filas, el tamaños de la serie también será de 3.

In [23]:
calificaciones = pd.Series( np.random.randint(5, 10, 3))
calificaciones 

0    8
1    7
2    8
dtype: int64

Ahora, para agregar la nueva columna al dataframe, basta con indicar el nombre del dataframe, entre corchetes el nombre de la nueva columna y asignarle la serie que acabamos de crear.
Pero como podemos observar, los índices no coinciden, ya que la serie que creamos tiene indices del 0 al 2 y el dataframe tiene índices a, b y c, es por eso que los valores de la columna que agregamos no aparecen y en su lugar aparecen valores nulos (NaN).

In [18]:
l['calificaciones'] = calificaciones
l

Unnamed: 0,username,email,age,status,calificaciones
a,user1,user1@example.com,25,True,
b,user2,user2@example.com,30,True,
c,user3,user3@example.com,45,False,


Para solucionar lo de los valores nulos, debemos indicar los mismos índices del dataframe al crear la serie. Posteriormente seguimos el mismo procedimiento para agregar la nueva columna al dataframe.

In [28]:
calificaciones = pd.Series( np.random.randint(5, 10, 3), index=['a', 'b', 'c'])
calificaciones 

a    7
b    6
c    9
dtype: int64

In [29]:
l['calificaciones'] = calificaciones
l

Unnamed: 0,username,email,age,status,calificaciones
a,user1,user1@example.com,25,True,7
b,user2,user2@example.com,30,True,6
c,user3,user3@example.com,45,False,9


Para renombrar una columna, basta con usar el método rename() del dataframe, dentro de los paréntesis usamos el parámetro columns y le pasamos un diccionario donde la llave es el nombre actual de la columna y el valor es el nuevo nombre de la columna. Se puede hacer con una o mas columnas a la vez.

In [26]:
l.rename(
    columns={'calificaciones': 'score'}
)

Unnamed: 0,username,email,age,status,score
a,user1,user1@example.com,25,True,8
b,user2,user2@example.com,30,True,6
c,user3,user3@example.com,45,False,5


Para eliminar una columna, basta con usar la palabra reservada del, indicar el nombre del dataframe y entre corchetes el nombre de la columna a eliminar. 

En este caso, pudimos eliminar la columna utilizando el nombre calificaciones aun despues de haberla renombrado, esto es porque el método rename() no modifica el DataFrame original, sino que devuelve una nueva copia con los cambios.

In [27]:
del l['calificaciones']
l

Unnamed: 0,username,email,age,status
a,user1,user1@example.com,25,True
b,user2,user2@example.com,30,True
c,user3,user3@example.com,45,False


Si queremos que el cambio de nombre de una columna se aplique permanentemente al dataframe original, debemos usar el parámetro inplace=True dentro del método rename().

In [34]:
l.rename(columns={'calificaciones': 'score'}, inplace=True)
l

Unnamed: 0,username,email,age,status,score
a,user1,user1@example.com,25,True,7
b,user2,user2@example.com,30,True,6
c,user3,user3@example.com,45,False,9


Al haber hecho eso, entonces si debemos escribir el nuevo nombre de la columna para que pueda ser eliminada. 

In [35]:
del l['score']
l

Unnamed: 0,username,email,age,status
a,user1,user1@example.com,25,True
b,user2,user2@example.com,30,True
c,user3,user3@example.com,45,False


### Leer archivos CSV


Para poder trabajar con archivos CSV en Pandas, primero debemos importar el archivo utilizando la funcion pd.read_csv() y dentro de los paréntesis debemos colocar la ruta del archivo que queremos importar o su nombre en caso de que se encuentre en la misma carpeta que nuestro script.

El dataframe creado a partir del archivo csv se puede guardar en una variable, en este caso y, esto hace que podamos trabajar con los datos de manera más sencilla y eficiente.

Tambien incluimos el parámetro index_col='id' para indicarle a Pandas que la columna id debe ser usada como índice del dataframe. Si no hacemos esto, Pandas creará los indices numéricos por defecto, como lo habíamos visto antes. Podemos elegir cualquier columna como índice, pero lo recomendable y más comun es que se use el id.

Nota: el resultado del script nos muestra el dataframe, pero solo nos muestra las primeras 5 filas y las ultimas 5 filas de dataframe.

In [2]:
y = pd.read_csv('Breast_cancer_dataset.csv', index_col='id')
y

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


Para consultar las primeras filas de nuestro dataframe, podemos usar el atributo head(), el cual sin argumentos dentro de los parentesis nos arroja las 5 primeras filas.

In [10]:
y.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Si a head() le añadimos un parametro dentro de los parentesis nos arrojara la cantidad de filas correspondientes al numero señalado entre parentesis, en este caso 20. 

In [11]:
y.head(20)

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,
844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,
844981,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,
84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,


Así mismo, con tail() obtendremos las ultimas 5 filas de nuestro dataframe.

In [12]:
y.tail()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


Y si añadimos un numero como parametro dentro de los parentesis, nos arrojara la cantidad de filas correspondientes al numero entre parentesis, solo que en este caso seran las utimas, como en el ejemplo, que seran las ultimas 20 filas.

In [13]:
y.tail(20)

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
923465,B,10.82,24.21,68.89,361.6,0.08192,0.06602,0.01548,0.00816,0.1976,...,31.45,83.9,505.6,0.1204,0.1633,0.06194,0.03264,0.3059,0.07626,
923748,B,10.86,21.48,68.51,360.5,0.07431,0.04227,0.0,0.0,0.1661,...,24.77,74.08,412.3,0.1001,0.07348,0.0,0.0,0.2458,0.06592,
923780,B,11.13,22.44,71.49,378.4,0.09566,0.08194,0.04824,0.02257,0.203,...,28.26,77.8,436.6,0.1087,0.1782,0.1564,0.06413,0.3169,0.08032,
924084,B,12.77,29.43,81.35,507.9,0.08276,0.04234,0.01997,0.01499,0.1539,...,36.0,88.1,594.7,0.1234,0.1064,0.08653,0.06498,0.2407,0.06484,
924342,B,9.333,21.94,59.01,264.0,0.0924,0.05605,0.03996,0.01282,0.1692,...,25.05,62.86,295.8,0.1103,0.08298,0.07993,0.02564,0.2435,0.07393,
924632,B,12.88,28.92,82.5,514.3,0.08123,0.05824,0.06195,0.02343,0.1566,...,35.74,88.84,595.7,0.1227,0.162,0.2439,0.06493,0.2372,0.07242,
924934,B,10.29,27.61,65.67,321.4,0.0903,0.07658,0.05999,0.02738,0.1593,...,34.91,69.57,357.6,0.1384,0.171,0.2,0.09127,0.2226,0.08283,
924964,B,10.16,19.59,64.73,311.7,0.1003,0.07504,0.005025,0.01116,0.1791,...,22.88,67.88,347.3,0.1265,0.12,0.01005,0.02232,0.2262,0.06742,
925236,B,9.423,27.88,59.26,271.3,0.08123,0.04971,0.0,0.0,0.1742,...,34.24,66.5,330.6,0.1073,0.07158,0.0,0.0,0.2475,0.06969,
925277,B,14.59,22.68,96.39,657.1,0.08473,0.133,0.1029,0.03736,0.1454,...,27.27,105.9,733.5,0.1026,0.3171,0.3662,0.1105,0.2258,0.08004,


### Limpieza de datos

El atributo dorpna() sirve para crear un nuevo dataframe excluyebdo todas las filas que carescan de algun valor. 

In [None]:
y.dropna()

Si no queremos eliminar ls filas con los valores nulos, podemos utilizar el atributo fillna(), con el que podeños añadir un valor a las celdas de las filas que contienen valores nulos, como se muestra en el resultado en la columna Unnamed:32.

In [7]:
y.fillna('Nuevo valor')

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,Nuevo valor
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,Nuevo valor
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,Nuevo valor
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,Nuevo valor
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,Nuevo valor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,Nuevo valor
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,Nuevo valor
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,Nuevo valor
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,Nuevo valor


Lo anterior sirve muy bien, pero eso aplica para todas las filas, en las celdas que contienen valores nulos.

Si queremos agragar un valor en especifico a cada celda que este en determinada columna, lo haremos con un diccionario, en el que el nombre de la columna es la llave y el valor de la llave es el valor que se asignará a la celda en caso de que tenga un valor nulo.

In [24]:
y.fillna( {'radius_mean':'whithout radius', 'texture_mean':'whithout texture'} )

rojo       1
azul       2
purpura    3
blanco     4
negro      5
dtype: int64

### Atributo loc

In [25]:
usuarios = {
    'username' : ['user1', 'user2', 'user3'],
    'email' : ['user1@example.com', 'user2@example.com', 'user3@example.com'],
    'age' : [25, 30, 45],
    'status' : [True, True, False]
}

s = pd.DataFrame(usuarios, index=['a', 'b', 'c'])
s

Unnamed: 0,username,email,age,status
a,user1,user1@example.com,25,True
b,user2,user2@example.com,30,True
c,user3,user3@example.com,45,False


Con el atributo loc podemos obtener filas con respecto a indices de tipo string.

In [26]:
s.loc['a']

username                user1
email       user1@example.com
age                        25
status                   True
Name: a, dtype: object

Podemos crear un subdataframe que contenga solo las filas que deseamos, basta con indicar en los parametros el nombre del índice donde comenzara y el nombre del índice en el que terminará
 este subdataframe.

In [6]:
s.loc['b':'c']

Unnamed: 0,username,email,age,status
b,user2,user2@example.com,30,True
c,user3,user3@example.com,45,False


Tambien podemos obtener las filas con las columnas que puntualmente queremos, basta con indicar, despues de los nombre de los indices, los nombres de las columnas que necesitasmos.

In [7]:
s.loc['b':'c', ['email', 'status']]

Unnamed: 0,email,status
b,user2@example.com,True
c,user3@example.com,False


### Atributo iloc

Con el atributo loc podemos obtener filas con respecto a indices de tipo entro (int).

In [14]:
y

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


In [18]:
y.iloc[568]

diagnosis                         B
radius_mean                    7.76
texture_mean                  24.54
perimeter_mean                47.92
area_mean                     181.0
smoothness_mean             0.05263
compactness_mean            0.04362
concavity_mean                  0.0
concave points_mean             0.0
symmetry_mean                0.1587
fractal_dimension_mean      0.05884
radius_se                    0.3857
texture_se                    1.428
perimeter_se                  2.548
area_se                       19.15
smoothness_se              0.007189
compactness_se              0.00466
concavity_se                    0.0
concave points_se               0.0
symmetry_se                 0.02676
fractal_dimension_se       0.002783
radius_worst                  9.456
texture_worst                 30.37
perimeter_worst               59.16
area_worst                    268.6
smoothness_worst            0.08996
compactness_worst           0.06444
concavity_worst             

Podemos crear un subdataframe extrayendo las filas y la cantidad de filas que necesitasmos. Basta con indicar en los parametro en numero del indice de la fila en que querremos comenzar y el numero del índice de la fila en la que querremos que termine.

In [19]:
y.iloc[0:5]

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Podemos crear un subdataframe con las filas y columnas que deseamos, basta con indicar, despues de los indices de la filas, los indices de las columnas que deseamos.

Nota: En el caso de iloc, al requerir de información de columnas puntuales, debemos indicar las columnas no por su nombre sino por su numero de índice, tomando en cuenta que la primer columna tiene el íncide 0, la segunda columna el índice 1, la tercer columna el índice 2 y así, sucesivamente. 

In [21]:
y.iloc[0:5, [0, 3, 6, 10]]

Unnamed: 0_level_0,diagnosis,perimeter_mean,compactness_mean,fractal_dimension_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
842302,M,122.8,0.2776,0.07871
842517,M,132.9,0.07864,0.05667
84300903,M,130.0,0.1599,0.05999
84348301,M,77.58,0.2839,0.09744
84358402,M,135.1,0.1328,0.05883


Otra forma de crear subdataframes eligiendo puntualmente las filas y columnas es de la siguiente manera.

Aunque ambas nos darán los mismos resultados, lo mas recomendable es usar esta segunda forma, ya que es mas legible y por o tanto, mas comprensible.

In [22]:
y.iloc[0:5][ ['diagnosis', 'fractal_dimension_mean'] ]

Unnamed: 0_level_0,diagnosis,fractal_dimension_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1
842302,M,0.07871
842517,M,0.05667
84300903,M,0.05999
84348301,M,0.09744
84358402,M,0.05883


### Condicionales

In [23]:
y

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


En el siguiente ejemplo, condicionamos la información que obtenemos colocando el nombre del dataframe, dentro de corchetes el nombre del dataframe, seguido de corchetes con el nombre de la columna dentro y despues el valor que queremos que tengan las filas de esa columna, en este caso 'B'.
Lo que nosa da como resultado todas las filas y columnas que sontengan B en la columna diagnosis.

In [24]:
y[ y['diagnosis'] == 'B']

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8510426,B,13.540,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.047810,0.1885,...,19.26,99.70,711.2,0.14400,0.17730,0.23900,0.12880,0.2977,0.07259,
8510653,B,13.080,15.71,85.63,520.0,0.10750,0.12700,0.04568,0.031100,0.1967,...,20.49,96.09,630.5,0.13120,0.27760,0.18900,0.07283,0.3184,0.08183,
8510824,B,9.504,12.44,60.34,273.9,0.10240,0.06492,0.02956,0.020760,0.1815,...,15.66,65.13,314.9,0.13240,0.11480,0.08867,0.06227,0.2450,0.07773,
854941,B,13.030,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.029230,0.1467,...,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169,
85713702,B,8.196,16.84,51.71,201.9,0.08600,0.05943,0.01588,0.005917,0.1769,...,21.96,57.26,242.2,0.12970,0.13570,0.06880,0.02564,0.3105,0.07409,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925277,B,14.590,22.68,96.39,657.1,0.08473,0.13300,0.10290,0.037360,0.1454,...,27.27,105.90,733.5,0.10260,0.31710,0.36620,0.11050,0.2258,0.08004,
925291,B,11.510,23.93,74.52,403.5,0.09261,0.10210,0.11120,0.041050,0.1388,...,37.16,82.28,474.2,0.12980,0.25170,0.36300,0.09653,0.2112,0.08732,
925292,B,14.050,27.15,91.38,600.4,0.09929,0.11260,0.04462,0.043040,0.1537,...,33.17,100.20,706.7,0.12410,0.22640,0.13260,0.10480,0.2250,0.08321,
925311,B,11.200,29.37,70.67,386.0,0.07449,0.03558,0.00000,0.000000,0.1060,...,38.30,75.19,439.6,0.09267,0.05494,0.00000,0.00000,0.1566,0.05905,


En el siguiente ajemplo solo le agregamos una condicion más al aejemplo anterior, indicando el nombre de una columna en los ultimos corchetes, esto le indica al Pandas que queremos la informacion solo de esa columna ajustandose a la primer condicional.
En este caso, solo obtendremos los radius_mean de todas filas que contienen B en la columna diagnosis.

In [25]:
y[ y['diagnosis'] == 'B']['radius_mean']

id
8510426     13.540
8510653     13.080
8510824      9.504
854941      13.030
85713702     8.196
             ...  
925277      14.590
925291      11.510
925292      14.050
925311      11.200
92751        7.760
Name: radius_mean, Length: 357, dtype: float64

En el siguiente ejemplo, obtenemos los datos de las columnas diagnosis y radius_mean pero que en la columna fractal_dimension_worst los datos solo sean mayores a 0.1000.

In [29]:
y[ y['fractal_dimension_worst'] > 0.1000][ ['diagnosis', 'radius_mean'] ]

Unnamed: 0_level_0,diagnosis,radius_mean
id,Unnamed: 1_level_1,Unnamed: 2_level_1
842302,M,17.990
84348301,M,11.420
843786,M,12.450
84458202,M,13.710
844981,M,13.000
...,...,...
91805,B,8.571
921362,B,7.691
921386,B,14.470
925622,M,15.220


En el siguiente ejemplo, estamos obteniendo todas las filas donde diagnosis es B pero que en fractal_dimension_worst sea mayor a 0.1000.

Nota: para poder usar operadores logicos como & (AND) y | (OR), las conciones debemos colocarlas entre ().

In [30]:
y[ (y['diagnosis'] == 'B') & (y['fractal_dimension_worst'] > 0.1000)]

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
859471,B,9.029,17.33,58.79,250.5,0.1066,0.1413,0.313,0.04375,0.2111,...,22.65,65.5,324.7,0.1482,0.4365,1.252,0.175,0.4228,0.1175,
859711,B,8.888,14.64,58.79,244.0,0.09783,0.1531,0.08606,0.02872,0.1902,...,15.67,62.56,284.4,0.1207,0.2436,0.1434,0.04786,0.2254,0.1084,
8611161,B,13.34,15.86,86.49,520.0,0.1078,0.1535,0.1169,0.06987,0.1942,...,23.19,96.66,614.9,0.1536,0.4791,0.4858,0.1708,0.3527,0.1016,
86409,B,14.26,19.65,97.83,629.9,0.07837,0.2233,0.3003,0.07798,0.1704,...,23.73,107.0,709.0,0.08949,0.4193,0.6783,0.1505,0.2398,0.1082,
864496,B,8.726,15.83,55.84,230.9,0.115,0.08201,0.04132,0.01924,0.1649,...,19.62,64.48,284.4,0.1724,0.2364,0.2456,0.105,0.2926,0.1017,
869476,B,11.9,14.65,78.11,432.8,0.1152,0.1296,0.0371,0.03003,0.1995,...,16.51,86.26,509.6,0.1424,0.2517,0.0942,0.06042,0.2727,0.1036,
871001502,B,8.219,20.7,53.27,203.9,0.09405,0.1305,0.1321,0.02168,0.2222,...,29.72,58.08,249.8,0.163,0.431,0.5381,0.07879,0.3322,0.1486,
8710441,B,9.731,15.34,63.78,300.2,0.1072,0.1599,0.4108,0.07857,0.2548,...,19.49,71.04,380.5,0.1292,0.2772,0.8216,0.1571,0.3108,0.1259,
872608,B,9.904,18.06,64.6,302.4,0.09699,0.1294,0.1307,0.03716,0.1669,...,24.39,73.07,390.2,0.1301,0.295,0.3486,0.0991,0.2614,0.1162,
8810158,B,13.11,22.54,87.02,529.4,0.1002,0.1483,0.08705,0.05102,0.185,...,29.16,99.48,639.3,0.1349,0.4402,0.3162,0.1126,0.4128,0.1076,


En el siguiente ejemplo, al codigo anterior le agregamos ['fractal_dimension_worst'].mean() al final, esto para obtener el promedio de fractal_dimension_worst. 
Recordemos que ese promedio esta condicionado solo por las las filas que contienen B en la columna diagnosis y las fils con valores mayor a 0.1000 en la columna fractal_dimension_worst.

In [3]:
y[ (y['diagnosis'] == 'B') & (y['fractal_dimension_worst'] > 0.1000)]['fractal_dimension_worst'].mean()

np.float64(0.11337407407407406)

### Ordenamiento

En el siguiente ejemplo, estamos obteniendo la fila con el valor más bajo en la columna texture_mean, pero solo de las filas que contienen B en la columna diagnosis.
La primer condición en la de la columna diagnosis, que solo arroja las filas que tengan valor B en esa columna, despues ordenamos las filas con sort_values(), que ordena los valores de la columna texture_mean de menor a mayor, y por ultimo limitamos los resultados con head(1) para que solo nos arroje la primer fila, que es la que contiene el valor más bajo en la columna texture_mean.

In [4]:
y[ y['diagnosis'] == 'B'].sort_values('texture_mean').head(1)

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
87127,B,10.8,9.71,68.77,357.6,0.09594,0.05736,0.02531,0.01698,0.1381,...,12.02,73.66,414.0,0.1436,0.1257,0.1047,0.04603,0.209,0.07699,


EN el siguiente ejemplo, estamos obteniendo las 5 filas con los valores más altos en la columna texture_mean, pero solo de las filas que contienen M en la columna diagnosis.
Como podemos observar, la estructura del código es muy similar al ejemplo anterior, solo que en este caso, en sort_values() agreggamos el parámetro ascending=False, para que las filas se ordenen de mayor a menor y por ultimo limitamos los resultados con head(5) para que solo nos arroje las primeras 5 filas, que son las que contienen los 5 valores más altos en la columna texture_mean.

In [6]:
y[ y['diagnosis'] == 'M'].sort_values('texture_mean', ascending=False).head(5)

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88330202,M,17.46,39.28,113.4,920.6,0.09812,0.1298,0.1417,0.08811,0.1809,...,44.87,141.2,1408.0,0.1365,0.3735,0.3241,0.2066,0.2853,0.08496,
88725602,M,15.53,33.56,103.7,744.9,0.1063,0.1639,0.1751,0.08399,0.2091,...,49.54,126.3,1035.0,0.1883,0.5564,0.5703,0.2014,0.3512,0.1204,
88119002,M,19.53,32.47,128.0,1223.0,0.0842,0.113,0.1145,0.06637,0.1428,...,45.41,180.2,2477.0,0.1408,0.4097,0.3995,0.1625,0.2713,0.07568,
88995002,M,20.73,31.12,135.7,1419.0,0.09469,0.1143,0.1367,0.08646,0.1769,...,47.16,214.0,3432.0,0.1401,0.2644,0.3442,0.1659,0.2868,0.08218,
925622,M,15.22,30.62,103.4,716.9,0.1048,0.2087,0.255,0.09429,0.2128,...,42.79,128.7,915.0,0.1417,0.7917,1.17,0.2356,0.4089,0.1409,


### Búsqueda por rangos

En el siguiente ejemplo, estamos obteniendo todas las filas que contienen valores en la columna texture_worst que están entre 40.00 y 50.00. Recordemos que para el uso de operadores lógicos como & (AND) y | (OR), las condiciones deben colocarse entre paréntesis.

In [8]:
y[ (y['texture_worst'] >= 40.00) & (y['texture_worst'] <= 50.00) ]

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,
87880,M,13.81,23.75,91.56,597.8,0.1323,0.1768,0.1558,0.09176,0.2251,...,41.85,128.5,1153.0,0.2226,0.5209,0.4646,0.2013,0.4432,0.1086,
88119002,M,19.53,32.47,128.0,1223.0,0.0842,0.113,0.1145,0.06637,0.1428,...,45.41,180.2,2477.0,0.1408,0.4097,0.3995,0.1625,0.2713,0.07568,
88203002,B,11.22,33.81,70.79,386.8,0.0778,0.03574,0.004967,0.006434,0.1845,...,41.78,78.44,470.9,0.09994,0.06885,0.02318,0.03002,0.2911,0.07307,
883270,B,14.22,27.85,92.55,623.9,0.08223,0.1039,0.1103,0.04408,0.1342,...,40.54,102.5,764.0,0.1081,0.2426,0.3064,0.08219,0.189,0.07796,
88330202,M,17.46,39.28,113.4,920.6,0.09812,0.1298,0.1417,0.08811,0.1809,...,44.87,141.2,1408.0,0.1365,0.3735,0.3241,0.2066,0.2853,0.08496,
88725602,M,15.53,33.56,103.7,744.9,0.1063,0.1639,0.1751,0.08399,0.2091,...,49.54,126.3,1035.0,0.1883,0.5564,0.5703,0.2014,0.3512,0.1204,
88995002,M,20.73,31.12,135.7,1419.0,0.09469,0.1143,0.1367,0.08646,0.1769,...,47.16,214.0,3432.0,0.1401,0.2644,0.3442,0.1659,0.2868,0.08218,
9112085,B,13.38,30.72,86.34,557.2,0.09245,0.07426,0.02819,0.03264,0.1375,...,41.61,96.69,705.6,0.1172,0.1421,0.07003,0.07763,0.2196,0.07675,
925622,M,15.22,30.62,103.4,716.9,0.1048,0.2087,0.255,0.09429,0.2128,...,42.79,128.7,915.0,0.1417,0.7917,1.17,0.2356,0.4089,0.1409,


Otra manera de obtener el mismo resultado del ejemplo anterior es utilizando el método between() dentro de los corchetes, con los parametros que condicionan el rango dentro de los paréntesis.

Lo más recomendable es que, mientras puedamos, utilicemos el método between(), ya que hace que el código sea más legible y por lo tanto, más comprensible.

In [9]:
y[ y['texture_worst'].between(40.00, 50.00) ]

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,
87880,M,13.81,23.75,91.56,597.8,0.1323,0.1768,0.1558,0.09176,0.2251,...,41.85,128.5,1153.0,0.2226,0.5209,0.4646,0.2013,0.4432,0.1086,
88119002,M,19.53,32.47,128.0,1223.0,0.0842,0.113,0.1145,0.06637,0.1428,...,45.41,180.2,2477.0,0.1408,0.4097,0.3995,0.1625,0.2713,0.07568,
88203002,B,11.22,33.81,70.79,386.8,0.0778,0.03574,0.004967,0.006434,0.1845,...,41.78,78.44,470.9,0.09994,0.06885,0.02318,0.03002,0.2911,0.07307,
883270,B,14.22,27.85,92.55,623.9,0.08223,0.1039,0.1103,0.04408,0.1342,...,40.54,102.5,764.0,0.1081,0.2426,0.3064,0.08219,0.189,0.07796,
88330202,M,17.46,39.28,113.4,920.6,0.09812,0.1298,0.1417,0.08811,0.1809,...,44.87,141.2,1408.0,0.1365,0.3735,0.3241,0.2066,0.2853,0.08496,
88725602,M,15.53,33.56,103.7,744.9,0.1063,0.1639,0.1751,0.08399,0.2091,...,49.54,126.3,1035.0,0.1883,0.5564,0.5703,0.2014,0.3512,0.1204,
88995002,M,20.73,31.12,135.7,1419.0,0.09469,0.1143,0.1367,0.08646,0.1769,...,47.16,214.0,3432.0,0.1401,0.2644,0.3442,0.1659,0.2868,0.08218,
9112085,B,13.38,30.72,86.34,557.2,0.09245,0.07426,0.02819,0.03264,0.1375,...,41.61,96.69,705.6,0.1172,0.1421,0.07003,0.07763,0.2196,0.07675,
925622,M,15.22,30.62,103.4,716.9,0.1048,0.2087,0.255,0.09429,0.2128,...,42.79,128.7,915.0,0.1417,0.7917,1.17,0.2356,0.4089,0.1409,


### Búsqueda entre opciones

En el siguiente ejemplo, estamos obteniendo todas las filas donde perimeter_mean es mayor a 100.00 y donde diagnosis sea igual a M o B.

In [10]:
y [ (y ['perimeter_mean'] > 100.00) & ( (y ['diagnosis'] == 'M') | (y ['diagnosis'] == 'B') ) ]

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.11890,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.1860,0.2750,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.2430,0.3613,0.08758,
84358402,M,20.29,14.34,135.1,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.2,1575.0,0.1374,0.2050,0.4000,0.1625,0.2364,0.07678,
844359,M,18.25,19.98,119.6,1040.0,0.09463,0.10900,0.11270,0.07400,0.1794,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926125,M,20.92,25.09,143.0,1347.0,0.10990,0.22360,0.31740,0.14740,0.2149,...,29.41,179.1,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,
926424,M,21.56,22.39,142.0,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.1,2027.0,0.1410,0.2113,0.4107,0.2216,0.2060,0.07115,
926682,M,20.13,28.25,131.2,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
926954,M,16.60,28.08,108.3,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.07820,


Otra manera de obtener el mismo resultado que en el ejemplo anterior es utilizando el método isin().
Para ello primero debemos crear una lista con las opciones que queremos evaluar, en este caso 'M' y 'B', y posteriormente usamos el método isin() dentro de los corchetes, pasando como argumento el nombre de la lista que contiene las opciones. 

Este método es aplicable cuando tenemos muchas opciones que evaluar, ya que hace que el código sea más legible y por lo tanto, más comprensible.

In [11]:
diagnosis = ['M', 'B']

y [ (y ['perimeter_mean'] > 100.00) & (y ['diagnosis'].isin(diagnosis) ) ]

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,10.38,122.8,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.11890,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.1860,0.2750,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.2430,0.3613,0.08758,
84358402,M,20.29,14.34,135.1,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,16.67,152.2,1575.0,0.1374,0.2050,0.4000,0.1625,0.2364,0.07678,
844359,M,18.25,19.98,119.6,1040.0,0.09463,0.10900,0.11270,0.07400,0.1794,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926125,M,20.92,25.09,143.0,1347.0,0.10990,0.22360,0.31740,0.14740,0.2149,...,29.41,179.1,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873,
926424,M,21.56,22.39,142.0,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,26.40,166.1,2027.0,0.1410,0.2113,0.4107,0.2216,0.2060,0.07115,
926682,M,20.13,28.25,131.2,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
926954,M,16.60,28.08,108.3,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.07820,


### Métodos de string

En Pandas, podemos condicionar los valores que las series almacenen através de métodos de string.

In [16]:
usuarios = {
    'username' : ['user1', 'gabriel user2', 'user3'],
    'email' : ['auser1@example.com', 'buser2@example.com', 'auser3@example.com'],
    'age' : [25, 30, 45],
    'status' : [True, True, False]
}

d = pd.DataFrame(usuarios, index=['a', 'b', 'c'])
d

Unnamed: 0,username,email,age,status
a,user1,auser1@example.com,25,True
b,gabriel user2,buser2@example.com,30,True
c,user3,auser3@example.com,45,False


Con el método str.startswith() podemos obtener todas las filas que contienen un valor en específico al inicio de la cadena de texto.
En este caso, obtenemos todas las filas donde los datos de la columna email inician con la letra 'a'.

In [18]:
d[ d['email'].str.startswith('a') ]

Unnamed: 0,username,email,age,status
a,user1,auser1@example.com,25,True
c,user3,auser3@example.com,45,False


Con el método str.endswith() podemos obtener todas las filas que contienen un valor en especifico al final de la cadena de texto.
En este caso, obtenemos todas las filas donde los datos de la columna username terminan con '3'.

In [19]:
d[ d['username'].str.endswith('3') ]

Unnamed: 0,username,email,age,status
c,user3,auser3@example.com,45,False


Con el método str.contains() podemos obtener todas las filas que contienen un valor en especifico en cualquier parte de la cadena de texto. 
En este caso, obtenemos todas las filas donde los datos de la columna username contienen la palabra 'gabriel'.

In [17]:
d[ d['username'].str.contains('gabriel') ]

Unnamed: 0,username,email,age,status
b,gabriel user2,buser2@example.com,30,True


### Agrupamiento

Utilizando el método groupby() podemos agrupar los valores unicos de una columna. 
Basta con indicar el nombre del dataframe, groupby(), dentro de los parentesis el nombre de la columna, despues, entre corchetes accedemos a la llave por la cual se agrupa, que es el nombre de la misma columna, y posteriormente contamos con count(). 
El resultado de nuestro ejemplo es el conteo de los valores 'M' = 212  y 'B' = 357, respectivamente.

In [20]:
y.groupby('diagnosis')['diagnosis'].count()

diagnosis
B    357
M    212
Name: diagnosis, dtype: int64

En el siguiente ejemplo, condicionamos la columna perimeter_mean para que solo nos arroje las filas que contienen el valor 130.0, despues agrupamos y accedemos a la llave (ambas con el nombre de la columna diagnosis) y por ultimo contamos con count().
Esto nos da como resultado que existen 2 filas de diagnosis 'M' que contienen el valor 130.0 en la columna perimeter_mean.

In [23]:
y[ y['perimeter_mean'] == 130.0].groupby('diagnosis')['diagnosis'].count()

diagnosis
M    2
Name: diagnosis, dtype: int64