
<img src="img/viu_logo.png" width="200">

## 01MAIR - Procesamiento de Datos

![logo](img/python_logo.png)

*Ivan Fuertes*

# Sumario
- Trabajo con strings
- Combinando datasets
- Limpieza de datos
 - map, filter, reduce
 - filling missing values
 - valores duplicados
 - categorizacion de datos

In [1]:
import numpy as np
import pandas as pd

### Combinar varios datasets 
- En base a un elemento en común (índice)
- MovieLens 'UserId'

In [2]:
import zipfile as zp # para descomprimir archivos zip
import urllib.request # para descargar de URL
import os

# descargar MovieLens dataset
url = 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'  
local_zip = os.path.join("res", "ml-1m.zip")
urllib.request.urlretrieve(url, local_zip)
# descomprimiendo archivo zip
with zp.ZipFile(local_zip, 'r') as zipp: 
    print('Extracting all files...') 
    zipp.extractall(os.path.join("res")) # destino
    print('Done!') 

Extracting all files...
Done!


In [3]:
root_path = os.path.join("res", "ml-1m" )

ratings_dataset = pd.read_csv(os.path.join(root_path, "ratings.dat"), sep='::',
                                index_col=0, engine='python',
                                names=['UserID','MovieID','Rating','Timestamp'])

users_dataset = pd.read_csv(os.path.join(root_path, "users.dat"),sep='::',
                              index_col=0, engine='python',
                              names=['UserID','Gender','Age','Occupation','Zip-code'])

In [4]:
users_dataset.sample(5)

Unnamed: 0_level_0,Gender,Age,Occupation,Zip-code
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1378,F,45,7,21044
3020,F,35,7,60302
4072,M,50,6,93111
1062,M,50,19,59457
442,M,25,1,55105


In [5]:
ratings_dataset.sample(5)

Unnamed: 0_level_0,MovieID,Rating,Timestamp
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5287,2001,4,961088221
1027,529,4,975003119
4275,1015,4,965298142
1666,647,3,974711194
4049,3274,4,965498261


### Uniendo datasets con 'join' y 'merge'
- merge() == join()
 - 'join' utiliza por defecto los índices para unir
- Utilizando el parámetro 'on'
 - Si las columnas difieren, 'left_on' y 'right_on'
 
 https://i.stack.imgur.com/hMKKt.jpg

In [6]:
# combinando users y ratings, ¿Cómo?
combined_dataset = users_dataset.merge(ratings_dataset, on='UserID', how='inner') # parametro 'on' define la columna pivote
combined_dataset.sample(5)

Unnamed: 0_level_0,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
434,F,45,3,98155,260,4,976247518
306,M,18,0,53051,2081,4,979186336
4094,M,25,17,49017,3510,4,965418832
2446,M,18,7,10038,2108,5,975349789
2158,M,25,12,60618,1218,4,974618223


In [7]:
movies_dataset = pd.read_csv(os.path.join(root_path, "movies.dat"),sep='::', engine='python',names=['MovieID','Title','Genre'])
movies_dataset.sample(5)

Unnamed: 0,MovieID,Title,Genre
2425,2494,"Last Days, The (1998)",Documentary
3026,3095,"Grapes of Wrath, The (1940)",Drama
3756,3826,Hollow Man (2000),Horror|Sci-Fi|Thriller
2213,2282,Pecker (1998),Comedy|Drama
471,475,In the Name of the Father (1993),Drama


In [8]:
# combinando movies y el resto
all_dataset = combined_dataset.merge(movies_dataset,on='MovieID', how='inner')
all_dataset.sample(5)

Unnamed: 0,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genre
100089,M,25,1,92688,2490,1,958063020,Payback (1999),Action|Thriller
51988,M,45,5,32953,1207,4,965177668,To Kill a Mockingbird (1962),Drama
979795,M,25,0,20008,504,3,964234095,No Escape (1994),Action|Sci-Fi
985304,M,45,3,55410,444,2,975530965,Even Cowgirls Get the Blues (1993),Comedy|Romance
405617,F,25,1,1860,3178,5,965582757,"Hurricane, The (1999)",Drama


### Concatenate
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

## Pivot
- Representar los datos en función a varios parámetros, agregando
```python
pivot_table(<lista de valores>, index=<agregador primario>, columns=<agregador secundario>)
```
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

In [9]:
all_dataset.pivot_table('Rating', index='Gender', columns='Age')
# all_dataset.pivot_table('Rating', index='Gender', columns='Age', aggfunc='count')
# all_dataset.pivot_table('Rating', index='Gender', columns='Age', aggfunc=['count', 'mean'])

Age,1,18,25,35,45,50,56
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
F,3.616291,3.453145,3.6067,3.659653,3.663044,3.79711,3.915534
M,3.517461,3.525476,3.52678,3.604434,3.627942,3.687098,3.720327


## Agrupaciones
- agg -> funciones estadísticas de agregación
- Series.unique() -> valores únicos
- pd.value_counts -> ocurrencias

## Manipulación de strings
```python
split(): separar en bloques en función de un carácter
replace(): reemplazar un carácter por otro
index(): encontrar la posición de un carácter
```

In [10]:
# Ejemplo con MovieLens: Genre
## 1: obtener todos los géneros por separado
## 2: crear un dataset de géneros
## 3: por película, marcar género por separado
## 4: unir con dataset original
movies_dataset.head(3)

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [11]:
all_genres = movies_dataset['Genre'].apply(lambda x : x.split('|'))
print(all_genres)

genres = pd.unique([genre for x in all_genres for genre in x])
print(genres)

0          [Animation, Children's, Comedy]
1         [Adventure, Children's, Fantasy]
2                        [Comedy, Romance]
3                          [Comedy, Drama]
4                                 [Comedy]
5                [Action, Crime, Thriller]
6                        [Comedy, Romance]
7                  [Adventure, Children's]
8                                 [Action]
9            [Action, Adventure, Thriller]
10                [Comedy, Drama, Romance]
11                        [Comedy, Horror]
12                 [Animation, Children's]
13                                 [Drama]
14            [Action, Adventure, Romance]
15                       [Drama, Thriller]
16                        [Drama, Romance]
17                              [Thriller]
18                                [Comedy]
19                                [Action]
20                 [Action, Comedy, Drama]
21                [Crime, Drama, Thriller]
22                              [Thriller]
23         

In [12]:
# crear tabla con columnas por género
zeros = np.zeros( (len(movies_dataset), len(genres)) )
genres_frame = pd.DataFrame(zeros,columns=genres)
genres_frame.head(3)

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
columns_genres = genres_frame.columns # lista de generos (columnas)
# para cada película, marcar género con 1
for i, genre in enumerate(movies_dataset['Genre']):
    inds = columns_genres.get_indexer(genre.split('|')) # retorna los indices correspondientes a los generos de cada pelicula
    genres_frame.iloc[i,inds] = 1 # localiza las columnas del genero correspondiente, marca con 1

In [14]:
genres_frame.head(5)

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# unir con dataset original
movies_split_genre = movies_dataset.join(genres_frame)

In [16]:
display(movies_split_genre.head(5))

Unnamed: 0,MovieID,Title,Genre,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Replace e index para extraer el año de la película

In [17]:
movies_dataset.head(2)

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy


In [18]:
# extraer el año de la columna Title
def split_year(title):
    index = title.index('(')
    return title[index:].replace('(','').replace(')','')
    
# crear nueva columna Year
movies_dataset['Year'] = movies_dataset['Title'].apply(split_year)
movies_dataset.sample(2)

Unnamed: 0,MovieID,Title,Genre,Year
1322,1343,Cape Fear (1991),Thriller,1991
312,315,"Specialist, The (1994)",Action,1994


In [19]:
# eliminar el año de la columna Title
def remove_year(title):
    index = title.index('(')
    return title[:index-1].strip()

movies_dataset['Title'] = movies_dataset['Title'].apply(remove_year)
movies_dataset.head(2)

Unnamed: 0,MovieID,Title,Genre,Year
0,1,Toy Story,Animation|Children's|Comedy,1995
1,2,Jumanji,Adventure|Children's|Fantasy,1995


#### Expresiones regulares
https://docs.python.org/3/library/re.html

- import re

In [20]:
# ¿Cómo localizar que 'Zip-code' tiene un formato erróneo?
users_dataset.sample(5)

users_dataset[users_dataset['Zip-code'].str.match('^\d{5}$') == False]

# ^\d{5}$
# ^ = start of the string
# \d = decimal string
# {5} = 5 repeticiones de decimales
# $ = end of string

Unnamed: 0_level_0,Gender,Age,Occupation,Zip-code
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
161,M,45,16,98107-2117
233,F,45,20,37919-4204
293,M,56,1,55337-4056
458,M,50,16,55405-2546
506,M,25,16,55103-1006
541,F,18,4,5849574
567,M,35,20,52570-9634
868,M,50,17,01702-7224
913,M,25,0,20744-6223
939,F,25,20,20110-5616


In [21]:
# ¿Cómo extraer el año con regex en el formato adecuado?
movies_dataset = pd.read_csv(os.path.join(root_path, "movies.dat"),sep='::', engine='python',names=['MovieID','Title','Genre'])

movies_dataset['Title'].str.extract('(\d{4})')

# (\d{4})
# (= busca apertura parentesis
# \d = decimal string
# {4} = 4 repeticiones de decimales
# ) = cierre de parentesis

Unnamed: 0,0
0,1995
1,1995
2,1995
3,1995
4,1995
5,1995
6,1995
7,1995
8,1995
9,1995


## Operaciones con colecciones
```python
reduce: aplicar una operación y retornar un valor
map: aplicar  una operación y retornar una secuencia
filter: retorna una secuencia con elementos que cumplen una condición
```


## Reduce
- Aplicar una operación matemática a cada uno de los elementos de una colección
- Diferente de 'apply()' porque retorna un valor numérico
- Ejemplo: Detección de géneros en años específicos

https://docs.python.org/3/library/functools.html

In [22]:
from functools import reduce # necesario para reduce

lista = [1, 3, 5, 7, 9]
print(reduce(lambda x,y: x + y, lista))

25


In [23]:
movies_1975 = movies_split_genre[ movies_split_genre['Title'].str.contains('1975') ]
movies_1975.head(3)

Unnamed: 0,MovieID,Title,Genre,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
707,716,Switchblade Sisters (1975),Crime,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
994,1007,"Apple Dumpling Gang, The (1975)",Children's|Comedy|Western,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
996,1009,Escape to Witch Mountain (1975),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
any_drama = reduce(lambda x,y : bool(x) | bool(y),movies_1975['Drama']) # hay algún drama en 1975
print(any_drama)

all_comedy = reduce(lambda x,y : bool(x) & bool(y),movies_1975['Comedy']) # son todas las películas de 1975 comedias?
print(all_comedy)

True
False


In [25]:
print(movies_1975['Drama'].any()) # Comprueba si hay algún valor que puede cumplir  
print(movies_1975['Comedy'].all()) # Comprueba si todos los valores son True

True
False


In [26]:
# Observar el tipo de dato antes para ver si es posible aplicar las funciones
print(movies_1975.dtypes)
print(movies_1975['Comedy'].unique())

MovieID          int64
Title           object
Genre           object
Animation      float64
Children's     float64
Comedy         float64
Adventure      float64
Fantasy        float64
Romance        float64
Drama          float64
Action         float64
Crime          float64
Thriller       float64
Horror         float64
Sci-Fi         float64
Documentary    float64
War            float64
Musical        float64
Mystery        float64
Film-Noir      float64
Western        float64
dtype: object
[0. 1.]


## Filter
- retorna una secuencia con elementos que cumplen una condición
- Ejemplo: obtener las películas de 1975 que contienen 'The' en el título

In [27]:
filtro = filter(lambda x : 'The' in x, movies_1975['Title'])
list(filtro)
# ¿Están todos los títulos con "The"? si tiene mayúsculas o no...

['Apple Dumpling Gang, The (1975)',
 'Man Who Would Be King, The (1975)',
 'Stepford Wives, The (1975)',
 'Rocky Horror Picture Show, The (1975)',
 'McCullochs, The (1975)',
 'Mirror, The (Zerkalo) (1975)']

## Map
- aplicar  una operación y retornar una secuencia
- Cambiar el valor integral de la columna 'Comedy' por bool

In [28]:
mapa = map(lambda x : bool(x), movies_1975['Comedy'])
movies_1975.loc[:,'Comedy'] = list(mapa)
movies_1975.head(4)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,MovieID,Title,Genre,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,...,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
707,716,Switchblade Sisters (1975),Crime,0.0,0.0,False,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
994,1007,"Apple Dumpling Gang, The (1975)",Children's|Comedy|Western,0.0,1.0,True,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
996,1009,Escape to Witch Mountain (1975),Adventure|Children's|Fantasy,0.0,1.0,False,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1176,1193,One Flew Over the Cuckoo's Nest (1975),Drama,0.0,0.0,False,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Transformación de variables (calidad de datos)
- Valores no definidos
- Valores duplicados
- Discretización (valores categóricos)

In [29]:
matrix = pd.DataFrame(np.random.randint(10,size=(5,10)))
matrix[matrix < 2] = np.nan
matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2,9.0,4.0,6,4.0,3.0,4,7.0,6,6
1,5,9.0,7.0,4,2.0,7.0,7,2.0,3,5
2,9,4.0,,8,8.0,8.0,6,3.0,4,9
3,9,,8.0,9,,,2,,8,9
4,8,4.0,,9,5.0,7.0,4,6.0,2,2


In [30]:
# nulos por columna
matrix.isnull().sum()
matrix.isna().sum()

0    0
1    1
2    2
3    0
4    1
5    1
6    0
7    1
8    0
9    0
dtype: int64

In [31]:
# Cantidad valores nulos
matrix.isnull().sum().sum()

6

In [32]:
# numero de no nulos por fila
matrix.count(axis=1)

0    10
1    10
2     9
3     6
4     9
dtype: int64

In [33]:
# Número de nulos por fila
matrix.shape[1] - matrix.count(axis=1)

0    0
1    0
2    1
3    4
4    1
dtype: int64

In [34]:
# Representación de las filas en las que una determinada columna tiene nulos
matrix[matrix[0].isnull()]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9


In [35]:
# Conteo de valores que aparecen en el dataset
valores = [6, 5]
# Identificación de valores de dominio que se encuentran en un listado
matrix[matrix[6].isin(valores)]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
2,9,4.0,,8,8.0,8.0,6,3.0,4,9


In [36]:
## Tratamiento de valores nulos
# eliminar
matrix.dropna()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2,9.0,4.0,6,4.0,3.0,4,7.0,6,6
1,5,9.0,7.0,4,2.0,7.0,7,2.0,3,5


In [37]:
# eliminar si no hay un número de valores no NaN
matrix.dropna(thresh=9)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2,9.0,4.0,6,4.0,3.0,4,7.0,6,6
1,5,9.0,7.0,4,2.0,7.0,7,2.0,3,5
2,9,4.0,,8,8.0,8.0,6,3.0,4,9
4,8,4.0,,9,5.0,7.0,4,6.0,2,2


In [38]:
# sustituir por un valor fijo
matrix.fillna(-1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2,9.0,4.0,6,4.0,3.0,4,7.0,6,6
1,5,9.0,7.0,4,2.0,7.0,7,2.0,3,5
2,9,4.0,-1.0,8,8.0,8.0,6,3.0,4,9
3,9,-1.0,8.0,9,-1.0,-1.0,2,-1.0,8,9
4,8,4.0,-1.0,9,5.0,7.0,4,6.0,2,2


In [39]:
# sustituir por valor dinámico (copia)
matrix.fillna(method='bfill') # bfill y ffill

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2,9.0,4.0,6,4.0,3.0,4,7.0,6,6
1,5,9.0,7.0,4,2.0,7.0,7,2.0,3,5
2,9,4.0,8.0,8,8.0,8.0,6,3.0,4,9
3,9,4.0,8.0,9,5.0,7.0,2,6.0,8,9
4,8,4.0,,9,5.0,7.0,4,6.0,2,2


In [40]:
# sustituir por valor dinámico (interpolación)
matrix.interpolate()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2,9.0,4.0,6,4.0,3.0,4,7.0,6,6
1,5,9.0,7.0,4,2.0,7.0,7,2.0,3,5
2,9,4.0,7.5,8,8.0,8.0,6,3.0,4,9
3,9,4.0,8.0,9,6.5,7.5,2,4.5,8,9
4,8,4.0,8.0,9,5.0,7.0,4,6.0,2,2


#### Tratar valores duplicados

In [41]:
serie = pd.Series(['a','b','c','a','c','a','g'])
serie.duplicated()

0    False
1    False
2    False
3     True
4     True
5     True
6    False
dtype: bool

In [42]:
df = all_dataset
# eliminar
# Eliminación de los duplicados en una columna definida
df2 = df.drop_duplicates(subset="Gender", keep='last', inplace=False)
display(df2)

Unnamed: 0,Gender,Age,Occupation,Zip-code,MovieID,Rating,Timestamp,Title,Genre
1000207,F,18,20,55410,3607,5,957756608,One Little Indian (1973),Comedy|Drama|Western
1000208,M,25,1,35401,2909,4,957273353,"Five Wives, Three Secretaries and Me (1998)",Documentary


#### Discretización (valores categóricos)
- Tras Series y DataFrame, objeto para categorías: Categorical
```python
categorias = pd.cut(<valores>, <bins>) 
```

In [43]:
# especificar los bloques
bins = [0,18,35,65,99]
edades = [16,25,18,71,44,100,12]
categorias = pd.cut(edades,bins)
print(categorias)

[(0.0, 18.0], (18.0, 35.0], (0.0, 18.0], (65.0, 99.0], (35.0, 65.0], NaN, (0.0, 18.0]]
Categories (4, interval[int64]): [(0, 18] < (18, 35] < (35, 65] < (65, 99]]


In [44]:
categorias.value_counts()

(0, 18]     3
(18, 35]    1
(35, 65]    1
(65, 99]    1
dtype: int64

In [45]:
# especificar el número de bloques
bins = 5
edades = [0,6,8,16,25,18,71,44,100]
categorias = pd.cut(edades,bins) # rangos idénticos (similar distancia de rangos)
print(categorias)
print(categorias.value_counts())

[(-0.1, 20.0], (-0.1, 20.0], (-0.1, 20.0], (-0.1, 20.0], (20.0, 40.0], (-0.1, 20.0], (60.0, 80.0], (40.0, 60.0], (80.0, 100.0]]
Categories (5, interval[float64]): [(-0.1, 20.0] < (20.0, 40.0] < (40.0, 60.0] < (60.0, 80.0] < (80.0, 100.0]]
(-0.1, 20.0]     5
(20.0, 40.0]     1
(40.0, 60.0]     1
(60.0, 80.0]     1
(80.0, 100.0]    1
dtype: int64


In [46]:
bins = 4
edades = [1,6,8,16,25,18,71,44,100]
categorias = pd.qcut(edades,bins) # rangos homogéneos (similar número de valores)
print(categorias)
print(categorias.value_counts())

[(0.999, 8.0], (0.999, 8.0], (0.999, 8.0], (8.0, 18.0], (18.0, 44.0], (8.0, 18.0], (44.0, 100.0], (18.0, 44.0], (44.0, 100.0]]
Categories (4, interval[float64]): [(0.999, 8.0] < (8.0, 18.0] < (18.0, 44.0] < (44.0, 100.0]]
(0.999, 8.0]     3
(8.0, 18.0]      2
(18.0, 44.0]     2
(44.0, 100.0]    2
dtype: int64


9