# Data Cleaning and Preparation
Ya que la información contenida en los datasets no suele encontrarse 100% depurada.

pandas nos ofrece la posibilidad de poder corregir estos datasets de una forma más sencilla.

# Load pandas

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# 1. Handling Missing Data
En las bases de datos, muchas veces existen valores sin información (missing data values). 

pandas ignora estos NaN values a la hora de realizar cálculos que impliquen estadísitca descriptiva.

Para valores numéricos, pandas utiliza un floating-number denominado NaN, el cual representa un missing data. 

Los missing values se llaman también sentinel value (no tienen por qué ser NaN, a veces se encuentran con otros símbolos).

## 1.1 Location
Podemos localizar los nan values

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

#### isnull()

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

#### notnull()

In [4]:
string_data.notnull()

0     True
1     True
2    False
3     True
dtype: bool

In [5]:
string_data[string_data.notnull()]

0     aardvark
1    artichoke
3      avocado
dtype: object

#### None
None value de Python también es considerado como un NaN

In [6]:
string_data[0] = None

string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [7]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://i.gyazo.com/f12158a0cf6798945d0f03fe20aa1608.png", width = 600)

## 1.2 Filtering and Drop

### 1.2.1 Series

In [8]:
data = pd.Series([1, np.nan, 3.5, 7])

data

0    1.0
1    NaN
2    3.5
3    7.0
dtype: float64

#### dropna()
Nos permite eliminar los NaN values (sin reemplazar el original)

Por defecto, elimina TODAS las filas que contienen al menos 1 NaN value

In [9]:
data.dropna()

0    1.0
2    3.5
3    7.0
dtype: float64

In [10]:
data[data.notnull()]

0    1.0
2    3.5
3    7.0
dtype: float64

### 1.2.2 DataFrame

In [11]:
from numpy import nan as NA

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


#### drop rows

Para eliminar todas las filas que contienen al menos un NaN value:

In [12]:
cleaned = data.dropna()

cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Para eliminar sólo las filas que contienen TODO NaN values:

In [13]:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


#### drop columns
Para eliminar columnas, lo mismo pero axis = 1

In [14]:
data[4] = NA

data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [15]:
data.dropna(axis = 1, how = "all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


### 1.2.3 Threeshold
Puede ser que queramos las filas a partir de cierta fila

In [16]:
df = pd.DataFrame(np.random.randn(7, 3))

df

Unnamed: 0,0,1,2
0,0.950069,1.486046,-0.975592
1,-0.248582,0.710829,-0.62853
2,0.955601,-0.596541,2.463748
3,-0.774308,0.155161,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


In [17]:
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

df

Unnamed: 0,0,1,2
0,0.950069,,
1,-0.248582,,
2,0.955601,,2.463748
3,-0.774308,,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


#### tresh
Sirve para especificar el umbral a partir del cual queremos las filas (incluido, hacia abajo)

In [18]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.955601,,2.463748
3,-0.774308,,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


## 1.3 Filling in Missing Data
Existen varias formas para rellenar los NaN values. 

### 1.3.1 fillna Method
Nos permite rellenar NaN values con un valor

Devuelve un nuevo objeto, pero podemos utilizar in-place para modificar 

In [19]:
df

Unnamed: 0,0,1,2
0,0.950069,,
1,-0.248582,,
2,0.955601,,2.463748
3,-0.774308,,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


#### constant value

In [20]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.950069,0.0,0.0
1,-0.248582,0.0,0.0
2,0.955601,0.0,2.463748
3,-0.774308,0.0,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


#### dict
Podemos utilizar un valor distinto para cada columna:

In [21]:
df.fillna({1:0.5, 2:0})

Unnamed: 0,0,1,2
0,0.950069,0.5,0.0
1,-0.248582,0.5,0.0
2,0.955601,0.5,2.463748
3,-0.774308,0.5,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


In [22]:
df

Unnamed: 0,0,1,2
0,0.950069,,
1,-0.248582,,
2,0.955601,,2.463748
3,-0.774308,,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


In [23]:
df.fillna({1:0.5, 2:0}, inplace = True)

df

Unnamed: 0,0,1,2
0,0.950069,0.5,0.0
1,-0.248582,0.5,0.0
2,0.955601,0.5,2.463748
3,-0.774308,0.5,-0.420315
4,0.548841,0.59758,0.639756
5,-1.940572,1.657079,-1.386188
6,0.219607,-1.948398,-0.93976


#### creativity
Podemos rellenar los nan values con mayor criterio. Por ejemplo, utilizando la media, la mediana... 

In [24]:
data = pd.Series([1., NA, 3.5, NA, 7])

data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [25]:
data.mean()

3.8333333333333335

In [26]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

In [27]:
Image(url= "https://i.gyazo.com/bda48a821269e9f33a46148759b26177.png", width = 600)

# 2. Data Transformation

## 2.1 Removing Duplicates
Es posible que en ocasiones nos encontramos con filas duplicadas, podemos localizarlas y eliminarlas fácilmente

In [28]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})

data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


#### duplicated() Method 
Nos devuelve una serie de booleanas. True = duplicado

In [29]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

#### drop_duplicates() Method
Nos devuelve un DataFrame sin duplicados

In [30]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Podemos especificar una columna en concreto, ya que por defecto se seleccionan todas. 

In [31]:
data["v1"] = range(7)

data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [32]:
data.drop_duplicates(["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [33]:
data.drop_duplicates(["k1", "k2"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5


Siempre que eliminamos duplicados, por defecto el nuevo objeto nos devuelve el primer valor duplicado. Podemos cambiarlo mediante keep = "last"

In [34]:
data.drop_duplicates(["k1"], keep = "last")

Unnamed: 0,k1,k2,v1
4,one,3,4
6,two,4,6


## 2.2 Transforming Data Using a Function or Mapping
Podemos realizar transformaciones en las columnas utilizando funciones o map

In [35]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                                       'Pastrami', 'corned beef', 'Bacon',
                                       'pastrami', 'honey ham', 'nova lox'],
                              'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
    
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Por ejemplo, queremos añadir una columna que especifique el animal del que proviene cada alimento. Creamos un diccionario para ello:

In [36]:
meat_to_animal = {'bacon': 'pig','pulled pork': 'pig','pastrami': 'cow','corned beef': 'cow',
                  'honey ham': 'pig','nova lox': 'salmon'}

Tenemos mezclados mayúsculas con minúsculas, así que lo estandarizamos a todo minúsculas:

In [37]:
lowercased = data["food"].str.lower()

lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

Ya podemos crear la nueva columna:

In [38]:
data['animal'] = lowercased.map(meat_to_animal)

data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


Podríamos haber hecho lo mismo con una función, de forma más simplificada:

In [39]:
data['food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

## 2.3 Replacing Values
fillna es un método especial para los nan values, pero también podemos hacer lo mismo para valores que no sean nan.

### 2.3.1 Replace Method
Podemos reemplazar valores con replace()

In [40]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

#### single value

In [41]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

#### multiple values

In [42]:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

#### different replacement for each value (list)

In [43]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#### different replacement for each value (dict)

In [44]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

## 2.4 Renaming Axis Indexes
Podemos transformar los axis labels para producir uno nuevo. También podemos utilizar in-place para modificar el original sin crear uno nuevo

In [45]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)), index=['Ohio', 'Colorado', 'New York'], 
                    columns=['one', 'two', 'three', 'four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


### 2.4.1 map

In [46]:
transform = lambda x: x[:4].upper()

data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [47]:
data.index = data.index.map(transform)

data.index

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [48]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### 2.4.2 rename
Si queremos crear una versión transformada del dataset sin moficiar la original


In [49]:
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [50]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


#### dict
Podemos hacer rename en forma de diccionario

In [51]:
data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [52]:
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [53]:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


## 2.5 Binning
En ocasiones, es útil separar los datos en contenedores de información para el análisis. 

### 2.5.1 cut 

Por ejemplo, tenemos un grupo de personas para un estudio, y tenemos las edades de cada persona:

In [54]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Queremos dividir en contenedores de edad. Por ejemplo, en contenedores de 18 a 25, 26 a 35, 35 a 60, 61 y más.

cut nos permite sustituir cada valor por el contenedor de edad que le corresponda

In [55]:
bins = [18, 25, 35, 60, 100]

In [56]:
cats = pd.cut(ages, bins)

cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

#### categories
Podemos ver el el total de intervalos

In [57]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

#### value_counts
Podemos contar el total de intervalos

In [58]:
cats.value_counts()

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

#### right = False
Por defecto, en los intervalos los valores a la izquierda son exclusivos () y los valores de la derecha inclusivos []

Podemos invertirlo seleccionando right = False

In [59]:
pd.cut(ages, [18, 25, 35, 60, 100], right = False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

#### labels
Podemos crear labels para cada intervalo

In [60]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

pd.cut(ages, bins, labels=group_names).value_counts()

Youth         5
YoungAdult    3
MiddleAged    3
Senior        1
dtype: int64

#### integer as a bin
Si escogemos un integer en lugar de un bin list, los contenedores generados se computan en base al integer establecido. El máximo y el mínimo se computan automáticamente

In [61]:
data = np.random.rand(20)

data

array([0.30011097, 0.38776335, 0.84740225, 0.49039528, 0.25455925,
       0.98023622, 0.81358267, 0.41558847, 0.06575867, 0.48143922,
       0.12408323, 0.98812116, 0.59063383, 0.19348328, 0.72192959,
       0.17088885, 0.2696352 , 0.91553248, 0.62062096, 0.80892707])

Por ejemplo, si escogemos el número 4, la lista se divide entre 4 y para cada intervalo va escogiendo los máximos y los mínimos.

In [62]:
pd.cut(data, 4, precision=2)

[(0.3, 0.53], (0.3, 0.53], (0.76, 0.99], (0.3, 0.53], (0.065, 0.3], ..., (0.065, 0.3], (0.065, 0.3], (0.76, 0.99], (0.53, 0.76], (0.76, 0.99]]
Length: 20
Categories (4, interval[float64]): [(0.065, 0.3] < (0.3, 0.53] < (0.53, 0.76] < (0.76, 0.99]]

In [63]:
pd.cut(data, 4, precision=2).value_counts()

(0.065, 0.3]    6
(0.3, 0.53]     5
(0.53, 0.76]    3
(0.76, 0.99]    6
dtype: int64

Si escogemos el número 2:

In [64]:
pd.cut(data, 2, precision=2)

[(0.065, 0.53], (0.065, 0.53], (0.53, 0.99], (0.065, 0.53], (0.065, 0.53], ..., (0.065, 0.53], (0.065, 0.53], (0.53, 0.99], (0.53, 0.99], (0.53, 0.99]]
Length: 20
Categories (2, interval[float64]): [(0.065, 0.53] < (0.53, 0.99]]

In [65]:
pd.cut(data, 2, precision=2).value_counts()

(0.065, 0.53]    11
(0.53, 0.99]      9
dtype: int64

### 2.5.2 qcut
Lo mismo que cut, pero en cuartiles (intervalos de la misma cantidad)

In [66]:
data = np.random.randn(1000)

cats = pd.qcut(data, 4)

cats

[(-0.697, -0.0246], (-0.0246, 0.646], (-0.697, -0.0246], (0.646, 3.011], (-0.697, -0.0246], ..., (-0.0246, 0.646], (-0.0246, 0.646], (-3.2929999999999997, -0.697], (-0.697, -0.0246], (-0.0246, 0.646]]
Length: 1000
Categories (4, interval[float64]): [(-3.2929999999999997, -0.697] < (-0.697, -0.0246] < (-0.0246, 0.646] < (0.646, 3.011]]

In [67]:
pd.value_counts(cats)

(0.646, 3.011]                   250
(-0.0246, 0.646]                 250
(-0.697, -0.0246]                250
(-3.2929999999999997, -0.697]    250
dtype: int64

In [68]:
cats.value_counts()

(-3.2929999999999997, -0.697]    250
(-0.697, -0.0246]                250
(-0.0246, 0.646]                 250
(0.646, 3.011]                   250
dtype: int64

## 2.6 Detecting and Filtering Outliers
Es frecuente la existencia de datos atípicos que desentonan con la muestra

In [69]:
data = pd.DataFrame(np.random.randn(1000, 4))

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.012571,0.046911,0.023578,-0.004271
std,1.002868,0.97075,0.986171,0.982385
min,-3.190608,-3.063634,-2.856846,-2.838001
25%,-0.658655,-0.616303,-0.638906,-0.657229
50%,0.015247,0.078457,0.026077,-0.007576
75%,0.67003,0.677955,0.704837,0.651432
max,3.071258,3.082804,3.279277,3.169814


Supongamos que queremos encontrar los valores que sean mayor a 3, en términos absolutos, para una columna en concreto:

In [70]:
col = data[2]

col[np.abs(col) > 3]

405    3.279277
Name: 2, dtype: float64

Para seleccionar todas las columnas que tengan valores mayores a 3 en términos absolutos:

In [71]:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
405,0.309112,0.342764,3.279277,0.463989
532,-0.127896,-0.678637,-1.872745,3.169814
545,-3.190608,-0.57289,1.632265,0.82409
580,3.071258,0.981962,0.388372,1.739738
755,3.053966,-0.652445,-0.421037,-0.523364
896,-0.884859,-3.063634,0.366071,-1.170856
930,-3.138259,-1.608355,0.391851,-1.060825
981,1.339567,3.082804,1.708691,0.42704


### 2.6.1 Cap Data
Podemos capar los outliers con un valor techo

#### np.sign()
transforma valores a 1 y -1 en función de si los valores que detecta son positivos o negativos

In [72]:
data.head()

Unnamed: 0,0,1,2,3
0,-0.812216,0.190481,0.511372,-0.000277
1,1.305778,0.272073,0.709092,-1.828184
2,0.396725,1.331996,0.321806,0.581281
3,-0.664429,-2.194205,0.402,1.138729
4,-0.84254,1.741134,-0.174013,0.629016


In [73]:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,1.0,1.0,-1.0
1,1.0,1.0,1.0,-1.0
2,1.0,1.0,1.0,1.0
3,-1.0,-1.0,1.0,1.0
4,-1.0,1.0,-1.0,1.0


In [74]:
data[np.abs(data) > 3] = np.sign(data) * 3

data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.012775,0.046892,0.023299,-0.004441
std,1.001473,0.970292,0.985287,0.98185
min,-3.0,-3.0,-2.856846,-2.838001
25%,-0.658655,-0.616303,-0.638906,-0.657229
50%,0.015247,0.078457,0.026077,-0.007576
75%,0.67003,0.677955,0.704837,0.651432
max,3.0,3.0,3.0,3.0


## 2.7 Permutation and Random Sampling
Permutar significa crear un orden aleatorio

In [75]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


#### creating permutation

In [76]:
sampler = np.random.permutation(5)

sampler

array([3, 0, 4, 2, 1])

#### applying permutation 
we can use take()

In [77]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
4,16,17,18,19
2,8,9,10,11
1,4,5,6,7


#### sample() Method
random subset 

Without replacement:

In [78]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
3,12,13,14,15
0,0,1,2,3


With replacement:

In [79]:
df2 = df.sample(n=3, replace = True)

df2

Unnamed: 0,0,1,2,3
3,12,13,14,15
2,8,9,10,11
1,4,5,6,7


## 2.8 Dummy Variables
Otro tipo de transformación consiste en convertir una variable categórica en una dummy (o indicador).

### 2.8.1 get_dummies()

In [80]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],'data1': range(6)})

df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [81]:
pd.get_dummies(df["key"])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


#### prefix
En caso de que queramos añadir un prefijo a las columnas. 

In [82]:
dummies = pd.get_dummies(df["key"], prefix = "keyy")

dummies

Unnamed: 0,keyy_a,keyy_b,keyy_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [83]:
df_with_dummy = df[["data1"]].join(dummies)

df_with_dummy

Unnamed: 0,data1,keyy_a,keyy_b,keyy_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


### 2.8.2 Multiple Categories [MUY COMPLEJO]
El procedimiento es más complicado

In [84]:
mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('movielens\movies.dat', sep='::', header=None, names=mnames)

movies[:10]

  return read_csv(**locals())


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


Primero, vamos a extraer todos los genres únicos que hay.

In [85]:
all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))
    
genres = pd.unique(all_genres)

genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

Para empezar a construir el dataset de dummies, podemos partir de un dataframe lleno de 0. Después, sustituímos el label de las columnas por todos los genres.

In [112]:
zero_matrix = np.zeros((len(movies), len(genres)))

dummies = pd.DataFrame(zero_matrix, columns=genres)

dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Ahora, podemos iterar en cada movie y establecer valores dummy en cada celda de 0 o 1. Para hacer esto, utilizamos dummies.columns

In [121]:
gen = movies.genres[0]

gen.split('|')

['Animation', "Children's", 'Comedy']

In [122]:
dummies.columns.get_indexer(gen.split('|'))

array([0, 1, 2], dtype=int64)

Ahora, podemos usar iloc para establecer valores basados en esos índices:

In [123]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [124]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))

movies_windic.head()

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [91]:
movies_windic.iloc[0]

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Adventure                                0
Genre_Fantasy                                  0
Genre_Romance                                  0
Genre_Drama                                    0
Genre_Action                                   0
Genre_Crime                                    0
Genre_Thriller                                 0
Genre_Horror                                   0
Genre_Sci-Fi                                   0
Genre_Documentary                              0
Genre_War                                      0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Film-Noir                                0
Genre_Western       

#### Other example:

In [92]:
np.random.seed(12345)

values = np.random.rand(10)

values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [93]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


# 3. String Manipulation
Muchas de las operaciones para interactuar con textos son sencillas, pero algunas otras requieren de las regular expressions.

## 3.1 String Object Methods
Built-in string methods que nos permiten realizar operaciones sencillas

### 3.1.1 split()
Nos permite partir 

In [94]:
val = 'a,b, guido'

val.split(',')

['a', 'b', ' guido']

### 3.1.2 strip()
Nos permite eliminar espacios en blanco

In [95]:
pieces = [x.strip() for x in val.split(',')]

pieces

['a', 'b', 'guido']

### 3.1.3 join()
Nos permite juntar elementos

In [96]:
first, second, third = pieces

first + '::' + second + '::' + third

'a::b::guido'

In [97]:
'::'.join(pieces)

'a::b::guido'

### 3.1.4 in, index(), find()
Sirven para detectar una substring

In [98]:
'guido' in val

True

In [99]:
val.index(',')

1

In [100]:
val.find(':')

-1

La diferencia entre find e index es que index lanza un error si la string no se encuentra, mientras que find nos devuelve -1

### 3.1.5 count()
Devuelve el número de ocurrencias de una substring seleccionada

In [101]:
val.count(",")

2

### 3.1.6 replace()
Sustituye cualquier ocurrencia que seleccionemos por otra. Suele utilizarse para eliminarse símbolos molestos

In [102]:
val.replace(',', '::')

'a::b:: guido'

In [103]:
val.replace(',', '')

'ab guido'

In [104]:
Image(url= "https://i.gyazo.com/c934043c90ca0e1789ed29ef1e42e0b5.png", width = 600)

## 3.2 Regular Expessions [MUY COMPLICADO]
Es una forma más flexible de encontrar símbolos molestos en el texto para eliminarlos o sustituirlos.

## 3.3 Vectorized String Functions in pandas
Todas estas expresiones empiezan por .str.

Tanto las regular expressions como los string methods fallarán si existe un NaN value. Para poder enfrentarnos a ello, Series tiene un método para las string operations que ignora los NaN values.

In [105]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}

data = pd.Series(data)

data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [106]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

Por ejemplo, queremos comprobar si en cada email aparece la palabra "gmail" con str.contains:

In [107]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Podemos hacer slicing con esta práctica

In [108]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

In [109]:
Image(url= "https://i.gyazo.com/b9eb52b45c796a4f6b92ecb1865f482b.png", width = 600)

In [110]:
Image(url= "https://i.gyazo.com/4fbf7428df59b9bfcffec2b3235ccc07.png", width = 600)