<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Handling-Missing-Data" data-toc-modified-id="Handling-Missing-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Handling Missing Data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Handling-methods" data-toc-modified-id="Handling-methods-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Handling methods</a></span></li></ul></li><li><span><a href="#Filtering-Out-Missing-Data" data-toc-modified-id="Filtering-Out-Missing-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Filtering Out Missing Data</a></span></li><li><span><a href="#Filling-In-Missing-Data" data-toc-modified-id="Filling-In-Missing-Data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Filling In Missing Data</a></span><ul class="toc-item"><li><span><a href="#Fillna-function-arguments" data-toc-modified-id="Fillna-function-arguments-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Fillna function arguments</a></span></li></ul></li></ul></li><li><span><a href="#Data-Transformation" data-toc-modified-id="Data-Transformation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Transformation</a></span><ul class="toc-item"><li><span><a href="#Removing-Duplicates" data-toc-modified-id="Removing-Duplicates-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Removing Duplicates</a></span></li><li><span><a href="#Transforming-Data-using-a-Function-or-Mapping" data-toc-modified-id="Transforming-Data-using-a-Function-or-Mapping-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Transforming Data using a Function or Mapping</a></span></li><li><span><a href="#Replacing-Values" data-toc-modified-id="Replacing-Values-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Replacing Values</a></span></li><li><span><a href="#Renaming-Axis-Indexes" data-toc-modified-id="Renaming-Axis-Indexes-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Renaming Axis Indexes</a></span></li></ul></li><li><span><a href="#String-Manipulation" data-toc-modified-id="String-Manipulation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>String Manipulation</a></span><ul class="toc-item"><li><span><a href="#String-object-methods" data-toc-modified-id="String-object-methods-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>String object methods</a></span></li><li><span><a href="#Regex" data-toc-modified-id="Regex-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Regex</a></span><ul class="toc-item"><li><span><a href="#Regular-expression-methods" data-toc-modified-id="Regular-expression-methods-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Regular expression methods</a></span></li></ul></li><li><span><a href="#Vectorized-String-Functions-in-pandas" data-toc-modified-id="Vectorized-String-Functions-in-pandas-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Vectorized String Functions in pandas</a></span></li></ul></li></ul></div>

In [1]:
from numpy import nan as NA
import pandas as pd
import numpy as np

# Handling Missing Data

### Handling methods
![image](img/image1.png)

## Filtering Out Missing Data

as you can see dropna() drops all the rows that have some value NaN

In [12]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])

In [13]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA

In [14]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


To drop columns in the same way, pass axis=1

In [15]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [16]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument

In [17]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-1.040417,,
1,0.916565,,
2,2.315443,,0.057186
3,2.459704,,-0.336589
4,0.21182,-1.298994,-0.112836
5,0.778972,1.037135,0.520803
6,-0.249749,1.088431,-1.233773


In [18]:
df.dropna()

Unnamed: 0,0,1,2
4,0.21182,-1.298994,-0.112836
5,0.778972,1.037135,0.520803
6,-0.249749,1.088431,-1.233773


In [19]:
# guarda las filas que tienen 2 datos o más
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,2.315443,,0.057186
3,2.459704,,-0.336589
4,0.21182,-1.298994,-0.112836
5,0.778972,1.037135,0.520803
6,-0.249749,1.088431,-1.233773


## Filling In Missing Data

to fill the null data with some value you can use de method fillna()

In [20]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-1.040417,0.0,0.0
1,0.916565,0.0,0.0
2,2.315443,0.0,0.057186
3,2.459704,0.0,-0.336589
4,0.21182,-1.298994,-0.112836
5,0.778972,1.037135,0.520803
6,-0.249749,1.088431,-1.233773


Calling fillna with a dict, you can use a different fill value for each column:

In [21]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-1.040417,0.5,0.0
1,0.916565,0.5,0.0
2,2.315443,0.5,0.057186
3,2.459704,0.5,-0.336589
4,0.21182,-1.298994,-0.112836
5,0.778972,1.037135,0.520803
6,-0.249749,1.088431,-1.233773


### Fillna function arguments
![image](img/image2.png)

# Data Transformation

## Removing Duplicates

In [24]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [25]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [26]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both of these methods by default consider all of the columns; alternatively, you can
specify any subset of them to detect duplicates. Suppose we had an additional column
of values and wanted to filter duplicates only based on the 'k1' column:

In [31]:
data['v1'] = range(7)

In [32]:
#devuelve el data frame sin repeticiones en k1
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination.  
Passing keep='last' will return the last one

In [34]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


## Transforming Data using a Function or Mapping

El proceso es hacer el map sobre la columna (como serie) y luego asignarle a la columna el valor de la columna transformada

In [38]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon','Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham', 'nova lox'],'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [39]:
meat_to_animal = {
    'bacon':'pig',
    'pulled pork':'pig',
    'pastrami':'cow',
    'corned beef':'cow',
    'honey ham':'pig',
    'nova lox':'salmon',
}

In [41]:
# acá se saca la serie sobre la que se va a trabajar
food = data["food"].str.lower()

In [43]:
# aca se le asigna la serie mapeada a la columna food
data["food"] = food.map(meat_to_animal)
data

Unnamed: 0,food,ounces
0,pig,4.0
1,pig,3.0
2,pig,12.0
3,cow,6.0
4,cow,7.5
5,pig,8.0
6,cow,3.0
7,pig,5.0
8,salmon,6.0


## Replacing Values

para reemplazar un valor en una serie usamos la funcion:  
* data.replace("valor a reemplazar","valor sustituto")  
* data.replace("lista de valores a reemplazar","listas de sustitutos")  
* data.replace("diccionario con key valores a reemplazar por y value como el sustituto")

In [46]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon','pastrami', 'corned beef', 'bacon','pastrami', 'honey ham', 'nova lox'],'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [47]:
# diccionario de valores a reemplazar por reemplazados
meat_to_animal = {
    'bacon':'pig',
    'pulled pork':'pig',
    'pastrami':'cow',
    'corned beef':'cow',
    'honey ham':'pig',
    'nova lox':'salmon',
}

In [51]:
# se saca la columna food como serie
serie = data["food"]
serie

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [55]:
# se realiza el reemplazo en la serie obtenida
nuevos_valores = serie.replace(meat_to_animal)
nuevos_valores

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [57]:
#se ingresa el valor al dataframe
data["food"] = nuevos_valores
data

Unnamed: 0,food,ounces
0,pig,4.0
1,pig,3.0
2,pig,12.0
3,cow,6.0
4,cow,7.5
5,pig,8.0
6,cow,3.0
7,pig,5.0
8,salmon,6.0


## Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or map‐
ping of some form to produce new, differently labeled objects. You can also modify
the axes in-place without creating a new data structure. Here’s a simple example

In [58]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [59]:
transform = lambda x: x[:4].upper()
new_index = data.index.map(transform)
data.index = new_index
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


If you want to create a transformed version of a dataset without modifying the origi‐
nal, a useful method is **rename**:

In [60]:
data.rename(index=str.title, columns=str.upper)


Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


Notably, rename can be used in conjunction with a dict-like object providing new val‐
ues for a subset of the axis labels

In [63]:
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'},
            inplace = True)
data

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


# String Manipulation

## String object methods
![imagen](img/image3.png)

## Regex

In [4]:
import re
text = "foo bar\t baz \tqux"

Una forma util de usar las expresiones regulares para ahorrar tiempo de ejecución es usando la función **re.compile()**
y tiene el atributo flags=re.IGNORECASE se vuelve cse-insensitive

In [5]:
re.split("\s+",text)

['foo', 'bar', 'baz', 'qux']

In [6]:
expresion = re.compile("\s+")
expresion.split(text)

['foo', 'bar', 'baz', 'qux']

### Regular expression methods
![imagen](img/image4.png)

## Vectorized String Functions in pandas
![imagen](img/imagen5.png)
![imagen](img/imagen6.png)