## Data Cleaning
Para resumir todas las acciones que tenemos que realizar cuando recibimos un dataset en cuestión de limpieza de datos, vamos a utilizar un fichero con la información de los pasajeros del barco Titanic.

Este dataset contiene las siguente información en sus columnas:

* **survival:**  Superviviente (0 = No, 1 = Si)
* **pclass:**	Clase den el barco	(1 = 1st, 2 = 2nd, 3 = 3rd)
* **sex:**	Sexo	
* **Age:**	Edad en años	
* **sibsp:**	Número de hermanos 
* **parch:**	Número de hijos
* **ticket:**	Número del billete
* **fare:**	    Tarifa
* **cabin:**	Número de la cabina	
* **embarked:**	Puerto de embarque (C = Cherbourg, Q = Queenstown, S = Southampton)

**IMPORTAMOS LAS LIBRERÍAS Y CARGAMOS EL FICHERO**

In [None]:
import pandas as pd
df = pd.read_csv('titanic.csv',sep=';', index_col=0)                    

**COMPROBAMOS LOS PRIMEROS REGISTRO PARA COMPROBAR LA CARGA CORRECTA**

In [None]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890/8/5,
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,712.833,C85,C,1874/11/18,1932/2/18
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886/9/17,1938/4/23
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877/3/10,1916/5/4
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877/4/22,


**COMPROBAMOS LAS COLUMNAS QUE TENEMOS EN EL FICHERO**

In [None]:
df.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'birth', 'death'],
      dtype='object')

**OBTENEMOS LOS RANGOS DEL DATASET, LA CANTIDAD DE FILAS Y COLUMNAS**

In [None]:
df.shape

(906, 13)

In [None]:
df.info

<bound method DataFrame.info of              Survived  Pclass  \
PassengerId                     
1                   0       3   
2                   1       1   
3                   1       3   
4                   1       1   
5                   0       3   
...               ...     ...   
902                 1       1   
903                 0       3   
904                 1       1   
905                 0       3   
906                 0       3   

                                                          Name     Sex   Age  \
PassengerId                                                                    
1                                      Braund, Mr. Owen Harris    male  22.0   
2            Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   
3                                       Heikkinen, Miss. Laina  female  26.0   
4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   
5                                     Allen, Mr. William H

**OBTENEMOS LOS TIPOS PARA CONOCER LOS DATOS DEL FICHEROS**

In [None]:
df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare         object
Cabin        object
Embarked     object
birth        object
death        object
dtype: object

Podemos observar que tenemos algunos de los tipos de datos incoherentes o a priori incorrectos. Podemos ver como la edad está como Float y las fechas tanto de nacimiento como de muerte están como cadena. Veremos más adelante si convertiremos algunas de los campos a categóricos como el Sex o Embarked

**COMPROBAMOS LOS DUPLICADOS**

Logica para [**`duplicated`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html):

- **`keep='first'`** (default): Marca los duplicados como verdadero excepto la primera ocurrencia
- **`keep='last'`** (default): Marca los duplicados como verdadero excepto la ultima ocurrencia
- **`keep=False`**: Marca todos los duplicados como True.

In [None]:
df.duplicated()

PassengerId
1      False
2      False
3      False
4      False
5      False
       ...  
902    False
903    False
904    False
905    False
906     True
Length: 906, dtype: bool

In [None]:
df.duplicated().sum()

8

Aparentemente tenemos 8 duplicados en el fichero. Vamos a comprobar con el nombre para calcular cuantos encuentra iguales.

In [None]:
df['Name'].duplicated().sum()

15

Encontramos en total a 15 nombres exactamente iguales, lo que nos indica que posiblemente tengamos 15 duplicados. Vamos a obtener la lista para poder hacer una comprobación directamente en el dataset al ser poca la cantidad de duplicados.

In [None]:
duplicados = df[df['Name'].duplicated(keep=False)]['Name'].tolist()
duplicados

['Mallet, Master. Andre',
 'Andersson, Mr. Anders Johan',
 'Andersson, Mr. Anders Johan',
 'Andersson, Mr. Anders Johan',
 'Spencer, Mrs. William Augustus (Marie Eugenie)',
 'Spencer, Mrs. William Augustus (Marie Eugenie)',
 'Holverson, Mrs. Alexander Oskar (Mary Aline Towner)',
 'Stewart, Mr. Albert A',
 'Moubarek, Master. Gerios',
 'Nye, Mrs. (Elizabeth Ramell)',
 'Crease, Mr. Ernest James',
 'Moutal, Mr. Rahamin Haim',
 'Moutal, Mr. Rahamin Haim',
 'Stewart, Mr. Albert A',
 'Moubarek, Master. Gerios',
 'Nye, Mrs. (Elizabeth Ramell)',
 'Crease, Mr. Ernest James',
 'Collyer, Miss. Marjorie "Lottie"',
 'Pengelly, Mr. Frederick William',
 'Hunt, Mr. George Henry',
 'Collyer, Miss. Marjorie "Lottie"',
 'Pengelly, Mr. Frederick William',
 'Hunt, Mr. George Henry',
 'Holverson, Mrs. Alexander Oskar (Mary Aline Towner)',
 'Mallet, Master. Andre',
 'Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)',
 'Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)',
 'Andersson, Mr. Anders Johan']

Vamos a examinar las filas duplicadas (ignorando la primera ocurrencia)

In [None]:
df.loc[df.duplicated(keep='first'), :]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
16,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
17,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
95,0,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.05,,S,1912/5/7,
219,0,1,"Stewart, Mr. Albert A",male,,0,0,PC 17605,277.208,,C,1912/8/19,
222,0,3,"Crease, Mr. Ernest James",male,19.0,0,0,S.P. 3464,81.583,,S,1893/12/16,
276,0,2,"Pengelly, Mr. Frederick William",male,19.0,0,0,28665,10.5,,S,1893/7/11,
277,0,2,"Hunt, Mr. George Henry",male,33.0,0,0,SCO/W 1585,12.275,,S,1879/10/17,
906,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,


In [None]:
df.loc[df.duplicated(keep='last'), :]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
15,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
16,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
17,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
70,0,1,"Stewart, Mr. Albert A",male,,0,0,PC 17605,277.208,,C,1912/8/19,
73,0,3,"Crease, Mr. Ernest James",male,19.0,0,0,S.P. 3464,81.583,,S,1893/12/16,
83,0,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.05,,S,1912/5/7,
249,0,2,"Pengelly, Mr. Frederick William",male,19.0,0,0,28665,10.5,,S,1893/7/11,
250,0,2,"Hunt, Mr. George Henry",male,33.0,0,0,SCO/W 1585,12.275,,S,1879/10/17,


In [None]:
df.loc[df.duplicated(keep=False), :]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
15,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
16,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
17,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S,1873/7/23,
70,0,1,"Stewart, Mr. Albert A",male,,0,0,PC 17605,277.208,,C,1912/8/19,
73,0,3,"Crease, Mr. Ernest James",male,19.0,0,0,S.P. 3464,81.583,,S,1893/12/16,
83,0,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.05,,S,1912/5/7,
95,0,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.05,,S,1912/5/7,
219,0,1,"Stewart, Mr. Albert A",male,,0,0,PC 17605,277.208,,C,1912/8/19,
222,0,3,"Crease, Mr. Ernest James",male,19.0,0,0,S.P. 3464,81.583,,S,1893/12/16,
249,0,2,"Pengelly, Mr. Frederick William",male,19.0,0,0,28665,10.5,,S,1893/7/11,


Para realizar la eliminación de los duplicados utilizaremos [**`drop_duplicates`**](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html)
Por defecto tenemos el keep=False. En este caso al comprobar que son registro exactamente iguales, solo vamos a aplicar el keep='first'

In [None]:
df.drop_duplicates(keep='first').shape

(898, 13)

**GESTIONAR VALORES NULOS O DESCONOCIDOS**

¿Qué significa "NaN"?
- "NaN" no es una cadena, más bien es un valor especial.
- Significa "Not a Number" e indica un **valor ausente**.
- **`read_csv`** detecta los valores que faltan (por defecto) al leer el archivo, y los sustituye por este valor especial.

In [None]:
df.isnull()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,False,False,False,False,False,False,False,False,False,True,False,False,True
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
902,False,False,False,False,False,False,False,False,False,False,False,False,False
903,False,False,False,False,True,False,False,False,False,True,False,False,True
904,False,False,False,False,False,False,False,False,False,False,False,False,False
905,False,False,False,False,False,False,False,False,False,True,False,False,True


In [None]:
df.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         181
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       699
Embarked      2
birth         0
death       557
dtype: int64

Con la sentencia **notnull()** tendremos lo contrario a isnull()

In [None]:
df.Survived.value_counts()

0    557
1    349
Name: Survived, dtype: int64

En esta caso podemos deducir que del campo "death" solo se informaron aquellos que sobrevivieron y no aquellos que fallecieron en el accidente.
Los datos faltantes del número de la cabina parece no tan importante y respecto a la edad, gracias a que tenenos la Edad podremos calcularla justo con el año 1912 que tuvo lugar el accidente.

Vamos a utilizar el método isnull() y head() para ver los registros faltantes.

In [None]:
df[df.Age.isnull()].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,84.583,,Q,1912/6/11,
21,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S,1912/8/14,1935/12/1
23,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C,1912/12/11,1937/8/3
30,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C,1912/11/21,
32,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,78.792,,Q,1912/6/19,1937/1/25


Tenemos varias opciones a la hora de trabajar con los registros nulos:
 * Si falta algún valor en una fila y la queremos eliminar por completo: **df.dropna(how='any')**
 * Si falta algún valor en una fila (teniendo en cuenta sólo la colunma "XXXX" y "ZZZZ"), entonces que elimine esa fila: **df.dropna(subset=['XXXX', 'ZZZZ'], how='any')**
 * Si faltan "todos" los valores en una fila (teniendo en cuenta sólo la colunma "XXXX" y "ZZZZ"), entonces queelimine esa fila: **df.dropna(subset=['XXXX', 'ZZZZ'], how='all')**
 * Completar los valores faltantes con un valor especificado: **df['XXXX'].fillna(value='000', inplace=True)**
 * Gestionar la carga de los datos de forma calculada
 
 En nuestro caso vamos a calcular la edad de los datos que nos faltan:
 

In [None]:
df['birth'] = pd.to_datetime(df['birth'])

In [None]:
df['death'] = pd.to_datetime(df['death'])

In [None]:
df.dtypes

Survived             int64
Pclass               int64
Name                object
Sex                 object
Age                float64
SibSp                int64
Parch                int64
Ticket              object
Fare                object
Cabin               object
Embarked            object
birth       datetime64[ns]
death       datetime64[ns]
dtype: object

In [None]:
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890-08-05,NaT
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,712.833,C85,C,1874-11-18,1932-02-18
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886-09-17,1938-04-23
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877-03-10,1916-05-04
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877-04-22,NaT
6,0,3,"Moran, Mr. James",male,,0,0,330877,84.583,,Q,1912-06-11,NaT
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,518.625,E46,S,1858-04-09,NaT
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,1910-03-25,NaT
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,111.333,,S,1885-06-02,1918-07-26
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,300.708,,C,1898-04-28,1913-07-17


In [None]:
df['YearB'] = df['birth'].dt.year

In [None]:
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death,YearB
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890-08-05,NaT,1890
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,712.833,C85,C,1874-11-18,1932-02-18,1874
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886-09-17,1938-04-23,1886
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877-03-10,1916-05-04,1877
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877-04-22,NaT,1877
6,0,3,"Moran, Mr. James",male,,0,0,330877,84.583,,Q,1912-06-11,NaT,1912
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,518.625,E46,S,1858-04-09,NaT,1858
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,1910-03-25,NaT,1910
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,111.333,,S,1885-06-02,1918-07-26,1885
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,300.708,,C,1898-04-28,1913-07-17,1898


In [None]:
df['Age']= 1912-df['YearB']

In [None]:
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death,YearB
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,1890-08-05,NaT,1890
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,712.833,C85,C,1874-11-18,1932-02-18,1874
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1886-09-17,1938-04-23,1886
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1877-03-10,1916-05-04,1877
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,1877-04-22,NaT,1877
6,0,3,"Moran, Mr. James",male,0,0,0,330877,84.583,,Q,1912-06-11,NaT,1912
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,518.625,E46,S,1858-04-09,NaT,1858
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S,1910-03-25,NaT,1910
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,111.333,,S,1885-06-02,1918-07-26,1885
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,300.708,,C,1898-04-28,1913-07-17,1898


In [None]:
df[df.Age.isnull()].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death,YearB
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1


**FILTRAMOS EL DATASET**

Una vez exploradas las series, los registros duplicados, los valores nulos y limpiado pasamos a la fase de filtrado, agrupamiento y finalmente ordenamiento del dataset ya limpio.
Para realizar los filtrados de los dataset utilizaremos este esquema:
* df[df.campo > valor]

Por ejemplo vamos a filtrar solo los pasaremos que son de primera clase:

In [None]:
df[df.Pclass==1]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death,YearB
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,712.833,C85,C,1874-11-18,1932-02-18,1874
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1877-03-10,1916-05-04,1877
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,518.625,E46,S,1858-04-09,NaT,1858
13,1,1,"Bonnell, Miss. Elizabeth",female,58,0,0,113783,26.55,C103,S,1854-12-18,1926-04-07,1854
27,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S,1884-11-08,1933-11-07,1884
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,1,"Carlsson, Mr. Frans Olof",male,33,0,0,695,5,B51 B53 B55,S,1879-02-04,NaT,1879
893,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56,0,1,11767,831.583,C50,C,1856-02-16,1916-09-24,1856
894,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56,0,1,11767,831.583,C50,C,1856-02-06,1937-02-10,1856
902,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30,B42,S,1893-02-23,1940-11-09,1893


Podemos filtrar por ejemplo solo los pasajeros mayores de edad y mostrar el nombre:

In [None]:
df[df.Age>=18].Name

PassengerId
1                                Braund, Mr. Owen Harris
2      Cumings, Mrs. John Bradley (Florence Briggs Th...
3                                 Heikkinen, Miss. Laina
4           Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                               Allen, Mr. William Henry
                             ...                        
901                                Montvila, Rev. Juozas
902                         Graham, Miss. Margaret Edith
904                                Behr, Mr. Karl Howell
905                                  Dooley, Mr. Patrick
906                          Andersson, Mr. Anders Johan
Name: Name, Length: 610, dtype: object

Podremos también realizar filtrados múltiple. Para ello tenemos que tener en cuenta los **operadores lógicos:** 

- **`and`**: Verdadero sólo si **ambos lados** del operador son Verdaderos
- **`or`**: Verdadero si **cualquier lado** del operador es Verdadero

Reglas para especificar **múltiples criterios de filtrado** en pandas:

- utilizar **`&`** en lugar de **`and`**
- usar **`|`** en lugar de **`or`**
- añadir **paréntesis** alrededor de cada condición para especificar el orden de evaluación

Por ejemplo vamos a mostrar los pasajeros de primera clase y que sean mayores de edad:

In [None]:
df[(df.Pclass==1) & (df.Age>=18)]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death,YearB
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,712.833,C85,C,1874-11-18,1932-02-18,1874
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,1877-03-10,1916-05-04,1877
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,518.625,E46,S,1858-04-09,NaT,1858
13,1,1,"Bonnell, Miss. Elizabeth",female,58,0,0,113783,26.55,C103,S,1854-12-18,1926-04-07,1854
27,1,1,"Sloper, Mr. William Thompson",male,28,0,0,113788,35.5,A6,S,1884-11-08,1933-11-07,1884
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,1,"Carlsson, Mr. Frans Olof",male,33,0,0,695,5,B51 B53 B55,S,1879-02-04,NaT,1879
893,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56,0,1,11767,831.583,C50,C,1856-02-16,1916-09-24,1856
894,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56,0,1,11767,831.583,C50,C,1856-02-06,1937-02-10,1856
902,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30,B42,S,1893-02-23,1940-11-09,1893


**AGRUPAMIENTO DE DATOS**

Para resumir la información del DataFrame, la agrupación es nuestro mejor aliado. El método .groupby() genera grupos en base a una o varias variables.

Por ejemplo podríamos agrupar por tipo de clase de pasajero, donde embarcó y el sexo que tiene:

In [None]:
df.groupby(['Embarked','Pclass', 'Sex']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,birth,death,YearB
Embarked,Pclass,Sex,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
C,1,female,45,45,45,45,45,45,45,37,45,44,45
C,1,male,43,43,43,43,43,43,43,31,43,17,43
C,2,female,7,7,7,7,7,7,7,1,7,7,7
C,2,male,11,11,11,11,11,11,11,1,11,3,11
C,3,female,23,23,23,23,23,23,23,1,23,15,23
C,3,male,44,44,44,44,44,44,44,0,44,11,44
Q,1,female,1,1,1,1,1,1,1,1,1,1,1
Q,1,male,1,1,1,1,1,1,1,1,1,0,1
Q,2,female,2,2,2,2,2,2,2,1,2,2,2
Q,2,male,1,1,1,1,1,1,1,0,1,0,1


Con .agg() podremos añadir alguna función como sum o mean:

In [None]:
df.groupby(['Embarked','Pclass', 'Sex']).agg(['mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Survived,Age,SibSp,Parch,birth,death,YearB
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,mean,mean,mean,mean,mean,mean
Embarked,Pclass,Sex,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
C,1,female,0.977778,31.688889,0.511111,0.311111,1880-10-17 05:52:00.000000000,1927-06-23 18:00:00.000000000,1880.311111
C,1,male,0.395349,33.581395,0.232558,0.325581,1878-11-30 06:08:22.325581312,1927-09-09 21:10:35.294117632,1878.418605
C,2,female,1.0,19.142857,0.714286,0.571429,1893-03-30 13:42:51.428571648,1925-03-14 13:42:51.428571392,1892.857143
C,2,male,0.272727,18.909091,0.454545,0.636364,1893-08-08 21:49:05.454545408,1921-12-28 08:00:00.000000000,1893.090909
C,3,female,0.652174,9.782609,0.565217,0.608696,1902-10-27 20:52:10.434782720,1926-08-22 11:12:00.000000000,1902.217391
C,3,male,0.25,14.181818,0.25,0.090909,1898-04-15 13:05:27.272727296,1927-02-22 08:43:38.181818112,1897.818182
Q,1,female,1.0,33.0,1.0,0.0,1879-10-24 00:00:00.000000000,1913-01-17 00:00:00.000000000,1879.0
Q,1,male,0.0,44.0,2.0,0.0,1868-04-19 00:00:00.000000000,NaT,1868.0
Q,2,female,1.0,15.0,0.0,0.0,1897-05-08 12:00:00.000000000,1927-11-05 00:00:00.000000000,1897.0
Q,2,male,0.0,57.0,0.0,0.0,1855-01-20 00:00:00.000000000,NaT,1855.0


Por ejemplo si queremos mostrar la edad media de las mujeres que viajaban en cada clase.

In [None]:
df.groupby(['Pclass','Sex'])['Age'].mean().unstack()['female']

Pclass
1    31.268041
2    27.717949
3    15.402778
Name: female, dtype: float64

**DATA BINNING**

La agrupación y segmentación de datos puede ser más exhaustiva con pandas haciendo distintos grupos dentro de una misma variable. Para hacer esta agrupación se utiliza el método .cut() que permite hacer estos “cortes” de una forma muy sencilla. 

Por ejemplo vamos a generar tramos para el año de nacimiento que generamos en el caso de la limpieza de nulos:

**ORDENAR EL DATASET YA LIMPIO Y LISTO PARA MODELIZAR**

In [None]:
df.sort_values('Survived', ascending=True)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,birth,death,YearB,AgeN
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,1890-08-05,NaT,,22
529,0,1,"Walker, Mr. William Anderson",male,47,0,0,36967,340.208,D46,S,1865-04-14,NaT,,47
531,0,3,"Ryan, Mr. Patrick",male,0,0,0,371110,24.15,,Q,1912-09-24,NaT,,0
533,0,3,"Pavlovic, Mr. Stefo",male,32,0,0,349242,78.958,,S,1880-08-06,NaT,,32
535,0,3,"Vovk, Mr. Janko",male,22,0,0,349252,78.958,,S,1890-09-24,NaT,,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
554,1,1,"Crosby, Miss. Harriet R",female,36,0,2,WE/P 5735,71,B22,S,1876-01-01,1930-08-04,,36
553,1,1,"Frolicher, Miss. Hedwig Margaritha",female,22,0,2,13568,49.5,B39,C,1890-04-04,1937-09-27,,22
220,1,3,"Moubarek, Master. Gerios",male,0,1,0,2661,152.458,,C,1912-02-04,1933-08-21,,0
214,1,3,"Albimona, Mr. Nassef Cassem",male,26,0,0,2699,187.875,,C,1886-07-19,1928-09-19,,26
