## Libreria Pandas

Pandas es la libreria mas utilizada para la manipulación de datos, utiliza series y dataframes que son estrtucturas de datos columnares, de una o dos dimensiones

In [1]:
import pandas as pd

## Series

Son columnas de datos con indices

In [2]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
animales = ['Tortuga', 'Zorro', 'Paloma', 'Elefante', "Zorro"]
tipo = ['reptil', 'mamífero', 'ave', 'mamífero', 'mamífero' ]
obj = pd.Series(tipo, index=animales)
obj

Tortuga       reptil
Zorro       mamífero
Paloma           ave
Elefante    mamífero
Zorro       mamífero
dtype: object

In [4]:
animales = ['Tortuga', 'Zorro', 'Paloma', 'Elefante', "Zorro"]
tipo = ['reptil', 'mamífero', 'ave', 'mamífero', 'mamífero' ]
obj = pd.Series(tipo)
obj.index = animales
obj

Tortuga       reptil
Zorro       mamífero
Paloma           ave
Elefante    mamífero
Zorro       mamífero
dtype: object

In [5]:
obj["Tortuga"]

'reptil'

In [6]:
obj[0]

'reptil'

In [7]:
obj.sort_values()[0]

'ave'

In [8]:
obj.sort_values()["Tortuga"]

'reptil'

## DataFrames

Son estructuras de dos dimensiones, pueden pensarse como la contatenación horizontal de series. 

In [9]:
d = {'tipo_vivienda': ['casa', 'departamento'],
     'm2': [125, 59],
     'Barrio': ['San Martin', 'Florida'],
     'Precio (kUSD)': [200, 130]
    }
df = pd.DataFrame(data=d, index=["casa1", "casa2"])
df

Unnamed: 0,tipo_vivienda,m2,Barrio,Precio (kUSD)
casa1,casa,125,San Martin,200
casa2,departamento,59,Florida,130


## Lectura de archivos de datos


Pandas soporta la lectura de una amplia cantidad de formatos ([más info](http://pandas.pydata.org/pandas-docs/stable/io.html)): 

- read_csv
- read_excel
- read_hdf
- read_sql
- read_json
- read_msgpack (experimental)
- read_html
- read_gbq (experimental)
- read_stata
- read_sas
- read_clipboard
- read_pickle

Vamos a empezar a probar con una dataset publicado para una competencia de kaggle: [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data).

In [17]:
data = pd.read_csv("https://raw.githubusercontent.com/andresdambrosio/DMA_LABO_Austral_2021_rosario/main/Data/titanic.csv",index_col='PassengerId')

In [18]:
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
data.tail()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [20]:
data.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [21]:
data.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [23]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [24]:
data.shape

(891, 11)

In [25]:
data.size

9801

In [26]:
data.Pclass

PassengerId
1      3
2      1
3      3
4      1
5      3
      ..
887    2
888    1
889    3
890    1
891    3
Name: Pclass, Length: 891, dtype: int64

In [27]:
data.Pclass.unique()

array([3, 1, 2])

In [28]:
data.Pclass.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [29]:
data.Pclass.nunique()

3

## Tipos de Indexado

Hay varias formas de seleccionar un subconjunto de los datos:

- Como las listas o arrays, por posición.
- Como los diccionarios, por llave o etiqueta.
- Como los arrays, por máscaras de verdadero o falso.
- Se puede indexar por número, rango o lista (array)
- Todos estos métodos pueden funcionar subconjunto como en las columnas


## Reglas Básicas

1. Se usan corchetes (abreviatura para el método `__getitem__`) para seleccionar columnas de un `DataFrame`

    ```python
    >>> df[['a', 'b', 'c']]
    ```

2. Se usa `.iloc` para indexar por posición (tanto filas como columnas)

    ```python
    >>> df.iloc[[1, 3], [0, 2]]
    ```
    
3. Se usa `.loc` para indexar por etiquetas (tanto filas como columnas)

    ```python
    >>> df.loc[["elemento1", "elemento2", "elemento3"], ["columna1", "columna2"]]
    ```

In [None]:
data.__getitem__("Name") == data["Name"]

In [None]:
data["Name"]

In [None]:
data.loc[[1], ["Name", "Sex"]]

In [None]:
data.loc[[1, 2, 3]]

In [None]:
data.loc[[1, 2, 3], "Name"]

In [None]:
data.loc[1, "Survived"]

In [None]:
temp = data.loc[:, ["Name", "Sex"]]
temp.loc[1, "Name"] = "Rafa"
temp

In [None]:
data.loc[1, "Name"], temp.loc[1, "Name"]

In [None]:
temp = data.copy()
temp.index = ["pasajero_nro_" + str(i) for i in temp.index]
temp.index.name =data.index.name
temp


In [None]:
data.loc[1]

In [None]:
data.loc["1"]

In [None]:
temp.loc[["pasajero_nro_1", "pasajero_nro_2", "pasajero_nro_3"], ["Name", "Sex"]]

In [None]:
temp.iloc[[1, 2, 3], [2, 3]]

In [None]:
del temp

In [None]:
data.loc[:3, :"Sex"]

In [None]:
data.sort_values("Name").loc[:3]

## Calculo de columnas nuevas

In [None]:
temp = data[["Name"]].copy()
temp.OtroNombre = ["OTRO_" + n for n in data.Name]
temp

In [None]:
temp.OtroNombre[:10]

In [None]:
temp["OtroNombre"] = ["OTRO_" + n for n in data.Name]
temp

In [None]:
del temp

## Filtrado

In [None]:
data["SibSp"] > 0

In [None]:
data[data["Age"] > 18]

In [None]:
data.select_dtypes("float")

### Funciones comunes

Pandas ya viene con una cantidad de funciones incorporadas, por ejemplo:

* sum
* mean
* std
* var
* cumsum
* value_counts()

In [None]:
data.Age.mean()

In [None]:
data.mean()

In [None]:
data.sum()

In [None]:
data.select_dtypes("float").sum(axis=1)

In [None]:
data.Age.cumsum()

In [None]:
data.isnull().sum()

In [None]:
data.Survived.value_counts()

In [None]:
data.Survived.value_counts(True)

In [None]:
data.Survived.value_counts(1)

### Ejercicio

1. Mostrar las primeras 16 files de data


2. ¿Cómo se llama el pasajero 881?


3. Calcular una columna numFam que sea la suma de la cantidad de familiares en el barco


4. Encontrar la edad media de los sobrevivientes