<img src="../static/logo.png" alt="datio" style="width: 200px "align="right"/>

# Pandas

Lo primero que debemos hacer es importar la libreria de Pandas:

In [56]:
import pandas as pd

<div class="alert alert-info">Por convenio (de la comunidad de desarrolladores) se usa `pd` como alias de la librería Pandas.</div>

# Data Structures

## Series

El primer ejemplo que vamos a poner va a ser el de definir una estructura de datos "Series", un array de datos unidimensional con idexación. Las "Series" se definen de la siguiente manera:  

**serie = pd.Series(data, index)**

In [57]:
pd.Series?

In [58]:
a = pd.Series([30,40,50,60])
a

0    30
1    40
2    50
3    60
dtype: int64

Es decir, que en el primer parámetro le indicamos los datos del array y en el segundo parámetro los índices (en el caso de no indicar los índices de forma explícita, se generarán los índides de forma automático empezando desde el valor cero). Veamos un ejemplo de como crear una estructura "Series" con la alineación del Real Madrid, en el que tenemos como 'data' sus nombres y como índice su dorsal:

In [59]:
# From ndarray

names = ['Navas', 'Carvajal', 'Pepe', 'Ramos', 'Varane', 'Nacho', 'Marcelo', 'Coentrao', 'Danilo', 'Kroos', 'James',
     'Casemiro','Kovacic', 'Modric', 'Asensio', 'Isco','Ronaldo', 'Benzema', 'Bale', 'Lucas Vazquez', 'Mariano', 'Morata']
dorsales = [1, 2, 3, 4, 5, 6, 12, 15, 23, 8, 10, 14, 16, 19, 20, 22, 7, 9, 11, 17, 18, 21]

#>
realMadridPlayers = pd.Series(index = dorsales, data = names)

print("Plantilla del Real Madrid 2017: \n%s" %realMadridPlayers)

Plantilla del Real Madrid 2017: 
1             Navas
2          Carvajal
3              Pepe
4             Ramos
5            Varane
6             Nacho
12          Marcelo
15         Coentrao
23           Danilo
8             Kroos
10            James
14         Casemiro
16          Kovacic
19           Modric
20          Asensio
22             Isco
7           Ronaldo
9           Benzema
11             Bale
17    Lucas Vazquez
18          Mariano
21           Morata
dtype: object


También podemos crearnos una estructura de datos "Series" a partir de una lista o de un diccionario. Si la construimos a partir de una lista nos pondrá los índices por defecto y si lo creamos a partir de un diccionario, pondrá como índices las claves. Vamos a ver a continuación un ejemplo de como crear una Serie a partir de un diccionario y además vamos a ver como insertar en esta serie un nuevo elemento:

In [60]:
# From dict

columnsDict = {1: 'IMP_FACTUZM', 2: 'IMP_FACTUZT', 3: 'IMP_FACTUZA', 4: 'IMP_OPERAM', 5:'IMP_OPERAT', 6: 'IMP_OPERAA'}

#>
prefixSerie = pd.Series(columnsDict)
prefixSerie

1    IMP_FACTUZM
2    IMP_FACTUZT
3    IMP_FACTUZA
4     IMP_OPERAM
5     IMP_OPERAT
6     IMP_OPERAA
dtype: object

Las Series se comportan como diccionarios, veamos como acceder a un elemento y como añadir un elemento:

In [61]:
# Insert new element: Set index 7 the value:'XTI_PRELSTD'

#>
prefixSerie[7] = 'XTI_PRELSTD'
prefixSerie

1    IMP_FACTUZM
2    IMP_FACTUZT
3    IMP_FACTUZA
4     IMP_OPERAM
5     IMP_OPERAT
6     IMP_OPERAA
7    XTI_PRELSTD
dtype: object

In [62]:
# Access to index with value i: Get the element with index 1

#>
prefixSerie[1]

'IMP_FACTUZM'

Nota: El índice puede ser de tipo String. Veamos un ejemplo:

In [63]:
codigoISO2 = pd.Series(['Afganistán','Åland, Islas','Albania','Alemania','Andorra','Angola','Anguila',
                        'Antártida','Antigua y Barbuda','Arabia Saudita','Argelia'],
                       index=['AF','AX','AL','DE','AD','AO','AI','AQ','AG','SA','DZ'])

In [64]:
# Access to index with value 'DE'

#>
codigoISO2['DE']

'Alemania'

Operaciones sobre series:
Las series alinean sus valores en base a su "label".
Una operación entre series no alineadas realizará la unión de los índices. Si una label no se encuentra en una Serie o la otra, el resultado será marcado como un NaN.


In [65]:
# Suma de dos series

a = pd.Series([1,2,3,4],[10,20,30,40])
b = pd.Series([4,6,5,2],[30,20,10,40])

#>
a + b

10    6
20    8
30    7
40    6
dtype: int64

## Dataframe

Un Dataframe es una estructura de datos similar a una tabla de una base de datos relacional, una tabla de excel, etc y como tal se pueden hacer muchas operaciones y como las que se harían con consultas a tablas de bases de datos o en excel.  
Para construir un DataFrame se puede hacer de diferentes formas, como por ejemplo a partir de una lista, de un diccionario, de una Serie, de otro DataFrame, leyendo una tabla excel, csv, etc.  
Nota: también se pueden añadir índices al dataframe, un índice es como una primary key sobre una tabla sql, con la excepción de que un índice puede tener valores duplicados.
Las Dataframes se definen de la siguiente manera:  

**df = pd.DataFrame(data, index, columns)**

In [66]:
pd.DataFrame?

In [67]:
# From dictionary:

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

#>
frame = pd.DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [68]:
# From matrix and columns :

columnsDF = ['cod_ofictrn','imp_trans','cod_user','cod_npuesto','cod_persona']
list_listas = [[1557,5000,'U044181','02',5133579],[2585,13000,'E018157','03',23783944],[2626,19000,'UA16687','01',12024444]]

#>
DF_1 = pd.DataFrame(list_listas,
                  columns = columnsDF)
DF_1

Unnamed: 0,cod_ofictrn,imp_trans,cod_user,cod_npuesto,cod_persona
0,1557,5000,U044181,2,5133579
1,2585,13000,E018157,3,23783944
2,2626,19000,UA16687,1,12024444


In [69]:
# From matrix, columns and index:

indexDF = ['A','A','C']

#>
DF_2 = pd.DataFrame(list_listas,
                    index = indexDF,
                  columns = columnsDF)
DF_2

Unnamed: 0,cod_ofictrn,imp_trans,cod_user,cod_npuesto,cod_persona
A,1557,5000,U044181,2,5133579
A,2585,13000,E018157,3,23783944
C,2626,19000,UA16687,1,12024444


In [70]:
# Get a specific column, it is a Series:


DF_2["cod_ofictrn"]
DF_2.cod_ofictrn # This way can confuse with dataframe methods

A    1557
A    2585
C    2626
Name: cod_ofictrn, dtype: int64

In [71]:
# Get a list of columns:

#>
DF_2[["cod_user","cod_ofictrn"]]

Unnamed: 0,cod_user,cod_ofictrn
A,U044181,1557
A,E018157,2585
C,UA16687,2626


In [72]:
# Create a new column from two: Concatenate column "cod_npuesto" and "cod_user"

#>
DF_2["new_column"] = DF_2["cod_npuesto"] + "-" + DF_2["cod_user"]
DF_2

Unnamed: 0,cod_ofictrn,imp_trans,cod_user,cod_npuesto,cod_persona,new_column
A,1557,5000,U044181,2,5133579,02-U044181
A,2585,13000,E018157,3,23783944,03-E018157
C,2626,19000,UA16687,1,12024444,01-UA16687


In [40]:
# Remove a column:

#>
DF_2.drop("cod_user", axis = 1, inplace = True)
DF_2

Unnamed: 0,cod_ofictrn,imp_trans,cod_user,cod_npuesto,cod_persona,new_column
A,1557,5000,U044181,2,5133579,02-U044181
A,2585,13000,E018157,3,23783944,03-E018157
C,2626,19000,UA16687,1,12024444,01-UA16687


Usaremos el método **loc** para acceder por índice, o bien **ix**:

In [221]:
DF_2.loc?

In [45]:
# Access to value with index 1:

#>
DF_2.loc['A']

# or 

#>

DF_2.ix['A']

Unnamed: 0,cod_ofictrn,imp_trans,cod_user,cod_npuesto,cod_persona,new_column
A,1557,5000,U044181,2,5133579,02-U044181
A,2585,13000,E018157,3,23783944,03-E018157


Si queremos acceder por el elemento en la posición i usaremos **iloc**, tambień **ix** únicamente si el índice no es de tipo entero.

In [46]:
# Access to the second element:

#>
DF_2.iloc[1]

# or 

#>
DF_2.ix[1]


cod_ofictrn          2585
imp_trans           13000
cod_user          E018157
cod_npuesto            03
cod_persona      23783944
new_column     03-E018157
Name: A, dtype: object

Nota: Ver las diferencias de: loc, iloc e ix  en : "Indexing and Selecting data": http://pandas.pydata.org/pandas-docs/stable/indexing.html

También es posible crear un DataFrame a partir de listas "mergeadas". De tal forma cada lista se convertiŕa en una columna del DataFrame. Para mergear listas usaremos la función **zip**:

In [73]:
cod_ofictrnList = [1557,2585,2626]
imp_transList = [5000,13000,19000]
cod_userList = ['U044181','E018157','UA16687']

In [74]:
zip?

In [75]:
# Create dataset as result to merge: cod_ofictrnList, imp_transList and cod_userList. 

#>
dataset = list(zip(cod_ofictrnList,imp_transList,cod_userList))
dataset

[(1557, 5000, 'U044181'), (2585, 13000, 'E018157'), (2626, 19000, 'UA16687')]

In [76]:
# Create a dataframe with dataset before and columns: 'cod_ofictrn','imp_trans','cod_user':

#>
DF_3 = pd.DataFrame(data = dataset, columns =  ['cod_ofictrn','imp_trans','cod_user'])
DF_3

Unnamed: 0,cod_ofictrn,imp_trans,cod_user
0,1557,5000,U044181
1,2585,13000,E018157
2,2626,19000,UA16687


Para exportar el dataframe a un fichero csv, usaremos **to_csv**

In [77]:
DF_1.to_csv?

In [78]:
# Export a dataframe in folder "data"

#>
filename = '../data/my_csv.csv'
DF_1.to_csv(filename,index=False,header=True)

Borramos el fichero creado, para borrar usaremos **remove**:

In [79]:
#!/usr/bin/python
import os

## delete only if file exists ##
if os.path.exists(filename):
    #>
    os.remove(filename)
else:
    print("Sorry, I can not remove %s file." % filename)

## Entrada de datos

La API de pandas I/O es un conjunto de las mejores funciones de acceso de lectura como pd.read_csv () que por lo general devuelven un objeto pandas.
>Fichero de datos:   read_csv   ,  read_table  
Datos estructurados:  read_hdf  ,  read_json  
Excel:  read_excel


### read_excel

In [80]:
pd.read_excel?

In [82]:
# Read excel file: tablas_sinfo.xlsx located in data.

#>
pd.read_excel('../data/tablas_sinfo.xlsx', sheetname='SINFO').head()

Unnamed: 0,PETICIONARIO,VISTA,DESCRIPCIÓN,ÁMBITO
0,SINFO,VSFMABJC,PLASTICO TARJETA CREDITO_M,SINFO
1,SINFO,VSFMABJD,PLASTICO TARJETA DEBITO_M,SINFO
2,SINFO,VSFMAHAR,CONT DE MVTOS CUENTAS PERSO_M,SINFO
3,SINFO,VSFMAHOM,FACTURACION EN COMERCIOS_M,SINFO
4,SINFO,VSFMAHSS,CONTRATO DE SEGUROS SOCIALE_M,SINFO


### read_csv

dataframe = pd.**read_csv**("filepath",sep=', ', delimiter=None,header='infer'...)  
Por defecto toma la primera fila como la cabecera con los nombres del dataset.

In [83]:
pd.read_csv?

In [84]:
# Read csv file: Base_evolucion_fondos.csv located in data.

#>
base_evolucion_path = "../data/Base_evolucion_fondos.csv"
df = pd.read_csv(base_evolucion_path, header= None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
1,E000170,,0,0,0,0,0,0
2,E000170,34,0,0,0,0,0,0
3,E000170,35,0,0,0,0,0,0
4,E000170,36,0,0,0,0,0,0


In [85]:
# Read csv file: Base_evolucion_fondos.csv located in data with header

#>
df = pd.read_csv(base_evolucion_path)
df.head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41


In [86]:
# Type of the df object:

#>
type(df)

pandas.core.frame.DataFrame

In [87]:
# List column names:

#>
list(df)

#or 

#>
df.columns.values 

array(['COD_USUARIO', 'COD_SEGLOBAL', 'fondos_201507', 'fondos_201508',
       'fondos_201509', 'fondos_201510', 'fondos_201511', 'fondos_201512'], dtype=object)

In [88]:
# Show first few rows

#>
df.head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41


In [89]:
# Show last few rows

#>
df.tail()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
39720,UZ44403,51.0,45291.79,42150.19,40250.91,43582.43,44290.3,42851.89
39721,UZ44403,52.0,9724.66,8899.26,8446.44,9407.03,9664.82,9243.44
39722,UZ44403,55.0,83105.26,79572.65,77208.25,119590.83,120249.18,149363.74
39723,UZ44403,56.0,64385.96,61047.44,58773.43,61692.29,61958.25,60707.55
39724,UZ44403,61.0,6733.52,6749.17,6749.56,6748.73,6747.17,6738.1


In [90]:
# Data type of each column

#>
df.dtypes

COD_USUARIO       object
COD_SEGLOBAL     float64
fondos_201507    float64
fondos_201508    float64
fondos_201509    float64
fondos_201510    float64
fondos_201511    float64
fondos_201512    float64
dtype: object

In [92]:
# Print index

#>
df.index

RangeIndex(start=0, stop=39725, step=1)

In [93]:
# Return number of columns and rows of dataframe

#>
df.shape

(39725, 8)

In [96]:
#  Number of rows

#>
len(df)

39725

In [97]:
# Number of columns

#>
len(df.columns)

8

In [98]:
# Basic statistics

#>
df.describe()

Unnamed: 0,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
count,36090.0,39725.0,39725.0,39725.0,39725.0,39725.0,39725.0
mean,47.497839,394371.5,384194.8,375273.3,386926.3,388452.2,381476.5
std,9.083982,1590689.0,1547671.0,1507742.0,1552754.0,1550784.0,1518714.0
min,31.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,37.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,51.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,55.0,62182.13,62107.63,61357.91,65858.6,68250.67,69010.23
max,75.0,50612990.0,50082030.0,49617650.0,49809970.0,48177540.0,45964950.0


In [99]:
#Getting information and basic calculations

#>
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39725 entries, 0 to 39724
Data columns (total 8 columns):
COD_USUARIO      39725 non-null object
COD_SEGLOBAL     36090 non-null float64
fondos_201507    39725 non-null float64
fondos_201508    39725 non-null float64
fondos_201509    39725 non-null float64
fondos_201510    39725 non-null float64
fondos_201511    39725 non-null float64
fondos_201512    39725 non-null float64
dtypes: float64(7), object(1)
memory usage: 2.4+ MB


In [100]:
#  Slices the rows via [], select rows with index from 0 to 2

#>
df[:3]

   # or

#>
df[0:3]
    
    # or
    
#>    
df.iloc[:3]

    # or
    
#>    
df.ix[:3]


Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
# Label slicing, slice over rows and columns; the same as above and only COD_SEGLOBAL

#>
df.loc[0:3,['COD_SEGLOBAL']]


Unnamed: 0,COD_SEGLOBAL
0,
1,34.0
2,35.0
3,36.0


In [249]:
# Slice rows and columns, select rows[3:6] colummsn [0,3]

#>
df.iloc[3:6,[0,3]]

Unnamed: 0,COD_USUARIO,fondos_201508
3,E000170,0.0
4,E000170,11673.35
5,E000170,65051.89


In [250]:
# Get a value explicitly by position, get row:1, column 1

#>
df.iloc[1,1]

# or

#>
df.iat[1,1]

34.0

In [251]:
# Get a value explicitly by label, get labels 1, 'COD_SEGLOBAL'

#>
df.at[1,'COD_SEGLOBAL']

34.0

In [106]:
# Select one column, which yields a Series. Select COD_SEGLOBAL:

#>
df[["COD_SEGLOBAL"]].head()

# or

#>
df.COD_SEGLOBAL.head()

0     NaN
1    34.0
2    35.0
3    36.0
4    37.0
Name: COD_SEGLOBAL, dtype: float64

In [253]:
# Select two columns. Select: COD_USUARIO and COD_SEGLOBAL

#>
df[["COD_USUARIO","COD_SEGLOBAL"]].head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL
0,E000170,
1,E000170,34.0
2,E000170,35.0
3,E000170,36.0
4,E000170,37.0


In [254]:
#Conditional selection

#>
boold_df = df > 34
df[boold_df]

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,,,,,,,
1,E000170,,,,,,,
2,E000170,35.0,,,,,,
3,E000170,36.0,,,,,,
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41
5,E000170,42.0,67014.43,65051.89,63686.01,65917.87,43472.36,23488.17
6,E000170,43.0,160260.96,154240.41,151735.51,157772.90,159883.25,155395.69
7,E000170,51.0,,,,,,
8,E000170,52.0,,,,,,
9,E000170,53.0,,,,,,


In [255]:
# Select with condition, using a single columnn's values to select rows
# Filter rows for the condition COD_SEGLOBAL>34

#>
df[df['COD_SEGLOBAL'] > 34].head()

     # or
#>    
df.query("COD_SEGLOBAL > 34").head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41
5,E000170,42.0,67014.43,65051.89,63686.01,65917.87,43472.36,23488.17
6,E000170,43.0,160260.96,154240.41,151735.51,157772.9,159883.25,155395.69


In [107]:
# Filter with double condition: 
# Filter for the conditions: COD_SEGLOBAL=34 AND fondos_201507 >0
#>
df[(df['COD_SEGLOBAL']==34)  & (df["fondos_201507"]>0)].head()

# or

#>
df.query("COD_SEGLOBAL == 34 & fondos_201507 >0").head()


Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
66,E000839,34.0,811363.27,780156.92,759107.87,787793.77,791198.14,778252.87
149,E018300,34.0,629302.8,611500.83,552623.06,555489.94,559612.44,602340.8
185,E018550,34.0,30715.54,29126.15,28050.06,29438.58,29567.54,28969.27
191,E018722,34.0,51854.54,51821.42,0.0,0.0,0.0,0.0
250,E018759,34.0,99975.88,91711.71,35975.3,39824.39,40686.49,39330.41


In [108]:
# Filter with double condition: 
# Filter for the conditions: COD_SEGLOBAL=34 OR fondos_201507 =0

#>
df[(df['COD_SEGLOBAL']==34)  | (df["fondos_201507"]==0)].head()

    # or

#>
df.query("COD_SEGLOBAL == 34 | fondos_201507 ==0").head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0
7,E000170,51.0,0.0,0.0,0.0,0.0,0.0,0.0


In [109]:
# Filter using isin() method for filtering
# Select data if 'COD_SEGLOBAL' isin: 35.0, 39.0, 41.0 

#>
df[df['COD_SEGLOBAL'].isin([35.0, 39.0, 41.0])].head()


Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0
67,E000839,35.0,1099182.08,1055838.46,844188.23,873946.38,878437.93,925242.25
126,E018089,35.0,0.0,0.0,0.0,0.0,0.0,0.0
138,E018296,41.0,0.0,0.0,0.0,0.0,0.0,0.0
150,E018300,35.0,1003572.54,979154.74,952984.13,966369.91,970343.66,958739.18


In [114]:
# Get unique vales on columnn: COD_SEGLOBAL

#>
df["COD_SEGLOBAL"].unique()

array([ nan,  34.,  35.,  36.,  37.,  42.,  43.,  51.,  52.,  53.,  54.,
        55.,  56.,  61.,  62.,  41.,  33.,  63.,  31.,  71.,  32.,  73.,
        72.,  75.,  74.])

In [115]:
# Maximum of the "fondos_201507" column

#>
df.fondos_201507.max()

50612990.479999997

In [116]:
df.set_index?

In [262]:
# reset_index() The index is copied in a column "index" and index Dataframe is starting in 0
# set_index("column_name") create index from column

# Note: boht methods you need to include inplace = True

### Group By

In [288]:
# Group by and aggregation function over all columns it is possible

#>
df.groupby("COD_USUARIO").mean()

Unnamed: 0_level_0,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512,COD_SEG_34
COD_USUARIO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
E000170,46.846154,1.706856e+04,1.649755e+04,1.622124e+04,1.681276e+04,1.536090e+04,1.361223e+04,0.071429
E000501,51.428571,9.130090e+05,9.075834e+05,8.745818e+05,9.084203e+05,9.115475e+05,9.000623e+05,0.000000
E000531,51.571429,1.115276e+06,1.107899e+06,1.094539e+06,1.132965e+06,1.134202e+06,1.129280e+06,0.000000
E000545,50.300000,3.918370e+05,3.767131e+05,3.663443e+05,3.796563e+05,3.812436e+05,3.790332e+05,0.000000
E000558,51.428571,2.829866e+05,2.734438e+05,2.723402e+05,2.787285e+05,2.728685e+05,2.749118e+05,0.000000
E000575,52.888889,2.151412e+05,2.104763e+05,2.070488e+05,2.293190e+05,2.265792e+05,2.230355e+05,0.000000
E000766,49.400000,1.884467e+06,1.831894e+06,1.799597e+06,1.848561e+06,1.888595e+06,1.868090e+06,0.000000
E000839,46.923077,2.317509e+05,2.235484e+05,2.008691e+05,2.065937e+05,2.074642e+05,2.084072e+05,0.071429
E001223,47.625000,3.629414e+05,3.609369e+05,3.558529e+05,3.658500e+05,3.647546e+05,3.732195e+05,0.000000
E017652,51.428571,8.942887e+05,8.799818e+05,8.724050e+05,9.057021e+05,8.969409e+05,8.758385e+05,0.000000


In [280]:
# Maximum fondos_201507 by COD_USUARIO

#>
df.groupby("COD_USUARIO").agg({"fondos_201507":"max"}).reset_index()

    #or
#>
df.groupby("COD_USUARIO")["fondos_201507"].max().reset_index()

Unnamed: 0,COD_USUARIO,fondos_201507
0,E000170,160260.96
1,E000501,5903439.20
2,E000531,6401503.37
3,E000545,3255015.86
4,E000558,1908020.96
5,E000575,1453531.75
6,E000766,8327876.90
7,E000839,1099182.08
8,E001223,3056661.14
9,E017652,6915466.66


In [264]:
# Mean fondos_201507 by COD_USUARIO, and rename column result to : Mean_fondos_201507

#>
df.groupby(["COD_USUARIO","COD_SEGLOBAL"]).agg({"fondos_201507":"mean"}).rename(columns={"fondos_201507":"Mean_fondos_201507"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Mean_fondos_201507
COD_USUARIO,COD_SEGLOBAL,Unnamed: 2_level_1
E000170,34.0,0.00
E000170,35.0,0.00
E000170,36.0,0.00
E000170,37.0,11684.42
E000170,42.0,67014.43
E000170,43.0,160260.96
E000170,51.0,0.00
E000170,52.0,0.00
E000170,53.0,0.00
E000170,54.0,0.00


In [291]:
# Describe information about group dataframe

#>
df.groupby(["COD_SEGLOBAL"]).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
COD_SEGLOBAL,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
31.0,count,3.500000e+01,3.500000e+01,3.500000e+01,3.500000e+01,3.500000e+01,3.500000e+01
31.0,mean,5.537936e+05,5.423679e+05,6.145769e+05,6.404842e+05,6.211203e+05,5.846042e+05
31.0,std,2.657048e+06,2.607464e+06,2.582427e+06,2.708621e+06,2.586518e+06,2.381022e+06
31.0,min,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
31.0,25%,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
31.0,50%,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
31.0,75%,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
31.0,max,1.572679e+07,1.543511e+07,1.503807e+07,1.579872e+07,1.505233e+07,1.380077e+07
32.0,count,9.900000e+01,9.900000e+01,9.900000e+01,9.900000e+01,9.900000e+01,9.900000e+01
32.0,mean,8.684837e+04,8.631245e+04,8.374194e+04,8.736599e+04,8.785439e+04,8.630741e+04


In [139]:
# Remove a column, delete 'COD_SEGLOBAL'

df1 = df.copy()
#>

df1.drop('COD_SEGLOBAL', axis=1, inplace=True)

#or
df1 = df.copy()
#>
df1.drop(df1.columns[1], axis=1, inplace=True) 

# or
df1 = df.copy()
#>
del df1['COD_SEGLOBAL']
df1.columns

Index(['COD_USUARIO', 'fondos_201507', 'fondos_201508', 'fondos_201509',
       'fondos_201510', 'fondos_201511', 'fondos_201512'],
      dtype='object')

In [135]:
# Remove multiple columns, delete 'COD_SEGLOBAL' and 'fondos_201507'

df1 = df.copy()
#>
df1.drop(df1.columns[[1, 2]], axis=1, inplace=True)
df1
    # or
df1 = df.copy()
#>
df1.drop(['COD_SEGLOBAL','fondos_201507'], axis=1, inplace=True)
df1

Unnamed: 0,COD_USUARIO,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,0.00,0.00,0.00,0.00,0.00
1,E000170,0.00,0.00,0.00,0.00,0.00
2,E000170,0.00,0.00,0.00,0.00,0.00
3,E000170,0.00,0.00,0.00,0.00,0.00
4,E000170,11673.35,11675.79,11687.85,11696.99,11687.41
5,E000170,65051.89,63686.01,65917.87,43472.36,23488.17
6,E000170,154240.41,151735.51,157772.90,159883.25,155395.69
7,E000170,0.00,0.00,0.00,0.00,0.00
8,E000170,0.00,0.00,0.00,0.00,0.00
9,E000170,0.00,0.00,0.00,0.00,0.00


### Merge

In [267]:
df1.merge?

In [212]:
# Merge two dataframes. Merge df1 and df2 (outer join) on Ciudad

data1 = {'Ciudad':['Trujillo', 'Las Pedroñeras', 'Navaluenga'], 'Provincia':['Caceres', 'Cuenca', 'Avila'], 'Poblacion':[10000, 2000, 400]}

data2 = {'Ciudad':['Trujillo', 'Las Pedroñeras', 'Navaluenga'], 'Poblacion':[12000, 1800, 380]}

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

#>
diff = df1.merge(df2, how='outer', on='Ciudad', suffixes=['_2000', '_1970'])
diff

Unnamed: 0,Ciudad,Poblacion_2000,Provincia,Poblacion_1970
0,Trujillo,10000,Caceres,12000
1,Las Pedroñeras,2000,Cuenca,1800
2,Navaluenga,400,Avila,380


In [269]:
# Substract. Create column Diferencia 1970-2000 with  Poblacion_2000 - Poblacion_1970 

#>
diff['Diferencia 1970-2000'] = diff['Poblacion_2000'] - diff['Poblacion_1970']

diff

Unnamed: 0,Ciudad,Poblacion_2000,Provincia,Poblacion_1970,Diferencia 1970-2000
0,Trujillo,10000,Caceres,12000,-2000
1,Las Pedroñeras,2000,Cuenca,1800,200
2,Navaluenga,400,Avila,380,20
3,Navaluenga,40000,Albacete,30000,10000


In [270]:
# Call functions for grouping. From df, create 'Count_201507_0' column to count fondos_201507=0.0

#>
df.assign(Eq0 = (df.fondos_201507 == 0.0)).groupby("Eq0").agg({"Eq0":"count"}).rename(columns ={"Eq0":"Count_201507_0"})

Unnamed: 0_level_0,Count_201507_0
Eq0,Unnamed: 1_level_1
False,15971
True,23754


In [148]:
# Order in ascending order by fondos_201507 show top 6

#>
df.sort_values(by='fondos_201507').head(6)

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0
16417,U091554,51.0,0.0,0.0,0.0,0.0,0.0,0.0
30026,U232050,51.0,0.0,0.0,0.0,0.0,0.0,0.0
16415,U091554,42.0,0.0,0.0,0.0,0.0,0.0,0.0
16414,U091554,36.0,0.0,0.0,0.0,0.0,0.0,0.0
16413,U091554,,0.0,0.0,0.0,0.0,0.0,24606.9


In [150]:
# Order in descending order by fondos_201507 show top 6

#>
df.sort_values('fondos_201507', ascending = False).head(6)

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
7001,U074939,41.0,50612990.48,50082028.12,49617645.7,49809970.95,48177541.14,45964949.26
21230,U098194,41.0,36135799.24,36481821.01,35557226.28,36119304.92,36083203.97,35601321.24
26599,U220043,42.0,32706296.65,31311523.98,31211364.45,33857520.17,35353067.41,34837797.38
27653,U224174,42.0,30323622.8,29298590.67,28316992.22,29763001.9,29958146.14,29411963.16
32983,U507626,42.0,28937953.28,28405796.79,27029486.9,27286799.47,27117331.04,26614780.84
13094,U084525,42.0,28094421.69,27088555.17,26551854.35,27408597.92,27600691.43,27157830.87


In [152]:
# Order in different sense order: "fondos_201507" in descending and "COD_SEGLOBAL" in ascending

#>
df.sort_values(by=['fondos_201507','COD_SEGLOBAL'], ascending = [False,True]).head(6)

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
7001,U074939,41.0,50612990.48,50082028.12,49617645.7,49809970.95,48177541.14,45964949.26
21230,U098194,41.0,36135799.24,36481821.01,35557226.28,36119304.92,36083203.97,35601321.24
26599,U220043,42.0,32706296.65,31311523.98,31211364.45,33857520.17,35353067.41,34837797.38
27653,U224174,42.0,30323622.8,29298590.67,28316992.22,29763001.9,29958146.14,29411963.16
32983,U507626,42.0,28937953.28,28405796.79,27029486.9,27286799.47,27117331.04,26614780.84
13094,U084525,42.0,28094421.69,27088555.17,26551854.35,27408597.92,27600691.43,27157830.87


In [276]:
# Rename columns, "COD_USUARIO":"user_code" and "COD_SEGLOBAL":"global_seg_code"

#>
df.rename(columns = {"COD_USUARIO":"user_code","COD_SEGLOBAL":"global_seg_code"}).tail()

Unnamed: 0,user_code,global_seg_code,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512
39720,UZ44403,51.0,45291.79,42150.19,40250.91,43582.43,44290.3,42851.89
39721,UZ44403,52.0,9724.66,8899.26,8446.44,9407.03,9664.82,9243.44
39722,UZ44403,55.0,83105.26,79572.65,77208.25,119590.83,120249.18,149363.74
39723,UZ44403,56.0,64385.96,61047.44,58773.43,61692.29,61958.25,60707.55
39724,UZ44403,61.0,6733.52,6749.17,6749.56,6748.73,6747.17,6738.1


In [277]:
# Add column 'new_column' with string 'Constant'

#>
df['new_column'] = 'Constant'
df

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512,new_column
0,E000170,,0.00,0.00,0.00,0.00,0.00,0.00,Constant
1,E000170,34.0,0.00,0.00,0.00,0.00,0.00,0.00,Constant
2,E000170,35.0,0.00,0.00,0.00,0.00,0.00,0.00,Constant
3,E000170,36.0,0.00,0.00,0.00,0.00,0.00,0.00,Constant
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41,Constant
5,E000170,42.0,67014.43,65051.89,63686.01,65917.87,43472.36,23488.17,Constant
6,E000170,43.0,160260.96,154240.41,151735.51,157772.90,159883.25,155395.69,Constant
7,E000170,51.0,0.00,0.00,0.00,0.00,0.00,0.00,Constant
8,E000170,52.0,0.00,0.00,0.00,0.00,0.00,0.00,Constant
9,E000170,53.0,0.00,0.00,0.00,0.00,0.00,0.00,Constant


In [157]:
# Add a column 'COD_SEG_34', with boolean values depending on COD_SEGLOBA==34.0

#>
df['COD_SEG_34'] = df.COD_SEGLOBAL == 34.0
df[['COD_SEG_34','COD_SEGLOBAL']]


Unnamed: 0,COD_SEG_34,COD_SEGLOBAL
0,False,
1,True,34.0
2,False,35.0
3,False,36.0
4,False,37.0
5,False,42.0
6,False,43.0
7,False,51.0
8,False,52.0
9,False,53.0


In [279]:
# Transposing data

#>
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39715,39716,39717,39718,39719,39720,39721,39722,39723,39724
COD_USUARIO,E000170,E000170,E000170,E000170,E000170,E000170,E000170,E000170,E000170,E000170,...,UN30630,UN30630,UZ44403,UZ44403,UZ44403,UZ44403,UZ44403,UZ44403,UZ44403,UZ44403
COD_SEGLOBAL,,34,35,36,37,42,43,51,52,53,...,56,61,,42,43,51,52,55,56,61
fondos_201507,0,0,0,0,11684.4,67014.4,160261,0,0,0,...,25393.2,13677.8,89477.8,3.17249e+06,2.53899e+06,45291.8,9724.66,83105.3,64386,6733.52
fondos_201508,0,0,0,0,11673.4,65051.9,154240,0,0,0,...,24794.2,13336.7,86449.4,3.07234e+06,2.67879e+06,42150.2,8899.26,79572.6,61047.4,6749.17
fondos_201509,0,0,0,0,11675.8,63686,151736,0,0,0,...,24321.3,13065.8,138533,3.008e+06,2.61129e+06,40250.9,8446.44,77208.2,58773.4,6749.56
fondos_201510,0,0,0,0,11687.9,65917.9,157773,0,0,0,...,24771.2,13321.1,213179,3.12522e+06,2.62779e+06,43582.4,9407.03,119591,61692.3,6748.73
fondos_201511,0,0,0,0,11697,43472.4,159883,0,0,0,...,30120.8,13326.8,212652,3.15219e+06,2.74991e+06,44290.3,9664.82,120249,61958.2,6747.17
fondos_201512,0,0,0,0,11687.4,23488.2,155396,0,0,0,...,29835,13192.8,210165,3.30337e+06,2.7493e+06,42851.9,9243.44,149364,60707.6,6738.1
new_column,Constant,Constant,Constant,Constant,Constant,Constant,Constant,Constant,Constant,Constant,...,Constant,Constant,Constant,Constant,Constant,Constant,Constant,Constant,Constant,Constant
COD_SEG_34,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [158]:
# Create new column Q3_2015 where fondos_201507 +  fondos_201508

#>
df.assign(Q3_2015 = df.fondos_201507 + df.fondos_201508).head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512,COD_SEG_34,Q3_2015
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0,False,0.0
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0,True,0.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0,False,0.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0,False,0.0
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41,False,23357.77


In [163]:
# Concatenate two DataFrame columns into a new, single column. Concatenate COD_USUARIO and COD_SEGLOBAL in column 'concatenated'

#>
df['concatenated'] = df['COD_USUARIO'].map(str) + ":" + df['COD_SEGLOBAL'].map(str)
df.head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512,COD_SEG_34,concatenated
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0,False,E000170:nan
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0,True,E000170:34.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0,False,E000170:35.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0,False,E000170:36.0
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41,False,E000170:37.0


### Apply

dataframe.apply(function) or dataframe.apply(lambda function)

In [175]:
# Split delimited values in a DataFrame column into two new columns. Split column concatenated with delimeted : and get first item.

#>
df['a'] = df['concatenated'].apply(lambda x: x.split(':', 1)[0])
df.head()

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512,COD_SEG_34,concatenated,a,b
0,E000170,,0.0,0.0,0.0,0.0,0.0,0.0,False,E000170:nan,E000170,
1,E000170,34.0,0.0,0.0,0.0,0.0,0.0,0.0,True,E000170:34.0,E000170,34.0
2,E000170,35.0,0.0,0.0,0.0,0.0,0.0,0.0,False,E000170:35.0,E000170,35.0
3,E000170,36.0,0.0,0.0,0.0,0.0,0.0,0.0,False,E000170:36.0,E000170,36.0
4,E000170,37.0,11684.42,11673.35,11675.79,11687.85,11696.99,11687.41,False,E000170:37.0,E000170,37.0


In [176]:
# Unique rows on "COD_USUARIO","COD_SEGLOBAL"
# By default: Drop duplicates except for the first occurrence.

#>
df.drop_duplicates('COD_USUARIO', inplace=True)
df

Unnamed: 0,COD_USUARIO,COD_SEGLOBAL,fondos_201507,fondos_201508,fondos_201509,fondos_201510,fondos_201511,fondos_201512,COD_SEG_34,concatenated,a,b
0,E000170,,0.00,0.00,0.00,0.00,0.00,0.00,False,E000170:nan,E000170,
14,E000501,,158772.14,154831.31,151801.95,154821.40,154964.26,153389.92,False,E000501:nan,E000501,
22,E000531,,60958.04,58109.30,55963.78,58196.52,58049.08,81840.54,False,E000531:nan,E000531,
30,E000545,,0.00,0.00,0.00,0.00,0.00,0.00,False,E000545:nan,E000545,
41,E000558,,0.00,0.00,0.00,0.00,0.00,0.00,False,E000558:nan,E000558,
49,E000575,,0.00,0.00,0.00,0.00,0.00,0.00,False,E000575:nan,E000575,
59,E000766,,0.00,0.00,0.00,0.00,0.00,0.00,False,E000766:nan,E000766,
65,E000839,,0.00,0.00,0.00,0.00,0.00,0.00,False,E000839:nan,E000839,
79,E001223,,0.00,0.00,0.00,0.00,0.00,0.00,False,E001223:nan,E001223,
88,E017652,,0.00,0.00,0.00,0.00,0.00,0.00,False,E017652:nan,E017652,


### Pivot 

In [190]:
d = pd.DataFrame({'A':[1,1,3,1],'B':[3,4,5,4],'C':[6,7,8,8]})
d

Unnamed: 0,A,B,C
0,1,3,6
1,1,4,7
2,3,5,8
3,1,4,8


In [191]:
d.pivot_table(values = 'B', index = ['A'], columns = ['C'])

C,6,7,8
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,3.0,4.0,4.0
3,,,5.0


### Missing Data
Pandas primarily uses the value **np.nan** to represent missing data. See the Missing Data section : http://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data

In [193]:
import numpy as np

In [194]:
dic = {"A": [1,2,np.nan],"B": [3, np.nan, np.nan],"C": [1,2,3]}
d = pd.DataFrame(dic)
d

Unnamed: 0,A,B,C
0,1.0,3.0,1
1,2.0,,2
2,,,3


In [195]:
# Get the boolean mask where values an nan

#>
pd.isnull(d).head()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,False


In [196]:
# Count number on missing values

#>
d.isnull().sum().sum()

3

In [197]:
# Check if there is any null

#>
d.isnull().values.any()

True

In [198]:
# Drop any rows that have missing data

#>
d.dropna()

Unnamed: 0,A,B,C
0,1.0,3.0,1


In [199]:
# Filling missing data

#>
d.fillna(value = "missing")


Unnamed: 0,A,B,C
0,1,3,1
1,2,missing,2
2,missing,missing,3


In [200]:
# Filling missing data with the mean of each column

#>
d.fillna(value = d.mean())

Unnamed: 0,A,B,C
0,1.0,3.0,1
1,2.0,3.0,2
2,1.5,3.0,3
