<a href="https://colab.research.google.com/github/fralfaro/MAT281_2023/blob/main/docs/lectures/data_manipulation/pd_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Pandas I

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png" width = "300" align="center"/>


## Introducción 

[Pandas](https://pandas.pydata.org/) es un paquete de Python que proporciona estructuras de datos rápidas, flexibles y expresivas diseñadas para que trabajar con datos "relacionales" o "etiquetados" sea fácil e intuitivo. 

Su objetivo es ser el bloque de construcción fundamental de alto nivel para hacer análisis de datos prácticos del mundo real en Python. Además, tiene el objetivo más amplio de convertirse en la herramienta de análisis/manipulación de datos de código abierto más potente y flexible disponible en cualquier idioma. Ya está en camino hacia este objetivo.

**Series y DataFrames**

* Las **series** son  arreglos unidimensionales con etiquetas. Se puede pensar como una generalización de los diccionarios de Python. 

<img src="https://raw.githubusercontent.com/fralfaro/MAT281_2022/main/docs/lectures/data_manipulation/data_manipulation/images/series.png" width = "300" align="center"/>


* Los **dataframe** son arreglos bidimensionales y una extensión natural de las series. Se puede pensar como la generalización de un numpy.array.

<img src="https://raw.githubusercontent.com/fralfaro/MAT281_2022/main/docs/lectures/data_manipulation/data_manipulation/images/dataframe.png" width = "800" align="center"/>

## Pandas Dataframes


Como se mencina anteriormente, los dataframes son arreglos de series, los cuales pueden ser de distintos tipos (numéricos, string, etc.). En esta parte mostraremos un ejemplo aplicado de las distintas funcionalidades de los dataframes.

### Creación de Dataframes

La creación se puede hacer de variadas formas con listas, dictionarios , numpy array , entre otros.


In [1]:
import pandas as pd
import numpy as np

In [2]:
# empty dataframe
df_empty = pd.DataFrame()
df_empty

In [3]:
# dataframe with list
df_list = pd.DataFrame(
    [
        ["nombre_01", "apellido_01", 60],
        ["nombre_02", "apellido_02", 14]
    ],
    
    columns = ["nombre", "apellido", "edad"]
)
df_list

Unnamed: 0,nombre,apellido,edad
0,nombre_01,apellido_01,60
1,nombre_02,apellido_02,14


In [4]:
# dataframe with dct
df_dct =  pd.DataFrame(
    {
        "nombre": ["nombre_01", "nombre_02",],
        "apellido": ["apellido_01", "apellido_02"],
        "edad": np.array([60,14]),
    }
)

df_dct

Unnamed: 0,nombre,apellido,edad
0,nombre_01,apellido_01,60
1,nombre_02,apellido_02,14


### Lectura de Datos

En general, cuando se trabajan con datos, estos se almacenan en algún lugar y en algún tipo de formato, por ejemplo:
* `.txt`
* `.csv`
* `.xlsx`
* `.db`
* etc.

📈 **Ejemplo**: El conjunto de datos `ibm_data` se refiere a los datos históricos de precios de las acciones de IBM (International Business Machines Corporation) obtenidos de `Yahoo Finance` durante un período de tiempo específico (1990-2017).

<img src="https://cdn-icons-png.flaticon.com/512/8070/8070604.png" width = "200" align="center"/>

📋**Descripción de las columnas**

| Columna   | Descripción                                              |
|-----------|----------------------------------------------------------|
| Date      | Fecha en la que se registraron los precios de las acciones |
| Open      | Precio de apertura de las acciones de IBM                |
| High      | Precio más alto al que se negociaron las acciones       |
| Low       | Precio más bajo al que se negociaron las acciones       |
| Close     | Precio de cierre de las acciones de IBM                 |
| Volume    | Número total de acciones de IBM negociadas              |

>🔑 **Nota**: Este será el conjunto de datos oficial que utilizaremos a lo largo del curso. Será fundamental para la mayoría de los casos de manipulación de datos y modelos de machine learning que exploraremos.

In [5]:
# cargar datos
path = 'data/ibm_data_1990_2017_daily.csv'
df = pd.read_csv(path, sep="," ).set_index('Date')

In [6]:
# mostar resultados
df

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1990-12-31,1990,12,31,,,,,,
1990-12-31,1990,12,31,,,,,,
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0
...,...,...,...,...,...,...,...,...,...
2017-12-22,2017,12,22,145.143402,146.271515,144.837479,145.793503,108.304832,3128168.0
2017-12-26,2017,12,26,145.803055,147.093689,145.793503,146.108994,108.539215,2593034.0
2017-12-27,2017,12,27,146.223709,146.443588,145.898666,146.395798,108.752281,2248168.0
2017-12-28,2017,12,28,146.462708,147.342255,146.462708,147.265778,109.398567,2811230.0


### Objetos de Pandas

En un nivel muy básico, los objetos de Pandas se pueden considerar como versiones mejoradas de matrices de `numpy` en las que las filas y columnas se identifican con etiquetas en lugar de simples índices enteros.

**DataFrame**

In [7]:
# valores
df.values

array([[1.99000000e+03, 1.20000000e+01, 3.10000000e+01, ...,
                   nan,            nan,            nan],
       [1.99000000e+03, 1.20000000e+01, 3.10000000e+01, ...,
                   nan,            nan,            nan],
       [1.99100000e+03, 1.00000000e+00, 2.00000000e+00, ...,
        2.67985172e+01, 1.20319891e+01, 4.34173700e+06],
       ...,
       [2.01700000e+03, 1.20000000e+01, 2.70000000e+01, ...,
        1.46395798e+02, 1.08752281e+02, 2.24816800e+06],
       [2.01700000e+03, 1.20000000e+01, 2.80000000e+01, ...,
        1.47265778e+02, 1.09398567e+02, 2.81123000e+06],
       [2.01700000e+03, 1.20000000e+01, 2.90000000e+01, ...,
        1.46673035e+02, 1.08958214e+02, 3.48014700e+06]])

In [8]:
# indice
df.index

Index(['1990-12-31', '1990-12-31', '1991-01-02', '1991-01-03', '1991-01-04',
       '1991-01-07', '1991-01-08', '1991-01-09', '1991-01-10', '1991-01-11',
       ...
       '2017-12-15', '2017-12-18', '2017-12-19', '2017-12-20', '2017-12-21',
       '2017-12-22', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29'],
      dtype='object', name='Date', length=6805)

In [9]:
# columnas
df.columns

Index(['Year', 'Month', 'Day', 'Open', 'High', 'Low', 'Close', 'Adj Close',
       'Volume'],
      dtype='object')

In [10]:
# tipo
type(df)

pandas.core.frame.DataFrame

**Series**

In [11]:
# elegir columna (pueden ser una o varias)
pd_series = df['Open']

In [12]:
# valores
pd_series.values

array([         nan,          nan,  26.97777176, ..., 146.22370911,
       146.46270752, 147.39006042])

In [13]:
# indice
pd_series.index

Index(['1990-12-31', '1990-12-31', '1991-01-02', '1991-01-03', '1991-01-04',
       '1991-01-07', '1991-01-08', '1991-01-09', '1991-01-10', '1991-01-11',
       ...
       '2017-12-15', '2017-12-18', '2017-12-19', '2017-12-20', '2017-12-21',
       '2017-12-22', '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29'],
      dtype='object', name='Date', length=6805)

In [14]:
# tipo
type(pd_series)

pandas.core.series.Series

### Operaciones en Pandas

Las operaciones en Pandas se refieren a las acciones que se pueden realizar sobre los objetos de Pandas, como Series y DataFrames, para manipular, transformar y analizar datos. 

#### Operaciones Básicas

In [15]:
# priemras filas 
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1990-12-31,1990,12,31,,,,,,
1990-12-31,1990,12,31,,,,,,
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0


In [16]:
# ultimas filas 
df.tail()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-12-22,2017,12,22,145.143402,146.271515,144.837479,145.793503,108.304832,3128168.0
2017-12-26,2017,12,26,145.803055,147.093689,145.793503,146.108994,108.539215,2593034.0
2017-12-27,2017,12,27,146.223709,146.443588,145.898666,146.395798,108.752281,2248168.0
2017-12-28,2017,12,28,146.462708,147.342255,146.462708,147.265778,109.398567,2811230.0
2017-12-29,2017,12,29,147.39006,147.915863,146.673035,146.673035,108.958214,3480147.0


In [17]:
# informacion del dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6805 entries, 1990-12-31 to 2017-12-29
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Year       6805 non-null   int64  
 1   Month      6805 non-null   int64  
 2   Day        6805 non-null   int64  
 3   Open       6803 non-null   float64
 4   High       6803 non-null   float64
 5   Low        6803 non-null   float64
 6   Close      6803 non-null   float64
 7   Adj Close  6803 non-null   float64
 8   Volume     6803 non-null   float64
dtypes: float64(6), int64(3)
memory usage: 531.6+ KB


In [18]:
# tipo de datos por columnas
df.dtypes

Year           int64
Month          int64
Day            int64
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

In [19]:
# filas y columnas
df.shape

(6805, 9)

In [20]:
# estadisticas basicas
df.describe()

Unnamed: 0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
count,6805.0,6805.0,6805.0,6803.0,6803.0,6803.0,6803.0,6803.0,6803.0
mean,2003.982807,6.542983,15.738575,107.526876,96.953768,102.104281,114.454251,57.129515,7819733.0
std,7.796566,3.43082,8.751971,656.382497,195.933357,600.001154,1352.448409,36.850064,4905685.0
min,1990.0,1.0,1.0,9.799235,9.978489,9.709608,9.799235,5.013444,682829.0
25%,1997.0,4.0,8.0,48.055092,48.816921,47.323135,47.860901,25.76819,4726246.0
50%,2004.0,7.0,16.0,90.793503,91.644356,89.67495,90.717018,51.459721,6637393.0
75%,2011.0,10.0,23.0,135.850861,136.615685,134.956978,135.936905,89.476288,9394230.0
max,2017.0,12.0,31.0,46875.0,15625.0,46875.0,109375.0,132.362503,72639160.0


In [21]:
# cantidad de objetos unicos - columna especifica
df['Year'].nunique()

28

In [22]:
# objetos unicos por columna especifica
df['Year'].unique()

array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017], dtype=int64)

In [23]:
# cantidad de objetos unicos - todas las columnas
df.nunique()

Year           28
Month          12
Day            31
Open         4478
High         4680
Low          4726
Close        4821
Adj Close    6203
Volume       6527
dtype: int64

In [24]:
# numero de ocurrencias de cada valor en una columna
df['Year'].value_counts()#.sort_index()

1992    254
1996    254
1993    253
1997    253
2008    253
1991    253
2004    252
2003    252
2010    252
2009    252
2014    252
2015    252
2005    252
2016    252
2002    252
2013    252
2000    252
1999    252
1998    252
1995    252
1994    252
2011    252
2017    251
2007    251
2006    251
2012    250
2001    248
1990      2
Name: Year, dtype: int64

In [25]:
# ordenar valores - por columna especifica, menor a mayor
df.sort_values('Year', ascending = True).head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1990-12-31,1990,12,31,,,,,,
1990-12-31,1990,12,31,,,,,,
1991-08-20,1991,8,20,23.004301,23.004301,22.73542,22.884798,10.623586,6822849.0
1991-08-21,1991,8,21,22.974426,23.034178,22.466539,22.854923,10.609709,11832770.0
1991-08-22,1991,8,22,22.854923,23.034178,22.496414,22.556166,10.471026,9060452.0


In [26]:
# ordenar valores - por columna especifica, mayor a menor
df.sort_values('Year', ascending = False).head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-12-29,2017,12,29,147.39006,147.915863,146.673035,146.673035,108.958214,3480147.0
2017-04-26,2017,4,26,153.470367,154.015289,152.858505,153.021027,110.2948,4526879.0
2017-05-15,2017,5,15,143.99617,145.152969,143.833649,144.847031,105.423012,4440793.0
2017-05-12,2017,5,12,143.690247,144.608032,143.403442,143.757172,104.629791,3886204.0
2017-05-11,2017,5,11,144.407272,144.502869,143.202682,144.024857,104.824615,5886783.0


#### Operaciones Matemáticas

In [27]:
# operaciones aritmeticas
s1 = df['High']
s2 = df['Low']

suma = s1+s2
resta = s1-s2
multiplicacion = s1*s2
division = s1/s2

# suma
print(f"suma: \n{suma.head()}\n")

# resta
print(f"resta: \n{resta.head()}\n")

# multiplicacion
print(f"multiplicacion: \n{multiplicacion.head()}\n")

# division
print(f"division: \n{division.head()}")

suma: 
Date
1990-12-31          NaN
1990-12-31          NaN
1991-01-02    53.985418
1991-01-03    54.045172
1991-01-04    53.746416
dtype: float64

resta: 
Date
1990-12-31         NaN
1990-12-31         NaN
1991-01-02    0.388384
1991-01-03    0.388384
1991-01-04    0.268881
dtype: float64

multiplicacion: 
Date
1990-12-31           NaN
1990-12-31           NaN
1991-01-02    728.568637
1991-01-03    730.182437
1991-01-04    722.151236
dtype: float64

division: 
Date
1990-12-31         NaN
1990-12-31         NaN
1991-01-02    1.014493
1991-01-03    1.014477
1991-01-04    1.010056
dtype: float64


In [28]:
# operaciones estadisticas
s1 = df['Open']

print(f"mean: {s1.mean()}") # mean
print(f"std:  {s1.std()}") # std
print(f"min:  {s1.min()}") # min
print(f"max:  {s1.max()}") # max

mean: 107.52687620888838
std:  656.3824971142529
min:  9.799235343933104
max:  46875.0


In [29]:
# correlaciones lineales
cols = ['Open', 'High', 'Low', 'Close']
df[cols].corr()

Unnamed: 0,Open,High,Low,Close
Open,1.0,0.014387,0.094527,0.042194
High,0.014387,1.0,0.0187,0.009763
Low,0.094527,0.0187,1.0,0.002562
Close,0.042194,0.009763,0.002562,1.0


#### Operaciones Avanzadas

In [30]:
# crear una columna constante
df['Constant'] = 1
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Constant
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1990-12-31,1990,12,31,,,,,,,1
1990-12-31,1990,12,31,,,,,,,1
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,1
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,1
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,1


In [31]:
# Eliminar columna
df = df.drop('Constant',axis=1)
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1990-12-31,1990,12,31,,,,,,
1990-12-31,1990,12,31,,,,,,
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0


In [32]:
# nueva columna a partir de otras dos
df['Diff_high_low'] = df['High']-df['Low']
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1990-12-31,1990,12,31,,,,,,,
1990-12-31,1990,12,31,,,,,,,
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881


In [33]:
# mediante funciones 'apply'
df['greater_than_1'] = df['Diff_high_low'].apply(lambda x: 1 if x>1 else 0)
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1990-12-31,1990,12,31,,,,,,,,0
1990-12-31,1990,12,31,,,,,,,,0
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0


In [34]:
df['greater_than_1'].value_counts()

1    4609
0    2196
Name: greater_than_1, dtype: int64

Muchas veces, en un Dataframe se necesita realizar operaciones entre  la fila actual y la fila anterior, lo cual puede ser complejo si no se utilizan las funciones correctas. A continuación se trabajan algunas de estas funciones:

1. **`shift()`**: Se utiliza para mover hacia arriba o hacia abajo los valores de una columna o serie de datos.
2. **`cumsum()`**: es una función que se utiliza para calcular la suma acumulativa de valores a lo largo de un eje en un DataFrame o una Serie.
3. **`pct_change()`**: es una función que se utiliza para calcular el cambio porcentual entre los elementos de una serie o columna en un DataFrame. 
4. **`rank()`**: es una función que se utiliza para asignar un rango a los elementos de una serie o columna en un DataFrame.

In [35]:
# aplicar funciones
df['shift'] = df['Open'].shift() # se muestra el valor de la fila anterior (la primera fila en este caso es NaN)
df['cumsum'] = df['Open'].cumsum()  # suma acumulada entre la fila actual y todas las anteriores
df['pct_change'] = df['Open'].pct_change() # porcentaje de cambio entre la fila actual y la anterior 
df['rank'] = df['Open'].rank() # ranking de los valores (donde 1 es el menor valor)
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,shift,cumsum,pct_change,rank
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1990-12-31,1990,12,31,,,,,,,,0,,,,
1990-12-31,1990,12,31,,,,,,,,0,,,,
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,,26.977772,,1331.5
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,26.977772,53.836042,-0.00443,1322.5
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,26.858271,80.783939,0.003337,1329.0


In [36]:
cols = ['shift', 'cumsum','pct_change', 'rank']
df = df.drop(cols,axis=1)
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1990-12-31,1990,12,31,,,,,,,,0
1990-12-31,1990,12,31,,,,,,,,0
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0


### Filtrar Datos

Para filtrar datos en Pandas, se utiliza el método `loc()` o `iloc()`, dependiendo de si queremos filtrar por etiquetas de índice o por posición. Para efectos prácticos, utilizaremos solo `loc()`.

In [37]:
# 'fecha' mayor a 2000
df_new = df.loc[df['Year'] >= 2000]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-03,2000,1,3,107.492828,110.898659,106.95507,110.898659,60.474785,10823694.0,3.943588,1
2000-01-04,2000,1,4,108.986618,109.46463,105.999046,107.134323,58.422077,8606279.0,3.465584,1
2000-01-05,2000,1,5,107.97084,114.483749,107.194077,110.898659,60.474785,13318927.0,7.289673,1
2000-01-06,2000,1,6,112.810707,113.706978,108.508606,108.986618,59.432137,8338607.0,5.198372,1
2000-01-07,2000,1,7,112.093689,112.750954,105.76004,108.508606,59.171471,12402108.0,6.990913,1


In [38]:
# crear condicion
valor_objetivo = 2000 
condicion = (df['Year'] >= valor_objetivo)
condicion.head()

Date
1990-12-31    False
1990-12-31    False
1991-01-02    False
1991-01-03    False
1991-01-04    False
Name: Year, dtype: bool

In [39]:
# aplicar condicion
df_new = df.loc[condicion]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-03,2000,1,3,107.492828,110.898659,106.95507,110.898659,60.474785,10823694.0,3.943588,1
2000-01-04,2000,1,4,108.986618,109.46463,105.999046,107.134323,58.422077,8606279.0,3.465584,1
2000-01-05,2000,1,5,107.97084,114.483749,107.194077,110.898659,60.474785,13318927.0,7.289673,1
2000-01-06,2000,1,6,112.810707,113.706978,108.508606,108.986618,59.432137,8338607.0,5.198372,1
2000-01-07,2000,1,7,112.093689,112.750954,105.76004,108.508606,59.171471,12402108.0,6.990913,1


Veamos distintos tipos de filtros:

In [40]:
# mayor o igual a 2000
df_new = df.loc[df['Year'] >= 2000]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-03,2000,1,3,107.492828,110.898659,106.95507,110.898659,60.474785,10823694.0,3.943588,1
2000-01-04,2000,1,4,108.986618,109.46463,105.999046,107.134323,58.422077,8606279.0,3.465584,1
2000-01-05,2000,1,5,107.97084,114.483749,107.194077,110.898659,60.474785,13318927.0,7.289673,1
2000-01-06,2000,1,6,112.810707,113.706978,108.508606,108.986618,59.432137,8338607.0,5.198372,1
2000-01-07,2000,1,7,112.093689,112.750954,105.76004,108.508606,59.171471,12402108.0,6.990913,1


In [41]:
# entre 2005-2010
df_new = df.loc[df['Year'].between(2005,2010)]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2005-01-03,2005,1,3,94.617592,94.741875,92.973228,93.451241,52.62587,5538779.0,1.768646,1
2005-01-04,2005,1,4,93.441681,94.091782,92.275337,92.447418,52.060616,5973706.0,1.816444,1
2005-01-05,2005,1,5,92.351814,93.527725,92.160614,92.25621,51.952915,5906448.0,1.367111,1
2005-01-06,2005,1,6,92.294456,92.715103,91.826004,91.969406,51.791401,4771538.0,0.889099,0
2005-01-07,2005,1,7,92.25621,92.543022,91.271507,91.567879,51.565285,6485932.0,1.271515,1


In [42]:
# solo 2000
df_new = df.loc[df['Year']==2000]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-03,2000,1,3,107.492828,110.898659,106.95507,110.898659,60.474785,10823694.0,3.943588,1
2000-01-04,2000,1,4,108.986618,109.46463,105.999046,107.134323,58.422077,8606279.0,3.465584,1
2000-01-05,2000,1,5,107.97084,114.483749,107.194077,110.898659,60.474785,13318927.0,7.289673,1
2000-01-06,2000,1,6,112.810707,113.706978,108.508606,108.986618,59.432137,8338607.0,5.198372,1
2000-01-07,2000,1,7,112.093689,112.750954,105.76004,108.508606,59.171471,12402108.0,6.990913,1


In [43]:
# varias condiciones condiciones
df_new = df.loc[(df['Year']==2000) & (df['Month']==5)]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-05-01,2000,5,1,107.552582,108.209846,105.580788,107.07457,58.450939,4389121.0,2.629059,1
2000-05-02,2000,5,2,106.118546,108.628105,105.819794,106.477058,58.124744,4272910.0,2.808311,1
2000-05-03,2000,5,3,105.640533,106.238052,101.876198,103.36998,56.428619,5798815.0,4.361855,1
2000-05-04,2000,5,4,104.326004,104.385757,102.473709,102.891968,56.167702,3798340.0,1.912048,1
2000-05-05,2000,5,5,101.935944,104.804016,101.816444,103.130974,56.298161,4911075.0,2.987572,1


In [44]:
# Filtrar por patrón de texto (dejar enero)
df['Date'] =  df.index.copy()
df_new = df.loc[df['Date'].str.contains('-01-')]
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04
1991-01-07,1991,1,7,26.619265,26.738768,26.29063,26.350382,11.830789,4976450.0,0.448137,0,1991-01-07
1991-01-08,1991,1,8,26.380259,26.380259,25.991875,26.051624,11.696658,7177652.0,0.388384,0,1991-01-08


### Valores Nulos y Duplicados

#### Valores Nulos

Un valor nulo (o faltante) representa la ausencia de un valor en una celda específica de un DataFrame o una Serie. Los valores nulos pueden ocurrir por varias razones, como datos perdidos o no disponibles, errores de medición o problemas de entrada de datos.

Los valores nulos se representan en Pandas mediante el objeto `NaN` (acrónimo de "Not a Number"). `NaN` es un valor especial de punto flotante definido en el estándar **IEEE** para representar valores no definidos o indefinidos. En Pandas, los valores nulos se representan como `NaN` para las Series y DataFrames que utilizan datos numéricos, y como `None` para las Series y DataFrames que utilizan datos no numéricos.

In [45]:
# valores nulos
df.isnull().sum()

Year              0
Month             0
Day               0
Open              2
High              2
Low               2
Close             2
Adj Close         2
Volume            2
Diff_high_low     2
greater_than_1    0
Date              0
dtype: int64

In [46]:
# Eliminar las filas que contienen valores nulos
df_new = df.dropna()
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04
1991-01-07,1991,1,7,26.619265,26.738768,26.29063,26.350382,11.830789,4976450.0,0.448137,0,1991-01-07
1991-01-08,1991,1,8,26.380259,26.380259,25.991875,26.051624,11.696658,7177652.0,0.388384,0,1991-01-08


In [47]:
# Rellenar los valores nulos con un valor específico, por ejemplo cero
df_new = df.fillna(0)
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1990-12-31,1990,12,31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1990-12-31
1990-12-31,1990,12,31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1990-12-31
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04


#### Datos Duplicados

En Pandas, se pueden manejar los datos duplicados utilizando el método `duplicated()` y `drop_duplicates()`.

* El método `duplicated()` devuelve un booleano que indica si una fila es duplicada o no, es decir, si existe otra fila con los mismos valores.
* El método `drop_duplicates()` elimina las filas duplicadas de un DataFrame.


In [48]:
# Comprobar filas duplicadas
df.duplicated().sum()

1

In [49]:
# Eliminar filas duplicadas
df_new = df.drop_duplicates()
df_new.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04
1991-01-07,1991,1,7,26.619265,26.738768,26.29063,26.350382,11.830789,4976450.0,0.448137,0,1991-01-07


In [50]:
# Comprobar filas duplicadas
df_new.duplicated().sum()

0

### Manipulación de Fechas

Pandas se desarrolló en el contexto del modelado financiero, por lo que, contiene varias herramientas para trabajar con fechas, horas y datos indexados por tiempo.

Comenzaremos por entendender las herramientas para manejar fechas y horas en Python, antes de pasar más específicamente a las herramientas proporcionadas por Pandas. 

In [51]:
import datetime
now = datetime.datetime.now()
print(now)

2024-04-29 14:12:29.977893


Puedes acceder a partes específicas de un objeto `datetime.datetime`, como el año, el mes, el día, la hora, el minuto y el segundo. Algunos de los atributos más comunes son:

* `year`: representa el año de la fecha y la hora.
* `month`: representa el mes de la fecha y la hora, como un número entre 1 y 12.
* `day`: representa el día del mes de la fecha y la hora, como un número entre 1 y 31.
* `hour`: representa la hora del día de la fecha y la hora, como un número entre 0 y 23.
* `minute`: representa los minutos de la hora de la fecha y la hora, como un número entre 0 y 59.
* `second`: representa los segundos de la hora de la fecha y la hora, como un número entre 0 y 59.
* `microsecond`: representa los microsegundos de la hora de la fecha y la hora, como un número entre 0 y 999999.

Veamos todo lo anterior, aplicado al dataframe que estamos trabajando:

In [52]:
# cambiar formato de la fecha 
df['Date'] = pd.to_datetime(df['Date'])
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04


In [53]:
# ver atributos
df.dtypes

Year                       int64
Month                      int64
Day                        int64
Open                     float64
High                     float64
Low                      float64
Close                    float64
Adj Close                float64
Volume                   float64
Diff_high_low            float64
greater_than_1             int64
Date              datetime64[ns]
dtype: object

In [54]:
# calcular atributos como año, mes, dia, ...
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['day'] = df['Date'].dt.day
df['hour'] = df['Date'].dt.hour
df['minute'] = df['Date'].dt.minute
df['second'] = df['Date'].dt.second
df['microsecond'] = df['Date'].dt.microsecond

df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date,year,month,day,hour,minute,second,microsecond
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31,1990,12,31,0,0,0,0
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31,1990,12,31,0,0,0,0
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02,1991,1,2,0,0,0,0
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03,1991,1,3,0,0,0,0
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04,1991,1,4,0,0,0,0


In [55]:
# lista de columnas a eliminar
columnas_a_eliminar = ['year', 'month', 'day', 'hour', 'minute', 'second', 'microsecond']

# eliminar las columnas
df = df.drop(columnas_a_eliminar, axis=1)

# mostrar dataframe
df.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close,Adj Close,Volume,Diff_high_low,greater_than_1,Date
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31
1990-12-31,1990,12,31,,,,,,,,0,1990-12-31
1991-01-02,1991,1,2,26.977772,27.186901,26.798517,26.798517,12.031989,4341737.0,0.388384,0,1991-01-02
1991-01-03,1991,1,3,26.858271,27.216778,26.828394,26.888145,12.072229,5470162.0,0.388384,0,1991-01-03
1991-01-04,1991,1,4,26.947897,27.007648,26.738768,26.798517,12.031989,4540058.0,0.268881,0,1991-01-04


## Referencias

1. [Python Pandas Tutorial: A Complete Introduction for Beginners](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)
2. [General functions](https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html)