# Semana 6: Manipulación de datos

## 📒 Preprocesamiento de Datos y Análisis

El preprocesamiento de datos es una etapa crucial en el análisis, ya que garantiza que los datos estén limpios y listos para su uso.

### ⚠️ Problemas comunes con los datos
Si los datos de entrada están mal estructurados o contienen errores, el análisis y las predicciones serán incorrectos. Algunos problemas comunes incluyen:
-  Datos ausentes (NaNs)
-  Duplicados
-  Errores en los nombres de las columnas
-  Valores extremos

In [1]:
import pandas as pd
df = pd.read_csv("Ejemplos/binary-star-planet.csv")
df

Unnamed: 0,System;Link;Star/Planet;Discovery;Spectral Type;Distance [pc];Mass Ratio;a_binary [au];e_sec;Number of Planets;Planet Motion Type;m1 [m_sun];m2 [m_sun];Comments;Minimum Mass [M_J];Semi-major axis [au];Orbital Period [d];Eccentricity;Argument of Perihelion [deg];Radius [R_J];Inclination [deg];Detection Method
0,HAT-P-4 b / TYC 2569-744-1;https://academic.ou...
1,HAT-P-4 b;https://academic.oup.com/mnras/artic...
2,CoRoT-7 Abc B;http://exoplanet.eu/catalog/coro...
3,CoRoT-7 b;http://exoplanet.eu/catalog/corot-7_...
4,CoRoT-7 c;http://exoplanet.eu/catalog/corot-7_...
...,...
366,Ross 458 c ;http://exoplanet.eu/catalog/ross_4...
367,HD 106906 AB b;http://exoplanet.eu/catalog/hd_...
368,HD 106906 AB b;http://exoplanet.eu/catalog/hd_...
369,ROXs 42AB b;http://exoplanet.eu/catalog/roxs_4...


In [2]:
df = pd.read_csv("Ejemplos/Exo_planets_DATA_EU.csv")
df.head()

Unnamed: 0,name,planet_status,mass,mass_error_min,mass_error_max,mass_sini,mass_sini_error_min,mass_sini_error_max,radius,radius_error_min,...,star_sp_type,star_age,star_age_error_min,star_age_error_max,star_teff,star_teff_error_min,star_teff_error_max,star_detected_disc,star_magnetic_field,star_alternate_names
0,109 Psc b,Confirmed,5.743,0.289,0.289,6.383,0.078,0.078,1.152,,...,G5 IV,6.9,0.6,0.6,5600.0,80.0,80.0,,,HD 10697
1,112 Psc b,Confirmed,0.033,0.005,0.005,0.033,0.005,0.005,,,...,G0IV,,,,5986.0,105.437,105.437,,,HD 12235
2,112 Psc c,Confirmed,9.866,1.781,1.781,,,,,,...,G0IV,,,,5986.0,105.437,105.437,,,HD 12235
3,11 Com Ab,Confirmed,16.1284,1.53491,1.53491,16.1284,1.53491,1.53491,,,...,G8III,,,,4742.0,100.0,100.0,,,HD 107383
4,11 UMi b,Confirmed,11.0873,1.1,1.1,11.0873,1.1,1.1,,,...,K4III,1.56,0.54,0.54,4340.0,70.0,70.0,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7327 entries, 0 to 7326
Data columns (total 98 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   name                        7327 non-null   object 
 1   planet_status               7327 non-null   object 
 2   mass                        4369 non-null   float64
 3   mass_error_min              3662 non-null   float64
 4   mass_error_max              3662 non-null   float64
 5   mass_sini                   1387 non-null   float64
 6   mass_sini_error_min         1134 non-null   float64
 7   mass_sini_error_max         1134 non-null   float64
 8   radius                      4901 non-null   float64
 9   radius_error_min            4319 non-null   float64
 10  radius_error_max            4319 non-null   float64
 11  orbital_period              6064 non-null   float64
 12  orbital_period_error_min    5292 non-null   float64
 13  orbital_period_error_max    5292 

### 🏷️ Renombrar columnas
A veces, las columnas tienen nombres poco claros o con espacios que dificultan el acceso. Podemos solucionarlo así:

In [4]:
import pandas as pd

# las medidas se almacenan en una lista de listas
measurements = [['Sun', 146, 152],
                                ['Moon', 0.36, 0.41], 
                                ['Mercury', 82, 217], 
                                ['Venus', 38, 261],
                                ['Mars', 56, 401],
                                ['Jupiter', 588, 968],
                                ['Saturn', 1195, 1660],
                                ['Uranus', 2750, 3150],
                                ['Neptune', 4300, 4700],
                                ['Halley\'s comet', 6, 5400]]

# los nombres de las columnas se almacenan en la variable header
header = ['Celestial bodies ','MIN', 'MAX'] 

# guardar el DataFrame en la variable celestial
celestial = pd.DataFrame(data=measurements, columns=header)

celestial

Unnamed: 0,Celestial bodies,MIN,MAX
0,Sun,146.0,152.0
1,Moon,0.36,0.41
2,Mercury,82.0,217.0
3,Venus,38.0,261.0
4,Mars,56.0,401.0
5,Jupiter,588.0,968.0
6,Saturn,1195.0,1660.0
7,Uranus,2750.0,3150.0
8,Neptune,4300.0,4700.0
9,Halley's comet,6.0,5400.0


In [5]:
print(celestial.columns)

Index(['Celestial bodies ', 'MIN', 'MAX'], dtype='object')


In [6]:
celestial = celestial.rename(columns={celestial.columns[0]:"celestial_bodies", celestial.columns[1]:"min",celestial.columns[2]:"max"})

In [7]:
celestial

Unnamed: 0,celestial_bodies,min,max
0,Sun,146.0,152.0
1,Moon,0.36,0.41
2,Mercury,82.0,217.0
3,Venus,38.0,261.0
4,Mars,56.0,401.0
5,Jupiter,588.0,968.0
6,Saturn,1195.0,1660.0
7,Uranus,2750.0,3150.0
8,Neptune,4300.0,4700.0
9,Halley's comet,6.0,5400.0


### ⁉️Procesar valores ausentes
Los valores ausentes pueden generar problemas en el análisis. Podemos:
- Contarlos con `isna().sum()`
- Eliminarlos con `dropna()`
- Rellenarlos con `fillna()`

In [8]:
import pandas as pd
df = pd.read_csv("Ejemplos/data.csv")
df.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0


In [9]:
print(df.isna().sum())

Duration    0
Pulse       0
Maxpulse    0
Calories    5
dtype: int64


In [10]:
for i in range(len(df['Calories'])):
    if df['Calories'].isna()[i] == True:
        print(i)

17
27
91
118
141


In [11]:
df.head(20)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


In [12]:
df.iloc[27]

Duration     60.0
Pulse       103.0
Maxpulse    132.0
Calories      NaN
Name: 27, dtype: float64

In [13]:
#usando dropna
df2 = df.dropna()

In [14]:
df2.head(20)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


In [15]:
#usando fillna
df.fillna(0, inplace=True)

In [16]:
df.head(20)

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
5,60,102,127,300.0
6,60,110,136,374.0
7,45,104,134,253.3
8,30,109,133,195.1
9,60,98,124,269.0


In [17]:
df.iloc[27]

Duration     60.0
Pulse       103.0
Maxpulse    132.0
Calories      0.0
Name: 27, dtype: float64

### 🖇️ Procesamiento de valores duplicados
Los datos duplicados pueden sesgar el análisis de datos. Podemos:
- Contarlos con `duplicated().sum()`
- Eliminarlos con `drop_duplicates()`
- Restablecer el índice  con `reset_index()`
- Ver los valores únicos de una columna con `unique()`
- Número de valores únicos de cada columna `nunique()`

In [18]:
import pandas as pd 
  
data = pd.read_csv("Ejemplos/employees.csv") 
data 

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


In [19]:
data["Team"].duplicated().sum()

989

In [20]:
data_nuevo = data.drop_duplicates(subset = "Team")
data_nuevo

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal
6,Ruby,Female,8/17/1987,4:20 PM,65476,10.012,True,Product
8,Angela,Female,11/22/2005,6:29 AM,95570,18.523,True,Engineering
9,Frances,Female,8/8/2002,6:51 AM,139852,7.524,True,Business Development
12,Brandon,Male,12/1/1980,1:08 AM,112807,17.492,True,Human Resources
13,Gary,Male,1/27/2008,11:40 PM,109831,5.831,False,Sales


In [21]:
data_indice = data_nuevo.reset_index()
data_indice

Unnamed: 0,index,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
4,5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal
5,6,Ruby,Female,8/17/1987,4:20 PM,65476,10.012,True,Product
6,8,Angela,Female,11/22/2005,6:29 AM,95570,18.523,True,Engineering
7,9,Frances,Female,8/8/2002,6:51 AM,139852,7.524,True,Business Development
8,12,Brandon,Male,12/1/1980,1:08 AM,112807,17.492,True,Human Resources
9,13,Gary,Male,1/27/2008,11:40 PM,109831,5.831,False,Sales


In [22]:
print(data["Team"].unique())
print(data["Team"].nunique())

['Marketing' nan 'Finance' 'Client Services' 'Legal' 'Product'
 'Engineering' 'Business Development' 'Human Resources' 'Sales'
 'Distribution']
10


### 📊 Agrupación y Ordenamiento de Datos
Estaremos trabajando con datos de carga masiva obligandonos a reorganizarlos y ordenrlos por subgrupos, para eso podemos usar los métodos `.groupby()` y `.count()`

In [23]:
data

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


In [24]:
data.groupby(by='Team').count()

Unnamed: 0_level_0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Business Development,99,90,101,101,101,101,99
Client Services,100,90,106,106,106,106,100
Distribution,77,72,90,90,90,90,77
Engineering,86,84,92,92,92,92,86
Finance,97,85,102,102,102,102,97
Human Resources,85,82,91,91,91,91,85
Legal,86,69,88,88,88,88,86
Marketing,91,81,98,98,98,98,91
Product,92,85,95,95,95,95,92
Sales,86,78,94,94,94,94,86


In [25]:
data.groupby(by='Team')["Salary"].count()

Team
Business Development    101
Client Services         106
Distribution             90
Engineering              92
Finance                 102
Human Resources          91
Legal                    88
Marketing                98
Product                  95
Sales                    94
Name: Salary, dtype: int64

In [26]:
data.groupby(by='Team')["Salary"].sum()

Team
Business Development    9278498
Client Services         9351789
Distribution            7965042
Engineering             8672766
Finance                 9406387
Human Resources         8275952
Legal                   7858718
Marketing               8862688
Product                 8423223
Sales                   8664303
Name: Salary, dtype: int64

También se pueden ordenar los valores usando el método de `.sort_values()`

In [27]:
data.sort_values(by='Last Login Time')

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
647,Donald,Male,4/6/1988,10:00 AM,122920,5.320,False,
739,Carlos,Male,1/25/1981,10:00 AM,138598,14.737,False,Sales
72,Bobby,Male,5/7/2007,10:01 AM,54043,3.833,False,Product
850,Charles,Male,9/3/1997,10:04 AM,148291,6.002,False,
676,Annie,Female,6/6/1992,10:04 AM,138925,9.801,True,Marketing
...,...,...,...,...,...,...,...,...
538,Adam,Male,10/8/2010,9:53 PM,45181,3.491,False,Human Resources
395,Kathy,Female,11/25/2001,9:55 PM,93753,7.094,True,Sales
198,Maria,Female,12/27/1990,9:57 PM,36067,9.640,True,Product
658,Jessica,Female,10/19/1986,9:58 PM,90285,13.591,True,Client Services


In [28]:
data.sort_values(by='First Name')

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2/17/2012,10:20 AM,61602,11.849,True,Marketing
327,Aaron,Male,1/29/1994,6:48 PM,58755,5.097,True,Marketing
440,Aaron,Male,7/22/1990,2:53 PM,52119,11.343,True,Client Services
937,Aaron,,1/22/1986,7:39 PM,63126,18.424,False,Client Services
137,Adam,Male,5/21/2011,1:45 AM,95327,15.120,False,Distribution
...,...,...,...,...,...,...,...,...
902,,Male,5/23/2001,7:52 PM,103877,6.322,,Distribution
925,,Female,8/23/2000,4:19 PM,95866,19.388,,Sales
946,,Female,9/15/1985,1:50 AM,133472,16.941,,Distribution
947,,Male,7/30/2012,3:07 PM,107351,5.329,,Marketing


## 🧲 Filtrado de Datos
Con frecuencia encontrarás conjuntos de datos masivos con millones o incluso miles de millones de filas. Sin embargo, en la mayoría de los casos solo necesitarás trabajar con una porción más pequeña y relevante de todo el conjunto de datos.

### 🎯 Indices de DataFrames

In [29]:
import pandas as pd      
data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
data

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


| Tipo              | Muestras|
| :---------------- | :------: |
| Una celda |   `.loc[fila, columna]`   | 
| Una columna|   `.loc[:, columna]`   | 
| Varias columnas|   `.loc[:, [columna_1, columna_2, columna_5]]`   |
| Varias columnas de manera consecutiva        |   `.loc[:, columna_1:columna_5]`   | 
| Una fila        |   `.loc[fila]`   | 
| Todas las filas a partir de la fila especifica        |   `.loc[fila_2:]`   | 
| Todas las filas, hasta de la fila especifica        |  `.loc[:fila_4]`   |
| Varias fila de manera consecutiva        |  `.loc[fila_2:fila_4]`   |

In [30]:
data.loc[0:10, "Age":"College"]

Unnamed: 0,Age,Height,Weight,College
0,25.0,6-2,180.0,Texas
1,25.0,6-6,235.0,Marquette
2,27.0,6-5,205.0,Boston University
3,22.0,6-5,185.0,Georgia State
4,29.0,6-10,231.0,
5,29.0,6-9,240.0,
6,21.0,6-8,235.0,LSU
7,25.0,7-0,238.0,Gonzaga
8,22.0,6-2,190.0,Louisville
9,22.0,6-4,220.0,Oklahoma State


In [31]:
data.loc[0:10, ["Age","College"]]

Unnamed: 0,Age,College
0,25.0,Texas
1,25.0,Marquette
2,27.0,Boston University
3,22.0,Georgia State
4,29.0,
5,29.0,
6,21.0,LSU
7,25.0,Gonzaga
8,22.0,Louisville
9,22.0,Oklahoma State


### 🔎 Filtrado con `.query()`
Podemos seleccionar datos según condiciones específicas usando el método de `.query()`

In [32]:
data.query("Age < 26")

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
...,...,...,...,...,...,...,...,...,...
446,Derrick Favors,Utah Jazz,15.0,PF,24.0,6-10,265.0,Georgia Tech,12000000.0
447,Rudy Gobert,Utah Jazz,27.0,C,23.0,7-1,245.0,,1175880.0
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6-8,206.0,Duke,1348440.0
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0


In [33]:
UN = ["Texas", "Kentucky", "LSU"]

datos = data.query("~(College in @UN)")[['Name', 'Number','Age']]

datos

Unnamed: 0,Name,Number,Age
1,Jae Crowder,99.0,25.0
2,John Holland,30.0,27.0
3,R.J. Hunter,28.0,22.0
4,Jonas Jerebko,8.0,29.0
5,Amir Johnson,90.0,29.0
...,...,...,...
453,Shelvin Mack,8.0,26.0
454,Raul Neto,25.0,24.0
455,Tibor Pleiss,21.0,26.0
456,Jeff Withey,24.0,26.0


In [34]:
UN = {"universidad_1":"Texas", "universidad_2":"Kentucky", "universidad_3":"LSU"}

datos = data.query("~(College in @UN.values())")[['Name', 'Number','Age']]

datos

Unnamed: 0,Name,Number,Age
1,Jae Crowder,99.0,25.0
2,John Holland,30.0,27.0
3,R.J. Hunter,28.0,22.0
4,Jonas Jerebko,8.0,29.0
5,Amir Johnson,90.0,29.0
...,...,...,...
453,Shelvin Mack,8.0,26.0
454,Raul Neto,25.0,24.0
455,Tibor Pleiss,21.0,26.0
456,Jeff Withey,24.0,26.0


In [35]:
datos.query("Age < 25")

Unnamed: 0,Name,Number,Age
3,R.J. Hunter,28.0,22.0
8,Terry Rozier,12.0,22.0
9,Marcus Smart,36.0,22.0
10,Jared Sullinger,7.0,24.0
16,Markel Brown,22.0,24.0
...,...,...,...
445,Dante Exum,11.0,20.0
446,Derrick Favors,15.0,24.0
447,Rudy Gobert,27.0,23.0
449,Rodney Hood,5.0,23.0


### ➿ Filtrado por Condiciones Múltiples
Podemos combinar varias condiciones para hacer filtros más específicos:

In [36]:
datos_filtrados = data[(data["Age"] >= 28) & (data["College"] == "LSU")]

In [37]:
datos_filtrados

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
106,Brandon Bass,Los Angeles Lakers,2.0,PF,31.0,6-8,250.0,LSU,3000000.0
380,Garrett Temple,Washington Wizards,17.0,SG,30.0,6-6,195.0,LSU,1100602.0
381,Marcus Thornton,Washington Wizards,15.0,SF,29.0,6-4,205.0,LSU,200600.0


Debemos de tener encuenta la siguiente sintaxis para filtrar datos por condiciones multiples

| Operador | Descripción | Ejemplo |
|----------|------------|---------|
| `&` (AND) | Se usa para combinar **dos o más condiciones**. Devuelve `True` solo si **todas** las condiciones son verdaderas. En Pandas, reemplaza el operador `and` de Python. | `df[(df["edad"] > 18) & (df["salario"] > 3000)]` → Filtra filas donde la edad sea mayor a 18 **y** el salario sea mayor a 3000. |
| ``\|`` (OR)  | Se usa para combinar **dos o más condiciones**. Devuelve `True` si **al menos una** de ellas es verdadera. En Pandas, reemplaza el operador `or` de Python. | `df[(df["salario"] > 5000) \| (df["edad"] < 20)]` → Filtra filas donde el salario sea mayor a 5000 **o** la edad sea menor a 20. |
| `~` (NOT) | Se usa para **negar una condición**, es decir, devuelve `True` para los valores donde la condición original es `False`. En Pandas, reemplaza el operador `not` de Python. | `df[~(df["ciudad"] == "Bogotá")]` → Filtra filas donde la ciudad **no sea** Bogotá. |



### Reepmlazar valores usando `.where()`
La función `where()` nos permite reemplazar valores que cumplan ciertas condiciones:

In [38]:
data["Age"] = data["Age"].where(data["Age"] > 25, 78)

In [39]:
data

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,78.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,78.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,78.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,78.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


La función `.where()` nos permite combinar las condiciones multiples de los filtrados

In [40]:
#La función `where()` en Pandas permite reemplazar valores en un DataFrame basándose en una condición.  
#Si la condición es `True`, el valor se mantiene; si es `False`, se reemplaza con `NaN` (o con otro valor especificado).  

import pandas as pd

data = {"nombre": ["Ana", "Carlos", "Luis", "Marta"],
        "edad": [22, 35, 18, 40],
        "salario": [4000, 2500, 1000, 5000]}

df = pd.DataFrame(data)

df

Unnamed: 0,nombre,edad,salario
0,Ana,22,4000
1,Carlos,35,2500
2,Luis,18,1000
3,Marta,40,5000


In [41]:
# Mantener solo salarios donde la edad sea mayor a 25 y el salario mayor a 3000
df["salario_filtrado"] = df["salario"].where((df["edad"] > 25) & (df["salario"] > 3000))
df

Unnamed: 0,nombre,edad,salario,salario_filtrado
0,Ana,22,4000,
1,Carlos,35,2500,
2,Luis,18,1000,
3,Marta,40,5000,5000.0
