# 🐼 Pandas

---
> **Pandas** es una librería de **Python** especializada en la manipulación y el análisis de datos. Ofrece estructuras de datos y operaciones para manipular tablas numéricas y series temporales, es como el *Excel de Python*.
---

## 📅 2023-11-22

In [1]:
import pandas as pd
import numpy as np

- **Series**: Arreglos unidimensionales de datos (una columna).
- **DataFrame**: Arreglos bidimensionales de datos (una tabla).

### 📊 Series

In [1]:
# Lista de marcas de vehículos.
marcas = [ 'Toyota', 'Volkswagen', 'Hyundai', 'Ford', 'Nissan', 'Chevrolet', ]

# Lista de índices alfabetizados.
index = [ 'a', 'b', 'c', 'd', 'e', 'f', ]

In [10]:
# Crear una serie.
s = pd.Series(["MEX", "USA", "CAN"])

In [11]:
# Observar la serie creada. Es una columna de datos con un índice.
s

0    MEX
1    USA
2    CAN
dtype: object

In [12]:
# Es posible acceder a los datos de la serie por medio de su índice.
s[0]

'MEX'

In [13]:
# Crear una serie a partir de la lista de vehículos y el índice.
serie = pd.Series(marcas, index)

In [14]:
serie

a        Toyota
b    Volkswagen
c       Hyundai
d          Ford
e        Nissan
f     Chevrolet
dtype: object

In [16]:
# Se puede acceder a los datos de la serie por medio de su índice (número o letra).
print(serie[0])
print(serie['a'])

Toyota
Toyota


In [18]:
# Serie a partir de un diccionario de edades.
edades = { "Juan": 25, "María": 22, "Pedro": 21, "Luis": 19, "Ana": 20, "Pablo": 23, }
serie_edades = pd.Series(edades)
serie_edades

Juan     25
María    22
Pedro    21
Luis     19
Ana      20
Pablo    23
dtype: int64

In [19]:
# Ventas del primer y segundo cuarto del año.
q1 = { "MEX": 100, "USA": 200, "CAN": 300, "CHI": 400 }
q2 = { "ARG": 500, "USA": 600, "CAN": 700, "CHI": 800 }

In [22]:
sales_q1 = pd.Series(q1)
sales_q2 = pd.Series(q2)

In [23]:
sales_q1

MEX    100
USA    200
CAN    300
CHI    400
dtype: int64

In [24]:
sales_q2

ARG    500
USA    600
CAN    700
CHI    800
dtype: int64

In [25]:
# Si se multiplica un arreglo por un escalar, se multiplica el arreglo. En este caso se duplica.
[2, 4] * 2

[2, 4, 2, 4]

In [26]:
# Si se multiplica un arreglo de NumPy por un escalar, se multiplica cada elemento del arreglo por el escalar.
np.array([2, 4]) * 2

array([4, 8])

In [27]:
# Si se multiplica una serie por un escalar, se multiplica cada elemento de la serie por el escalar.
sales_q1 * 2

MEX    200
USA    400
CAN    600
CHI    800
dtype: int64

# 📅 2023-11-22

In [12]:
import pandas as pd
import numpy as np

### 📊 Series: Continuación

In [13]:
# Ventas del primer y segundo cuarto del año.
q1 = { "MEX": 100, "USA": 200, "CAN": 300, "CHI": 400 }
q2 = { "ARG": 500, "USA": 600, "CAN": 700, "CHI": 800 }

In [14]:
sales_q1 = pd.Series(q1)
sales_q2 = pd.Series(q2)

In [15]:
sales_q1

MEX    100
USA    200
CAN    300
CHI    400
dtype: int64

In [16]:
sales_q1

MEX    100
USA    200
CAN    300
CHI    400
dtype: int64

In [17]:
# fill_value = 0: Si no existe un valor en una serie, se reemplaza por 0.
ventas = sales_q1.add(sales_q2, fill_value = 0)

In [18]:
ventas

ARG     500.0
CAN    1000.0
CHI    1200.0
MEX     100.0
USA     800.0
dtype: float64

In [19]:
# Ordenar las ventas de mayor a menor.
ventas.sort_values(ascending = False)

CHI    1200.0
CAN    1000.0
USA     800.0
ARG     500.0
MEX     100.0
dtype: float64

In [20]:
# Ver el tipo de datos de la serie.
ventas.dtype

dtype('float64')

### 🧮 DataFrames

In [29]:
# Crear un arreglo matriz de 4x3 con números aleatorios del 0 al 100.
np.random.seed(42)
my_data = np.random.randint(0, 101, (4, 3))
my_data

array([[51, 92, 14],
       [71, 60, 20],
       [82, 86, 74],
       [74, 87, 99]])

In [30]:
# Crear un índice para las filas.
my_index = ['CA', 'NY', 'AZ', 'TX']
# Crear un índice para las columnas.
my_columns = ['Jan', 'Feb', 'Mar']

In [31]:
# Crear un DataFrame a partir de un arreglo. Si no se especifican los índices y columnas, se crean índices y columnas numéricos.
df = pd.DataFrame(my_data)

In [32]:
df

Unnamed: 0,0,1,2
0,51,92,14
1,71,60,20
2,82,86,74
3,74,87,99


In [33]:
# Crear un DataFrame a partir de un arreglo con índices y columnas.
df = pd.DataFrame(data = my_data, index = my_index, columns = my_columns)

In [34]:
df

Unnamed: 0,Jan,Feb,Mar
CA,51,92,14
NY,71,60,20
AZ,82,86,74
TX,74,87,99


In [35]:
# Ver la información del DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, CA to TX
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Jan     4 non-null      int32
 1   Feb     4 non-null      int32
 2   Mar     4 non-null      int32
dtypes: int32(3)
memory usage: 80.0+ bytes


# 📅 2023-11-27

In [2]:
import pandas as pd
import numpy as np

### 🗺 `iloc` y `loc`

In [3]:
# Crear un arreglo matriz de 4x3 con números aleatorios del 0 al 100.
np.random.seed(42)
my_data = np.random.randint(0, 101, (4, 3))
my_data
# Crear un índice para las filas.
my_index = ['CA', 'NY', 'AZ', 'TX']
# Crear un índice para las columnas.
my_columns = ['Jan', 'Feb', 'Mar']
# Crear un DataFrame a partir de un arreglo. Si no se especifican los índices y columnas, se crean índices y columnas numéricos.
df = pd.DataFrame(my_data)

In [4]:
df

Unnamed: 0,0,1,2
0,51,92,14
1,71,60,20
2,82,86,74
3,74,87,99


In [6]:
# Observar el renglón 1 del DataFrame.
df.iloc[0]

0    51
1    92
2    14
Name: 0, dtype: int32

In [8]:
# Consultar cual es el valor de la celda en el renglón de índice 2 y columna de índice 2.
df.iloc[2, 2]

74

In [13]:
# Mostrar solo la columna 2.
# De todos los renglones, mostrar solo la columna 2.
df.iloc[0:, 2]

0    14
1    20
2    74
3    99
Name: 2, dtype: int32

In [15]:
# Mostrar la columna 0.
df.iloc[0:, 0:1]

Unnamed: 0,0
0,51
1,71
2,82
3,74


In [18]:
# Mostrar el renglón 3 en forma de DataFrame.
df.iloc[3:, 0:]

Unnamed: 0,0,1,2
3,74,87,99


In [19]:
# Crear un DataFrame a partir de un arreglo con índices y columnas.
df = pd.DataFrame(data = my_data, index = my_index, columns = my_columns)

In [20]:
df

Unnamed: 0,Jan,Feb,Mar
CA,51,92,14
NY,71,60,20
AZ,82,86,74
TX,74,87,99


In [21]:
# Mostrar el 60.
df.iloc[1, 1]

60

In [22]:
# Mostrar el 87.
df.iloc[3, 1]

87

In [23]:
# Encontrar el 20 utilizando el renglón y columna.
df.loc['NY', 'Mar']

20

In [24]:
# Ver la información de NY.
df.loc['NY']

Jan    71
Feb    60
Mar    20
Name: NY, dtype: int32

In [26]:
# Ver la información de NY a TX y de Jan a Mar.
df.loc["NY":"TX", "Jan":"Mar"]

Unnamed: 0,Jan,Feb,Mar
NY,71,60,20
AZ,82,86,74
TX,74,87,99


### 🗂 Leer archivos .CSV

In [31]:
# Ver el directorio de trabajo:
%pwd

'c:\\Users\\Admin\\Documents\\ITS\\Semestre 9\\Data_Science\\Pandas\\Notas'

In [29]:
tips_df = pd.read_csv('c:\\Users\\Admin\\Documents\\ITS\\Semestre 9\\Data_Science\\Pandas\\Datasets\\tips.csv')

In [30]:
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


In [32]:
# Ver la información del DataFrame.
tips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [33]:
# Localizar el 3.5.
tips_df.loc[2, "tip"]

3.5

In [34]:
# Ver solo las primeras 5 filas.
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [35]:
# Ver las primeras 10 filas.
tips_df.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
5,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882,Sun9679
6,8.77,2.0,Male,No,Sun,Dinner,2,4.38,Kristopher Johnson,2223727524230344,Sun5985
7,26.88,3.12,Male,No,Sun,Dinner,4,6.72,Robert Buck,3514785077705092,Sun8157
8,15.04,1.96,Male,No,Sun,Dinner,2,7.52,Joseph Mcdonald,3522866365840377,Sun6820
9,14.78,3.23,Male,No,Sun,Dinner,2,7.39,Jerome Abbott,3532124519049786,Sun3775


In [37]:
# Validar el tipo de datos de las columnas en conjunto.
tips_df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')

In [38]:
# Validar el tipo de datos de las columnas individualmente.
tips_df.dtypes

total_bill          float64
tip                 float64
sex                  object
smoker               object
day                  object
time                 object
size                  int64
price_per_person    float64
Payer Name           object
CC Number             int64
Payment ID           object
dtype: object

In [39]:
# Ver los últimos 5 registros.
tips_df.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [40]:
# Ver los últimos 10 registros.
tips_df.tail(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
234,15.53,3.0,Male,Yes,Sat,Dinner,2,7.76,Tracy Douglas,4097938155941930,Sat7220
235,10.07,1.25,Male,No,Sat,Dinner,2,5.04,Sean Gonzalez,3534021246117605,Sat4615
236,12.6,1.0,Male,Yes,Sat,Dinner,2,6.3,Matthew Myers,3543676378973965,Sat5032
237,32.83,1.17,Male,Yes,Sat,Dinner,2,16.42,Thomas Brown,4284722681265508,Sat2929
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [42]:
# Conocer las medidas estadísticas de las columnas numéricas.
tips_df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


# 📅 2023-11-28

### 🗂 DataFrames con Archivos .CSV: Repaso y Operaciones Adicionales

In [2]:
import pandas as pd
import numpy as np

In [3]:
tips_df = pd.read_csv('c:\\Users\\Admin\\Documents\\ITS\\Semestre 9\\Data_Science\\Pandas\\Datasets\\tips.csv')

In [4]:
# Ver los primeros 5 registros.
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


In [5]:
# Ver los últimos 5 registros.
tips_df.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.0,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.0,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17
243,18.78,3.0,Female,No,Thur,Dinner,2,9.39,Michelle Hardin,3511451626698139,Thur672


In [6]:
# Ver las medidas estadísticas de las columnas numéricas.
tips_df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


In [7]:
# Ver una columna en específico.
tips_df["total_bill"]

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

In [8]:
# Ver el tipo de datos de una columna.
type(tips_df["total_bill"])

pandas.core.series.Series

---

En Pandas, una columna es una **serie**.

---

In [9]:
# Ver ciertas columnas.
cols = ["total_bill", "tip", "smoker"]
tips_df[cols]

Unnamed: 0,total_bill,tip,smoker
0,16.99,1.01,No
1,10.34,1.66,No
2,21.01,3.50,No
3,23.68,3.31,No
4,24.59,3.61,No
...,...,...,...
239,29.03,5.92,No
240,27.18,2.00,Yes
241,22.67,2.00,Yes
242,17.82,1.75,No


In [10]:
# O bien, hacerlo directamente.
tips_df[["total_bill", "tip", "smoker"]]

Unnamed: 0,total_bill,tip,smoker
0,16.99,1.01,No
1,10.34,1.66,No
2,21.01,3.50,No
3,23.68,3.31,No
4,24.59,3.61,No
...,...,...,...
239,29.03,5.92,No
240,27.18,2.00,Yes
241,22.67,2.00,Yes
242,17.82,1.75,No


In [11]:
# Sumas de las columnas: total_bill + tip.
tips_df["total_bill"] + tips_df["tip"]

0      18.00
1      12.00
2      24.51
3      26.99
4      28.20
       ...  
239    34.95
240    29.18
241    24.67
242    19.57
243    21.78
Length: 244, dtype: float64

In [12]:
# Generar una nueva columna con el porcentaje de la propina.
tips_df["tip_pct"] = tips_df["tip"] / tips_df["total_bill"] * 100

In [13]:
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,20.392697
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,7.358352
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,8.822232
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,9.820426


In [14]:
# Si solo se desea ver la columna, sin crearla.
tips_df["tip"] / tips_df["total_bill"] * 100

0       5.944673
1      16.054159
2      16.658734
3      13.978041
4      14.680765
         ...    
239    20.392697
240     7.358352
241     8.822232
242     9.820426
243    15.974441
Length: 244, dtype: float64

In [15]:
# Reescribir la columna price_per_person.
tips_df["price_per_person"] = np.round(tips_df["total_bill"] / tips_df["size"], 2)

In [16]:
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,20.392697
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,7.358352
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,8.822232
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,9.820426


In [18]:
# Conocer el mínimo total_bill.
tips_df["total_bill"].min()

3.07

In [19]:
# Conocer el máximo total_bill.
tips_df["total_bill"].max()

50.81

In [20]:
# Conocer el promedio total_bill.
tips_df["total_bill"].mean()

19.78594262295082

In [21]:
# Suma de los valores de la columna total_bill.
tips_df["total_bill"].sum(axis = 0)

4827.77

# 📅 2023-11-29

In [1]:
import pandas as pd
import numpy as np
tips_df = pd.read_csv('c:\\Users\\Admin\\Documents\\ITS\\Semestre 9\\Data_Science\\Pandas\\Datasets\\tips.csv')

In [2]:
# Mostrar las columnas y el tipo de datos.
tips_df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person', 'Payer Name', 'CC Number', 'Payment ID'],
      dtype='object')

In [4]:
# Acceder al índice del DataFrame.
tips_df.index

RangeIndex(start=0, stop=244, step=1)

In [5]:
# Ver información escencial del DataFrame.
tips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [6]:
# Ver los datos estadísticos del DataFrame.
tips_df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


In [7]:
# Promedio de la columna total_bill.
# Observemos que el promedio de la columna total_bill coincide con el mostrado al hacer un describe.
tips_df["total_bill"].mean()

19.78594262295082

In [8]:
# Seleccionar información de dos columnas con base en una lista.
cols = ["total_bill", "tip"]
tips_df[cols]

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.50
3,23.68,3.31
4,24.59,3.61
...,...,...
239,29.03,5.92
240,27.18,2.00
241,22.67,2.00
242,17.82,1.75


In [9]:
# Ver una nueva columna con el porcentaje de la propina.
tips_df["tip"] / tips_df["total_bill"] * 100

0       5.944673
1      16.054159
2      16.658734
3      13.978041
4      14.680765
         ...    
239    20.392697
240     7.358352
241     8.822232
242     9.820426
243    15.974441
Length: 244, dtype: float64

In [10]:
# Agregar nueva columna con el porcentaje de la propina.
tips_df["tip_pct"] = tips_df["tip"] / tips_df["total_bill"] * 100

In [11]:
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.944673
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.054159
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,16.658734
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.978041
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.680765
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,20.392697
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,7.358352
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,8.822232
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,9.820426


## 🗑 Eliminar Columnas

In [12]:
# Eliminar columna. Se tiene que crear otro DataFrame o sobreescribir el existente.
tips_df_2 = tips_df.drop("tip_pct", axis = 1)

In [13]:
tips_df_2

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


## 🧮 Ver Forma del DataFrame

In [14]:
# Vér cuántos renglones y columnas tiene el DataFrame.
tips_df.shape

(244, 12)

In [15]:
# Ver cuántos renglones hay.
tips_df.shape[0]

244

In [16]:
# Ver cuántas columnas hay.
tips_df.shape[1]

12

## 📍 Índice

In [17]:
# Establecer el Payment ID como índice.
tips_df_idx = tips_df.set_index("Payment ID")

In [18]:
tips_df_idx

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,tip_pct
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,5.944673
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,16.054159
Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,16.658734
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,13.978041
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,14.680765
...,...,...,...,...,...,...,...,...,...,...,...
Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,20.392697
Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,7.358352
Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,8.822232
Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,9.820426


In [21]:
# Reestablecer el índice (como no se sobreescribe, el índice Payment ID sigue existiendo)
tips_df_idx.reset_index()

Unnamed: 0,Payment ID,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,tip_pct
0,Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,5.944673
1,Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,16.054159
2,Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,16.658734
3,Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,13.978041
4,Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,14.680765
...,...,...,...,...,...,...,...,...,...,...,...,...
239,Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,20.392697
240,Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,7.358352
241,Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,8.822232
242,Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,9.820426


## 🔎 Localización

In [22]:
tips_df_idx

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,tip_pct
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,5.944673
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,16.054159
Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,16.658734
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,13.978041
Sun2251,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,14.680765
...,...,...,...,...,...,...,...,...,...,...,...
Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,20.392697
Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,7.358352
Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,8.822232
Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,9.820426


In [26]:
# Ver el renglón 0.
tips_df_idx.iloc[0]

total_bill                       16.99
tip                               1.01
sex                             Female
smoker                              No
day                                Sun
time                            Dinner
size                                 2
price_per_person                  8.49
Payer Name          Christy Cunningham
CC Number             3560325168603410
tip_pct                       5.944673
Name: Sun2959, dtype: object

In [27]:
# Ver lo que hay en el renglón 2, columna 4.
tips_df_idx.iloc[2, 4]

'Sun'

In [28]:
# Ver los primeros tres renglones y las primeras tres columnas.
tips_df_idx.iloc[:3, :3]

Unnamed: 0_level_0,total_bill,tip,sex
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sun2959,16.99,1.01,Female
Sun4608,10.34,1.66,Male
Sun4458,21.01,3.5,Male


In [29]:
# Usando el nombre de los índices, ver que hay en Sun4458 y price_per_person.
tips_df_idx.loc["Sun4458", "price_per_person"]

7.0

In [45]:
# Primeras dos columnas y del renglón dos al cuatro.
tips_df_idx.loc["Sun4458":"Sun2251", "total_bill":"tip"]

In [46]:
# Ver el renglón 2 y el renglón 4.
list = ["Sun4608", "Sun5260"]
tips_df_idx.loc[list]

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,tip_pct
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,16.054159
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,13.978041


In [47]:
# Borrar renglón (no se sobreescribe).
tips_df_idx.drop("Sun2251", axis = 0)

Unnamed: 0_level_0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,tip_pct
Payment ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Sun2959,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,5.944673
Sun4608,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,16.054159
Sun4458,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,16.658734
Sun5260,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,13.978041
Sun9679,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882,18.623962
...,...,...,...,...,...,...,...,...,...,...,...
Sat2657,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,20.392697
Sat1766,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,7.358352
Sat3880,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,8.822232
Sat17,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,9.820426


## ⚡ Tarea: 
1. Mover la columna price_per_person al final.
2. Cómo convertir a entero, redondeando hacia arriba.
3. Mover la columna tip_pct (primero puede ser un pop y luego un insert).
4. Generar nuevo renglón e insertarlo en el tercer índice.

Ejercicio resuelto [aquí](../Ejercicios/20231128_19051186_EjPandas_DataFrame_03.ipynb).

# 📅 2023-12-04

In [1]:
import pandas as pd
import numpy as np
tips_df = pd.read_csv('c:\\Users\\Admin\\Documents\\ITS\\Semestre 9\\Data_Science\\Pandas\\Datasets\\tips.csv')

In [2]:
# Ver los datos de cierta columna.
tips_df["total_bill"]

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

## 0️⃣ Validaciones Lógicas

In [4]:
# Filtrar los datos de la columna total_bill que sean mayores a 20 dólares.
tips_df["total_bill"] > 20

0      False
1      False
2       True
3       True
4       True
       ...  
239     True
240     True
241     True
242    False
243    False
Name: total_bill, Length: 244, dtype: bool

In [5]:
# Ver el DataFrame con los datos filtrados.
tips_df[tips_df["total_bill"] > 20]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
5,25.29,4.71,Male,No,Sun,Dinner,4,6.32,Erik Smith,213140353657882,Sun9679
7,26.88,3.12,Male,No,Sun,Dinner,4,6.72,Robert Buck,3514785077705092,Sun8157
...,...,...,...,...,...,...,...,...,...,...,...
237,32.83,1.17,Male,Yes,Sat,Dinner,2,16.42,Thomas Brown,4284722681265508,Sat2929
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766


In [6]:
# Ver aquellos registros de personas que NO fuman.
tips_df[tips_df["smoker"] == "No"]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
235,10.07,1.25,Male,No,Sat,Dinner,2,5.04,Sean Gonzalez,3534021246117605,Sat4615
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17


In [7]:
# Solo ver los días viernes.
tips_df[tips_df["day"] == "Fri"]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
90,28.97,3.0,Male,Yes,Fri,Dinner,2,14.48,Daniel Mason,3597456900644078,Fri4175
91,22.49,3.5,Male,No,Fri,Dinner,2,11.24,Earl Horn,6011849326227398,Fri5700
92,5.75,1.0,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780
93,16.32,4.3,Female,Yes,Fri,Dinner,2,8.16,Natalie Nguyen,5181236182893396,Fri6963
94,22.75,3.25,Female,No,Fri,Dinner,2,11.38,Jamie Garza,676318332068,Fri2318
95,40.17,4.73,Male,Yes,Fri,Dinner,4,10.04,Aaron Bentley,180026611638690,Fri9628
96,27.28,4.0,Male,Yes,Fri,Dinner,2,13.64,Eric Carter,4563054452787961,Fri3159
97,12.03,1.5,Male,Yes,Fri,Dinner,2,6.02,Eric Herrera,580116092652,Fri9268
98,21.01,3.0,Male,Yes,Fri,Dinner,2,10.5,Michael Li,4831801127457917,Fri144
99,12.46,1.5,Male,No,Fri,Dinner,2,6.23,Edward Carter,347435564751626,Fri5575


In [8]:
# Mostrar aquellos registros donde la cuenta total sea mayor a 25 dólares y que el día sea viernes.
tips_df[(tips_df["total_bill"] > 25) & (tips_df["day"] == "Fri")]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
90,28.97,3.0,Male,Yes,Fri,Dinner,2,14.48,Daniel Mason,3597456900644078,Fri4175
95,40.17,4.73,Male,Yes,Fri,Dinner,4,10.04,Aaron Bentley,180026611638690,Fri9628
96,27.28,4.0,Male,Yes,Fri,Dinner,2,13.64,Eric Carter,4563054452787961,Fri3159


In [9]:
# Mostrar aquellos registros donde no se fume y el día sea sábado.
tips_df[(tips_df["smoker"] == "No") & (tips_df["day"] == "Sat")]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
19,20.65,3.35,Male,No,Sat,Dinner,3,6.88,Timothy Oneal,6568069240986485,Sat9213
20,17.92,4.08,Male,No,Sat,Dinner,2,8.96,Thomas Rice,4403296224639756,Sat1709
21,20.29,2.75,Female,No,Sat,Dinner,2,10.14,Natalie Gardner,5448125351489749,Sat9618
22,15.77,2.23,Female,No,Sat,Dinner,2,7.88,Ashley Shelton,3524119516293213,Sat9786
23,39.42,7.58,Male,No,Sat,Dinner,4,9.86,Lance Peterson,3542584061609808,Sat239
24,19.82,3.18,Male,No,Sat,Dinner,2,9.91,Christopher Ross,36739148167928,Sat6236
25,17.81,2.34,Male,No,Sat,Dinner,4,4.45,Robert Perkins,30502930499388,Sat907
26,13.37,2.0,Male,No,Sat,Dinner,2,6.68,Kyle Avery,6531339539615499,Sat6651
27,12.69,2.0,Male,No,Sat,Dinner,2,6.34,Patrick Barber,30155551880343,Sat394
28,21.7,4.3,Male,No,Sat,Dinner,2,10.85,David Collier,5529694315416009,Sat3697


In [10]:
# Ver registros donde el día sea sábado o domingo.
tips_df[(tips_df["day"] == "Sat") | (tips_df["day"] == "Sun")]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880


### ❓ `isin`

In [11]:
# Se puede utilizar el método isin() para filtrar.
# En este caso, se filtran los registros donde el día sea sábado o domingo.
tips_df[tips_df["day"].isin(["Sat", "Sun"])]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251
...,...,...,...,...,...,...,...,...,...,...,...
238,35.83,4.67,Female,No,Sat,Dinner,3,11.94,Kimberly Crane,676184013727,Sat9777
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880


## 🔗 Operaciones con Cadenas

In [12]:
# Ver los tipos de datos de las columnas.
tips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


Es posible apreciar que el CC Number es de tipo entero.

> Se busca mostrar solamente los últimos 4 dígitos del CC Number.

In [14]:
# Ejemplo para ver los últimos 4 digitos.
str(123456789)[-4:]

'6789'

In [15]:
def last_four(num):
    """
    Función para obtener los últimos 4 dígitos de un número.
    """
    return str(num)[-4:]

In [16]:
# Aplicar la función al número 123456789.
last_four(123456789)

'6789'

### 📝 `apply`

El método `apply` permite aplicar una función a cada elemento de una serie.

In [17]:
# Aplicar la función a la columna "CC Number".
tips_df["CC Number"].apply(last_four)

0      3410
1      9230
2      1322
3      5994
4      7221
       ... 
239    2842
240    5404
241    7196
242    0950
243    8139
Name: CC Number, Length: 244, dtype: object

In [18]:
# Incorporar la nueva columna al DataFrame.
tips_df["last_four"] = tips_df["CC Number"].apply(last_four)
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,1322
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,2842
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,5404
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,7196
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,0950


In [19]:
# Función para multiplicar por 2.
def mult_by_2(num):
    """
    Función para multiplicar por 2.
    """
    return num * 2

In [20]:
# Uso de la función.
mult_by_2(5)

10

In [21]:
# Multiplicar por 2 la columna "total_bill".
tips_df["total_bill"].apply(mult_by_2)

0      33.98
1      20.68
2      42.02
3      47.36
4      49.18
       ...  
239    58.06
240    54.36
241    45.34
242    35.64
243    37.56
Name: total_bill, Length: 244, dtype: float64

## 🐑 Funciones Lambda

In [22]:
# Es posible simplificar la función.
tips_df["total_bill"].apply(lambda bill: bill * 2)

0      33.98
1      20.68
2      42.02
3      47.36
4      49.18
       ...  
239    58.06
240    54.36
241    45.34
242    35.64
243    37.56
Name: total_bill, Length: 244, dtype: float64

# 📅 2023-12-05

In [1]:
import pandas as pd
import numpy as np
tips_df = pd.read_csv('c:\\Users\\Admin\\Documents\\ITS\\Semestre 9\\Data_Science\\Pandas\\Datasets\\tips.csv')

# Filtrar los datos de la columna total_bill que sean mayores a 20 dólares.
tips_df["total_bill"] > 20

0      False
1      False
2       True
3       True
4       True
       ...  
239     True
240     True
241     True
242    False
243    False
Name: total_bill, Length: 244, dtype: bool

## 🏹 `vectorize`

El método `vectorize` permite aplicar una función a cada elemento de una serie.

In [2]:
def propina (total_bill, tip):
    """
    Función para calcular el porcentaje de propina.
    """
    if (tip / total_bill) > 0.15:
        return "Generosa"
    else:
        return "Tacaña"

In [4]:
# Crear una columna que utilice la función propina.
# Si tenemos más de una columna involucrada, el método vectorize() de NumPy nos ayuda a aplicar la función.
tips_df["Propinas"] = np.vectorize(propina)(tips_df["total_bill"], tips_df["tip"])

In [5]:
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,Propinas
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,Tacaña
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,Generosa
2,21.01,3.50,Male,No,Sun,Dinner,3,7.00,Travis Walters,6011812112971322,Sun4458,Generosa
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,Tacaña
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,Tacaña
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,9.68,Michael Avila,5296068606052842,Sat2657,Generosa
240,27.18,2.00,Female,Yes,Sat,Dinner,2,13.59,Monica Sanders,3506806155565404,Sat1766,Tacaña
241,22.67,2.00,Male,Yes,Sat,Dinner,2,11.34,Keith Wong,6011891618747196,Sat3880,Tacaña
242,17.82,1.75,Male,No,Sat,Dinner,2,8.91,Dennis Dixon,4375220550950,Sat17,Tacaña


## ⬇ Repaso de Ordenamiento

In [6]:
# Forma 1: El resultado es una serie.
tips_df["total_bill"].sort_values(ascending=False)

170    50.81
212    48.33
59     48.27
156    48.17
182    45.35
       ...  
149     7.51
111     7.25
172     7.25
92      5.75
67      3.07
Name: total_bill, Length: 244, dtype: float64

In [7]:
# Forma 2: El resultado es un DataFrame.
tips_df.sort_values("total_bill", ascending=False)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,Propinas
170,50.81,10.00,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954,Generosa
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590,Generosa
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139,Tacaña
156,48.17,5.00,Male,No,Sun,Dinner,6,8.03,Ryan Gonzales,3523151482063321,Sun7518,Tacaña
182,45.35,3.50,Male,Yes,Sun,Dinner,3,15.12,Jose Parsons,4112207559459910,Sun2337,Tacaña
...,...,...,...,...,...,...,...,...,...,...,...,...
149,7.51,2.00,Male,No,Thur,Lunch,2,3.76,Daniel Robbins,4823139288341889,Thur6321,Generosa
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,Tacaña
172,7.25,5.15,Male,Yes,Sun,Dinner,2,3.62,Larry White,30432617123103,Sun9209,Generosa
92,5.75,1.00,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780,Generosa


In [8]:
# Ordenar el DataFrame por dos columnas.
tips_df.sort_values(["total_bill", "tip"], ascending=False)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,Propinas
170,50.81,10.00,Male,Yes,Sat,Dinner,3,16.94,Gregory Clark,5473850968388236,Sat1954,Generosa
212,48.33,9.00,Male,No,Sat,Dinner,4,12.08,Alex Williamson,676218815212,Sat4590,Generosa
59,48.27,6.73,Male,No,Sat,Dinner,4,12.07,Brian Ortiz,6596453823950595,Sat8139,Tacaña
156,48.17,5.00,Male,No,Sun,Dinner,6,8.03,Ryan Gonzales,3523151482063321,Sun7518,Tacaña
182,45.35,3.50,Male,Yes,Sun,Dinner,3,15.12,Jose Parsons,4112207559459910,Sun2337,Tacaña
...,...,...,...,...,...,...,...,...,...,...,...,...
149,7.51,2.00,Male,No,Thur,Lunch,2,3.76,Daniel Robbins,4823139288341889,Thur6321,Generosa
172,7.25,5.15,Male,Yes,Sun,Dinner,2,3.62,Larry White,30432617123103,Sun9209,Generosa
111,7.25,1.00,Female,No,Sat,Dinner,1,7.25,Terri Jones,3559221007826887,Sat4801,Tacaña
92,5.75,1.00,Female,Yes,Fri,Dinner,2,2.88,Leah Ramirez,3508911676966392,Fri3780,Generosa


---

🔰 **Nota**: Al ordenar múltiples columnas, se ordena primero la primera columna, luego la segunda, y así sucesivamente.

---

## 🔎 Consultas Diversas al DataFrame

### 🤓 Método `describe`

In [9]:
# Ver la información estadística de las columnas numéricas.
tips_df.describe()

Unnamed: 0,total_bill,tip,size,price_per_person,CC Number
count,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,7.888197,2563496000000000.0
std,8.902412,1.383638,0.9511,2.914234,2369340000000000.0
min,3.07,1.0,1.0,2.88,60406790000.0
25%,13.3475,2.0,2.0,5.8,30407310000000.0
50%,17.795,2.9,2.0,7.255,3525318000000000.0
75%,24.1275,3.5625,3.0,9.39,4553675000000000.0
max,50.81,10.0,6.0,20.27,6596454000000000.0


### ℹ Método `info`

In [11]:
# El método info() nos muestra información escencial del DataFrame: número de columnas, número de renglones, tipo de datos de las columnas, etc.
tips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
 11  Propinas          244 non-null    object 
dtypes: float64(3), int64(2), object(7)
memory usage: 23.0+ KB


### 💯 Máximo y Mínimo (valores e índices)

In [12]:
# Ver el máximo de una columna.
tips_df["total_bill"].max()

50.81

In [13]:
# Ver el índice máximo de una columna.
tips_df["total_bill"].idxmax()

170

In [14]:
# Ir al renglón del índice máximo para ver toda la información.
tips_df.iloc[170]

total_bill                     50.81
tip                             10.0
sex                             Male
smoker                           Yes
day                              Sat
time                          Dinner
size                               3
price_per_person               16.94
Payer Name             Gregory Clark
CC Number           5473850968388236
Payment ID                   Sat1954
Propinas                    Generosa
Name: 170, dtype: object

In [15]:
# Ver el mínimo de una columna.
tips_df["total_bill"].min()

3.07

In [16]:
# Ver el índice mínimo de una columna.
tips_df["total_bill"].idxmin()

67

In [17]:
# Ir a ese renglón para ver toda la información.
tips_df.iloc[67]

total_bill                      3.07
tip                              1.0
sex                           Female
smoker                           Yes
day                              Sat
time                          Dinner
size                               1
price_per_person                3.07
Payer Name             Tiffany Brock
CC Number           4359488526995267
Payment ID                   Sat3455
Propinas                    Generosa
Name: 67, dtype: object

## 🧮 Conteo de Valores: `value_counts()`, `unique()` y `nunique()`

El método `value_counts()` permite contar los valores de una columna.

In [19]:
tips_df["day"].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

El método `unique()` permite obtener los valores únicos de una columna.

In [20]:
tips_df["day"].unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

El método `nunique()` permite obtener el número de valores únicos de una columna.

In [21]:
# Contar el número de días únicos.
len(tips_df["day"].unique())

4

In [22]:
# Contar el número de días únicos (con nunique()).
tips_df["day"].nunique()

4

## 🔁 `replace`

In [23]:
# En lugar de ver No en la columna de smoker, solo ver N.
tips_df["smoker"].replace("No", "N")

0        N
1        N
2        N
3        N
4        N
      ... 
239      N
240    Yes
241    Yes
242      N
243      N
Name: smoker, Length: 244, dtype: object

In [24]:
# En lugar de ver Yes en la columna de smoker, solo ver Y.
tips_df["smoker"].replace("Yes", "Y")

0      No
1      No
2      No
3      No
4      No
       ..
239    No
240     Y
241     Y
242    No
243    No
Name: smoker, Length: 244, dtype: object

### 🗾 Mapeo: `map()`

In [25]:
# Reemplazar los valores de la columna smoker por N y Y.
my_map = {"No": "N", "Yes": "Y"}
tips_df["smoker"].map(my_map)

0      N
1      N
2      N
3      N
4      N
      ..
239    N
240    Y
241    Y
242    N
243    N
Name: smoker, Length: 244, dtype: object