# Part 1 – Pandas DataFrames  

For this section, we will work with the **National Household Survey (ENAHO)** dataset extracted from [INEI](https://proyectos.inei.gob.pe/microdatos/).  
In this [shared drive folder](https://drive.google.com/drive/folders/1h00GwfCRyq0Grem3bR26yxc33ELYJwG8?usp=sharing), you will find:  
- A reference questionnaire (you can review it to identify questions of interest).  
- Three datasets corresponding to the following modules: **Housing (200)**, **Education (300)**, and **Labor (500)**.  
Download files in your local. 
---

1. Set your working directory and **import the dataset** `Enaho01A-2023-300.csv` using Pandas.  

> Note: Consider the file encoding (`UTF-8` or `ISO-8859-10`).  
> Example: `df = pd.read_csv("datos.csv", encoding="ISO-8859-10")`  

- Read and display the **first 5 rows**.  
- Convert the column names into a **list** and print it.  
- Check the **data types** of the DataFrame.  
- Select a subsample containing the variables `['CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO']` and between 3–5 additional variables of your interest.  


In [5]:
import pandas as pd

In [2]:
df = pd.read_csv("C:\Diplomado_PUCP_25\Enaho01A-2023-300.csv", encoding="ISO-8859-10")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Diplomado_PUCP_25\\Enaho01A-2023-300.csv'

In [3]:
# Mostrar las primeras filas
print(df.head())

NameError: name 'df' is not defined

In [5]:
# Convertir nombres de columnas a lista
print(list(df.columns))

['AŅO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO', 'UBIGEO', 'DOMINIO', 'ESTRATO', 'CODINFOR', 'P300N', 'P300I', 'P300A', 'P301A', 'P301B', 'P301C', 'P301D', 'P301A0', 'P301A1', 'P301B0', 'P301B1', 'P301B3', 'P302', 'P302X', 'P302A', 'P302B', 'P303', 'P304A', 'P304B', 'P304C', 'P304D', 'P305', 'P306', 'P307', 'P307A1', 'P307A2', 'P307A3', 'P307A4', 'P307A4_5', 'P307A4_6', 'P307A4_7', 'P307B1', 'P307B2', 'P307B3', 'P307B4', 'P307B4_5', 'P307B4_6', 'P307B4_7', 'P307C', 'P308A', 'P308B', 'P308C', 'P308D', 'P308B1', 'P308B2', 'P308B3', 'P308B4', 'P308B5', 'P308C1', 'P308C2', 'P310', 'P310B1', 'P310C0', 'P310C1', 'P310D1', 'P310D2', 'P310E0', 'P310E1', 'P310E3', 'P311N$1', 'P311N$2', 'P311N$3', 'P311N$4', 'P311N$5', 'P311N$6', 'P311N$7', 'P311N$8', 'P311N$9', 'P311$1', 'P311$2', 'P311$3', 'P311$4', 'P311$5', 'P311$6', 'P311$7', 'P311$8', 'P311$9', 'P311A1$1', 'P311A1$2', 'P311A1$3', 'P311A1$4', 'P311A1$5', 'P311A1$6', 'P311A1$7', 'P311A1$8', 'P311A1$9', 'P311A2$1', 'P311A2$2', 'P31

In [6]:
# Revisar tipos de datos
print(df.dtypes)

AŅO               int64
MES               int64
CONGLOME          int64
VIVIENDA          int64
HOGAR             int64
                 ...   
I315B            object
FACTOR07        float64
FACTORA07       float64
NCONGLOME         int64
SUB_CONGLOME      int64
Length: 511, dtype: object


In [7]:
# Seleccionar submuestra (ejemplo con 4 obligatorias + 3 más de interés)
sub_df = df[['CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO',
             'P208A', 'P209', 'P301A']]

print(sub_df.head())

   CONGLOME  VIVIENDA  HOGAR  CODPERSO  P208A P209  P301A
0      5030         2     11         1     43    1      8
1      5030         2     11         2     41    1     10
2      5030         2     11         3      9           3
3      5030         2     11         4      7           3
4      5030        11     11         1     60    2      4


---

2. **Data Manipulation (Data Cleaning):**  
- Explore the DataFrame using summary functions.  
- Identify if there are missing values.  
- If they exist, remove them.  


In [8]:
# Explorar el DataFrame con funciones de resumen
print(df.info())       # Información general (filas, columnas, tipos de datos, valores nulos)
print(df.describe())   # Estadísticos básicos de las variables numéricas

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108354 entries, 0 to 108353
Columns: 511 entries, AŅO to SUB_CONGLOME
dtypes: float64(2), int64(21), object(488)
memory usage: 422.4+ MB
None
            AŅO            MES       CONGLOME       VIVIENDA          HOGAR  \
count  108354.0  108354.000000  108354.000000  108354.000000  108354.000000   
mean     2023.0       6.495127   16944.756797      77.820330      11.146271   
std         0.0       3.445244    3144.386733      68.547022       1.370084   
min      2023.0       1.000000    5007.000000       1.000000      11.000000   
25%      2023.0       3.000000   16028.000000      31.000000      11.000000   
50%      2023.0       7.000000   17500.000000      66.000000      11.000000   
75%      2023.0       9.000000   19014.000000     106.000000      11.000000   
max      2023.0      12.000000   21001.000000     991.000000      44.000000   

            CODPERSO         UBIGEO        DOMINIO        ESTRATO  \
count  108354.000000  10835

In [9]:
# Identificar si hay valores perdidos
print(df.isnull().sum())   # Muestra cuántos valores faltan por cada columna

AŅO             0
MES             0
CONGLOME        0
VIVIENDA        0
HOGAR           0
               ..
I315B           0
FACTOR07        0
FACTORA07       0
NCONGLOME       0
SUB_CONGLOME    0
Length: 511, dtype: int64


In [10]:
# Eliminar filas con valores perdidos (si existen)
df_limpio = df.dropna()

In [11]:
# Verificar nuevamente
print(df_limpio.isnull().sum())

AŅO             0
MES             0
CONGLOME        0
VIVIENDA        0
HOGAR           0
               ..
I315B           0
FACTOR07        0
FACTORA07       0
NCONGLOME       0
SUB_CONGLOME    0
Length: 511, dtype: int64


---

3. Import a second dataset (choose between `Enaho01A-2023-200.csv` or `Enaho01A-2023-500.csv`).  
- Display the **first 5 rows**.  
- Convert the column names into a **list** and print it.  
- Check the **data types**.  
- Select a subsample containing the variables `['CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO']` and between 3–5 additional variables of your interest.  
- Perform the following modifications:  
  - **A.** Change the data type of a variable (e.g., from text to numeric).  
  - **B.** Modify some values in a specific column.  

In [12]:
df2 = pd.read_csv("C:\Diplomado_PUCP_25\Enaho01-2023-200.csv", encoding="ISO-8859-10")

In [13]:
# Ver primeras 5 filas
print(df2.head())

    AŅO  MES  CONGLOME  VIVIENDA  HOGAR  CODPERSO  UBIGEO  DOMINIO  ESTRATO  \
0  2023    2      5007        22     11         1   10101        4        4   
1  2023    2      5007        22     11         2   10101        4        4   
2  2023    2      5007        22     11         3   10101        4        4   
3  2023    2      5007        31     11         1   10101        4        4   
4  2023    2      5007        31     11         2   10101        4        4   

               P201P  ...  OCUPAC_R3 OCUPAC_R4 RAMA_R3 RAMA_R4 CODTAREA  \
0  20190050070221101  ...                                                 
1  20190050070221102  ...                                                 
2  20190050070221104  ...                                                 
3  20190050070311102  ...                                                 
4  20230050070311102  ...                                                 

  CODTIEMPO TICUEST01   FACPOB07 NCONGLOME SUB_CONGLOME  
0               

In [14]:
# Nombres de columnas como lista
print(list(df2.columns))

['AŅO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO', 'UBIGEO', 'DOMINIO', 'ESTRATO', 'P201P', 'P203', 'P203A', 'P203B', 'P204', 'P205', 'P206', 'P207', 'P208A', 'P208B', 'P209', 'P210', 'P211A', 'P211D', 'P212', 'P213', 'P214', 'P215', 'P216', 'P217', 'T211', 'OCUPAC_R3', 'OCUPAC_R4', 'RAMA_R3', 'RAMA_R4', 'CODTAREA', 'CODTIEMPO', 'TICUEST01', 'FACPOB07', 'NCONGLOME', 'SUB_CONGLOME']


In [15]:
# Tipos de datos
print(df2.dtypes)

AŅO               int64
MES               int64
CONGLOME          int64
VIVIENDA          int64
HOGAR             int64
CODPERSO          int64
UBIGEO            int64
DOMINIO           int64
ESTRATO           int64
P201P             int64
P203              int64
P203A            object
P203B            object
P204             object
P205             object
P206             object
P207             object
P208A            object
P208B            object
P209             object
P210             object
P211A            object
P211D            object
P212             object
P213             object
P214             object
P215             object
P216             object
P217             object
T211             object
OCUPAC_R3        object
OCUPAC_R4        object
RAMA_R3          object
RAMA_R4          object
CODTAREA         object
CODTIEMPO        object
TICUEST01         int64
FACPOB07        float64
NCONGLOME         int64
SUB_CONGLOME      int64
dtype: object


In [16]:
# Submuestra con variables de interés
sub_df2 = df2[['CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO',
               'P208A', 'P209', 'P210']]  # 3 variables extra como ejemplo
print(sub_df2.head())


   CONGLOME  VIVIENDA  HOGAR  CODPERSO P208A P209 P210
0      5007        22     11         1    64    2     
1      5007        22     11         2    63    2     
2      5007        22     11         3    31    6     
3      5007        31     11         1    79    3     
4      5007        31     11         2    50    1     


In [17]:
# Cambiar el tipo de dato de una variable (ejemplo: P208A a numérico)
sub_df2['P208A'] = pd.to_numeric(sub_df2['P208A'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df2['P208A'] = pd.to_numeric(sub_df2['P208A'], errors='coerce')


In [18]:
# Modificar algunos valores en una columna (ejemplo: renombrar en P209)
# Supongamos que P209 = 1 es "Hombre" y 2 es "Mujer"
sub_df2['P209'] = sub_df2['P209'].replace({1: 'Hombre', 2: 'Mujer'})

print(sub_df2.head())

   CONGLOME  VIVIENDA  HOGAR  CODPERSO  P208A P209 P210
0      5007        22     11         1   64.0    2     
1      5007        22     11         2   63.0    2     
2      5007        22     11         3   31.0    6     
3      5007        31     11         1   79.0    3     
4      5007        31     11         2   50.0    1     


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df2['P209'] = sub_df2['P209'].replace({1: 'Hombre', 2: 'Mujer'})


---

4. **Merging Datasets:**  [2 pts]
- Identify the common columns between the two datasets (from questions 16 and 18).  
- Verify whether the values match in both datasets. If not, correct the mismatched values to ensure a proper merge.  

> Recommendation: Use the following as common columns:  
> `common_columns = ['CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO']`  
> in `pd.merge(..., on=common_columns, how=...)`.  

- Perform the **merge**.  
- Display the **first 5 rows** of the resulting DataFrame.  

In [19]:
# Hemos importado df (300) y df2 (200)
# Definir columnas comunes
common_columns = ['CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO']

In [20]:
# Verificamos si los nombres de las columnas coinciden en ambos datasets
print("Columnas en df:", set(df.columns))
print("Columnas en df2:", set(df2.columns))

Columnas en df: {'P311D7$9', 'FACTOR07', 'P311D5$8', 'P311D5$9', 'P3122A4', 'P316C4', 'D315B5', 'P311B$4', 'P311B$7', 'P311C$1', 'P209', 'D311D7$1', 'D311D4$6', 'I311D4$7', 'P311A1$8', 'P311D3$8', 'P311A6$1', 'P3122A2', 'P314B1_2', 'I311B$2', 'P314B1_6', 'P316C3', 'P311B$1', 'I311D3$4', 'I311D5$2', 'P311A3$4', 'I311D2$7', 'P308C1', 'P311D5$2', 'P301C', 'P311C$3', 'P311D6$2', 'P305', 'P3122A1', 'P205', 'P301A0', 'D311D5$1', 'P315B2', 'P311C$2', 'D311D5$5', 'I311D2$4', 'I311D$2', 'P314B$5', 'P311A2$7', 'P311N$8', 'P311A6$2', 'I315B6', 'I311D5$7', 'P311B$6', 'P311D7$1', 'P311$7', 'I311D6$2', 'D3121B', 'P314B$2', 'P311D2$6', 'I311D$7', 'P311D3$4', 'P311B$5', 'D311D6$1', 'P311A1$5', 'P311E$3', 'P316C10', 'I311D6$1', 'P307B3', 'P3121D', 'P311A7$5', 'P300N', 'D311D5$7', 'DOMINIO', 'D3121C2', 'D3121C3', 'P311A7$3', 'P311A3$7', 'P311A7$1', 'D311D2$1', 'P315B6', 'D3122C4', 'D311D$7', 'P307B4', 'I311D2$3', 'P311E$2', 'P3121B', 'P311$4', 'P316$11', 'D311D6$2', 'P311D7$6', 'P311D3$7', 'P314B1_7', '

In [21]:
# Verificamos si existen diferencias en los valores clave
print(df[common_columns].head())
print(df2[common_columns].head())

   CONGLOME  VIVIENDA  HOGAR  CODPERSO
0      5030         2     11         1
1      5030         2     11         2
2      5030         2     11         3
3      5030         2     11         4
4      5030        11     11         1
   CONGLOME  VIVIENDA  HOGAR  CODPERSO
0      5007        22     11         1
1      5007        22     11         2
2      5007        22     11         3
3      5007        31     11         1
4      5007        31     11         2


In [22]:
# Verificamos si existen errores de formato
for col in common_columns:
    df[col] = df[col].astype(str).str.strip()
    df2[col] = df2[col].astype(str).str.strip()

In [23]:
# Realizamos la unión (merge) usando las columnas comunes
df_merged = pd.merge(df, df2, on=common_columns, how="inner")

In [24]:
# Mostrar primeras 5 filas del DataFrame combinado
print(df_merged.head())

   AŅO_x  MES_x CONGLOME VIVIENDA HOGAR CODPERSO  UBIGEO_x  DOMINIO_x  \
0   2023      1     5030        2    11        1     10201          7   
1   2023      1     5030        2    11        2     10201          7   
2   2023      1     5030        2    11        3     10201          7   
3   2023      1     5030        2    11        4     10201          7   
4   2023      1     5030       11    11        1     10201          7   

   ESTRATO_x  CODINFOR  ...  OCUPAC_R3  OCUPAC_R4  RAMA_R3  RAMA_R4 CODTAREA  \
0          4         1  ...                                                    
1          4         2  ...                                                    
2          4         2  ...                                                    
3          4         2  ...                                                    
4          4         1  ...                                                    

  CODTIEMPO TICUEST01    FACPOB07 NCONGLOME_y SUB_CONGLOME_y  
0                

---

5. In the resulting DataFrame:  
- Group the data by a variable of your choice using `groupby()`.  
- Calculate a relevant statistical indicator, for example: the **average income per category**.  


In [25]:
# Agrupar por sexo (P209_y) y calcular el ingreso promedio_y
resultado = df_merged.groupby('P209_y')['NCONGLOME_y'].mean()

print(resultado)

P209_y
     23295.302717
1    23048.550097
2    24128.679762
3    23682.408167
4    25664.943299
5    23879.302499
6    23668.747556
Name: NCONGLOME_y, dtype: float64


---
# Part 2 – If conditions 

---
6. Basic If Condition  
Write a Python function that checks if a given number is positive.  
- If the number is greater than zero, return `"The number X is positive."`.  
- Otherwise, return `"The number X is not positive."`.  


In [1]:
def check_number(i):
    if i > 0:
        return f'El número {i} es positivo'
    else:
        return f'El número {i} es negativo'

In [6]:
print(check_number(10))
print(check_number(-100))

El número 10 es positivo
El número -100 es negativo


---

7. If Condition with Multiple Expressions  
Create a program that checks the temperature (in Celsius) and returns a message depending on the value:  
- If the temperature is **below 0**, return `"It is freezing."`.  
- If the temperature is between **0 and 20**, return `"It is cold."`.  
- If the temperature is between **21 and 30**, return `"It is warm."`.  
- If the temperature is greater than **30**, return `"It is hot."`.  


In [12]:
def temperature(i):
    if i < 0:
        return "It is freezing"
    elif 0 <= i <= 20:
        return "It is cold"
    elif 21 <= i <= 30:
        return "It is warm"
    else:
        return "It is hot"

In [15]:
print(temperature(-5))
print(temperature(10))
print(temperature(23))
print(temperature(39))

It is freezing
It is cold
It is warm
It is hot


---

8. Logical Operators   [2 pts]
Write a function that determines if a person is eligible for a scholarship based on these conditions:  
- The person must have a GPA greater than **3.5** **AND**  
- Either their extracurricular activities are `"Yes"` **OR** they have community service hours greater than **50**.  

The function should return:  
- `"Eligible for scholarship."` if conditions are met.  
- `"Not eligible for scholarship."` otherwise.  


In [19]:
def scholarship(gpa, activities, community):
    if gpa > 3.5 and (activities.lower() == 'yes' or community > 50):
        return "Eligible for scholarship"
    else:
        return "Not eligible for scholarship"

In [26]:
print(scholarship(4,'YEs',40))
print(scholarship(3.2,'YES',55))
print(scholarship(5,'NO',49))

Eligible for scholarship
Not eligible for scholarship
Not eligible for scholarship


---

9. Python Identity Operators  
Create two lists:  
```python
list1 = [1, 2, 3]
list2 = [1, 2, 3]
list3 = list1 
```

Check with identity operators:
```python
list1 is list2
list1 is list3
list1 == list2 
 ```

Explain the difference in the results for each comparison.


In [28]:
list1 = [1, 2, 3]
list2 = [1, 2, 3]
list3 = list1

print('list1 is list2:', list1 is list2)
print('list1 is list3:', list1 is list3)
print('list1 == list2:', list1 == list2)


list1 is list2: False
list1 is list3: True
list1 == list2: True


- La función 'is' compara identidad
- La función '==' compara contenido

---

10. Nested If Statement  [2 pts] 

Write a function that takes a student's score and determines the grade:  

- If the score is greater than or equal to 90, return `"A"`.  
- If the score is between 80 and 89:  
  - Check if the score is exactly 85, then return `"B+"`.  
  - Otherwise, return `"B"`.  
- If the score is between 70 and 79, return `"C"`.  
- Otherwise, return `"Fail"`. 

In [29]:
def score(i):
    if i >= 90:
        return 'A'
    elif 80 <= i <= 89:
        if i ==85:
            return 'B+'
        else:
            return 'B'
    elif 70 <= i <= 79:
        return "C"
    else:
        return "Fail"

In [31]:
print(score(20))
print(score(71))
print(score(83))
print(score(85))
print(score(92))

Fail
C
B
B+
A


--- 

# Part 3 – For loops

---

11. For Loop in NumPy  

Write a for loop using **NumPy** to iterate through an array of numbers `[10, 20, 30, 40, 50]` and print each value multiplied by 2.  

- Re-question: How would you modify the loop so that it stores the results in a new NumPy array instead of just printing them?  


In [6]:
# Part 1
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
for valor in arr:
    print(valor * 2)

20
40
60
80
100


In [7]:
#Part 2
arr = np.array([10, 20, 30, 40, 50])
resultados = []   # lista vacía

for valor in arr:
    resultados.append(valor * 2)

# Convertir la lista a arreglo de NumPy
resultados = np.array(resultados)
print(resultados)

[ 20  40  60  80 100]


---

12. For Loop in List  

Create a list of words: `["python", "loop", "list", "iteration"]`.  
Write a for loop to print the length of each word.  

- Re-question: How can you rewrite the same loop using a **list comprehension**?  


In [9]:
#Part 1
palabras = ["python", "loop", "list", "iteration"]

for palabra in palabras:
    print(len(palabra))

6
4
4
9


In [10]:
#Part 2
palabras = ["python", "loop", "list", "iteration"]
longitudes = [len(palabra) for palabra in palabras]
print(longitudes)

[6, 4, 4, 9]


---

13. For Loop in Dictionary  

Given a dictionary of student scores:  
`{"Alice": 85, "Bob": 92, "Charlie": 78, "Diana": 88}`  

Write a for loop to print each student's name along with their score.  

- Re-question: Modify the loop so that it only prints the names of students who scored above 80.  


In [None]:
#Pa
scores = {"Alice": 85, "Bob": 92, "Charlie": 78, "Diana": 88}

for nombre, puntaje in scores.items():
    print(nombre, ":", puntaje)

---

14. For Loop using Range  

Write a for loop using `range()` to print all even numbers between 1 and 20.  

- Re-question: How would you change the loop to also calculate the **sum** of these even numbers while iterating?  


---

15. Iterations over Pandas (ENAHO dataset)  [2 pts] 

Suppose you are analyzing the **National Household Survey (ENAHO)** dataset, specifically the file **`ENAHO01A-2023-400`**.  
The question of interest is **`P41601`**: *“¿Cuánto fue el monto total por la compra o servicio?”*.  

Write a `for` loop that iterates over the column `P41601` and prints values greater than **5000**.  

- **Re-question:** How would you optimize this task using **pandas vectorized operations** (e.g., boolean indexing) instead of a `for` loop, to make the analysis faster and more efficient?  

