In [1]:
import pandas as pd

1. We have data `MotherData.csv` excerpted from a recent Demographic and Health Survey.  First convert the dataset from `wide` (each observation is a mother) to `long` (each observation is a birth, with associated mother id). The id `caseid` identifies uniquely all the mothers.  These columns refer to variable of children **['bidx', 'bord', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'b10', 'b11', 'b12', 'b13', 'b15', 'b16']** and have new columns for all their children. We have information for 20 children. It starts from last child to oldest one. Use for loops to reshape this dataset from `wide` to `long` ate mother and children level. If you want to get more information from the columns please see [this pdf](http://www.dhsprogram.com/pubs/pdf/DHSG4/Recode6_DHS_22March2013_DHSG4.pdf).


# Reshaping Data from Wide to Long Format in Python

## Understanding the Task

1. **Dataset Structure**: The dataset in the wide format has each row representing a mother, with multiple columns for each child.
2. **Objective**: Convert this dataset to a long format where each row represents a child, including the mother's ID.

## Analyzing the Provided Code

- Import the dataset and prepare an array to iterate through children.
- Use a for loop to handle each child, selecting relevant columns and creating a DataFrame for each.
- Concatenate these DataFrames to create a long format dataset.

## Suggestions and Instructions for Completion

### Understanding the Code

- Explain the purpose of each line, e.g., `nchilds_list` creates a formatted list of child indices for column naming.
- Discuss the use of `map` and string formatting in `lambda x: x + f"_{childx}"`.

### Modify Column Renaming

- Ensure understanding of `df1.columns = ['caseid'] +  prefixes` for aligning columns to a standard format after each loop iteration.

### Avoiding Duplicates

- Teach filtering out rows where data for a specific child does not exist, useful when a mother has fewer than 20 children.

### Adding a Child Identifier

- Add a column indicating the child's number in each loop iteration:

  ```python
  for childx in nchilds_list:
      cols = ['caseid'] + list(map(lambda x: x + f"_{childx}", prefixes))
      df1 = df.loc[:, cols].copy()
      df1.columns = ['caseid'] + prefixes
      df1['child_number'] = childx  # Adding child number
      append_df.append(df1)
  ```

### Handling Missing Data

- Handle missing data which might arise if some mothers have fewer than 20 children.

### Code Testing and Validation

- Test with a small subset first and validate the long-format DataFrame's structure and sample rows.

### Encourage Exploration

- Try different approaches, like more advanced pandas functions or performing the task without a loop.

### Commenting and Documentation

- Stress the importance of commenting on the code for better understanding.

### Consulting Documentation and Resources

- Refer to the pandas documentation for unfamiliar functions.

## Example: Adding Child Identifier

```python
for childx in nchilds_list:
    cols = ['caseid'] + list(map(lambda x: x + f"_{childx}", prefixes))
    df1 = df.loc[:, cols].copy()
    df1.columns = ['caseid'] + prefixes
    df1['child_number'] = childx  # Adding child number
    append_df.append(df1)
```

## Final Steps

- Check the resulting DataFrame's head and tail.
- Perform necessary cleaning or filtering.


2. Import all the RECH1.SAV files from all the subfolder located in this folder. `Diplomado_PUCP/_data/endes`

Step 1: Understanding the File Structure
Your task is to navigate through the directory Diplomado_PUCP/_data/endes and its subdirectories.
You need to find files named RECH1.SAV in these subdirectories.

Step 2: Importing Necessary Libraries
Before writing the script, you need to import some essential libraries. Here's how you can do it:

In [2]:
import pandas as pd

In [16]:
import numpy as np

In [17]:
np.arange(2015, 2020)

array([2015, 2016, 2017, 2018, 2019])

In [9]:
df2015 = pd.read_spss('/Users/ar8787/Documents/GitHub/Diplomado_PUCP/_data/endes/2015/RECH1.SAV')

df2015['year_sample'] = 2015

df2016 = pd.read_spss('/Users/ar8787/Documents/GitHub/Diplomado_PUCP/_data/endes/2016/RECH1.SAV')

df2016['year_sample'] = 2016

df2017 = pd.read_spss('/Users/ar8787/Documents/GitHub/Diplomado_PUCP/_data/endes/2017/RECH1.SAV')

df2017['year_sample'] = 2017

In [18]:
import pandas as pd

years = np.arange(2015, 2020)
dfs = []

for year in years:
    file_path = f'/Users/ar8787/Documents/GitHub/Diplomado_PUCP/_data/endes/{year}/RECH1.SAV'
    df = pd.read_spss(file_path)
    df['year_sample'] = year
    dfs.append(df)

# Concatenate the DataFrames for different years into a single DataFrame
result_df = pd.concat(dfs, ignore_index=True)


In [19]:
result_df

Unnamed: 0,HHID,HVIDX,HV101,HV102,HV103,HV104,HV105,HV106,HV107,HV108,...,QH13A4,QH13A5,QH13A6,year_sample,QH25A,QH25B,QH25CM,QH25CA,QH21A,ID1
0,000104301,3.0,Son/daughter,Yes,Yes,Male,2.0,"No education, preschool",,0.0,...,No,No,No,2015,,,,,,
1,000207901,5.0,Son/daughter,Yes,Yes,Male,0.0,"No education, preschool",,0.0,...,No,No,No,2015,,,,,,
2,000211901,4.0,Son/daughter,Yes,Yes,Male,3.0,"No education, preschool",,0.0,...,No,No,No,2015,,,,,,
3,000213001,5.0,Other relative,Yes,Yes,Male,2.0,"No education, preschool",,0.0,...,No,No,No,2015,,,,,,
4,000218601,2.0,Son/daughter,Yes,Yes,Male,3.0,"No education, preschool",,0.0,...,No,No,No,2015,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
724981,325407201,3.0,Hijo/Hija,Sí,Sí,Hombre,9.0,Primario,3.0,3.0,...,,,,2019,PERUANA,,,,No,2019.0
724982,325407301,1.0,Jefe del Hogar,Sí,Sí,Hombre,35.0,Superior,2.0,13.0,...,,,,2019,PERUANA,,,,,2019.0
724983,325407401,1.0,Jefe del Hogar,Sí,Sí,Hombre,24.0,Secundario,5.0,11.0,...,,,,2019,PERUANA,,,,,2019.0
724984,325407401,2.0,Esposa o esposo,Sí,Sí,Mujer,23.0,Secundario,3.0,9.0,...,,,,2019,PERUANA,,,,,2019.0


In [13]:
# append data
df_app = pd.concat([df2015, df2016, df2017])

In [14]:
df_app

Unnamed: 0,HHID,HVIDX,HV101,HV102,HV103,HV104,HV105,HV106,HV107,HV108,...,HV138,HV139,HV140,QH13A1,QH13A2,QH13A3,QH13A4,QH13A5,QH13A6,year_sample
0,000104301,3.0,Son/daughter,Yes,Yes,Male,2.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2015
1,000207901,5.0,Son/daughter,Yes,Yes,Male,0.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2015
2,000211901,4.0,Son/daughter,Yes,Yes,Male,3.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2015
3,000213001,5.0,Other relative,Yes,Yes,Male,2.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2015
4,000218601,2.0,Son/daughter,Yes,Yes,Male,3.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140593,317510701,1.0,Head,Yes,Yes,Male,31.0,Primary,5.0,5.0,...,,,,No,No,No,No,No,No,2017
140594,317510701,2.0,Wife or husband,Yes,Yes,Female,29.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2017
140595,317510701,3.0,Son/daughter,Yes,Yes,Male,5.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2017
140596,317510701,4.0,Son/daughter,Yes,Yes,Female,4.0,"No education, preschool",,0.0,...,,,,No,No,No,No,No,No,2017


In [None]:
import os
import glob
# If you need to read .SAV files, you might need a library like pyreadstat
import pyreadstat


Step 3: Navigating Directories and Finding Files
You can use os and glob libraries to navigate through directories and find files. Here's a basic way to do it:

In [None]:
def find_sav_files(base_path):
    # This pattern will match any RECH1.SAV files in subdirectories of the base path
    pattern = os.path.join(base_path, '**', 'RECH1.SAV')
    
    # glob.glob will return a list of file paths matching the pattern
    # recursive=True allows searching in subdirectories
    return glob.glob(pattern, recursive=True)

base_path = 'Diplomado_PUCP/_data/endes'
sav_files = find_sav_files(base_path)
print("Found .SAV files:", sav_files)


Step 4: Reading .SAV Files
If you need to read data from these .SAV files, you can use pyreadstat. Here's a simple way to do it:

In [None]:
def read_sav_file(file_path):
    df, meta = pyreadstat.read_sav(file_path)
    return df  # df is a DataFrame containing the data from the .SAV file

# Example of reading the first found .SAV file
if sav_files:
    first_file_data = read_sav_file(sav_files[0])
    print(first_file_data)


Step 5: Additional Suggestions
Handling Exceptions: It's a good practice to handle exceptions, like file not found or read errors.
Learning Resources: Encourage students to refer to Python documentation or tutorials for understanding libraries like os, glob, and pyreadstat.
Code Comments: Teach them to write comments to explain their code for better understanding.
Practice: Encourage them to modify the script, like reading specific columns or data processing, to get more practice.


Step 6: Encouragement and Patience
Remind the students that learning to code takes time and practice. Encourage them to experiment with the code and explore additional Python features.