#  

# 6 Nested, Hierarchical, Multilevel, Longitudinal Data

In [9]:
import pandas as pd
import requests

# Correct URL
url = 'https://github.com/jhustata/lab6.md/raw/main/stsrpos.csv'
response = requests.get(url)
response.raise_for_status()  # Raise an error for bad status codes

# Save the content to a temporary file
with open('temp_stsrpos.csv', 'wb') as temp_file:
    temp_file.write(response.content)

# Load the dataset from the temporary file
dataset = pd.read_csv('temp_stsrpos.csv')

# Display the first few rows and column names
dataset.head(), dataset.columns


(   usrds_id                        gnn    srvc_dt
 0         0                 FEBUXOSTAT  11oct2024
 1         1        CYCLOBENZAPRINE HCL  15jun2024
 2         7       LEVOTHYROXINE SODIUM  18feb2024
 3        12             INSULIN ASPART  17nov2024
 4        15  HYDROCODONE/ACETAMINOPHEN  14jan2024,
 Index(['usrds_id', 'gnn', 'srvc_dt'], dtype='object'))

# Lab

Creating a homework assignment for graduate students that focuses on nested, hierarchical, multilevel, and longitudinal data requires tasks that challenge their understanding and application of these concepts. Hereâ€™s an assignment that should serve the purpose:

---

**Assignment: Analyzing Hierarchical and Longitudinal Data**

**Objective:**  
The purpose of this assignment is to develop your understanding and skills in handling nested, hierarchical, multilevel, and longitudinal data structures. You will analyze a dataset that exhibits these characteristics using Python and relevant libraries such as `pandas` and `statsmodels`.

**Dataset:**  
You are provided with a dataset (`stsrpos.csv`) containing information about medication prescriptions across various patients and time periods. The columns include:
- `usrds_id`: The ID of the user/patient.
- `gnn`: The generic name of the medication.
- `srvc_dt`: The service date of the prescription.

**Instructions:**

1. **Data Preparation:**
   - Load the dataset and inspect its structure.
   - Convert the `srvc_dt` column to a proper date format.
   - Create a new column representing the year and month of the service date.

2. **Descriptive Analysis:**
   - Calculate and visualize the distribution of prescriptions over time.
   - Identify the most frequently prescribed medications.

3. **Hierarchical Data Analysis:**
   - The data is inherently hierarchical with patients nested within prescription dates.
   - Determine the number of unique patients and prescriptions in the dataset.
   - Explore the variability in the number of prescriptions per patient over time.

4. **Longitudinal Data Analysis:**
   - Select a subset of patients who have at least five prescriptions over the observation period.
   - Examine how the frequency of their prescriptions changes over time.
   - Plot these changes for at least three selected patients to visualize their prescription trends.

5. **Multilevel Modeling:**
   - Build a simple linear mixed-effects model to predict the frequency of prescriptions using a random intercept for patients.
   - Interpret the results, focusing on the significance of the random effect.

6. **Discussion:**
   - Discuss the challenges and insights of working with hierarchical and longitudinal data.
   - Reflect on how multilevel modeling can help address issues of data dependency.

**Submission:**  
Submit a Jupyter notebook containing all code, outputs, visualizations, and a discussion. Make sure to clearly comment your code and provide insightful interpretations of your results.

---

This assignment covers various aspects of working with hierarchical and longitudinal data, engaging students with practical applications.

# Solutions

Certainly, here are some potential solutions and hints for the teaching assistants (TAs) for the given assignment. The solutions provided here cover the major points, but TAs should keep in mind that students may approach problems differently, and multiple correct answers can exist.

---

**1. Data Preparation:**

- **Load the dataset and inspect its structure:**

   ```python
   import pandas as pd

   dataset = pd.read_csv('stsrpos.csv')
   dataset.head()
   ```

- **Convert `srvc_dt` column to a proper date format:**

   ```python
   dataset['srvc_dt'] = pd.to_datetime(dataset['srvc_dt'], format='%d%b%Y')
   ```

- **Create a new column for the year and month:**

   ```python
   dataset['year_month'] = dataset['srvc_dt'].dt.to_period('M')
   ```

**2. Descriptive Analysis:**

- **Calculate and visualize the distribution of prescriptions over time:**

   ```python
   import matplotlib.pyplot as plt

   # Count prescriptions by month
   month_counts = dataset['year_month'].value_counts().sort_index()

   # Plotting
   month_counts.plot(kind='bar', title='Number of Prescriptions by Month')
   plt.ylabel('Number of Prescriptions')
   plt.xlabel('Year-Month')
   plt.show()
   ```

- **Identify the most frequently prescribed medications:**

   ```python
   top_meds = dataset['gnn'].value_counts().head(10)
   top_meds.plot(kind='bar', title='Top 10 Most Prescribed Medications')
   plt.ylabel('Number of Prescriptions')
   plt.xlabel('Medication')
   plt.show()
   ```

**3. Hierarchical Data Analysis:**

- **Determine the number of unique patients and prescriptions:**

   ```python
   num_patients = dataset['usrds_id'].nunique()
   num_prescriptions = len(dataset)
   ```

- **Explore the variability in the number of prescriptions per patient:**

   ```python
   prescriptions_per_patient = dataset['usrds_id'].value_counts()
   prescriptions_per_patient.describe()
   ```

**4. Longitudinal Data Analysis:**

- **Select a subset of patients with at least five prescriptions:**

   ```python
   patients_with_5_plus = prescriptions_per_patient[prescriptions_per_patient >= 5].index
   subset = dataset[dataset['usrds_id'].isin(patients_with_5_plus)]
   ```

- **Examine and plot the frequency of prescriptions over time for selected patients:**

   ```python
   selected_patients = patients_with_5_plus[:3]

   for patient_id in selected_patients:
       patient_data = subset[subset['usrds_id'] == patient_id]
       freq_over_time = patient_data['year_month'].value_counts().sort_index()
       freq_over_time.plot(kind='bar', title=f'Patient {patient_id}: Prescriptions Over Time')
       plt.ylabel('Number of Prescriptions')
       plt.xlabel('Year-Month')
       plt.show()
   ```

**5. Multilevel Modeling:**

- **Build a linear mixed-effects model:**

   ```python
   import statsmodels.formula.api as smf

   model = smf.mixedlm("usrds_id ~ year_month", subset, groups=subset["usrds_id"])
   result = model.fit()
   print(result.summary())
   ```

   The model here is quite simple and is mostly for illustrative purposes. The TA can expect variations, such as different covariates in the formula or more complex structures for the random effects.

**6. Discussion:**

- This section is primarily qualitative, but TAs should encourage students to discuss:
  - The hierarchical nature of the data.
  - The use of multilevel modeling to account for nested data structures.
  - The challenges of longitudinal data, such as missing data and varying observation times.
  - The insights gained from the analysis and how it informed their understanding of the dataset.

By providing these solutions and hints, TAs should be able to effectively guide students through the assignment while allowing flexibility for different approaches and interpretations.