# Task 1 – Data Cleaning & Formatting

**Objective:**  
To clean and format the client-provided survey dataset into an SPSS-ready `.sav` file for Q import.

**Tools Used:**  
Python (pandas, numpy, pyreadstat)

**Overview of Steps**
1. Remove invalid or incomplete records  
2. Process multi-response questions  
3. Convert text labels to numeric codes  
4. Apply skip logic (Q1_99)  
5. Rename and reorder columns  
6. Create 'Wave' variable  
7. Add variable and value labels  
8. Export to `.sav`


## Step 0 – Import Packages and Load Data

Before starting the data cleaning process, I first import all necessary Python libraries and my custom functions.

###  Imported Libraries
- **pandas** – used for reading and transforming the Excel dataset into a DataFrame.  
- **numpy** – provides numerical operations such as handling missing values (`NaN`) and creating new columns efficiently.  
- **pyreadstat** – enables exporting the cleaned dataset to SPSS (`.sav`) format with correct labels and value codes.

### Importing Custom Functions from `Task1.py`
The script `Task1.py` contains all the cleaning functions I previously built.  
By importing them here, I can execute each step interactively in Jupyter Notebook and show how the dataset changes after every operation.

Functions imported:
- `remove_invalid_cases()` – removes logic errors and incomplete rows  
- `process_multiresponse_questions()` – splits multi-response questions into binary columns  
- `convert_labels_to_codes()` – maps text answers to numeric codes  
- `apply_q1_99_skip_logic()` – applies skip rules for respondents selecting “None of these”  ( (:P)Forgot to find this error logic until checking)
- `rename_and_reorder_columns()` – renames variables and adjusts their order  
- `create_wave_variable()` – creates the weekly ‘Wave’ variable from `CompletedDate`  
- `create_labels()` – builds variable and value label dictionaries for SPSS  
- `save_to_spss()` – exports the cleaned dataset with labels into `.sav`

### Loading the Raw Dataset
The raw Excel file, **`EXAMPLE DATA FILE.xlsx`**, is then read using `pandas.read_excel()` to inspect its structure and confirm that all expected columns are present before cleaning begins.


In [5]:
import pandas as pd
import numpy as np
import pyreadstat

from Task1 import (
    remove_invalid_cases,
    process_multiresponse_questions,
    convert_labels_to_codes,
    apply_q1_99_skip_logic,
    rename_and_reorder_columns,
    create_wave_variable,
    create_labels,
    save_to_spss
)

# Load the raw data
df = pd.read_excel("EXAMPLE DATA FILE.xlsx")
df.head()


Unnamed: 0,ID,What is your gender?,What is your age?,What is your postcode?,Which of the following brands of electricity providers are you aware of?,Which of the following brands of electricity providers are you aware of? (Other (please specify)),And which ONE of these brands is your main provider?,And which ONE of these brands is your main provider? (Other (please specify)),"Thinking about ‘Origin’, how favourable is your overall impression of them?",How likely are you to recommend ‘Origin’ to friends or family?,...,How would you rate ‘Origin’ on each of the following? (Innovation),"In the past 12 months, have you seen or heard any advertising for ‘Origin’?",Where did you see or hear advertising for ‘Origin’?,Where did you see or hear advertising for ‘Origin’? (Other (please specify)),Which of the following best describes your current work status?,Which of the following best describes your current work status? (Other (please specify)),Which of the following best describes your total annual household income?,Which of the following best describes your household structure?,Which of the following best describes your household structure? (Other (please specify)),CompletedDate
0,76,Female,65+,6128,Synergy; AGL; Origin; Red Energy,,Origin,,Very favourable,1,...,Fair,Don't know,,,Retired,,"$60,000–$89,999",Group household / share house,,2025-08-04
1,78,Female,65+,6002,None of these,,Origin,,Very unfavourable,7,...,Very poor,No,,,Student,,"$60,000–$89,999",Single parent with children at home,,2025-08-04
2,79,Female,35-44,6289,Synergy; Western Power; Origin; Horizon Power;...,ATCO,Red Energy,,Very favourable,6,...,Excellent,No,,,Unemployed and looking for work,,"$60,000–$89,999","Couple, no children",,2025-08-04
3,82,Female,55-64,6122,AGL; Origin; Horizon Power,,Origin,,Neutral,2,...,Fair,No,,,Retired,,"$90,000–$119,999","Couple, no children",,2025-08-04
4,85,Female,25-34,6162,None of these,,Origin,,Very favourable,6,...,Fair,Yes,TV; Online / Social media; Outdoor (billboards...,,Working full time,,"Less than $30,000","Single, no children",,2025-08-04
