<div style="display: inline-block;">
    <img src="images/nhsa_logo.png" alt="Image" style="text-align: left;">
</div>

# Parent Gauge Data Analysis Project
---
## Data Wrangling Script and Documentation

In this script, we will provide a step-by-step demonstration of how script is being cleaned. 

In [2]:
# Start with the necessary imports
import pandas as pd
import numpy as np

In [3]:
#Load the Data into the dataframe
df = pd.read_excel('../data/INTVDATA.xlsx', sheet_name ='Main', engine ='openpyxl')

#Copy existing dataframe to .csv file
df.to_csv('../data/intv_data.csv', index=False)

#read the new .csv file
df = pd.read_csv('../data/intv_data.csv')

  df = pd.read_csv('../data/intv_data.csv')


---------

## Drop Duplicate Rows

In [12]:
####DROP DUPLICATE ROWS####
# Count the number of rows before dropping duplicates
rows_before = len(df)

# Drop duplicate rows based on all columns except the first two
result = df.drop_duplicates(subset=df.columns[2:])

# Count the number of rows after dropping duplicates
rows_after = len(result)

# Calculate and print the number of rows dropped
rows_dropped = rows_before - rows_after
print(f"Number of rows dropped: {rows_dropped}")

Number of rows dropped: 203


## Sample Generation

In [17]:
#Because the main dataset is too large for data cleaning, 
#construct a small sample for faster processing. Once I am done coding, we will use the entire dataset.
df_sample = df.sample(frac=0.08)
df_sample.to_csv('../data/sample_data.csv', index=False)

print("Created a sample of 8% of the total dataset")

Created a sample of 8% of the total dataset


## Main Data Cleaning

This is a summary of all the data cleaning and reformatting steps that were conducted.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.

## Remove Unnecessary Columns

In [6]:
##REMOVE UNNECESSARY COLUMNS
# guardian_vendor_id, interview_id, interviewer_id, interviewer,
#interviewer_vendor_id, student_staff_vendor_id, student_vendor_id

# List of columns to be removed
columns_to_remove = ['guardian_vendor_id', 'interview_id', 'interviewer_id', 
                     'interviewer', 'interviewer_vendor_id', 'student_staff_vendor_id', 
                     'student_vendor_id']

# Removing the columns from the DataFrame
df_sample = df_sample.drop(columns=columns_to_remove)

#save updates to working csv
df_sample.to_csv('../data/sample_data.csv', index=False)

# Code to Clean

## Center

In [7]:
##code to clean
##PROGRAM
unique_programs = df_sample['program'].unique().tolist()

# Sort the list in place
unique_programs.sort()

#create a text file of the unique programs
with open('../data/unique_programs.txt', 'w') as f:
    for item in unique_programs:
        f.write("%s\n" % item)

## Created_at

## Date

In [8]:
##DATE##
# Convert 'date' column to datetime format
#errors=coercse converts problematic dates to NaN.
#df['date'] = pd.to_datetime(df['date'], format='mixed', errors='coerce')
#print(df['date'].isnull().sum())

# Make a copy of the 'date' column
#df_copy = df['date'].copy()

# Iterate over the entries in the 'date' column
for i, date in enumerate(df_sample['date']):
    try:
        # Try to convert the date to datetime format
        pd.to_datetime(date, format='mixed')
    except Exception:
        print(f"An error occurred at index {i} with the date: {date}")
        
df_sample['date'] = pd.to_datetime(df_sample['date'], errors='coerce')

#Create separate 'year', 'month', and 'day' columns
df_sample['year'] = df_sample['date'].dt.year
df_sample['month'] = df_sample['date'].dt.month
df_sample['day'] = df_sample['date'].dt.day

#save to csv
df_sample.to_csv('../data/sample_data.csv', index=False)


An error occurred at index 339 with the date: date
An error occurred at index 531 with the date: date
An error occurred at index 881 with the date: date
An error occurred at index 953 with the date: date
An error occurred at index 961 with the date: date
An error occurred at index 1372 with the date: date
An error occurred at index 4517 with the date: date
An error occurred at index 6454 with the date: date
An error occurred at index 8159 with the date: date
An error occurred at index 10415 with the date: date
An error occurred at index 10559 with the date: date
An error occurred at index 10634 with the date: date
An error occurred at index 10764 with the date: date
An error occurred at index 11479 with the date: date
An error occurred at index 11577 with the date: date
An error occurred at index 14098 with the date: date
An error occurred at index 14340 with the date: date
An error occurred at index 14387 with the date: date
An error occurred at index 14719 with the date: date
An erro

## Evaluation

Dummy variables have been created, breaking the three categorical variables into three columns.

In [9]:
##Evaluation##
#use one-hot encoding to create dummy variables in preparation for regression.
#NOTE: This, however eliminates the original 'evaluation' column
df_sample = pd.get_dummies(df_sample, columns=['evaluation'])

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)


## Evaluation Year

The original format of the evaluation year was formatted as 2016-2017. For easier analysis, the start and end year have been split up into two columns, "evaluation_start_year" and "evaluation_end_year"

In [10]:
##evaluation_year##
# Split the 'evaluation_year' column into two separate columns 'start_year' and 'end_year'
df_sample[['evaluation_start_year', 'evaluation_end_year']] = df_sample['evaluation_year'].str.split('-', expand=True)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Name

In [None]:
##need to scrub the name

## Guardian Employment

## Guardian Enrollment Date

## Guardian Highest Education

## Guardian, Hispanic?

In [19]:
##remove anything that is not "yes" or "no"
# Update the guardian_hispanic column
df_sample.loc[~df_sample["guardian_hispanic"].isin(["Yes", "No"]), "guardian_hispanic"] = ""

#check the column "guardian_native_language" "student_hispanic", 
#"student_native_language", "language of interview", 
# Define a function to fill missing values in guardian_hispanic column based on conditions
def fill_guardian_hispanic(row):
    if pd.isnull(row['guardian_hispanic']):
        if row['student_hispanic'] == 'Yes':
            return 'Yes'
        elif row['student_hispanic'] == 'No':
            return 'No'
        elif row['guardian_native_language'] == 'Spanish':
            return 'Yes'
        elif row['student_native_language'] == 'Spanish':
            return 'Yes'
        elif row['language'] == 'Spanish':
            return 'Yes'
        else:
            return 'No'
    else:
        return row['guardian_hispanic']

# Apply the function to fill missing values in guardian_hispanic column
df_sample['guardian_hispanic'] = df_sample.apply(fill_guardian_hispanic, axis=1)

##---------------------------##

# Convert "yes" and "no" to binary dummy variables
#df_sample = pd.get_dummies(df_sample, columns=["guardian_hispanic"], prefix="guardian_hispanic", drop_first=True)

# Write the updated data to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Native Language

## Guardian Race

## Guardian Birth Date???

In [None]:
##guardian_birth_date##
#– use DOB and interview year to determine guardian’s age during time of interview. 

## Guardian Sex

In [24]:
# Count the number of missing rows in 'guardian_hispanic' column
missing_count = df_sample['guardian_sex'].isnull().sum()

# Print the number of missing rows
print("Number of missing rows in 'sex' column:", missing_count)

Number of missing rows in 'sex' column: 503


In [9]:
##guardian_sex##
# Using direct mapping to create dummy variable out of guardian_sex
df_sample['female'] = (df_sample['guardian_sex'] == 'Female').astype(int)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Vendor ID

## Interview ID

## Interviewer Name

## Interviewer ID

## Interviewer Vendor ID

## Language of Interview

## Mode of Interview

## Program

## Student Name

## Student Enrollment Date

## Student Disability Status

In [21]:
# Count the number of missing rows in 'guardian_hispanic' column
missing_count = df_sample['student_hispanic'].isnull().sum()

# Print the number of missing rows
print("Number of missing rows in 'student_hispanic' column:", missing_count)

Number of missing rows in 'student_hispanic' column: 1129


## Student, Hispanic
##remove anything that is not "yes" or "no"

In [22]:
##remove anything that is not "yes" or "no"
# Update the guardian_hispanic column
df_sample.loc[~df_sample["guardian_hispanic"].isin(["Yes", "No"]), "guardian_hispanic"] = ""

#check the column "guardian_native_language" "guardian_hispanic", "student_native_language"
# Define a function to fill missing values in student_hispanic column based on conditions
def fill_student_hispanic(row):
    if pd.isnull(row['student_hispanic']):
        if row['guardian_hispanic'] == 'Yes':
            return 'Yes'
        elif row['guardian_hispanic'] == 'No':
            return 'No'
        elif row['guardian_native_language'] == 'Spanish':
            return 'Yes'
        elif row['student_native_language'] == 'Spanish':
            return 'Yes'
        elif pd.notnull(row['student_native_language']):
            return 'No'
        else:
            return ''
    else:
        return row['student_hispanic']

# Apply the function to fill missing values in student_hispanic column
df_sample['student_hispanic'] = df_sample.apply(fill_student_hispanic, axis=1)

##---------------------------##

# Convert "yes" and "no" to binary dummy variables
#df_sample = pd.get_dummies(df_sample, columns=["guardian_hispanic"], prefix="guardian_hispanic", drop_first=True)

# Write the updated data to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)



## Student, ID

## Student Birth Date

## Student in last year

## Student Native Language

## Student Program Type

## Student Race

## Student Service Type

## Student Sex

## Student Staff

## Student Staff ID

## Student Staff Vendor ID

## Student Vendor ID

## Student Was Early Headstart

## Student Was Head Start

# Next Section: Likert Scale Interview Questions

# Next Section: Open Interview Questions

perhaps we can use data analysis to see how sentiments change
https://www.surveypractice.org/article/25699-what-to-do-with-all-those-open-ended-responses-data-visualization-techniques-for-survey-researchers