<div style="display: inline-block;">
    <img src="images/nhsa_logo.png" alt="Image" style="text-align: left;">
</div>

# Parent Gauge Data Analysis Project
---
## Data Wrangling Script and Documentation

In this script, we will provide a step-by-step demonstration of how script is being cleaned. 

In [87]:
# Start with the necessary imports
import pandas as pd
import numpy as np
from tabulate import tabulate
from prettytable import PrettyTable
from rich.console import Console
from rich.table import Table

#uses old version of google trans: pip3 install googletrans==3.1.0a0
from googletrans import Translator

#used for filling in missing genders --DISCLAIMER: I recognize the sensitivities of this matter and understand that this may not be totally accurate
import gender_guesser.detector as gender

In [4]:
#Load the Data into the dataframe
df = pd.read_excel('../data/INTVDATA.xlsx', sheet_name ='Main', engine ='openpyxl')
######MAKE SURE TO SPECIFY DATATYPE LATER ON######

#Copy existing dataframe to .csv file
df.to_csv('../data/intv_data.csv', index=False)

#read the new .csv file
df = pd.read_csv('../data/intv_data.csv')

  df = pd.read_csv('../data/intv_data.csv')


---------

## Drop Duplicate Rows

In [40]:
####DROP DUPLICATE ROWS####
# Count the number of rows before dropping duplicates
rows_before = len(df)

# Drop duplicate rows based on all columns except the first two
result = df.drop_duplicates(subset=df.columns[2:])

# Count the number of rows after dropping duplicates
rows_after = len(result)

# Calculate and print the number of rows dropped
rows_dropped = rows_before - rows_after
print(f"Number of duplicate rows dropped: {rows_dropped}")

Number of rows dropped: 203


## Sample Generation

Note: You can run this code again if you would like to reset the sample dataset.

In [89]:
#Because the main dataset is too large for data cleaning, 
#construct a small sample for faster processing. Once I am done coding, we will use the entire dataset.
df_sample = df.sample(frac=0.1)
df_sample.to_csv('../data/sample_data.csv', index=False)

print("Created a sample of 10% of the total dataset")

Created a sample of 10% of the total dataset


## Main Data Cleaning

This is a summary of all the data cleaning and reformatting steps that were conducted.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.

# Code to Clean

## Center

In [5]:
##code to clean
##PROGRAM
unique_programs = df_sample['program'].unique().tolist()

# Sort the list in place
unique_programs.sort()

#create a text file of the unique programs
with open('../data/unique_programs.txt', 'w') as f:
    for item in unique_programs:
        f.write("%s\n" % item)

## Created_at
 drop this feature, as it is unnecessary

In [10]:
df_sample = df_sample.drop('created_at', axis=1)

## Date

In [97]:
##DATE##
# Iterate over the entries in the 'date' column
for i, date in enumerate(df_sample['date']):
    try:
        # Try to convert the date to datetime format
        pd.to_datetime(date, format='mixed')
    except Exception:
        print(f"An error occurred at index {i} with the date: {date}")
        
df_sample['date'] = pd.to_datetime(df_sample['date'], errors='coerce')

#Create separate 'year', 'month', and 'day' columns
df_sample['date_year'] = df_sample['date'].dt.year
df_sample['date_month'] = df_sample['date'].dt.month
df_sample['date_day'] = df_sample['date'].dt.day

#save to csv
df_sample.to_csv('../data/sample_data.csv', index=False)


## Evaluation

Dummy variables have been created, breaking the three categorical variables into three columns.

In [42]:
##Evaluation##
#use one-hot encoding to create dummy variables in preparation for regression.
#NOTE: This, however eliminates the original 'evaluation' column
df_sample = pd.get_dummies(df_sample, columns=['evaluation'])

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

#**pending: drop excess 'evaluation_evaluation' column**



## Evaluation Year

The original format of the evaluation year was formatted as 2016-2017. For easier analysis, the start and end year have been split up into two columns, "evaluation_start_year" and "evaluation_end_year"

In [43]:
##evaluation_year##
# Split the 'evaluation_year' column into two separate columns 'start_year' and 'end_year'
df_sample[['evaluation_start_year', 'evaluation_end_year']] = df_sample['evaluation_year'].str.split('-', expand=True)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Name

the name should be scrubbed for privacy.

In [44]:
##need to scrub the name, drop the column
#df_sample = df_sample.drop('guardian', axis=1)

## Guardian Employment

upon further inspection, it seems that the guardian employment is tied to the guardian_id, and it is not possible to further fill in missing values. note that over 60% of guardian_employment is missing. 

## Guardian Enrollment Date

## Guardian Highest Education

same as guardian_employment, no referencing of the guardian_id will fill in missing blanks. Note that over 60% of guardian employment is misssing.

## Guardian, Hispanic?

In [51]:
##remove anything that is not "yes" or "no"
# Update the guardian_hispanic column
df_sample.loc[~df_sample["guardian_hispanic"].isin(["Yes", "No"]), "guardian_hispanic"] = ""

#check the column "guardian_native_language" "student_hispanic", 
#"student_native_language", "language of interview", 
# Define a function to fill missing values in guardian_hispanic column based on conditions
def fill_guardian_hispanic(row):
    if pd.isnull(row['guardian_hispanic']) or row['guardian_hispanic'] not in ['Yes', 'No']:
        if row['student_hispanic'] == 'Yes':
            return 'Yes'
        elif row['student_hispanic'] == 'No':
            return 'No'
        elif row['guardian_native_language'] == 'Spanish':
            return 'Yes'
        elif row['student_native_language'] == 'Spanish':
            return 'Yes'
        elif row['language'] == 'Spanish':
            return 'Yes'
        else:
            return 'No'
    else:
        return row['guardian_hispanic']

# Apply the function to fill missing values in guardian_hispanic column
df_sample['guardian_hispanic'] = df_sample.apply(fill_guardian_hispanic, axis=1)

##---------------------------##

# Convert "yes" and "no" to binary dummy variables
df_sample['guardian_hispanic'] = df_sample['guardian_hispanic'].map({'Yes': True, 'No': False})

# Write the updated data to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Native Language

## Guardian Race

## Guardian Date of Birth

In [None]:
##guardian_birth_date##
#– use DOB and interview year to determine guardian’s age during time of interview. 

## Guardian Sex

In [64]:
#use gender_guesser library to fill in missing values
def guess_gender(name):
    d = gender.Detector()
    gender_guess = d.get_gender(name)
    return gender_guess

# Apply guess_gender function to fill missing values in 'guardian_sex' column
df_sample['guardian_sex'] = df_sample.apply(lambda row: guess_gender(row['guardian']) if pd.isnull(row['guardian_sex']) else row['guardian_sex'], axis=1)

In [66]:
###use this section to clean up inconsistencies#####

In [11]:
# Using direct mapping to create dummy variable out of guardian_sex
df_sample['female'] = (df_sample['guardian_sex'] == 'Female').astype(int)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Vendor ID

consider dropping this feature

In [69]:
##need to scrub the name, drop the column
df_sample = df_sample.drop('guardian_vendor_id', axis=1)

KeyError: "['guardian_vendor_id'] not found in axis"

## Interview ID

In [68]:
##need to scrub the name, drop the column
df_sample = df_sample.drop('interview_id', axis=1)

## Interviewer Name

Drop for privacy

In [70]:
##need to scrub the name, drop the column
df_sample = df_sample.drop('interviewer', axis=1)

## Interviewer ID

In [71]:
##need to scrub the name, drop the column
df_sample = df_sample.drop('interviewer_id', axis=1)

## Interviewer Vendor ID

In [72]:
#drop the column
df_sample = df_sample.drop('interviewer_vendor_id', axis=1)

## Language of Interview

Here, (1) remove invalid languages i.e. "language". (2) next, use the guardian ID to fill in missing values. (3) Lastly consider removing or imputing rows that have missing values. 

In [73]:
#step 1: remove invalid languages i.e. language

#step 2: use guardian_id to fill in missing values

#step 3: remove or impute rows that have missing values
#do this by checking if any of the open interview responses were in a certain language, if they are, apply corresponding language

## Mode of Interview

convert to dummy variable

In [76]:
value_counts = df['mode'].value_counts()

print(value_counts)

#impute missing values proportionally

#convert to dummy variable

mode
In-person    120277
Phone         84356
mode            204
Name: count, dtype: int64


## Program

## Student Name

drop the name for privacy.

In [77]:
##need to scrub the name, drop the column
#df_sample = df_sample.drop('student', axis=1)

## Student Enrollment Date

## Student Disability Status

In [12]:
# Count the number of missing rows in 'guardian_hispanic' column
missing_count = df_sample['student_hispanic'].isnull().sum()

# Print the number of missing rows
print("Number of missing rows in 'student_hispanic' column:", missing_count)

Number of missing rows in 'student_hispanic' column: 1136


## Student, Hispanic
##remove anything that is not "yes" or "no"

In [13]:
##remove anything that is not "yes" or "no"
# Update the guardian_hispanic column
df_sample.loc[~df_sample["guardian_hispanic"].isin(["Yes", "No"]), "guardian_hispanic"] = ""

#check the column "guardian_native_language" "guardian_hispanic", "student_native_language"
# Define a function to fill missing values in student_hispanic column based on conditions
def fill_student_hispanic(row):
    if pd.isnull(row['student_hispanic']):
        if row['guardian_hispanic'] == 'Yes':
            return 'Yes'
        elif row['guardian_hispanic'] == 'No':
            return 'No'
        elif row['guardian_native_language'] == 'Spanish':
            return 'Yes'
        elif row['student_native_language'] == 'Spanish':
            return 'Yes'
        elif pd.notnull(row['student_native_language']):
            return 'No'
        else:
            return ''
    else:
        return row['student_hispanic']

# Apply the function to fill missing values in student_hispanic column
df_sample['student_hispanic'] = df_sample.apply(fill_student_hispanic, axis=1)

##---------------------------##

# Convert "yes" and "no" to binary dummy variables
#df_sample = pd.get_dummies(df_sample, columns=["guardian_hispanic"], prefix="guardian_hispanic", drop_first=True)

# Write the updated data to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)



## Student, ID

## Student Birth Date

## Student in last year

## Student Native Language

## Student Program Type

## Student Race

## Student Service Type

## Student Sex

use gender_guesser again

In [91]:
def guess_gender(name):
    d = gender.Detector()
    gender_guess = d.get_gender(name)
    return gender_guess

# Apply guess_gender function to fill missing values in 'student_sex' column
df_sample['student_sex'] = df_sample.apply(lambda row: guess_gender(row['student']) if pd.isnull(row['student_sex']) else row['student_sex'], axis=1)



## Student Staff

drop

In [None]:
##need to scrub the name, drop the column
#df_sample = df_sample.drop('student_staff', axis=1)

## Student Staff ID

drop this column

In [94]:
##need to scrub the name, drop the column
#df_sample = df_sample.drop('student_staff_id', axis=1)

## Student Staff Vendor ID

consider removing

In [93]:
##need to scrub the name, drop the column
#df_sample = df_sample.drop('student_staff_vendor_id', axis=1)

## Student Vendor ID

In [92]:
##need to scrub the name, drop the column
#df_sample = df_sample.drop('student_staff_vendor_id', axis=1)

## Student Was Early Headstart

## Student Was Head Start

In [90]:
## Once everything is done: drop unnecesary columns

# Next Section: Likert Scale Interview Questions

# Next Section: Open Interview Questions

perhaps we can use data analysis to see how sentiments change
https://www.surveypractice.org/article/25699-what-to-do-with-all-those-open-ended-responses-data-visualization-techniques-for-survey-researchers

In [18]:
from googletrans import Translator

# Assuming 'df' is your DataFrame
columns_to_translate = ['OQ1', 'OQ2', 'OQ3', 'OQ3a', 'OQ4', 'OQ5', 'OQ6', 'OQ7', 'OQ8', 'OQ9', 'OQ10']

# Create an instance of the Translator class
translator = Translator(service_urls=['translate.google.com'])

# Iterate over the columns to translate
for column in columns_to_translate:
    # Translate the non-null values in the column
    df_sample[column] = df_sample[column].apply(lambda x: translator.translate(x, dest='en').text if pd.notnull(x) else x)


Exception: Could not find TKK token for this request.
See https://github.com/ssut/py-googletrans/issues/234 for more details.

In [15]:
#first need to translate spanish-language to english
# Assuming 'df' is your DataFrame
columns_to_translate = ['OQ1', 'OQ2', 'OQ3', 'OQ3a', 'OQ4', 'OQ5', 'OQ6', 'OQ7', 'OQ8', 'OQ9', 'OQ10']

# Create a table to display translation information
table = Table(title="Translation Information")
table.add_column("Variable", justify="left", style="cyan")
table.add_column("Rows Translated", justify="right", style="green")
table.add_column("% Translated (Non-Missing)", justify="right", style="magenta")

# Iterate over the columns to translate
for column in columns_to_translate:
    translated_count = 0
    non_missing_count = df[column].notnull().sum()
    translator = Translator(service_urls=['translate.google.com'])  # Create an instance of the Translator class
    for i, value in enumerate(df[column]):
        if pd.notnull(value):
            # Detect if the value is in Spanish
            if translator.detect(value).lang == "es":
                # Translate the value from Spanish to English
                translation = translator.translate(value, src='es', dest='en')
                # Update the translated value in the DataFrame
                df.loc[i, column] = translation.text
                translated_count += 1
    # Calculate the percentage of translated non-missing rows
    percent_translated = translated_count / non_missing_count * 100 if non_missing_count > 0 else 0
    # Add a row to the table
    table.add_row(column, str(translated_count), f"{percent_translated:.2f}%")

# Display the table
console = Console()
console.print(table)



#scrub names

Exception: Could not find TKK token for this request.
See https://github.com/ssut/py-googletrans/issues/234 for more details.

## Remove Unnecessary Columns

In [None]:
##REMOVE UNNECESSARY COLUMNS
# guardian_vendor_id, interview_id, interviewer_id, interviewer,
#interviewer_vendor_id, student_staff_vendor_id, student_vendor_id

# List of columns to be removed
columns_to_remove = ['guardian_vendor_id', 'interview_id', 'interviewer_id', 
                     'interviewer', 'interviewer_vendor_id', 'student_staff_vendor_id', 
                     'student_vendor_id']

# Removing the columns from the DataFrame
df_sample = df_sample.drop(columns=columns_to_remove)

#save updates to working csv
df_sample.to_csv('../data/sample_data.csv', index=False)