<div style="display: inline-block;">
    <img src="images/nhsa_logo.png" alt="Image" style="text-align: left;">
</div>

# Parent Gauge Data Analysis Project
---
## Data Wrangling Script and Documentation

In this script, we will provide a step-by-step demonstration of how script is being cleaned. 

In [1]:
# Start with the necessary imports
import pandas as pd
import numpy as np
import re
import datetime
from tabulate import tabulate
from prettytable import PrettyTable
from rich.console import Console
from rich.table import Table

#uses old version of google trans: pip3 install googletrans==3.1.0a0
from googletrans import Translator

# Utility function for gender imputation - DISCLAIMER: Sensitivity and Accuracy Considerations
# This function is intended for filling in missing gender values for statistical purposes only.
# Please note that gender imputation methods may not accurately reflect an individual's gender identity.
# Use caution and sensitivity when interpreting or applying these imputed values.
import gender_guesser.detector as gender_guesser
from genderize import Genderize
import sexmachine.detector as sexmachine

#utilize a Named Entity Recognition (NER) library to detect and remove named entities like names
import spacy

#nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [9]:
#Load the Data into the dataframe
df = pd.read_excel('../data/INTVDATA.xlsx', sheet_name ='Main', engine ='openpyxl')

#Copy existing dataframe to .csv file
df.to_csv('../data/intv_data.csv', index=False)

In [2]:
#read the new .csv file
df = pd.read_csv('../data/intv_data.csv')

---------

## Drop Duplicate Rows

A majority of the duplicate rows in the parent gauge dataset involve rows that are duplicates of the header row. We will use a random variable such as 'date' and remove rows that equal the name of the variable. Afterwards, we will drop additional duplicates, if any. 

In [3]:
#To make sure all duplicate "header" rows are eliminated, we pick a random column,
#remove rows where 'date' column is equal to 'date', except the first row
# Initialize a counter for deleted rows
deleted_date_rows = 0

# Iterate over the rows of the DataFrame
for i, row in df.iterrows():
    if row['date'] == 'date':
        # Remove the row if the date matches
        df.drop(i, inplace=True)
        deleted_date_rows += 1

# Count the number of rows before dropping duplicates
rows_before = len(df)

# Drop duplicate rows based on all columns except the first two, since they are indexed
result = df.drop_duplicates(subset=df.columns[2:])
        
# Count the number of rows after dropping duplicates
rows_after = len(result)

# Calculate and print the number of rows dropped
rows_dropped = (rows_before - rows_after) + deleted_date_rows
print(f"Number of duplicate rows dropped: {rows_dropped}")

#Copy existing dataframe to .csv file
df.to_csv('../data/intv_data.csv', index=False)

Number of duplicate rows dropped: 0


## Sample Data Frame Generation

Note: You can run this code again if you would like to reset the sample dataset.

In [8]:
#Because the main dataset is too large for data cleaning, 
#construct a small sample for faster processing. Once I am done coding, we will use the entire dataset.
df_sample = df.sample(frac=0.1)
df_sample.to_csv('../data/sample_data.csv', index=False)

print("Created a sample of 10% of the total dataset")

Created a sample of 10% of the total dataset


## Main Data Cleaning

This is a summary of all the data cleaning and reformatting steps that were conducted.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.
- **Program** - I identified the corresponding state and county.

# Code to Clean

This section examines each variable and transforms and cleans them accordingly.

## Center

For the center, we want to correspond each center to their respective state and geographic location for future analyses.

In [36]:
unique_centers = df_sample['center'].unique().tolist()

# Sort the list in place
unique_centers.sort()

#create a text file of the unique programs
with open('../data/unique_centers.txt', 'w') as f:
    for item in unique_centers:
        f.write("%s\n" % item)

## Created_at
 drop this feature, as it is unnecessary

## Date

Here, we will break down the date into three columns, seperating the year, month, and day. This code performs data cleaning on a column called 'date' in the dataset. It first checks each entry in the column for any errors in the date format, printing an error message if any are found. Then, it converts the 'date' column to a standard datetime format and creates new columns for the year, month, and day extracted from the dates. Finally, the cleaned data is saved to a CSV file. This code ensures that the dates are properly formatted, allows for easy analysis based on different time periods, and provides a clean dataset for further use.

In [47]:
# Iterate over the entries in the 'date' column
for i, date in enumerate(df_sample['date']):
    try:
        # Try to convert the date to datetime format
        pd.to_datetime(date, format='mixed')
    except Exception:
        print(f"An error occurred at index {i} with the date: {date}")
        
df_sample['date'] = pd.to_datetime(df_sample['date'], errors='coerce')

#Create separate 'year', 'month', and 'day' columns
df_sample['date_year'] = df_sample['date'].dt.year
df_sample['date_month'] = df_sample['date'].dt.month
df_sample['date_day'] = df_sample['date'].dt.day

#save to csv
df_sample.to_csv('../data/sample_data.csv', index=False)


## Evaluation Year

The original format of the evaluation year was formatted as '2016-2017', for instance. For easier analysis, the start and end year have been split up into two columns, "evaluation_start_year" and "evaluation_end_year." Moreover, we convert the new variables into a faster data type. Lastly, try to fix any erroneous years, then remove the rest.

In [11]:
# Get the current year
current_year = datetime.datetime.now().year

# Split the 'evaluation_year' column into two separate columns 'start_year' and 'end_year'
df_sample[['evaluation_start_year', 'evaluation_end_year']] = df_sample['evaluation_year'].str.split('-', expand=True)

# Convert 'evaluation_start_year' and 'evaluation_end_year to int16
df_sample['evaluation_start_year'] = df_sample['evaluation_start_year'].astype('int16')
df_sample['evaluation_end_year'] = df_sample['evaluation_end_year'].astype('int16')

# Convert 'student_enrollment_date' column to datetime, in preparation for referencing
df_sample['student_enrollment_date'] = pd.to_datetime(df_sample['student_enrollment_date'])

# NOTE: Parent guage officially started in 2017, so years far from that range are erroneous.
# FIX errneous rows that were supposed to be a valid year
# Store the indices of rows with updated evaluation_start_year
updated_indices = df_sample[(df_sample['evaluation_start_year'] < 2016) | (df_sample['evaluation_start_year'] > current_year)].index

# Update evaluation_start_year based on evaluation and student_enrollment_date, which is less erroneous
df_sample.loc[df_sample['evaluation_start_year'] < 2016, 'evaluation_start_year'] = df_sample.loc[df_sample['evaluation_start_year'] < 2016].apply(
    lambda row: row['student_enrollment_date'].year if row['evaluation'] == 'Initial' else row['student_enrollment_date'].year + 1, axis=1
)

df_sample.loc[df_sample['evaluation_start_year'] > current_year, 'evaluation_start_year'] = df_sample.loc[df_sample['evaluation_start_year'] > current_year].apply(
    lambda row: row['student_enrollment_date'].year if row['evaluation'] == 'Initial' else row['student_enrollment_date'].year + 1, axis=1
)

# Update evaluation_end_year only for the updated rows
df_sample.loc[updated_indices, 'evaluation_end_year'] = df_sample.loc[updated_indices, 'evaluation_start_year'] + 1

# Clear erroneous values where 'evaluation_start_year' is below 2016 or above current_year
df_sample.loc[df_sample['evaluation_start_year'] < 2016, ['evaluation_year', 'evaluation_start_year', 'evaluation_end_year']] = None
df_sample.loc[df_sample['evaluation_start_year'] > current_year, ['evaluation_year', 'evaluation_start_year', 'evaluation_end_year']] = None

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Evaluation

Dummy variables have been created, breaking the three categorical variables into three columns.

In [48]:
#use one-hot encoding to create dummy variables in preparation for regression.
#WARNING: This, however eliminates the original 'evaluation' column
df_sample = pd.get_dummies(df_sample, columns=['evaluation'])

#drop excess 'evaluation_evaluation' column, if it exists.
if 'evaluation_evaluation' in df_sample.columns:
    df_sample.drop('evaluation_evaluation', axis=1, inplace=True)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)


## Guardian Name

After employment the name should be scrubbed for privacy. 

## Guardian Employment

upon further inspection, it seems that the guardian employment is tied to the guardian_id, and it is not possible to further fill in missing values. note that over 60% of guardian_employment is missing. We should consider dropping this variable and removing it from our analysis.


## Guardian Enrollment Date

Use this variable to determine how many years the parent has been in the parent gauge program.

## Guardian Highest Education

same as guardian_employment, no referencing of the guardian_id will fill in missing blanks. Note that over 60% of guardian employment is misssing. We should consider dropping this variable and removing it from our analysis.

## Guardian, Hispanic

There are missing values, but many other variables including the guardian's native language, student's native language, whether the student is hispanic, and what language the interview was used—to determine whether the guardian was hispanic.

In [61]:
##remove anything that is not "yes" or "no"
#reconcile "dont know" and "refused" values
####pending...

# Update the guardian_hispanic column
df_sample.loc[~df_sample["guardian_hispanic"].isin(["Yes", "No"]), "guardian_hispanic"] = ""

#check the column "guardian_native_language" "student_hispanic", 
#"student_native_language", "language of interview", 
# Define a function to fill missing values in guardian_hispanic column based on conditions
def fill_guardian_hispanic(row):
    if pd.isnull(row['guardian_hispanic']) or row['guardian_hispanic'] not in ['Yes', 'No']:
        if row['student_hispanic'] == 'Yes':
            return 'Yes'
        elif row['student_hispanic'] == 'No':
            return 'No'
        elif row['guardian_native_language'] == 'Spanish':
            return 'Yes'
        elif row['student_native_language'] == 'Spanish':
            return 'Yes'
        elif row['language'] == 'Spanish':
            return 'Yes'
        else:
            return 'No'
    else:
        return row['guardian_hispanic']

# Apply the function to fill missing values in guardian_hispanic column
df_sample['guardian_hispanic'] = df_sample.apply(fill_guardian_hispanic, axis=1)

# Convert "yes" and "no" to binary dummy variables
df_sample['guardian_hispanic'] = df_sample['guardian_hispanic'].map({'Yes': True, 'No': False})

# Write the updated data to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Native Language

## Guardian Race

tasks: 

In [40]:
table_breakdown = df_sample['guardian_race'].value_counts(dropna=False).reset_index()
table_breakdown.columns = ['Value', 'Count']

print(table_breakdown)

                                                Value  Count
0                                               White  10156
1                           Black or African American   5171
2                                                 NaN   1615
3                                               Other   1484
4                                        Multi-Racial    712
5                                          Don't Know    383
6                                               Asian    374
7                    American Indian or Alaska Native    200
8                                             , White    190
9                                         Unspecified     37
10                Native Hawaiian or Pacific Islander     37
11                        , Black or African American     20
12                                     , Multi-Racial     13
13                                            Refused     10
14                                White, Multi-Racial      8
15                      

## Guardian Date of Birth

use guardian's DOB and interview year to determine guardian’s age during time of interview. We create a new variable named 'guardian_age'

In [66]:
# use guardian's DOB and interview year to determine guardian’s age during time of interview. 
# Convert 'guardian_birth_date' and 'date' columns to datetime
df_sample['guardian_birth_date'] = pd.to_datetime(df_sample['guardian_birth_date'], errors='coerce')
df_sample['date'] = pd.to_datetime(df_sample['date'], errors='coerce')

# Calculate the age at the time of the interview
df_sample['guardian_age'] = pd.NaT  # Initialize the column with missing values

for i, row in df_sample.iterrows():
    try:
        age = (row['date'] - row['guardian_birth_date']).days // 365
        df_sample.at[i, 'guardian_age'] = age
    except:
        pass  # Ignore any dates that are out of bounds

# Remove rows where 'guardian_age' values are less than 18 or above 99
df_sample = df_sample[(df_sample['guardian_age'] >= 18) & (df_sample['guardian_age'] <= 99)]

# Write the updated dataframe to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Sex

Here, we use the gender guesser to fill in missing values. 

In [26]:
#clean some values for more proper analysis
df_sample['guardian_sex'] = df_sample['guardian_sex'].replace('F', 'Female')
df_sample['guardian_sex'] = df_sample['guardian_sex'].replace('M', 'Male')
df_sample['guardian_sex'] = df_sample['guardian_sex'].replace('Unknown', '')

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

In [27]:
# use both the "gender_guesser", "Genderize", and "sexmachine" library to fill in missing values
# warning issue: running too many libraries in the function creates bottlenecks. 
def guess_gender(name):
    # Using sexmachine library
    try:
        d_sexmachine = sexmachine.Detector(case_sensitive=False)
        genders_sexmachine = [d_sexmachine.get_gender(name) for name in names]

        for name, gender_sexmachine in zip(names, genders_sexmachine):
            if gender_sexmachine == 'male':
                results.append((name, 'Male'))
            elif gender_sexmachine == 'female':
                results.append((name, 'Female'))
    except:
        pass

    # Using genderize library
    try:
        genderize_results = Genderize().get(names)
        for name, genderize_result in zip(names, genderize_results):
            if 'gender' in genderize_result and genderize_result['gender'] is not None:
                results.append((name, genderize_result['gender'].capitalize()))
    except:
        pass
        
    # Using gender_guesser library
    d_guesser = gender_guesser.Detector()
    gender_guess = d_guesser.get_gender(name)

    if gender_guess == 'male' or gender_guess == 'mostly_male':
        return 'Male'
    elif gender_guess == 'female' or gender_guess == 'mostly_female':
        return 'Female'

    return None

df_sample['guardian_sex'] = df_sample.apply(lambda row: guess_gender(row['guardian']) if row['guardian_sex'] == '' else row['guardian_sex'], axis=1)
df_sample['guardian_sex'] = df_sample.apply(lambda row: guess_gender(row['guardian']) if pd.isnull(row['guardian_sex']) else row['guardian_sex'], axis=1)
df_sample['guardian_sex'] = df_sample.apply(lambda row: guess_gender(row['guardian']) if row['guardian_sex'] == 'None' else row['guardian_sex'], axis=1)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

In [28]:
table_breakdown = df_sample['guardian_sex'].value_counts(dropna=False).reset_index()
table_breakdown.columns = ['Value', 'Count']

print(table_breakdown)


    Value  Count
0  Female  18031
1    Male   1585
2    None    435
3   Other      3


In [85]:
# Using direct mapping to create dummy variable out of guardian_sex
df_sample['female'] = np.where(df_sample['guardian_sex'] == 'Female', True, False)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Guardian Vendor ID

consider dropping this feature

## Interview ID

drop this.

## Interviewer Name

Drop for privacy

## Interviewer ID

drop this.

## Interviewer Vendor ID

drop this.

## Language of Interview

There are a few missing or invalid variables. As most interviews were done in the english language, impute variable with "English."

In [45]:
#Assume interview was done in english. Fill empty values with "English"
df_sample['language'].fillna("English", inplace=True)

## Mode of Interview

convert to dummy variable

In [14]:
#convert to dummy variable
# Using direct mapping to create dummy variable out of mode
df_sample['phone_interview'] = np.where(df_sample['mode'] == 'Phone', True, False)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Program

## Student Name

drop the name for privacy.

## Student Enrollment Date

Erroneous years range from 1915 to 2121. need to fix them, and reference other columns. 

In [27]:
# Convert 'student_enrollment_date' to datetime format
df['student_enrollment_date'] = pd.to_datetime(df['student_enrollment_date'], errors='coerce')

#fix erroneous and missing years

## Student Disability Status

*If entry is blank, change to 'None' if interview disregards questions intended for parents of students with disabilities. 

In [17]:
#convert disability questions from string to numeric data type
df_sample['QD1'] = pd.to_numeric(df_sample['QD1'], errors='coerce')
df_sample['QD1a'] = pd.to_numeric(df_sample['QD1a'], errors='coerce')
df_sample['QD2'] = pd.to_numeric(df_sample['QD2'], errors='coerce')
df_sample['QD2a'] = pd.to_numeric(df_sample['QD2a'], errors='coerce')

# Custom function to check if any of the disability question columns have value between 1 to 5
def has_disability(row):
    return 1 <= row['QD1'] <= 5 or 1 <= row['QD1a'] <= 5 or 1 <= row['QD2'] <= 5 or 1 <= row['QD2a'] <= 5

# Apply the custom function and fill in the missing values in 'student_has_disability'
df_sample['student_has_disability'] = df_sample.apply(lambda row: has_disability(row) if pd.isna(row['student_has_disability']) else row['student_has_disability'], axis=1)

#save updates to csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Student, Hispanic

remove anything that is not "yes" or "no". reconcile "dont know" and "refused" values.

In [13]:
##remove anything that is not "yes" or "no"
#reconcile "dont know" and "refused" values

# Update the guardian_hispanic column
df_sample.loc[~df_sample["guardian_hispanic"].isin(["Yes", "No"]), "guardian_hispanic"] = ""

#check the column "guardian_native_language" "guardian_hispanic", "student_native_language"
# Define a function to fill missing values in student_hispanic column based on conditions
def fill_student_hispanic(row):
    if pd.isnull(row['student_hispanic']):
        if row['guardian_hispanic'] == 'Yes':
            return 'Yes'
        elif row['guardian_hispanic'] == 'No':
            return 'No'
        elif row['guardian_native_language'] == 'Spanish':
            return 'Yes'
        elif row['student_native_language'] == 'Spanish':
            return 'Yes'
        elif pd.notnull(row['student_native_language']):
            return 'No'
        else:
            return ''
    else:
        return row['student_hispanic']

# Apply the function to fill missing values in student_hispanic column
df_sample['student_hispanic'] = df_sample.apply(fill_student_hispanic, axis=1)

# Convert "yes" and "no" to binary dummy variables
#df_sample = pd.get_dummies(df_sample, columns=["guardian_hispanic"], prefix="guardian_hispanic", drop_first=True)

# Write the updated data to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)


## Student, ID

## Student Birth Date

In [9]:
# use student's DOB and interview year to determine student’s age during time of interview. 
# Convert 'student_birth_date' and 'date' columns to datetime
df_sample['student_birth_date'] = pd.to_datetime(df_sample['student_birth_date'], errors='coerce')
df_sample['date'] = pd.to_datetime(df_sample['date'], errors='coerce')

# Calculate the age at the time of the guardian's interview
df_sample['student_age'] = pd.NaT  # Initialize the column with missing values

for i, row in df_sample.iterrows():
    try:
        age = (row['date'] - row['student_birth_date']).days // 365
        df_sample.at[i, 'student_age'] = age
    except:
        pass  # Ignore any dates that are out of bounds

# Remove rows where 'student_age' values are less than -1 or above 6
df_sample = df_sample[(df_sample['student_age'] >= -1) & (df_sample['student_age'] <= 6)]

# Write the updated dataframe to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

## Student in last year

In [22]:
#how do you fill up all these missing values?

## Student Native Language

## Student Program Type

For missing values, we can check the center's program type. Then, need to deal with variables that state both "head start" and "early head start." Can possibly also base it off student age. 7% is missing.

## Student Race

#there are too many combinations, over 85 of them. need to simplify and create dummy variables.

## Student Service Type

Try to reference off missing variables like program or center. 

## Student Sex

use gender_guesser again

In [91]:
def guess_gender(name):
    d = gender.Detector()
    gender_guess = d.get_gender(name)
    return gender_guess

# Apply guess_gender function to fill missing values in 'student_sex' column
df_sample['student_sex'] = df_sample.apply(lambda row: guess_gender(row['student']) if pd.isnull(row['student_sex']) else row['student_sex'], axis=1)



## Student Staff

drop

## Student Staff ID

drop this column

## Student Staff Vendor ID

consider removing

## Student Vendor ID

remove

## Student Was Early Headstart

In [56]:
# Convert string representations of 'True' and 'False' to boolean
df_sample['student_was_early_head_start'] = df_sample['student_was_early_head_start'].replace({'False': False, 'True': True})

# Write the updated dataframe to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

## Student Was Head Start

In [57]:
# Convert string representations of 'True' and 'False' to boolean
df_sample['student_was_head_start'] = df_sample['student_was_head_start'].replace({'False': False, 'True': True})

# Write the updated dataframe to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

# Next Section: Likert Scale Interview Questions

### CHANGE ALL UNANSWERED VALUES TO BLANK

In [82]:
##CHANGE ALL UNANSWERED VALUES TO BLANK##
#determine all of the likert scale interview variables

variables = ['Q1', 'Q1a', 'Q1b', 'Q1c', 'Q1d', 'Q2', 'Q2a', 'Q3', 'Q3a', 'Q4', 'Q4a', 'Q5', 'Q5a', 'Q6', 'Q6a',
             'Q7', 'Q7a', 'Q8', 'Q8a', 'QD1', 'QD1a', 'QD2', 'QD2a', 'Q9', 'Q9a', 'Q10', 'Q10a', 'Q11', 'Q11a',
             'Q12', 'Q12a', 'Q13', 'Q13a', 'Q14', 'Q14a', 'Q15', 'Q15a', 'Q16', 'Q16a', 'Q17', 'Q17a', 'Q18',
             'Q18a', 'Q19', 'Q20', 'Q21', 'Q22', 'Q23', 'Q24', 'Q25']

#convert string values to numeric, change the datatype to int
df_sample[variables] = df_sample[variables].apply(pd.to_numeric, errors='coerce', downcast='integer')

#replace the row values with blanks if the value is below 0
df_sample[variables] = df_sample[variables].applymap(lambda x: '' if x < 0 else x)

# Write the updated dataframe to a new CSV file
df_sample.to_csv('../data/sample_data.csv', index=False)

### DROP ALL BLANK INTERVIEWS

In [83]:
# Select columns starting with 'Q'
q_columns = [col for col in df_sample.columns if col.startswith('Q')]

# Count the number of rows where all 'Q' columns are blank or null
count_all_blank = len(df_sample[(df_sample[q_columns].isnull() | (df_sample[q_columns] == "")).all(axis=1)])

# Calculate the percentage of rows where all 'Q' columns are blank
percentage_all_blank = (count_all_blank / len(df_sample)) * 100

# Delete the rows where all 'Q' columns are blank
df_sample = df_sample[~(df_sample[q_columns].isnull() | (df_sample[q_columns] == "")).all(axis=1)]

# Print the number and percentage of deleted rows
print("Number of rows deleted where all 'Q' variables are blank:", count_all_blank)
print("Percentage of rows deleted where all 'Q' variables are blank: {:.2f}%".format(percentage_all_blank))

# Write the updated dataframe to .csv file
df_sample.to_csv('../data/sample_data.csv', index=False)

Number of rows deleted where all 'Q' variables are blank: 2087
Percentage of rows deleted where all 'Q' variables are blank: 10.20%


In [85]:
## need to consider how questions are intended to not be answered based on evaluation period and circumstances (e.g. student disability), filter them


Count of missing values in 'Q8a': 529
Percentage of missing values in 'Q8a': 3.88%


# Next Section: Open Interview Questions

perhaps we can use data analysis to see how sentiments change
https://www.surveypractice.org/article/25699-what-to-do-with-all-those-open-ended-responses-data-visualization-techniques-for-survey-researchers

In [61]:
#first need to translate spanish-language to english
columns_to_translate = ['OQ1', 'OQ2', 'OQ3', 'OQ3a', 'OQ4', 'OQ5', 'OQ6', 'OQ7', 'OQ8', 'OQ9', 'OQ10']

In [95]:
#SCRUB ALL NAMES WITH GENERIC NAMES
# Load the spaCy multilingual language model
nlp = spacy.load('xx_ent_wiki_sm')

# Define the generic placeholder for student names
generic_name = "Student"

# Define a function to remove names from a text string
def remove_names(text):
    doc = nlp(text)
    cleaned_text = ' '.join([token.text if token.ent_type_ != 1 else generic_name for token in doc])
    return cleaned_text

# Convert float values to strings in the 'OQ1' column
df['OQ1'] = df['OQ1'].astype(str)

# Apply the function to the 'OQ1' column
df['OQ1_cleaned'] = df['OQ1'].apply(remove_names)

# Print the updated column
print(df['OQ1_cleaned'])

0         YES I DO THE STORY THAT I HAVE SINCE MY OLD CH...
1                                                       nan
2         I am more open with telling the staff any prob...
3         it has helped me and my children . I have lear...
4         I mean my relationship with Stacy , its a good...
                                ...                        
204670    This is our last summer with the program .   W...
204671    Mantuvo conversaciones , escucho acerca de nue...
204672    La maestra hizo preguntas acerca de mi familia...
204673    Se intereso por mi familia y nuestras necesida...
204674    Manteniendo conversaciones , haciendo pregunta...
Name: OQ1_cleaned, Length: 204675, dtype: object


## Text Normalization

In [89]:
def normalize_text(text):
    text = text.lower() #lowercases
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text) #removes special characters and punctuation
    text = text.strip() #normalizes text
    # Apply additional normalization steps as needed
    return text

for column in df.columns:
    if column.startswith('OQ'):
        df_sample[column] = df_sample[column].apply(normalize_text)
        

#TOKENIZATION
#REMOVING STOPWORDS
#CORRECTING MISSPELLED WORDS
#STEMMING AND LEMATIZATION
#HANDLE MISSING VALUES

# Write the updated dataframe to .csv file
df_sample.to_csv('../data/sample_data.csv', index=False)

# Remove Unnecessary Columns

After meticulously going through the data cleaning process, it is important to ensure that the dataset is streamlined and efficient for analysis. Removing excess columns that do not contribute to the analysis or could introduce noise is a critical step in data preparation. This step not only enhances the readability and manageability of the dataset but also optimizes memory usage and potentially speeds up processing time. It’s vital to scrutinize each column and ascertain its relevance in context to the analysis goals.

In [None]:
## REMOVE UNNECESSARY AND NO LONGER NEEDED COLUMNS
# List of columns to be removed
columns_to_remove = ['created_at', 'guardian', 'guardian_employment', 'guardian_highest_education', 'guardian_birth_date', 'evaluation_year'
                     'guardian_vendor_id', 'interview_id', 'interviewer_id', 'mode',
                     'interviewer', 'interviewer_vendor_id', 'student', 'student_staff_vendor_id', 'student_birth_date'
                     'student_vendor_id', 'student_staff', 'student_staff_id']

# Removing the columns from the DataFrame
df_sample = df_sample.drop(columns=columns_to_remove)

#save updates to working csv
df_sample.to_csv('../data/sample_data.csv', index=False)

## Data Type Reconfiguration

To make the analysis faster

In [3]:
## code

In [88]:
unique_values = df['program'].nunique()
print(f"Number of unique values in 'program': {unique_values}")

Number of unique values in 'program': 223
