# Initial Cleaning/Data Exploration (NHES)

#### HISTORY

* 10/25/20 Initial exploration of OECD dataset
* 11/13/20 Data analysis of OECD data from 2015
* 12/11/20 Data Analysis of Google Trends searches

---------

This notebook makes basic observations about the 2016 National Household Education Survey. This data comes from the "Parent and Family Involvement" section of the survey. The survey asked for information about the child's school, academic career, parent involvement in school affairs. Specifically, this data set has information about children who are homeschooled in addition to children who attend public or private schools. This notebook mostly does a lot of cleaning, changing column names and recoding variables into categorical data with `string` text. The notebook also answers:
* How many parents participated in the survey?
* How many children in the data are enrolled in public or private school?
* How many children in the data are homeschooled?


In [1]:
#Importing necessary packages
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

In [2]:
nhes_df = pd.read_csv("../data/raw/nhes.csv")

In [3]:
nhes_df.shape

(14075, 69)

14,075 parents were surveyed for the Parent and Family Involvement survey. There were a lot more than 65 questions asked of them, but these are the variables I chose from the database.

In [4]:
nhes_df.head(10)

Unnamed: 0,BASMID,QTYPE,GRADE,SCPUBPRI,SSAMSC,SEENJOY,SEGRADES,SEADPLCXX,SEBEHAVX,SESCHWRK,...,HDHEALTH,CDOBYY,AGE2015,CPLCBRTH,CSEX,TTLHHINC,RACEETH2,CENREG,PARGRADEX,FAMILY16X
0,20161000013,2,6,4,1,1,1,-1,0,0,...,2,2007,8,1,1,10,4,4,4,1
1,20161000017,2,12,4,1,1,2,2,0,0,...,2,2001,14,1,1,6,3,3,3,1
2,20161000050,2,4,4,1,1,5,-1,0,0,...,1,2008,7,1,1,6,3,2,4,1
3,20161000057,1,-1,-1,-1,-1,-1,-1,-1,-1,...,2,2001,14,1,2,5,1,2,4,1
4,20161000058,2,12,4,1,2,1,1,0,0,...,2,2001,14,1,2,9,1,3,4,1
5,20161000064,2,13,4,1,2,2,2,0,0,...,1,2000,15,1,2,8,3,2,3,1
6,20161000065,2,13,4,1,2,1,2,2,0,...,1,1999,16,1,1,4,1,2,1,3
7,20161000085,1,-1,-1,-1,-1,-1,-1,-1,-1,...,1,1999,16,1,1,5,5,2,4,1
8,20161000096,2,13,4,1,2,3,2,0,0,...,2,2000,15,1,1,5,3,4,3,3
9,20161000120,2,12,4,2,2,1,2,0,0,...,2,2001,14,1,2,9,3,4,5,2


In [5]:
nhes_df.columns

Index(['BASMID', 'QTYPE', 'GRADE', 'SCPUBPRI', 'SSAMSC', 'SEENJOY', 'SEGRADES',
       'SEADPLCXX', 'SEBEHAVX', 'SESCHWRK', 'SEGBEHAV', 'SEGWORK', 'SEREPEAT',
       'SESUSOUT', 'SESUSPIN', 'SEEXPEL', 'SEFUTUREX', 'GRADEEQ', 'HSWHOX',
       'HSTUTOR', 'HSCOOP', 'HSDAYS', 'HSHOURS', 'HSSTYL', 'HSCLIBRX',
       'HSCHSPUBX', 'HSCEDPUBX', 'HSCORGX', 'HSCCHURX', 'HSCPUBLX', 'HSCPRIVX',
       'HSCRELX', 'HSCNETX', 'HSCOTH', 'HSCVTLCR', 'HSCOURS', 'HSINTNET',
       'HSINTPUB', 'HSINTST', 'HSINTCH', 'HSINTAPB', 'HSINTPRI', 'HSINTCOL',
       'HSINTOH', 'FSSPORTX', 'FSVOL', 'FSMTNG', 'FSPTMTNG', 'FSATCNFN',
       'FSFUNDRS', 'FSCOMMTE', 'FSCOUNSLR', 'FHHOME', 'FHWKHRS', 'FHCAMT',
       'FHCHECKX', 'FHHELP', 'FORESPON', 'FODINNERX', 'HDHEALTH', 'CDOBYY',
       'AGE2015', 'CPLCBRTH', 'CSEX', 'TTLHHINC', 'RACEETH2', 'CENREG',
       'PARGRADEX', 'FAMILY16X'],
      dtype='object')

It is hard to understand what some of these columns mean, so I will rename the variables for easier reference.

In [6]:
#Create a mapping dictionary
rename_search = {'BASMID': 'id', 
                 'QTYPE': 'enroll_hmsc',
                 'GRADE': 'grade_year', 
                 'SCPUBPRI': 'school_type',
                 'SSAMSC': 'same_school', 
                 'SEENJOY': 'enjoy_school', 
                 'SEGRADES': 'grades',
                 'SEADPLCXX': 'ap_enroll', 
                 'SEBEHAVX': 'behavior_problem', 
                 'SESCHWRK': 'work_problem', 
                 'SEGBEHAV': 'behavior_good', 
                 'SEGWORK': 'work_good', 
                 'SEREPEAT': 'repeat_grades',
                 'SESUSOUT': 'out_suspension', 
                 'SESUSPIN': 'in_suspension', 
                 'SEEXPEL': 'expelled', 
                 'SEFUTUREX': 'expectations', 
                 'GRADEEQ': 'hmsc_grade_year',
                 'HSWHOX': 'hmsc_who_teaches',
                 'HSTUTOR': 'hmsc_tutor', 
                 'HSCOOP': 'hmsc_coop',
                 'HSDAYS': 'days_hmsc', 
                 'HSHOURS': 'hrs_hmsc',
                 'HSSTYL': 'teach_style', 
                 'HSCLIBRX': 'Library', 
                 'HSCHSPUBX': 'Homeschool Catalog', 
                 'HSCEDPUBX': 'Educational Publisher', 
                 'HSCORGX': 'Homeschooling Organization', 
                 'HSCCHURX': 'Church',
                 'HSCPUBLX': 'Public School', 
                 'HSCPRIVX': 'Private School', 
                 'HSCRELX': 'Bookstore', 
                 'HSCNETX': 'Websites', 
                 'HSCOTH': 'Other Source', 
                 'HSCVTLCR': 'Virtual School/Curriculum',
                 'HSCOURS': 'hmsc_course',
                 'HSINTNET': 'internet_hmsc', 
                 'HSINTPUB': 'Local Public School', 
                 'HSINTST': 'State', 
                 'HSINTCH': 'Charter School', 
                 'HSINTAPB': 'Another Public School',
                 'HSINTPRI': 'Private School', 
                 'HSINTCOL': 'College', 
                 'HSINTOH': 'Someplace Else', 
                 'FSSPORTX': 'school_event', 
                 'FSVOL': 'volunteer', 
                 'FSMTNG': 'school_mtng',
                 'FSPTMTNG': 'pta_mtng', 
                 'FSATCNFN': 'parent_teacher', 
                 'FSFUNDRS': 'fundraiser', 
                 'FSCOMMTE': 'school_cmte', 
                 'FSCOUNSLR': 'counselor', 
                 'FHHOME': 'hw_time',
                 'FHWKHRS': 'hw_hours', 
                 'FHCAMT': 'child_hw', 
                 'FHCHECKX': 'hw_check', 
                 'FHHELP': 'hw_help', 
                 'FORESPON': 'time_mgmt', 
                 'FODINNERX': 'meal_together',
                 'HDHEALTH': 'child_health', 
                 'CDOBYY': 'year_born', 
                 'AGE2015': 'age', 
                 'CPLCBRTH': 'country_born', 
                 'CSEX': 'sex',
                 'TTLHHINC': 'ttl_income', 
                 'RACEETH2': 'race', 
                 'CENREG': 'region', 
                 'PARGRADEX': 'parent_educ', 
                 'FAMILY16X': 'family_type'}
#Rename the columns
nhes_df = nhes_df.rename(columns = rename_search)
nhes_df.columns

Index(['id', 'enroll_hmsc', 'grade_year', 'school_type', 'same_school',
       'enjoy_school', 'grades', 'ap_enroll', 'behavior_problem',
       'work_problem', 'behavior_good', 'work_good', 'repeat_grades',
       'out_suspension', 'in_suspension', 'expelled', 'expectations',
       'hmsc_grade_year', 'hmsc_who_teaches', 'hmsc_tutor', 'hmsc_coop',
       'days_hmsc', 'hrs_hmsc', 'teach_style', 'Library', 'Homeschool Catalog',
       'Educational Publisher', 'Homeschooling Organization', 'Church',
       'Public School', 'Private School', 'Bookstore', 'Websites',
       'Other Source', 'Virtual School/Curriculum', 'hmsc_course',
       'internet_hmsc', 'Local Public School', 'State', 'Charter School',
       'Another Public School', 'Private School', 'College', 'Someplace Else',
       'school_event', 'volunteer', 'school_mtng', 'pta_mtng',
       'parent_teacher', 'fundraiser', 'school_cmte', 'counselor', 'hw_time',
       'hw_hours', 'child_hw', 'hw_check', 'hw_help', 'time_mgmt',
  

I will recode some of the categorical variables to match what is written in the codebook.

In [7]:
enroll_hmsc_dict = {1 : 'Homeschooled',
                   2 : 'Enrolled in school'}

school_type_dict = {1 : 'Private School',
                   2 : 'Private School',
                   3: 'Private School',
                   4: 'Public School',
                   -1: 'Homeschooled'}

grades_dict = {1: 4, # "Mostly A's",
              2: 3, #"Mostly B's",
              3: 2, #"Mostly C's",
              4: 1} #"Mostly D's or lower"

ap_enroll_dict = {1: 'Yes',
                 2: 'No'}

grade_year_dict = {2: 'Kindergarten',
                  3: 'Kindergarten',
                  4: '1st',
                  5: '2nd',
                  6: '3rd',
                  7: '4th',
                  8: '5th',
                  9: '6th',
                  10: '7th',
                  11: '8th',
                  12: '9th',
                  13: '10th',
                  14: '11th',
                  15: '12th'}

hmsc_grade_year_dict = {2: 'Kindergarten',
                  3: '1st',
                  4: '2nd',
                  5: '3rd',
                  6: '4th',
                  7: '5th',
                  8: '6th',
                  9: '7th',
                  10: '8th',
                  11: '9th',
                  12: '10th',
                  13: '11th',
                  14: '12th'}

hmsc_course_dict = {1 : 'No Course',
                   2: 'Online & In-Person',
                   3: 'Online',
                   4: 'In-Person'}

fs_dict = {1: 'Yes',
           2: 'No'}

hw_time_dict = {1: 'Less than once a week',
               2: '1-2 days a week',
               3:'3-4 days a week',
               4: '5+ days a week',
               5: 'Never',
               6: 'Child does not have homework'}

#How does this child feel about the amount of homework he or she is assigned?
child_hw_dict = {1: 'Amount is about right',
                 2: "It's too much",
                 3: "It's too little"}

#How often does any adult in your household check to see that this child's homework is done?
# hw_check_dict = {'1': 'Never',
#                '2': 'Rarely',
#                '3':'Sometimes',
#                '4': 'Always',
#                 '-1': np.nan}


#In the past week, discussed with the child how to manage time
time_mgmt_dict = {1: '1', #Yes
                  2: '0'} #No

race_dict = {1: 'White',
               2: 'Black',
               3:'Hispanic',
               4: 'Asian',
               5: 'Other/Mixed'}

region_dict = {1: 'Northeast',
               2: 'South',
               3: 'Midwest',
               4: 'West'}

family_type_dict = {1: 'Two parents and sibling(s)',
                   2: 'Two parents, no sibling',
                   3: 'One parent and sibling(s)',
                   4: 'One parent, no sibling',
                   5: 'Other'}

sex_dict = {1: 'Male',
           2: 'Female'}

teach_style_dict = {1: 'Strictly use formal curriculum',
                   2: 'Mostly use formal curriculum',
                   3: 'Mostly use informal learning',
                   4: 'Always use informal learning'}

hmsc_who_dict = {1: 'Mother',
                2: 'Father',
                3: 'Grandparent',
                4: 'Sibling',
                5: 'Another Person',
                6: 'Teacher/Tutor'}

hw_help_dict= {1: 1,
               2: 2,
               3: 3,
               4: 4,
               5: 5,
              -1: np.nan}

nhes_df = nhes_df.replace({'enroll_hmsc': enroll_hmsc_dict, 
                'school_type': school_type_dict, 
                'ap_enroll': ap_enroll_dict,
                'hmsc_course': hmsc_course_dict, 
                'school_event': fs_dict, 
                'volunteer': fs_dict,
                'school_mtng': fs_dict, 
                'pta_mtng': fs_dict, 
                'parent_teacher': fs_dict, 
                'fundraiser': fs_dict,
                'school_cmte': fs_dict, 
                'counselor': fs_dict, 
                'hw_time': hw_time_dict, 
                'child_hw': child_hw_dict,
                'time_mgmt': time_mgmt_dict, 
                'race': race_dict, 
                'region': region_dict,
                'family_type': family_type_dict,
                'sex': sex_dict,
                'grade_year': grade_year_dict,
                'teach_style': teach_style_dict,
                'hmsc_grade_year': hmsc_grade_year_dict,
                'hw_help': hw_help_dict})


In [8]:
#Saving the data frame as a .csv file in the `data` folder
nhes_df.to_csv('../data/cleaned/nhes_CLEAN.csv', index=False)

#### How many children in the 2016 NHES were enrolled in some kind of public/private school versus homeschooled?

In [9]:
nhes_df['enroll_hmsc'].value_counts()

Enrolled in school    13523
Homeschooled            552
Name: enroll_hmsc, dtype: int64

In [10]:
552/13523

0.040819344819936404

Only 4% of parents surveyed had children who were homeschooled. The majority of children were enrolled in public or private school.

#### How many attended public schools vs. private schools?

In [11]:
nhes_df['school_type'].value_counts()
#Public school children ae coded as 4

Public School     11991
Private School     1532
Homeschooled        552
Name: school_type, dtype: int64

In [12]:
sum_pub_priv = sum(nhes_df['school_type'] == 'Public School')+sum(nhes_df['school_type'] == 'Private School')
sum(nhes_df['school_type'] == 'Public School')/sum_pub_priv

0.8867115285069881

88.7% of children who are enrolled in school attended public school. Public is the more popular school choice, which could be due to high tuition at private schools.

In [13]:
nhes_df.groupby('grade_year')['enroll_hmsc'].value_counts().unstack()

enroll_hmsc,Enrolled in school,Homeschooled
grade_year,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,,552.0
10th,1290.0,
11th,1342.0,
12th,1421.0,
1st,779.0,
2nd,848.0,
3rd,915.0,
4th,896.0,
5th,955.0,
6th,946.0,


Here, all homeschool children are coded as -1 (there is another question for about their equivalent grade, explored in the data analysis notebook.) A large majority of the students are in 7th grade or a higher grade.

### Creating a new data frame for just homeschool data

In [14]:
hmsc_filter = nhes_df['enroll_hmsc']== 'Homeschooled'
hmsc_df = nhes_df[hmsc_filter]

In [15]:
hmsc_df.head(10)

Unnamed: 0,id,enroll_hmsc,grade_year,school_type,same_school,enjoy_school,grades,ap_enroll,behavior_problem,work_problem,...,child_health,year_born,age,country_born,sex,ttl_income,race,region,parent_educ,family_type
3,20161000057,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,2,2001,14,1,Female,5,White,South,4,Two parents and sibling(s)
7,20161000085,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,1,1999,16,1,Male,5,Other/Mixed,South,4,Two parents and sibling(s)
16,20161000188,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,1,1998,17,1,Female,3,Hispanic,West,2,"Two parents, no sibling"
83,20161001259,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,5,2001,14,1,Female,1,Hispanic,South,4,Two parents and sibling(s)
122,20161001905,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,1,1998,17,1,Female,1,White,West,3,One parent and sibling(s)
138,20161002179,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,1,1998,17,1,Female,7,Other/Mixed,West,3,Two parents and sibling(s)
175,20161002612,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,1,1999,16,1,Male,3,Hispanic,West,1,Two parents and sibling(s)
189,20161002841,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,2,1998,17,1,Female,2,Black,South,3,One parent and sibling(s)
203,20161003133,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,1,2007,8,1,Female,8,Hispanic,Midwest,4,"Two parents, no sibling"
224,20161003438,Homeschooled,-1,Homeschooled,-1,-1,-1,-1,-1,-1,...,2,2008,7,1,Male,4,White,South,5,Two parents and sibling(s)


In [16]:
drop_columns = hmsc_df.mean() == -1
cols_to_drop=drop_columns[drop_columns].index
cols_to_drop

Index(['grade_year', 'same_school', 'enjoy_school', 'grades', 'ap_enroll',
       'behavior_problem', 'work_problem', 'behavior_good', 'work_good',
       'repeat_grades', 'out_suspension', 'in_suspension', 'expelled',
       'expectations', 'school_event', 'volunteer', 'school_mtng', 'pta_mtng',
       'parent_teacher', 'fundraiser', 'school_cmte', 'counselor', 'hw_time',
       'hw_hours', 'child_hw', 'hw_check'],
      dtype='object')

In [17]:
hmsc_df = hmsc_df.drop(columns = cols_to_drop)
hmsc_df.columns

Index(['id', 'enroll_hmsc', 'school_type', 'hmsc_grade_year',
       'hmsc_who_teaches', 'hmsc_tutor', 'hmsc_coop', 'days_hmsc', 'hrs_hmsc',
       'teach_style', 'Library', 'Homeschool Catalog', 'Educational Publisher',
       'Homeschooling Organization', 'Church', 'Public School',
       'Private School', 'Bookstore', 'Websites', 'Other Source',
       'Virtual School/Curriculum', 'hmsc_course', 'internet_hmsc',
       'Local Public School', 'State', 'Charter School',
       'Another Public School', 'Private School', 'College', 'Someplace Else',
       'hw_help', 'time_mgmt', 'meal_together', 'child_health', 'year_born',
       'age', 'country_born', 'sex', 'ttl_income', 'race', 'region',
       'parent_educ', 'family_type'],
      dtype='object')

In [18]:
#Saving the data frame as a .csv file in the `data` folder
hmsc_df.to_csv('../data/hmsc_df.csv', index=False)

### Creating a new data frame with data or enrolled students only

To possibly do some comparisons between students enrolled in physicals and those 

In [19]:
#Filtering for enrolled students
enroll_filter = nhes_df['enroll_hmsc']== 'Enrolled in school'
enroll_df = nhes_df[enroll_filter]

In [20]:
enroll_df.mean()

id                            2.016111e+10
same_school                   1.028248e+00
enjoy_school                  1.754862e+00
grades                        2.055387e+00
behavior_problem              5.120905e-01
work_problem                  6.119944e-01
behavior_good                 1.219996e+00
work_good                     1.202544e+00
repeat_grades                 1.922724e+00
out_suspension                1.933890e+00
in_suspension                 1.931006e+00
expelled                      1.986763e+00
expectations                  4.957628e+00
hmsc_grade_year              -1.000000e+00
hmsc_who_teaches             -1.000000e+00
hmsc_tutor                   -1.000000e+00
hmsc_coop                    -1.000000e+00
days_hmsc                    -1.000000e+00
hrs_hmsc                     -1.000000e+00
teach_style                  -1.000000e+00
Library                      -1.000000e+00
Homeschool Catalog           -1.000000e+00
Educational Publisher        -1.000000e+00
Homeschooli

In [21]:
drop_columns2 = enroll_df.mean() == -1
cols_to_drop2 = drop_columns2[drop_columns2].index
cols_to_drop2
enroll_df = enroll_df.drop(columns = cols_to_drop2)
enroll_df.columns

Index(['id', 'enroll_hmsc', 'grade_year', 'school_type', 'same_school',
       'enjoy_school', 'grades', 'ap_enroll', 'behavior_problem',
       'work_problem', 'behavior_good', 'work_good', 'repeat_grades',
       'out_suspension', 'in_suspension', 'expelled', 'expectations',
       'school_event', 'volunteer', 'school_mtng', 'pta_mtng',
       'parent_teacher', 'fundraiser', 'school_cmte', 'counselor', 'hw_time',
       'hw_hours', 'child_hw', 'hw_check', 'hw_help', 'time_mgmt',
       'meal_together', 'child_health', 'year_born', 'age', 'country_born',
       'sex', 'ttl_income', 'race', 'region', 'parent_educ', 'family_type'],
      dtype='object')

In [22]:
enroll_df.head()

Unnamed: 0,id,enroll_hmsc,grade_year,school_type,same_school,enjoy_school,grades,ap_enroll,behavior_problem,work_problem,...,child_health,year_born,age,country_born,sex,ttl_income,race,region,parent_educ,family_type
0,20161000013,Enrolled in school,3rd,Public School,1,1,1,-1,0,0,...,2,2007,8,1,Male,10,Asian,West,4,Two parents and sibling(s)
1,20161000017,Enrolled in school,9th,Public School,1,1,2,No,0,0,...,2,2001,14,1,Male,6,Hispanic,Midwest,3,Two parents and sibling(s)
2,20161000050,Enrolled in school,1st,Public School,1,1,5,-1,0,0,...,1,2008,7,1,Male,6,Hispanic,South,4,Two parents and sibling(s)
4,20161000058,Enrolled in school,9th,Public School,1,2,1,Yes,0,0,...,2,2001,14,1,Female,9,White,Midwest,4,Two parents and sibling(s)
5,20161000064,Enrolled in school,10th,Public School,1,2,2,No,0,0,...,1,2000,15,1,Female,8,Hispanic,South,3,Two parents and sibling(s)


In [23]:
#Saving the data frame as a .csv file in the `data` folder
enroll_df.to_csv('../data/enroll_df.csv', index=False)