# School Attendance Data

## Importing the data

- To use gdata package, [Active Perl](https://www.activestate.com/activeperl/downloads) needs to be installed in the first place.
- Follow the [instruction](https://cran.r-project.org/web/packages/gdata/INSTALL) to make gdata automatically find Perl without specifying its location. Otherwise, you can just give the location of Perl executable file in perl argument.

In [7]:
# Load the gdata package
library(gdata)

# Import the spreadsheet: att
att <- read.xls("attendance.xls", perl = "C:/Perl64/bin/perl.exe")

In [8]:
att_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1294/datasets/attendance.xls"

att <- read.xls(att_url, perl = "C:/Perl64/bin/perl.exe")

## Examining the data

In [9]:
# Print the column names 
names(att)

# Print the first 6 rows
head(att)

# Print the last 6 rows
tail(att)

# Print the structure
str(att)

Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08,X,X.1,X.2,X.3,X.4,X.5,X.6,X.7,X.8,X.9,X.10,X.11,X.12,X.13,X.14,X.15
,"Total elementary, secondary, and combined elementary/secondary schools",,,,,,,,Elementary schools,,,,Secondary schools,,,
,ADA as percent of enrollment,,Average hours in school day,,Average days in school year,,Average hours in school year,,ADA as percent of enrollment,,Average hours in school day,,ADA as percent of enrollment,,Average hours in school day,
1,2,,3,,4,,5,,6,,7,,8,,9,
United States ........,93.1,(0.22),6.6,(0.02),180,(0.1),1193,(3.1),94.0,(0.27),6.7,(0.02),91.1,(0.43),6.6,(0.04)
Alabama .................,93.8,(1.24),7.0,(0.07),180,(0.8),1267,(12.3),93.8,(1.84),7.0,(0.08),94.6,(0.38),7.1,(0.17)
Alaska ..................,89.9,(1.22),6.5,(0.05),180,(3.4),1163,(22.9),91.3,(1.56),6.5,(0.05),93.2,(1.57),6.2,(0.15)


Unnamed: 0,Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08,X,X.1,X.2,X.3,X.4,X.5,X.6,X.7,X.8,X.9,X.10,X.11,X.12,X.13,X.14,X.15
54,Wisconsin ...............,95.0,(0.57),6.9,(0.04),180.0,(0.7),1246.0,(8.6),95.4,(0.41),6.9,(0.05),93.0,(1.91),7.0,(0.14)
55,Wyoming .................,92.4,(1.15),6.9,(0.05),175.0,(1.3),1201.0,(8.3),92.2,(1.65),6.9,(0.05),92.4,(0.75),7.0,(0.07)
56,â€ Not applicable.,,,,,,,,,,,,,,,,
57,â€¡Reporting standards not met (too few cases).,,,,,,,,,,,,,,,,
58,"NOTE: Averages reflect data reported by schools rather than state requirements. School-reported length of day may exceed state requirements, and there is a range of statistical error in reported estimates. Standard errors appear in parentheses.",,,,,,,,,,,,,,,,
59,"SOURCE: U.S. Department of Education, National Center for Education Statistics, Schools and Staffing Survey (SASS), \Public School Questionnaire",\ 2003-04 and 2007-08. (This table was prepared June 2011.),,,,,,,,,,,,,,,


'data.frame':	59 obs. of  17 variables:
 $ Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08: Factor w/ 58 levels "","   United States ........",..: 1 1 3 2 6 7 8 9 10 11 ...
 $ X                                                                                                                                                                            : Factor w/ 42 levels "","\\ 2003-04 and 2007-08. (This table was prepared June 2011.)",..: 42 41 3 22 28 8 6 14 23 29 ...
 $ X.1                                                                                                                                                                          : Factor w/ 45 levels "","(0.22)","(0.23)",..: 1 1 1 2 22 21 41 27 14 6 ...
 $ X.2                                                                                                                                                

## Removing unnecessary rows

In [10]:
# Create remove
remove <- c(3, 56:59)

# Create att2
att2 <- att[-remove, ]

## Removing useless columns

In [11]:
# Create remove
remove <- c(3, 5, 7, 9, 11, 13, 15, 17)

# Create att3
att3 <- att2[, -remove]

## Splitting the data

In [12]:
## att3 is pre-loaded

# Subset just elementary schools: att_elem
att_elem <- att3[, c(1, 6, 7)]

# Subset just secondary schools: att_sec
att_sec <- att3[, c(1, 8, 9)]

# Subset all schools: att4
att4 <- att3[, 1:5]

## Replacing the names

In [13]:
## att4 is pre-loaded

# Define cnames vector (don't change)
cnames <- c("state", "avg_attend_pct", "avg_hr_per_day", 
            "avg_day_per_yr", "avg_hr_per_yr")

# Assign column names of att4
colnames(att4) <- cnames

# Remove first two rows of att4: att5
att5 <- att4[-(1:2), ]

# View the names of att5
names(att5)

## Cleaning up extra characters

In [14]:
## stringr and att5 are pre-loaded
library(stringr)

# Remove all periods in state column
att5$state <- str_replace_all(string = att5$state, pattern = "\\.", replacement = "")

# Remove white space around state names
att5$state <- str_trim(string = att5$state)

# View the head of att5
head(att5)

Unnamed: 0,state,avg_attend_pct,avg_hr_per_day,avg_day_per_yr,avg_hr_per_yr
4,United States,93.1,6.6,180,1193
5,Alabama,93.8,7.0,180,1267
6,Alaska,89.9,6.5,180,1163
7,Arizona,89.0,6.4,181,1159
8,Arkansas,91.8,6.9,179,1229
9,California,93.2,6.2,181,1129


## Some final type conversions

In [15]:
# Change columns to numeric using dplyr (don't change)
library(dplyr)
example <- mutate_each(att5, funs(as.numeric), -state)

# Define vector containing numerical columns: cols
cols <- 2:5

# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)

"package 'dplyr' was built under R version 3.4.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:gdata':

    combine, first, last

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

`mutate_each()` is deprecated.
Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
To map `funs` over a selection of variables, use `mutate_at()`


In [16]:
str(att5)

head(att5)

'data.frame':	52 obs. of  5 variables:
 $ state         : chr  "United States" "Alabama" "Alaska" "Arizona" ...
 $ avg_attend_pct: num  22 28 8 6 14 23 29 5 7 11 ...
 $ avg_hr_per_day: num  7 11 6 5 10 3 11 6 8 10 ...
 $ avg_day_per_yr: num  10 10 10 11 9 11 2 11 11 11 ...
 $ avg_hr_per_yr : num  26 45 15 13 36 5 28 19 31 42 ...


Unnamed: 0,state,avg_attend_pct,avg_hr_per_day,avg_day_per_yr,avg_hr_per_yr
4,United States,22,7,10,26
5,Alabama,28,11,10,45
6,Alaska,8,6,10,15
7,Arizona,6,5,11,13
8,Arkansas,14,10,9,36
9,California,23,3,11,5
