GitHub - data-science-coursera-notes/ProgrammingAssignment3: Getting and Cleaning Data

Programming Assignment for Getting and Cleaning Data Course
Data Science Specialisation by John Hopkins University
by Chan Chee-Foong on 24 Mar 2016

Assignment Summary

The goal is to prepare tidy data that can be used for later analysis. We are required to submit:

a tidy data set (the average of each variable for each activity and each subject) from the raw data set downloaded
a link to a Github repository with the script for performing the analysis
a code book called CodeBook.md that describes the variables, the data, and any transformations or work that was performed to clean up the data.
a README.md in the repo with the scripts to explains how all of the scripts work and how they are connected.

Data

For data, please refer to the CodeBook.md. The document explains where the raw data was downloaded, observations made on the raw data and also how the raw data was prepared into a tidy data set for further analysis.

R Programming Script

This assignment only contains one R script called run_analysis.R. Comments are included in the script to explain the process and flow of cleaning up and preparing the tidy data set. Included in the script are also the steps to analyse the tidy data set and return the average of each variable for each activity and each subject into a output file (tidy.txt) as required by the assignment.

Output

Refer to the text file name tidy.txt in this repository.

Special Notes

As part of the pre and post data cleaning and analysis process, sanity checks are done on the R console to ensure that the raw data are correctly understood and that the final tidy data make sense. Some sanity checks done are:

Count and ensure the number of variables in the feature.txt is the same as the number of columns in the training and test data set. Number should be 561.
Count and ensure the number of labels in activity_labels.txt tallies with information indicated in the README.txt inside the downloaded data set. Number should be 6.
Count and ensure the number of labels in train/y_train.txt and test/y_test.txt tallies with the number of test records in the training and test data set respectively.
Count and ensure the number of labels in train/subject_train.txt and test/subject_test.txt tallies with the number of test records in the training and test data set respectively.
Count the unique number of activity labels in train/y_train.txt and test/y_test.txt. Number should be 6.
Count the unique number of subject labels in train/subject_train.txt and test/subject_test.txt. Number should be 30.
Count and ensure the number of variables in the required variable list with mean() and std(). Number should be 66.
Count and ensure the number of records in the final tidy data set for extraction into tidy.csv. Number should be 180 (30 subjects x 6 activites). Number of variables should be 66. Number of columns should be 68 (66 + subject and activity)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assignment Summary

Data

R Programming Script

Output

Special Notes

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
CodeBook.md		CodeBook.md
README.md		README.md
run_analysis.R		run_analysis.R
tidy.txt		tidy.txt

data-science-coursera-notes/ProgrammingAssignment3

Folders and files

Latest commit

History

Repository files navigation

Assignment Summary

Data

R Programming Script

Output

Special Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages