# Code Section 1:  Model Training and Testing
## This section contains two challenges

For time series data, we should convert the raw dataset to feature dataset, where each data sample contains features extracted from a period of time. For example, for each 1000 sensor data records/rows,we can consider it as a segmentation (Window) and extract statistic features from it. In this tutorial, we extract min， max and mean values of the first acceleromter on the wrist sensor. In data visualization used in the previous weeks, we could find that for different activities the sensor signal data values are in different ranges. Therefore, we could think that we could recognize different activities by the range of data, which means minimum,maximum and mean values of data may be useful features to recognize activities.     

In [3]:
#Code Block 1.1

import numpy as np 
import pandas as pd 
from scipy import signal
import matplotlib.pyplot as plt 
import math
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

def create_training_data_from_files(list_of_filenames, output_filename):

    #create the empty training set where we are going to add our "features"
    training_set = np.empty(shape=(0, 10))
    
    for dataset_file in list_of_filenames:

        #import the file contents into a panadas data frame
        imported_data = pd.read_csv(dataset_file, sep=',', header=None)

        #generate "features" for each activitiy
        for activityNumber in range(1,14):
            
            #get all data relating to that activity and convert to a numpy ndarray
            activity_data = imported_data[imported_data[24] == activityNumber].values

            #smooth over the data for columns 0, 1, 2, ...23 (not column 24)
            b, a = signal.butter(4, 0.04, 'low', analog=False)
            for j in range(24):
                activity_data[:, j] = signal.lfilter(b, a, activity_data[:, j])
            
            #how many full rows of 1000 are there for this activity data?
            number_of_training_windows = int( len(activity_data)/1000 )
            print(  "File " + dataset_file +
                    " has " + str(number_of_training_windows) + " windows, each of which has 1000 rows "+
                    "for activity: " + str(activityNumber))
            
            #for each window of 1000 rows... scan the data and add the scan results to training_set
            for window_number in range(number_of_training_windows):
                #sample data (get the next 1000 rows and all the columns)
                window_data = activity_data[ 
                                1000 * window_number : 1000 * (window_number + 1) , 
                                :
                            ]
                #we are about to build up a feature_sample that will have 10 columns
                feature_window = []
                for i in range(3):
                    feature_window.append(np.min(window_data[:, i]))
                    feature_window.append(np.max(window_data[:, i]))
                    feature_window.append(np.mean(window_data[:, i]))
                # add the activtiy number (The last column from the row of data)
                feature_window.append(int(window_data[0, -1])) 
                #make it in to an ndarray so it can be added to training data
                feature_window = np.array([feature_window]) 
                training_set = np.concatenate((training_set, feature_window), axis=0)
            
    #now save all this training data into a file to be used at a later date
    df_training = pd.DataFrame(training_set)
    df_training.to_csv(output_filename, index=None, header=None)
    print('attempted to create training feature set data file:'+ output_filename +'. Please check if the file was created successfully in your local folder!')
    print(str(len(training_set)) + " data rows should be in the output training feature set file")


def create_testing_data_from_files(list_of_filenames, output_filename):

    #create the empty training set where we are going to add our "features"
    testing_set = np.empty(shape=(0, 10))
    
    for dataset_file in list_of_filenames:

        #import the file contents into a panadas data frame
        imported_data = pd.read_csv(dataset_file, sep=',', header=None)

        #generate "features" for each activitiy
        for activityNumber in range(1,14):
            
            #get all data relating to that activity and convert to a numpy ndarray
            activity_data = imported_data[imported_data[24] == activityNumber].values

            #smooth over the data for columns 0, 1, 2, ...23 (not column 24)
            b, a = signal.butter(4, 0.04, 'low', analog=False)
            for j in range(24):
                activity_data[:, j] = signal.lfilter(b, a, activity_data[:, j])
            
            #how many full rows of 1000 are there for this activity data?
            number_of_testing_windows = int( len(activity_data)/1000 )
            print(  "File " + dataset_file +
                    " has " + str(number_of_testing_windows) + " windows, each of which has 1000 rows "+
                    "for activity: " + str(activityNumber))
            
            #for each window of 1000 rows... scan the data and add the scan results to training_set
            for window_number in range(number_of_testing_windows):
                #sample data (get the next 1000 rows and all the columns)
                window_data = activity_data[ 
                                1000 * window_number : 1000 * (window_number + 1) , 
                                :
                            ]
                #we are about to build up a feature_sample that will have 10 columns
                feature_window = []
                for i in range(3):
                    feature_window.append(np.min(window_data[:, i]))
                    feature_window.append(np.max(window_data[:, i]))
                    feature_window.append(np.mean(window_data[:, i]))
                # add the activtiy number (The last column from the row of data)
                feature_window.append(int(window_data[0, -1])) 
                #make it in to an ndarray so it can be added to training data
                feature_window = np.array([feature_window]) 
                testing_set = np.concatenate((testing_set, feature_window), axis=0)
            
    #now save all this training data into a file to be used at a later date
    df_testing = pd.DataFrame(testing_set)
    df_testing.to_csv(output_filename, index=None, header=None)
    print('attempted to create testing feature set data file:'+ output_filename +'. Please check if the file was created successfully in your local folder!')
    print(str(len(testing_set)) + " data rows should be in the output testing set file")
    
    training_fileName = []
training_fileName.append ('dataset_1.txt')
testing_fileName = []
testing_fileName.append ('dataset_3.txt')
create_training_data_from_files (training_fileName, 'week6_training_data_1Participant.csv')
create_testing_data_from_files (testing_fileName, 'week6_testing_data_1Participant.csv')

df_training = pd.read_csv('week6_training_data_1Participant.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1Participant.csv', header=None)

label_train = df_training[9].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([9], axis=1)
data_train = df_training.values

label_test = df_testing[9].values
label_test = label_test - 1
df_testing = df_testing.drop([9], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))



File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 1
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 2
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 3
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 4
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 5
File dataset_1.txt has 18 windows, each of which has 1000 rows for activity: 6
File dataset_1.txt has 52 windows, each of which has 1000 rows for activity: 7
File dataset_1.txt has 6 windows, each of which has 1000 rows for activity: 8
File dataset_1.txt has 6 windows, each of which has 1000 rows for activity: 9
File dataset_1.txt has 25 windows, each of which has 1000 rows for activity: 10
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 11
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 12
File dataset_1.txt has 12 windows, each of which ha

In [4]:
#Challenge 1


#Question 1
#
#After executing code block 1.1, do you notice there are two new files generated in your local folder?
#what are the difference between these two files? (Hint: think from the lecture, what is the difference between 
#model training and model inference?)
#
#A trainig_data1 file and a testing_data1 file was created.
#A training data is the initial dataset you use to teach a model or recognise patterns or perform to a criteria.
#A testing data is a dataset to validate the model.

#Question 2
#
#After executing code block 1.1, you see an accuracy value (Accuracy：x) in the output? How was it calculated? 
#
#The accuracy was 0.4225941422594142.



#Question 3
#
#After executing code block 1.1, there is an matrix output. Why it takes a while to output? What are the columns 
#and what are the rows?  
#
#It takes a while as there are large amount of data being compared.
#The columns are predictions for a specific activity, so the first column is the prediction for activity 1, second colmn is predictio for activity 2, etc.
#The rows are the actual number of the activity, so the first row is the number of activity 1, second row is number of activity 2, etc.
#Value in columns that aren't on the diagonal are false positives. 
#Value in rows that aren't on the diagonal are true negatives.


In [None]:
#Challenge 2


#Question 1
#
#What is the accuracy for the sitting activity (please refer to the dataset link or the final pages of 
#the week 6 lecture)?



#Question 2
#
#What are those activities where the accuracies are not so well (e.g., false positive + false negative >= 5)?



#Question 3
#
#Any reasons why the above-mentioned activities have bad accuracies? and what are your suggestions to improve 
#the accuracies?
#
#A potential reason for the bad accuracy might be because there aren't enough training data (for activity 9).
#And some activities are hard to distinguish with one another (like acitivity 11 and 12).
#Increasing the number of training_dataset will 'train' the model more hence making it more accurate.
#And adjusting the k-value in k-NN algorithm will improve the accuracy.


## Code Section 2: Testing with model parameter (T) and training data (E)
## This section contains one challenge

### in the code block below, we changed the K neighbours in KNN to 4 instead of 3 in the pervious code section

In [2]:
#Code Block 2.1

training_fileName = []
training_fileName.append ('dataset_1.txt')
testing_fileName = []
testing_fileName.append ('dataset_3.txt')
create_training_data_from_files (training_fileName, 'week6_training_data_1Participant.csv')
create_testing_data_from_files (testing_fileName, 'week6_testing_data_1Participant.csv')

df_training = pd.read_csv('week6_training_data_1Participant.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1Participant.csv', header=None)

label_train = df_training[9].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([9], axis=1)
data_train = df_training.values

label_test = df_testing[9].values
label_test = label_test - 1
df_testing = df_testing.drop([9], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))



File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 1
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 2
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 3
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 4
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 5
File dataset_1.txt has 18 windows, each of which has 1000 rows for activity: 6
File dataset_1.txt has 52 windows, each of which has 1000 rows for activity: 7
File dataset_1.txt has 6 windows, each of which has 1000 rows for activity: 8
File dataset_1.txt has 6 windows, each of which has 1000 rows for activity: 9
File dataset_1.txt has 25 windows, each of which has 1000 rows for activity: 10
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 11
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 12
File dataset_1.txt has 12 windows, each of which ha

### in the code block below, we changed the K neighbours in KNN to 6 instead of 4 in the pervious code section

In [None]:
#Code Block 2.2
df_training = pd.read_csv('week6_training_data_1Participant.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1Participant.csv', header=None)

label_train = df_training[9].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([9], axis=1)
data_train = df_training.values

label_test = df_testing[9].values
label_test = label_test - 1
df_testing = df_testing.drop([9], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))



### In the following code, we will use two participants data as training and one participant as testing

In [None]:
#Code Block 2.3
training_fileName = []
training_fileName.append ('dataset_1.txt')
training_fileName.append('dataset_3.txt')
testing_fileName = []
testing_fileName.append ('dataset_4.txt')
create_training_data_from_files (training_fileName, 'week6_training_data_2Participant.csv')
create_testing_data_from_files (testing_fileName, 'week6_testing_data_1Participant.csv')


df_training = pd.read_csv('week6_training_data_2Participant.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1Participant.csv', header=None)

label_train = df_training[9].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([9], axis=1)
data_train = df_training.values

label_test = df_testing[9].values
label_test = label_test - 1
df_testing = df_testing.drop([9], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))


### In the following code, we will use four participants data as training and one participant as testing

In [None]:
#Code Block 2.4
training_fileName = []
training_fileName.append ('dataset_1.txt')
training_fileName.append('dataset_3.txt')
training_fileName.append('dataset_4.txt')
training_fileName.append('dataset_5.txt')
testing_fileName = []
testing_fileName.append ('dataset_6.txt')
create_training_data_from_files (training_fileName, 'week6_training_data_4Participant.csv')
create_testing_data_from_files (testing_fileName, 'week6_testing_data_1Participant.csv')


df_training = pd.read_csv('week6_training_data_4Participant.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1Participant.csv', header=None)

label_train = df_training[9].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([9], axis=1)
data_train = df_training.values

label_test = df_testing[9].values
label_test = label_test - 1
df_testing = df_testing.drop([9], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))


In [None]:
#Challenge 3

#Question 1
#
#After running code block 2.1 where we change K value of the KNN model from 3 to 4 , compared to the results from running 
#code block 1.1, which actitivies have improved 
#accuracy and which activities has decreased accuracy?  How about the overall accuracy (improved or dropped)? 


#Question 2
#
#After running code block 2.2 where we further increase the value of model K value to 6 from 4, compared to the results from running 
#code block 2.1, does such change help in improving the accuracy? If so, which activities has improved accuracy? 
#How about the overal accuracy (improved or dropped)?


#Question 3
#
#After running code block 2.3 where we used 2 participants data to train the main instead of using 1, compared to the results from running 
#code block 2.1, which actitivies have improved 
#accuracy and which activities has decreased accuracy?  How about the overall accuracy (improved or dropped)?



#Question 4
#
#After running code block 2.4 where we used 4 participants data to train the model, compared to the results from running 
#code block 2.3, which actitivies have improved 
#accuracy and which activities has decreased accuracy?  How about the overall accuracy (improved or dropped)?




# Code Section 3:  Testing with features and training data (E)
## This section contains one challenge

### In the following code, we will  use one participant data as training and one participant as testing, but we will introduce more features (page 42 from week 5 lecture notes) NOTE: in the lecture, there are two static features Min and Max, in the practical, there are three static featuers Min,Max, and Mean.  Don't memorize, understand the principle behind and apply

In [5]:
#Code Block 3.1

def create_training_data_from_filesWithMoreFeatures(list_of_filenames, output_filename):

    #create the empty training set where we are going to add our "features"
    training_set = np.empty(shape=(0, 73))
    
    for dataset_file in list_of_filenames:

        #import the file contents into a panadas data frame
        imported_data = pd.read_csv(dataset_file, sep=',', header=None)

        #generate "features" for each activitiy
        for activityNumber in range(1,14):
            
            #get all data relating to that activity and convert to a numpy ndarray
            activity_data = imported_data[imported_data[24] == activityNumber].values

            #smooth over the data for columns 0, 1, 2, ...23 (not column 24)
            b, a = signal.butter(4, 0.04, 'low', analog=False)
            for j in range(24):
                activity_data[:, j] = signal.lfilter(b, a, activity_data[:, j])
            
            #how many full rows of 1000 are there for this activity data?
            number_of_training_windows = int( len(activity_data)/1000 )
            print(  "File " + dataset_file +
                    " has " + str(number_of_training_windows) + " windows, each of which has 1000 rows "+
                    "for activity: " + str(activityNumber))
            
            #for each window of 1000 rows... scan the data and add the scan results to training_set
            for window_number in range(number_of_training_windows):
                #sample data (get the next 1000 rows and all the columns)
                window_data = activity_data[ 
                                1000 * window_number : 1000 * (window_number + 1) , 
                                :
                            ]
                #we are about to build up a feature_sample that will have 73 columns, why?
                feature_window = []
                for i in range(24):
                    feature_window.append(np.min(window_data[:, i]))
                    feature_window.append(np.max(window_data[:, i]))
                    feature_window.append(np.mean(window_data[:, i]))
                # add the activtiy number (The last column from the row of data)
                feature_window.append(int(window_data[0, -1])) 
                #make it in to an ndarray so it can be added to training data
                feature_window = np.array([feature_window]) 
                training_set = np.concatenate((training_set, feature_window), axis=0)
            
    #now save all this training data into a file to be used at a later date
    df_training = pd.DataFrame(training_set)
    df_training.to_csv(output_filename, index=None, header=None)
    print('attempted to create training feature set data file:'+ output_filename +'. Please check if the file was created successfully in your local folder!')
    print(str(len(training_set)) + " data rows should be in the output training feature set file")


def create_testing_data_from_filesWithMoreFeatures(list_of_filenames, output_filename):

    #create the empty training set where we are going to add our "features"
    testing_set = np.empty(shape=(0, 73))
    
    for dataset_file in list_of_filenames:

        #import the file contents into a panadas data frame
        imported_data = pd.read_csv(dataset_file, sep=',', header=None)

        #generate "features" for each activitiy
        for activityNumber in range(1,14):
            
            #get all data relating to that activity and convert to a numpy ndarray
            activity_data = imported_data[imported_data[24] == activityNumber].values

            #smooth over the data for columns 0, 1, 2, ...23 (not column 24)
            b, a = signal.butter(4, 0.04, 'low', analog=False)
            for j in range(24):
                activity_data[:, j] = signal.lfilter(b, a, activity_data[:, j])
            
            #how many full rows of 1000 are there for this activity data?
            number_of_testing_windows = int( len(activity_data)/1000 )
            print(  "File " + dataset_file +
                    " has " + str(number_of_testing_windows) + " windows, each of which has 1000 rows "+
                    "for activity: " + str(activityNumber))
            
            #for each window of 1000 rows... scan the data and add the scan results to training_set
            for window_number in range(number_of_testing_windows):
                #sample data (get the next 1000 rows and all the columns)
                window_data = activity_data[ 
                                1000 * window_number : 1000 * (window_number + 1) , 
                                :
                            ]
                #we are about to build up a feature_sample that will have 10 columns
                feature_window = []
                for i in range(24):
                    feature_window.append(np.min(window_data[:, i]))
                    feature_window.append(np.max(window_data[:, i]))
                    feature_window.append(np.mean(window_data[:, i]))
                # add the activtiy number (The last column from the row of data)
                feature_window.append(int(window_data[0, -1])) 
                #make it in to an ndarray so it can be added to training data
                feature_window = np.array([feature_window]) 
                testing_set = np.concatenate((testing_set, feature_window), axis=0)
            
    #now save all this training data into a file to be used at a later date
    df_testing = pd.DataFrame(testing_set)
    df_testing.to_csv(output_filename, index=None, header=None)
    print('attempted to create testing feature set data file:'+ output_filename +'. Please check if the file was created successfully in your local folder!')
    print(str(len(testing_set)) + " data rows should be in the output testing set file")




#using four participants as training
#and one participant as testing
training_fileName = []
training_fileName.append ('dataset_1.txt')
testing_fileName = []
testing_fileName.append ('dataset_3.txt')
create_training_data_from_filesWithMoreFeatures (training_fileName, 'week6_training_data_1ParticipantMorefeatures.csv')
create_testing_data_from_filesWithMoreFeatures (testing_fileName, 'week6_testing_data_1ParticipantMorefeatures.csv')


df_training = pd.read_csv('week6_training_data_1ParticipantMorefeatures.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1ParticipantMorefeatures.csv', header=None)

label_train = df_training[72].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([72], axis=1)
data_train = df_training.values

label_test = df_testing[72].values
label_test = label_test - 1
df_testing = df_testing.drop([72], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))


File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 1
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 2
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 3
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 4
File dataset_1.txt has 12 windows, each of which has 1000 rows for activity: 5
File dataset_1.txt has 18 windows, each of which has 1000 rows for activity: 6
File dataset_1.txt has 52 windows, each of which has 1000 rows for activity: 7
File dataset_1.txt has 6 windows, each of which has 1000 rows for activity: 8
File dataset_1.txt has 6 windows, each of which has 1000 rows for activity: 9
File dataset_1.txt has 25 windows, each of which has 1000 rows for activity: 10
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 11
File dataset_1.txt has 24 windows, each of which has 1000 rows for activity: 12
File dataset_1.txt has 12 windows, each of which ha

### In the following code, we will  use four participants data as training and one participant as testing and the same number of features as above code block

In [None]:
#Code Block 3.2

#using four participants as training
#and one participant as testing
training_fileName = []
training_fileName.append ('dataset_1.txt')
training_fileName.append('dataset_3.txt')
training_fileName.append('dataset_4.txt')
training_fileName.append('dataset_5.txt')
testing_fileName = []
testing_fileName.append ('dataset_6.txt')
create_training_data_from_filesWithMoreFeatures (training_fileName, 'week6_training_data_4Participant.csv')
create_testing_data_from_filesWithMoreFeatures (testing_fileName, 'week6_testing_data_1Participant.csv')


df_training = pd.read_csv('week6_training_data_4Participant.csv', header=None)
df_testing = pd.read_csv('week6_testing_data_1Participant.csv', header=None)

label_train = df_training[72].values
# Labels should start from 0 in sklearn
label_train = label_train - 1
df_training = df_training.drop([72], axis=1)
data_train = df_training.values

label_test = df_testing[72].values
label_test = label_test - 1
df_testing = df_testing.drop([72], axis=1)
data_test = df_testing.values

# Feature normalization for improving the performance of machine learning models. In this example code, 
# StandardScaler is used to scale original feature to be centered around zero. You could try other normalization methods.
scaler = preprocessing.StandardScaler().fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

# Build KNN classifier, in this example code
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(data_train, label_train)

# Evaluation. when we train a machine learning model on training set, we should evaluate its performance on testing set.
# We could evaluate the model by different metrics. Firstly, we could calculate the classification accuracy.
label_pred = knn.predict(data_test)
print('Accuracy: ', accuracy_score(label_test, label_pred))
# We could use confusion matrix to view the classification for each activity.
print(confusion_matrix(label_test, label_pred))


In [None]:
#Challenge 4


#Question 1
#
#After running code block 3.1 where we increase number of features, compared to the results from running 
#code block 1.1, which actitivies have improved 
#accuracy and which activities has decreased accuracy?  How about the overall accuracy (improved or dropped)? How many 
#features have been added? what are these newly added features?


#Question 2
#
#After running code block 3.2 where we increase number of training data, compared to the results from running 
#code block 2.4, which actitivies have improved 
#accuracy and which activities has decreased accuracy?  How about the overall accuracy (improved or dropped)?


#Question 3
#
#It seems number of features and number of training data are important, but in real-world, how do you know which
#features are useful and how do you collect more training data?  How many are quantified as sufficient training data?


#Question 4
#
#How to do boundary value analysis for this activity recognition system? What about code coverage?


