Used dataset from the below link for predicting a class. 
https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Below are the classes which we will be predicting
1 WALKING,
2 WALKING_UPSTAIRS,
3 WALKING_DOWNSTAIRS,
4 SITTING,
5 STANDING,
6 LAYING

Attribute Information:

For each record in the dataset it is provided: 
- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration. 
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 
- An identifier of the subject who carried out the experiment.


Relevant Papers:

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012 


In [161]:
#Importing modules required
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
from time import time
from pyspark.mllib.evaluation import MulticlassMetrics

# Loading the data

I have prepared the data in Unix before feeding the data to Saprk. Please follow below steps:

After you download the data from the above link provided, we need to combine the actual data and label data for both train and test data for training the model and checking the accuracy on the test data

Train Data:

paste -d' ' X_train.txt y_train.txt >> train.txt

#For transforming to a single space between in one line
tr -s " "< train.txt  | cat > train_new.txt      

Test Data:

paste -d' ' y_test.txt X_test.txt >> X_test_new.txt 

tr -s " "< X_test_new.txt  | cat > test_new.txt 




In [154]:
train = sc.textFile("/home/osboxes/Downloads/UCI HAR Dataset/train/train_new.txt")
test = sc.textFile("/home/osboxes/Downloads/UCI HAR Dataset/test/test_new.txt")

# Labeled Points

A labeled point is a local vector associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms and they are stored as doubles. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification of K classes, a label can be 0, 1, ..., k-1 classes . Remember this is very important, as the class label, need to start from 0.

# Preparing the Training and Testing Data
We need to identify labels and features for input to the different models
0 WALKING,
1 WALKING_UPSTAIRS,
2 WALKING_DOWNSTAIRS,
3 SITTING,
4 STANDING,
5 LAYING


In [155]:
# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0]-1, values[1:])

In [156]:
parsedData = train.map(parsePoint)
parsedData1= test.map(parsePoint)
parsedData.take(1)

[LabeledPoint(4.0, [0.28858451,-0.020294171,-0.13290514,-0.9952786,-0.98311061,-0.91352645,-0.99511208,-0.98318457,-0.92352702,-0.93472378,-0.56737807,-0.74441253,0.85294738,0.68584458,0.81426278,-0.96552279,-0.99994465,-0.99986303,-0.99461218,-0.99423081,-0.98761392,-0.94321999,-0.40774707,-0.67933751,-0.60212187,0.92929351,-0.85301114,0.35990976,-0.058526382,0.25689154,-0.22484763,0.26410572,-0.09524563,0.27885143,-0.46508457,0.49193596,-0.19088356,0.37631389,0.43512919,0.66079033,0.96339614,-0.14083968,0.11537494,-0.98524969,-0.98170843,-0.87762497,-0.98500137,-0.98441622,-0.89467735,0.89205451,-0.16126549,0.12465977,0.97743631,-0.12321341,0.056482734,-0.37542596,0.89946864,-0.97090521,-0.97551037,-0.98432539,-0.98884915,-0.91774264,-1.0,-1.0,0.11380614,-0.590425,0.5911463,-0.59177346,0.59246928,-0.74544878,0.72086167,-0.71237239,0.71130003,-0.99511159,0.99567491,-0.99566759,0.99165268,0.57022164,0.43902735,0.98691312,0.077996345,0.0050008031,-0.067830808,-0.99351906,-0.98835999,-0.

For multiclass classification problems, the algorithm will output a multinomial logistic regression model, which contains K−1K−1 binary logistic regression models regressed against the first class. Given a new data points, K−1K−1 models will be run, and the class with largest probability will be chosen as the predicted class.
We implemented two algorithms to solve logistic regression: mini-batch gradient descent and L-BFGS. 
We recommend L-BFGS over mini-batch gradient descent for faster convergence.

In [157]:
# Build the model for multiclass prediction
t0 = time()
model = LogisticRegressionWithLBFGS.train(parsedData, numClasses=6)
tt = time() - t0

print "Classifier trained in {} seconds".format(round(tt,3))


Classifier trained in 18.941 seconds


In [158]:
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

Training Error = 0.0174102285092


# Evaluating the model on the new data (test data)

In order to measure the classification error on our test data, we use map on the test_data RDD and the model to predict each test point class.

In [159]:
labels_and_preds = parsedData1.map(lambda p: (p.label, model.predict(p.features)))
t0 = time()
test_accuracy = labels_and_preds.filter(lambda (v, p): v == p).count() / float(parsedData1.count())
tt = time() - t0
print "Prediction made in {} seconds. Test accuracy is {}".format(round(tt,3), round(test_accuracy,4))


Prediction made in 2.94 seconds. Test accuracy is 0.9308


In [163]:
# Compute raw scores on the test set
predictionAndLabels = parsedData1.map(lambda lp: (float(model.predict(lp.features)), lp.label))

# Instantiate metrics object
metrics = MulticlassMetrics(predictionAndLabels)

In [166]:
# Summary stats
print("Recall = %s" % metrics.recall())
print("Precision = %s" % metrics.precision())
print("Accuracy = %s" % metrics.accuracy)

Recall = 0.930777061418
Precision = 0.930777061418
Accuracy = 0.930777061418


In [173]:
predictionAndLabels.count()
predictionAndLabels.take(50)

[(4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (4.0, 4.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (4.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (4.0, 3.0),
 (4.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0),
 (3.0, 3.0)]

# Individual label stats

In [181]:

labels = predictionAndLabels.map(lambda x: x[1]).distinct().collect()


In [184]:
labels

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]

In [186]:
for label in labels:
    print("Class %s precision = %s" % (label, metrics.precision(label)))
    print("Class %s recall = %s" % (label, metrics.recall(label)))

Class 0.0 precision = 0.890510948905
Class 0.0 recall = 0.983870967742
Class 1.0 precision = 0.948545861298
Class 1.0 recall = 0.900212314225
Class 2.0 precision = 0.964285714286
Class 2.0 recall = 0.9
Class 3.0 precision = 0.930283224401
Class 3.0 recall = 0.869653767821
Class 4.0 precision = 0.87260034904
Class 4.0 recall = 0.93984962406
Class 5.0 precision = 0.996212121212
Class 5.0 recall = 0.979515828678
