# Quantified Self: Classification

We'll start by loading the data from our csv files into data frames. We'll be using the "revised" versions of these files from the aggregation and hypothesis testing portion of the project. The revised files have removed any instances missing data and have the additional attribute of the schedule type of that day (MWF/TR/SS).

In [1]:
import pandas as pd
health_df = pd.read_csv("revised_health.csv")
health_df = health_df.drop("Unnamed: 0", axis = 1)
screentime_df = pd.read_csv("revised_screen_time.csv")
screentime_df=screentime_df.drop("Unnamed: 0", axis = 1)

Looking at the descriptions of the two different dataframes, we can begin to get an idea for what patterns exist within our data set. From here, we will merge the two data frames on the "date" attribute,  allowing us to view and work with the data set all at once.

In [2]:
#we have some duplicate attributes, so we'll drop those.
health_df = health_df.drop("Weekday", axis = 1)
health_df = health_df.drop("Schedule Type", axis = 1)
data_df = screentime_df.merge(health_df, on = "Date")
print(data_df)
data_df.describe()

          Date    Weekday  Total Time (Minutes)        Highest  Games  Social  \
0   10/11/2020     Sunday                 512.0         Social   88.0   226.0   
1   10/12/2020     Monday                 582.0         Social   38.0   261.0   
2   10/13/2020    Tuesday                 346.0         Social   35.0   163.0   
3   10/14/2020  Wednesday                 444.0         Social   30.0   139.0   
4   10/15/2020   Thursday                 411.0         Social   19.0   177.0   
..         ...        ...                   ...            ...    ...     ...   
61  12/11/2020     Friday                 639.0         Social    8.0   535.0   
62  12/12/2020   Saturday                 495.0         Social   14.0    90.0   
63  12/13/2020     Sunday                 469.0         Social   13.0   122.0   
64  12/14/2020     Monday                 578.0  Entertainment    0.0   113.0   
65  12/15/2020    Tuesday                 567.0  Entertainment    3.0    64.0   

    Entertainment  Utilitie

Unnamed: 0,Total Time (Minutes),Games,Social,Entertainment,Utilities,Productivity & Finance,Creativity,Information & Reading,Education,Shopping & Food,Other,Unclassified,Distance (mi),Flights Climbed (count),Steps (count)
count,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0,66.0
mean,416.530303,40.166667,145.984848,32.666667,9.909091,12.469697,8.5,6.348485,0.333333,1.30303,5.666667,153.227273,1.747074,4.393939,4030.515152
std,118.338526,41.73454,68.351916,44.566401,16.252552,11.481383,12.38268,13.670285,1.657926,4.318011,9.204235,102.002684,1.227376,4.220386,2806.700605
min,139.0,0.0,63.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0,0.084637,0.0,199.0
25%,318.5,11.0,107.25,6.0,0.0,5.25,0.0,0.0,0.0,0.0,0.0,74.25,0.486223,0.0,1199.75
50%,447.5,23.0,138.5,14.0,2.0,9.0,3.0,2.0,0.0,0.0,1.0,131.5,1.854219,4.5,4220.5
75%,500.5,61.75,171.75,42.5,11.5,16.75,10.0,6.0,0.0,0.0,7.0,214.75,2.622152,7.0,6010.25
max,639.0,169.0,535.0,210.0,74.0,70.0,54.0,82.0,13.0,26.0,35.0,482.0,4.114276,16.0,9600.0


Now that the data has been merged, we can start with our classification. The goal here is to see if we can predict the schedule type of any given day, given screen time and health data attributes. We'll exclude the "Weekday" and "Date" attributes from our "x" data frame, since those two attributes will always give away the schedule type and don't have to do with screen time or physical activity.

Part of our preprocessing is encoding categorical values so they can be treated numerically. In this case, we will be encoding the "Highest" attribute, which gives the app category with the highest screen time on any given day.

In [3]:
from sklearn.preprocessing import LabelEncoder

highest_le = LabelEncoder()
highest_le.fit(data_df["Highest"])
print(highest_le.classes_)
test = ["Entertainment","Games", "Social"]
test = highest_le.transform(test)
print(test)

['Entertainment' 'Games' 'Social']
[0 1 2]


Using the label encoder, "Entertainment" becomes 0, "Games" becomes 1, and "Social" becomes 2.

In [4]:
data_df["Highest"] = highest_le.transform(data_df["Highest"])
print(data_df["Highest"])

0     2
1     2
2     2
3     2
4     2
     ..
61    2
62    2
63    2
64    0
65    0
Name: Highest, Length: 66, dtype: int32


Now we can drop the date and weekday atributes.

In [5]:
data_df = data_df.drop("Date", axis = 1)
data_df = data_df.drop("Weekday", axis = 1)

With this done, we can move on to our classification.

## kNN Classification

Moving onto classification, we begin with the kNN classifier. Our y value is the attribute that we want to predict. In this case, that's the schedule type (MWF/TR/SS). The rest of the data makes up our x.

In [6]:
y = data_df["Schedule Type"]
x = data_df.drop("Schedule Type", axis=1)

Now that we have our data separated, we will implement min-max scaling to ensure that no one attribute is weighted more heavily in classification than any other. From there we will use `train_test_split` to separate our training and testing sets. `train_test_split` uses the hold out method, holding out 25% of the data set for testing.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier

scaler = MinMaxScaler()
x = scaler.fit_transform(x)
#train test split defaults to holding out 25%
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, stratify=y)


Now we'll fit our kNN classifier to the data set.

In [8]:
clf = KNeighborsClassifier(n_neighbors=5, metric="euclidean")
clf.fit(x_train, y_train)
y_predicted = clf.predict(x_test)
for i in range(len(y_predicted)):
    print(y_test.values[i], y_predicted[i])

accuracy = clf.score(x_test, y_test)
print(accuracy)

SS MWF
TR MWF
SS MWF
MWF MWF
MWF MWF
MWF MWF
TR TR
MWF MWF
TR SS
MWF MWF
SS MWF
TR SS
SS SS
MWF MWF
SS MWF
MWF MWF
TR MWF
0.5294117647058824


With only about 53% accuracy, our kNN classifier isn't doing very well with the data presented to it. All of the MWF days were correctly classified as such. However, of all of the TR and SS days, only two were classified correctly. We may want to revise what data is being used in our classification. However, there are other classifiers we can look at before narrowing down what data we are using in our x data frame.

Next, we'll look at the decision tree classifier.

In [9]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state = 0)
tree_clf.fit(x_train, y_train)

y_predicted = tree_clf.predict(x_test)
for i in range(len(y_predicted)):
    print(y_test.values[i], y_predicted[i])
accuracy = tree_clf.score(x_test, y_test)
print(accuracy)

SS TR
TR SS
SS MWF
MWF MWF
MWF MWF
MWF TR
TR TR
MWF MWF
TR TR
MWF MWF
SS TR
TR TR
SS SS
MWF SS
SS MWF
MWF MWF
TR TR
0.5882352941176471


The decision tree classifier improves on the kNN classifier, with an accuracy of 58%, but it's still not great. This time, the correct classifications were more spread out between the different schedule days, but there was still almost half of the test set that was incorrectly classified. Time to take another look at our data set.

## Refining Classifiers

Based on the accuracy of these two classifiers, we can tell that there's some confusion happening based on the data given. From our hypothesis testing (see AggregationAndHypoTesting), we know that the total minutes of screen time plays a strong role in distinguishing between a MWF day and a TR or SS day. However, while the visualizations seemed to indicate that physical activity was notably different depending on the day of the week, our hypothesis testing disproved the theory that distance, steps, and flights could be used to distinguish between schedule types. 

Looking at our screen time data, there are a lot of categories that tend towards 0. MinMax scaling might not be the most effective when it comes to individual categories such as these. Categories that maybe _shouldn't_ have as much weight are given equal weight to some of the more significant categories, such as total minutes of screen time.

Based on these ideas, I'm putting together a revised dataframe containing attributes that I think are most significant to our predictions.

In [10]:
revised_df = pd.DataFrame()
revised_df["Total Minutes"] = data_df["Total Time (Minutes)"]
revised_df["Highest"] = data_df["Highest"]
revised_df["Games"] = data_df["Games"]
revised_df["Social"] = data_df["Social"]
revised_df["Entertainment"] = data_df["Entertainment"]
revised_df["Utilities"] = data_df["Utilities"]
revised_df["Productivity & Finance"] = data_df["Productivity & Finance"]
revised_df["Other"] = data_df["Other"]
revised_df["Unclassified"] = data_df["Unclassified"]
revised_df["Schedule Type"] = data_df["Schedule Type"]

The revised data frame has removed any attributes that our hypothesis testing determined could not be used to clearly distinguish a certain schedule type. I've also removed certain screen time categories that did not contain much significant data. With this revised data set, we can redo our classifiers.

In [11]:
y = revised_df["Schedule Type"]
x = revised_df.drop("Schedule Type", axis = 1)

x = scaler.fit_transform(x)
#train test split defaults to holding out 25%
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, stratify=y)
clf = KNeighborsClassifier(n_neighbors=5, metric="euclidean")
clf.fit(x_train, y_train)
y_predicted = clf.predict(x_test)
for i in range(len(y_predicted)):
    print(y_test.values[i], y_predicted[i])

accuracy = clf.score(x_test, y_test)
print(accuracy)

SS TR
TR SS
SS MWF
MWF SS
MWF MWF
MWF MWF
TR SS
MWF MWF
TR MWF
MWF MWF
SS SS
TR TR
SS SS
MWF MWF
SS MWF
MWF MWF
TR MWF
0.5294117647058824


The kNN classifier still has an accuracy of about 53%.

In [12]:
tree_clf = DecisionTreeClassifier(random_state = 0)
tree_clf.fit(x_train, y_train)

y_predicted = tree_clf.predict(x_test)
for i in range(len(y_predicted)):
    print(y_test.values[i], y_predicted[i])
accuracy = tree_clf.score(x_test, y_test)
print(accuracy)

SS TR
TR MWF
SS SS
MWF SS
MWF MWF
MWF TR
TR TR
MWF MWF
TR TR
MWF MWF
SS MWF
TR TR
SS TR
MWF MWF
SS SS
MWF MWF
TR TR
0.6470588235294118


The decision tree classifier, however, now has an accuracy of about 65%. Still not great, but an improvement!

Next, we will start adjusting the parameters of each classifier to try to find a better accuracy.

In [13]:
clf = KNeighborsClassifier(n_neighbors=13, metric="euclidean")
clf.fit(x_train, y_train)
y_predicted = clf.predict(x_test)
for i in range(len(y_predicted)):
    print(y_test.values[i], y_predicted[i])

accuracy = clf.score(x_test, y_test)
print(accuracy)

SS SS
TR MWF
SS MWF
MWF MWF
MWF MWF
MWF MWF
TR SS
MWF MWF
TR TR
MWF MWF
SS MWF
TR TR
SS SS
MWF MWF
SS MWF
MWF MWF
TR MWF
0.6470588235294118


By increasing the number of neighbors used to compare in the kNN classifier to 13, we are able to increase the accuracy to about 65%. Now, the kNN and decision tree classifiers are about the same in terms of accuracy.

Like kNN, we can adjust the parameters of the decision tree classifier to alter the outcome. In this case, we will be limiting the max height.

In [14]:
tree_clf = DecisionTreeClassifier(max_depth = 5, random_state = 0)
tree_clf.fit(x, y)
y_predicted = tree_clf.predict(x_test)
for i in range(len(y_predicted)):
    print(y_test.values[i], y_predicted[i])
accuracy = tree_clf.score(x_test, y_test)
print(accuracy)

SS SS
TR TR
SS SS
MWF MWF
MWF MWF
MWF MWF
TR TR
MWF MWF
TR TR
MWF MWF
SS SS
TR TR
SS SS
MWF MWF
SS SS
MWF MWF
TR TR
1.0


By changing the max depth of the tree, and limiting it to 5, we are able to get all of the test values correct, for a 100% accuracy rate. Overall, our decision tree classifier performs better than the kNN classifier. However, by manipulating the parameters of the kNN classifier, we were able to improve the classifier's accuracy.