## Wifi Positioning

##### - Use signal strength from wireless access points to develop a model to positionally locate a user inside a building where GPS suffers from signal loss. Each record in the data contains the signal strength from 520 different wireless access points, some identifying information about the user's position, user ID, and timestamp. This data was collected on a Spanish university campus.

####  Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
pd.options.mode.chained_assignment = None

#### Load Data, understand basic structure, and some data manipulation

In [3]:
os.chdir('./Python Projects/Wifi Positioning/data/UJIndoorLoc/')

##### - Two separate datasets are available, training and validation. Training dataset is ~20k records and contains 4 features describing the user's location: building ID, floor number, space ID, and relative position (inside or outside room). Validation dataset is ~1k records and contains the building ID and floor number of each user, but not the space ID  or relative position.

In [4]:
df = pd.read_csv('trainingData.csv', encoding = 'utf-8')
target_df = pd.read_csv('validationData.csv', encoding = 'utf-8')

In [5]:
print('Training DF:\n' + str(df[['BUILDINGID', 'FLOOR', 'SPACEID', 'RELATIVEPOSITION']].head(2)) + '\n\n')
print('Validation DF:\n' + str(target_df[['BUILDINGID', 'FLOOR', 'SPACEID', 'RELATIVEPOSITION']].head(2)))

Training DF:
   BUILDINGID  FLOOR  SPACEID  RELATIVEPOSITION
0           1      2      106                 2
1           1      2      106                 2


Validation DF:
   BUILDINGID  FLOOR  SPACEID  RELATIVEPOSITION
0           1      1        0                 0
1           2      4        0                 0


In [6]:
print('The shape of \'df\' is: ' + str(df.shape))
print('The shape of \'target_df\' is: ' + str(target_df.shape))

The shape of 'df' is: (19937, 529)
The shape of 'target_df' is: (1111, 529)


##### - Both data sets are the same number of features

In [7]:
(df.columns != target_df.columns).sum()

0

##### - All columns match

In [8]:
df.dtypes.value_counts()

int64      527
float64      2
dtype: int64

In [9]:
target_df.dtypes.value_counts()

int64      527
float64      2
dtype: int64

In [10]:
df.select_dtypes('float64')

Unnamed: 0,LONGITUDE,LATITUDE
0,-7541.2643,4.864921e+06
1,-7536.6212,4.864934e+06
2,-7519.1524,4.864950e+06
3,-7524.5704,4.864934e+06
4,-7632.1436,4.864982e+06
...,...,...
19932,-7485.4686,4.864875e+06
19933,-7390.6206,4.864836e+06
19934,-7516.8415,4.864889e+06
19935,-7537.3219,4.864896e+06


##### - All columns are ints except for lat and long, which are floats. There are several features that should be categorical and timestamp should be datetime.

In [11]:
strings = ['FLOOR', 'BUILDINGID', 'SPACEID', 'RELATIVEPOSITION', 'USERID', 'PHONEID']

In [12]:
for i in strings:
    df[i] = df[i].astype('str')
    target_df[i] = target_df[i].astype('str')

In [13]:
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], unit = 's')
target_df['TIMESTAMP'] = pd.to_datetime(target_df['TIMESTAMP'], unit = 's')

##### - This can be tackled as a classification problem (by predicting building, floor, etc.). First, will test a couple different models on a subset of the data (one building of the three), then will apply best-performing model to the entire dataset to determine if it is reliable enough to be used on the validation dataset, where it will predict the space ID and relative position.

### Classification Analysis

In [14]:
from sklearn.model_selection import train_test_split

#### Prepare new df

In [15]:
df[['BUILDINGID', 'TIMESTAMP']].groupby('BUILDINGID').count()

Unnamed: 0_level_0,TIMESTAMP
BUILDINGID,Unnamed: 1_level_1
0,5249
1,5196
2,9492


In [16]:
df_c = df.copy()
df_c = df_c[(df_c['BUILDINGID'] == '2')]

In [17]:
len(df_c)

9492

##### - To find the best-performing model, I will set the target variable as space ID, since it has the widest range of possibilities (of the two unknown variables in the validation dataset).

In [18]:
columns_to_remove = ['LONGITUDE', 'LATITUDE', 'FLOOR', 'BUILDINGID', 'RELATIVEPOSITION', 'USERID', 'PHONEID', 'TIMESTAMP']

In [19]:
df_c.drop(columns_to_remove, axis = 1, inplace = True)

In [20]:
X = df_c.drop('SPACEID', axis = 1)
y = df_c['SPACEID']

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 123)

#### Import models

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import *

#### Random Forest

In [23]:
rfc = RandomForestClassifier()

In [24]:
print(cross_val_score(rfc, X_train, y_train, cv = 5))

[0.78511236 0.82022472 0.7872191  0.79985955 0.79339424]


In [25]:
rfc.fit(X_train, y_train)

RandomForestClassifier()

In [26]:
rfc_pred = rfc.predict(X_test)

#### SVM

In [27]:
svc = SVC(gamma = 'scale')

In [28]:
print(cross_val_score(svc, X_train, y_train, cv = 5))

[0.63272472 0.62289326 0.5997191  0.60533708 0.58257203]


In [29]:
svc.fit(X_train, y_train)

SVC()

In [30]:
svc_pred = svc.predict(X_test)

#### K-Nearest Neighbor

In [31]:
knn = KNeighborsClassifier(n_neighbors = 3)

In [32]:
print(cross_val_score(knn, X_train, y_train, cv = 5))

[0.60814607 0.63061798 0.62640449 0.62078652 0.61419536]


In [33]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [34]:
knn_pred = knn.predict(X_test)

#### Review Predictions and Metrics

In [35]:
predictions_df = pd.DataFrame({'Actuals': y_test, 'Random Forest': rfc_pred, 'Support Vector Machines': svc_pred, 'K-Nearest Neighbor': knn_pred})

In [36]:
predictions_df.head(15)

Unnamed: 0,Actuals,Random Forest,Support Vector Machines,K-Nearest Neighbor
10587,137,137,137,137
511,212,214,214,212
14831,109,109,109,109
2296,233,234,224,234
13552,103,103,103,103
4448,110,110,110,110
2579,221,221,221,221
2191,117,117,117,110
4968,204,204,211,204
3877,143,143,143,140


##### - Just going by the eye test, it looks like all three models perform relatively well, although SVM takes a much longer time to run.

In [37]:
predictions_list = [rfc_pred, svc_pred, knn_pred]
predictions_names = ['Random Forest', 'SVC', 'KNN']

In [38]:
i = 0
for pred in predictions_list:
    print(predictions_names[i])
    print('accuracy: ' + str(accuracy_score(y_test, pred)))
    print('precision: ' + str(precision_score(y_test, pred, average = 'weighted', zero_division = 1)))
    print('f1 score: ' + str(f1_score(y_test, pred, average = 'weighted')) + '\n')
    i += 1

Random Forest
accuracy: 0.8048883270122208
precision: 0.8430119669041071
f1 score: 0.8118993787724988

SVC
accuracy: 0.629582806573957
precision: 0.6846042255934202
f1 score: 0.6318106971884611

KNN
accuracy: 0.6380109565950274
precision: 0.6839861888176492
f1 score: 0.6469219549362023



##### - Random Forest seems to be the best-performing model.

#### Run on full data set

##### - If using multiple buildings (as in the full dataset), will have to create a "key" value since the same space ID may correspond to a number of different locations (as if all buildings use the same numbering system).

In [39]:
df_c_full = df.copy()

In [40]:
df_c_full['Location Identifier'] = df_c_full[['BUILDINGID', 'FLOOR', 'SPACEID', 'RELATIVEPOSITION']].agg('-'.join, axis = 1)

In [41]:
df_c_full['Location Identifier'] = df_c_full['Location Identifier'].astype('category')

In [42]:
remove2 = ['LONGITUDE', 'LATITUDE', 'BUILDINGID', 'FLOOR', 'SPACEID', 'RELATIVEPOSITION', 'USERID', 'PHONEID', 'TIMESTAMP']

In [43]:
df_c_full.drop(remove2, axis = 1, inplace = True)

In [44]:
X = df_c_full.drop('Location Identifier', axis = 1)
y = df_c_full['Location Identifier']

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 123)

In [46]:
rfc.fit(X_train, y_train)

RandomForestClassifier()

In [47]:
rfc_pred2 = rfc.predict(X_test)

In [48]:
accuracy_score(y_test, rfc_pred2)

0.8068204613841524

In [49]:
precision_score(y_test, rfc_pred2, average = 'weighted', zero_division = 1)

0.8663769739944652

In [50]:
f1_score(y_test, rfc_pred2, average = 'weighted')

0.8097545702441776

In [51]:
predictions_df2 = pd.DataFrame({'Actuals': y_test, 'Random Forest': rfc_pred2})

In [52]:
predictions_df2.sample(15)

Unnamed: 0,Actuals,Random Forest
11569,2-2-105-2,2-2-102-2
278,2-3-201-1,2-3-201-1
14710,2-1-102-2,2-1-102-2
711,2-3-243-1,2-3-243-1
6345,2-0-138-2,2-0-139-2
8904,0-0-117-2,0-0-117-2
18138,0-2-129-2,0-2-129-2
15673,1-0-217-1,1-0-217-1
12078,1-1-30-2,1-1-8-2
14802,2-1-115-2,2-1-115-2


### Apply model to holdout dataset

In [53]:
target_df.head()

Unnamed: 0,WAP001,WAP002,WAP003,WAP004,WAP005,WAP006,WAP007,WAP008,WAP009,WAP010,...,WAP520,LONGITUDE,LATITUDE,FLOOR,BUILDINGID,SPACEID,RELATIVEPOSITION,USERID,PHONEID,TIMESTAMP
0,100,100,100,100,100,100,100,100,100,100,...,100,-7515.916799,4864890.0,1,1,0,0,0,0,2013-10-04 07:45:03
1,100,100,100,100,100,100,100,100,100,100,...,100,-7383.867221,4864840.0,4,2,0,0,0,13,2013-10-07 14:10:54
2,100,100,100,100,100,100,100,100,100,100,...,100,-7374.30208,4864847.0,4,2,0,0,0,13,2013-10-07 14:11:35
3,100,100,100,100,100,100,100,100,100,100,...,100,-7365.824883,4864843.0,4,2,0,0,0,13,2013-10-07 14:12:18
4,100,100,100,100,100,100,100,100,100,100,...,100,-7641.499303,4864922.0,2,0,0,0,0,2,2013-10-04 09:09:34


In [54]:
target_df_copy = target_df.copy()

In [55]:
target_df_copy['Location Identifier'] = ''
target_df_copy.drop(remove2, axis = 1, inplace = True)

In [56]:
target_df_copy['Location Identifier'] = rfc.predict(target_df_copy.drop(['Location Identifier'], axis = 1))

In [57]:
target_df['Location Identifier'] = target_df_copy['Location Identifier']

In [58]:
splits = target_df['Location Identifier'].str.split('-', n = 4, expand = True)

In [59]:
new_columns = ['Building ID - Pred.', 'Floor - Pred.', 'Space ID - Pred.', 'Relative Position - Pred.']
i = 0
for column in new_columns:
    target_df[column] = splits[i]
    i += 1

In [60]:
comparison_table = target_df[['BUILDINGID', 'Building ID - Pred.', 'FLOOR', 'Floor - Pred.']]

In [61]:
comparison_table['Building Check'] = np.where(comparison_table['BUILDINGID'] == comparison_table['Building ID - Pred.'], True, False)
comparison_table['Floor Check'] = np.where(comparison_table['FLOOR'] == comparison_table['Floor - Pred.'], True, False)

In [62]:
comparison_table.head()

Unnamed: 0,BUILDINGID,Building ID - Pred.,FLOOR,Floor - Pred.,Building Check,Floor Check
0,1,1,1,2,True,False
1,2,2,4,4,True,True
2,2,2,4,3,True,False
3,2,2,4,4,True,True
4,0,0,2,2,True,True


In [63]:
building_acc = sum(comparison_table['Building Check']) / len(comparison_table)

In [64]:
floor_acc = sum(comparison_table['Floor Check']) / len(comparison_table)

In [65]:
print('Building Accuracy: ' + str(building_acc))
print('Floor Accuracy: ' + str(floor_acc))

Building Accuracy: 0.9981998199819982
Floor Accuracy: 0.8586858685868587


##### - The model correctly predicted the building ID almost 100% of the time and the floor number about 85% of the time. When I ran the model on the training dataset, space ID was correctly predicted ~80% of the time, so I can assume it should have similar accuracy on this validation data.