# Predicting Cybersickness from postural data with Machine Learning
According to the postural instability theory, cybersickness occurs due to changes in the natural postural behavior of the human body.


<a id="0"></a> <br>
 ## Table of Contents  
1. [The datasets](#1)     
1. [Goal](#2) 
1. [Cleaning the dataset](#3)
2. [Descriptive Analysis](#4)
3. [Data Preparation - Initial NaN processing](#6)     
4. [Questions 1 and 2](#7)     
    1. [Data preparation](#8)
        1. [Answering question #1](#9)   
        2. [Answering question #2](#10)     
5. [Question 3: What is the vibe of each neighborhood based on the neighborhood overview?](#11)
    1. [Data Preparation](#12)     
    2. [Modeling](#13)
    3. [Evaluation - Answering the question](#14)
6. [Question 4: Can we predict the price of Boston Airbnbs?](#15)
    1. [Data Preparation](#16)     
    2. [Modeling and Evaluation](#17)
    3. [Answering the question](#18)
7. [Conclusion](#19)

<a id="1"></a>
## The datasets

Since we have data coming from two different places and in different formats, we need to analyze them differently. Firstly, I will create a dataframe containing data from each experiment and then combine them. This will require some data wrangling. The goal is to create a single dataframe that contains all the data we need in the same format. 
Also, in total we have three experiments. We will import their data next.

<a id="2"></a>
## Goal

<a id="3"></a>
## Cleaning the dataset

In [44]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [51]:
#Importing datasets
inspection = pd.read_csv('data/inspection.csv')
search = pd.read_csv('data/search.csv')

#Deleting unnecessary columns
del inspection['Condition']
del search['Condition']

#Converting Experiment column to numeric
inspection['Experiment'] = inspection['Experiment'].apply(lambda x: 1 if x == 'exp_1' else (2 if x == 'exp_2' else 3))
search['Experiment'] = search['Experiment'].apply(lambda x: 1 if x == 'exp_1' else (2 if x == 'exp_2' else 3))

#Converting Sex column to numeric, 1 = female, 2 = male
inspection['sex'] = inspection['sex'].apply(lambda x: 1 if x == 'F' else 2)
search['sex'] = search['sex'].apply(lambda x: 1 if x == 'F' else 2)

In [53]:
inspection.head()

Unnamed: 0,Experiment,part_id,sex,Sickness,5,6,7,8,9,10,...,5995,5996,5997,5998,5999,6000,6001,6002,6003,6004
0,1,11,1,1,-0.607,-0.629,-0.643,-0.666,-0.695,-0.711,...,-3.312,-3.329,-3.347,-3.364,-3.366,-3.37,-3.37,-3.364,-3.373,-3.376
1,1,11,2,0,-0.759,-0.744,-0.734,-0.758,-0.766,-0.759,...,-2.964,-2.977,-2.97,-2.97,-2.974,-2.996,-2.996,-3.028,-3.071,-3.117
2,1,13,1,1,0.357,0.382,0.391,0.375,0.383,0.383,...,-3.027,-3.014,-3.007,-3.021,-3.037,-3.037,-3.04,-3.053,-3.053,-3.043
3,1,13,2,1,0.357,0.382,0.391,0.375,0.383,0.383,...,-3.027,-3.014,-3.007,-3.021,-3.037,-3.037,-3.04,-3.053,-3.053,-3.043
4,1,15,1,0,-0.342,-0.332,-0.311,-0.291,-0.281,-0.271,...,-4.019,-4.019,-4.035,-4.009,-3.989,-3.979,-3.971,-3.981,-3.969,-3.989


In [41]:
search.head()

Unnamed: 0,Experiment,part_id,sex,Condition,Sickness,5,6,7,8,9,...,5995,5996,5997,5998,5999,6000,6001,6002,6003,6004
0,exp_1,11,F,Search_Task.xlsx,1,-0.995,-0.98,-0.987,-0.987,-0.978,...,-2.995,-3.004,-2.998,-2.99,-2.984,-2.981,-2.996,-2.981,-2.984,-2.987
1,exp_1,11,M,Search_Task.xlsx,0,0.283,0.274,0.257,0.257,0.25,...,-0.679,-0.664,-0.641,-0.64,-0.639,-0.64,-0.637,-0.645,-0.647,-0.617
2,exp_1,13,F,Search_Task.xlsx,1,0.621,0.643,0.664,0.675,0.675,...,-3.189,-3.181,-3.215,-3.224,-3.232,-3.245,-3.254,-3.254,-3.223,-3.206
3,exp_1,13,M,Search_Task.xlsx,1,-0.205,-0.205,-0.188,-0.18,-0.189,...,-4.081,-4.058,-4.058,-4.048,-4.048,-4.035,-4.015,-3.998,-3.988,-3.976
4,exp_1,15,F,Search_Task.xlsx,0,-0.382,-0.424,-0.435,-0.446,-0.467,...,-3.834,-3.821,-3.826,-3.785,-3.79,-3.795,-3.782,-3.757,-3.757,-3.76


In [64]:
#Deleting rows that deviate from the mean of the column by 5 standard deviations

#Inspection dataframe
# Calculate z-scores for each column
z_scores = stats.zscore(inspection)

# Identify outliers
threshold = 5
outliers = (abs(z_scores) > threshold).any(axis=1)

# Filter the DataFrame to get rows with outliers
df_outliers = inspection[outliers]
df_outliers.head()

#Delete row
inspection = inspection.drop(df_outliers.index)


In [65]:
#Search dataframe
# Calculate z-scores for each column
z_scores = stats.zscore(search)
outliers = (abs(z_scores) > threshold).any(axis=1)
df_outliers = search[outliers]
df_outliers.head()
#Delete row
search = search.drop(df_outliers.index)

<a id="4"></a>
## Descriptive Analysis

Unnamed: 0,Experiment,part_id,sex,Sickness,5,6,7,8,9,10,...,5995,5996,5997,5998,5999,6000,6001,6002,6003,6004
0,1,11,1,1,-0.607,-0.629,-0.643,-0.666,-0.695,-0.711,...,-3.312,-3.329,-3.347,-3.364,-3.366,-3.370,-3.370,-3.364,-3.373,-3.376
1,1,11,2,0,-0.759,-0.744,-0.734,-0.758,-0.766,-0.759,...,-2.964,-2.977,-2.970,-2.970,-2.974,-2.996,-2.996,-3.028,-3.071,-3.117
2,1,13,1,1,0.357,0.382,0.391,0.375,0.383,0.383,...,-3.027,-3.014,-3.007,-3.021,-3.037,-3.037,-3.040,-3.053,-3.053,-3.043
3,1,13,2,1,0.357,0.382,0.391,0.375,0.383,0.383,...,-3.027,-3.014,-3.007,-3.021,-3.037,-3.037,-3.040,-3.053,-3.053,-3.043
4,1,15,1,0,-0.342,-0.332,-0.311,-0.291,-0.281,-0.271,...,-4.019,-4.019,-4.035,-4.009,-3.989,-3.979,-3.971,-3.981,-3.969,-3.989
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136,3,32,1,0,0.628,0.629,0.619,0.629,0.639,0.637,...,-4.658,-4.646,-4.658,-4.651,-4.666,-4.670,-4.651,-4.670,-4.687,-4.669
137,3,33,1,1,0.486,0.463,0.453,0.463,0.451,0.463,...,-4.321,-4.294,-4.278,-4.249,-4.225,-4.222,-4.176,-4.188,-4.186,-4.171
138,3,34,1,0,1.132,1.105,1.093,1.076,1.084,1.074,...,-4.469,-4.461,-4.443,-4.422,-4.421,-4.394,-4.380,-4.370,-4.357,-4.306
139,3,35,1,0,0.330,0.310,0.300,0.310,0.280,0.270,...,-2.477,-2.460,-2.457,-2.474,-2.424,-2.390,-2.372,-2.352,-2.361,-2.373
