In [13]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from scipy.stats import f_oneway

import warnings
warnings.filterwarnings('ignore')

### The ANOVA Test
An ANOVA test is a way to find out if survey or experiment results are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.

Basically, you’re testing groups to see if there’s a difference between them.

### What Does “One-Way” or “Two-Way Mean?
One-way or two-way refers to the number of independent variables (IVs) in your Analysis of Variance test.

One-way has one independent variable (with 2 levels). For example: brand of cereal,
Two-way has two independent variables (it can have multiple levels). For example: brand of cereal, calories.

### What are “Groups” or “Levels”?
Groups or levels are different groups within the same independent variable. In the above example, your levels for “brand of cereal” might be Lucky Charms, Raisin Bran, Cornflakes — a total of three levels. Your levels for “Calories” might be: sweetened, unsweetened — a total of two levels.

Let’s say you are studying if an alcoholic support group and individual counseling combined is the most effective treatment for lowering alcohol consumption. You might split the study participants into three groups or levels:

Medication only,
Medication and counseling,
Counseling only.
Your dependent variable would be the number of alcoholic beverages consumed per day.

If your groups or levels have a hierarchical structure (each level has unique subgroups), then use a nested ANOVA for the analysis.

### What Does “Replication” Mean?
It’s whether you are replicating (i.e. duplicating) your test(s) with multiple groups. With a two way ANOVA with replication , you have two groups and individuals within that group are doing more than one thing (i.e. two groups of students from two colleges taking two tests). If you only have one group taking two tests, you would use without replication.

### Types of Tests.
There are two main types: one-way and two-way. Two-way tests can be with or without replication.

One-way ANOVA between groups: used when you want to test two groups to see if there’s a difference between them.
Two way ANOVA without replication: used when you have one group and you’re double-testing that same group. For example, you’re testing one set of individuals before and after they take a medication to see if it works or not.
Two way ANOVA with replication: Two groups, and the members of those groups are doing more than one thing. For example, two groups of patients from different hospitals trying two different therapies.

In [14]:
X_train = pd.read_csv('data/X_train.csv')
X_test = pd.read_csv('data/X_test.csv')
y_train = pd.read_csv('data/y_train.csv')

### Some data exploration and ANOVA
Let's have a look at our features and columns. And then carry out the ANOVA across measurement groups (i.e. recording sessions).

In [15]:
X_train.head()

Unnamed: 0,row_id,series_id,measurement_number,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
0,0_0,0,0,-0.75853,-0.63435,-0.10488,-0.10597,0.10765,0.017561,0.000767,-0.74857,2.103,-9.7532
1,0_1,0,1,-0.75853,-0.63434,-0.1049,-0.106,0.067851,0.029939,0.003385,0.33995,1.5064,-9.4128
2,0_2,0,2,-0.75853,-0.63435,-0.10492,-0.10597,0.007275,0.028934,-0.005978,-0.26429,1.5922,-8.7267
3,0_3,0,3,-0.75852,-0.63436,-0.10495,-0.10597,-0.013053,0.019448,-0.008974,0.42684,1.0993,-10.096
4,0_4,0,4,-0.75852,-0.63435,-0.10495,-0.10596,0.005135,0.007652,0.005245,-0.50969,1.4689,-10.441


In [16]:
# any null values?
X_train.isnull().sum()

row_id                   0
series_id                0
measurement_number       0
orientation_X            0
orientation_Y            0
orientation_Z            0
orientation_W            0
angular_velocity_X       0
angular_velocity_Y       0
angular_velocity_Z       0
linear_acceleration_X    0
linear_acceleration_Y    0
linear_acceleration_Z    0
dtype: int64

In [17]:
y_train.head()

Unnamed: 0,series_id,group_id,surface
0,0,13,fine_concrete
1,1,31,concrete
2,2,20,concrete
3,3,31,concrete
4,4,22,soft_tiles


In [18]:
import numpy as np

# what are our surface materials i.e. targets?
np.unique(y_train['surface'])

array(['carpet', 'concrete', 'fine_concrete', 'hard_tiles',
       'hard_tiles_large_space', 'soft_pvc', 'soft_tiles', 'tiled',
       'wood'], dtype=object)

Our dataset consists of series with a length of 128 measurements and the values from the 10 sensor channels. Every series of measurements has one target, which is a description of the surface material. Luckily, we don't have to deal with any missing values.

In y_train we're furthermore given a group_id which indicates the recording session this particular series was taken. Across each session, the robot was only driving on one surface. We will make use of that information and see, if there is any variation in means across different recording sessions within the features (because of varying sensor calibrations, environment etc.).

In [19]:
encoder = LabelEncoder()
surfaces = np.unique(y_train['surface'])
y_train['surface'] = encoder.fit_transform(y_train['surface'])

In [20]:
# do we have a strong variation in means across groups?
# let's find out with pairwise t-Tests for groups with same surface
joined = X_train.set_index('series_id').join(
    y_train.set_index('series_id'))

In [21]:
def anova_across_surface(surface, X):
    # helper function to calculate anovas for group samples of surface levels
    records = X[X.loc[:, 'surface']==surface]
    group_nos = np.unique(records.loc[:, 'group_id'])
    anovas = []
    for col in records.columns:
        samples = [list(X[X.loc[:, 'group_id'] == i][col]) for i in group_nos]
        aov = f_oneway(*samples)
        anovas.append(aov[0])
        anovas.append(aov[1])
    return anovas

In [22]:
# calculate all the anovas first
anovas = dict()
for i in range(0, 9):
    # for each surface level
    anovas[i] = anova_across_surface(i, joined)

In [23]:
# make nice tables
no_of_columns = 3
new_cols = ['row_id\t', 'measr_no', 'orien_X', 'orient_Y',
       'orient_Z', 'orient_W', 'velocity_X',
       'velocity_Y', 'velocity_Z', 'accel_X',
       'accel_Y', 'accel_Z', 'group_id',
       'surface']
joined.columns = new_cols
line_1 = '\n' + '\t\t|%s' * no_of_columns
line_2 = 'surface\t\t' + '|F\t      p-value\t' * no_of_columns
line_3 = '%12.12s' + '\t|%9.3e   %8.3f' * no_of_columns

In [24]:
for i in range(no_of_columns, len(joined.columns), no_of_columns):
    print(line_1 % tuple(joined.columns[i-no_of_columns:i]))
    print(line_2)
    print('=' * 22 * (no_of_columns + 1))
    for j, surface in zip(range(0,9), encoder.inverse_transform(list(range(0, 9)))):
        row = anovas[j][2*(i-no_of_columns):2*i]
        print(line_3 % tuple([surface] + anovas[j][2*(i-no_of_columns):2*i]))


		|row_id			|measr_no		|orien_X
surface		|F	      p-value	|F	      p-value	|F	      p-value	
      carpet	|9.233e+02      0.000	|0.000e+00      1.000	|2.623e+04      0.000
    concrete	|6.551e+02      0.000	|0.000e+00      1.000	|1.154e+05      0.000
fine_concret	|6.507e+02      0.000	|0.000e+00      1.000	|3.694e+04      0.000
  hard_tiles	|      nan        nan	|      nan        nan	|      nan        nan
hard_tiles_l	|8.358e+02      0.000	|0.000e+00      1.000	|1.055e+05      0.000
    soft_pvc	|8.777e+02      0.000	|0.000e+00      1.000	|8.220e+04      0.000
  soft_tiles	|1.390e+01      0.000	|0.000e+00      1.000	|1.477e+05      0.000
       tiled	|7.462e+02      0.000	|0.000e+00      1.000	|6.899e+04      0.000
        wood	|5.406e+02      0.000	|0.000e+00      1.000	|5.650e+03      0.000

		|orient_Y		|orient_Z		|orient_W
surface		|F	      p-value	|F	      p-value	|F	      p-value	
      carpet	|2.800e+04      0.000	|3.202e+04      0.000	|2.219e+04      0.000
    concrete	|2.135e

Let's see what we can get out of the above table! For each combination of surface material and feature, we have the F-statistic and p-value for a one-way ANOVA that was calculated across samples taken during different recording sessions. In other words, for each surface material we asked, if the means of the features are the same across different recording sessions. A low p-value close to zero corresponds to the answer 'no' and we can assume, that the means are not very similar. A high p-value gives us a hint at similar means.

At first we notice, that for the hard_tiles material, ANOVA didn't provide us with results. This is due to the fact, we have only one sample for hard_tiles in our dataset. ANOVA is therefore obsolete in this case.

Looking at the second row in table 1, we can sanity-check our calculations. The p-value for equal means of the measurement_id's is 1. This makes sense, because for each sample the measurement_id's are just integers from 0 to 127 and the means thus equal across all samples.

We can now have a look at the more interesting parts of the above tables. For many entries, we have very extreme values for F and thus p-values very close to 0 (so close, that for many entries we actually cut off the non-zero decimals while formatting). For these values, we have to reject the hypothesis of equal means.

But there's also a few tests, that indicate similar means:

Samples of velocity_X for carpet, concrete, hard_tiles_large and tiled materials (with p-values of 0.725, 0.953, 0.960 and 0.97 respectively).
Samples of acceleration_Z for carpet, concrete, fine_concrete, hard_tiles_large, tiled and wood (with p-values of 0.519, 1, 0.951, 0.989, 0.999, 0.997 respectively).
Interesting to note are also some of the tests, where our hypothesis of similar means has to be rejected even a very low significance levels. I leave it up to the reader to interpret the above results.