## Descriptive Analysis

A notebook to describe the data set with simple statistical tools.

In [None]:
import pandas as pd
import pathlib
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import os
from scipy.stats import ttest_rel
import random


import ml_colon

### Setting up Data Directory

In [None]:
data_dir = ml_colon.HERE.parents[2] / "data" 
print(data_dir)

assert data_dir.exists()

data_files = list(data_dir.glob("*.csv"))
print([f.name for f in data_files])

assert data_files

### Loading Raw Data

In [None]:
_filepath = data_dir / "raw_data.csv"
df = pd.read_csv(_filepath)

# assert all rows have been loaded
len(df) == sum(1 for i in open(_filepath)) -1 # file has header

print(f"Raw data set has: {len(df)} rows")

In [None]:
descr_df = pd.read_csv(data_dir / "data_description.csv", index_col="column_name")

In [None]:
print(descr_df)

Let's take a quick look at the actual data.

In [None]:
df.head()

Let's take a quick look at the datatypes in the dataframe

In [None]:
df.dtypes

Conclusion:
Looks like we are only dealing with numerical data (no characters, strings, datetimes, ...)

However, the int64 types seem to be discrete and may need special care.

### Missing Values?

Next, let's take a look if there are any Nulls, NaNs in the data set and if so how many.

In [None]:
_null_df = df.isnull().sum()

print(_null_df[_null_df > 0])

The missing values for sub_mean_3 and cost_2 maybe can be imputed / recovered. (or dropped since there are only 17 at most)

For the target variable "relevant" not. It's probably best to drop these 2 rows as we later also want to exclude them from training / testing the model and its only a total of 2 rows...

In [None]:
df = df[~df.relevant.isnull()]

len(df)

### Column Analysis

We want to go over each variable in the dataset and explore it with simple descriptive statistics.

A first overview can be seen here:

In [None]:
df.describe()

#### Column: Quality

In [None]:
column_name = "quality"
print(descr_df.loc[column_name, "description"])

sns.histplot(df[column_name].values)

In [None]:
df[column_name].value_counts()

Looks like a Discrete uniform distribution, but maybe data set was sampled in that way...


#### Column: Bits

In [None]:

column_name = "bits"
print(descr_df.loc[column_name, "description"])

In [None]:

fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")

sns.histplot(df[column_name].values, ax=ax)

The distribution is highly skewed and it looks like most of the blocks are encoded using only a few bits.
This raises the question: Are there blocks that allegedly are encoded with 0 bits in the video stream?

Note: This should not be possible as 0 bits would mean 0 information in my opinion. 

Let's identify the characteristics of the rows whose value of "bits" is 0

In [None]:
df[df["bits"]==0].head()

In [None]:
df[df["bits"]==0].sum()

When the value of <i>bits</i> is 0, <i>intra_parts, inter_16x16_parts, inter_4x4_parts, inter_other_parts, non_zero_pixels, block_movement_h, block_movement_v, var_movement_h, var_movement_v</i> are zero as well. So, what do they represent?

In [None]:
pd.options.display.max_colwidth = 200
descr_df.loc[df[df["bits"]==0].sum()==0]

In [None]:
print("Number of rows with 0 bits: ", len(df[df[column_name] ==0]))

In [None]:
pd.cut(df[column_name], bins=[0, 8, 16, 32, 64, 124], include_lowest=False).value_counts(sort=False)

There are some block that are encoded with a very small number of bits. 696 rows are encoded with at most 8 bits. Will be interesting to see how quality depends on number of bits...For now we keep these rows but potentially these rows we have to deal with separately.

#### intra_parts

In [None]:
column_name = "intra_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")

sns.histplot(df[column_name].values, ax=ax)

In [None]:
df[column_name].value_counts(sort=False)

The big majority of rows have 0 sub-blocks. Not sure if I yet understand what these sub-blocks are...

#### skip_parts

In [None]:
column_name = "skip_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

In [None]:
df.columns

#### inter_16x16_parts

In [None]:
column_name = "inter_16x16_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

This is difficult to interpret. The description is not clear. 

#### Column: inter_4x4_parts

In [None]:
column_name = "inter_4x4_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

The above two plots have shown that bigger sub-blocks have less information overlap than small sub-blocks. 

#### inter_other_parts

In [None]:
column_name = "inter_other_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### non_zero_pixels

In [None]:
column_name = "non_zero_pixels"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### frame_width

In [None]:
column_name = "frame_width"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### frame_height

In [None]:
column_name = "frame_height"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### movement_level

In [None]:
column_name = "movement_level"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### mean

In [None]:
column_name = "mean"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### sub_mean_1

In [None]:
column_name = "sub_mean_1"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### sub_mean_2

In [None]:
column_name = "sub_mean_2"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### sub_mean_3

In [None]:
column_name = "sub_mean_3"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### sub_mean_4

In [None]:
column_name = "sub_mean_4"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### var_sub_blocks

In [None]:
column_name = "var_sub_blocks"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### sobel_h

In [None]:
column_name = "sobel_h"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### sobel_v

In [None]:
column_name = "sobel_v"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### variance

In [None]:
column_name = "variance"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### block_movement_h

In [None]:
column_name = "block_movement_h"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### block_movement_v

In [None]:
column_name = "block_movement_v"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### var_movement_h

In [None]:
column_name = "var_movement_h"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### var_movement_v

In [None]:
column_name = "var_movement_v"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### cost_1

In [None]:
column_name = "cost_1"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### cost_2

In [None]:
column_name = "cost_2"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

### Target variable: relevant 

In [None]:
column_name = "relevant"
print(descr_df.loc[column_name, "description"])

In [None]:
print("Relevant == 1: ", len(df[df.relevant == 1]))
print("Relevant == 0: ", len(df[df.relevant == 0]))

print("Percentage where relevant is 1 out of all: ",len(df[df.relevant == 1])/ len(df))

Conclusion of relevant:

There is a high difference in the number of relevant blocks and irrelevant blocks. In the machine learning we may need to stratify the data to not favor the algorithm of predicting 1 all the time. If we do not take this into account, we can create a very simple algorithm already with 82% accuracy (just predict 1 all the time).

As the outcome variable is binary, we should look into methods of classification, e.g. logistic regression / decision tree / neural networks

In order to consider which variables should be constant and categorical we explore the number of unique values of each of them

In [None]:
df.nunique()

We split into continuous and constant variables

In [None]:
cat = df.loc[:, df.nunique() < 29]
cont = df.loc[:, df.nunique() >= 29]
varlist = cont.columns.tolist()
varlist.append('relevant')
contRelv = df[varlist]

In [None]:
contRelv.groupby('relevant').mean()

In this table we can check the difference between the relevant categories, which looks significative

In [None]:
def equality_testing(df, variables, y):
    for var in variables:
        group0 = df.loc[df[y] == 0][var].tolist()
        group1 = df.loc[df[y] == 1][var].tolist()
        print(var)
        # Some variables have different length depending the group
        maxSize = max(len(group0), len(group1))
        group0 = random.choices(group0, k = maxSize)
        group1 = random.choices(group1, k = maxSize)
        
        if ttest_rel(group0, group1).pvalue >=0.5:
            print("The groups have the same mean.") 
        else:
            print("The groups are different.")

With this function we make use of the T-test in order to confirm if the groups are statistically different

In [None]:
equality_testing(contRelv, contRelv.columns[:-1].tolist(), contRelv.columns[-1])

Now we can see how the categorical variables correlate with the relevant variable

In [None]:
plt.figure(figsize=(15, 10))
corr_mtx = cat.corr()
sns.heatmap(corr_mtx, annot=True)

Now we do the same with the continuous variables

In [None]:
plt.figure(figsize=(20, 16))
corr_mtx = contRelv.corr()
sns.heatmap(corr_mtx, annot=True)

We can study to reduce the cross-correlation and/or to delete some variables