## Descriptive Analysis

A notebook to describe the data set with simple statistical tools.

In [None]:
import pandas as pd
import pathlib
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import os
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from scipy.stats import ttest_rel
from sklearn.feature_selection import SelectKBest, chi2
import random


import ml_colon
import ml_colon.data_preparation

### Checking Data Directory

In [None]:
print("Looking for data files under", ml_colon.DATA_DIR)

assert ml_colon.DATA_DIR.exists()

data_files = [f.name for f in list(ml_colon.DATA_DIR.glob("*.csv"))]
print("Found files", data_files)


assert data_files
assert ml_colon.RAW_CSV_FILENAME in data_files, "Please provide the input dataset under data/raw_data.csv"

### Loading Raw Data

In [None]:
df = ml_colon.data_preparation.get_df_from_csv()

# assert all rows have been loaded
len(df) == sum(1 for i in open(_filepath)) -1 # file has header

print(f"Raw data set has: {len(df)} rows")

In [None]:
descr_df = pd.read_csv(ml_colon.DATA_DIR / "data_description.csv" , index_col="column_name")

pd.options.display.max_colwidth = 200

# align text to the left
descr_df.style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

Let's take a quick look at the actual data.

In [None]:
df.head()

Let's take a quick look at the datatypes in the dataframe

In [None]:
df.dtypes

Conclusion:
Looks like we are only dealing with numerical data (no characters, strings, datetimes, ...)

However, the int64 types seem to be discrete and may need special care.

### Missing Values?

Next, let's take a look if there are any Nulls, NaNs in the data set and if so how many.

In [None]:
_null_df = df.isnull().sum()

print(_null_df[_null_df > 0])

The missing values for sub_mean_3 and cost_2 maybe can be imputed / recovered. (or dropped since there are only 17 at most)

For the target variable "relevant" not. It's probably best to drop these 2 rows as we later also want to exclude them from training / testing the model and its only a total of 2 rows...

In [None]:
df = df[~df.relevant.isnull()]

len(df)

## Column Analysis

We want to go over each variable in the dataset and explore it with simple descriptive statistics.

A first overview can be seen here:

In [None]:
df.describe()

#### Column: Quality

In [None]:
column_name = "quality"
print(descr_df.loc[column_name, "description"])

sns.histplot(df[column_name].values)

In [None]:
df[column_name].value_counts()

Looks like a Discrete uniform distribution, but maybe data set was sampled in that way...


#### Column: Bits

In [None]:

column_name = "bits"
print(descr_df.loc[column_name, "description"])

In [None]:

fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")

sns.histplot(df[column_name].values, ax=ax)

The distribution is highly skewed and it looks like most of the blocks are encoded using only a few bits.
This raises the question: Are there blocks that allegedly are encoded with 0 bits in the video stream?

Note: This should not be possible as 0 bits would mean 0 information in my opinion. We dive deeper into that question in the `data_exploration.ipynb` notebook.

#### Column: intra_parts

In [None]:
column_name = "intra_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")

sns.histplot(df[column_name].values, ax=ax)

In [None]:
df[column_name].value_counts(sort=False)

The big majority of rows have 0 sub-blocks. Not sure if I yet understand what these sub-blocks are...

#### Column: skip_parts

In [None]:
column_name = "skip_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: inter_16x16_parts

In [None]:
column_name = "inter_16x16_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

This is difficult to interpret. The description is not clear. 

#### Column: inter_4x4_parts

In [None]:
column_name = "inter_4x4_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

The above two plots have shown that bigger sub-blocks have less information overlap than small sub-blocks. 

#### Column: inter_other_parts

In [None]:
column_name = "inter_other_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: non_zero_pixels

In [None]:
column_name = "non_zero_pixels"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: frame_width

In [None]:
column_name = "frame_width"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

In [None]:
print(df.frame_width.value_counts())

#### Column: frame_height

In [None]:
column_name = "frame_height"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: movement_level

In [None]:
column_name = "movement_level"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: mean

In [None]:
column_name = "mean"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

There seem to be some outliers.

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name} within 1st and 99th percentile")
sns.histplot(df[
    (df[column_name] > np.percentile(df[column_name], 1)) 
  & (df[column_name] < np.percentile(df[column_name], 99))
][column_name])

#### Column: sub_mean_1

In [None]:
column_name = "sub_mean_1"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: sub_mean_2

In [None]:
column_name = "sub_mean_2"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: sub_mean_3

In [None]:
column_name = "sub_mean_3"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: sub_mean_4

In [None]:
column_name = "sub_mean_4"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: var_sub_blocks

In [None]:
column_name = "var_sub_blocks"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

Same as with column `mean` there most likely are outliers

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name} within 1st and 99th percentile")
sns.histplot(df[
    (df[column_name] > np.percentile(df[column_name], 1)) 
  & (df[column_name] < np.percentile(df[column_name], 99))
][column_name])

#### Column: sobel_h

In [None]:
column_name = "sobel_h"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: sobel_v

In [None]:
column_name = "sobel_v"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: variance

In [None]:
column_name = "variance"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: block_movement_h

In [None]:
column_name = "block_movement_h"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: block_movement_v

In [None]:
column_name = "block_movement_v"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: var_movement_h

In [None]:
column_name = "var_movement_h"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name} within 2nd and 95th percentile")
sns.histplot(df[
    (df[column_name] > np.percentile(df[column_name], 2)) 
  & (df[column_name] < np.percentile(df[column_name], 95))
][column_name])

#### Column: var_movement_v

In [None]:
column_name = "var_movement_v"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name], ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name} within 2nd and 95th percentile")
sns.histplot(df[
    (df[column_name] > np.percentile(df[column_name], 2)) 
  & (df[column_name] < np.percentile(df[column_name], 95))
][column_name])

#### Column: cost_1

In [None]:
column_name = "cost_1"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

#### Column: cost_2

In [None]:
column_name = "cost_2"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])

### Target variable: relevant 

In [None]:
column_name = "relevant"
print(descr_df.loc[column_name, "description"])

In [None]:
print("Relevant == 1: ", len(df[df.relevant == 1]))
print("Relevant == 0: ", len(df[df.relevant == 0]))

print("Percentage where relevant is 1 out of all: ",len(df[df.relevant == 1])/ len(df))

Conclusion of relevant:

There is a high difference in the number of relevant blocks and irrelevant blocks. In the machine learning we may need to stratify the data to not favor the algorithm of predicting 1 all the time. If we do not take this into account, we can create a very simple algorithm already with 82% accuracy (just predict 1 all the time).

As the outcome variable is binary, we should look into methods of classification, e.g. logistic regression / decision tree / neural networks