
## Data Exploration

A notebook to investigate the relationships between the different variables, especially the target variable. 

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import ml_colon.data_preparation

### Load raw Data

In [None]:
data_dir = ml_colon.HERE.parents[2] / "data" 
descr_df = pd.read_csv(data_dir / "data_description.csv", index_col="column_name")
df = ml_colon.data_preparation.get_df_from_csv()

## Exploring

#### Column: Bits

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.set_title(f"Histogram of bits")
sns.histplot(df["bits"].values, ax=ax)

In [None]:
pd.cut(df["bits"], bins=[0, 8, 16, 32, 64, 124], include_lowest=False).value_counts(sort=False)

There are some block that are encoded with a very small number of bits. 696 rows are encoded with at most 8 bits. 

Let's identify the characteristics of the rows whose value of "bits" is 0.

In [None]:
df_bits_zero = df.loc[df.bits == 0]
print(f"There are {len(df_bits_zero)} rows with bits = 0")

In [None]:
_descr_df = df_bits_zero.describe()

other_cols_zero = list(_descr_df.columns[_descr_df.loc["mean"].eq(0)])

print(f"Other columns that are also 0: {other_cols_zero}")

The rows where the value of `bits` are 0 also have 0 in `bits`, `intra_parts`, `inter_16x16_parts`, `inter_4x4_parts`, `inter_other_parts`, `non_zero_pixels`, `block_movement_h`, `block_movement_v`, `var_movement_h` and `var_movement_v`. 

#### Assumption: Redundant rows
When the `bits = 0` we think that it could mean that the information was already encoded in another frame. In other words, they are already represented on the previous frame. To verify the assumption, it is necessary to check if we find duplicated rows in the dataframe disregarding the column `bits`.

In [None]:
df.duplicated(subset=[c for c in df.columns if c != "bits"]).any()

The quick check shows that when considering all columns except `bits` we do not find any duplicated rows. Meaning our assumption was wrong. Furthermore this also means that there are in general no duplicated rows in the dataset. 

In [None]:
df.duplicated().any()

Since there are multiple continuous variables in the dataset it is worth to see if there combination of these columns uniquely identifies a row (just out of curiosity)

In [None]:
df.duplicated(subset=["cost_1", "cost_2", "movement_level"]).sum()

We find that the combination of the values in the columns `cost_1`, `cost_2` and `movement_level` is already almost unique. Only 2 rows have the exact same values for these 3 columns.

In [None]:
df.groupby(["cost_1", "cost_2", "movement_level"]).filter(lambda x: x["bits"].count() > 1)[["bits", "cost_1", "cost_2", "movement_level"]]

### Relationships with target variable
In this section we will be looking for high discriminators to determine if a block is relevant. First we produce boxplots for every continuous variable. This could help to see if one group has a different mean/distribution than the other group in the variable relevant.

In [None]:
fig, ax = plt.subplots(figsize=(16,25))

for num, y in enumerate(df.columns[11:-1]):
    ax1 = fig.add_subplot(5,3,num+1)
    ax1 = df.boxplot(y, by='relevant', ax=ax1)
    
plt.suptitle("Boxplots of different variables grouped by relevant", size=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.98])


There seem to be a lot of outliers for the different variables and groups. Some variables seem interesting as the mean of the two groups differs a bit. 

Now lets take a look at the relationship of the discrete variables with the target variable. For this part, we group by each discrete value and see what percentage in that group has relevant = 1. 

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(5,5))
col_num = [0, 3, 4, 9] # columns quality, skip_parts, inter_16x16_parts and frame_height

for ax, y in zip(axes.flatten(), df.columns[col_num]):
    df.groupby(y).mean()['relevant'].plot(ax=ax, kind='bar')
    
plt.suptitle("")
plt.tight_layout(rect=[0, 0.03, 1, 0.98])

Two discrete variables showed many different unique values. Therefore, we handle them as continuous variable in this part of the analysis. We use boxplots to see the different distributions of the data for the groups of relevant. 

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8,3))
col_name = ['bits', 'non_zero_pixels']

for ax, y in zip(axes.flatten(), col_name):
    df.boxplot(y, by='relevant', ax=ax)
plt.suptitle("")


The histogram shows 4 different groups starting with a higher peak and followed by two values with smaller counts. These will be grouped together and differences between the percentage of rows where relevant is equal to 1 is shown in the following plot.

In [None]:
fig, ax = plt.subplots(figsize=(5,3))
proc_df = df[['intra_parts', 'relevant']].copy()
proc_df['group'] = proc_df['intra_parts'] // 10 
proc_df = proc_df.replace({"group": {3: 2, 4:3, 6:4}})
proc_df.groupby('group')['relevant'].mean().plot(kind='bar')