
## Data Exploration

A notebook to investigate the relationships between the different variables, especially the target variable. 

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import ml_colon.data_preparation

### Load raw Data

In [None]:
data_dir = ml_colon.HERE.parents[2] / "data" 
descr_df = pd.read_csv(data_dir / "data_description.csv", index_col="column_name")
df = ml_colon.data_preparation.get_df_from_csv()

### Exploring some variables

#### Column: Bits

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.set_title(f"Histogram of bits")
sns.histplot(df["bits"].values, ax=ax)

Let's identify the characteristics of the rows whose value of "bits" is 0.

In [None]:
df[df["bits"]==0].count()

In [None]:
np.count_nonzero(df[df["bits"]==0], axis=1)

The rows where the value of `bits` are 0 also have 0 in `intra_parts`, `inter_16x16_parts`, `inter_4x4_parts`, `inter_other_parts`, `inter_other_parts`, `block_movement_h`, `block_movement_v`, `var_movement_h`, `var_movement_v`. 

When the value of <i>bits</i> is 0, it means that none of its frame was encoded from the last frame. In other words, they are already represented on the previous frame. To verify the assumption, it is necessary to check if there are the rows with the same values.

In [None]:
df_wo_bits = df[df["bits"]!=1].drop(["bits"], axis=1)
df_wo_bits

In [None]:
df_zero_bits = df[df["bits"]==0]
df_zero_bits_wo_bits = df_zero_bits.drop(["bits"], axis=1)
df_zero_bits_wo_bits

If `df_zero_bits_wo_bits`(147 cases) are all in `df_wo_bits`(15792 cases), it means the rows with 0 bits are all redundant.

In [None]:
df_zero_bits_wo_bits.isin(df_wo_bits)

In [None]:
for each in df[df["bits"]==0].values:
    print(each)

In [None]:
pd.options.display.max_colwidth = 200
descr_df.loc[df[df["bits"]==0].sum()==0]

In [None]:
print("Number of rows with 0 bits: ", len(df[df["bits"] ==0]))

In [None]:
pd.cut(df["bits"], bins=[0, 8, 16, 32, 64, 124], include_lowest=False).value_counts(sort=False)

There are some block that are encoded with a very small number of bits. 696 rows are encoded with at most 8 bits. Will be interesting to see how quality depends on number of bits...For now we keep these rows but potentially these rows we have to deal with separately.

### Relationships with target variable
In this section we will be looking for high discriminators to determine if a block is relevant. First we produce boxplots for every continuous variable. This could help to see if one group has a different mean/distribution than the other group in the variable relevant.

In [None]:
fig, ax = plt.subplots(figsize=(16,25))

for num, y in enumerate(df.columns[11:-1]):
    ax1 = fig.add_subplot(5,3,num+1)
    ax1 = df.boxplot(y, by='relevant', ax=ax1)
    
plt.suptitle("Boxplots of different variables grouped by relevant", size=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.98])


There seem to be a lot of outliers for the different variables and groups. Some variables seem interesting as the mean of the two groups differs a bit. 

TODO: What else does this plot tell us?

TODO: 
* Correlation Analysis
* Distribution plots of dependent variables with aspect to target variable
