
## Data Exploration

A notebook to investigate the relationships between the different variables, especially the target variable. 

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import ml_colon.data_preparation

### Load raw Data

In [None]:
data_dir = ml_colon.HERE.parents[2] / "data" 
descr_df = pd.read_csv(data_dir / "data_description.csv", index_col="column_name")
df = ml_colon.data_preparation.get_df_from_csv()

## Exploring

#### Column: Bits

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.set_title(f"Histogram of bits")
sns.histplot(df["bits"].values, ax=ax)

In [None]:
pd.cut(df["bits"], bins=[0, 8, 16, 32, 64, 124], include_lowest=False).value_counts(sort=False)

There are some block that are encoded with a very small number of bits. 696 rows are encoded with at most 8 bits. 

Let's identify the characteristics of the rows whose value of "bits" is 0.

In [None]:
df_bits_zero = df.loc[df.bits == 0]
print(f"There are {len(df_bits_zero)} rows with bits = 0")

In [None]:
_descr_df = df_bits_zero.describe()

other_cols_zero = list(_descr_df.columns[_descr_df.loc["mean"].eq(0)])

print(f"Other columns that are also 0: {other_cols_zero}")

The rows where the value of `bits` are 0 also have 0 in `bits`, `intra_parts`, `inter_16x16_parts`, `inter_4x4_parts`, `inter_other_parts`, `non_zero_pixels`, `block_movement_h`, `block_movement_v`, `var_movement_h` and `var_movement_v`. 

#### Assumption: Redundant rows
When the `bits = 0` we think that it could mean that the information was already encoded in another frame. In other words, they are already represented on the previous frame. To verify the assumption, it is necessary to check if we find duplicated rows in the dataframe disregarding the column `bits`.

In [None]:
df.duplicated(subset=[c for c in df.columns if c != "bits"]).any()

The quick check shows that when considering all columns except `bits` we do not find any duplicated rows. Meaning our assumption was wrong. Furthermore this also means that there are in general no duplicated rows in the dataset. 

In [None]:
df.duplicated().any()

Since there are multiple continuous variables in the dataset it is worth to see if there combination of these columns uniquely identifies a row (just out of curiosity)

In [None]:
df.duplicated(subset=["cost_1", "cost_2", "movement_level"]).sum()

We find that the combination of the values in the columns `cost_1`, `cost_2` and `movement_level` is already almost unique. Only 2 rows have the exact same values for these 3 columns.

In [None]:
df.groupby(["cost_1", "cost_2", "movement_level"]).filter(lambda x: x["bits"].count() > 1)[["bits", "cost_1", "cost_2", "movement_level"]]

### Relationships with target variable
In this section we will be looking for high discriminators to determine if a block is relevant. First we produce boxplots for every continuous variable. This could help to see if one group has a different mean/distribution than the other group in the variable relevant.

In [None]:
fig, ax = plt.subplots(figsize=(16,25))

for num, y in enumerate(df.columns[11:-1]):
    ax1 = fig.add_subplot(5,3,num+1)
    ax1 = df.boxplot(y, by='relevant', ax=ax1)
    
plt.suptitle("Boxplots of different variables grouped by relevant", size=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.98])


There seem to be a lot of outliers for the different variables and groups. Some variables seem interesting as the mean of the two groups differs a bit. 

Now lets take a look at the relationship of the discrete variables with the target variable. For this part, we group by each discrete value and see what percentage in that group has relevant = 1. 

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(5,5))
col_num = [0, 3, 4, 9] # columns quality, skip_parts, inter_16x16_parts and frame_height

for ax, y in zip(axes.flatten(), df.columns[col_num]):
    df.groupby(y).mean()['relevant'].plot(ax=ax, kind='bar')
    
plt.suptitle("")
plt.tight_layout(rect=[0, 0.03, 1, 0.98])

Two discrete variables showed many different unique values. Therefore, we handle them as continuous variable in this part of the analysis. We use boxplots to see the different distributions of the data for the groups of relevant. 

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8,3))
col_name = ['bits', 'non_zero_pixels']

for ax, y in zip(axes.flatten(), col_name):
    df.boxplot(y, by='relevant', ax=ax)
plt.suptitle("")


The histogram shows 4 different groups starting with a higher peak and followed by two values with smaller counts. These will be grouped together and differences between the percentage of rows where relevant is equal to 1 is shown in the following plot.

In [None]:
fig, ax = plt.subplots(figsize=(5,3))
proc_df = df[['intra_parts', 'relevant']].copy()
proc_df['group'] = proc_df['intra_parts'] // 10 
proc_df = proc_df.replace({"group": {3: 2, 4:3, 6:4}})
proc_df.groupby('group')['relevant'].mean().plot(kind='bar')

## Correlation Analysis
In this section, the correlations between all variables are investigated. 

First, in order to consider which variables should be constant and categorical we explore the number of unique values of each of them

In [None]:
df.nunique()

We split into continuous and constant variables

In [None]:
cat = df.loc[:, df.nunique() < 29]
cont = df.loc[:, df.nunique() >= 29]
varlist = cont.columns.tolist()
varlist.append('relevant')
contRelv = df[varlist]

In [None]:
contRelv.groupby('relevant').mean()

In this table we can check the difference between the relevant categories, which looks significative

We save a copy of the main dataframe for further analysis.

In [None]:
dftest=df

In [None]:
def equal_test(df, variables, y):
    for var in variables:
        group0 = df.loc[df[y] == 0][var].tolist()
        group1 = df.loc[df[y] == 1][var].tolist()
        print(var)
        # Some variables have different length depending the group
        maxSize = max(len(group0), len(group1))
        group0 = random.choices(group0, k = maxSize)
        group1 = random.choices(group1, k = maxSize)
        
        if ttest_rel(group0, group1).pvalue >= 0.5:
            print("The groups have the same mean.") 
        else:
            print("The groups are different.")

With this function we make use of the T-test in order to confirm if the groups are statistically different

In [None]:
equal_test(contRelv, contRelv.columns[:-1].tolist(), contRelv.columns[-1])

Now we can see how the categorical variables correlate with the relevant variable

In [None]:
plt.figure(figsize = (15, 10))
corr_mtx = cat.corr()
sns.heatmap(corr_mtx, annot = True, cmap = "YlOrRd")

Now we do the same with the continuous variables

In [None]:
plt.figure(figsize = (20, 16))
corr_mtx = contRelv.corr()
sns.heatmap(corr_mtx, annot = True, cmap = "YlOrRd")

We reduce the number of variables in order to make a more comprenhensive correlation analysis.
Some of the variables are highly correlated and their descriptions help to understand how this reductions can me done. After creating new variables we delete the old ones.

In the first case we generate a new variable combining frame_height, frame_width and non_zero_pixels in order to obtain precise pixel infromation of the frame.

In [None]:
df["pixel_frame"] = df['non_zero_pixels'] / (df['frame_height'] * df['frame_width'])

df = df.drop(['frame_height', 'frame_width', 'non_zero_pixels'], axis = 1)

For the sub_mean variable we have put all of them together in order to reduce their correlation.

In [None]:
df['sub_mean'] = (df['sub_mean_1'] + df['sub_mean_2'] + df['sub_mean_3'] + df['sub_mean_4']) / 4

df = df.drop(['sub_mean_1', 'sub_mean_2', 'sub_mean_3', 'sub_mean_4'], axis = 1)

The sobel variables make reference to the mean of the pixels of the encoded block after applying the Sobel operator in vertical and horizontal direction. So we can combine them easily in order to obtain a mean of both of them.

In [None]:
df['sobel_hv'] = (df['sobel_h'] + df['sobel_v']) / 2

df = df.drop(['sobel_h', 'sobel_v'], axis = 1)

This variables make reference to the measure of the movement and variance of a certain block in vertical and horizontal. As it is not relevant for this project to keep them separated we can obtain a single variable collecting a mean of this information.

In [None]:
df['movement_var'] = ((df['block_movement_h'] / df['var_movement_h']) + (df['block_movement_v'] / df['var_movement_v'])) / 2

df = df.drop(['block_movement_h', 'block_movement_v', 'var_movement_h', 'var_movement_v'], axis = 1)

We make the same combination with the cost variables so we can obtain a better correlation analysis.

In [None]:
df['cost'] = (df['cost_1'] + df['cost_2']) / 2

df = df.drop(['cost_1', 'cost_2'], axis = 1)

Now we can check if the correlation have been improved with the transformed variables.

In [None]:
plt.figure(figsize = (20, 16))
corr_mtx = df.corr()
sns.heatmap(corr_mtx, annot = True, cmap = "YlOrRd")

By checking the correlation we can reduce the number of variables but the main method used in this project works under the SelectKBest function.

In this part the analysis is not finished since some variables still having high correlation between them (<0.5). Due that we have decided to drop sobel_hv, sub_mean, variance, bits and inter_other_parts.

In [None]:
df = df.drop(["sobel_hv", "sub_mean", "variance", "bits", "inter_other_parts"], axis = 1)

Now we generate the correlation matrix showing the improvement of the results for the target variable.

In [None]:
plt.figure(figsize = (16, 12))
corr_mtx = df.corr()
sns.heatmap(corr_mtx, annot = True, cmap = "YlOrRd")

Now we can check if the continuous variables that are still in the analysis are statistically different or if they have the same mean.

In [None]:
cont = df.loc[:, df.nunique() >= 29]
varlist = cont.columns.tolist()
varlist.append('relevant')
contRelv = df[varlist]
equal_test(contRelv, ["movement_level", "mean", "var_sub_blocks", "pixel_frame", "cost", "movement_var"], contRelv.columns[-1])

We decided to use the Label encoder instead of the One Hot Encoding as order could be important for further analysis.

In [None]:
cat.head()

In [None]:
l_code = LabelEncoder()
for var in cat.columns:
    l_code.fit(cat[var])
    cat[var] = l_code.transform(cat[var])

In [None]:
cat.head()

## Feature Selection
In this section, feature selection will be used. This is done with the help of the SelectKBest function of Sklearn.

In order to perform the SelectKBest method it is necessary to delete the the Missing Values.

In [None]:
dftest = dftest[~dftest.sub_mean_3.isnull()]

dftest = dftest[~dftest.cost_2.isnull()]

In this step we create feature and target variable for Classification problem

In [None]:
X_clf = dftest.loc[:,dftest.columns != 'relevant']

y_clf = dftest.loc[:,dftest.columns == 'relevant']

X_clf_new = SelectKBest(score_func = chi2, k = 10).fit_transform(X_clf, y_clf)

X_clf_new[:5]

In [None]:
X_clf.head()

We can conclude that the best 10 features that classify the most with the target variable relevant are bits, non_zero_pixels, frame_width, frame_height, movement_level, variance, var_movement_h, var_movement_v, cost_1 and cost_2.