## Descriptive Analysis

A notebook to describe the data set with simple statistical tools.

In [None]:
import pandas as pd
import pathlib
import seaborn as sns
import matplotlib.pyplot as plt


import ml_colon

### Setting up Data Directory

In [None]:
data_dir = ml_colon.HERE.parents[2] / "data" 
print(data_dir)

assert data_dir.exists()

data_files = list(data_dir.glob("*.csv"))
print([f.name for f in data_files])

assert data_files

### Loading Raw Data

In [None]:
_filepath = data_dir / "raw_data.csv"
df = pd.read_csv(_filepath)

# assert all rows have been loaded
len(df) == sum(1 for i in open(_filepath)) -1 # file has header

print(f"Raw data set has: {len(df)} rows")

In [None]:
descr_df = pd.read_csv(data_dir / "data_description.csv", index_col="column_name")

In [None]:
print(descr_df)

Let's take a quick look at the datatypes in the dataframe

In [None]:
df.dtypes

Conclusion:
Looks like we are only dealing with numerical data (no characters, strings, datetimes, ...)
 

### Missing Values?

Next, let's take a look if there are any Nulls, NaNs in the data set and if so how many.

In [None]:
_null_df = df.isnull().sum()

print(_null_df[_null_df > 0])

The missing values for sub_mean_3 and cost_2 maybe can be imputed / recovered.

For the target variable "relevant" not. It's probably best to drop these 2 rows as we later also want to exclude them from training / testing the model and its only a total of 2 rows...

In [None]:
df = df[~df.relevant.isnull()]

len(df)

### Column Analysis

We want to go over each variable in the dataset and explore it with simple descriptive statistics.

A first overview can be seen here:

In [None]:
df.describe()

#### Column: Quality

In [None]:
column_name = "quality"
print(descr_df.loc[column_name, "description"])

sns.histplot(df[column_name].values)

In [None]:
df[column_name].value_counts()

Looks like a Discrete uniform distribution, but maybe data set was sampled in that way...


#### Column: Bits

In [None]:

column_name = "bits"
print(descr_df.loc[column_name, "description"])

In [None]:

fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")

sns.histplot(df[column_name].values, ax=ax)

The distribution is highly skewed and it looks like most of the blocks are encoded using only a few bits.
This raises the question: Are there blocks that allegedly are encoded with 0 bits in the video stream?

Note: This should not be possible as 0 bits would mean 0 information in my opinion. 

In [None]:
print("Number of rows with 0 bits: ", len(df[df[column_name] ==0]))

In [None]:
pd.cut(df[column_name], bins=[0, 8, 16, 32, 64, 124], include_lowest=False).value_counts(sort=False)

There are some block that are encoded with a very small number of bits. 696 rows are encoded with at most 8 bits. Will be interesting to see how quality depends on number of bits...For now we keep these rows but potentially these rows we have to deal with separately.

#### intra_parts

In [None]:
column_name = "intra_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")

sns.histplot(df[column_name].values, ax=ax)

In [None]:
df[column_name].value_counts(sort=False)

The big majority of rows have 0 sub-blocks. Not sure if I yet understand what these sub-blocks are...

#### skip_parts

In [None]:
column_name = "skip_parts"
print(descr_df.loc[column_name, "description"])

In [None]:
fig, ax = plt.subplots(figsize=(16,8))

ax.set_title(f"Histogram of {column_name}")
sns.histplot(df[column_name])