# Import Packages

# Video Quality Metric prediction


## Data 

Data is one csv file which contains four sections Source,Video Characteristic,Encoding Setting and Target Value  

- **Video_data_set.csv** - A collected data from videos 

- **Source Viedo**
 
  * `s_video_id` - Identifer for the original video (numerical)
  * `s_width` - Resolution of the raw video (numerical)
  * `s_height` - Resolution of the raw video (numerical)
  * `s_storage_size` - the total size of the source video without audio tracks (numerical)
  * `s_duration` - Length of the source video (numerical)
  * `s_scan_type` - "progressive" or "interlaced"(categorical)
  
**----------------------------------------------------------------------------------**  
  
- **Video Characteristic**

  * `c_content_category` - A label indicating the category of the video with the highest probability(categorical)
  * `c_scene_change_ffmp_ratio30` - Indicates how many scene changes appear per minute on average in the video for a given probability 30%(numerical)
  * `c_scene_change_ffmp_ratio60` - Indicates how many scene changes appear per minute on average in the video for a given probability 60%(numerical)
  * `c_scene_change_ffmp_ratio90` - Indicates how many scene changes appear per minute on average in the video for a given probability 90%(numerical)
  * `c_scene_change_py_thresh30` - Indicates how many scene changes appear throughout entire clip with threshold 30(numerical)
  * `c_scene_change_py_thresh50` - Indicates how many scene changes appear throughout entire clip with threshold 50(numerical)
  * `c_si` - The spatial perceptual information (SI) based on the Sobel filter averaged over the whole video(numerical)
  * `c_ti` - The temporal perceptual information (TI) based upon the motion difference feature averaged over the whole video(numerical)
  * `c_colorhistogram_mean_dark` - color values between [0-63] are grouped in this block (dark)population mean of RGB color values normalised - divide by pixel count & divide by channel count mean of mean over all frames of a video(numerical)
  * `c_colorhistogram_mean_medium_dark` - color values between [64-127] (numerical)
  * `c_colorhistogram_mean_medium_bright` - color values between [128-195] (numerical)
  * `c_colorhistogram_mean_bright` - color values between [196-255] (numerical)
  * `c_colorhistogram_std_dev_dark` - standard deviation of c_colorhistogram_mean_medium_dark within each frame mean of all frames of a video(numerical)
  * `c_colorhistogram_std_dev_medium_dark` - standard deviation of c_colorhistogram_mean_medium_bright within each frame mean of all frames of a video(numerical)
  * `c_colorhistogram_std_dev_medium_bright` - standard deviation of c_colorhistogram_mean_medium_bright within each frame mean of all frames of a video(numerical)
  * `c_colorhistogram_std_dev_bright` - standard deviation of c_colorhistogram_mean_bright within each frame mean of all frames of a video(numerical)
  * `c_colorhistogram_temporal_mean_std_dev_dark` - temporal standard deviation of mean of c_colorhistogram_mean_dark(numerical)
  * `c_colorhistogram_temporal_mean_std_dev_medium_dark` - temporal standard deviation of mean of c_colorhistogram_mean_medium_dark(numerical)
  * `c_colorhistogram_temporal_mean_std_dev_medium_bright` - temporal standard deviation of mean of c_colorhistogram_mean_medium_bright(numerical)
  * `c_colorhistogram_temporal_mean_std_dev_bright` - temporal standard deviation of mean of c_colorhistogram_mean_bright(numerical)
- **Encoding Setting**
  * `e_crf` - Constant Rate Factor for this encoding(numerical)
  * `e_width` - Target Resolution of the encoded video(numerical)
  * `e_height` - Target Resolution of the encoded video(numerical)
  * `e_aspect_ratio` - Aspect ratio of the video(numerical)
  * `e_pixel_aspect_ratio` - Aspect ratio of the pixels. Usually 1:1 = 1(numerical)
  * `e_codec` - Video Codec e.g H.264, H.265, VP9, AV1.(categorical)
  * `e_codec_profile` - Video Codec Profile e.g baseline, main, high. Depending on the profile certain encoder features are disabled/enabled(categorical)
  * `e_codec_level` - Video Codec Level: Specified set of constraints that indicate a degree of required decoder performance for a profile.(ordered categorical)
  * `e_framerate` - Frames per second(numerical)
  * `e_gop_size` - number of frames between two I-frames(numerical)
  * `e_b_frame_int` - number of b frames per interval(numerical)
  * `e_ref_frame_count` - Reference frames are frames of a compressed video that are used to define future frames.(numerical)
  * `e_scan_type` - "progressive" or "interlaced"(categorical)
  * `e_bit_depth` - Amount of information stored in each pixel of data. also known as 'color depth'(numerical)
  * `e_pixel_fmt` - color models like YUV, RGB, YPbPr, etc.(categorical)

- **Target Value**
  * `t_average_bitrate` - Average Bitrate as encoding setting(numerical)
  * `t_average_vmaf` - quality metric((numerical)
  * `t_average_vmaf_mobile` - No Value 
  * `t_average_vmaf_4k` - No Value 
  * `t_average_psnr` - No Value 

# Import Packages

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.offline as py
py.init_notebook_mode(connected=True)

from math import sqrt
from pprint import pprint
from IPython.display import display
from scipy.spatial import ConvexHull
from matplotlib.pyplot import figure

from sklearn.metrics import r2_score
from sklearn import preprocessing, svm
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split,RandomizedSearchCV


import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

# Read the Dataset

In [None]:
data_set = pd.read_csv("../input/per-title-encoding/Video_data_set.csv",sep = ',')
print("**Video Dataset:**")
display(data_set.head())
print("**Video Dataset shape:**", data_set.shape)

# Data Preprocessing 

## Droping unneccesary data

After searching about the data in the dataset, 
I found that some data does not affect on the our target which is **VMAF**.
**Reason:** 
- The VMAF value is the numerical target and also the values that belongs to these data are showing most format and structure of data. 

In [None]:
data_set = data_set.drop(["e_codec",
                          "e_codec_profile",
                          "e_scan_type",
                          "e_pixel_aspect_ratio",
                          "e_pixel_fmt",
                          "e_aspect_ratio",
                          "e_b_frame_int",
                          "e_ref_frame_count",
                          "t_average_vmaf_mobile",
                          "e_bit_depth",
                          "t_average_vmaf_4k",
                          "t_average_psnr",
                          "s_scan_type"], axis=1) # remove desire columns 
data_set.head()
print("**Video Dataset shape after removing columns:**", data_set.shape)

## Data Cleaning 

The **data_set** contains Nan value and also some duplicate value in below columns names: 
- `c_content_category`          
- `c_colorhistogram_mean_dark`                             
- `c_colorhistogram_mean_medium_dark`                       
- `c_colorhistogram_mean_medium_bright`                     
- `c_colorhistogram_mean_bright`                            
- `c_colorhistogram_std_dev_dark`                           
- `c_colorhistogram_std_dev_medium_dark`                    
- `c_colorhistogram_std_dev_medium_bright`                  
- `c_colorhistogram_std_dev_bright`                         
- `c_colorhistogram_temporal_mean_std_dev_dark`             
- `c_colorhistogram_temporal_mean_std_dev_medium_dark`      
- `c_colorhistogram_temporal_mean_std_dev_medium_bright`    
- `c_colorhistogram_temporal_mean_std_dev_bright`
- `c_ti`
- `c_si`
- `e_codec_level`                             
- `e_framerate`                                             
- `e_gop_size`
- `t_average_vmaf`

In next cell, I remove all duplicates and Nan values and sort them base on Video ID.  

In [None]:
data_set_c = data_set.copy()
data_set_c.drop_duplicates(inplace=True)    # Removing the duplicate rows
data_set_c.dropna(inplace=True)             # Removing the Nan value 
data_set_c.sort_values(by=['s_video_id'], inplace=True)  # sort dataset based on video ID  
data_set_c.reset_index(inplace = True, drop = True)  
print("**Dataset shape after data cleaning:**", data_set_c.shape)

## Data Prepration

- Since the dataset contains the `Height` and `Width` so we can have the resolution of the videos as a new column named as `Res_width_height` 

- Calculate the Scene change in second based on the duration also can help to have the average scene change and named as `scene_change_avg` in dataset.   

In [None]:
# Calculate resolution 
data_set_c['Res_width_height'] = data_set_c['e_width']*data_set_c['e_height']

In [None]:
# division of each scene by duration in seconds
data_set_c["scene_change_avg"]= data_set_c['c_scene_change_ffmpeg_ratio60']*60/data_set_c['s_duration']

## Detecting outliers and Apply method ITU-T P.1401 for removing 

- In order to remove the extreme outliers,the technique ITU-T Recommendation P.1401 was
followed, which is basically for subjective rating, but I
found it suitable for the data as well. In this method,
unlike the general Box and whiskers graphs in which
data out of the range 1.5 times the Interquartile Range
(IQR) considering as outliers, 3 times the IQR used to find
and remove the extreme outliers.
- Applying this method on features that more important for our scenario and removing the outliers 

**Applied features:** 

- `t_average_bitrate`
- `scene_change_avg`
- `s_duration`
- `s_storage_size`

**In below, I applied boxplot inorder to check the outliers 
(outliers are points that are out of range data distribution)**

In [None]:
sns.boxplot(x= data_set_c['t_average_bitrate'])

**Remove outliers of `average_bitrate` by ITU-T method**

In [None]:
shape=data_set_c['t_average_bitrate'].shape[0] # get the number of row 
I_range=data_set_c['t_average_bitrate'].describe()[6]-data_set_c['t_average_bitrate'].describe()[4]
I_range2=I_range*3  
max_threshold=data_set_c['t_average_bitrate'].describe()[6]+I_range2
for i in range(shape):
    if(data_set_c.iloc[i, 31:32].values[0] >= max_threshold):
        data_set_c.iloc[i, 31:32]= np.nan


data_set_c.dropna(inplace=True)
data_set_c.reset_index(inplace= True, drop=True)
display(data_set_c.head())
print("**Dataset shape after removing outlier of Bitrate:**", data_set_c.shape)
print("-----------------------------------------------------")
print("Checking outlier of 'Bitrate'after removing outliers ")
sns.boxplot(x= data_set_c['t_average_bitrate'])


**Remove outliers of `scene_change_average` by ITU-T method**

In [None]:
sns.boxplot(x= data_set_c['scene_change_avg'])

In [None]:
shape=data_set_c['scene_change_avg'].shape[0]
I_range=data_set_c['scene_change_avg'].describe()[6]-data_set_c['scene_change_avg'].describe()[4]
I_range2=I_range*3
max_threshold=data_set_c['scene_change_avg'].describe()[6]+I_range2
for i in range(shape):
    if(data_set_c.iloc[i, 34:35].values[0] >= max_threshold):
        data_set_c.iloc[i, 34:35]=np.nan

data_set_c.dropna(inplace=True)        
data_set_c.reset_index(inplace= True, drop=True)
display(data_set_c.head())
print("**Dataset shape after removing outlier of scene_change:**", data_set_c.shape)
print("----------------------------------------------------------")
print("Checking outlier of 'scene_change'after removing outliers ")
sns.boxplot(x= data_set_c['scene_change_avg'])

**Remove outliers of `s_duration` by ITU-T method**

In [None]:
sns.boxplot(x= data_set_c['s_duration'])

In [None]:
shape=data_set_c['s_duration'].shape[0]
I_range=data_set_c['s_duration'].describe()[6]-data_set_c['s_duration'].describe()[4]
I_range2=I_range*3
max_threshold=data_set_c['s_duration'].describe()[6]+I_range2
for i in range(shape):
    if(data_set_c.iloc[i, 4:5].values[0] >= max_threshold):
        data_set_c.iloc[i, 4:5]=np.nan

data_set_c.dropna(inplace=True)        
data_set_c.reset_index(inplace= True, drop=True)
display(data_set_c.head())
print("**Dataset shape after removing outlier of video duration:**", data_set_c.shape)
print("----------------------------------------------------------")
print("Checking outlier of 'duration'after removing outliers ")
sns.boxplot(x= data_set_c['s_duration'])

**Remove outliers of `s_storage_size` by ITU-T method**

In [None]:
sns.boxplot(x= data_set_c['s_storage_size'])

In [None]:
shape=data_set_c['s_storage_size'].shape[0]
I_range=data_set_c['s_storage_size'].describe()[6]-data_set_c['s_duration'].describe()[4]
I_range2=I_range*3
max_threshold=data_set_c['s_storage_size'].describe()[6]+I_range2
for i in range(shape):
    if(data_set_c.iloc[i, 3:4].values[0] >= max_threshold):
        data_set_c.iloc[i, 3:4]=np.nan

data_set_c.dropna(inplace=True)        
data_set_c.reset_index(inplace= True, drop=True)
display(data_set_c.head())
print("**Dataset shape after removing outlier of storage_size:**", data_set_c.shape)
print("----------------------------------------------------------")
print("Checking outlier of 'storage size'after removing outliers ")
sns.boxplot(x= data_set_c['s_storage_size'])

## Data Description 

In [None]:
display(data_set_c.describe().transpose()) 

In order to understand more relation of the fetures with **VMAF**, 
I applied the correlation heatmap between the them so I can choose the most correlated one. 

In [None]:
corrMat = data_set_c.corr(method='pearson') #correlation calculation
fig, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corrMat, annot=True, fmt='.2f', ax=ax) # base on the heatmap
plt.title("Correlation Matrix after preprocessing ")
plt.show()

# Preparation of Train / Test 

After checking the correlation between the data and also based on the papers that shared
from Netflix, I applied the high correlated features into the Machine learning which are: 
- `scene_change_avg`,`s_duration`,`e_framerate`
- `s_height`,`s_storage_size`,`Res_width_height`,`t_average_bitrate`
- `e_crf`,`s_width`,`t_average_vmaf`

Getting the features value from the preprocessed `data_set_c` into `data_set_final`  

In [None]:
data_set_final=pd.DataFrame({
                         'scene_change_avg':data_set_c["scene_change_avg"] ,
                         's_duration': data_set_c['s_duration'], 
                         's_video_id':data_set_c['s_video_id'],
                         'e_framerate':data_set_c['e_framerate'],
                         's_height':data_set_c['s_height'], 
                         's_storage_size': data_set_c['s_storage_size'], 
                         'Res_WidthHeight': data_set_c['Res_width_height'],
                         't_average_bitrate': data_set_c['t_average_bitrate'], 
                         'e_crf': data_set_c['e_crf'], 
                         's_width' : data_set_c['s_width'], 
                         't_average_vmaf': data_set_c['t_average_vmaf']})

data_set_final.dropna(inplace=True)
display(data_set_final.head())
print("**Dataset shape after feature selection:**", data_set_final.shape)

**Selecting the X as features and y as Target**  

In [None]:
data_set_final1 = data_set_final.drop(['s_video_id'], axis=1) # removing viedo ID for only training 

X = np.array(data_set_final1.drop(['t_average_vmaf'], 1)) # selecting the features except VMAF
X = preprocessing.scale(X) 
y = np.array(data_set_final1['t_average_vmaf']) # Selecting the VMAF as a target 

print("The shape of X:",X.shape)
print("The shape of y:",y.shape)

Some of the encoded videos of each source video may
place in train data and the rest in the test set. In this way,
it looks unfair because the results probably biased to the
training set, and the trend of each source video would be
predictable easily so I applied 10-fold cross-validation on dataset since the dataset contains around 6680 data point and also each video content has value around 60, I used K=10 to devided data based on group of viedo content category which is around 600 hundered and the rest would be the other video content category with different feature.     

- **point:** by applying the k-Fold Cross validation on random forest with 1000 trees just try to understand the highest performance of Random forest then I will apply hyper parameter in order to find the optimum of model.    

In [None]:
def Cross_score():
    
    SVR_cross = svm.SVR()
    LR_cross = LinearRegression()
    RF_cross = RandomForestRegressor(n_estimators=1000)

    scores_SVR = cross_val_score(SVR_cross, X, y, cv=10)
    scores_LR = cross_val_score(LR_cross, X, y, cv=10) 
    scores_RF = cross_val_score(RF_cross, X, y, cv=10) 
    
    print('----------------------------------------')
    print('scores_SVR:',scores_SVR)
    print('----------------------------------------')
    print('scores_LR:',scores_LR)
    print('----------------------------------------')
    print('scores_RF:',scores_RF)
    
Cross_score()

From the result of Cross validation, I found that the result of Random forest is showing high performance than the other model. 

**Checking the normal `train_test_split` on dataset and also check the confidence of each model**

Approximately 25% of source videos
(together with all encoded of source videos) are chosen to
be the test data, and 75% remaining were using in the
training set. This method ensures that there is not biased to
the training set. 

**Split into Train and Test set**


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Machine learninig model 

In [None]:
def ML_model(X_train,y_train,X_test,y_test):
    
    SVR = svm.SVR()
    LR = LinearRegression()
    RF = RandomForestRegressor(n_estimators=1000)

    SVR.fit(X_train, y_train)
    LR.fit(X_train, y_train)
    RF.fit(X_train, y_train)
    
    predict_SVR =SVR.predict(X_test)
    predict_LR=LR.predict(X_test)
    predict_RF=RF.predict(X_test)
    
    Result_svr = pd.concat([pd.DataFrame({"y_test":y_test}) ,pd.DataFrame({"predict_SVR":predict_SVR})], axis = 1)
    Result_lr = pd.concat([pd.DataFrame({"y_test":y_test}) ,pd.DataFrame({"predict_LR":predict_LR})], axis = 1)
    Result_rf = pd.concat([pd.DataFrame({"y_test":y_test}) ,pd.DataFrame({"predict_RF":predict_RF})], axis = 1)
    

    Result_svr.to_csv('SVR.csv')
    Result_lr.to_csv('LinearRegression.csv')
    Result_rf.to_csv('RF.csv')



    confidence_SVR = SVR.score(X_test, y_test)
    confidence_LR = LR.score(X_test, y_test)
    confidence_RF = RF.score(X_test, y_test)
    
    lst1 =["SVR","LR","RF"]
    lst = [predict_SVR,predict_LR,predict_RF]
    for idx, val in enumerate(lst):
        Corr = np.corrcoef(y_test,val)[0][1]
        MSE1 = MSE(y_test, val)
        MAE1 = MAE(y_test, val)
        RMSE = sqrt(MSE1)
        R2 = r2_score(y_test,val)
        
        print("--------------------------")
        print('Corr_{}: {:.3f}' .format(lst1[idx],Corr)) 
        print('MAE_{}:  {:.3f}' .format(lst1[idx],MAE1))
        print('MSE_{}:  {:.3f}' .format(lst1[idx],MSE1))
        print('RMSE_{}: {:.3f}' .format(lst1[idx],RMSE))
        print('R2_{}:   {:.3f}' .format(lst1[idx],R2))
        print("--------------------------")
        
        figure = plt.subplots(figsize=(3,3))
        plt.scatter(y_test,val)
        plt.xlabel('Actual vmaf')
        plt.ylabel('Predicted vmaf')
        plt.title('%s' %lst1[idx])

        
    print("*************Confidence********")
    
    print("--------------------------")
    print('confidence_SVR:',confidence_SVR)
    print("--------------------------")
    print('confidence_LR:',confidence_LR)
    print("--------------------------")
    print('confidence_RF:',confidence_RF)
    print("--------------------------")
    


ML_model(X_train,y_train,X_test,y_test)

Since Hyperparameters are directly control the behaviour of the training algorithm and have a significant impact on the performance of the model is being trained, I applied the hyperparameter on best model which is Random Forest to get the optimum parameters of random forest.The train set samples correctly as well as the test set base on the 10-fold cross-validation score; I decided to use a random search to just see how the performance of the random forest model changed.     

## Hyper parameter on Random forest Using Randomized Search

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print("*****Random Grid parameter*****")
display(random_grid)


## Use the random grid to search for best hyperparameters

In [None]:

# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_model = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100,cv = 10, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_model.fit(X_train, y_train)


* **Here is the best parameter that suit for Random forest model**

In [None]:
best=rf_model.best_estimator_
print(best)

**Applying the best estimator for Random Forest model to see the difference** 

In [None]:
RF = RandomForestRegressor(n_estimators=400, max_features='sqrt',bootstrap=False)
RF.fit(X_train, y_train)
predict_RF=RF.predict(X_test)

#  Feature importance from Random Forest model

In [None]:
feature_names = list(data_set_final.columns)[:-1]
importances = RF.feature_importances_
for i,v in enumerate(importances):
    print('%s,---------------------Score: %.5f' % (feature_names[i],v))

In [None]:
feature_names = list(data_set_final.columns)[:-1]
fig = dict({
    "data": [{"type": "bar",
              "x": feature_names[:-1],
              "y": importances}],
    "layout": {"title": {"text": "A Figure Specified Important Features"},
               'xaxis': {'categoryorder': 'total ascending'}}
})

fig = go.Figure(fig)
py.iplot(fig)

#  *** plot 6 esolution curves based on VMAF ***

By having a closer look at the rate-distortion plots of each
of the video clips, it can observe that in each specific
resolution, by increasing the bitrate, the quality enhances
to a certain point, after passing that point the quality
becomes flattened, thus jumping to a higher resolution is
convenient.

In [None]:
FIGURE, ax = plt.subplots(3, 2,figsize=(13,13))
uniq = data_set_c['c_content_category'].unique()# get the unique content   



###################################ax[0, 0]######################################
# Summerize values which is sorted in uniq
summerize = data_set_c[data_set_c['c_content_category'] == uniq[0]]
resoluion_lst = summerize['e_width'].unique()# Get the unique resolution from our dataset

for i in resoluion_lst:
        resoluion = summerize[summerize['e_width'] == i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[0, 0].plot(x, y, label=i)
        ax[0, 0].set_xlabel('Average_Bitrate')
        ax[0, 0].set_ylabel('Average_VMAF')
        ax[0, 0].title.set_text(uniq[0])

###################################ax[1, 0]#####################################        
summerize=data_set_c[data_set_c['c_content_category'] == uniq[1]]
resoluion_lst = summerize['e_width'].unique()
for i in resoluion_lst:
        resoluion = summerize[summerize['e_width'] == i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[1, 0].plot(x, y, label=i)
        ax[1, 0].set_xlabel('Average_Bitrate')
        ax[1, 0].set_ylabel('Average_VMAF')
        ax[1, 0].title.set_text(uniq[1])

###################################ax[2, 0]###################################
summerize=data_set_c[data_set_c['c_content_category'] == uniq[2]]
resoluion_lst = summerize['e_width'].unique()
for i in resoluion_lst:
        resoluion = summerize[summerize['e_width'] == i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[2, 0].plot(x, y, label=i)
        ax[2, 0].set_xlabel('Average_Bitrate')
        ax[2, 0].set_ylabel('Average_VMAF')
        ax[2, 0].title.set_text(uniq[2])

        
###################################ax[0, 1]###################################
summerize=data_set_c[data_set_c['c_content_category'] == uniq[3]]
resoluion_lst = summerize['e_width'].unique()
for i in resoluion_lst:
        resoluion = summerize[summerize['e_width'] == i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[0, 1].plot(x, y, label=i)
        ax[0, 1].set_xlabel('Average_Bitrate')
        ax[0, 1].set_ylabel('Average_VMAF')
        ax[0, 1].title.set_text(uniq[3])


###################################ax[1, 1]##################################
summerize = data_set_c[data_set_c['c_content_category'] == uniq[4]]
resoluion_lst = summerize['e_width'].unique()
for i in resoluion_lst:
        resoluion = summerize[summerize['e_width'] == i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[1, 1].plot(x, y, label=i)
        ax[1, 1].set_xlabel('Average_Bitrate')
        ax[1, 1].set_ylabel('Average_VMAF')
        ax[1, 1].title.set_text(uniq[4])
        
###################################ax[2, 1]##################################
summerize = data_set_c[data_set_c['c_content_category'] == uniq[5]]
resoluion_lst = summerize['e_width'].unique()
for i in resoluion_lst:
        resoluion = summerize[summerize['e_width'] == i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[2, 1].plot(x, y, label=i)
        ax[2, 1].set_xlabel('Average_Bitrate')
        ax[2, 1].set_ylabel('Average_VMAF')
        ax[2, 1].title.set_text(uniq[5])



plt.legend(title='Resolution')
plt.tight_layout()

The convex hull is a ubiquitous structure in
computational geometry. Convexity A set S is convex if x
∈ S and y ∈ S implies that the segment xy ⊆ S.
meaning, Given any two points in the polygon, the line
segment between them stays inside the polygon, in other
words, the convex hull or convex envelope of a set X of
points in the Euclidean plane or a Euclidean space is the
smallest convex set that contains X

# ***Convex Hull on  6 resolution of Curves*** 

In [None]:

figure, ax = plt.subplots(3, 2,figsize=(13,13))

     

###################################ax[0, 0]###############################
summerize=data_set_c[data_set_c['c_content_category']==uniq[0]] 
resoluion_lst = summerize['e_width'].unique() 


for i in resoluion_lst:
        resoluion = summerize[summerize['e_width']==i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[0, 0].plot(x, y, label=i)
        ax[0, 0].set_xlabel('Average_Bitrate')
        ax[0, 0].set_ylabel('Average_VMAF')
        ax[0, 0].title.set_text(uniq[0])
                        
# finding x and y axis for our convex hull


x = summerize['t_average_bitrate']
y = summerize['t_average_vmaf']
x_lst = pd.Series(x).values
y_lst = pd.Series(y).values
x_lst_S = x_lst.reshape(64,1)
y_lst_S = y_lst.reshape(64,1)
xy = np.concatenate((x_lst_S, y_lst_S), axis=1)

# plot convex hull for axis

hull = ConvexHull(xy)
for sample in hull.simplices:
    ax[0, 0].plot(xy[sample, 0], xy[sample, 1], 'r', label=i, linewidth=2)

###################################ax[1, 0]###############################
summerize=data_set_c[data_set_c['c_content_category']==uniq[1]]
resoluion_lst = summerize['e_width'].unique()

for i in resoluion_lst:
        resoluion = summerize[summerize['e_width']==i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[1, 0].plot(x, y, label=i)
        ax[1, 0].set_xlabel('Average_Bitrate')
        ax[1, 0].set_ylabel('Average_VMAF')
        ax[1, 0].title.set_text(uniq[1])
                        
# finding x and y axis for our convex hull

x = summerize['t_average_bitrate']
y = summerize['t_average_vmaf']
x_lst = pd.Series(x).values
y_lst = pd.Series(y).values
x_lst_S = x_lst.reshape(70,1)
y_lst_S = y_lst.reshape(70,1)
xy = np.concatenate((x_lst_S, y_lst_S), axis=1)


# plot convex hull for axis

hull = ConvexHull(xy)
for sample in hull.simplices:
    ax[1, 0].plot(xy[sample, 0], xy[sample, 1], 'r', label=i, linewidth=2)


###################################ax[2, 0]################################
summerize=data_set_c[data_set_c['c_content_category']==uniq[2]]
resoluion_lst = summerize['e_width'].unique()


for i in resoluion_lst:
        resoluion = summerize[summerize['e_width']==i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[2, 0].plot(x, y, label=i)
        ax[2, 0].set_xlabel('Average_Bitrate')
        ax[2, 0].set_ylabel('Average_VMAF')
        ax[2, 0].title.set_text(uniq[2])
                        
# finding x and y axis for convex hull

x = summerize['t_average_bitrate']
y = summerize['t_average_vmaf']
x_lst = pd.Series(x).values
y_lst = pd.Series(y).values
x_lst_S = x_lst.reshape(415,1)
y_lst_S = y_lst.reshape(415,1)
xy = np.concatenate((x_lst_S, y_lst_S), axis=1)


## plot convex hull for axis

hull = ConvexHull(xy)
for sample in hull.simplices:
    ax[2, 0].plot(xy[sample, 0], xy[sample, 1], 'r', label=i, linewidth=2)

        

###################################ax[0, 1]###############################
summerize=data_set_c[data_set_c['c_content_category']==uniq[3]]
resoluion_lst = summerize['e_width'].unique()

for i in resoluion_lst:
        resoluion = summerize[summerize['e_width']==i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[0, 1].plot(x, y, label=i)
        ax[0, 1].set_xlabel('Average_Bitrate')
        ax[0, 1].set_ylabel('Average_VMAF')
        ax[0, 1].title.set_text(uniq[3])
                        
# finding x and y axis for our convex hull

x = summerize['t_average_bitrate']
y = summerize['t_average_vmaf']
x_lst = pd.Series(x).values
y_lst = pd.Series(y).values
x_lst_S = x_lst.reshape(66,1)
y_lst_S = y_lst.reshape(66,1)
xy = np.concatenate((x_lst_S, y_lst_S), axis=1)


## plot convex hull for axis

hull = ConvexHull(xy)
for sample in hull.simplices:
    ax[0, 1].plot(xy[sample, 0], xy[sample, 1], 'r', label=i, linewidth=2)


###################################ax[1, 1]###############################
summerize=data_set_c[data_set_c['c_content_category']==uniq[4]] 
resoluion_lst = summerize['e_width'].unique() 

for i in resoluion_lst:
        resoluion = summerize[summerize['e_width']==i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[1, 1].plot(x, y, label=i)
        ax[1, 1].set_xlabel('Average_Bitrate')
        ax[1, 1].set_ylabel('Average_VMAF')
        ax[1, 1].title.set_text(uniq[4])
                        
# finding x and y axis for our convex hull

x = summerize['t_average_bitrate']
y = summerize['t_average_vmaf']
x_lst = pd.Series(x).values
y_lst = pd.Series(y).values
x_lst_S = x_lst.reshape(350,1)
y_lst_S = y_lst.reshape(350,1)
xy = np.concatenate((x_lst_S, y_lst_S), axis=1)


## plot convex hull for axis

hull = ConvexHull(xy)
for sample in hull.simplices:
    ax[1, 1].plot(xy[sample, 0], xy[sample, 1], 'r', label=i, linewidth=2)

    

###################################ax[2, 1]###############################
summerize=data_set_c[data_set_c['c_content_category']==uniq[5]]
resoluion_lst = summerize['e_width'].unique()

for i in resoluion_lst:
        resoluion = summerize[summerize['e_width']==i]
        resoluion = resoluion.sort_values(by=['t_average_bitrate'])
        x = resoluion['t_average_bitrate']
        y = resoluion['t_average_vmaf']
        ax[2, 1].plot(x, y, label=i)
        ax[2, 1].set_xlabel('Average_Bitrate')
        ax[2, 1].set_ylabel('Average_VMAF')
        ax[2, 1].title.set_text(uniq[5])
                        
# finding x and y axis for our convex hull

x = summerize['t_average_bitrate']
y = summerize['t_average_vmaf']
x_lst = pd.Series(x).values
y_lst = pd.Series(y).values
x_lst_S = x_lst.reshape(139,1)
y_lst_S = y_lst.reshape(139,1)
xy = np.concatenate((x_lst_S, y_lst_S), axis=1)

plt.legend(title='Resolution') 
## plot convex hull for axis
hull = ConvexHull(xy)
for sample in hull.simplices:
    ax[2, 1].plot(xy[sample, 0], xy[sample, 1], 'r', label=i, linewidth=2)


plt.tight_layout()

* # Conclusion

**I built a parametric based model that can
predict the convex hull using encoding parameters as well
as content information such as scene change information.
In addition, I compared different types of regression
models to find the best model that predict VMAF. In this
way, I first processed the raw data by removing outliers,
handle the missing values and duplicates values.**

**For Choosing the right VMAF based on convexhull is applying on real test-bed.** 