# Data Check!

This notebook provides a way to check our data for 2 necessary attributes before going further. Given our data examples in ceph storage, the flake analysis tool in the [AI Library](https://gitlab.com/opendatahub/ai-library) requires the following 2 conditions.  


* `example["status"] == "failure"`
* `example["log"] != "[]"`

We can see in the flake analysis project [here](https://gitlab.com/opendatahub/ai-library/-/blob/master/flakes_train/bots/learn/data.py#L26-27) that non-failures are filtered out during data loading and prior to model training. We can also see [here](https://gitlab.com/opendatahub/ai-library/-/blob/master/flakes_train/bots/learn/extractor.py#L103-112) that the data extraction step expects a non-empty string to be transformed using a count vectorizer. If all our strings are empty, there will not be any data to encode or train our model with. Therefore, we have to ensure that we have examples in our datset that meet both these criteria, that is non-empty log messages for failed tests.        


Below we will iterate through all available data and quantify the frequency of these attributes and determine how much usable data we have to train our flake analysis model.  

**Modification for RHV dataset**: Further analysis of the RHV dataset has shown that it uses the string "FAILED" instead of "failure". We will make that replace below.   

In [20]:
import os
import boto3
import json
import tempfile
import urllib3
urllib3.disable_warnings()


# SET PARAMETERS TO ACCESS S3 BACKEND
s3Path = 'ccit'
s3_endpoint_url = 'https://s3.upshift.redhat.com/'
s3_bucket_name = 'DH-PLAYPEN'


# CREATE CONNECTION TO S3 BACKEND
session = boto3.Session()
s3 = session.resource('s3', endpoint_url=s3_endpoint_url, verify=False)


# DOWNLOAD TRAINING DATA
objects = []
bucket = s3.Bucket(name=s3_bucket_name)

# get list of all availble objects
for obj in bucket.objects.filter(Prefix=s3Path):
    objects.append(obj.key)

# We want to count the occurences of data points that include both non-empty logs and "status" == "failure""    
count_both = 0
count_failure = 0
count_log  = 0
for key in objects:
    obj = s3.Object(s3_bucket_name, key)
    contents = obj.get()['Body'].read().decode('utf-8')
    if contents:
        jcontents = json.loads(contents)
        if jcontents["log"] != "[]" and jcontents["status"] == "FAILED":
            count_both += 1
            print(count)
        
        if jcontents["log"] != "[]":
            count_log += 1
        
        if jcontents["status"] == "FAILED":
            count_failure += 1 
        else:
            pass
            

print(f'{count_log} none empty files out of {len(objects)}')            
print(f'{count_failure} failures out of {len(objects)}')            
print(f'{count_both} none empty failures out of {len(objects)}')

4314 none empty files out of 15292
1316 failures out of 15292
695 none empty failures out of 15292


Great! It looks like we have ~700 usable examples. We can know move forward with the testing. 