# Data Verification

While importing the transcript data from JSON to the database as well as elasticsearch, we've found out that there are quite many episodes that have the exact same transcripts.
We're suspecting it's the error from the data. However, we create this notebook for verification.

To that end, we perform the following steps from scratch:
* We downloaded a fresh copy of the original dataset (i.e., `podcasts-no-audio-13GB.zip`)
* We decompressed the `zip` file
* There are three main `tar` files that contains the set of transcript json files
* We created 3 separate folders `part0to2`, `part3to5`, `part6to7` and decompress those `tar` files into the corresponding folder. This is to avoid the potential overlapse of data into those `tar` files.

The analysis on this notebook is conducted based on the folder structure resulted from the above steps

In [15]:
import os
import json
import typing
from glob import glob

ROOT_DIR = os.path.dirname(os.path.dirname(os.getcwd()))
print(f"ROOT_DIR: {ROOT_DIR}")
DATA_DIR = os.path.join(ROOT_DIR, 'data', 'podcasts-transcript')
print(f"DATA_DIR: {DATA_DIR}")
DATA_PREFIX = os.path.join("spotify-podcasts-2020", "podcasts-transcripts")

ROOT_DIR: /home/erik/Projects/KTH/dd2476-podcast-search
DATA_DIR: /home/erik/Projects/KTH/dd2476-podcast-search/data/podcasts-transcript


**List all files in the part folders**

In [16]:
part0to2_path = os.path.join(DATA_DIR, "part0to2", DATA_PREFIX)
part0to2 = [y for x in os.walk(part0to2_path) for y in glob(os.path.join(x[0], '*.json'))]

part3to5_path = os.path.join(DATA_DIR, "part3to5", DATA_PREFIX)
part3to5 = [y for x in os.walk(part3to5_path) for y in glob(os.path.join(x[0], '*.json'))]

part6to7_path = os.path.join(DATA_DIR, "part6to7", DATA_PREFIX)
part6to7 = [y for x in os.walk(part6to7_path) for y in glob(os.path.join(x[0], '*.json'))]

In [17]:
file_list = part0to2 + part3to5 + part6to7
print("Number of files in list ", len(file_list))

file_set = set(file_list)
print("Number of files in set ", len(file_set))

Number of files in list  105360
Number of files in set  105360


In [18]:
json_files = [f.split("/")[-1] for f in file_list]
print(len(json_files), len(set(json_files)))

105360 105360


### As we can see, there is no two json files with the same name, each file contains the transcript for each episode

In [None]:
all_files = 

**Inspect show id=`0XDDRp9nP5S3kgx413Ixg3`**

In [47]:
show_id = '4j5clif9VEUY2iFGzAaEDe'
ep_list = [f for f in file_list if show_id in f]
print(len(ep_list))

30


In [29]:
def load_episode_transcript(fpath: str) -> typing.Tuple[int, str]:
    with open(fpath, 'r') as f:
        data = json.load(f)
    
    results = data['results']
    transcripts = []
    for res in results:
        alternatives = res.get('alternatives')
        if len(alternatives) != 1:
            print("More than 1 alternative found")
        alternative = alternatives[0]
        if not bool(alternatives):
            continue

        if 'transcript' in alternative.keys():
            transcripts.append(alternative.get('transcript'))
    return len(transcripts), "".join(transcripts)

In [48]:
count, transcript = load_episode_transcript(ep_list[0])
print(count, transcript)

FileNotFoundError: [Errno 2] No such file or directory: '/home/erik/Projects/KTH/dd2476-podcast-search/data/podcasts-transcript/part3to5/spotify-podcasts-2020/podcasts-transcripts/4/J/show_4j5clif9VEUY2iFGzAaEDe/6ZpoMAnKIzs7BUlAevP22p.json'

**Separate part**

In [42]:
data_path = os.path.join(DATA_DIR, DATA_PREFIX)
json_file_list = [y for x in os.walk(data_path) for y in glob(os.path.join(x[0], '*.json'))]
len(json_file_list)

105360

In [43]:
json_file_list = [f.replace(data_path + "/", "") for f in json_file_list]

In [44]:
json_file_list[0]

'4/J/show_4Jocfk9mf9D876514gZHet/03R2P2RnGOOZ57hGoXAT6z.json'

In [46]:
with open(os.path.join(DATA_DIR, "spotify-podcasts-2020", "json_file_list.txt"), "w") as f:
    for jf in json_file_list:
        f.write(jf + "\n")