# Preprocessing

We use this file once to pre-process the entire `quotebank` dataset. The idea is to reduce the size of the data to only include quotes that might (or might not) interest us later. We will see/decide at the end whether this step was necessary/useful

## Preparation

In [1]:
import time
import pandas as pd
from os import listdir

We will start by defining the environement (constants, create directories) that will help us later during the filtering

In [2]:
CHUNK_SIZE: int    = 1 << 16      # size of the chunks we process instead of the entire file at once
DATA_DIR_PATH: str = 'quotebank'  # directory containing the initial compressed quotebank dataset
OUT_DIR_PATH: str  = 'out'        # output directory where the program will dump the filtered files

In [3]:
!mkdir out

In [4]:
data_files = listdir(DATA_DIR_PATH)
data_files.sort()

## Computation

Here's the bulk of the program. The main idea is as follows:

For each file of the quotebank dataset do:
- break the file in multiple chunks for easier processing (the entire file would probably not fit in memory)
- for each chunk do:
  - create a new dataframe `sports_quotes` that consist of quotes containing the word "sport" in the url
  - write the filtered quotes to a `.csv` file
- print status update


In [23]:
start_time = time.time()

for filename in data_files:
    file_path: str = '{}/{}'.format(DATA_DIR_PATH, filename)
    comp_ext:  str = file_path.split('.')[-1]
    out_path:  str = '{}/sport-{}.csv'.format(OUT_DIR_PATH, filename.split('.')[0])
    
    with pd.read_json(file_path, lines=True, compression=comp_ext, chunksize=CHUNK_SIZE) as df_reader:
        
        i: int = 0 # keeps track of the chunk number
        total_lines: int = 0 # keeps track of output file length
        export_header: bool = True
        
        for chunk in df_reader:   
            if (i & 15 == 0):
                print(f"  - Processing chunk #{i} (size = {chunk.shape[0]}) for file '{file_path}'")
            
            # keep only lines containing the 'sport' substring in the url(s)
            sports_quotes = chunk[[any('sport' in url for url in url_list) for url_list in chunk.urls]]

            sports_quotes.to_csv(out_path, mode='a', header=export_header)
            export_header = False

            total_lines += sports_quotes.shape[0]
            i += 1

        # summary at the end of a file
        print(f">>> Processed a total of {i} chunks for file '{file_path}' => total of {total_lines} lines out of {i * CHUNK_SIZE}")


print("--- %s seconds ---" % (time.time() - start_time))

  - Processing chunk #0 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #16 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #32 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #48 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #64 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #80 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #96 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #112 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #128 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #144 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #160 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processing chunk #176 (size = 65536) for file 'quotebank/quotes-2015.json.bz2'
  - Processi

## Statistics

First of all, we observe that the whole computation took a total of just above 2 hours, which is reasonable for a dataset of this size (19GB compressed). But was it useful?

In [5]:
7471 / 60

124.51666666666667

To answer the previous daunting question, we finish by compressing the produced `.csv` files into `.bz2` files for two reasons:
1. so that we are working with smaller files when we later use our filtered dataset
2. so that we can compare the filtered size with the original quotebank size

In [32]:
!find out -iname '*.csv' -exec bzip2 -kzv {} \;
!mkdir out_bz2
!find out -iname '*.csv.bz2' -exec mv {} out_bz2 \;

  out/sport-quotes-2015.csv:  4.201:1,  1.904 bits/byte, 76.20% saved, 2586161046 in, 615620062 out.
  out/sport-quotes-2016.csv:  4.384:1,  1.825 bits/byte, 77.19% saved, 2052387478 in, 468111951 out.
  out/sport-quotes-2017.csv:  4.561:1,  1.754 bits/byte, 78.07% saved, 5103764884 in, 1119072564 out.
  out/sport-quotes-2020.csv:  4.717:1,  1.696 bits/byte, 78.80% saved, 578289196 in, 122593618 out.
  out/sport-quotes-2019.csv:  4.671:1,  1.713 bits/byte, 78.59% saved, 2621718521 in, 561288174 out.
  out/sport-quotes-2018.csv:  4.769:1,  1.678 bits/byte, 79.03% saved, 4473331888 in, 938035681 out.


As a final step, we compare the size of the original compressed dataset (19GB) and the output compressed dataset (3.6GB).

In [6]:
!du -h quotebank    # initial '.json.bz2' files
!du -h out          # output '.csv' files
!du -h out_bz2      # compressed output ('.csv.bz2')

 19G	quotebank
 16G	out
3.6G	out_bz2


In [7]:
3.6 / 19

0.18947368421052632

Naively, we deduce that our filtered dataset is about 5 times smaller than the original (removed 81% of the quotes). However, this is only an estimation. Let's try to see why by finding a quote that appears in both files (original `json` file (here the one from milestone 1) and final `csv` file)

In [8]:
!head -n 16 quotes-2019-nytimes.json | grep sport

{"quoteID": "2019-01-13-028337", "quotation": "It's crazy. I can't even really explain it right now.", "speaker": "Todd Gurley II", "qids": ["Q7812406"], "date": "2019-01-13 15:55:44", "numOccurrences": 1, "probas": [["Todd Gurley II", "0.5824"], ["None", "0.413"], ["C.J. Anderson", "0.0046"]], "urls": ["https://www.nytimes.com/2019/01/13/sports/rams-nfl-playoffs-cj-anderson.html"], "phase": "E"}


We found a quote containing the word "sport" inside the url in the first 16 quotes. Let's try to find it in our final `csv` from 2019. We will try to find it in the first $n = 2^{10}$ lines for efficiency reasons

In [9]:
!head -n 1024 out/sport-quotes-2019.csv | grep "2019-01-13-028337"  # grep with the previous quoteID

835,2019-01-13-028337,It's crazy. I can't even really explain it right now.,Todd Gurley II,['Q7812406'],2019-01-13 15:55:44,1,"[['Todd Gurley II', '0.5824'], ['None', '0.413'], ['C.J. Anderson', '0.0046']]",['https://www.nytimes.com/2019/01/13/sports/rams-nfl-playoffs-cj-anderson.html'],E


By comparing the same quote written in `json` and `csv`, we realise that the first contains 400 characters and the second only 290. This is due to the fact that the `json` quotes are stored inside `json` objects with repetitive object fields (e.g. "quoteID", "speaker"). Thus even if the program did not filter any lines, the output `csv` would still be smaller.

:information_source: as a final note, we thought about about the fact that the `bz2` compression is probably able to compress `json` way more easily, thus comparing file size is definitely not the way to go.

In [10]:
!head -n 16 quotes-2019-nytimes.json | grep sport | wc -m
!head -n 1024 out/sport-quotes-2019.csv | grep "2019-01-13-028337" | wc -m

     400
     290


Thus the only reasonable way to estimate the efficiency of our program is by looking at the calculation output. If we take only quotes from 2019, we observe the following:
```pseudocode
>>> Processed a total of 333 chunks for file 'quotebank/quotes-2019.json.bz2' => total of 2922983 lines out of 21823488
```

In [11]:
2922983 / 21823488

0.1339374805713917

A good estimation is then sayinng that we removed 87% of the file (instead of 81%). We think that we achieved both goals stated in the [Preprocessing](#Preprocessing) introduction, namely:
1. reduce the dataset to a reasonable size
2. only keep information that might interest us later