### Sample creation
Here, we create a sample, which will contains around 200,00 quotes per year, which will be usefull for testing our different programms on it instead of running them on the full corpus. The whole corpus will be opened by chunks and we'll take random rows, uniformely for each year, in order to get the desired amount of rows.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# List of years available and used
years = [2015,2016,2017,2018,2019,2020]

# Size of years file in GB
year_size = np.array([3.3,2.3,5.2,4.8,3.6,.9])

# Number of rows in file of 2020 (computed once "by hand", as long as this file is not too heavy)
rows2020 = 5244449

# Estimation of number of rows in each file
year_rows = year_size*rows2020/.9

# Which correspond to n chunks of size one million
chunks_number = np.rint(year_rows/1e6).astype(int)

# Which means we have to take n random rows in each chunk to get a sample of 200,000 rows per year
rows_in_chunk = np.rint(2*1e5/chunks_number).astype(int)

In [3]:
# As long as year 2017 crashes (even run alone), we change parameters for it
rows_in_chunk[2] = 666
chunks_number[2] = 300

# And we chose different chunk size for this year
chunk_sizes = [1e6,1e6,1e5,1e6,1e6,1e6]

In [4]:
rows_in_chunk, chunks_number

(array([10526, 15385,   666,  7143,  9524, 40000]),
 array([ 19,  13, 300,  28,  21,   5]))

In [5]:
# Data files all have the following columns
Index = ['quoteID', 'quotation', 'speaker', 'qids', 'date', 'numOccurrences','probas', 'urls', 'phase']

In [6]:
# Useful function, when testing loops
def process_chunk(chunk,year):
        print(f'Processing chunk with {len(chunk)} rows, in file from year {year}')
        # print(chunk.columns)

In [7]:
print("C'est parti, mon kiki !\n")
df = pd.DataFrame(columns=Index)
for iii, year in enumerate(years) :
    with pd.read_json(f'./Quotebank/quotes-{year}.json.bz2', lines=True, compression='bz2', chunksize=chunk_sizes[iii]) as df_reader:
        print(f"Start year {year} with chunks of size {int(chunk_sizes[iii])}")
        for chunk in df_reader:
            # process_chunk(chunk,year)
            df = pd.concat([df,chunk.sample(rows_in_chunk[iii])])
        print(f"Done with year {year}")
df = df.reset_index(drop=True)
df.to_json(f"./Quotebank/Sample.json.bz2",compression="bz2",lines=True,orient="records")
print("\nTout est bien qui finit bien")

C'est parti, mon kiki !

Start year 2015 with chunks of size 1000000
Done with year 2015
Start year 2016 with chunks of size 1000000
Done with year 2016
Start year 2017 with chunks of size 100000
Done with year 2017
Start year 2018 with chunks of size 1000000
Done with year 2018
Start year 2019 with chunks of size 1000000
Done with year 2019
Start year 2020 with chunks of size 1000000
Done with year 2020

Tout est bien qui finit bien
