# Pre-processing quotes files

This notebook is for processing the quotes files into the processed files (tokenized, with genders, etc.).

In [13]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-win_amd64.whl (24.0 MB)
Installing collected packages: gensim
Successfully installed gensim-4.1.2


find: '/I': No such file or directory
find: '/N': No such file or directory
find: 'FirewallModule.exe': No such file or directory


In [1]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import timeit
import bz2
import datetime
import os
from src.prep_utilities import * 
from src.prep_pipeline import *

%matplotlib inline
%load_ext autoreload
%autoreload 2
!python ./src/load_models_data.py
data_folder = './data/'

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


find: '/I': No such file or directory
find: '/N': No such file or directory
find: 'FirewallModule.exe': No such file or directory


## Load speaker attributes (processed)

In [2]:
# Load parquet file
speaker_attributes = pd.read_parquet(data_folder + 'speaker_attributes_processed.parquet')

## Pre-process quotes by chunks

Preferences

In [3]:
year = 2015 #change this for each year
data_file = 'quotes-'+ str(year)+'.json.bz2'
data_path = data_folder + data_file
chunk_size = 1000

Process by chunks to `.parquet` file.

In [4]:
# Load by chunks
f = bz2.open(data_path, "rb")
data=pd.read_json(f, lines=True, chunksize=chunk_size)

In [None]:
# Write the pre processed data sample to parquet
# Iterate through chunks

start = timeit.default_timer()
start_progress = timeit.default_timer()
progress_step = 100

for i_chunk, chunk in enumerate(data):
        
        # Print progress
        if i_chunk%progress_step == 0:
            stop_progress = timeit.default_timer()
            print(f'Time since last: {stop_progress-start_progress:.1f}s\n')
            print(f"Pre-processing chunks {i_chunk}-{i_chunk+progress_step-1}")
            start_progress = timeit.default_timer()
        
        # Pre-process chunk
        chunk_prep = prep_docs(chunk, speaker_attributes, print_progress = False, fix_contract = True, del_stop = False, lemmatize = True, min_size = 5, min_true_size = 5)

        # Write to parquet file
        table = pa.Table.from_pandas(chunk_prep)

        # for the first chunk of records
        if i_chunk == 0:
            # create a parquet write object giving it an output file
            pqwriter = pq.ParquetWriter(data_folder + 'quotes-'+str(year)+'-prep.parquet', table.schema)

        pqwriter.write_table(table)

stop = timeit.default_timer()
print(f'Total time: {stop-start:.1f}s\n')
        
# close the parquet writer
if pqwriter:
    pqwriter.close()

Time since last: 0.1s

Pre-processing chunks 0-99


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  copy_doc['date'] = copy_doc['date'].apply(lambda x: get_yyyy_mm(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  copy_doc['tokens'] = copy_doc['quotation'].apply(


Time since last: 218.1s

Pre-processing chunks 100-199
Time since last: 201.5s

Pre-processing chunks 200-299
Time since last: 202.4s

Pre-processing chunks 300-399
Time since last: 202.6s

Pre-processing chunks 400-499
Time since last: 212.2s

Pre-processing chunks 500-599
Time since last: 199.0s

Pre-processing chunks 600-699
Time since last: 207.8s

Pre-processing chunks 700-799
Time since last: 199.2s

Pre-processing chunks 800-899
Time since last: 202.5s

Pre-processing chunks 900-999
Time since last: 200.2s

Pre-processing chunks 1000-1099
Time since last: 251.9s

Pre-processing chunks 1100-1199
Time since last: 206.9s

Pre-processing chunks 1200-1299
Time since last: 196.5s

Pre-processing chunks 1300-1399
Time since last: 195.3s

Pre-processing chunks 1400-1499
Time since last: 195.2s

Pre-processing chunks 1500-1599
Time since last: 205.5s

Pre-processing chunks 1600-1699
Time since last: 225.1s

Pre-processing chunks 1700-1799
Time since last: 243.2s

Pre-processing chunks 18