# Extraction and Preprocessing Pipeline #

## Import Python Modules ##
A module is a Python object with arbitrarily named attributes that you can bind and reference. Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

To make use of the functions in a module, you'll need to import the module with an import statement. An import statement is made up of the import keyword along with the name of the module. In a Python file, this will be declared at the top of the code, under any shebang lines or general comments.

In [1]:
import os
import json
import pandas as pd
from datetime import datetime

## Create `DatasetLoader` class ##

Python is an "object-oriented programming" (OOP) language. This means that almost all the code is implemented using a special construct called classes. Programmers use classes to keep related things together. This is done using the keyword “class,” which is a grouping of object-oriented constructs.

A class is a code template for creating objects. Objects have member variables and have behaviour associated with them. In python a class is created by the keyword class.

In [2]:
class DatasetLoader:
    """ Dataset loader for ndjson Reddit files
    """
    def __init__(self, preprocessers=None):
        self.preprocessors = preprocessers
        if self.preprocessors is None:
            self.preprocessorts = list()
            
    def load(self, filepath, datasort=False, save=False, verbose=-1):
        """ Load ndjson file
            - filepath:
            - datasort:
            - save:
            - verbose:
        """
        AUTHOR = list()
        BODY = list()
        CREATED = list()
        
        with open(filepath, "r") as fname:
            lignes = fname.readlines()
            for (i, ligne) in enumerate(lignes):
                dobj = json.loads(ligne)
                
                author = dobj["author"]
                body = dobj["body"]
                created = datetime.fromtimestamp(dobj["created_utc"])
                
                if self.preprocessors is not None:
                    for p in self.preprocessors:
                        body = p.preprocess(body)
                
                AUTHOR.append(author)
                BODY.append(body)
                CREATED.append(created)
                
                if verbose > 0 and i > 0 and (i + 1) % verbose == 0:
                    print("[INFO] processed {}/{}".format(i + 1, len(lignes)))
        
        df = pd.DataFrame()
        df["author"] = AUTHOR
        df["body"] = BODY
        df["created"] = CREATED
        
        if datasort:
            df.sort_values(by=["created"])
        
        if save:
            outname = filepath.split(".")[0] + ".csv"
            df.to_csv(outname, index=False)
            
            
        return df

### Instantiate class ###
An object is created using the constructor of the class. This object will then be called the instance of the class. In Python we create instances in the following manner

In [3]:
def main():
    fpath = os.path.join("dat_reddit", "comments.ndjson")
    dl = DatasetLoader()
    df = dl.load(filepath=fpath, datasort=True, save=True, verbose=100)

In [4]:
main()

[INFO] processed 100/1200
[INFO] processed 200/1200
[INFO] processed 300/1200
[INFO] processed 400/1200
[INFO] processed 500/1200
[INFO] processed 600/1200
[INFO] processed 700/1200
[INFO] processed 800/1200
[INFO] processed 900/1200
[INFO] processed 1000/1200
[INFO] processed 1100/1200
[INFO] processed 1200/1200
