# Understanding Files Sructure:


 
- **sentence.txt** The file contains one sentence pair (English-French) per line. The delimiter is ' ||| '.
- **index.txt** The file have the same number of lines as `sentence.txt`, where each line contains a sequence of `ids` (space sparated) that indicate the source of the sentence pair in `sentence.txt` file. We need these `ids`  in order to  restore the document stucture, and to generate the link to the original PDF document at <https://www.sedar.com>. Here an example of an `id`:
    > `2141333_2141334_00002290_1_0 -> docIdEng_docIdFr_issuerNo_parIdx_sentIdx` 
- **score.txt** The file have the same number of lines as `sentence.txt`, where each line contains a confidence score of the pair quality (the minimum score is 0.6).
- **split.txt** The file have the same number of lines as `sentence.txt`, where each line contains an index that indicates if the sentence pair belong train, valid, or test portion as defined in the paper. Valid indexes are: {0:train, 1:valid, 2:test, -1:ignore}
- **similar.txt** Each line in this file contains line indexes (space separated) of `sentence.txt` file that group a set of near duplicate sentences. For example, if a line contains `26 56 101` means that the sentence pairs at lines 26, 56, and 101 in `sentence.txt` are near duplicates. 

- **meta.pkl** A python dictionary that contains issuers and documents meta data.
- **sent_index.pkl** A python dictionary, where the key is a document and value is a list of (par_idx, sent_idx, l_idx). `l_idx` is the line index in `sentence.txt` of the pair at (par_idx, sent_idx) in the document.

### Before we start let's define the file paths: 

In [8]:
import os

# change data dir to your local path
data_dir = "/path/to/data/"
sentence_file = os.path.join(data_dir, "sentence.txt")
index_file = os.path.join(data_dir, "index.txt")
score_file = os.path.join(data_dir, "score.txt")
split_file = os.path.join(data_dir, "split.txt")
similar_file = os.path.join(data_dir, "similar.txt")

meta_file = os.path.join(data_dir, "meta.pkl")
sent_index_file = os.path.join(data_dir, "sent_index.pkl")
delimiter = ' ||| '

### A demo function to explore files structure

In [13]:
def print_pairs():
    """
         A demostration on files format.

    """

    f1 = open(sentence_file, encoding="utf-8")
    f2 = open(index_file, encoding="utf-8")
    f3 = open(score_file, encoding="utf-8")
    f4 = open(split_file, encoding="utf-8")
    portion_index = {"0": "train", "1": "valid", "2":"test",  "-1": "None"}

    for idx, (sentence, index, score, portion) in enumerate(zip(f1, f2, f3, f4)):
        pair = sentence.strip().split(delimiter)
        print("**English Sentence** %s" % pair[0])
        print("**French Sentence** %s" % pair[1])
        print("- Noise filter confidence score **%s** " % score.strip())
        print("- This sentence pair is repeated **%s times** in the entire corpus." % len(index.strip().split(" ")))
        print("- This sentence pair belongs to the **%s portion**." % portion_index[portion.strip()])
        print()
        if idx == 10:
            break


###  Here is a sample of the output

**English Sentence** Base Shelf Prospectus  
**French Sentence** Prospectus préalable de base  
- Noise filter confidence score **0.63**   
- This sentence pair is repeated **1170 times** in the entire corpus.  
- This sentence pair belongs to the **None portion**.  
  
**English Sentence** A copy of this preliminary short form prospectus has been filed with the securities regulatory authorities in each of the provinces of Canada but has not yet become final for the purpose of the sale of securities.  
**French Sentence** Un exemplaire du présent prospectus simplifié provisoire a été déposé auprès de l’autorité en valeurs mobilières de chacune des provinces du Canada; toutefois, ce document n’est pas encore dans sa forme définitive en vue du placement de titres.  
- Noise filter confidence score **0.73**   
- This sentence pair is repeated **417 times** in the entire corpus.  
- This sentence pair belongs to the **None portion**.  
  
**English Sentence** The securities may not be sold until a receipt for the short form base shelf prospectus is obtained from the securities regulatory authorities.  
**French Sentence** Les titres qu’il décrit ne peuvent être placés avant que l’autorité en valeurs mobilières n’ait visé le prospectus simplifié préalable de base.  
- Noise filter confidence score **0.67**   
- This sentence pair is repeated **14 times** in the entire corpus.  
- This sentence pair belongs to the **None portion**.  
  
**English Sentence** This short form base shelf prospectus has been filed under legislation in each of the provinces of Canada that permits certain information about these securities to be determined after this prospectus has become final and that permits the omission from this prospectus of that information.  
**French Sentence** Le présent prospectus simplifié préalable de base a été déposé auprès de toutes les provinces du Canada selon un régime permettant d’attendre après qu’il soit dans sa version définitive pour déterminer certains renseignements concernant les titres offerts et d’omettre ces renseignements dans le prospectus.  
- Noise filter confidence score **0.71**   
- This sentence pair is repeated **7 times** in the entire corpus.  
- This sentence pair belongs to the **None portion**.  
  
**English Sentence** The securities to be issued hereunder have not been, and will not be, registered under the United States Securities Act of 1933, as amended and, subject to certain exceptions, may not be offered, sold or delivered, directly or indirectly, in the United States of America or for the account or benefit of U.S. persons.  
**French Sentence** Les titres qui seront émis en vertu des présentes n’ont pas été ni ne seront inscrits en vertu de la Securities Act of 1933 des États-Unis, telle qu’elle a été modifiée et, sous réserve de certaines exceptions, ils ne peuvent être offerts, vendus ni livrés, directement ou indirectement, aux États-Unis d’Amérique ou pour le compte ou au profit de personnes des États-Unis.  
- Noise filter confidence score **0.84**   
- This sentence pair is repeated **38 times** in the entire corpus.  
- This sentence pair belongs to the **None portion**.  
  
**English Sentence** Medium Term Deposit Notes  
**French Sentence** Billets de dépôt à moyen terme  
- Noise filter confidence score **0.67**   
- This sentence pair is repeated **6 times** in the entire corpus.  
- This sentence pair belongs to the **train portion**.  
  
**English Sentence** The aggregate principal amount of Notes which may be issued under this Short Form Prospectus may not exceed \$5,000,000,000 in lawful money of Canada (or the equivalent in non-Canadian currency units calculated on the basis of the principal amount of Notes issued, in the case of interest-bearing Notes, or on the basis of the gross proceeds received by Caisse centrale, in the case of non-interest bearing Notes or Notes bearing interest at a rate that, at the time of issuance, is below market rates).  
**French Sentence** Le capital global des billets pouvant être émis en vertu du présent prospectus simplifié ne peut dépasser 5 000 000 000 $ en monnaie légale du Canada (ou l’équivalent en unités de monnaies non canadiennes calculé en fonction du capital des billets émis, dans le cas des billets portant intérêt, ou en fonction du produit brut reçu par la Caisse centrale, dans le cas des billets ne portant pas intérêt ou des billets portant intérêt à un taux inférieur aux taux du marché au moment de leur émission).  
- Noise filter confidence score **0.89**   
- This sentence pair is repeated **4 times** in the entire corpus.  
- This sentence pair belongs to the **train portion**.  
  
**English Sentence** Notes may be redeemed at the option of Caisse centrale, in whole or in part, prior to their maturity date in the manner set out in the applicable Pricing Supplement (defined  below).  
**French Sentence** Les billets pourront être remboursés avant leur date d’échéance, en totalité ou en partie, au gré de la Caisse centrale, de la manière indiquée dans le supplément de fixation du prix (défini ci-dessous)  applicable.  
- Noise filter confidence score **0.80**   
- This sentence pair is repeated **4 times** in the entire corpus.  
- This sentence pair belongs to the **train portion**.  
  
**English Sentence** The specific variable terms of any offering of Notes (including, where applicable and without limitation, the specific designation, the aggregate principal amount being offered, the currency, the issue and delivery date, the maturity date, the issue price (or the manner of determination thereof if offered on a non-fixed price basis), the interest rate (either fixed or floating, and, if floating, the manner of calculation thereof), the interest payment date(s), the redemption, exchange or conversion provisions (if any), the events of default (if any), the repayment terms, the name and compensation of the agents, underwriters or dealers acting as principals (if any), the method of distribution, the form and the actual net proceeds to Caisse centrale) will be set forth in one or more pricing supplements (each a “Pricing Supplement”) which will be delivered to prospective purchasers together with this Short Form Prospectus.  
**French Sentence** Les modalités variables particulières propres à tout placement de billets (y compris, le cas échéant et sans limitation, la désignation particulière, le capital global offert, la monnaie, la date d’émission et de livraison, la date d’échéance, le prix d’émission (ou la façon dont le prix sera déterminé si les billets sont offerts à un prix non déterminé), le taux d’intérêt (fixe ou variable, et, s’il s’agit d’un taux variable, la façon dont il sera calculé), les dates de versement des intérêts, les dispositions relatives au remboursement par anticipation, à l’échange ou à la conversion (s’il en est), les cas de défaut (s’il en est), les modalités de remboursement, le nom et la rémunération des placeurs pour compte, des preneurs fermes ou des courtiers agissant pour leur propre compte (s’il en est), le mode de placement, la forme et le montant réel du produit net revenant à la Caisse centrale) seront indiqués dans un ou plusieurs suppléments de fixation du prix (individuellement, « supplément de fixation du prix ») qui seront transmis aux acquéreurs éventuels avec le présent prospectus simplifié.  
- Noise filter confidence score **0.68**   
- This sentence pair is repeated **5 times** in the entire corpus.  
- This sentence pair belongs to the **train portion**.  
  
**English Sentence** Caisse centrale reserves the right to set forth in a Pricing Supplement specific variable terms of Notes that are not within the options and parameters set forth in this Short Form Prospectus.  
**French Sentence** La Caisse centrale se réserve le droit d’indiquer dans un supplément de fixation du prix les modalités variables particulières des billets qui ne sont pas comprises dans les options et les paramètres indiqués dans le présent prospectus  simplifié.  
- Noise filter confidence score **0.76**   
- This sentence pair is repeated **12 times** in the entire corpus.  
- This sentence pair belongs to the **train portion**.  
  
**English Sentence** The Notes are unsecured, are not issued under a trust indenture and rank pari passu and pro rata with all unsecured and unsubordinated deposits, borrowings and obligations of Caisse centrale.  
**French Sentence** Les billets ne sont pas garantis par une sûreté, ne sont pas émis en vertu d’une convention de fiducie et prennent rang également avec tous les dépôts, emprunts et obligations non garantis et non subordonnés de la Caisse centrale, proportionnellement à ceux-ci.  
- Noise filter confidence score **0.66**   
- This sentence pair is repeated **6 times** in the entire corpus.  
- This sentence pair belongs to the **train portion**.
  

### Use this function to split the corpus into  train, valid, test sets

In [9]:
def split_corpus(output_dir):
    """ Split the corpus into 6 files: {train|valid|test}.{en|fr}
    Args:
            output_dir: The directory where to write the files

    """
    portion = ["train", "valid", "test"]
    p_split = dict([(idx, portion[int(line.strip())]) for idx, line in enumerate(open(split_file)) if line != '-1\n'])

    lang = {"en": 0, "fr": 1}
    fouts = {}
    
    for p in portion:
        for l in lang.keys():
            fouts[(p, l)] = open(os.path.join(output_dir, "%s.%s" % (p, l)), "w")

    for idx, line in enumerate(open(sentence_file, encoding='utf-8')):
        if idx in p_split:
            vals = line[:-1].split(delimiter)
            for l, l_idx in lang.items():
                fouts[(p_split[idx], l)].write("%s\n" % vals[l_idx])

### The following function is used to demostrate how to  restore document structure

In [14]:
def genrate_document():
    """
         A demostration on how to generate the document structure.

    """
        """
         A demostration on how to generate the document structure.

    """
    print("load meta data")
    with open(meta_file, "rb") as fp:
        meta = pickle.load(fp)

    print("load sentence index file")
    with open(sent_index_file, "rb") as fp:
        sent_index = pickle.load(fp)

    print("load sentence pairs")
    sent_pairs = open(sentence_file, encoding="utf-8").readlines()

    print("load sentence pairs scores")
    scores = open(score_file, encoding="utf-8").readlines()

    print("processing document")
    doc_done = set()
    for issuerNo, issuer in meta.items():
        issuer_link = "https://www.sedar.com/DisplayProfile.do?lang=EN&issuerType=%s&issuerNo=%s" % \
                      (issuer['issuerType'], issuer['issuerNo'])
        print("**Issuer Number** %s" % issuer['issuerNo'])
        print("**Issuer Name** %s" % issuer['issuerName'])
        print("**Industrial Group** %s" % issuer['ind_group'])
        print("**Issuer Profile link** %s " % issuer_link)
        print("\n")
        for (docId, issuerNo), doc in issuer["docs"].items():
            # if en and fr docId are the same then its a bilingual document
            print("**English Document link**\n %s" % get_doc_link(issuer, doc, "en"))
            print("**French Document link**\n %s" % get_doc_link(issuer, doc, "fr"))
            print("**Document Type** %s" % doc["docType"])
            print("**Document Description** %s" % doc["docDes"])
            print("**Submission Date** %s" % doc["doc_date"].strftime("%Y-%m-%d"))
            print()

            # skip a french doc if the english one has been treated (and vice versa)
            if doc["key"] in doc_done: continue
            doc_done.add(doc["key"])

            for par_idx, sent_idx, l_idx in sent_index[doc["key"]]:
                pair = sent_pairs[l_idx].rstrip().split(delimiter)
                score = scores[l_idx].strip()
                frmt = (par_idx, sent_idx, score)
                print("**Paragraph Num=** %s\t**Sentence Num=** %s\t**Confidence Score=** %s" % frmt)
                print("**English Sentence** %s " % pair[0])
                print("**French Sentence** %s " % pair[1])
                print()
            print("\n\n")
            break
        break


    
def get_doc_link(issuer, doc, lang):
    i = 0 if lang == "en" else 1
    docId = doc["key"][i]
    doc_link = "https://www.sedar.com/GetFile.do?lang=EN&docClass=%s&issuerNo=%s&issuerType=%s&projectNo=%s&docId=%s"
    doc_link = doc_link % (doc['docClass'], issuer['issuerNo'], issuer['issuerType'], doc['projectNo'], docId)
    return doc_link

###  Here is a sample of the output

**(Important Note)** The original source of the data are commercially designed PDFs. Consequently, segmentation and parsing errors may occur, and they are sometimes hard to detect espetially when the sentence is too long (Strikethrough text chunck).

**Issuer Number** 00019641  
**Issuer Name** AAER Inc.  
**Industrial Group** other  
**Issuer Profile link** https://www.sedar.com/DisplayProfile.do?lang=EN&issuerType=03&issuerNo=00019641   
  
  
**English Document link**  
 https://www.sedar.com/GetFile.do?lang=EN&docClass=8&issuerNo=00019641&issuerType=03&projectNo=01595266&docId=2675068  
**French Document link**  
 https://www.sedar.com/GetFile.do?lang=EN&docClass=8&issuerNo=00019641&issuerType=03&projectNo=01595266&docId=2675034  
**Document Type** News Releases  
**Document Description** News release  
**Submission Date** 2010-06-10  
  
**Paragraph Num=** 2    **Sentence Num=** 0    **Confidence Score=** 1.00  
**English Sentence** FOR IMMEDIATE DISTRIBUTION PRESS RELEASE   
**French Sentence** POUR DISTRIBUTION IMMÉDIATE COMMUNIQUÉ DE PRESSE   
  
**Paragraph Num=** 10    **Sentence Num=** 0    **Confidence Score=** 0.64  
**English Sentence** Furthermore, the initial order granted under the Companies' Creditors Arrangement Act (Canada) ("CCAA") in favour of AAER on April 8, 2010, as subsequently extended from time to time by the court, has now been further extended until July 7, 2010 to allow AAER to analyze its restructuring alternatives further to the completion of the sales described above and Pioneer’s expression of interest to consider the possibility of sponsoring an eventual plan, and, as the case may be, to complete and file a joint plan of arrangement for consideration by its creditors and the Court.   
**French Sentence** ~~Pioneer»), et le 4 juin 2010, la vente d’une partie de ses dépôt en inventaire et de ses dépenses prépayées à Global Casting inc.~~ De plus, l’ordonnance initiale en vertu de la LACC octroyée en faveur d’AAER le 8 avril 2010, précédemment prolongé par la Cour a quelques reprises, a encore une fois été prolongée jusqu’au 7 juillet 2010, de façon a permettre a AAER d’analyser ses alternatives de restructuration de ses affaires suite aux ventes décrites ci-haut et suite à la démonstration d’un intérêt par Pioneer de considérer la possibilité de parrainer ce plan éventuel et, si requis, de développer et de déposer un plan commun de restructuration des affaires afin de le soumettre à ses créanciers et à la Cour.   
  
**Paragraph Num=** 10    **Sentence Num=** 1    **Confidence Score=** 0.75  
**English Sentence** The Court also authorized the postponement of AAER’s annual meeting of shareholders to September 30, 2010.   
**French Sentence** La Cour autorise de plus la remise de l’assemblée annuelle des actionnaires au 30 septembre 2010.   
  
**Paragraph Num=** 14    **Sentence Num=** 0    **Confidence Score=** 1.00  
**English Sentence** About AAER Inc.   
**French Sentence** À propos d’AAER Inc.   
  
**Paragraph Num=** 15    **Sentence Num=** 0    **Confidence Score=** 0.75  
**English Sentence** AAER is a wind turbine manufacturer located in Bromont, Quebec that manufactures and maintains high capacity 1 MW or more wind turbines principally for the North American market.   
**French Sentence** AAER est un fabricant d'éoliennes établi à Bromont, au Québec, qui fabrique et entretient des éoliennes à haute capacité, de 1 MW et plus, principalement pour le marché nord-américain.   
  
**Paragraph Num=** 15    **Sentence Num=** 1    **Confidence Score=** 0.82  
**English Sentence** Its strategy is to progressively build its products' components to provide a high level of reliability and competitive pricing to its customers.   
**French Sentence** Sa stratégie consiste à fabriquer progressivement les composantes de ses produits de manière à offrir à sa clientèle une grande fiabilité et des prix concurrentiels.   
  
**Paragraph Num=** 15    **Sentence Num=** 2    **Confidence Score=** 0.77  
**English Sentence** AAER uses a portfolio of proven European technologies to ensure the performance of its turbines in various wind conditions and terrains.   
**French Sentence** AAER utilise des technologies européennes éprouvées pour assurer le rendement de ses éoliennes dans diverses conditions de vents et de terrains.   
  
**Paragraph Num=** 15    **Sentence Num=** 3    **Confidence Score=** 0.86  
**English Sentence** Its stock is listed on the TSX Venture Exchange (TSX-V: AAE).   
**French Sentence** Ses actions se négocient à la Bourse de croissance TSX (TSX-V : AAE).   
  
**Paragraph Num=** 19    **Sentence Num=** 0    **Confidence Score=** 0.76  
**English Sentence** This news release contains certain forward-looking statements or forward looking-information.   
**French Sentence** Le présent communiqué de presse contient des énoncés ou des renseignements de nature prospective.   
  
**Paragraph Num=** 19    **Sentence Num=** 1    **Confidence Score=** 0.68  
**English Sentence** These forward looking statements are subject to a variety of risks and uncertainties  beyond  AAER’s  ability  to  control  or  predict  which  could  cause  actual  events  or results  to  differ  materially  from  those  anticipated  in  such  forward  looking  statements.   
**French Sentence**  Ces  énoncés  prospectifs  comportent  des  risques  et  des  incertitudes  qui   échappent  au contrôle  de  la  Société  et  qui  pourraient  faire  en  sorte  que  les  événements  ou  les  résultats  réels diffèrent considérablement de ceux anticipés dans ces énoncés prospectifs.   
  
**Paragraph Num=** 19    **Sentence Num=** 2    **Confidence Score=** 0.76  
**English Sentence** Such risks  and  uncertainties  are  disclosed  under  the  heading  “Risk  Factors”  in  AAER’s  amended and restated preliminary prospectus dated February 25, 2010 and the annual information form for the year ended December 31, 2008 and dated March 26, 2009.   
**French Sentence** Ces risques et incertitudes sont énumérés dans la section « Facteurs de risque » du prospectus provisoire amendé et reformulé daté du 25 février 2010 et de la notice annuelle de la Société pour l’exercice financier se terminant le 31 décembre 2008 et datée du 26 mars 2009.   
  
**Paragraph Num=** 23    **Sentence Num=** 0    **Confidence Score=** 0.84  
**English Sentence** Neither TSX Venture Exchange nor its Regulation Services Provider (as that term is defined in the policies of the TSX Venture Exchange) accepts responsibility for the adequacy or accuracy of this release.   
**French Sentence** La Bourse de croissance TSX et son fournisseur de services de réglementation (au sens attribué à ce terme dans les politiques de la Bourse de croissance TSX) n’assument aucune responsabilité quant à la pertinence ou à l’exactitude du présent communiqué.