# External Merge Sort Documentation

## Overview

The `ExternalMergeSort` class implements the external merge sort algorithm to efficiently sort large datasets that do not fit into the computer's main memory. The class takes an input CSV file, performs external sorting using a specified buffer size, and produces a sorted output CSV file.

## Class Attributes

- **`input_file`**: The path to the input CSV file that needs to be sorted.
- **`output_file`**: The path to the output CSV file where the sorted data will be stored.
- **`buffer_size`**: The size of the buffer used for sorting chunks of data. Default value is 1000.
- **`runs`**: A list to store temporary run files generated during the sorting process.

## Methods

### `__init__(self, input_file, output_file, buffer_size=1000)`

Constructor method to initialize the `ExternalMergeSort` object.

- **Parameters:**
  - `input_file` (str): Path to the input CSV file.
  - `output_file` (str): Path to the output CSV file.
  - `buffer_size` (int, optional): Size of the buffer used for sorting chunks. Default is 1000.

### `sort(self)`

Main method to perform the external merge sort on the input file.

- **Steps:**
  1. Reads chunks of data from the input file.
  2. Sorts each chunk individually.
  3. Writes sorted chunks to temporary run files.
  4. Merges runs until only one run remains.
  5. Renames the final run to the output file.

### `merge_runs(self)`

Helper method to merge runs produced during the sorting process.

- **Steps:**
  1. Groups runs into pairs for merging.
  2. Calls the `_merge` method to merge each pair of runs.
  3. Stores the merged runs for the next iteration.

### `_merge(self, run1, run2)`

Helper method to merge two runs into a single run.

- **Parameters:**
  - `run1` (str): Path to the first run file.
  - `run2` (str): Path to the second run file.

- **Returns:**
  - `merged_run` (str): Path to the merged run file.

- **Steps:**
  1. Opens the two run files for reading.
  2. Compares records from both runs and writes them to a new run in sorted order.

## Usage

1. **Initialization:**
   ```python
   input_file_path = 'mergesort1.csv'
   output_file_path = 'ordered_hashwork.csv'
   merge_sort = ExternalMergeSort(input_file_path, output_file_path)


In [None]:
import csv
import os
import heapq
from tempfile import NamedTemporaryFile
import pandas as pd

key = 4
n_colums = 13

class ExternalMergeSort:
    def __init__(self, input_file, output_file, buffer_size=1000):
        self.input_file = input_file
        self.output_file = output_file
        self.buffer_size = buffer_size
        self.runs = []

    def sort(self):
        with open(self.input_file, 'r', newline='') as infile:
            reader = csv.reader(infile,delimiter=";")
            for chunk in iter(lambda: list(heapq.nsmallest(self.buffer_size, reader, key=lambda x: x[key])), []):
                chunk.sort(key=lambda x: x[key])

                with NamedTemporaryFile(mode='w', delete=False, newline='') as temp_run:
                    writer = csv.writer(temp_run)
                    writer.writerows(chunk)
                    self.runs.append(temp_run.name)

        while len(self.runs) > 1:
            self.merge_runs()

        os.rename(self.runs[0], self.output_file)


    def merge_runs(self):
        next_runs = []
        for i in range(0, len(self.runs), n_colums):
            run1 = self.runs[i]
            run2 = self.runs[i + 1] if i + 1 < len(self.runs) else None

            merged_run = self._merge(run1, run2)
            next_runs.append(merged_run)

        self.runs = next_runs

    def _merge(self, run1, run2):
        with open(run1, 'r', newline='') as file1, open(run2, 'r', newline='') as file2:
            reader1 = csv.reader(file1)
            reader2 = csv.reader(file2)

            with NamedTemporaryFile(mode='w', delete=False, newline='') as merged_run:
                writer = csv.writer(merged_run)
                record1 = next(reader1, None)
                record2 = next(reader2, None)

                while record1 or record2:
                    if record1 and (not record2 or record1[1] <= record2[1]):
                        writer.writerow(record1)
                        record1 = next(reader1, None)
                    else:
                        writer.writerow(record2)
                        record2 = next(reader2, None)

        return merged_run.name

input_file_path = 'mergesort1.csv'
output_file_path = 'ordered_hashwork.csv'

merge_sort = ExternalMergeSort(input_file_path, output_file_path)
merge_sort.sort()

df = pd.read_csv(output_file_path)
df

Unnamed: 0,1000253,PR/DF0012537,04/01/2018,POSTO 81 LTDA,00001974000190,"LOC CNN 2, SN",LOTE A,CEILANDIA,72220500,DF,BRASILIA,VIBRA,04/01/2018.1
0,1000254,PR/SP0014189,26/09/2001,POSTO JARDIM AMERICA DE BAURU LTDA,2953000199,"AVENIDA NOSSA SENHORA DE FATIMA, 6 105",,VILA AVIACAO,17053460,SP,BAURU,IPIRANGA,04/11/2009
1,158439,PR/SP0169573,15/07/2004,POSTO JARDIM AMERICA DE BAURU LTDA,2953000270,"RUA PADRE FRANCISCO VAN DER MAAS, 7-70",,JARDIM CONTORNO,17047020,SP,BAURU,VIBRA,25/08/2020
2,1000024,PR/SP0029503,11/12/2002,COMPETRO COMERCIO E DISTRIBUICAO DE DERIVADOS ...,3188000121,"RUA HUMBERTO DE CAMPOS, 306",,JARDIM ZULMIRA,18061000,SP,SOROCABA,BANDEIRA BRANCA,11/12/2002
3,1000022,PR/SP0021686,28/02/2002,COMPETRO COMERCIO E DISTRIBUICAO DE DERIVADOS ...,3188000474,"AVENIDA SANTOS DUMONT, 701",,AEROPORTO,18065290,SP,SOROCABA,BANDEIRA BRANCA,27/07/2018
4,1000025,PR/SP0029505,11/12/2002,COMPETRO COMERCIO E DISTRIBUICAO DE DERIVADOS ...,3188000555,"RUA APARECIDA, 506",,ALEM LINHA,18095000,SP,SOROCABA,BANDEIRA BRANCA,11/12/2002
...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,1001925,PR/MA0007268,08/05/2001,ALCANTARA DERIVADOS DE PETROLEO E SERVIÇOS LTDA,987726000160,"RODOVIA BR 135, S/N",KM 23,TIBIRI,65095040,MA,SAO LUIS,BANDEIRA BRANCA,16/04/2021
995,1001931,PR/PA0010733,12/07/2001,I OECHSLER & CIA LTDA,991423000110,"RUA 24 DE MAIO, 174",,CENTRO,68640000,PA,OUREM,IPIRANGA,12/07/2001
996,1055005,PR/AM0211239,02/05/2007,A C DE SOUSA LUBRIFICANTES LTDA,992097000166,"AVENIDA SILVES, 780",,CACHOEIRINHA,69065080,AM,MANAUS,VIBRA,02/05/2007
997,1001933,PR/MS0006202,04/04/2001,AUTO POSTO TAQUARUSSU LTDA - EPP,992487000136,"AVENIDA FELINTO MULLER, 1031",,CENTRO,79765000,MS,TAQUARUSSU,TAURUS,16/09/2004
