# Load Orbis Data

Preprocess Orbis .rar files.

The Orbis dataset is split into multiple 'topics' resembling SQL tables.
Each of the 'topics' is devided into multiple compressed files. 
To process the Orbis, data, we need to preprocess these files.

In this notebook, we will focus on two parts of the Orbis dataset:
1. All Addresses
2. BvD ID and Name

The first part, _All Addresses_, is composed of 8 compressed files, and 
the second part, _BvD ID and Name_ is composed of 4 comperssed files.
Each of the files takes a few GB of storage space.

Therefore, the important part of this notebook is to process these files efficiently with regards to memory and space.
For that reason, we decided to omit filtering by the 'Country' and joining the dataframes on the ID, as this operation is too expensive.
Both of the dataframes are filetered in the same way, based on their only common column 'BvD ID number'.


In this notebook, we call two methods to:
- Extract Orbis data files.
- Filter out columns with unnecessary data.
- Save obtained dataframes to .csv files.

In [None]:
from linkage.dataset.unrar import unrar_addresses, unrar_names

In [None]:
TYPE = 'all'  # 'all' or 'part01'

In [None]:
USEFUL_COLS_ADDR = ['BvD ID number', 
                   'Postcode', 
                   'City',
                   'City (native)',
                   'Country',
                   'Country ISO code',
                   'Region in country']

INDEX_COL_ADDR = 'BvD ID number'

USEFUL_COLS_NAMES = ['BvD ID number', 
                   'NAME']

INDEX_COL_NAMES = 'BvD ID number'
    
# Data directories
RAW_DIR = "../data/raw/orbis"
INTERMEDIATE_DIR = f"../data/intermediate/orbis/"

# Data (.rar) files

# Data (.txt) files
ADDR_TXT = 'All_addresses.txt'  # Name of the All_addresses.part0x.rar after un-raring
NAME_TXT = 'BvD_ID_and_Name.txt'  # Name of the BvD_ID_and_Name.part0x.rar after un-raring

# Data (.csv) files
NAME_DF_FILE = f"orbis_german_bvid_name_unprocessed_{TYPE}.csv"
ADDR_DF_FILE = f"orbis_german_all_addresses_unprocessed_{TYPE}.csv"
ID_DF_FILE = f"orbis_german_BvD_ID_number_{TYPE}.csv"

## 1. Load Address parts of Orbis dataset

The Orbis dataset is split and stored in multiple files on the path:
```python
../data/raw/orbis/
```

The data are read into Pandas **DataFrame** which is then stored in the file defined in 'ADDR_DF_FILE' on the path:
```python
../data/intermediate/orbis/
```

### Get Addresses

The company **addresses** are stored in multiple _All_addresses.part0x.rar_.

Following function call extracts .rar files containing addresses of companies and concatenate them to a single dataframe.

During execution, records are filter using the German ID.

After each file is processed, the auxiliary .txt file is removed.



In [None]:
addr_df = unrar_addresses(type_unrar=TYPE,
                          source_dir=RAW_DIR,
                          dest_dir=INTERMEDIATE_DIR, 
                          source_file=ADDR_TXT, 
                          dest_file=ADDR_DF_FILE, 
                          index_column=INDEX_COL_ADDR, 
                          useful_columns=USEFUL_COLS_ADDR)
addr_df.head()

## Load Company Name parts of Orbis dataset

The Orbis dataset is split and stored in multiple files on the path:
```python
../data/raw/orbis/
```

The data are read into Pandas **DataFrame** which is then stored in the file defined in 'NAME_DF_FILE' on the path:
```python
../data/intermediate/orbis/
```

### Get Company Names

The company **names** are stored in multiple _BvD_ID_and_Name.part0x.rar_.

Following function call extracts .rar files containing names of companies and concatenate them to a single dataframe.

During execution, records are filter using the German ID.

After each file is processed, the auxiliary .txt file is removed.


In [None]:
name_df = unrar_names(type_unrar=TYPE,
                      source_dir=RAW_DIR,
                      dest_dir=INTERMEDIATE_DIR, 
                      source_file=NAME_TXT, 
                      dest_file=NAME_DF_FILE, 
                      index_column=INDEX_COL_NAMES, 
                      useful_columns=USEFUL_COLS_NAMES)
name_df.head()