# Clean Orbis Data

In this notebook, we clean a part of the Orbis dataset containing company names.

The notebook is organized in the following fashion:

0. Import libraries and define constants
1. Load parts of Orbis dataset
2. Check the data
3. Clean company names
4. Save processed data

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import ftfy
import pyunpack
import rarfile
os.environ["MODIN_ENGINE"] = "dask"
import modin.pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline     
sns.set(color_codes=True)

from linkage.model.utils import save_dataframe, read_dataframe
from linkage.model.change_dataframe import replace_german_characters, repair_broken_unicode, replace_other_latin_characters
from linkage.model.clean_names import clean_names, clean_names_with_dictionary
from linkage.model.examine_dataframe import contains_all_nan, contains_any_nan, drop_all_nan, count_redundant_spaces
from linkage.model.examine_dataframe import column_contains_nan, drop_subset_nan, print_dataframe_length
from linkage.visualize.plot import plot_histogram
from linkage.visualize.visualize_dataframe import show_nan_counts

In [None]:
# Two types of data, all or the first part (part01.rar)
# part01 is used for implementation purposes 
# To check if everything is working as it sould
TYPE = 'all'  # 'all' or 'part01'

# 'std' for standardized, 'std_dict_40k' for dictionary cleaning with the 40k most common words
NOTE = 'std'

In [None]:
# Specify paths to data directories
INTERMEDIATE_DATA_DIR = "../data/intermediate/orbis"
PROCESSED_DATA_DIR = f"../data/processed/orbis/{TYPE}"

# Specifie file names
#ORBIS_FILE = "orbis_german_bvid_name_unprocessed_part01.csv"  # TODO
ORBIS_FILE = f"orbis_german_bvid_name_unprocessed_{TYPE}.csv"
ORBIS_PROCESSED_FILE = f"orbis_german_bvid_name_processed_{TYPE}.csv"

# Dataframe's index
ORBIS_INDEX = 'BvD ID number'

# Column names
# Good to specify if the column names would change
COMPANY_NAME = 'NAME'

# Columns to take when reading the dataframe from a file
USEFUL_COLS = [ORBIS_INDEX, COMPANY_NAME]

# Additional columns
COMPANY_NAME_STANDARDIZED = 'company_standard'
COMPANY_NAME_DICT_CLEANED = 'company_dict_clean'

# Labels for plots
PLOT_LABELS = ['Comp. name']
PLOT_LABELS_WITH_DICT_CLEAN = ['Comp. name', 'Comp. name stand.', 'Comp. name dict. clean']

## 1. Load parts of Orbis dataset

The Orbis dataset is stored on path:
```python
../data/intermediate/orbis/
```

The data are read into Pandas **DataFrame**.


In [None]:
# Read previously obtained German companies
df = read_dataframe(INTERMEDIATE_DATA_DIR, ORBIS_FILE, None, USEFUL_COLS)
print_dataframe_length(df)
df.head()

## 2. Check the data

What should be checked:
- Columns' type
- Number of unique rows
- Index
- NaN values
- Broken Unicode 

### Check the dataframe info

First, we check the number of columns and rows.

We print the column names with their data types.

In [None]:
# Get column names, the number of the columns, the number of rows
df.info(verbose=True , show_counts=True)

### Check for uniqueness and index

Then, we look at the uniqueness of values in the individual columns.

Next, we check if the data frame has an index. If there is no index, the execution ends with an exception.

In [None]:
# Check if the column is unique
for i in df.columns:
  print(f'{i} is unique: {df[i].is_unique}')

# Check the index values
# Results in error if there is no index
df.index.values

### Check for NaN values

Here, we check the missing data.

In [None]:
show_nan_counts(df, PLOT_LABELS, ymin=0, ymax=len(df)+500)

#### All values are NaN

Let's check if some rows are NaN.

In [None]:
contains_all_nan(df)

#### Deal with all NaN

In [None]:
drop_all_nan(df)

#### Some values are NaN

Let's check if some rows are NaN.

In [None]:
contains_any_nan(df)

### Check broken Unicode

It can happen that someone has encoded Unicode with one standard and decoded it with a different one.

As a result, some of the characters may be "broken".

A nice example is ampersand (&) which will decode as &amp.

In [None]:
# Check for broken Unicode
df[df[COMPANY_NAME].str.contains('&amp', regex=True, case=False)].head()

#### Repair broken Unicode

The library ftfy (fixes text for you) will do.

In [None]:
# Repair broken unicode
repair_broken_unicode(df, COMPANY_NAME)
    
# Look again for the broken ampersand
df[df[COMPANY_NAME].str.contains('&amp;', regex=True, case=False) == True]

### Replace with basic Latin characters

Let's check if the dataframe contains any characters other than basic Latin ones and replace them.

In [None]:
# Check all rows with other than German alphanumerical characters
df[df[COMPANY_NAME].str.contains('[ÄÖÜßÁÉÓÚ]', regex=True) == True].head()

#### Replace German characters

Replace German characters with umlaut and ß with their basic Latin equivalents.

In [None]:
# Replace characters with umlaut
replace_german_characters(df, COMPANY_NAME)

# Check all rows with other than German alphanumerical characters
df[df[COMPANY_NAME].str.contains('[ÄÖÜß]', regex=True) == True].head()

#### Columns contain only German characters

Let's check if the company names contain different than German characters.

In [None]:
# Replace á to a etc.
replace_other_latin_characters(df, COMPANY_NAME)

# Check all rows with other than latin alphanumerical characters
df[df[COMPANY_NAME].str.contains('[ÁÉÓÚ]', regex=True) == True].head()

### Check the range of the values

We look at the values of company names.

In [None]:
plot_histogram(df, column_name=COMPANY_NAME, title='Count by company name before standardization and deduplication.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

### Deduplication

We check duplicated records and drop them.

In [None]:
# Show the duplicated records
df[df.duplicated(subset=USEFUL_COLS, keep=False) == True].sort_values(ORBIS_INDEX).head()

In [None]:
# Drop duplicates
df.drop_duplicates(subset=USEFUL_COLS, inplace=True)

# Get the new lenght of the dataframe
print_dataframe_length(df)

# Check again
df[df.duplicated(subset=USEFUL_COLS, keep=False) == True].sort_values(ORBIS_INDEX).head()

### Check the range of the values

We look at the values of company names.

In [None]:
plot_histogram(df, column_name=COMPANY_NAME, title='Count by company name deduplicated before standardization.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

## 3. Clean company names

Here, we will clean the company names.

In general, we will remove:
- Redundant characters
- Redundant individually standing numbers
- Redundant white characters

We will apply:
- PDP standardization routines
- Dictionary cleaning

### Remove redundant words

Here, we standardize company names and apply PDP: 

PDP standardization routines:

0. Remove non-alphanumerical characters.

1. Change things to shortcuts

2. Remove the shortcuts

3. Remove corporate names and non-corporate

4. Combine abbreviations and remove them

In [None]:
df[COMPANY_NAME_STANDARDIZED] = df[COMPANY_NAME]
df.head()

# Clean company names from redundant words
clean_names(df, column_name=COMPANY_NAME_STANDARDIZED)

df.head()

#### Check empty names after cleaning

Let's check if some of the values resulted in empty strings.

In [None]:
# Check which company names resulted into empty string after cleaning
empty_name_filter = df[df[COMPANY_NAME] == '']

empty_name_df = df[df.index.isin(empty_name_filter.index)]

empty_name_df

#### Fill empty 

Fill the empty company name values with their original version.

In [None]:
# Fill the values where the empty company name is
empty_name_df[COMPANY_NAME_STANDARDIZED] = empty_name_df[COMPANY_NAME]

# Clean company names without removing redundant words
clean_names(empty_name_df, column_name=COMPANY_NAME_STANDARDIZED, remove_redundant=False)

# Update the values where the empty company name is
df.update(empty_name_df)

df[df.index.isin(empty_name_df.index)]

In [None]:
# Check all rows with other than German alphanumerical characters
df[df[COMPANY_NAME_STANDARDIZED].str.contains('[^a-zA-Z0-9 ]', regex=True) == True]

#### Company names do not contains space on the beginning and the end, or double spaces

Cleaning of the names should have removed all the redundant spaces created during cleaning process.

In [None]:
count_redundant_spaces(df, COMPANY_NAME_STANDARDIZED)

### Remove redundant words using dictionary

We use a dictionary containing the 40k most common German words.

We use this step to achieve similar results we achieve when cleaning the JobPostings dataset.

In [None]:
df[COMPANY_NAME_DICT_CLEANED] = df[COMPANY_NAME_STANDARDIZED]
df.head()

In [None]:
# Clean names with a dictionary
#clean_names_with_dictionary(df, column_name=COMPANY_NAME_DICT_CLEANED) # TODO: uncomment
df.head()

### Check the results

Let's plot value counts.

In [None]:
# Counts after standardization
company_counts_df = df[COMPANY_NAME_STANDARDIZED].value_counts().copy()
plot_histogram(company_counts_df, title='Count by company name after standardization.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

In [None]:
# Counts after standardization and dictionary cleaning
company_dict_clean_counts_df = df[COMPANY_NAME_DICT_CLEANED].value_counts().copy()
plot_histogram(company_dict_clean_counts_df, title='Count by company name after standardization and dictionary cleaning.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

#### Save company appearences to CSV

To check possible missed company types, e.g. GGMBH.

In [None]:
company_standard_counts_df_file = "orbis_company_standard_value_counts.csv"
company_dict_clean_counts_df_file = "orbis_company_copy_value_counts.csv"

# Save dataframe to a csv file
save_dataframe(company_counts_df, INTERMEDIATE_DATA_DIR, company_standard_counts_df_file)
save_dataframe(company_dict_clean_counts_df, INTERMEDIATE_DATA_DIR, company_dict_clean_counts_df_file)

### Set an index

In _1. Load parts of Orbis dataset_, we read the data without setting the _BvD ID_ as an index.

We set the index now, so the data saved to file do not contain an additional column with previously used index (line number).


In [None]:
# Set an index
df.set_index(ORBIS_INDEX, inplace=True)
df.head()

## 4. Save processed data

The processed data is stored in a csv file on path:
```python
../data/processed/orbis/
```

In [None]:
save_dataframe(df, PROCESSED_DATA_DIR, ORBIS_PROCESSED_FILE)