# Clean JobPostings Data

In this notebook, we clean the JobPostings dataset.

The notebook is organized in the following fashion:

0. Import libraries and define constants
1. Load Job Posting dataset
2. Check the data
3. Create a new dataframe with significant columns
4. Clean company names
5. Addresses
6. Clean addresses
7. Translate English names in addresses
8. German ZIP codes
9. Fill missing company addresses
10. Fill missing job addresses
11. Join dataframes
12. Save processed data

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)

import linkage.model.fill_addresses as fa
import linkage.model.german_zip_codes as gzc

from linkage.model.utils import read_dataframe, save_dataframe
from linkage.model.change_dataframe import replace_german_characters, repair_broken_unicode
from linkage.model.change_dataframe import replace_other_latin_characters
from linkage.model.clean_names import clean_names, clean_names_with_dictionary
from linkage.model.clean_addresses import clean_addresses, replace_english_names
from linkage.model.examine_dataframe import contains_all_nan, contains_any_nan, drop_all_nan, count_redundant_spaces
from linkage.model.examine_dataframe import column_contains_nan, drop_subset_nan, print_dataframe_length
from linkage.visualize.plot import plot_histogram
from linkage.visualize.visualize_dataframe import show_nan_counts

In [None]:
# 'std' for standardized, 'std_dict_40k' for dictionary cleaning with the 40k most common words
NOTE = 'std'

In [None]:
# Specify paths to data directories
RAW_DATA_DIR = '../data/raw/jobpostings'
INTERMEDIATE_DATA_DIR = "../data/intermediate/jobpostings"
PROCESSED_DATA_DIR = "../data/processed/jobpostings"

# Specifie file names
JP_PROCESSED_FILE = f"jobpostings_test_sample_{NOTE}.csv"

# List of files containing job postings
JP_FILES_LIST = ['jobpostings_test_sample.txt', 
                 'jobpostings_test_sample2.txt']

# Dataframe's index
JP_INDEX = 'jobposting_id'

# Column names
# Good to specify if the column names would change
COMPANY_NAME = 'company'
COMPANY_CITY, COMPANY_ZIP, COMPANY_STATE = 'company_city', 'company_zipcode', 'company_state'
JOB_CITY, JOB_ZIP, JOB_STATE = 'job_city', 'job_zipcode', 'job_state'

# Columns to take when reading the dataframe from a file
USEFUL_COLS = [JP_INDEX, COMPANY_NAME, 
               COMPANY_CITY, COMPANY_ZIP, COMPANY_STATE, 
               JOB_CITY, JOB_ZIP, JOB_STATE]

# Address columns
COMPANY_ADDR_COLS = [COMPANY_CITY, COMPANY_ZIP, COMPANY_STATE]
JOB_ADDR_COLS = [JOB_CITY, JOB_ZIP, JOB_STATE]

# Split columns to lists if numerical or alpha-numerical
COMPANY_ADDR_COLS_NAMES = [COMPANY_CITY, COMPANY_STATE]
COMPANY_ADDR_COLS_ZIPCODES = [COMPANY_ZIP]

JOB_ADDR_COLS_NAMES = [JOB_CITY, JOB_STATE]
JOB_ADDR_COLS_ZIPCODES = [JOB_ZIP]

# Additional columns
COMPANY_NAME_STANDARDIZED = 'company_standard'
COMPANY_NAME_DICT_CLEANED = 'company_dict_clean'

# Labels for plots
PLOT_LABELS = ['Comp. name', 'Comp. city', 'Comp. ZIP code', 'Comp. state', 'Job. city', 'Job. ZIP code', 'Job. state']
PLOT_LABELS_WITH_DICT_CLEAN = ['Comp. name', 'Comp. city', 'Comp. ZIP code', 'Comp. state', 
                               'Job. city', 'Job. ZIP code', 'Job. state', 'Comp. name stand.', 'Comp. name dict. clean']

## 1. Load JobPostings dataset

The Job Posting dataset is stored on path:
```python
../data/raw/jobpostings/
```
The file containing dataset is named _jobpostings_test_sample.txt_.

The data are read into Pandas **DataFrame**.


In [None]:
# Here we save the read files for easy concatenation into a single dataframe
df_list = []

# Iterate over job postings files and read them to dataframes
for jobpostings_file in JP_FILES_LIST:
    df_part = read_dataframe(RAW_DATA_DIR, jobpostings_file, JP_INDEX, USEFUL_COLS, dtype=str)
    df_list.append(df_part)

# Concatenate dataframes to one main dataframe
df = pd.concat(df_list)

print(f"Num. of records: {len(df)}")

df.head()

## 2. Check the data

What should be checked:
- Columns' type
- Number of unique rows
- Index
- NaN values
- Broken Unicode 

### Check the dataframe info

First, we check the number of columns and rows.

We print the column names with their data types.

In [None]:
# Get column names, the number of the columns, the number of rows
df.info(verbose=True , show_counts=True)

### Check for uniqueness and index

Then, we look at the uniqueness of values in the individual columns.

Next, we check if the data frame has an index. If there is no index, the execution ends with an exception.

In [None]:
# Check if the column is unique
for i in df.columns:
    print(f'{i} is unique: {df[i].is_unique}') # TODO: only print unique

# Check the index values
# Results in error if there is no index
df.index.values  # Remember, we set the index at the beginning

### Check NaN values

Here, we check the missing data. 

In [None]:
show_nan_counts(df, PLOT_LABELS, ymin=0, ymax=len(df)+500)

#### All values are NaN

Let's check if some rows are NaN.

In [None]:
contains_all_nan(df)

#### Deal with all NaN rows

For now, we will drop the rows with only NaN values.

In [None]:
drop_all_nan(df)

#### Company name is NaN
Let's check if some of the company names are NaN. 

In [None]:
column_contains_nan(df, COMPANY_NAME)

#### Deal with NaN company name values

For now, we will drop the rows with NaN company names.

In [None]:
drop_subset_nan(df, COMPANY_NAME)

### Check broken Unicode

It can happen that someone has encoded Unicode with one standard and decoded it with a different one.

As a result, some of the characters may be "broken".

A nice example is ampersand (&) which will decode as &amp.

In [None]:
# Look for the broken ampersand
df[df[COMPANY_NAME].str.contains('&amp;', regex=True, case=False) == True].head()

#### Repair broken Unicode

The library ftfy (fixes text for you) will do.

In [None]:
# Repair broken unicode
repair_broken_unicode(df, df.columns)

# Look again for the broken ampersand
df[df[COMPANY_NAME].str.contains('&amp;', regex=True, case=False) == True].head()

### Replace with basic Latin characters

Let's check if the dataframe contains any characters other than basic Latin ones and replace them.

In [None]:
# Check all rows with other than German alphanumerical characters
df[df[COMPANY_NAME].str.contains('[ÄÖÜßÁÉÓÚ]', regex=True) == True].head()

#### Replace German characters

Replace German characters with umlaut and ß with their basic Latin equivalents.

In [None]:
# Replace characters with umlaut
replace_german_characters(df, df.columns)

# Check all rows with other than German alphanumerical characters
df[df[COMPANY_NAME].str.contains('[ÁÉÓÚ]', regex=True) == True].head()

#### Columns contain only German characters

Let's check if the company names contain different than German characters.

In [None]:
# Replace á to a etc.
replace_other_latin_characters(df, df.columns)

# Check all rows with other than latin alphanumerical characters
df[df[COMPANY_NAME].str.contains('[ÁÉÓÚ]', regex=True) == True].head()

## 3. Create a new dataframe with significant columns

Let's create a new dataframe to simplify standardization and cleaning of company names and addresses.

Taken columns for __company name__ standardization:
- company_id
- company

Taken columns for __company addresses__ standardization:
- company_zipcode
- company_city
- company_state

Taken columns for __job addresses__ standardization:
- job_zipcode
- job_city
- job_state


#### Dataframe for name cleaning

Create a new dataframe with significant columns.

In [None]:
# Create a new dataframe for name cleaning
# Take jobposting_id and company name
name_df = df[[COMPANY_NAME]].copy()
name_df.head()

#### Dataframe for addresses

Create new dataframes for company addresses and job addresses.

In [None]:
# Save address part of dataframe for later
company_addr_df = df[COMPANY_ADDR_COLS].copy()
job_addr_df = df[JOB_ADDR_COLS].copy()

print(f'Company address dataframe:\n{company_addr_df.head()}\n\n')

print(f'Job address dataframe:\n{job_addr_df.head()}')

### Check the range of the values

We look at the values of company names.

In [None]:
plot_histogram(name_df, column_name=COMPANY_NAME, title='Count by company name before standardization.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

## 4. Clean company names

Here, we will clean the company names.

In general, we will remove:
- Redundant characters
- Redundant individually standing numbers
- Redundant white characters

We will apply:
- PDP standardization routines
- Dictionary cleaning

In [None]:
# Check for corporation names, e.g. GmBH
name_df[name_df[COMPANY_NAME].str.contains(' GMBH ', regex=True, case=False)].head()

In [None]:
# Check for some special character in COMPANY_NAME column
name_df[name_df[COMPANY_NAME].str.contains('[&]', regex=True)].head()

### Remove redundant words

Here, we standardize company names and apply PDP: 

PDP standardization routines:

0. Remove non-alphanumerical characters.

1. Change things to shortcuts

2. Remove the shortcuts

3. Remove corporate names and non-corporate

4. Combine abbreviations and remove them

In [None]:
name_df[COMPANY_NAME_STANDARDIZED] = name_df[COMPANY_NAME]
name_df.head()

# Clean company names from redundant words
clean_names(name_df, column_name=COMPANY_NAME_STANDARDIZED)

name_df.head()

#### Check empty names after cleaning

Let's check if some of the values resulted in empty strings.

In [None]:
# Check which company names resulted into empty string after cleaning
empty_name_filter = name_df[name_df[COMPANY_NAME_STANDARDIZED] == '']

empty_name_df = name_df[name_df.index.isin(empty_name_filter.index)]

empty_name_df

#### Fill empty 

Fill the empty company name values with their original version.

In [None]:
# Fill the values where the empty company name is
empty_name_df[COMPANY_NAME_STANDARDIZED] = empty_name_df[COMPANY_NAME]

# Clean company names without removing redundant words
clean_names(empty_name_df, column_name=COMPANY_NAME_STANDARDIZED, remove_redundant=False)

# Update the values where the empty company name is
name_df.update(empty_name_df)

name_df[name_df.index.isin(empty_name_filter.index)]

#### Company names do not contains space on the beginning and the end, or double spaces

Cleaning of the names should have removed all the redundant spaces created during cleaning process.

In [None]:
count_redundant_spaces(name_df, COMPANY_NAME_STANDARDIZED)

In [None]:
# Check appearence of different types of companies
name_df[name_df[COMPANY_NAME_STANDARDIZED].str.contains('consult', regex=True, case=False) == True].head()

### Remove redundant words using dictionary

We use a dictionary containing the 40k most common German words.

We try to remove words which do not belong to the company names.

In [None]:
name_df[COMPANY_NAME_DICT_CLEANED] = name_df[COMPANY_NAME_STANDARDIZED]
name_df.head()

In [None]:
# Clean names with a dictionary
# clean_names_with_dictionary(name_df, column_name=COMPANY_NAME_DICT_CLEANED) # TODO: uncomment
name_df.head()

In [None]:
name_df[name_df['company_dict_clean'] == 'DB']

### Check the result

Let's plot value counts.

In [None]:
# Counts after standardization
company_counts_df = name_df[COMPANY_NAME_STANDARDIZED].value_counts().copy()
plot_histogram(company_counts_df, title='Count by company name after standardization.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

In [None]:
# Counts after standardization and dictionary cleaning
company_dict_clean_counts_df = name_df[COMPANY_NAME_DICT_CLEANED].value_counts().copy()
plot_histogram(company_dict_clean_counts_df, title='Count by company name after standardization and dictionary cleaning.', 
               ylabel='Num. of appearances per company name', xlabel='Company name')

In [None]:
# Show unique values of company names after dictionary cleaning
name_df[name_df[COMPANY_NAME_DICT_CLEANED] == ''][COMPANY_NAME].unique()

#### Save company appearences to CSV

To check possible missed company types, e.g. GGMBH.

In [None]:
company_standard_counts_df_file = "jobpostings_company_standard_value_counts.csv"
company_dict_clean_counts_df_file = "jobpostings_company_copy_value_counts.csv"

# Save dataframe to a csv file
save_dataframe(company_counts_df, INTERMEDIATE_DATA_DIR, company_standard_counts_df_file)
save_dataframe(company_dict_clean_counts_df, INTERMEDIATE_DATA_DIR, company_dict_clean_counts_df_file)

### Update the main dataframe

Update the original dataframe by adding standardized and dict. cleaned company names.

In [None]:
df[COMPANY_NAME_STANDARDIZED] = name_df[COMPANY_NAME_STANDARDIZED]
df[COMPANY_NAME_DICT_CLEANED] = name_df[COMPANY_NAME_DICT_CLEANED]

df.head()

## 5. Addresses

Process the addresses for Record Linkage

### Check NaN values

Some addresses my contain NaN values. We will try to fill them when possible. 

We will drop Addresses which contain only NaN values. These records have non-NaN company name, otherwise they would be dropped at the beginning.  

#### All values are NaN

Let's check if some rows are NaN. First, check company addresses, then job addresses.

In [None]:
# Company address contains only NaN
contains_all_nan(company_addr_df)

In [None]:
# Job address contains only NaN
contains_all_nan(job_addr_df)

#### Deal with all NaN values

Drop the rows, which are only NaN as we do not use the during the cleaning process. Removed rows are preserved in the main dataframe (as the respective company name is not NaN).

In [None]:
drop_all_nan(company_addr_df)

In [None]:
drop_all_nan(job_addr_df)

#### Some values are NaN

Let's check rows with any NaN value.

In [None]:
# Company address contains any NaN
contains_any_nan(company_addr_df)

In [None]:
# Job address contains any NaN
contains_any_nan(job_addr_df)

### Check State values

Plot counts for each of the German states.

In [None]:
# Plot company states
df[COMPANY_STATE].value_counts().plot(kind='bar')

In [None]:
# Plot job states
df[JOB_STATE].value_counts().plot(kind='bar')

## 6. Clean addresses

Clean non-numerical parts of addresses from non-alphabetical characters, group single consecutive letters, and turn names upper case.

Standardization for addresses is similar to the one for company names.

### Clean company addresses

Here, we standardize company addresses.

In [None]:
# Clean company addresses
for column_name in COMPANY_ADDR_COLS_NAMES:

    clean_addresses(company_addr_df, column_name)
    
company_addr_df.head()

### Clean job addresses

Here, we standardize job addresses.

In [None]:
# Clean job addresses
for column_name in JOB_ADDR_COLS_NAMES:

    clean_addresses(job_addr_df, column_name)
    
job_addr_df.head()

### Categorical 

After the standardization, state columns should have maximum of 17 different values (16 for Bundeslands and 1 for NaN values).
Therefore, we can change the datatype for states to _categorical__ and spare some memory and computational time.

In [None]:
# Convert a column type to categorical to save memory
company_addr_df[COMPANY_STATE] = company_addr_df[COMPANY_STATE].astype('category')

In [None]:
# Convert a column type to categorical to save memory
job_addr_df[JOB_STATE] = job_addr_df[JOB_STATE].astype('category')

## 7. Translate English names in addresses

Some of the cities may be named in English. 
Translate the English names to their German equivalents.

In [None]:
# Check for German city name
company_addr_df[company_addr_df[COMPANY_CITY].str.contains('MUNICH') == True].head()

In [None]:
# Translate company city
replace_english_names(company_addr_df, COMPANY_CITY)

# Translate job city
replace_english_names(job_addr_df, JOB_CITY)

# Check for German city name
company_addr_df[company_addr_df[COMPANY_CITY].str.contains('MUNICH') == True].head()

In [None]:
company_addr_df[company_addr_df[COMPANY_CITY].str.contains('[^a-zA-Z0-9ÜÄÖß ]', regex=True) == True]

## 8. German ZIP codes

The _German-Zip-Codes.csv_ of German ZIP codes is saved on path
```python
../data/external/german-zip-codes
```

We use German-Zip-Codes to fill the missing values.

In [None]:
# Initialize class for German-Zip-Codes
german_zipcodes = gzc.GermanZipCodes()

# Read the standardized dataframe of german zip codes
zip_df = german_zipcodes.zip_df
zip_df.head()

In [None]:
# Check the german-zip-codes dataframe info
zip_df.info(verbose=True , show_counts=True)

#### Replace the mean of ZIP codes

Because we used the mean of the ZIP codes in the previous step, we do not want to use the full ZIP to avoid confusion.

We replace the last 3 numbers of ZIP codes with 'xxx'.

In [None]:
zip_mean_df = german_zipcodes.zip_mean_df
zip_mean_df.head()

## 9. Fill missing company addresses

Here, we try to fill missing parts of company addresses using other non-missing values of records.

In [None]:
# Initialize class for cleaning data
fill_address = fa.FillAddress(company_addr_df, zip_df, zip_mean_df, COMPANY_ZIP, COMPANY_CITY, COMPANY_STATE)

### ZIP codes

Fill missing company ZIP codes.

In [None]:
# Filter missing or invalid zipcodes and create a new dataframe
missing_zip_mask = ((company_addr_df[COMPANY_ZIP].isna() | company_addr_df[COMPANY_ZIP].str.contains('[a-zA-Z]', regex=True)) \
                   & company_addr_df[COMPANY_CITY].notna())

missing_zip_df = company_addr_df[missing_zip_mask].copy()

print_dataframe_length(missing_zip_df)
missing_zip_df.head()

In [None]:
# Fill missing zipcode
missing_zip_df = fill_address.fill_missing_zipcode(missing_zip_df)

column_contains_nan(missing_zip_df, COMPANY_ZIP, print_df=False)
missing_zip_df.head()

In [None]:
# Update company addresses
company_addr_df.update(missing_zip_df)

#### Check the data

In [None]:
company_addr_df[company_addr_df[COMPANY_ZIP].isna()].head()

### Company city

Fill missing company cities.

In [None]:
# Filter missing or invalid cities and create a new dataframe
missing_city_mask = (company_addr_df[COMPANY_CITY].isna() \
                     | (company_addr_df[COMPANY_CITY].str.contains('[^A-Z ]', regex=True) == True)) \
                     & company_addr_df[COMPANY_ZIP].notna()

missing_city_df = company_addr_df[missing_city_mask].copy()

print_dataframe_length(missing_city_df)
missing_city_df.head()

In [None]:
# Fill missing city
missing_city_df = fill_address.fill_missing_city(missing_city_df)

column_contains_nan(missing_city_df, COMPANY_CITY, print_df=False)
missing_city_df.head()

In [None]:
# Update company addresses
company_addr_df.update(missing_city_df)

#### Check the data

In [None]:
company_addr_df[company_addr_df[COMPANY_CITY].isna()].head()

### German state

Fill missing company states.

In [None]:
# Filter missing or invalid states and create a new dataframe
missing_state_mask = (company_addr_df[COMPANY_STATE].isna()
                     | company_addr_df[COMPANY_STATE].str.contains('0-9', regex=True)
                     | ~company_addr_df[COMPANY_STATE].isin(fill_address.bundesland_lst)) \
                     & (company_addr_df[COMPANY_ZIP].notna() 
                     | company_addr_df[COMPANY_CITY].notna())

missing_state_df = company_addr_df[missing_state_mask].copy()

print_dataframe_length(missing_state_df)
missing_state_df.head()

In [None]:
# Fill missing state
missing_state_df = fill_address.fill_missing_state(missing_state_df)

column_contains_nan(missing_state_df, COMPANY_STATE, print_df=False)
missing_state_df.head()

In [None]:
# Update company addresses
company_addr_df.update(missing_state_df)

#### Check the data

In [None]:
company_addr_df[company_addr_df[COMPANY_STATE].isna()].head()

Note: Schwadorf 2432 is situated near Vienna.

## 10. Fill missing job addresses

Here, we try to fill missing parts of job addresses using other non-missing values of records.

In [None]:
# Initialize class for cleaning data
fill_address = fa.FillAddress(job_addr_df, zip_df, zip_mean_df, JOB_ZIP, JOB_CITY, JOB_STATE)

### Zip codes

Fill missing job ZIP codes.

In [None]:
# Filter missing or invalid zipcodes and create a new dataframe
missing_zip_mask = (job_addr_df[JOB_ZIP].isna() \
                   | job_addr_df[JOB_ZIP].str.contains('[a-zA-Z]', regex=True)) \
                   & job_addr_df[JOB_CITY].notna()

missing_zip_df = job_addr_df[missing_zip_mask].copy()

print_dataframe_length(missing_zip_df)
missing_zip_df.head()

In [None]:
# Fill missing zipcode
missing_zip_df = fill_address.fill_missing_zipcode(missing_zip_df)

column_contains_nan(missing_zip_df, JOB_ZIP, print_df=False)
missing_zip_df.head()

In [None]:
# Update job addresses
job_addr_df.update(missing_zip_df)

#### Check the data

In [None]:
job_addr_df[job_addr_df[JOB_ZIP].isna()].head()

### Job city

Fill missing job cities.

In [None]:
# Filter missing or invalid cities and create a new dataframe
missing_city_mask = (job_addr_df[JOB_CITY].isna() \
                     | (job_addr_df[JOB_CITY].str.contains('[0-9]', regex=True))) \
                     & (job_addr_df[JOB_ZIP].notna() 
                        | job_addr_df[JOB_STATE].notna())

missing_city_df = job_addr_df[missing_city_mask].copy()

print_dataframe_length(missing_city_df)
missing_city_df.head()

In [None]:
# Fill missing city
missing_city_df = fill_address.fill_missing_city(missing_city_df)

column_contains_nan(missing_city_df, JOB_CITY, print_df=False)
missing_city_df.head()

In [None]:
# Update job addresses
job_addr_df.update(missing_city_df, overwrite=True)

#### Check the data

In [None]:
job_addr_df[job_addr_df[JOB_CITY].isna()].head()

### German state

Fill missing job states.

In [None]:
# Filter missing or invalid states and create a new dataframe
missing_state_mask = (job_addr_df[JOB_STATE].isna()
                     | job_addr_df[JOB_STATE].str.contains('0-9', regex=True)
                     | ~job_addr_df[JOB_STATE].isin(fill_address.bundesland_lst)) \
                     & (job_addr_df[JOB_ZIP].notna() 
                     | job_addr_df[JOB_CITY].notna())

missing_state_df = job_addr_df[missing_state_mask].copy()

print_dataframe_length(missing_state_df)
missing_state_df.head()

In [None]:
# Fill missing state
missing_state_df = fill_address.fill_missing_state(missing_state_df)

column_contains_nan(missing_state_df, JOB_STATE, print_df=False)
missing_state_df.head()

In [None]:
# Update job addresses
job_addr_df.update(missing_state_df)

#### Check the data

In [None]:
job_addr_df[job_addr_df[JOB_STATE].isna()].head()

## 11. Join dataframes

Join dataframes of cleaned company and job addresses with the main dataframe.

The main dataframe is updated with only not-NaN values from address dataframes.

In [None]:
column_contains_nan(df, COMPANY_NAME)

In [None]:
# Plot num. of NaN values before filling missing values
show_nan_counts(df, PLOT_LABELS_WITH_DICT_CLEAN, ymin=13000, ymax=len(df)+500)

In [None]:
# Update dataframe with company addresses
df.update(company_addr_df)
df.head()

In [None]:
# Plot num. of NaN values after filling missing values of company addresses
show_nan_counts(df, PLOT_LABELS_WITH_DICT_CLEAN, ymin=13000, ymax=len(df)+500)

In [None]:
# Update dataframe with job addresses
df.update(job_addr_df)
df.head()

In [None]:
# Plot num. of NaN values after filling missing values of job addresses
show_nan_counts(df, PLOT_LABELS_WITH_DICT_CLEAN, ymin=13000, ymax=len(df)+500)

In [None]:
column_contains_nan(df, COMPANY_NAME)

### Check State values

We check counts for each German state.

In [None]:
# Plot company states after filling
df[COMPANY_STATE].value_counts().plot(kind='bar')

In [None]:
# Plot job states after filling
df[JOB_STATE].value_counts().plot(kind='bar')

## 12. Save Processed Data

The processed data is stored in a csv file on a path:
```python
../data/processed/jobposting
```

In [None]:
save_dataframe(df, PROCESSED_DATA_DIR, JP_PROCESSED_FILE)