# Cleaning of wage associated columns

This file contains all Cleanup-Routines for columns with the prefix "wage".

This document could be seperated in two different sections:
- Explanation
- Functions

In the fist section, we will explain some of the peculiarities of the highlighted column. This will result in a function, which could be easily used in the cleanup of the associated coluumns.

## Cleaning up wage_offer_from

The columns are:
- wage_offer_from_9089
- wage_offered_from_9089

In [1]:
import pandas as pd
import importlib
import modules
import numpy as np
import matplotlib.pyplot as plt

In [2]:

col_list= ["wage_offer_from_9089", "wage_offered_from_9089"]
visas_df = pd.read_csv("data/us_perm_visas.csv", usecols=col_list)
visas_df.head ()

FileNotFoundError: [Errno 2] No such file or directory: 'data/us_perm_visas.csv'

In [None]:
visas_df.dtypes

wage_offered_from_9089 was sucessfully imported as float.
However we'll have to take a closer look at wage_offer_from_9089 to convert it to a float value.

Next, we'll take a closer look how the values in these two columns are distributed.
We defined a new function in our modules library to deal with this task.

In [None]:
modules.print_count_of_values_relation(visas_df, True, True)

To explain the resulting graphs:
The X-Axis shows the index of all values. We have over 350000 values (rows) in our dataset.
The Y-Axis shows if a row is filled with a actual value. Not NaN Values are displayed as 1, NaN values are displayed as 0.

It becomes aparent that both columnscomplete each other. Gaps in wage_offer_from_9098 can be filled with wage_offer_from_9098.

----
## Cleaning up `wage_offer_from_9089`

In Progress of our data analysis it became apparent, that wage_offer_from_9089 has to be cleaned up, before analyzing the contained data.

In [None]:
cleanup_df = visas_df.copy()
cleanup_df.dtypes

Originally all the imported rows were recognized as "object". But the values should be converted to float values.

In [None]:
# cleanup_df["wage_offer_from_9089"].astype('float')

First, we tried to convert the data, by applying the new type.

In [None]:
cleanup_df["wage_offer_from_9089"].apply(type)

We defined a new columnn, containing the types of all newly converted values.

In [None]:
cleanup_df['wage_Type'] = cleanup_df["wage_offer_from_9089"].apply(lambda x: type(x).__name__)

In [None]:

cleanup_df.head()

In [None]:
import modules

In [None]:
importlib.reload(modules)
modules.print_full(cleanup_df.sample(100))

It became apparent, that apllying the new type was partially successful. But values containing delimiters or seperators are still recognized as a string.

In [None]:

cleanup_df['wage_Type'].value_counts()

We defined a new function to remove the delimiters. Additionally we replaced the '#############' values, which occured two times, with NaN.

In [None]:
def clean_currency(x):
    """ If the value is a string, then remove delimiters
    otherwise, the value is numeric and can be converted.

    """
    if isinstance(x, str):
        if x == '#############':
            x = np.nan
        else:
            return(x.replace(',', ''))
    return(x)

In [None]:
cleanup_df["wage_offer_from_9089"] = cleanup_df["wage_offer_from_9089"].apply(clean_currency).astype('float')
cleanup_df['wage_Type'] = cleanup_df["wage_offer_from_9089"].apply(lambda x: type(x).__name__)

In [None]:

cleanup_df['wage_Type'].value_counts()

In [None]:

modules.print_full(cleanup_df.sample(100))

Perfect, all values are now converted to float. The column `wage_offer_from_9089` was successfully cleaned up.

In [None]:
cleanup_df['wage_offer_from_9089'].median()

In [None]:
visas_df['wage_offer_from_9089'] = cleanup_df['wage_offer_from_9089']
del cleanup_df

In [None]:
visas_df.sample(15)

In [None]:
## Merging both columns

All NaN values wil lbe filled with 0 in order to sum both columns up.
We saw earlier thar both columns never overlap.

In [None]:
visas_df['wage_offer_merged'] = visas_df['wage_offer_from_9089'].fillna(0) + visas_df['wage_offered_from_9089'].fillna(0)

In [None]:
visas_df.sample(15)

In [None]:
visas_df['wage_offer_merged'].dtype

In [None]:
(visas_df['wage_offer_merged'] == 0).sum()

We filled NaN values with 0 in Order to calculate sums. Now we will replaye 0 with NaN to keep the columns cleaned up.
Additionally a wage of 0 would be unrealistic.

In [None]:
visas_df['wage_offer_merged'].replace(0, np.nan, inplace=True)

In [None]:
(visas_df['wage_offer_merged'] == 0).sum()

In [None]:
visas_df['wage_offer_merged'].median()
clean_df = pd.DataFrame()
clean_df['wage_offer_merged'] = visas_df['wage_offer_merged']

In [None]:
clean_df.head()