# Joblisting Exploratory Data Analysis (EDA): Cleaning

---

<div class="alert alert-block alert-info">
Table of Contents: <br>
    
<ul>
    <li>1. <a href="#1.-Setup">Setup</a></li>
    <li>2. <a href="#2.-Loading-the-Data">Loading the Data</a></li>
    <li>3. <a href="#3.-Framing-the-Problem">Framing the Problem</a>
        <ul>
            <li>3.1. <a href="#3.1.-Objective">Objective</a></li>
            <li>3.2. <a href="#3.2.-Procedure">Procedure</a></li>
            <li>3.3. <a href="#3.3.-Design">Design</a></li>
        </ul>
    </li>
    <li>4. <a href="#4.-Previewing-the-Data">Previewing the Data</a></li>
    <li>5. <a href="#5.-Wrangling-and-Cleaning-the-Data">Wrangling and Cleaning the Data</a>
        <ul>
            <li>5.1. <a href="#5.1.-company-and-rating">company and rating</a></li>
            <li>5.2. <a href="#5.2.-headquarters">headquarters</a></li>
            <li>5.3. <a href="#5.3-salary-estimate">salary estimate</a></li>
            <li>5.4. <a href="#5.4-job-type">job type</a></li>
            <li>5.5. <a href="#5.5.-size,-founded,-type,-industry,-sector,-revenue,-and-job-description">size, founded, type, industry, sector, revenue, and job description</a></li>
            <li>5.6. <a href="#5.6.-Building-the-Pipeline">Building the Pipeline</a></li>
        </ul>
    </li>
    <li>6. <a href="#6.-Saving">Saving</a></li>
</ul>
</div>

## 1. Setup

+ If the following assertions fail, I believe other versions still work. Just ensure you are not using too outdated libraries (e.g. Python 3.* compared to Python 2.*). These are just the versions I'm using for this project.

In [1]:
# General imports.

# Python ≥ 3.7.9 is used.
import sys
assert sys.version_info >= (3, 7, 9)

import os
import re
import time
from collections import Counter
from ordered_set import OrderedSet

# Specific imports.

# NumPy ≥ 1.19.5 is used.
import numpy as np
assert np.__version__ >= "1.18.5"

# Pandas ≥ 1.2.4 is used.
import pandas as pd
assert pd.__version__ >= "1.2.4"

# Scikit-learn ≥ 0.24.2 is used.
import sklearn
assert sklearn.__version__ >= "0.24.2"

from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator



In [2]:
# Utility Function(s).

# Borrowed from: 
# https://github.com/ageron/handson-ml2/blob/master/04_training_linear_models.ipynb.

PROJECT_ROOT_DIR = "."
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "img")
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(ax, fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    ax.savefig(path, format=fig_extension, dpi=resolution)

## 2. Loading the Data

In [3]:
PATH = "../input/joblisting.csv"
df = pd.read_csv(PATH, index_col=0)

## 3. Framing the Problem

### 3.1. Objective

+ Wrangle, clean, and perform Exploratory Data Analysis (EDA) on the Glassdoor.com joblisting data. Extract useful findings in relationships between estimated salary and other variables. I'll split this EDA into 2 parts: descriptive and exploratory analysis for the first part and question-driven exploration for the second part.

We will work on wrangling and cleaning in this notebook.

### 3.2. Procedure

1. Setup.
2. Loading the data.
3. Framing the problem.
4. Preview the data.
5. Wrangling and cleaning the data.
6. Univariate non-graphical analysis.
7. Univariate graphical analysis.
8. Multivariate non-graphical analysis. 
9. Multivariate graphical analysis.
10. Organize findings.

Disclaimer: Steps aren't necessarily done sequentially (I may jump back and forth a bit if I come up with an idea about something).

### 3.3. Design

+ 8 total Google sheets will be continually updated throughout the project: 6 sheets for questions, assumption_univariate, conclusion_univariate, and observation_univariate (assumption_multivariate, conclusion_multivariate, observation_multivariate). The last sheet is called miscellaneous_observations and will contain observations about things other than the features themselves. These are my **analysis log sheets**.

|              | assumptions              | conclusions              | observations              |
| :----------- | :----------------------- | :----------------------- | :------------------------ |
| univariate   | assumptions_univariate   | conclusions_univariate   | observations_univariate   |
| multivariate | assumptions_multivariate | conclusions_multivariate | observations_multivariate |

| Other                      |
| :------------------------- |
| miscellaneous_observations |
| Questions                  |

+ Assumptions will be updated mostly at the beginning of the project (can still be updated all throughout the project).
+ Conclusions will mostly be done during and at the end of the project (can still be updated all throughout the project). 
+ Observations will, unlike the previous 2, be updated all throughout the project.


+ For this project, I created a diagram to go along and keep track of all the edits I've made. I call this my **edit log**.
+ **Note**: for the 6 sheets I showed in the markdown table above, the assumptions, conclusions, and observations will be denoted by [row, column] and for miscellaneous_observations and Questions sheets, I will denote it by [row,].
+ **Note**: Assumptions are kept exclusively on the spreadsheet, Questions are denoted with (❓), conclusions with (📑), and observations with (🔍).


+ I'll try to adhere by Python conventions in PEP8 and [this](https://stackoverflow.com/questions/43577404/purpose-of-import-this) (though I can't promise this will turn out perfectly).

## 4. Previewing the Data

In [4]:
df.head()

Unnamed: 0,company,job title,headquarters,salary estimate,job type,size,founded,type,industry,sector,revenue,job description
0,Walmart\n3.4,Data Scientist,"Sunnyvale, CA",-1,Job Type : N/A,10000+ employees,1994,company - public,general merchandise & superstores,retail,$10+ billion (usd),Position Summary...\nWhat you'll do...\nAnalyt...
1,TikTok\n3.8,Data Scientist,"Mountain View, CA",-1,Job Type : Full-time,501 to 1000 employees,2016,company - private,internet,information technology,unknown / non-applicable,TikTok is the leading destination for short-fo...
2,Indeed\n4.3,Principal Data Scientist - Candidate Recommend...,"San Francisco, CA",Employer Provided Salary:$187K - $231K,Job Type : Full-time,10000+ employees,2004,company - private,internet,information technology,$2 to $5 billion (usd),Your Job\nThe Candidate Recommendations team b...
3,Indeed\n4.3,Senior Data Scientist - Moderation Engineering,"San Francisco, CA",Employer Provided Salary:$130K - $156K,Job Type : Full-time,10000+ employees,2004,company - private,internet,information technology,$2 to $5 billion (usd),Your Job\nThe Moderation Engineering team’s mi...
4,Thermo Fisher - America\n3.8,Data Scientist III,"San Francisco, CA",-1,Job Type : N/A,10000+ employees,1902,company - public,biotech & pharmaceuticals,biotech & pharmaceuticals,$10+ billion (usd),Thermo Fisher Scientific Inc. is the world lea...


In [5]:
df.shape

(2573, 12)

In [6]:
df.columns

Index(['company', 'job title', 'headquarters', 'salary estimate', 'job type',
       'size', 'founded', 'type', 'industry', 'sector', 'revenue',
       'job description'],
      dtype='object')

In [7]:
df.describe()

Unnamed: 0,founded
count,2573.0
mean,1485.608628
std,866.857576
min,-1.0
25%,-1.0
50%,1995.0
75%,2009.0
max,2019.0


> 🔍 observation_univariate: [0, "founded"].\
 \
 All the values for the "founded" feature are biased\
 because they have a great deal of missing values (-1)\
 and this would skew both the mean, std, and the median and quartiles.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2573 entries, 0 to 2572
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   company          2573 non-null   object
 1   job title        2573 non-null   object
 2   headquarters     2573 non-null   object
 3   salary estimate  2573 non-null   object
 4   job type         2573 non-null   object
 5   size             2573 non-null   object
 6   founded          2573 non-null   int64 
 7   type             2573 non-null   object
 8   industry         2573 non-null   object
 9   sector           2573 non-null   object
 10  revenue          2573 non-null   object
 11  job description  2573 non-null   object
dtypes: int64(1), object(11)
memory usage: 261.3+ KB


> 🔍 misc_obsvtn: [0,].\
 \
 There are no null or NaN values but they are actually represented as -1s.\
 Everything is an object except for the year founded which is int64.

## 5. Wrangling and Cleaning the Data

In [9]:
df_clean = df.copy()
df_clean.head()

Unnamed: 0,company,job title,headquarters,salary estimate,job type,size,founded,type,industry,sector,revenue,job description
0,Walmart\n3.4,Data Scientist,"Sunnyvale, CA",-1,Job Type : N/A,10000+ employees,1994,company - public,general merchandise & superstores,retail,$10+ billion (usd),Position Summary...\nWhat you'll do...\nAnalyt...
1,TikTok\n3.8,Data Scientist,"Mountain View, CA",-1,Job Type : Full-time,501 to 1000 employees,2016,company - private,internet,information technology,unknown / non-applicable,TikTok is the leading destination for short-fo...
2,Indeed\n4.3,Principal Data Scientist - Candidate Recommend...,"San Francisco, CA",Employer Provided Salary:$187K - $231K,Job Type : Full-time,10000+ employees,2004,company - private,internet,information technology,$2 to $5 billion (usd),Your Job\nThe Candidate Recommendations team b...
3,Indeed\n4.3,Senior Data Scientist - Moderation Engineering,"San Francisco, CA",Employer Provided Salary:$130K - $156K,Job Type : Full-time,10000+ employees,2004,company - private,internet,information technology,$2 to $5 billion (usd),Your Job\nThe Moderation Engineering team’s mi...
4,Thermo Fisher - America\n3.8,Data Scientist III,"San Francisco, CA",-1,Job Type : N/A,10000+ employees,1902,company - public,biotech & pharmaceuticals,biotech & pharmaceuticals,$10+ billion (usd),Thermo Fisher Scientific Inc. is the world lea...


Webscraping is super messy, especially when it comes to joblistings! I've implemented some general transformations for removing NaN Rows and certain specified columns.

In [10]:
# 1. Remove rows if too many missing values.

class RemoveNaNRows(BaseEstimator, TransformerMixin):
    def __init__(self, cnt=5):
        self.cnt = cnt

    def fit(self, X, y=None):  
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        filter_ = X_copy.apply(lambda row: False if Counter(row.values)["-1"] >= self.cnt else True, axis=1)
        X_copy = X_copy[filter_]
        
        return X_copy.reset_index(drop=True)

# 1. Select certain columns.

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names=[], include=True):
        self.attribute_names = attribute_names
        self.include = include
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if not self.attribute_names:
            return X
        
        if self.include:
            return X[self.attribute_names]
        
        cols = OrderedSet(X.columns).difference(OrderedSet(self.attribute_names))
        return X[cols]
        

# 1. Remove unnecessary indices at the beginning.

class IndexRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
                
        unnamed_cols = [col for col in X_copy.columns if "Unnamed" in col]
        try: X_copy = X_copy.drop(columns=unnamed_cols)
        except: pass
                                 
        return X_copy

> ❓ Questions: 0.\
  \
  Which columns should we be removing?

We will remove the "job title" column because it won't be too useful (all jobs are data science or data science related). "job description" is an NLP task which we won't tackle (actually, maybe we can extract key words or frequent words using an NLP method!). Upon inspection, "founded" has too many NaNs and "industry" is too similar to sector (which has less NaNs) so we simply drop "industry".

In [11]:
clean_general_pipeline = Pipeline([
    ("nan_row_remover", RemoveNaNRows()),
    ("col_selector", DataFrameSelector(["job title", "founded", "job description", "industry"], include=False)),
    ("index_remover", IndexRemover())
])

In [12]:
df_clean = clean_general_pipeline.fit_transform(df_clean)
df_clean.head()

Unnamed: 0,company,headquarters,salary estimate,job type,size,type,sector,revenue
0,Walmart\n3.4,"Sunnyvale, CA",-1,Job Type : N/A,10000+ employees,company - public,retail,$10+ billion (usd)
1,TikTok\n3.8,"Mountain View, CA",-1,Job Type : Full-time,501 to 1000 employees,company - private,information technology,unknown / non-applicable
2,Indeed\n4.3,"San Francisco, CA",Employer Provided Salary:$187K - $231K,Job Type : Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
3,Indeed\n4.3,"San Francisco, CA",Employer Provided Salary:$130K - $156K,Job Type : Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
4,Thermo Fisher - America\n3.8,"San Francisco, CA",-1,Job Type : N/A,10000+ employees,company - public,biotech & pharmaceuticals,$10+ billion (usd)


### 5.1. company and rating

In [13]:
# Looking at all unique values and their counts for 
# the "company" attribute and sorting by ascending.
companies = df["company"].value_counts()
companies

Facebook\n4.3                          119
Ursus\n4.4                              51
Uber\n4.0                               43
Intuit - Data\n4.5                      33
Salesforce\n4.4                         30
                                      ... 
EPM Scientific\n2.3                      1
Matroid\n5.0                             1
International Consulting Group Inc.      1
Gotion, Inc.\n4.1                        1
Baidu USA\n3.9                           1
Name: company, Length: 770, dtype: int64

> ❓ Questions: 1.\
  \
  How many unique companies are there?

In [14]:
# My process:
# - split ratings out
# - apply set() to remove duplicates

unique_companies = set([company.split("\n")[0] for company in companies.index])
len(unique_companies)

695

In [15]:
unique_companies

{'10x Genomics',
 '23andMe',
 '3T Biosciences',
 '4Insite',
 '6sense',
 '8k Miles',
 'ABD Insurance and Financial Services',
 'ACL Digital',
 'ALLDATA',
 'ALSTEM',
 'AMPAC Fine Chemicals',
 'APTIM',
 'AT&T',
 'AVA ROAD',
 'AVANADE',
 'Abbott Laboratories',
 'Abl Schools',
 'Acara Solutions',
 'Accenture',
 'Accrete Hitech Solutions',
 'Acorn Analytics',
 'Acumen LLC',
 'Adecco',
 'Adobe',
 'Advanced Systems Group',
 'Adventist Health',
 'Affimedix Inc',
 'Afresh',
 'Afterpay Touch',
 'Agama Solutions',
 'Agilent Technologies, Inc.',
 'Aible',
 'Akoya Biosciences',
 'Akraya Inc.',
 'Alameda Health Consortium/Community Health Center Network',
 'Alexa Internet',
 'AlignTech',
 'Alkahest, Inc.',
 'Allness Inc',
 'Allscripts',
 'Alten',
 'Alto Neuroscience, Inc.',
 'Alvah Contractors, Inc.',
 'Amazon Dev Center U.S., Inc.',
 'Amazon Web Services, Inc.',
 'Amazon.com Services LLC',
 'Ambys Medicines',
 'Amino, Inc.',
 'Amobee',
 'AmpersandPeople',
 'Anaspec, Inc',
 'Anthem',
 'Anzu Global',


> 🔍 observations_univariate: [0, "company"].\
  \
  695 unique companies, but looks like some overlap like "Wells Fargo" and "Wells Fargo Bank"\
  and some other companies that aren't company names like "information technology sector" or "financial services company".

> ❓ Questions: 2.\
  \
  What percentage of instances don't have ratings?

In [16]:
p = len([c for c in df.company if "\n" not in c])/len(df) * 100

print(f"About {p:.1f}% of the dataset doesn't have ratings.")

About 13.6% of the dataset doesn't have ratings.


> 🔍 observations_univariate: [1, "company"].\
  \
  13.6% is a good chunk of the data, but my threshold for ignoring possible features\
  is 15%. I think it might be useful to include rating in the dataframe.

The "company" column still needs a lot more cleaning. Namely, we need to identify same companies and replace them with 1 name. This will make it easier to group them later on.

We can use some type of simple NLP approach to identify how similar 2 strings are, but that runs the risk of mislabeling data. And even then, if we compared each word with all other words, that's $O(N^2)$ comparisons!

Since we only have 695 unique companies, I'll simply hardcode rules to cover the edge cases seen so far. Though generally for more complex tasks like with a consistent stream of new data, we might want to shift to a more generalizeable setup. On the client-side, data entry options can be enforced (though not really in this case) and on our side we can either build a really robust parser or build a bot/AI that can compare semantic meaning between words to detect similar company names that are likely to be the same company. This can leverage unsupervised learning methods! But, for now we will stick with a simple hardcoded parser.

In [17]:
set(sorted(df_clean.company.values))

{'10x Genomics\n4.1',
 '23andMe\n4.1',
 '3T Biosciences',
 '4Insite\n4.0',
 '6sense\n4.9',
 '8k Miles\n3.9',
 'ABD Insurance and Financial Services\n4.7',
 'ACL Digital\n3.5',
 'ACL Digital\n3.6',
 'ALLDATA\n4.1',
 'ALSTEM\n3.7',
 'AMPAC Fine Chemicals\n3.1',
 'APTIM\n3.4',
 'AT&T\n3.7',
 'AVANADE\n4.1',
 'Abbott Laboratories\n3.9',
 'Abl Schools',
 'Acara Solutions\n3.5',
 'Accenture\n4.0',
 'Accrete Hitech Solutions',
 'Acorn Analytics',
 'Acumen LLC\n3.3',
 'Adecco\n3.7',
 'Adobe\n4.4',
 'Advanced Systems Group\n3.0',
 'Adventist Health\n3.6',
 'Afresh\n5.0',
 'Afterpay Touch\n3.7',
 'Agama Solutions\n3.7',
 'Agilent Technologies, Inc.\n4.3',
 'Aible\n5.0',
 'Akoya Biosciences\n4.0',
 'Akraya Inc.\n4.6',
 'Akraya Inc.\n4.7',
 'Alexa Internet\n3.8',
 'AlignTech\n4.4',
 'Allscripts\n3.7',
 'Alten\n3.2',
 'Alto Neuroscience, Inc.',
 'Amazon Dev Center U.S., Inc.\n3.8',
 'Amazon Web Services, Inc.\n3.8',
 'Amazon.com Services LLC\n3.8',
 'Amino, Inc.\n4.5',
 'Amobee\n3.9',
 'Anthem\n3.6

> ❓ Questions: 3.\
  \
  Which companies do we need to aggregate names for?

Aggregate: Amazon, Parker Institute for/of Cancer Immunotherapy, Sony, Stealth Mode Startup, Twitch, Wells Fargo, Varo\
Remove: financial services company, information technology sector

Let's build out the cleaner for both splitting the ratings and handling these aggregations and removes.

In [18]:
replace_company_names = {"Amazon": "Amazon", 
                         "Parker Institute": "Parker Institute for Cancer Immunotherapy", 
                         "Sony": "Sony", 
                         "Stealth Mode": "Stealth Mode", 
                         "Twitch": "Twitch", 
                         "Wells Fargo": "Wells Fargo", 
                         "Varo": "Varo"}

remove_company_names = ["financial services company", "information technology sector"]

In [19]:
# 1. Split company and rating and change rating dtype.
# 2. Aggregate similar company names. 
# 3. Remove non-company names. 

class CompanyRatingSplit(BaseEstimator, TransformerMixin):
    def __init__(self, idx=1):
        self.idx = idx
        
    def _split_rating(self, X):
        X_copy = X.copy()
        company_rating = (X_copy["company"]
                          .apply(lambda company: pd.Series(company.split("\n")))
                          .set_axis(["company", "rating"], axis=1))
        X_copy["company"] = company_rating["company"]
        X_copy.insert(self.idx, "rating", company_rating["rating"])
        X_copy.rating = X_copy.rating.astype("float64")
        
        return X_copy
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        X_copy = X.copy()
        X_copy = self._split_rating(X_copy)
        
        return X_copy
    
class AggregateCompanyNames(BaseEstimator, TransformerMixin):
    def __init__(self, company_names):
        self.company_names = company_names
    
    def fit(self, X, y=None):
        # Ideally, you would have a learnable algorithm to find similar names.
        # Or, you have some simple NLP method to match similar words together.
        # But this gets finnicky when you encounter "University of California, Berkeley"
        # and "University of California, Davis".
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        for k, v in self.company_names.items():
            filter_ = X_copy["company"].apply(lambda company: True if k.lower() in company.lower() else False)
            X_copy.loc[filter_, "company"] = v
            
        return X_copy.reset_index(drop=True)
    
class RemoveNonCompanyNames(BaseEstimator, TransformerMixin):
    def __init__(self, company_names):
        self.company_names = company_names
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        for c in self.company_names:
            X_copy = X_copy[X_copy.company != c]
        
        return X_copy.reset_index(drop=True)

We know from a previous observation that there are NaNs in the ratings. Let's first fill NaNs by the company rating (if the commpany has multiple joblistings and some of them have ratings). Then, we will fill by global average rating.

In [20]:
# 1. Fill rating by average company rating.
# 2. Fill by average rating.
    
class FillRatingByCmpnyAvg(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def _fill_nans_with_cmpny_avg_rating(self, cmpny_set, df):
        for company in cmpny_set:
            ratings = df[df.company == company]["rating"]
            ratings_nans = ratings[ratings.isna().values]
            nan_idx = list(ratings_nans.index)
            ratings_avg = round(ratings.mean(), 1)
            for idx in nan_idx:
                df.at[idx, "rating"] = ratings_avg
    
    def fit(self, X, y=None):
        X_copy = X.copy()
        
        companies = X_copy["company"].value_counts()
        idx_of_missing_rates = [company for company in companies.index if "\n" not in company]
        avg_cmpny_ratings = X_copy.groupby("company").mean()["rating"]
        no_rating = avg_cmpny_ratings.isna().values
        no_rating_cmpny = avg_cmpny_ratings.index[no_rating]
        
        self.companies_with_ratings = set(idx_of_missing_rates).difference(set(no_rating_cmpny))
        
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        self._fill_nans_with_cmpny_avg_rating(self.companies_with_ratings, X_copy)
        
        return X_copy
    
class FillRatingByAvg(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        X_copy = X.copy()
        
        self.avg_rating = round(X_copy.rating.mean(), 1)
        
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy.rating = X_copy.rating.fillna(self.avg_rating)
        
        return X_copy

In [21]:
clean_company_pipeline = Pipeline([
    ("company_rating_split", CompanyRatingSplit()),
    ("company_aggregate", AggregateCompanyNames(replace_company_names)),
    ("company_remove", RemoveNonCompanyNames(remove_company_names))
])

clean_rating_pipeline = Pipeline([
    ("rating_fill_by_cmpny_avg", FillRatingByCmpnyAvg()),
    ("rating_fill_by_avg", FillRatingByAvg())
])

In [22]:
df_clean = clean_company_pipeline.fit_transform(df_clean)
df_clean = clean_rating_pipeline.fit_transform(df_clean)

In [23]:
df_clean.company.value_counts()

Facebook                       119
Ursus                           51
Uber                            43
Salesforce                      42
Amazon                          41
                              ... 
Sigmaways, Inc.                  1
Dimensional Control Systems      1
infolob                          1
ALLDATA                          1
PayJoy                           1
Name: company, Length: 601, dtype: int64

> 📑 conclusions_univariate: [0, "company"].\
  \
  There seems to be a lot of companies posting just\
  one joblisting making it difficult to use for machine learning.\
  I could try different methods of sampling, but I'll hold off on this for now.

In [24]:
for company in list(df_clean["company"].values):
    if "-1" in company: print("Missing company.")

> 📑 conclusions_univariate: [1, "company"].\
  \
  Company has no missing values.\
  They also have no outliers. 

In [25]:
df_clean.head()

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
0,Walmart,3.4,"Sunnyvale, CA",-1,Job Type : N/A,10000+ employees,company - public,retail,$10+ billion (usd)
1,TikTok,3.8,"Mountain View, CA",-1,Job Type : Full-time,501 to 1000 employees,company - private,information technology,unknown / non-applicable
2,Indeed,4.3,"San Francisco, CA",Employer Provided Salary:$187K - $231K,Job Type : Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
3,Indeed,4.3,"San Francisco, CA",Employer Provided Salary:$130K - $156K,Job Type : Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
4,Thermo Fisher - America,3.8,"San Francisco, CA",-1,Job Type : N/A,10000+ employees,company - public,biotech & pharmaceuticals,$10+ billion (usd)


### 5.2. headquarters

In [26]:
df_clean.headquarters.isna().sum(), df_clean.headquarters.unique()

(0,
 array(['Sunnyvale, CA', 'Mountain View, CA', 'San Francisco, CA',
        'United States', 'San Jose, CA', 'Livermore, CA', 'Palo Alto, CA',
        'Santa Clara, CA', 'Oakland, CA', 'San Mateo, CA', 'Los Altos, CA',
        'Foster City, CA', 'Morgan Hill, CA', 'Newark, CA',
        'Menlo Park, CA', 'Senior Data Scientist', 'Staff Data Scientist',
        'Redwood City, CA', 'Cupertino, CA', 'Burlingame, CA',
        'Emeryville, CA', 'South San Francisco, CA', 'Pleasanton, CA',
        'Los Gatos, CA', 'Half Moon Bay, CA', 'Sausalito, CA',
        'San Carlos, CA', 'Brisbane, CA', 'Elk Grove, CA',
        'Business Data Analyst', 'Pittsburg, CA', 'Sacramento, CA',
        'Fremont, CA', 'Concord, CA', 'East Palo Alto, CA', 'Milpitas, CA',
        'Patterson, CA', 'Berkeley, CA', 'San Bruno, CA', 'Hercules, CA',
        'San Ramon, CA', 'Staff Business Data Analysis',
        'Economist - Machine Learning Engineer',
        'Manager 3 Data and Analytics', 'Principal Data Scienti

> 📑 conclusions_univariate: [0, "headquarters"].\
  \
  There are no NaNs and no -1s, however, though the headquarters unique values\
  are mostly fine, there seem to be a lot of mix in with what seems to be\
  "job title".

In [27]:
headquarter_value_counts = df_clean.headquarters.value_counts()
headquarter_value_counts

San Francisco, CA                           986
Menlo Park, CA                              173
Mountain View, CA                           139
San Jose, CA                                127
Palo Alto, CA                               116
                                           ... 
Staff Data Scientist - Product Analytics      1
Tracy, CA                                     1
Staff Business Data Analyst                   1
Pittsburg, CA                                 1
West Sacramento, CA                           1
Name: headquarters, Length: 74, dtype: int64

> 🔍 observations_univariate: [0, "headquarters"].\
  \
  We can omit the NaNs for where job titles are since there aren't that many.

> ❓ Questions: 4.\
  \
  Can we extract all the non-headquarter unique entries?

In [28]:
# Since, all actual headquarter locations are located in California, 
# I will parse the indices and simply check which ones don't have CA.

no_ca = headquarter_value_counts.index[headquarter_value_counts.index.map(lambda hq: "CA" not in hq)]
no_ca

Index(['United States', 'Business Data Analyst',
       'Staff Machine Learning Engineer', 'Senior Technical Data Analyst',
       'Staff Business Data Analysis', 'Economist - Machine Learning Engineer',
       'Senior Data Scientist', 'Principal Data Scientist',
       'Staff Business Data Analyst, Ecosystem Fraud Prevention',
       'Manager 3 Data and Analytics', 'Staff Data Scientist',
       'Group Manager - Data Science', 'Senior Business Data Analyst',
       'Staff Data Scientist - Product Analytics',
       'Staff Business Data Analyst'],
      dtype='object')

> 🔍 observations_univariate: [1, "headquarters"].\
  \
  United States is a location and it is one unique value in the "headquarters" column.\
  However, since CA is in the US, having the location of US doesn't help much if there aren't\
  joblistings also coming in internationally. We will remove US alongside all the \
  other job titles.

In [29]:
df_clean[df_clean.headquarters == "United States"]

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
9,Mitre Corporation,3.8,United States,-1,Job Type : Full-time,5001 to 10000 employees,nonprofit organization,government,$1 to $2 billion (usd)
43,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
49,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
55,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
64,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
70,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
94,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
96,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
105,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)


> 🔍 observations_univariate: [2, "headquarters"].\
  \
  Most jobs have a correct headquarters location. For those that don't, we will simply remove those instances from the dataset.

In [30]:
is_extraneous = [True if "CA" not in hq else False for hq in df_clean.headquarters]
df_clean[is_extraneous]

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
9,Mitre Corporation,3.8,United States,-1,Job Type : Full-time,5001 to 10000 employees,nonprofit organization,government,$1 to $2 billion (usd)
43,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
49,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
55,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
64,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
70,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
94,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
96,Apple,4.3,United States,-1,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
105,Apple,4.3,United States,-1,Job Type : N/A,10000+ employees,company - public,information technology,$10+ billion (usd)
142,Intuit - Data,4.5,Senior Data Scientist,"Mountain View, CA",Job Type : Full-time,5001 to 10000 employees,company - public,information technology,$2 to $5 billion (usd)


> 🔍 misc_obsvtn: [1,].\
  \
  I was originally going to drop whatever rows have "headquarters" unique values\
  corresponding to job_titles however I realized that they\
  actually included the headquarters, it was just wrongly placed.\
  However, since they are all missing salary estimates (and I cannot infer\
  this salary estimate from similar instances), I will simply drop them as they\
  don't pose a significant contribution to my dataset (as there are only 9 instances).\
  Edit: Not 9 anymore with the introduction of a new batch of data,\
  but still negligible compared to the size of the data.

Let's build our cleaner pipeline for headquarters.

In [31]:
# 1. Remove extraneous HQs.

class RemoveExtraneousHQ(BaseEstimator, TransformerMixin):
    def __init__(self,):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        filter_ = [False if "CA" not in hq else True for hq in X_copy.headquarters]
        
        return X_copy[filter_].reset_index(drop=True)

In [32]:
clean_hq_pipeline = Pipeline([
    ("remove_ext_hq", RemoveExtraneousHQ())
])

In [33]:
df_clean = clean_hq_pipeline.fit_transform(df_clean)

In [34]:
df_clean.headquarters.unique()

array(['Sunnyvale, CA', 'Mountain View, CA', 'San Francisco, CA',
       'San Jose, CA', 'Livermore, CA', 'Palo Alto, CA',
       'Santa Clara, CA', 'Oakland, CA', 'San Mateo, CA', 'Los Altos, CA',
       'Foster City, CA', 'Morgan Hill, CA', 'Newark, CA',
       'Menlo Park, CA', 'Redwood City, CA', 'Cupertino, CA',
       'Burlingame, CA', 'Emeryville, CA', 'South San Francisco, CA',
       'Pleasanton, CA', 'Los Gatos, CA', 'Half Moon Bay, CA',
       'Sausalito, CA', 'San Carlos, CA', 'Brisbane, CA', 'Elk Grove, CA',
       'Pittsburg, CA', 'Sacramento, CA', 'Fremont, CA', 'Concord, CA',
       'East Palo Alto, CA', 'Milpitas, CA', 'Patterson, CA',
       'Berkeley, CA', 'San Bruno, CA', 'Hercules, CA', 'San Ramon, CA',
       'Alameda, CA', 'Scotts Valley, CA', 'Petaluma, CA',
       'Greenbrae, CA', 'Rancho Cordova, CA', 'Saint Helena, CA',
       'Oakdale, CA', 'Clearlake, CA', 'Walnut Creek, CA', 'Lodi, CA',
       'Stanford, CA', 'Tracy, CA', 'Hayward, CA', 'Stockton, CA',
   

### 5.3. salary estimate

In [35]:
df_clean["salary estimate"].unique()

array(['-1', 'Employer Provided Salary:$187K - $231K',
       'Employer Provided Salary:$130K - $156K',
       'Employer Provided Salary:$190K',
       'Employer Provided Salary:$120K - $160K',
       'Employer Provided Salary:$115K', '$99K - $170K (Glassdoor est.)',
       '$94K - $169K (Glassdoor est.)', '$108K - $197K (Glassdoor est.)',
       '$91K - $151K (Glassdoor est.)', '$63K - $133K (Glassdoor est.)',
       '$126K - $222K (Glassdoor est.)', '$75K - $162K (Glassdoor est.)',
       '$85K - $177K (Glassdoor est.)', '$101K - $179K (Glassdoor est.)',
       '$96K - $154K (Glassdoor est.)', '$107K - $186K (Glassdoor est.)',
       '$100K - $170K (Glassdoor est.)', '$43K - $79K (Glassdoor est.)',
       '$80K - $151K (Glassdoor est.)', '$94K - $130K (Glassdoor est.)',
       '$127K - $201K (Glassdoor est.)', '$83K - $159K (Glassdoor est.)',
       '$82K - $167K (Glassdoor est.)', '$76K - $156K (Glassdoor est.)',
       '$79K - $161K (Glassdoor est.)', '$90K - $174K (Glassdoor est.)

> 🔍 observation_univariate: [0, "salary estimate"].\
  \
  Everything is a string specifying an interval or a single number. There are NaNs.\
  Some salary estimates are hourly. Some just include 1 number as an estimate.\
  Some include a range. The salary column is an object dtype. There are 3 different\
  cases: one includes "Employer Provided Salary", one has no range, one has the\
  "(Glassdoor est.)". We will have to build a parser for these!

Let's build something like the HQ one where we simply remove the NaNs. This is an important statistic so we want instances that have a salary estimate. We will probably shave off a giant chunk of our data! :( But this is necessary if we want to train a predictive model. We don't want to impute our *labels*.

In [36]:
# 1. Remove NaNs from salary.
# 2. Parse the salary and extract average.

class RemoveNaNSalary(BaseEstimator, TransformerMixin):
    def __init__(self,):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        return X_copy[X_copy["salary estimate"] != "-1"].reset_index(drop=True)

class SalaryParser(BaseEstimator, TransformerMixin):
    def __init__(self,):
        pass
    
    # Parse and replace the salary column with the average of the provided salary range.
    def _salary_parser(self, salary):
        if "Employer Provided Salary" in salary:
            # salary_range can sometimes be just 1 value.
            salary_range = re.sub(r"[a-zA-Z]+|\s|\$|\:", "", salary).split("-")
        elif "(Glassdoor est.)" in salary:
            salary_range = re.sub(r"[a-zA-Z]+|\s|\$|\.|[()]", "", salary).split("-")
        avg_salary = np.mean(np.array(salary_range, dtype=np.float64))
        return avg_salary
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy["salary estimate"] = X_copy["salary estimate"].map(self._salary_parser)
        
        return X_copy.reset_index(drop=True)

In [37]:
clean_salary_pipeline = Pipeline([
    ("nan_remover", RemoveNaNSalary()),
    ("parser", SalaryParser())
])

In [38]:
df_clean = clean_salary_pipeline.fit_transform(df_clean)

In [39]:
df_clean.head()

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
0,Indeed,4.3,"San Francisco, CA",209.0,Job Type : Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
1,Indeed,4.3,"San Francisco, CA",143.0,Job Type : Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
2,Harnham US,3.8,"San Francisco, CA",190.0,Job Type : N/A,51 to 200 employees,company - private,business services,$25 to $50 million (usd)
3,Abl Schools,4.1,"San Francisco, CA",140.0,Job Type : Full-time,1 to 50 employees,company - private,business services,unknown / non-applicable
4,Amazon,3.8,"Palo Alto, CA",115.0,Job Type : Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)


### 5.4. job type

In [40]:
df_clean["job type"].unique()

array(['Job Type : Full-time', 'Job Type : N/A', 'Job Type : Part-time',
       'Job Type : Contract', 'Job Type : Internship',
       'Job Type : Temporary'], dtype=object)

In [41]:
df_clean[df_clean["job type"] == "Job Type : N/A"]

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
2,Harnham US,3.8,"San Francisco, CA",190.0,Job Type : N/A,51 to 200 employees,company - private,business services,$25 to $50 million (usd)


> 🔍 observations_univariate: [0, "job type"].\
  \
  I need to RegEx parse the column values to\
  remove the "Job Type :" part. Then,\
  I need to replace the N/A with NaNs and finally\
  remove or impute the NaNs.

In [42]:
df_clean["job type"].value_counts()

Job Type : Full-time     1913
Job Type : Part-time       32
Job Type : Contract         8
Job Type : Internship       4
Job Type : Temporary        2
Job Type : N/A              1
Name: job type, dtype: int64

In [43]:
df_clean["job type"]

0       Job Type : Full-time
1       Job Type : Full-time
2             Job Type : N/A
3       Job Type : Full-time
4       Job Type : Full-time
                ...         
1955    Job Type : Full-time
1956    Job Type : Full-time
1957    Job Type : Full-time
1958    Job Type : Full-time
1959    Job Type : Full-time
Name: job type, Length: 1960, dtype: object

Let's remove the N/A job types and also parse them.

In [44]:
class RemoveNaNJobType(BaseEstimator, TransformerMixin):
    def __init__(self,):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        return X_copy[X_copy["job type"].apply(lambda x: x.split(" ")[-1] != "N/A")].reset_index(drop=True)
        
        
class JobTypeParser(BaseEstimator, TransformerMixin):
    def __init__(self,):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy["job type"] = X_copy["job type"].map(lambda x: x.split()[-1])
        
        return X_copy

In [45]:
clean_jobtype_pipeline = Pipeline([
    ("nan_remover", RemoveNaNJobType()),
    ("parser", JobTypeParser())
])

In [46]:
df_clean = clean_jobtype_pipeline.fit_transform(df_clean)

In [47]:
df_clean["job type"].unique()

array(['Full-time', 'Part-time', 'Contract', 'Internship', 'Temporary'],
      dtype=object)

In [48]:
df_clean.shape

(1959, 9)

### 5.5. size, founded, type, industry, sector, revenue, and job description

In [49]:
df_clean["size"].value_counts()

10000+ employees           596
1 to 50 employees          347
51 to 200 employees        267
1001 to 5000 employees     196
201 to 500 employees       188
501 to 1000 employees      168
5001 to 10000 employees    126
unknown                     71
Name: size, dtype: int64

> 🔍 observations_univariate: [0, "size"].\
  \
  There are 71 unknown values.

In [50]:
df_clean["type"].value_counts()

company - public                  886
company - private                 856
subsidiary or business segment     56
nonprofit organization             52
college / university               37
unknown                            26
contract                           15
government                         10
self-employed                       9
hospital                            6
private practice / firm             5
school / school district            1
Name: type, dtype: int64

> 🔍 observations_univariate: [0, "type"].\
  \
  26 unknowns, not too many.

In [51]:
df_clean["sector"].value_counts()

information technology                863
-1                                    241
biotech & pharmaceuticals             187
business services                     139
finance                               105
retail                                 97
health care                            61
manufacturing                          58
education                              48
insurance                              43
oil, gas, energy & utilities           26
non-profit                             20
media                                  18
transportation & logistics             12
government                             11
real estate                             7
telecommunications                      6
accounting & legal                      6
restaurants, bars & food services       4
construction, repair & maintenance      3
arts, entertainment & recreation        2
travel & tourism                        1
agriculture & forestry                  1
Name: sector, dtype: int64

> 🔍 observations_univariate: [0, "sector"].\
  \
  Lots of missing values and a few obscure unique values.

In [52]:
df_clean["revenue"].value_counts()

unknown / non-applicable            855
$10+ billion (usd)                  475
$100 to $500 million (usd)           94
$2 to $5 billion (usd)               94
$1 to $2 billion (usd)               86
less than $1 million (usd)           76
$5 to $10 billion (usd)              53
$25 to $50 million (usd)             50
$50 to $100 million (usd)            44
$10 to $25 million (usd)             44
$1 to $5 million (usd)               43
$500 million to $1 billion (usd)     28
$5 to $10 million (usd)              17
Name: revenue, dtype: int64

> 🔍 observations_univariate: [0, "revenue"].\
  \
  Nearly half of the records don't have a recorded revenue. We can't afford to drop it, we will leave it in the data.

In [53]:
class RemoveNaNSize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy = X_copy.drop(X_copy.index[X_copy["size"] == "-1"])
        X_copy = X_copy.drop(X_copy.index[X_copy["size"] == "unknown"])
        
        return X_copy.reset_index(drop=True)
    
class RemoveNaNType(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy = X_copy.drop(X_copy.index[X_copy["type"] == "-1"])
        X_copy = X_copy.drop(X_copy.index[X_copy["type"] == "unknown"])
        
        return X_copy.reset_index(drop=True)
    
class RemoveNaNSector(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy = X_copy.drop(X_copy.index[X_copy["sector"] == "-1"])
        
        return X_copy.reset_index(drop=True)
    
class RemoveNaNRevenue(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        
        X_copy = X_copy.drop(X_copy.index[X_copy["revenue"] == "-1"])
        
        return X_copy.reset_index(drop=True)

In [54]:
clean_size_pipeline = Pipeline([
    ("nan_remover", RemoveNaNSize())
])

clean_type_pipeline = Pipeline([
    ("nan_remover", RemoveNaNType())
])

clean_sector_pipeline = Pipeline([
    ("nan_remover", RemoveNaNSector())
])

clean_revenue_pipeline = Pipeline([
    ("nan_remover", RemoveNaNRevenue())
])

In [55]:
df_clean = clean_size_pipeline.fit_transform(df_clean)
df_clean = clean_type_pipeline.fit_transform(df_clean)
df_clean = clean_sector_pipeline.fit_transform(df_clean)
df_clean = clean_revenue_pipeline.fit_transform(df_clean)

In [56]:
df_clean.shape

(1701, 9)

In [57]:
df_clean.head()

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
0,Indeed,4.3,"San Francisco, CA",209.0,Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
1,Indeed,4.3,"San Francisco, CA",143.0,Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
2,Abl Schools,4.1,"San Francisco, CA",140.0,Full-time,1 to 50 employees,company - private,business services,unknown / non-applicable
3,Amazon,3.8,"Palo Alto, CA",115.0,Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
4,Thermo Fisher - America,3.8,"San Francisco, CA",134.5,Full-time,10000+ employees,company - public,biotech & pharmaceuticals,$10+ billion (usd)


### 5.6. Building the Pipeline

In [58]:
# Hyperparams.
remove_cols = ["job title", "founded", "job description", "industry"]

replace_company_names = {"Amazon": "Amazon", 
                         "Parker Institute": "Parker Institute for Cancer Immunotherapy", 
                         "Sony": "Sony", 
                         "Stealth Mode": "Stealth Mode", 
                         "Twitch": "Twitch", 
                         "Wells Fargo": "Wells Fargo", 
                         "Varo": "Varo"}

remove_company_names = ["financial services company", "information technology sector"]

# Pipelines.
clean_general_pipeline = Pipeline([
    ("nan_row_remover", RemoveNaNRows()),
    ("col_selector", DataFrameSelector(remove_cols, include=False)),
    ("index_remover", IndexRemover())
])

clean_company_pipeline = Pipeline([
    ("company_rating_split", CompanyRatingSplit()),
    ("company_aggregate", AggregateCompanyNames(replace_company_names)),
    ("company_remove", RemoveNonCompanyNames(remove_company_names))
])

clean_rating_pipeline = Pipeline([
    ("rating_fill_by_cmpny_avg", FillRatingByCmpnyAvg()),
    ("rating_fill_by_avg", FillRatingByAvg())
])

clean_hq_pipeline = Pipeline([
    ("remove_ext_hq", RemoveExtraneousHQ())
])

clean_salary_pipeline = Pipeline([
    ("nan_remover", RemoveNaNSalary()),
    ("parser", SalaryParser())
])

clean_jobtype_pipeline = Pipeline([
    ("nan_remover", RemoveNaNJobType()),
    ("parser", JobTypeParser())
])

clean_size_pipeline = Pipeline([
    ("nan_remover", RemoveNaNSize())
])

clean_type_pipeline = Pipeline([
    ("nan_remover", RemoveNaNType())
])

clean_sector_pipeline = Pipeline([
    ("nan_remover", RemoveNaNSector())
])

clean_revenue_pipeline = Pipeline([
    ("nan_remover", RemoveNaNRevenue())
])

cleaning_pipeline = Pipeline([
    ("general", clean_general_pipeline),
    ("company", clean_company_pipeline),
    ("rating", clean_rating_pipeline),
    ("headquarters", clean_hq_pipeline),
    ("salary", clean_salary_pipeline),
    ("jobtype", clean_jobtype_pipeline),
    ("size", clean_size_pipeline),
    ("type", clean_type_pipeline),
    ("sector", clean_sector_pipeline),
    ("revenue", clean_revenue_pipeline),
])

In [59]:
df_clean = cleaning_pipeline.fit_transform(df)

In [60]:
df_clean.shape

(1701, 9)

In [61]:
df_clean.head()

Unnamed: 0,company,rating,headquarters,salary estimate,job type,size,type,sector,revenue
0,Indeed,4.3,"San Francisco, CA",209.0,Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
1,Indeed,4.3,"San Francisco, CA",143.0,Full-time,10000+ employees,company - private,information technology,$2 to $5 billion (usd)
2,Abl Schools,4.1,"San Francisco, CA",140.0,Full-time,1 to 50 employees,company - private,business services,unknown / non-applicable
3,Amazon,3.8,"Palo Alto, CA",115.0,Full-time,10000+ employees,company - public,information technology,$10+ billion (usd)
4,Thermo Fisher - America,3.8,"San Francisco, CA",134.5,Full-time,10000+ employees,company - public,biotech & pharmaceuticals,$10+ billion (usd)


In [62]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1701 entries, 0 to 1700
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   company          1701 non-null   object 
 1   rating           1701 non-null   float64
 2   headquarters     1701 non-null   object 
 3   salary estimate  1701 non-null   float64
 4   job type         1701 non-null   object 
 5   size             1701 non-null   object 
 6   type             1701 non-null   object 
 7   sector           1701 non-null   object 
 8   revenue          1701 non-null   object 
dtypes: float64(2), object(7)
memory usage: 119.7+ KB


# 6. Saving

In [81]:
df_clean.to_csv("../input/joblisting_cleaned.csv", index=False)
df_clean.to_csv("./input/joblisting_cleaned.csv", index=False)