# Finance Job Market Analysis

This notebook is where I will be cleaning the data I scraped from LinkedIn using [linkedin-job-scraper](https://github.com/Daneski13/linkedin-job-scraper).

## Data Cleaning

### Import Data

In [9]:
import pandas as pd
import glob

files = glob.glob("data/*.csv")
                  
dfs = []
for file in files:
    dfs.append(pd.read_csv(file))

linkedin = pd.concat(dfs)    
display(linkedin.head())
linkedin.describe()

Unnamed: 0,date_scraped,title,full_url,company,company_url,location,description,seniority_level,employment_type,job_function,industries
0,2022-10-04,Quantitative analyst (finance),https://www.linkedin.com/jobs/view/quantitativ...,"Lucas Group, A Korn Ferry Company",https://www.linkedin.com/company/lucas-group?t...,"Charlotte, NC","\n <p>Lucas Group, a Korn Ferry company...",Associate,Full-time,Finance,Banking
1,2022-10-03,Accounting and finance associates,https://www.linkedin.com/jobs/view/accounting-...,EVERESTX Talent Solutions,https://www.linkedin.com/company/everestxtalen...,"Pennsylvania, United States",\n <strong>Overview of the Role:</stron...,Not Assigned,Not Assigned,Not Assigned,Not Assigned
2,2022-09-28,"Associate/ Consulting Associate - Litigation, ...",https://www.linkedin.com/jobs/view/associate-c...,Charles River Associates,https://www.linkedin.com/company/charles-river...,"Washington, DC",\n <strong>About Charles River Associat...,Entry level,Full-time,Legal,Business Consulting and Services
3,2022-09-28,"Associate/ Consulting Associate - Litigation, ...",https://www.linkedin.com/jobs/view/associate-c...,Charles River Associates,https://www.linkedin.com/company/charles-river...,"Chicago, IL",\n <strong>About Charles River Associat...,Entry level,Full-time,Legal,Business Consulting and Services
4,2022-10-04,Senior Financial Analyst (Remote),https://www.linkedin.com/jobs/view/senior-fina...,Capital Search Group,https://www.linkedin.com/company/capital-searc...,"McLean, VA",\n Microsoft has become a corporate lea...,Not Assigned,Not Assigned,Not Assigned,Not Assigned


Unnamed: 0,date_scraped,title,full_url,company,company_url,location,description,seniority_level,employment_type,job_function,industries
count,60017,60017,60017,60016,60017,60017,60016,51890,52288,52288,52288
unique,7,12294,29607,7205,7213,3071,14787,6,4,693,1009
top,2022-09-30,Remote Tax Professional,https://www.linkedin.com/jobs/view/finance-and...,Aston Carter,https://www.linkedin.com/company/aston-carter?...,United States,\n <strong><u>What You'll Do...<br><br>...,Associate,Full-time,Accounting/Auditing and Finance,Not Assigned
freq,11299,2443,7,3339,3339,3738,2474,23092,40980,10160,9708


### Drop Duplicates

Out of the 31,383 job listings that were scraped, 29,607 were unique

In [10]:
linkedin.drop_duplicates(subset="full_url", inplace=True)
linkedin.describe()

Unnamed: 0,date_scraped,title,full_url,company,company_url,location,description,seniority_level,employment_type,job_function,industries
count,29607,29607,29607,29607,29607,29607,29606,29400,29422,29422,29422
unique,7,12294,29607,7205,7213,3071,14647,6,4,691,1000
top,2022-09-30,Remote Tax Professional,https://www.linkedin.com/jobs/view/quantitativ...,Aston Carter,https://www.linkedin.com/company/aston-carter?...,United States,\n <strong><u>What You'll Do...<br><br>...,Associate,Full-time,Not Assigned,Not Assigned
freq,5669,1214,1,1749,1749,1833,1229,11325,20245,8380,8381


### Handle Nulls

In [11]:
linkedin.isna().sum()

date_scraped         0
title                0
full_url             0
company              0
company_url          0
location             0
description          1
seniority_level    207
employment_type    185
job_function       185
industries         185
dtype: int64

First lets drop the row with the missing description.

In [12]:
linkedin.dropna(subset="description", inplace=True)
linkedin.isna().sum()

date_scraped         0
title                0
full_url             0
company              0
company_url          0
location             0
description          0
seniority_level    207
employment_type    185
job_function       185
industries         185
dtype: int64

I will fill the rest of the missing data with the string "Not Assigned"

In [13]:
linkedin.fillna("Not Assigned", inplace=True)

### Data Types

In [14]:
linkedin.info(show_counts=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29606 entries, 0 to 4625
Data columns (total 11 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   date_scraped     object
 1   title            object
 2   full_url         object
 3   company          object
 4   company_url      object
 5   location         object
 6   description      object
 7   seniority_level  object
 8   employment_type  object
 9   job_function     object
 10  industries       object
dtypes: object(11)
memory usage: 2.7+ MB


### Remove Unnecessary listings

If a listing does not mention finance in it's title or it's description, the listing can be discarded as only finance related job postings are relevant.

In [15]:
# Drop rows that do not mention finance, "financ" is chosen to include words such as "financial"
def drop(x):
    if 'financ' in x["description"].lower() or 'financ' in x["title"].lower():
        return True
    return False

linkedin = linkedin[linkedin.apply(drop, axis=1)]

### Export cleaned data

After cleaning we are left with 28,634 listings out of the 31,383 originally scraped


In [16]:
display(linkedin.shape[0])
linkedin.to_csv("data/cleaned.csv", index=False)

28634