# Fixing companies jobs counts

In the last notebook, we found that our file `companies_available_jobs.csv` was incorrectly produced. It has that every company has been reviewed in the same 67 jobs.

We will fix that here.

For this, we will need to go back to our file `companies_jobs_counts.csv`, which lists how many times each company has been reviewed. We will also need to use our file `all_jobs_ratings.csv`, which contains the 13,000 jobs we have info about.

In [1]:
import pandas as pd
pd.set_option('display.max_columns',200)
pd.set_option('display.max_rows', 90)

import numpy as np

## Import job counts and job titles

In [2]:
#has how many times each job has been reviewed at each company
job_counts = pd.read_csv('companies_jobs_counts.csv', index_col='Unnamed: 0')

#import ratings for jobs I consider
job_ratings = pd.read_csv('all_jobs_ratings_transpose.csv', index_col='Unnamed: 0')

#all jobs at companies I will recommend
all_jobs = job_ratings.loc[:,'Job Title']

In [3]:
reviewed_companies = pd.read_csv('reviewed_companies.csv', index_col='Unnamed: 0')

In [4]:
print(reviewed_companies.shape)

reviewed_companies.head()

(5832, 7)


Unnamed: 0,Ticker Symbol,Ticker Sector,Ticker Industry,Company Id,Company URL,company_name,count
0,vtx:rog,Health Care,Pharmaceuticals & Biotechnology,274,https://www.glassdoor.com/Overview/Working-at-...,Genentech,609
1,bcs:falabella,,,10976,https://www.glassdoor.com/Overview/Working-at-...,Falabella,9
2,asx:wow,Consumer Services,Food & Drug Retailers,473193,https://www.glassdoor.com/Overview/Working-at-...,Big W,70
3,asx:wor,,,35193,https://www.glassdoor.com/Overview/Working-at-...,WorleyParsons,379
4,nyse:xom,Oil & Gas,Oil & Gas Producers,237,https://www.glassdoor.com/Overview/Working-at-...,ExxonMobil,845


In [6]:
job_counts.head()

Unnamed: 0,Company Id,company_name,Job Title,Employee Status,count
0,4,AAR,A&P,Current Employee,1
1,4,AAR,A&P Lead,Current Employee,1
2,4,AAR,A&P Lead Mechanic,Current Employee,1
3,4,AAR,A&P Mechanic,Current Employee,1
4,4,AAR,A&P Mechanic,Former Contractor,2


In [7]:
print('Total number of reviews in job_counts: {}'.format(job_counts.loc[:,'count'].sum()))

Total number of reviews in job_counts: 2615681


## Restricting job counts

We will only consider jobs that have been reviewed at least 3 times at some company. This will decrease the number of jobs in half.

In [10]:
#filter job_counts to only jobs that have been reviewed at least 3x
job_counts = job_counts[job_counts.loc[:,'Job Title'].isin(all_jobs)]

In [11]:
print('Current number of reviews: {}'.format(job_counts.loc[:,'count'].sum()))

Current number of reviews: 1224851


In [14]:
job_counts.shape

(372136, 5)

In [17]:
job_counts.loc[:,'Job Title'].drop_duplicates()

3                                              A&P Mechanic
6                                            A&P Technician
8                                                Accountant
9                                            Administrative
10                                 Administrative Assistant
12                          Aircraft Maintenance Technician
14                                        Aircraft Mechanic
18                                                      Amt
19                                                  Analyst
21                                                Anonymous
27                                            Assembly Tech
28                                      Assembly Technician
29                          Aviation Maintenance Technician
32                                      Avionics Technician
37                                         Business Analyst
38                           Business Development Executive
40                                      

In [18]:
print(job_counts.shape)
job_counts.dropna(axis=0, subset=['Job Title']).head(90)

(372136, 5)


Unnamed: 0,Company Id,company_name,Job Title,Employee Status,count
3,4,AAR,A&P Mechanic,Current Employee,1
4,4,AAR,A&P Mechanic,Former Contractor,2
5,4,AAR,A&P Mechanic,Former Employee,5
6,4,AAR,A&P Technician,Current Employee,3
8,4,AAR,Accountant,Current Employee,1
9,4,AAR,Administrative,Current Employee,1
10,4,AAR,Administrative Assistant,Current Employee,1
12,4,AAR,Aircraft Maintenance Technician,Current Employee,1
13,4,AAR,Aircraft Maintenance Technician,Former Employee,1
14,4,AAR,Aircraft Mechanic,Current Contractor,1


In [19]:
job_counts.shape

(372136, 5)

## Extract jobs reviewed at each company

From the `job_counts` DataFrame, I will find the collection of jobs reviewed at each company. I will then convert this collection into a long string, with job titles separated by the fixed string ' AKLDJJ '. When I import this information in future work, I can just quickly split by ' AKLDJJ '.

In [22]:
company_jobs_dict = {}

counter = 0

for comp_id in job_counts.loc[:,'Company Id'].drop_duplicates():
    job_counts_for_company = job_counts[job_counts['Company Id'] == comp_id]
    
    #find lists of jobs at company
    company_jobs_dict[comp_id] = list(job_counts_for_company.loc[:,'Job Title'].drop_duplicates())
    
    #remove job titles that were listed as NaN
    if np.nan in company_jobs_dict[comp_id]:
        counter += 1
        company_jobs_dict[comp_id].remove(np.nan)
        
    company_jobs_dict[comp_id] = " AKLDJJ ".join(company_jobs_dict[comp_id])
    
print(counter)

462


In [23]:
#example string of jobs at company
company_jobs_dict[20]

'Composite Technician AKLDJJ Engineer AKLDJJ Finishing Technician AKLDJJ IT Manager AKLDJJ IT Support Analyst AKLDJJ Intern AKLDJJ Manufacturing AKLDJJ Manufacturing Supervisor AKLDJJ Ndt Technician AKLDJJ Process Engineer AKLDJJ Senior Buyer AKLDJJ Senior Manager'

In [24]:
company_jobs_df = pd.DataFrame.from_dict({'Company Id': [comp_id 
                                                         for comp_id in job_counts.loc[:,'Company Id'].drop_duplicates()],
                                          'Available Jobs': [company_jobs_dict[comp_id]
                                                            for comp_id in job_counts.loc[:,'Company Id'].drop_duplicates()]})

In [25]:
company_jobs_df.head()

Unnamed: 0,Company Id,Available Jobs
0,4,A&P Mechanic AKLDJJ A&P Technician AKLDJJ Acco...
1,7,Accounts Payable AKLDJJ Administrative Assista...
2,8,A&P Mechanic AKLDJJ Account Manager AKLDJJ Acc...
3,9,Machine Operator
4,12,AREA SALES MANAGER AKLDJJ Abteilungsleiter AKL...


In [26]:
#a few companies without jobs
company_jobs_df = company_jobs_df[~company_jobs_df['Company Id'].isin([1892,23043,660837])]

In [27]:
company_jobs_df = company_jobs_df.set_index('Company Id', drop = True)

In [28]:
company_jobs_df.head()

Unnamed: 0_level_0,Available Jobs
Company Id,Unnamed: 1_level_1
4,A&P Mechanic AKLDJJ A&P Technician AKLDJJ Acco...
7,Accounts Payable AKLDJJ Administrative Assista...
8,A&P Mechanic AKLDJJ Account Manager AKLDJJ Acc...
9,Machine Operator
12,AREA SALES MANAGER AKLDJJ Abteilungsleiter AKL...


In [29]:
#save to CSV file
company_jobs_df.to_csv('companies_available_jobs.csv')