#  Predicting Salaries

We collected salary information on data science jobs in a variety of markets. Then using the location, title, and summary of the job we attempted to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings do not come with salary information, being to able extrapolate or predict the expected salaries from other listings can useful.

Normally, regression could be used for a task like this; however, since there is a fair amount of natural variance in job salaries, we approached this as a classification problem and used classifiers.

Therefore, the first part of the project was focused on scraping Indeed.com. The latter part of the project was focused on building models using job postings with salary information to predict salaries.

## Scraping job listings from Indeed.com

we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

In [1]:
import pandas as pd
import numpy as np
import warnings

# Filter out all warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

def evaluate_classification_model(y_true, y_pred, name):
    # Calculate accuracy
    accuracy = accuracy_score(y_true, y_pred)
    
    # Calculate confusion matrix
    confusion = confusion_matrix(y_true, y_pred)
    
    # Calculate precision
    precision = precision_score(y_test, y_pred, 
                                           pos_label='positive',
                                           average='micro')
    
    # Calculate recall
    recall = recall_score(y_test, y_pred, pos_label='positive',
                                           average='micro')
    print("Model Name:", name)
    print("Accuracy:", accuracy)
    print("Confusion Matrix:")
    print(confusion)
    print("Precision:", precision)
    print("Recall:", recall)