# Preprocessing and model training


In this notebook, I will prepare my dataframes and then train several classification models off of them and compare results.

Building off of two previous notebooks: (https://github.com/gisthuband/Capstone_2_DS_Job_Locator/blob/main/data_wrangle.ipynb and https://github.com/gisthuband/Capstone_2_DS_Job_Locator/blob/main/exploratory_data_analysis.ipynb), I have constructed and analyzed a dataframe containing information on data science jobs in 2024.

I will use data to create a classification model, that will take my desired salary range and self perceived competitiveness in the job market, and use that to find the best locations and companies to apply for.

The data was found using a kaggle dataset containing 500 job postings for the data science filed in 2024, and a BLS report generated using data science field statistics of 2023.

Kaggle: https://www.kaggle.com/datasets/ritiksharma07/data-science-job-listings-from-glassdoor   

BLS: https://data.bls.gov/oes/#/occInd/One%20occupation%20for%20multiple%20industries 

The samples' (the individual job postings) features will be their upper salary post, lower salary post, company rating, total data scientists in company's state, ratio of job posts to total data scientists in company state, annual mean wage of state, annual median wage of state, ratio of job post to annual mean wage, and ratio of job post to annual median wage.

Each job will receive its label based on geographic region: west, midwest, south, east, or remote.

The models will train based of off the numerical features as x and the regions as y.


In the end, I will input my own desired salary range and my perceived competitiveness in the job market.  The salary range will correspond to the upper and lower salary features.  The perceived competitiveness will become the ratio of posting to state mean, which that in turn will be used to calculate the ratio of posting to state median.  These ratios will determine the state mean and median features in tandem with the inputted salary range.  The company rating, employment in state, and ratio of posts to employment will be automatically taken as their median values as to not overcomplicate the model.  From this input I will get the region label, and from this region label I can use the original dataframe and generate, the top posting cities in that region, along with the companies and the titles of roles accompanying those posts.

This notebook will contain the following: a preprocessing class that will have the capacity to:

1.) Produce a tidied up dataframe

2.) Produce test and training version of data without one hot encoding for the labels

3.) Produce a test and training version of data with one hot encoding for the labels

4.) Produce a test and training version of data with one hot encoding for the labels and a standardized scaling

5.) Produce a test and training version of data with one hot encoding for the labels and a max min scaling

6.) Produce a mislabeled test and training labels without one hot encoding

7.) Produce a mislabedl test and training labels with one hot encoding



## Creation of Preprocessing Class




In [10]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
class preprocess:
    def __init__(self, df_path):
        self.df_path = df_path
    
    
    def refine(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        return df
    
    def regular(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        ready_df = df
        
        features = list(ready_df.columns[ready_df.columns != 'labels'])

        X = ready_df[features]
        
        y = ready_df['labels']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1, stratify=y)
        
        return X_train, X_test, y_train, y_test
    
    def dummied(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        dum_df = pd.get_dummies(df['labels'])        
        
        dummed_df = pd.concat([df, dum_df],axis=1)

        dummed_df = dummed_df.drop(columns='labels')
        
        features = list(dummed_df.columns[dummed_df.columns != 'west'])
        
        features.remove('south')
        
        features.remove('east')
        
        features.remove('midwest')
        
        features.remove('remote')
        
        X = dummed_df[features]

        y = dummed_df[['west','east','south','midwest','remote']]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
        
        return X_train, X_test, y_train, y_test
    
    def standard_scaled(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        dum_df = pd.get_dummies(df['labels'])        
        
        dummed_df = pd.concat([df, dum_df],axis=1)

        dummed_df = dummed_df.drop(columns='labels')
        
        features = list(dummed_df.columns[dummed_df.columns != 'west'])
        
        features.remove('south')
        
        features.remove('east')
        
        features.remove('midwest')
        
        features.remove('remote')
        
        X = dummed_df[features]

        y = dummed_df[['west','east','south','midwest','remote']]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
        
        s_scaler = preprocessing.StandardScaler().fit(X_train)
        
        X_train=s_scaler.transform(X_train)
        
        X_test=s_scaler.transform(X_test)

        return X_train, X_test, y_train, y_test
    
    def maxmin_scaled(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        dum_df = pd.get_dummies(df['labels'])        
        
        dummed_df = pd.concat([df, dum_df],axis=1)

        dummed_df = dummed_df.drop(columns='labels')
        
        features = list(dummed_df.columns[dummed_df.columns != 'west'])
        
        features.remove('south')
        
        features.remove('east')
        
        features.remove('midwest')
        
        features.remove('remote')
        
        X = dummed_df[features]

        y = dummed_df[['west','east','south','midwest','remote']]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
        
        X_train=preprocessing.minmax_scale(X_train)
        
        X_test=preprocessing.minmax_scale(X_test)

        return X_train, X_test, y_train, y_test
    
    def random_label_regular(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        ready_df = df
        
        rng = np.random.default_rng()
        
        rand_labels = rng.integers(low=0, high=5, size=len(ready_df)) 
        
        region_labels = ['west','east','south','midwest','remote']
        
        random_label_fin = list(map(lambda x: region_labels[x], rand_labels))
        
        ready_df['random_labels'] = random_label_fin
        
        ready_df = ready_df.drop(columns='labels')
        
        features = list(ready_df.columns[ready_df.columns != 'random_labels'])

        X = ready_df[features]
        
        y = ready_df['random_labels']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1, stratify=y)
        
        return X_train, X_test, y_train, y_test
    
    def random_label_dum(df_path):
        
        df = pd.read_csv(df_path)

        df = df.drop(columns='Unnamed: 0')

        df = df.drop(columns=['state','city','Job Title','company_name'])
        
        rng = np.random.default_rng()
        
        rand_labels = rng.integers(low=0, high=5, size=len(df)) 
        
        region_labels = ['west','east','south','midwest','remote']
        
        random_label_fin = list(map(lambda x: region_labels[x], rand_labels))
        
        df['random_labels'] = random_label_fin
        
        dum_df = pd.get_dummies(df['random_labels'])        
        
        dummed_df = pd.concat([df, dum_df],axis=1)

        dummed_df = dummed_df.drop(columns=['labels','random_labels'])
        
        features = list(dummed_df.columns[dummed_df.columns != 'west'])
        
        features.remove('south')
        
        features.remove('east')
        
        features.remove('midwest')
        
        features.remove('remote')
        
        X = dummed_df[features]

        y = dummed_df[['west','east','south','midwest','remote']]

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
        
        return X_train, X_test, y_train, y_test
    
    
    
    

In [11]:
trial_regular = preprocess.regular('explored_data_v1.csv')

In [12]:
print (trial_regular[3])

85        west
147    midwest
251     remote
35        east
42       south
        ...   
114      south
130      south
134    midwest
186      south
75        west
Name: labels, Length: 69, dtype: object


In [13]:
trial_fake = preprocess.random_label_regular('explored_data_v1.csv')

In [14]:
print (trial_fake[3])

109     remote
331     remote
211     remote
151       west
342    midwest
        ...   
90      remote
111    midwest
246     remote
227    midwest
281       east
Name: random_labels, Length: 69, dtype: object
