##### Data Acquisition/Extraction

To acquire the data from the directory of HTML files, which are basically job postings:

We have used 'BeautifulSoup' library (allows to parse HTML/XML files) to extract the job title and job description section from job postings. Then we used regular expressions, 're' library, to extract:

1. The salary values from the job description section include upper and lower bounds; for missing salary values on job postings, we filled the column with a value of zero so we can filter those postings when needed.

2. If the job is remote; we looked for the keyword: 'remote' in both the title and the job description section of job postings.

3. If the job is an internship; we looked for the 'intern', 'internship', and 'co-op' in both the title and the job description section of job postings. 

4. The job type from the job description section, includes 'Full-time', 'Part-time', and 'Contract'; for missing values, we filled the column with the value 'Full-time (i)' where '(i)' indicates that the value was imputed; we made this decision because lots of job postings did not have a label for job type, although the job postings are referred to as 'Full-time' jobs. However, we ended up not using this column in our model analysis because it turned out not to be significant enough in our prior analyses of the data.

Lastly, we saved the dataframe containing the title column and five different columns that are extracted from looking at the title/job description section of each job posting in a CSV file.


##### Data Cleaning 

After we acquired the data, at first glance, we saw that the salary values needed to be cleaned. 
The salary values have two variations:

1. One such variation is '\\$75,000'. For this variation of salary values, we dropped the \\$-sign and converted the values to numeric.

2. And the other variation is '\\$75.3K'. For this variation of salary values, we dropped the $-sign and everything after the decimal and multiplied the value with 1000 (code is not implemented as in this description), basically taking the floor value for some salary values. So, the given salary value, \\$75.3K, will convert to 75000. We made this choice to save some coding time.

After we finished the cleaning data phase, it was time for the aspect of the project we were looking forward to: feature extraction.

##### Feature Extraction

Till now, we lacked the right features to adequately answer our questions. However, when we extracted the whole title from job postings and saved it as a column, we had in mind to transform the titles into meaningful features for our model because the titles as they were had so much variability in the data we had in hand.

After carefully observing the data, specifically the job titles, we came up with two additional features that we wanted to add to our data. The two new features are:

1.  Role; levels include 'Scientist', 'Analyst', 'ML Engineer', 'Research', and 'Unidentified'. To extract the values for this feature, we used regular expressions to search the job title for similar keywords as the levels themselves in a hierarchy that searches for the most common appearing word, such as 'scientist' at the very end. 

2.  Seniority; levels include 'Senior', 'Junior', and 'None'. To extract the values for this feature, we again used regular expressions to search the job title for similar keywords as the levels themselves.

At this point, we knew how we wanted to use the data for machine learning analyses—that is, if the statistical tests concluded the data was significant for analysis. So we went further and one-hot encoded the two new features and 'remote' column, which we intentionally encoded as a "Y" or "N" value during acquisition. To do this, for the 'remote' column, we used pandas.DataFrame.apply function to convert the 'Y' and 'N' values to 1 and 0, respectively. And for the two new features columns, we used pandas.To obtain a Python list of lists of size n*levels of the respective feature, where n is the number of observations and contains a 1 that indicates the category and the rest are 0's, use the DataFrame.apply function. For example, if a seniority observation returns [1,0,0], this means that the job posting is for a senior position. And for example, if a role observation returns [0,1,0,0,0], this means that the job posting is for an analyst position. Then we appended the two different lists for seniority and role to two dataframes using pandas.Series.iteritems (again, we couldn't find a better way; using this function definitely saved us some time for doing analysis). However, this entire process could have been done using MultiLevelBinarizer from Scikit-Learn, but it was fun computing the one-hot encoded columns manually. Then we finally put all the distinct columns into one dataframe and saved it as a CSV file.

The program called "feature_extraction_complete.py." does all this by taking a CSV file as an input argument and an output argument to write the results to a CSV file. It also produces 2 CSV files containing one-hot encoded features generated from role and seniority features.


##### ML Techniques:
For the machine learning techniques, we wanted to perform regression to predict the 'average salary' as well as classification to predict the 'role'. From our prior analysis, we knew which features to consider as predictors for the model. Furthermore, we decided to choose our models based on two important observations in our data. The first is that the sample from each group of 'role' has different sizes. This means the data is slightly imbalanced. And the second is that most features have binary values. Therefore, we only wanted to use decision trees or any other variety of it to avoid overfitting or underfitting the data. Finally, we kept it simple and went with "Random Forest" and "Gradient Boosting" as two different models of the same variety for both regression and classification tasks.

##### Regression
For the regression models, 

Response variable: Average_salary

Explanatory variables: Is_remote, Is_intern, Is_junior, Is_senior, Seniority_unknown, 
                        Is_scientist, Is_analyst, Is_ml_engineer, Is_research


~ Model 1: 

Random Forest Regressor with 100 estimators and max depth = 10

Model training score: 0.492 (approx.)

Model validation score: 0.417 (approx.)


~ Model 2: 

Gradient Boosting Regressor with 100 estimators and max depth = 3

Model training score: 0.493 (approx.)

Model validation score: 0.414 (approx.)
                   

##### Classification
For the classification models, 

Response variable: Role

Explanatory variables: Salary_low, Salary_high, Is_remote, Is_intern, 
                        Is_junior, Is_senior, Seniority_unknown


~ Model 1: 

Random Forest Classifier with 100 estimators and max depth = 7

Model training score: 0.723 (approx.)

Model validation score: 0.54 (approx.)


~ Model 2: 

Gradient Boosting Classifier with 100 estimators

Model training score: 0.838 (approx.)

Model validation score: 0.513 (approx.)

The program called "ml_analysis.py." takes a CSV file, which include one-hot encoded features, as an input argument and prints the models that are being trained and the training and validation score for each of the four models. It also produces 4 CSV files containing the predictions of the 4 models trained (two regressors and two classifiers) in the program directory.