# Lab Five: Wide and Deep Networks


Arely Alcantara, Emily Fashenpour

## 1. Preparation and Overview

### 1.1 Business Case

The data set we selected is titled "Stack Overflow 2018 Developer Survey". It is a yearly survey collected by Stack Overflow where the developer community is asked questions that range from education, job satisfaction, views on AI, adverstising, ethics, and even stack overflow itself, and general sleeping and eating habits. There were more than 100,000 responses to the survey but there were only 67,441 responses that were completed and did not contain personally identifying information. 

There are two .csv files that below to this dataset. The first file (survey_results_public.csv) contains all the responses to the questions asked in the survey and the second file (survey_results_schema.csv) contains the question the goes with each column in the first file. For example, the column 'Hobby' contains all the responses to the question 'Do you code as a hobby?'.

There are several job seach companies, like Indeed and ZipRecruiter, who could use data from a survey like this to help find more customers. Meaning they could take this data, try to determine which types of features correlate to a developer currently looking for a job, and better understand who they should be connecting with to say 'Hey, we think you might be looking for a job, and here are some jobs that you may be interested in!'.

We believe that in order for companies like Indeed to find our model useful, we need to be precise more than half of the time. That just means that our model needs to be able to predict and be right more than half of time so that it would be better than just guessing.

* Dataset URL: https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey#survey_results_public.csv
* Classification task: Classify a developer's job search status as either not interesting in new job opportunites, not actively searching but open to hearing about job opportunites, and actively searching for new job opportunities.

### 1.2 Data Preparation

There are a total of 129 features in the dataset. To help with time and complexity of building the models to classify whether a developer is currently searching for a job, we decided to drop a lot of the features for a several reasons. (1) There were a couple columns that we felt would have no affect on the model and the prediction task including 'Respondent', which was just a unique number for each developer who completed the survey and 'IDE' which contained all the responses to which IDEs the developer used. (2) There were a lot of questions that asked the developer to rank certain values in order of importance, like prefered way to be contacted at work, how they felt about a particular advertisment, or how they thought stack overflow could improve. We felt that these features would not have an affect on the model/prediction task. (3) __HERE__

What is left after dropping many of the columns are the features we felt would have an impact on the prediction task. Features like 'JobSatisfaction', 'Age', and 'Dependents'. If developers are unhappy with their current job, are they more likely to be looking for a job and do the number of dependents and age also affect whether they are looking for a new job?

Since we are trying to predict whether a developer is currently searching for a job, we are also going to drop all the rows in 'JobSearchStatus' that have a null value or 'nan'.

In [46]:
import pandas as pd
import numpy as np


#read in the csv files
survey = pd.read_csv('stack-overflow-survey/survey_results_public.csv')
#survey_schema = pd.read_csv('stack-overflow-survey/survey_results_schema.csv')

#drop colums that are not needed
drop = ['Respondent', 'SurveyEasy', 'SurveyTooLong', 'AssessJob1', 'AssessJob2', 'AssessJob3',
        'AssessJob4', 'AssessJob5', 'AssessJob6', 'AssessJob7', 'AssessJob8', 'AssessJob9', 'AssessJob10',
        'AssessBenefits1', 'AssessBenefits2', 'AssessBenefits3', 'AssessBenefits4', 'AssessBenefits5',
        'AssessBenefits6', 'AssessBenefits7', 'AssessBenefits8', 'AssessBenefits9', 'AssessBenefits10',
        'AssessBenefits11', 'JobContactPriorities1', 'JobContactPriorities2', 'JobContactPriorities3', 
        'JobContactPriorities4', 'JobContactPriorities5', 'JobEmailPriorities1', 'JobEmailPriorities2',
        'JobEmailPriorities3', 'JobEmailPriorities4', 'JobEmailPriorities5', 'JobEmailPriorities6',
        'JobEmailPriorities7', 'Currency', 'Salary', 'CurrencySymbol', 'IDE', 'OperatingSystem', 'NumberMonitors',
        'AdBlocker', 'AdBlockerDisable', 'AdBlockerReasons', 'AdsAgreeDisagree1', 'AdsAgreeDisagree2', 
        'AdsAgreeDisagree3', 'AdsActions', 'AdsPriorities1', 'AdsPriorities2', 'AdsPriorities3', 
        'AdsPriorities4', 'AdsPriorities5', 'AdsPriorities6','AdsPriorities7', 'AIDangerous', 
        'AIInteresting', 'AIResponsible', 'AIFuture', 'EthicsChoice', 'EthicsReport','EthicsResponsible',
        'EthicalImplications', 'StackOverflowRecommend', 'StackOverflowVisit', 'StackOverflowHasAccount', 
        'StackOverflowParticipate', 'StackOverflowDevStory', 'StackOverflowJobsRecommend', 'StackOverflowConsiderMember',
        'HypotheticalTools1', 'HypotheticalTools2', 'HypotheticalTools3', 'HypotheticalTools4', 'HypotheticalTools5', 
        'ErgonomicDevices', 'LanguageWorkedWith' , 'LanguageDesireNextYear', 'DatabaseWorkedWith',
        'DatabaseDesireNextYear', 'PlatformWorkedWith', 'PlatformDesireNextYear', 'FrameworkWorkedWith', 
        'FrameworkDesireNextYear', 'Methodology', 'VersionControl', 'CommunicationTools', 'TimeFullyProductive',
        'SelfTaughtTypes', 'TimeAfterBootcamp']# ['SurveyEasy', 'SurveyTooLong']

#dropping all col
for d in drop:
    survey.drop([d], axis=1, inplace=True)

#dropping all null values in the 'JobSearchStaus' column
survey = survey.dropna(subset=['JobSearchStatus'])

survey.head()

Unnamed: 0,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,YearsCoding,...,HoursOutside,SkipMeals,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS
0,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,3-5 years,...,1 - 2 hours,Never,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,
1,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",Database administrator;DevOps specialist;Full-...,30 or more years,...,30 - 59 minutes,Never,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,
2,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,Engineering manager;Full-stack developer,24-26 years,...,,,,,,,,,,
3,No,No,United States,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",100 to 499 employees,Full-stack developer,18-20 years,...,Less than 30 minutes,3 - 4 times per week,I don't typically exercise,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,35 - 44 years old,No,No
4,Yes,No,South Africa,"Yes, part-time",Employed full-time,Some college/university study without earning ...,"Computer science, computer engineering, or sof...","10,000 or more employees",Data or business analyst;Desktop or enterprise...,6-8 years,...,1 - 2 hours,Never,3 - 4 times per week,Male,Straight or heterosexual,Some college/university study without earning ...,White or of European descent,18 - 24 years old,Yes,


In [47]:
print(survey.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79488 entries, 0 to 89971
Data columns (total 38 columns):
Hobby                 79488 non-null object
OpenSource            79488 non-null object
Country               79488 non-null object
Student               78167 non-null object
Employment            78470 non-null object
FormalEducation       77907 non-null object
UndergradMajor        67324 non-null object
CompanySize           61172 non-null object
DevType               78760 non-null object
YearsCoding           79383 non-null object
YearsCodingProf       76139 non-null object
JobSatisfaction       69006 non-null object
CareerSatisfaction    76139 non-null object
HopeFiveYears         75375 non-null object
JobSearchStatus       79488 non-null object
LastNewJob            78889 non-null object
UpdateCV              65538 non-null object
SalaryType            50990 non-null object
ConvertedSalary       47701 non-null float64
EducationTypes        66045 non-null object
HackathonR

The majority of columns contain categorical data that need to be mapped to an interger value. For example, the 'JobSearchStatus' has three unique values: 'I am not interested in new job opportunities','I’m not actively looking, but I am open to new opportunities', and 'I am actively looking for a job' which are mapped in integer values 0, 1, and 2. The same process if repeated for the other columns.

In [48]:
#map categorical data to integer values
survey['JobSearchStatus'] = survey['JobSearchStatus'].map({
    'I am not interested in new job opportunities': 0,
    'I’m not actively looking, but I am open to new opportunities': 1,
    'I am actively looking for a job': 2
})

A = pd.Series(survey['JobSearchStatus']).unique()
for a in A:
    print('Unique:  ',a)

Unique:   1
Unique:   2
Unique:   0


### 1.3 Cross Product Features

### 1.4 Evaluation Metrics

### 1.5 Dividing Data into Training and Testing

## 2. Modeling

### 1.1 Model 1

### 1.2 Model 2

### 1.3 Model 3

### 1.4 Comparing Our Best Model to a Standard MultiLayer Perceptron

#### ROC Graph

## 3. Additional Analysis

### 3.1 Dimensionality Reduction and Visualization

## References
https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey#survey_results_public.csv