# Projeto de Machine Learning

## 1. Case RH - Retenção de funcionários
Objetivo:
Criar um modelo que represente o *Job Satisfaction* de um profissional de TI a partir de características da compania e outros dados obtidos pela pesquisa. Utilize os dados da pesquisa de 2018 do Stackoverflow.

## 2. Case Empresa de Anúncios - Previsão de salários

Objetivo:
Criar um modelo que represente o *Salary* de um profissional de TI a partir de características da compania e outras características dos funcionários utilizando os dados disponibilizados na pesquisa de 2017 do Stackoverflow.


### Passos:

0. Carregar base de dados
1.Seleção de features - Análise das Features / Construir base analítica
  - remover linhas com missing
  - codificar variáveis categóricas como fatores
  - etc.
2. Análise exploratória da Base:
  - Histograma de Salários
  - Histograma de satisfação - Quantos tem satisfação maior que 0.7 
  - Correlações das features, 
  - etc.
4. Traduzir o problema - buscar a melhor solução de negócio
5. Selecionar e treinar o modelo 
  - Selecionar o modelo
  - Definir X (features) e y (variável dependente)
  - Normalizar as features (facultativo, mas melhora os resultados de predição)
  - Separar modelos em treino e teste 
  -Treinar o modelo
6. Retornar MSE para o modelo e distribuição real do seu y_teste e do seu y_pred (y preditos)



# LOAD DATASET

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split

from sklearn import metrics
pd.set_option('display.max_rows', 3500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

data = pd.read_csv("../survey_results_public.csv")

# TRATAMENTO DA BASE

#### Elimino linhas que tenham nulos nas colunas Salário, JobSatisfaction, JobSeekingStatus e CarreerSatisfaction

In [3]:
data2 = data[~(data.Salary.isnull() | 
        data.JobSatisfaction.isnull() |
        data.CareerSatisfaction.isnull())]
nrow = data2.shape[0]
print("# Linhas :", nrow)
print("# Colunas: ", data2.shape[1])
del data

# Linhas : 12847
# Colunas:  154


#### Deleto linhas com HoursPerWeek >= 20 ( EmploymentStatus full-time ou part-time) 
(Supondo erro no preenchimento da survey)

In [4]:
data2 = data2[(data2.HoursPerWeek<20)]
print(len(data2))

5540


#### Deleto colunas que possuam mais de 5 mil missing (90%)

In [5]:
del_columns = data2.columns[data2.isnull().sum() >5000]
df = data2.drop(del_columns, axis = 1)
del_columns

Index(['YearsCodedJobPast', 'MobileDeveloperType', 'NonDeveloperType', 'ExCoderReturn', 'ExCoderNotForMe', 'ExCoderBalance', 'ExCoder10Years', 'ExCoderBelonged', 'ExCoderSkills', 'ExCoderWillNotCode', 'ExCoderActive', 'TimeAfterBootcamp', 'ExpectedSalary'], dtype='object')

In [6]:
#PRINT DO NUMERO DE MISSING POR COLUNA:

#print("data_tratada shape: ", df.shape)
#print("total missing data")
#print()
#for col in df.columns:
#    print(col , df[col].isnull().sum())

#### Deleto algumas colunas que são irrelevantes, que ainda possuem alto valor de missing ou que só possuem um só tipo de resposta (ex: Professional)

In [7]:
#Colunas irrelevantes
df = df.drop("Respondent", axis = 1)
df = df.drop('PronounceGIF', axis = 1)
df = df.drop('ClickyKeys', axis = 1)

#Colunas com alto valor de Missing Data
df = df.drop('HaveWorkedDatabase', axis =1)
df = df.drop('WantWorkDatabase', axis =1)
df = df.drop('HaveWorkedPlatform', axis =1)
df = df.drop('WantWorkPlatform', axis =1)

#Coluna com apenas uma só resposta
df = df.drop("Professional", axis = 1)

#### Transformo as colunas com múltiplas respostas em dummies


HaveWorkedLanguage, WantWorkLanguage, IDE, DeveloperType e ImportantBenefits, Gender, Race, StackOverflowDevices, MetricAssess, EducationTypes


In [8]:
def strip_lista(lista):
    return list(map(str.strip,lista))

def get_dummy_cols(df, column_string, prefix = "uses"):
    df1 = df[column_string].str.split(';').dropna()
    df1 = df1.apply(lambda x: strip_lista(x))
    df2 = pd.get_dummies(df1.apply(pd.Series).stack()).sum(level=0)
    df = df.drop(column_string, axis = 1)
    df2.columns = [prefix +"_"+ x for x in df2.columns]     
    result = pd.merge(df,df2,left_index=True,right_index=True)
    return result


df = get_dummy_cols(df,'HaveWorkedLanguage', prefix ='Worked_with')
df = get_dummy_cols(df,'WantWorkLanguage', prefix = 'Want_work')
df = get_dummy_cols(df,'IDE' ,prefix = 'Uses')
df = get_dummy_cols(df, "DeveloperType", prefix = "DevType")
df = get_dummy_cols(df,'ImportantBenefits' ,prefix = 'IsImportantBenefit')
df = get_dummy_cols(df,'Gender' ,prefix = 'Gender')
df = get_dummy_cols(df, 'Race', prefix = "Race")
df = get_dummy_cols(df, 'StackOverflowDevices', prefix = "Uses_StackOverflow_in")
df = get_dummy_cols(df, 'MetricAssess', prefix = "MetricAssess_")
df = get_dummy_cols(df, 'EducationTypes', prefix = "Education")

print(df.shape)
df.head(1)

(1928, 288)


Unnamed: 0,ProgramHobby,Country,University,EmploymentStatus,FormalEducation,MajorUndergrad,HomeRemote,CompanySize,CompanyType,YearsProgram,YearsCodedJob,WebDeveloperType,CareerSatisfaction,JobSatisfaction,ProblemSolving,BuildingThings,LearningNewTech,BoringDetails,JobSecurity,DiversityImportant,AnnoyingUI,FriendsDevelopers,RightWrongWay,UnderstandComputers,SeriousWork,InvestTimeTools,WorkPayCare,KinshipDevelopers,ChallengeMyself,CompetePeers,ChangeWorld,JobSeekingStatus,HoursPerWeek,LastNewJob,AssessJobIndustry,AssessJobRole,AssessJobExp,AssessJobDept,AssessJobTech,AssessJobProjects,AssessJobCompensation,AssessJobOffice,AssessJobCommute,AssessJobRemote,AssessJobLeaders,AssessJobProfDevel,AssessJobDiversity,AssessJobProduct,AssessJobFinances,JobProfile,ResumePrompted,LearnedHiring,ImportantHiringAlgorithms,ImportantHiringTechExp,ImportantHiringCommunication,ImportantHiringOpenSource,ImportantHiringPMExp,ImportantHiringCompanies,ImportantHiringTitles,ImportantHiringEducation,ImportantHiringRep,ImportantHiringGettingThingsDone,Currency,Overpaid,TabsSpaces,EducationImportant,SelfTaughtTypes,CousinEducation,WorkStart,HaveWorkedFramework,WantWorkFramework,AuditoryEnvironment,Methodology,VersionControl,CheckInCode,ShipIt,OtherPeoplesCode,ProjectManagement,EnjoyDebugging,InTheZone,DifficultCommunication,CollaborateRemote,EquipmentSatisfiedMonitors,EquipmentSatisfiedCPU,EquipmentSatisfiedRAM,EquipmentSatisfiedStorage,EquipmentSatisfiedRW,InfluenceInternet,InfluenceWorkstation,InfluenceHardware,InfluenceServers,InfluenceTechStack,InfluenceDeptTech,InfluenceVizTools,InfluenceDatabase,InfluenceCloud,InfluenceConsultants,InfluenceRecruitment,InfluenceCommunication,StackOverflowDescribes,StackOverflowSatisfaction,StackOverflowFoundAnswer,StackOverflowCopiedCode,StackOverflowJobListing,StackOverflowCompanyPage,StackOverflowJobSearch,StackOverflowNewQuestion,StackOverflowAnswer,StackOverflowMetaChat,StackOverflowAdsRelevant,StackOverflowAdsDistracting,StackOverflowModeration,StackOverflowCommunity,StackOverflowHelpful,StackOverflowBetter,StackOverflowWhatDo,StackOverflowMakeMoney,HighestEducationParents,SurveyLong,QuestionsInteresting,QuestionsConfusing,InterestedAnswers,Salary,Worked_with_Assembly,Worked_with_C,Worked_with_C#,Worked_with_C++,Worked_with_Clojure,Worked_with_CoffeeScript,Worked_with_Common Lisp,Worked_with_Dart,Worked_with_Elixir,Worked_with_Erlang,Worked_with_F#,Worked_with_Go,Worked_with_Groovy,Worked_with_Hack,Worked_with_Haskell,Worked_with_Java,Worked_with_JavaScript,Worked_with_Julia,Worked_with_Lua,Worked_with_Matlab,Worked_with_Objective-C,Worked_with_PHP,Worked_with_Perl,Worked_with_Python,Worked_with_R,Worked_with_Ruby,Worked_with_Rust,Worked_with_SQL,Worked_with_Scala,Worked_with_Smalltalk,Worked_with_Swift,Worked_with_TypeScript,Worked_with_VB.NET,Worked_with_VBA,Worked_with_Visual Basic 6,Want_work_Assembly,Want_work_C,Want_work_C#,Want_work_C++,Want_work_Clojure,Want_work_CoffeeScript,Want_work_Common Lisp,Want_work_Dart,Want_work_Elixir,Want_work_Erlang,Want_work_F#,Want_work_Go,Want_work_Groovy,Want_work_Hack,Want_work_Haskell,Want_work_Java,Want_work_JavaScript,Want_work_Julia,Want_work_Lua,Want_work_Matlab,Want_work_Objective-C,Want_work_PHP,Want_work_Perl,Want_work_Python,Want_work_R,Want_work_Ruby,Want_work_Rust,Want_work_SQL,Want_work_Scala,Want_work_Smalltalk,Want_work_Swift,Want_work_TypeScript,Want_work_VB.NET,Want_work_VBA,Want_work_Visual Basic 6,Uses_Android Studio,Uses_Atom,Uses_Coda,Uses_Eclipse,Uses_Emacs,Uses_IPython / Jupyter,Uses_IntelliJ,Uses_Komodo,Uses_Light Table,Uses_NetBeans,Uses_Notepad++,Uses_PHPStorm,Uses_PyCharm,Uses_RStudio,Uses_RubyMine,Uses_Sublime Text,Uses_TextMate,Uses_Vim,Uses_Visual Studio,Uses_Visual Studio Code,Uses_Xcode,Uses_Zend,DevType_Data scientist,DevType_Database administrator,DevType_Desktop applications developer,DevType_DevOps specialist,DevType_Developer with a statistics or mathematics background,DevType_Embedded applications/devices developer,DevType_Graphic designer,DevType_Graphics programming,DevType_Machine learning specialist,DevType_Mobile developer,DevType_Other,DevType_Quality assurance engineer,DevType_Systems administrator,DevType_Web developer,IsImportantBenefit_Annual bonus,IsImportantBenefit_Charitable match,IsImportantBenefit_Child/elder care,IsImportantBenefit_Education sponsorship,IsImportantBenefit_Equipment,IsImportantBenefit_Expected work hours,IsImportantBenefit_Health benefits,IsImportantBenefit_Long-term leave,IsImportantBenefit_Meals,IsImportantBenefit_None of these,IsImportantBenefit_Other,IsImportantBenefit_Private office,IsImportantBenefit_Professional development sponsorship,IsImportantBenefit_Remote options,IsImportantBenefit_Retirement,IsImportantBenefit_Stock options,IsImportantBenefit_Vacation/days off,Gender_Female,Gender_Gender non-conforming,Gender_Male,Gender_Other,Gender_Transgender,Race_Black or of African descent,Race_East Asian,Race_Hispanic or Latino/Latina,Race_I don’t know,Race_I prefer not to say,Race_Middle Eastern,"Race_Native American, Pacific Islander, or Indigenous Australian",Race_South Asian,Race_White or of European descent,Uses_StackOverflow_in_Android app,Uses_StackOverflow_in_Android browser,Uses_StackOverflow_in_Desktop,Uses_StackOverflow_in_Other phone browser,Uses_StackOverflow_in_iOS app,Uses_StackOverflow_in_iOS browser,MetricAssess__Benchmarked product performance,MetricAssess__Bugs found,MetricAssess__Commit frequency,MetricAssess__Customer satisfaction,MetricAssess__Hours worked,MetricAssess__Lines of code,MetricAssess__Manager's rating,MetricAssess__On time/in budget,MetricAssess__Other,MetricAssess__Peers' rating,MetricAssess__Release frequency,MetricAssess__Revenue performance,MetricAssess__Self-rating,Education_Bootcamp,Education_Coding competition,Education_Hackathon,Education_Industry certification,Education_On-the-job training,Education_Online course,Education_Open source contributions,Education_Part-time/evening course,Education_Self-taught
34,"Yes, I program as a hobby",Croatia,"Yes, full-time",Employed full-time,Bachelor's degree,Computer engineering or electrical/electronics...,A few days each month,10 to 19 employees,"Privately-held limited company, not in startup...",7 to 8 years,1 to 2 years,,10.0,8.0,,,,,,,,,,,,,,,,,,"I'm not actively looking, but I am open to new...",1.0,Between 1 and 2 years ago,Important,Important,Important,Important,Important,Important,Important,Very important,Important,Somewhat important,Important,Very important,Not at all important,Somewhat important,Important,LinkedIn,"I completed a major project, assignment, or co...",I was contacted directly by someone at the com...,Very important,Somewhat important,Important,Not very important,Somewhat important,Important,Important,Somewhat important,Not very important,Very important,Euros (€),Somewhat underpaid,Spaces,Not very important,Trade book; Textbook; Stack Overflow Q&A; Stac...,Get a job as a QA tester; Take online courses;...,7:00 AM,,,Keep the room absolutely quiet,Pair; Kanban,Team Foundation Server,Multiple times a day,Agree,Agree,Agree,Agree,Agree,Disagree,Strongly agree,,,,,,,,,,,,,,,,,,"I have a login for Stack Overflow, but haven't...",10.0,Several times,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Haven't done at all,Disagree,Strongly disagree,Strongly disagree,Somewhat agree,Agree,Strongly agree,Somewhat agree,Strongly disagree,A master's degree,Disagree,Agree,Strongly disagree,Agree,14838.709677,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,1,0,1,0,1,0,0,1,0,0,0,0,0,1,1


### FIM DO PRE TRATAMENTO - Salvo a base pré tratada
