# Análise de dados estruturados

In [1]:
import pandas as pd

![panda](https://media.giphy.com/media/HDR31jsQUPqQo/giphy.gif)

## Criar um dataframe a partir de um dicionário

In [2]:
dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
       "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
       "area": [8.516, 17.10, 3.286, 9.597, 1.221],
       "population": [200.4, 143.5, 1252, 1357, 52.98] }

In [3]:
# Transformar o dicionário em um dataframe
brics = pd.DataFrame(dict)

In [4]:
# Ver os primeiros registros desse dataframe
brics.head()

Unnamed: 0,area,capital,country,population
0,8.516,Brasilia,Brazil,200.4
1,17.1,Moscow,Russia,143.5
2,3.286,New Dehli,India,1252.0
3,9.597,Beijing,China,1357.0
4,1.221,Pretoria,South Africa,52.98


## Importar um csv com o pandas

Vamos utilizar os dados que o Kaggle lançou no ano de 2017 sobre Cientistas de Dados e Data Science. São 5 datasets diferentes:

 - **schema.csv**: a CSV file with survey schema. This schema includes the questions that correspond to each column name in both the multipleChoiceResponses.csv and freeformResponses.csv.
 - **multipleChoiceResponses.csv**: Respondents' answers to multiple choice and ranking questions. These are non-randomized and thus a single row does correspond to all of a single user's answers. 
 -**freeformResponses.csv:** Respondents' freeform answers to Kaggle's survey questions. These responses are randomized within a column, so that reading across a single row does not give a single user's answers.
 - **conversionRates.csv**: Currency conversion rates (to USD) as accessed from the R package "quantmod" on September 14, 2017
 - **RespondentTypeREADME.txt**: This is a schema for decoding the responses in the "Asked" column of the schema.csv file.

In [4]:
# Carregue o dataset multipleChoiceResponses com o pandas 
multiple_choice = pd.read_csv('kaggle-survey-2017/multipleChoiceResponses.csv')

In [5]:
# Veja as primeiras linhas do dataset
multiple_choice.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,...,JobFactorExperienceLevel,JobFactorDepartment,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,...,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,...,,,,,,,,Somewhat important,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,...,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,...,,,,,,,,,,
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,...,,,,,,,,,,


In [7]:
#Veja a quantidade de linhas e de colunas do dataset
multiple_choice.shape

(16716, 228)

Existem 228 colunas!!!
![panda](https://media.giphy.com/media/14aUO0Mf7dWDXW/giphy.gif)

Vamos ver do que se tratam essas colunas. Como são MUITAS colunas precisamos alterar a configuração padrão do pandas para visualização de linhas e colunas

In [8]:
pd.set_option('max_rows', 200)
pd.set_option('max_columns', 1000)

In [9]:
multiple_choice.head()

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,CurrentEmployerType,MLToolNextYearSelect,MLMethodNextYearSelect,LanguageRecommendationSelect,PublicDatasetsSelect,LearningPlatformSelect,LearningPlatformUsefulnessArxiv,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessCollege,LearningPlatformUsefulnessCompany,LearningPlatformUsefulnessConferences,LearningPlatformUsefulnessFriends,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessNewsletters,LearningPlatformUsefulnessCommunities,LearningPlatformUsefulnessDocumentation,LearningPlatformUsefulnessCourses,LearningPlatformUsefulnessProjects,LearningPlatformUsefulnessPodcasts,LearningPlatformUsefulnessSO,LearningPlatformUsefulnessTextbook,LearningPlatformUsefulnessTradeBook,LearningPlatformUsefulnessTutoring,LearningPlatformUsefulnessYouTube,BlogsPodcastsNewslettersSelect,LearningDataScienceTime,JobSkillImportanceBigData,JobSkillImportanceDegree,JobSkillImportanceStats,JobSkillImportanceEnterpriseTools,JobSkillImportancePython,JobSkillImportanceR,JobSkillImportanceSQL,JobSkillImportanceKaggleRanking,JobSkillImportanceMOOC,JobSkillImportanceVisualizations,JobSkillImportanceOtherSelect1,JobSkillImportanceOtherSelect2,JobSkillImportanceOtherSelect3,CoursePlatformSelect,HardwarePersonalProjectsSelect,TimeSpentStudying,ProveKnowledgeSelect,DataScienceIdentitySelect,FormalEducation,MajorSelect,Tenure,PastJobTitlesSelect,FirstTrainingSelect,LearningCategorySelftTaught,LearningCategoryOnlineCourses,LearningCategoryWork,LearningCategoryUniversity,LearningCategoryKaggle,LearningCategoryOther,MLSkillsSelect,MLTechniquesSelect,ParentsEducation,EmployerIndustry,EmployerSize,EmployerSizeChange,EmployerMLTime,EmployerSearchMethod,UniversityImportance,JobFunctionSelect,WorkHardwareSelect,WorkDataTypeSelect,WorkProductionFrequency,WorkDatasetSize,WorkAlgorithmsSelect,WorkToolsSelect,WorkToolsFrequencyAmazonML,WorkToolsFrequencyAWS,WorkToolsFrequencyAngoss,WorkToolsFrequencyC,WorkToolsFrequencyCloudera,WorkToolsFrequencyDataRobot,WorkToolsFrequencyFlume,WorkToolsFrequencyGCP,WorkToolsFrequencyHadoop,WorkToolsFrequencyIBMCognos,WorkToolsFrequencyIBMSPSSModeler,WorkToolsFrequencyIBMSPSSStatistics,WorkToolsFrequencyIBMWatson,WorkToolsFrequencyImpala,WorkToolsFrequencyJava,WorkToolsFrequencyJulia,WorkToolsFrequencyJupyter,WorkToolsFrequencyKNIMECommercial,WorkToolsFrequencyKNIMEFree,WorkToolsFrequencyMathematica,WorkToolsFrequencyMATLAB,WorkToolsFrequencyAzure,WorkToolsFrequencyExcel,WorkToolsFrequencyMicrosoftRServer,WorkToolsFrequencyMicrosoftSQL,WorkToolsFrequencyMinitab,WorkToolsFrequencyNoSQL,WorkToolsFrequencyOracle,WorkToolsFrequencyOrange,WorkToolsFrequencyPerl,WorkToolsFrequencyPython,WorkToolsFrequencyQlik,WorkToolsFrequencyR,WorkToolsFrequencyRapidMinerCommercial,WorkToolsFrequencyRapidMinerFree,WorkToolsFrequencySalfrod,WorkToolsFrequencySAPBusinessObjects,WorkToolsFrequencySASBase,WorkToolsFrequencySASEnterprise,WorkToolsFrequencySASJMP,WorkToolsFrequencySpark,WorkToolsFrequencySQL,WorkToolsFrequencyStan,WorkToolsFrequencyStatistica,WorkToolsFrequencyTableau,WorkToolsFrequencyTensorFlow,WorkToolsFrequencyTIBCO,WorkToolsFrequencyUnix,WorkToolsFrequencySelect1,WorkToolsFrequencySelect2,WorkFrequencySelect3,WorkMethodsSelect,WorkMethodsFrequencyA/B,WorkMethodsFrequencyAssociationRules,WorkMethodsFrequencyBayesian,WorkMethodsFrequencyCNNs,WorkMethodsFrequencyCollaborativeFiltering,WorkMethodsFrequencyCross-Validation,WorkMethodsFrequencyDataVisualization,WorkMethodsFrequencyDecisionTrees,WorkMethodsFrequencyEnsembleMethods,WorkMethodsFrequencyEvolutionaryApproaches,WorkMethodsFrequencyGANs,WorkMethodsFrequencyGBM,WorkMethodsFrequencyHMMs,WorkMethodsFrequencyKNN,WorkMethodsFrequencyLiftAnalysis,WorkMethodsFrequencyLogisticRegression,WorkMethodsFrequencyMLN,WorkMethodsFrequencyNaiveBayes,WorkMethodsFrequencyNLP,WorkMethodsFrequencyNeuralNetworks,WorkMethodsFrequencyPCA,WorkMethodsFrequencyPrescriptiveModeling,WorkMethodsFrequencyRandomForests,WorkMethodsFrequencyRecommenderSystems,WorkMethodsFrequencyRNNs,WorkMethodsFrequencySegmentation,WorkMethodsFrequencySimulation,WorkMethodsFrequencySVMs,WorkMethodsFrequencyTextAnalysis,WorkMethodsFrequencyTimeSeriesAnalysis,WorkMethodsFrequencySelect1,WorkMethodsFrequencySelect2,WorkMethodsFrequencySelect3,TimeGatheringData,TimeModelBuilding,TimeProduction,TimeVisualizing,TimeFindingInsights,TimeOtherSelect,AlgorithmUnderstandingLevel,WorkChallengesSelect,WorkChallengeFrequencyPolitics,WorkChallengeFrequencyUnusedResults,WorkChallengeFrequencyUnusefulInstrumenting,WorkChallengeFrequencyDeployment,WorkChallengeFrequencyDirtyData,WorkChallengeFrequencyExplaining,WorkChallengeFrequencyPass,WorkChallengeFrequencyIntegration,WorkChallengeFrequencyTalent,WorkChallengeFrequencyDataFunds,WorkChallengeFrequencyDomainExpertise,WorkChallengeFrequencyML,WorkChallengeFrequencyTools,WorkChallengeFrequencyExpectations,WorkChallengeFrequencyITCoordination,WorkChallengeFrequencyHiringFunds,WorkChallengeFrequencyPrivacy,WorkChallengeFrequencyScaling,WorkChallengeFrequencyEnvironments,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,WorkChallengeFrequencyOtherSelect,WorkDataVisualizations,WorkInternalVsExternalTools,WorkMLTeamSeatSelect,WorkDatasets,WorkDatasetsChallenge,WorkDataStorage,WorkDataSharing,WorkDataSourcing,WorkCodeSharing,RemoteWork,CompensationAmount,CompensationCurrency,SalaryChange,JobSatisfaction,JobSearchResource,JobHuntTime,JobFactorLearning,JobFactorSalary,JobFactorOffice,JobFactorLanguages,JobFactorCommute,JobFactorManagement,JobFactorExperienceLevel,JobFactorDepartment,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity
0,"Non-binary, genderqueer, or gender non-conforming",,,Employed full-time,,,Yes,,DBA/Database Engineer,Fine,Employed by a company that doesn't perform adv...,SAS Base,Random Forests,F#,Dataset aggregator/platform (i.e. Socrata/Kagg...,"College/University,Conferences,Podcasts,Trade ...",,,,,Very useful,,,,,,,,Very useful,,,Somewhat useful,,,"Becoming a Data Scientist Podcast,Data Machina...",,,,,,,,,,,,,,,,,,,Yes,Bachelor's degree,Management information systems,More than 10 years,"Predictive Modeler,Programmer,Researcher",University courses,0.0,0.0,100.0,0.0,0.0,0.0,"Computer Vision,Natural Language Processing,Su...","Evolutionary Approaches,Neural Networks - GANs...",A doctoral degree,Internet-based,100 to 499 employees,Increased slightly,3-5 years,I visited the company's Web site and found a j...,Not very important,Build prototypes to explore applying machine l...,"Gaming Laptop (Laptop + CUDA capable GPU),Work...","Text data,Relational data",Rarely,10GB,"Neural Networks,Random Forests,RNNs","Amazon Web services,Oracle Data Mining/ Oracle...",,Rarely,,,,,,,,,,,,,,,,,,,,,,,,,,Sometimes,,Most of the time,,,,,,,,,,,,,,,,,,,,,,"Association Rules,Collaborative Filtering,Neur...",,Rarely,,,Often,,,,,,,,,,,,,,,Sometimes,Often,,Most of the time,,,,,,,,,,,0.0,100.0,0.0,0.0,0.0,0.0,Enough to explain the algorithm to someone non...,Company politics / Lack of management/financia...,Rarely,,,,,,,,,,,,,,,,Often,Most of the time,,,,,26-50% of projects,Do not know,Standalone Team,,,Document-oriented (e.g. MongoDB/Elasticsearch)...,"Company Developed Platform,I don't typically s...",,"Mercurial,Subversion,Other",Always,,,I am not currently employed,5,,,,,,,,,,,,,,,,,,
1,Female,United States,30.0,"Not employed, but looking for work",,,,,,,,Python,Random Forests,Python,Dataset aggregator/platform (i.e. Socrata/Kagg...,Kaggle,,,,,,,Somewhat useful,,,,,,,,,,,,"Becoming a Data Scientist Podcast,Siraj Raval ...",1-2 years,,Nice to have,Unnecessary,,Unnecessary,,Necessary,,,,,,,,,2 - 10 hours,Master's degree,Yes,Master's degree,Computer Science,Less than a year,Software Developer/Software Engineer,University courses,10.0,30.0,0.0,30.0,30.0,0.0,"Computer Vision,Supervised Machine Learning (T...","Bayesian Techniques,Decision Trees - Gradient ...",A bachelor's degree,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Somewhat important,,
2,Male,Canada,28.0,"Not employed, but looking for work",,,,,,,,Amazon Web services,Deep learning,R,Dataset aggregator/platform (i.e. Socrata/Kagg...,"Arxiv,College/University,Kaggle,Online courses...",Very useful,,Somewhat useful,,,,Somewhat useful,,,,Very useful,,,,,,,Very useful,"FastML Blog,No Free Hunch Blog,Talking Machine...",1-2 years,Necessary,,,,,Necessary,,,,,,,,"Coursera,edX",Basic laptop (Macbook),2 - 10 hours,Github Portfolio,Yes,Master's degree,Engineering (non-computer focused),3 to 5 years,"Data Scientist,Machine Learning Engineer",University courses,20.0,50.0,0.0,30.0,0.0,0.0,"Adversarial Learning,Computer Vision,Natural L...","Decision Trees - Random Forests,Ensemble Metho...",A bachelor's degree,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"Asking friends, family members, or former coll...",1-2,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important,Very Important
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,Self-employed,TensorFlow,Neural Nets,Python,I collect my own data (e.g. web-scraping),"Blogs,College/University,Conferences,Friends n...",,Very useful,Very useful,,Very useful,Very useful,,,,Very useful,Very useful,Very useful,,,,,,,KDnuggets Blog,,,,,,,,,,,,,,,,,,,Yes,Master's degree,Mathematics or statistics,More than 10 years,"Business Analyst,Operations Research Practitio...",University courses,30.0,0.0,40.0,30.0,0.0,0.0,"Recommendation Engines,Reinforcement learning,...","Bayesian Techniques,Decision Trees - Gradient ...",High school,Mix of fields,,,,,Very important,Analyze and understand data to influence produ...,"Laptop + Cloud service (AWS, Azure, GCE ...)",Relational data,Always,1GB,"Bayesian Techniques,Decision Trees,Random Fore...","Amazon Machine Learning,Amazon Web services,Cl...",Rarely,Often,,,Rarely,,,,Rarely,,,,,Rarely,Rarely,,,,,Rarely,Rarely,,Sometimes,,Rarely,,Rarely,,,,Rarely,,Rarely,,,,,Sometimes,,Rarely,,Often,,,Rarely,,,,,,,"A/B Testing,Bayesian Techniques,Data Visualiza...",Sometimes,,Sometimes,,,,Sometimes,Often,Sometimes,,,,,,,Sometimes,Often,Sometimes,,Sometimes,,,Sometimes,,,,Often,,,Often,,,,50.0,20.0,0.0,10.0,20.0,0.0,Enough to refine and innovate on the algorithm,Company politics / Lack of management/financia...,Often,Often,Often,Often,Often,Often,,Often,Often,Often,Most of the time,Often,Often,Often,,Often,Often,Often,Often,Often,Often,,100% of projects,Entirely internal,Standalone Team,Electricity data sets from government and states,"Everything is custom, there is never a tool th...","Column-oriented relational (e.g. KDB/MariaDB),...","Company Developed Platform,Email",,Generic cloud file sharing software (Dropbox/B...,,250000.0,USD,Has increased 20% or more,10 - Highly Satisfied,,,,,,,,,,,,,,,,,,
4,Male,Taiwan,38.0,Employed full-time,,,Yes,,Computer Scientist,Fine,Employed by a company that doesn't perform adv...,TensorFlow,Text Mining,Python,GitHub,"Arxiv,Conferences,Kaggle,Textbook",Very useful,,,,Somewhat useful,,Somewhat useful,,,,,,,,Somewhat useful,,,,"Data Machina Newsletter,Jack's Import AI Newsl...",,,,,,,,,,,,,,,,,,,No,Doctoral degree,Engineering (non-computer focused),More than 10 years,"Computer Scientist,Data Analyst,Data Miner,Dat...",University courses,60.0,5.0,5.0,30.0,0.0,0.0,"Computer Vision,Outlier detection (e.g. Fraud ...","Bayesian Techniques,Decision Trees - Gradient ...",Primary/elementary school,Technology,"5,000 to 9,999 employees",Stayed the same,Don't know,A tech-specific job board,Somewhat important,Build prototypes to explore applying machine l...,"Gaming Laptop (Laptop + CUDA capable GPU),GPU ...","Image data,Relational data",Most of the time,100GB,"Bayesian Techniques,CNNs,Ensemble Methods,Neur...","C/C++,Jupyter notebooks,MATLAB/Octave,Python,R...",,,,Most of the time,,,,,,,,,,,,,Sometimes,,,,Often,,,,,,,,,,Sometimes,,Sometimes,,,,,,,,,,,,,Sometimes,,,,,,"Association Rules,Bayesian Techniques,CNNs,Col...",,Sometimes,Often,Most of the time,Sometimes,,Most of the time,Sometimes,Often,Sometimes,,,,Most of the time,,Sometimes,,Sometimes,,Most of the time,Sometimes,,,,Sometimes,Often,,Most of the time,,Sometimes,,,,30.0,20.0,15.0,15.0,20.0,0.0,Enough to refine and innovate on the algorithm,Company politics / Lack of management/financia...,Often,Sometimes,,,,,,,Sometimes,Sometimes,Sometimes,,,,Sometimes,,Most of the time,,Sometimes,,,,10-25% of projects,Approximately half internal and half external,Business Department,,,Flat files not in a database or cache (e.g. CS...,Company Developed Platform,,Git,Rarely,,,I do not want to share information about my sa...,2,,,,,,,,,,,,,,,,,,


Podemos ver só o nome das colunas também utilizando o `columns`. Para ficar mais fácil de visualizar, ao invés de retornar o array, podemos transformar esse dado em uma Series.

In [10]:
# Use o columns no dataframe e coloque-o em uma Series para facilitar a visualização
pd.Series(multiple_choice.columns)

0                                     GenderSelect
1                                          Country
2                                              Age
3                                 EmploymentStatus
4                                    StudentStatus
5                              LearningDataScience
6                                       CodeWriter
7                                   CareerSwitcher
8                            CurrentJobTitleSelect
9                                         TitleFit
10                             CurrentEmployerType
11                            MLToolNextYearSelect
12                          MLMethodNextYearSelect
13                    LanguageRecommendationSelect
14                            PublicDatasetsSelect
15                          LearningPlatformSelect
16                 LearningPlatformUsefulnessArxiv
17                 LearningPlatformUsefulnessBlogs
18               LearningPlatformUsefulnessCollege
19               LearningPlatfo

Podemos ver mais detalhes do dataset com o `info()`

In [11]:
multiple_choice.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16716 entries, 0 to 16715
Columns: 228 entries, GenderSelect to JobFactorPublishingOpportunity
dtypes: float64(13), object(215)
memory usage: 29.1+ MB


Podemos dar uma olhada nos tipos de campos que vem em cada uma das colunas númericas com um único comando

In [12]:
multiple_choice.describe()

Unnamed: 0,Age,LearningCategorySelftTaught,LearningCategoryOnlineCourses,LearningCategoryWork,LearningCategoryUniversity,LearningCategoryKaggle,LearningCategoryOther,TimeGatheringData,TimeModelBuilding,TimeProduction,TimeVisualizing,TimeFindingInsights,TimeOtherSelect
count,16385.0,13109.0,13126.0,13111.0,13122.0,13126.0,13094.0,7530.0,7528.0,7517.0,7529.0,7523.0,7513.0
mean,32.372841,33.366771,27.375514,15.217593,16.988607,5.531434,1.79594,36.144754,21.268066,10.806372,13.869372,13.094776,2.396247
std,10.473487,25.787181,26.86084,18.996778,23.676917,11.07268,9.357886,21.649591,16.165958,12.257932,11.722945,12.974846,12.157137
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,25.0,15.0,5.0,0.0,0.0,0.0,0.0,20.0,10.0,0.0,5.0,5.0,0.0
50%,30.0,30.0,20.0,10.0,5.0,0.0,0.0,35.0,20.0,10.0,10.0,10.0,0.0
75%,37.0,50.0,40.0,25.0,30.0,10.0,0.0,50.0,30.0,15.0,20.0,20.0,0.0
max,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,303.0,100.0


E se eu quiser ver a quantidade de nulos no dataset todo?

In [13]:
multiple_choice.isnull().sum()

GenderSelect                                      95
Country                                          121
Age                                              331
EmploymentStatus                                   0
StudentStatus                                  15436
LearningDataScience                            15432
CodeWriter                                      3530
CareerSwitcher                                 13704
CurrentJobTitleSelect                           4886
TitleFit                                        5212
CurrentEmployerType                             5115
MLToolNextYearSelect                            5718
MLMethodNextYearSelect                          5883
LanguageRecommendationSelect                    5718
PublicDatasetsSelect                            5920
LearningPlatformSelect                          5445
LearningPlatformUsefulnessArxiv                14325
LearningPlatformUsefulnessBlogs                11951
LearningPlatformUsefulnessCollege             

E se eu quiser fazer a porcentagem de nulos?

In [14]:
multiple_choice.isnull().sum() / len(multiple_choice)

GenderSelect                                   0.005683
Country                                        0.007239
Age                                            0.019801
EmploymentStatus                               0.000000
StudentStatus                                  0.923427
LearningDataScience                            0.923187
CodeWriter                                     0.211175
CareerSwitcher                                 0.819813
CurrentJobTitleSelect                          0.292295
TitleFit                                       0.311797
CurrentEmployerType                            0.305994
MLToolNextYearSelect                           0.342067
MLMethodNextYearSelect                         0.351938
LanguageRecommendationSelect                   0.342067
PublicDatasetsSelect                           0.354152
LearningPlatformSelect                         0.325736
LearningPlatformUsefulnessArxiv                0.856963
LearningPlatformUsefulnessBlogs                0

Nossa quanto nulo!
![sad_panda](https://media.giphy.com/media/3e18NPUVzoxzO/giphy.gif)

O que eu devo fazer se eu quiser ver apenas coluna `JobFactorSalary`?

In [15]:
multiple_choice['JobFactorSalary']

0                       NaN
1                       NaN
2            Very Important
3                       NaN
4                       NaN
5                       NaN
6                       NaN
7            Very Important
8                       NaN
9                       NaN
10       Somewhat important
11                      NaN
12       Somewhat important
13                      NaN
14                      NaN
15                      NaN
16                      NaN
17                      NaN
18       Somewhat important
19            Not important
20       Somewhat important
21                      NaN
22                      NaN
23                      NaN
24                      NaN
25                      NaN
26                      NaN
27                      NaN
28                      NaN
29           Very Important
30                      NaN
31                      NaN
32                      NaN
33                      NaN
34                      NaN
35           Very Im

Vamos fazer algumas operações com o pandas para contar o número de nulos que existem nessa coluna

In [16]:
multiple_choice['JobFactorSalary'].isnull().sum()

13231

Como eu faço se eu só quiser ver os 10 primeiros registros?

In [17]:
multiple_choice['JobFactorSalary'][:10]

0               NaN
1               NaN
2    Very Important
3               NaN
4               NaN
5               NaN
6               NaN
7    Very Important
8               NaN
9               NaN
Name: JobFactorSalary, dtype: object

E se eu quiser ver 2 colunas ao mesmo tempo? (E apenas essas 2 colunas)

In [18]:
multiple_choice[['JobFactorSalary', 'JobFactorLearning']][:10]

Unnamed: 0,JobFactorSalary,JobFactorLearning
0,,
1,,
2,Very Important,Very Important
3,,
4,,
5,,
6,,
7,Very Important,Very Important
8,,
9,,


O quanto que as pessoas dessa pesquisa estão satisfeitas com o trabalhos? Conseguimos saber isso usando só o pandas?

In [19]:
multiple_choice['JobSatisfaction'].value_counts()

7                          1448
8                          1427
6                           765
9                           677
5                           627
10 - Highly Satisfied       589
3                           358
4                           354
1 - Highly Dissatisfied     167
I prefer not to share       148
2                           117
Name: JobSatisfaction, dtype: int64

Percebemos com esse comando que as pessoas até que estão bastante satisfeitas.

Agora vamos olhar só as pessoas que estão Super Satisfeitas (Highly Satisfied) com o seu trabalho. Como que eu posso fazer isso?

In [20]:
# Filtre só quem está com o JobSatisfaction de 10. Guarde isso em uma variável pq é bastante dado
highly_satisfied = multiple_choice[multiple_choice['JobSatisfaction'] == '10 - Highly Satisfied']

In [21]:
# veja o tamanho do dataset. Ele bateu com a quantidade de pessoas que estão altamente satisfeitas?
highly_satisfied.shape

(589, 228)

In [22]:
# Veja os primeiros 3 registros (todas as colunas) das pessoas altamentes satisfeitas
highly_satisfied[:3]

Unnamed: 0,GenderSelect,Country,Age,EmploymentStatus,StudentStatus,LearningDataScience,CodeWriter,CareerSwitcher,CurrentJobTitleSelect,TitleFit,CurrentEmployerType,MLToolNextYearSelect,MLMethodNextYearSelect,LanguageRecommendationSelect,PublicDatasetsSelect,LearningPlatformSelect,LearningPlatformUsefulnessArxiv,LearningPlatformUsefulnessBlogs,LearningPlatformUsefulnessCollege,LearningPlatformUsefulnessCompany,LearningPlatformUsefulnessConferences,LearningPlatformUsefulnessFriends,LearningPlatformUsefulnessKaggle,LearningPlatformUsefulnessNewsletters,LearningPlatformUsefulnessCommunities,LearningPlatformUsefulnessDocumentation,LearningPlatformUsefulnessCourses,LearningPlatformUsefulnessProjects,LearningPlatformUsefulnessPodcasts,LearningPlatformUsefulnessSO,LearningPlatformUsefulnessTextbook,LearningPlatformUsefulnessTradeBook,LearningPlatformUsefulnessTutoring,LearningPlatformUsefulnessYouTube,BlogsPodcastsNewslettersSelect,LearningDataScienceTime,JobSkillImportanceBigData,JobSkillImportanceDegree,JobSkillImportanceStats,JobSkillImportanceEnterpriseTools,JobSkillImportancePython,JobSkillImportanceR,JobSkillImportanceSQL,JobSkillImportanceKaggleRanking,JobSkillImportanceMOOC,JobSkillImportanceVisualizations,JobSkillImportanceOtherSelect1,JobSkillImportanceOtherSelect2,JobSkillImportanceOtherSelect3,CoursePlatformSelect,HardwarePersonalProjectsSelect,TimeSpentStudying,ProveKnowledgeSelect,DataScienceIdentitySelect,FormalEducation,MajorSelect,Tenure,PastJobTitlesSelect,FirstTrainingSelect,LearningCategorySelftTaught,LearningCategoryOnlineCourses,LearningCategoryWork,LearningCategoryUniversity,LearningCategoryKaggle,LearningCategoryOther,MLSkillsSelect,MLTechniquesSelect,ParentsEducation,EmployerIndustry,EmployerSize,EmployerSizeChange,EmployerMLTime,EmployerSearchMethod,UniversityImportance,JobFunctionSelect,WorkHardwareSelect,WorkDataTypeSelect,WorkProductionFrequency,WorkDatasetSize,WorkAlgorithmsSelect,WorkToolsSelect,WorkToolsFrequencyAmazonML,WorkToolsFrequencyAWS,WorkToolsFrequencyAngoss,WorkToolsFrequencyC,WorkToolsFrequencyCloudera,WorkToolsFrequencyDataRobot,WorkToolsFrequencyFlume,WorkToolsFrequencyGCP,WorkToolsFrequencyHadoop,WorkToolsFrequencyIBMCognos,WorkToolsFrequencyIBMSPSSModeler,WorkToolsFrequencyIBMSPSSStatistics,WorkToolsFrequencyIBMWatson,WorkToolsFrequencyImpala,WorkToolsFrequencyJava,WorkToolsFrequencyJulia,WorkToolsFrequencyJupyter,WorkToolsFrequencyKNIMECommercial,WorkToolsFrequencyKNIMEFree,WorkToolsFrequencyMathematica,WorkToolsFrequencyMATLAB,WorkToolsFrequencyAzure,WorkToolsFrequencyExcel,WorkToolsFrequencyMicrosoftRServer,WorkToolsFrequencyMicrosoftSQL,WorkToolsFrequencyMinitab,WorkToolsFrequencyNoSQL,WorkToolsFrequencyOracle,WorkToolsFrequencyOrange,WorkToolsFrequencyPerl,WorkToolsFrequencyPython,WorkToolsFrequencyQlik,WorkToolsFrequencyR,WorkToolsFrequencyRapidMinerCommercial,WorkToolsFrequencyRapidMinerFree,WorkToolsFrequencySalfrod,WorkToolsFrequencySAPBusinessObjects,WorkToolsFrequencySASBase,WorkToolsFrequencySASEnterprise,WorkToolsFrequencySASJMP,WorkToolsFrequencySpark,WorkToolsFrequencySQL,WorkToolsFrequencyStan,WorkToolsFrequencyStatistica,WorkToolsFrequencyTableau,WorkToolsFrequencyTensorFlow,WorkToolsFrequencyTIBCO,WorkToolsFrequencyUnix,WorkToolsFrequencySelect1,WorkToolsFrequencySelect2,WorkFrequencySelect3,WorkMethodsSelect,WorkMethodsFrequencyA/B,WorkMethodsFrequencyAssociationRules,WorkMethodsFrequencyBayesian,WorkMethodsFrequencyCNNs,WorkMethodsFrequencyCollaborativeFiltering,WorkMethodsFrequencyCross-Validation,WorkMethodsFrequencyDataVisualization,WorkMethodsFrequencyDecisionTrees,WorkMethodsFrequencyEnsembleMethods,WorkMethodsFrequencyEvolutionaryApproaches,WorkMethodsFrequencyGANs,WorkMethodsFrequencyGBM,WorkMethodsFrequencyHMMs,WorkMethodsFrequencyKNN,WorkMethodsFrequencyLiftAnalysis,WorkMethodsFrequencyLogisticRegression,WorkMethodsFrequencyMLN,WorkMethodsFrequencyNaiveBayes,WorkMethodsFrequencyNLP,WorkMethodsFrequencyNeuralNetworks,WorkMethodsFrequencyPCA,WorkMethodsFrequencyPrescriptiveModeling,WorkMethodsFrequencyRandomForests,WorkMethodsFrequencyRecommenderSystems,WorkMethodsFrequencyRNNs,WorkMethodsFrequencySegmentation,WorkMethodsFrequencySimulation,WorkMethodsFrequencySVMs,WorkMethodsFrequencyTextAnalysis,WorkMethodsFrequencyTimeSeriesAnalysis,WorkMethodsFrequencySelect1,WorkMethodsFrequencySelect2,WorkMethodsFrequencySelect3,TimeGatheringData,TimeModelBuilding,TimeProduction,TimeVisualizing,TimeFindingInsights,TimeOtherSelect,AlgorithmUnderstandingLevel,WorkChallengesSelect,WorkChallengeFrequencyPolitics,WorkChallengeFrequencyUnusedResults,WorkChallengeFrequencyUnusefulInstrumenting,WorkChallengeFrequencyDeployment,WorkChallengeFrequencyDirtyData,WorkChallengeFrequencyExplaining,WorkChallengeFrequencyPass,WorkChallengeFrequencyIntegration,WorkChallengeFrequencyTalent,WorkChallengeFrequencyDataFunds,WorkChallengeFrequencyDomainExpertise,WorkChallengeFrequencyML,WorkChallengeFrequencyTools,WorkChallengeFrequencyExpectations,WorkChallengeFrequencyITCoordination,WorkChallengeFrequencyHiringFunds,WorkChallengeFrequencyPrivacy,WorkChallengeFrequencyScaling,WorkChallengeFrequencyEnvironments,WorkChallengeFrequencyClarity,WorkChallengeFrequencyDataAccess,WorkChallengeFrequencyOtherSelect,WorkDataVisualizations,WorkInternalVsExternalTools,WorkMLTeamSeatSelect,WorkDatasets,WorkDatasetsChallenge,WorkDataStorage,WorkDataSharing,WorkDataSourcing,WorkCodeSharing,RemoteWork,CompensationAmount,CompensationCurrency,SalaryChange,JobSatisfaction,JobSearchResource,JobHuntTime,JobFactorLearning,JobFactorSalary,JobFactorOffice,JobFactorLanguages,JobFactorCommute,JobFactorManagement,JobFactorExperienceLevel,JobFactorDepartment,JobFactorTitle,JobFactorCompanyFunding,JobFactorImpact,JobFactorRemote,JobFactorIndustry,JobFactorLeaderReputation,JobFactorDiversity,JobFactorPublishingOpportunity
3,Male,United States,56.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Operations Research Practitioner,Poorly,Self-employed,TensorFlow,Neural Nets,Python,I collect my own data (e.g. web-scraping),"Blogs,College/University,Conferences,Friends n...",,Very useful,Very useful,,Very useful,Very useful,,,,Very useful,Very useful,Very useful,,,,,,,KDnuggets Blog,,,,,,,,,,,,,,,,,,,Yes,Master's degree,Mathematics or statistics,More than 10 years,"Business Analyst,Operations Research Practitio...",University courses,30.0,0.0,40.0,30.0,0.0,0.0,"Recommendation Engines,Reinforcement learning,...","Bayesian Techniques,Decision Trees - Gradient ...",High school,Mix of fields,,,,,Very important,Analyze and understand data to influence produ...,"Laptop + Cloud service (AWS, Azure, GCE ...)",Relational data,Always,1GB,"Bayesian Techniques,Decision Trees,Random Fore...","Amazon Machine Learning,Amazon Web services,Cl...",Rarely,Often,,,Rarely,,,,Rarely,,,,,Rarely,Rarely,,,,,Rarely,Rarely,,Sometimes,,Rarely,,Rarely,,,,Rarely,,Rarely,,,,,Sometimes,,Rarely,,Often,,,Rarely,,,,,,,"A/B Testing,Bayesian Techniques,Data Visualiza...",Sometimes,,Sometimes,,,,Sometimes,Often,Sometimes,,,,,,,Sometimes,Often,Sometimes,,Sometimes,,,Sometimes,,,,Often,,,Often,,,,50.0,20.0,0.0,10.0,20.0,0.0,Enough to refine and innovate on the algorithm,Company politics / Lack of management/financia...,Often,Often,Often,Often,Often,Often,,Often,Often,Often,Most of the time,Often,Often,Often,,Often,Often,Often,Often,Often,Often,,100% of projects,Entirely internal,Standalone Team,Electricity data sets from government and states,"Everything is custom, there is never a tool th...","Column-oriented relational (e.g. KDB/MariaDB),...","Company Developed Platform,Email",,Generic cloud file sharing software (Dropbox/B...,,250000.0,USD,Has increased 20% or more,10 - Highly Satisfied,,,,,,,,,,,,,,,,,,
60,Male,Canada,34.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Other,Fine,Self-employed,,,Python,,"Arxiv,Blogs,Kaggle,Online courses,Personal Pro...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Yes,Doctoral degree,Physics,More than 10 years,Researcher,University courses,0.0,35.0,0.0,65.0,0.0,0.0,"Time Series,Unsupervised Learning",Bayesian Techniques,A bachelor's degree,Internet-based,,,,,Very important,,Basic laptop (Macbook),Text data,,,"Bayesian Techniques,Regression/Logistic Regres...","C/C++,Jupyter notebooks,MATLAB/Octave,NoSQL,Py...",,,,Sometimes,,,,,,,,,,,,,Sometimes,,,,Most of the time,,,,,,Sometimes,,,,Sometimes,,,,,,,,,,,Sometimes,,,,,,,,,,"Data Visualization,kNN and Other Clustering,Ti...",,,,,,,Most of the time,,,,,,,Sometimes,,,,,,,,,,,,,,,,Often,,,,0.0,0.0,0.0,0.0,0.0,0.0,"Enough to code it again from scratch, albeit i...","Dirty data,Explaining data science to others,L...",,,,,Most of the time,Sometimes,,,,,,,,,,,,,,,,,76-99% of projects,,,,,,,,,,,,,10 - Highly Satisfied,,,,,,,,,,,,,,,,,,
77,Male,Israel,29.0,"Independent contractor, freelancer, or self-em...",,,Yes,,Data Scientist,Perfectly,Self-employed,,Deep learning,Python,Dataset aggregator/platform (i.e. Socrata/Kagg...,"Arxiv,Kaggle,Official documentation,Online cou...",Somewhat useful,,,,,,Very useful,,,Very useful,Very useful,,,Very useful,,,,Very useful,"FastML Blog,KDnuggets Blog,No Free Hunch Blog",,,,,,,,,,,,,,,,,,,,Master's degree,Computer Science,3 to 5 years,"Data Scientist,Software Developer/Software Eng...",University courses,80.0,0.0,0.0,0.0,20.0,0.0,"Recommendation Engines,Supervised Machine Lear...","Bayesian Techniques,Ensemble Methods",A master's degree,Financial,,,,,Very important,Analyze and understand data to influence produ...,Laptop or Workstation and private datacenters,Relational data,Always,1GB,"Decision Trees,Ensemble Methods,Random Forests","Java,Python,Tableau",,,,,,,,,,,,,,,Most of the time,,,,,,,,,,,,,,,,Most of the time,,,,,,,,,,,,,,Rarely,,,,,,,"A/B Testing,Bayesian Techniques,Cross-Validati...",Sometimes,,Often,,,Most of the time,Most of the time,Most of the time,Most of the time,,,,,,,Most of the time,,,,,,,Most of the time,,,,,,,,,,,30.0,30.0,10.0,20.0,10.0,0.0,"Enough to code it again from scratch, albeit i...","Dirty data,Explaining data science to others,L...",,,,,Sometimes,Often,,,,Most of the time,,,,,,,,,,,,,76-99% of projects,Entirely external,Standalone Team,,,Flat files not in a database or cache (e.g. CS...,I don't typically share data,,Bitbucket,Most of the time,,,I do not want to share information about my sa...,10 - Highly Satisfied,,,,,,,,,,,,,,,,,,


E se eu quiser ver as pessoas altamente satisfeitas e que trabalham com python?

In [23]:
highly_satisfied = multiple_choice['JobSatisfaction'] == '10 - Highly Satisfied'
pythonist = multiple_choice['LanguageRecommendationSelect'] == 'Python'
highly_satisfied_and_pythonist = multiple_choice[highly_satisfied & pythonist]

In [24]:
highly_satisfied_and_pythonist.shape

(319, 228)

E se tentassemos com a idade? Ver só que está abaixo de 30 anos

In [25]:
highly_satisfied = multiple_choice['JobSatisfaction'] == '10 - Highly Satisfied'
age = multiple_choice['Age'] < 30.0
highly_satisfied_and_age = multiple_choice[highly_satisfied & age]

In [26]:
highly_satisfied_and_age.shape

(171, 228)

Quais são as linguagens que a galera altamente satisfeita recomenda?

In [27]:
multiple_choice[highly_satisfied]['LanguageRecommendationSelect'].value_counts()

Python      319
R           151
SQL          28
C/C++/C#     20
Matlab       11
Other         9
Scala         9
Java          7
Julia         5
SAS           4
Stata         3
Haskell       2
Name: LanguageRecommendationSelect, dtype: int64

In [28]:
highly_satisfied_languages = multiple_choice[highly_satisfied]['LanguageRecommendationSelect'].value_counts()
language_counts = multiple_choice['JobSatisfaction'][highly_satisfied].notnull().sum()

In [29]:
language_counts

589

In [30]:
(highly_satisfied_languages / language_counts) * 100

Python      54.159593
R           25.636672
SQL          4.753820
C/C++/C#     3.395586
Matlab       1.867572
Other        1.528014
Scala        1.528014
Java         1.188455
Julia        0.848896
SAS          0.679117
Stata        0.509338
Haskell      0.339559
Name: LanguageRecommendationSelect, dtype: float64

E se eu quiser ordenar esses valores? Do menor para o maior?

In [31]:
pd.Series((highly_satisfied_languages / language_counts) * 100).sort_values()

Haskell      0.339559
Stata        0.509338
SAS          0.679117
Julia        0.848896
Java         1.188455
Other        1.528014
Scala        1.528014
Matlab       1.867572
C/C++/C#     3.395586
SQL          4.753820
R           25.636672
Python      54.159593
Name: LanguageRecommendationSelect, dtype: float64

### Desafio 1

Qual o país que tem a maior quantidade de dados onde as pessoas preencheram a coluna que tem o menor número dos dados?

Dica: Você precisará ordenar os campos pela quantidade de nulos (ou não nulos) e depois ver o país dessa galera.

![arrested_panda](https://media.giphy.com/media/N6funLtVsHW0g/giphy.gif)

In [32]:
multiple_choice.isnull().sum().sort_values(ascending=False)

WorkToolsFrequencyAngoss                       16694
WorkToolsFrequencySalfrod                      16684
WorkToolsFrequencyKNIMECommercial              16680
WorkMethodsFrequencySelect2                    16677
WorkToolsFrequencyStatistica                   16674
WorkToolsFrequencyDataRobot                    16653
WorkToolsFrequencyRapidMinerCommercial         16642
WorkFrequencySelect3                           16635
WorkToolsFrequencySAPBusinessObjects           16625
WorkMethodsFrequencySelect3                    16623
JobSkillImportanceOtherSelect3                 16603
WorkToolsFrequencySASJMP                       16601
WorkToolsFrequencyOrange                       16594
WorkToolsFrequencySelect2                      16581
WorkToolsFrequencyTIBCO                        16577
WorkToolsFrequencyFlume                        16575
WorkToolsFrequencyMinitab                      16572
WorkToolsFrequencyStan                         16564
WorkToolsFrequencyIBMCognos                   

In [33]:
multiple_choice[multiple_choice['WorkToolsFrequencyAngoss'].notnull()]['Country'].value_counts()

United States     8
India             3
Mexico            2
Italy             2
Canada            2
Singapore         1
Australia         1
Egypt             1
Other             1
United Kingdom    1
Name: Country, dtype: int64

## Selecionando por index

E se eu quiser pegar os valores de uma linha específica do dataframe?

In [34]:
multiple_choice.iloc[10,]

GenderSelect                                                                              Female
Country                                                                                   Russia
Age                                                                                           20
EmploymentStatus                                          Not employed, and not looking for work
StudentStatus                                                                                Yes
LearningDataScience                            Yes, I'm focused on learning mostly data scien...
CodeWriter                                                                                   NaN
CareerSwitcher                                                                               NaN
CurrentJobTitleSelect                                                                        NaN
TitleFit                                                                                     NaN
CurrentEmployerType           

Também posso ver só o valor de uma coluna, sem escrever o nome, somente pela sua posição

In [35]:
multiple_choice.iloc[:,0]

0        Non-binary, genderqueer, or gender non-conforming
1                                                   Female
2                                                     Male
3                                                     Male
4                                                     Male
5                                                     Male
6                                                     Male
7                                                   Female
8                                                   Female
9                                                     Male
10                                                  Female
11                                                    Male
12                                                    Male
13                                                    Male
14                                                    Male
15                                                    Male
16                                                    Ma

Mais detalhes sobre `loc`, `iloc` e `ix` podem ser vistas nesse [link](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/)

### Desafio 2

Será que temos uma quantidade gigante de nulos nesse dataset porque as pessoas não preencheram essas perguntas por multipla escolha, por que responderam no modo livre? 

Para validar essa hipótese teremos que carregar o outro dataset, que contém as perguntas em forma livre. e juntar (Pelo menos uma das variáveis) dos dois os datasets. Eu escolhi `DataScienceIdentity`

![challenge_panda](https://media.giphy.com/media/K9z3im98oo9Ve/giphy.gif)

In [36]:
free_responses = pd.read_csv('kaggle-survey-2017/freeformResponses.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [37]:
free_responses.head()

Unnamed: 0,GenderFreeForm,KaggleMotivationFreeForm,CurrentJobTitleFreeForm,MLToolNextYearFreeForm,MLMethodNextYearFreeForm,LanguageRecommendationFreeForm,PublicDatasetsFreeForm,PersonalProjectsChallengeFreeForm,LearningPlatformCommunityFreeForm,LearningPlatformFreeForm1,LearningPlatformFreeForm2,LearningPlatformFreeForm3,LearningPlatformUsefulnessCommunitiesFreeForm,LearningPlatformUsefulnessFreeForm1Select,LearningPlatformUsefulnessFreeForm1SelectFreeForm,LearningPlatformUsefulnessFreeForm2Select,LearningPlatformUsefulnessFreeForm2SelectFreeForm,LearningPlatformUsefulnessFreeForm3Select,LearningPlatformUsefulnessFreeForm3SelectFreeForm,BlogsPodcastsNewslettersFreeForm,JobSkillImportanceOtherSelect1FreeForm,JobSkillImportanceOtherSelect2FreeForm,JobSkillImportanceOtherSelect3FreeForm,CoursePlatformFreeForm,HardwarePersonalProjectsFreeForm,ProveKnowledgeFreeForm,ImpactfulAlgorithmFreeForm,InterestingProblemFreeForm,DataScienceIdentityFreeForm,MajorFreeForm,PastJobTitlesFreeForm,FirstTrainingFreeForm,LearningCategoryOtherFreeForm,MLSkillsFreeForm,MLTechniquesFreeform,EmployerIndustryOtherFreeForm,EmployerSearchMethodOtherFreeForm,JobFunctionFreeForm,WorkHardwareFreeForm,WorkDataTypeFreeForm,WorkLibrariesFreeForm,WorkAlgorithmsFreeForm,WorkToolsFreeForm1,WorkToolsFreeForm2,WorkToolsFreeForm3,WorkToolsFrequencySelect1FreeForm,WorkFrequencySelect2FreeForm,WorkFrequencySelect3FreeForm,WorkMethodsFreeForm1,WorkMethodsFreeForm2,WorkMethodsFreeForm3,WorkMethodsFrequencySelect1FreeForm,WorkMethodsFrequencySelect2FreeForm,WorkMethodsFrequencySelect3FreeForm,TimeOtherSelectFreeForm,WorkChallengesFreeForm,WorkChallengeFrequencyOtherFreeForm,WorkMLTeamSeatFreeForm,WorkDataStorageFreeForm,WorkCodeSharingFreeForm,SalaryChangeFreeForm,JobSearchResourceFreeForm
0,,,,,,,,Data manipulation,,,,,,,,,,,,,,,,,,,"It's not deployed yet, but hopefully a computa...",,,,,,,,,,,,,,,"Clustering Methods, association rules",,,,,,,,,,,,,,,,,,,,
1,,,,,,,,I can't find time to practice consistently,,,,,,,,,,,,,,,,,,,Sentiment analysis of twitter data,,,,,,,,,,,,,,,,Stata,,,,,,,,,,,,,,,,,,,
2,,,teacher,,,,,,,Meetups,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,Connectivity/data fusion,,,,,,,,,,,,,,,,Udemy,,,,,I use mid-level data science paired with high-...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,svm,,,,,,,,,,,,,,python;scikit-learn; panda; numpy;,,,,,,,,,,,,,,,,,,,,,


In [38]:
free_responses.shape

(16716, 62)

In [39]:
free_responses.isnull().sum().sort_values()

WorkLibrariesFreeForm                                12221
InterestingProblemFreeForm                           12261
ImpactfulAlgorithmFreeForm                           12369
PersonalProjectsChallengeFreeForm                    13209
DataScienceIdentityFreeForm                          14299
PastJobTitlesFreeForm                                14622
CurrentJobTitleFreeForm                              15573
BlogsPodcastsNewslettersFreeForm                     15603
WorkMLTeamSeatFreeForm                               15788
EmployerIndustryOtherFreeForm                        15790
MajorFreeForm                                        15907
MLTechniquesFreeform                                 15963
KaggleMotivationFreeForm                             15970
MLSkillsFreeForm                                     16050
WorkToolsFreeForm1                                   16052
WorkDataTypeFreeForm                                 16055
EmployerSearchMethodOtherFreeForm                    160

In [40]:
free_index = free_responses['DataScienceIdentityFreeForm'].notnull()

In [41]:
identity_check = pd.DataFrame({'IdentityFree': free_responses[free_index]['DataScienceIdentityFreeForm'], 
              'IdentitySelect': multiple_choice[free_index]['DataScienceIdentitySelect']})

In [42]:
identity_check.shape

(2417, 2)

In [43]:
identity_check.isnull().sum()

IdentityFree        0
IdentitySelect    583
dtype: int64

Aparentemente não foi isso que aconteceu...

## Alterando o dataset original

In [44]:
df = multiple_choice.copy()

In [45]:
df['LearningDataScience'].value_counts()

Yes, I'm focused on learning mostly data science skills                  800
Yes, but data science is a small part of what I'm focused on learning    429
No, I am not focused on learning data science skills                      55
Name: LearningDataScience, dtype: int64

In [46]:
def replace_value(row):
    if row == "Yes, I'm focused on learning mostly data science skills":
        return "yes"
    elif row == "Yes, but data science is a small part of what I'm focused on learning":
        return "so so"
    elif row == "No, I am not focused on learning data science skills":
        return "no"

In [47]:
df['LearningDataScienceSimple'] = df['LearningDataScience'].apply(replace_value)

Agora podemos ver os novos valores desses campos

In [48]:
df['LearningDataScienceSimple'].value_counts()

yes      800
so so    429
no        55
Name: LearningDataScienceSimple, dtype: int64

In [49]:
df.shape

(16716, 229)

E agora que a outra coluna muito complexa não serve mais, podemos descartá-la

In [50]:
df.drop(['LearningDataScience'], axis=1, inplace=True)

In [51]:
df.shape

(16716, 228)

Pela quantidade de linhas no dataset percebemos que o `value_counts()` não retorna os valores nulos - e nós temos MUITOS valores nulos nessa coluna. Podemos utilizar um método do pandas para trocar os NAs por uma categoria nossa.

In [52]:
df['LearningDataScienceSimple'].fillna("did not answer the question", inplace=True)

In [53]:
df['LearningDataScienceSimple'].value_counts()

did not answer the question    15432
yes                              800
so so                            429
no                                55
Name: LearningDataScienceSimple, dtype: int64

Outra forma de alterar o dataset é utilizando funções _in place_ para isso utilizaremos o `lambda`

Por exemplo: E se eu quiser atualizar a idade dos participantes? O dataset foi coletado em 2017 e já estamos em 2018

In [54]:
df['NewAge'] = df['Age'].apply(lambda x: x + 1)

Vamos ver se funcionou? Vamos dar uma olhada nos primeiros 5 registros, com a coluna 'Age' e a 'NewAge' lado a lado

In [55]:
df[['Age','NewAge']][:5]

Unnamed: 0,Age,NewAge
0,,
1,30.0,31.0
2,28.0,29.0
3,56.0,57.0
4,38.0,39.0


Ficou mais claro a proporção de pessoas que não responderam agora o/

### Desafio 3

Separar os datasets pelas pessoas que os responderam. Para isso você vai ter que carregar o dataset `schema.csv`.

Como uns datasets ficariam muito pequenos, sugiro que você utilize os seus conhecimentos recém adquiridos e crie 4 datasets distintos.

![panda_playground](https://media.giphy.com/media/ieaUdBJJC19uw/giphy.gif)

In [56]:
schema = pd.read_csv('kaggle-survey-2017/schema.csv')

In [57]:
schema.head()

Unnamed: 0,Column,Question,Asked
0,GenderSelect,Select your gender identity. - Selected Choice,All
1,GenderFreeForm,Select your gender identity. - A different ide...,All
2,Country,Select the country you currently live in.,All
3,Age,What's your age?,All
4,EmploymentStatus,What's your current employment status?,All


In [58]:
schema.Asked.value_counts()

CodingWorker       161
All                 70
Learners            41
Worker1              6
CodingWorker-NC      5
Worker               2
OnlineLearners       2
Non-worker           2
Non-switcher         1
Name: Asked, dtype: int64

In [59]:
coding_worker_column = schema[schema['Asked'] == 'CodingWorker']['Column']
all_column = schema[schema['Asked'] == 'All']['Column']
learners_column = schema[schema['Asked'] == 'Learners']['Column']
others_columns = schema[(~ schema['Asked'].isin(['CodingWorker', 'All', 'Learners']))]['Column']

In [60]:
all_multiple_selection_cols = [c for c in all_column.values if 'freeform' not in c.lower()]
coding_worker_multiple_selection_cols = [c for c in coding_worker_column.values if 'freeform' not in c.lower()]
learners_multiple_selection_cols = [c for c in learners_column.values if 'freeform' not in c.lower()]
others_multiple_selection_cols = [c for c in others_columns.values if 'freeform' not in c.lower()]

In [61]:
others_multiple_selection_cols

['StudentStatus',
 'LearningDataScience',
 'CodeWriter',
 'CareerSwitcher',
 'CurrentJobTitleSelect',
 'TitleFit',
 'CurrentEmployerType',
 'CoursePlatformSelect',
 'EmployerIndustry',
 'EmployerSize',
 'EmployerSizeChange',
 'EmployerMLTime',
 'EmployerSearchMethod']

In [63]:
all_multiple_selection = multiple_choice[all_multiple_selection_cols]
coding_worker_multiple_selection = multiple_choice[coding_worker_multiple_selection_cols]
learners_multiple_selection = multiple_choice[learners_multiple_selection_cols]
others_multiple_selection = multiple_choice[others_multiple_selection_cols]

In [64]:
print(all_multiple_selection.shape)
print(coding_worker_multiple_selection.shape)
print(learners_multiple_selection.shape)
print(others_multiple_selection.shape)

(16716, 43)
(16716, 137)
(16716, 35)
(16716, 13)


Lembrando que nós jogamos fora quem era 'free_form' - Logo, haverão sempre menos colunas do que vimos antes na quantidade por tipo de pessoa que respondeu.

In [68]:
all_multiple_selection.to_csv('all_multiple_selection.csv')
coding_worker_multiple_selection.to_csv('coding_worker_multiple_selection.csv')
learners_multiple_selection.to_csv('learners_multiple_selection.csv')
others_multiple_selection.to_csv('others_multiple_selection.csv')