# Keystone 0: Data Presentation - StackOverflow Dev Survey 

Initial data presentation for Thinkful DS program. SOF survey analysis of DS attitudes

### Data Sources
- file1 : Description of where this file came from

### Changes
- 08-06-2019 : Started project

In [2]:
import pandas as pd
from pathlib import Path
from datetime import datetime

import numpy as np
import matplotlib.pyplot as plt

## Description:

The data set is the full, cleaned results of the 2019 Stack Overflow Developer Survey are in the external subdirectory. Free response submissions and personally identifying information have been removed from the results to protect the privacy of respondents. There are three files:

1. survey_results_public.csv - CSV file with main survey results, one respondent per row and one column per answer
2. survey_results_schema.csv - CSV file with survey schema, i.e., the questions that correspond to each column name
3. so_survey_2019.pdf - PDF file of survey instrument

The survey was fielded from January 23 to February 14, 2019. The median time spent on the survey for qualified responses was 23.3 minutes.

Respondents were recruited primarily through channels owned by Stack Overflow. The top 5 sources of respondents were onsite messaging, blog posts, email lists, Meta posts, banner ads, and social media posts. Since respondents were recruited in this way, highly engaged users on Stack Overflow were more likely to notice the links for the survey and click to begin it.

A local copy of survey_results_public.csv renamed survey2019.csv for further use.


## File Locations

In [3]:
today = datetime.today()
in_file = Path.cwd() / "data" / "raw" / "survey2019.csv"
summary_file = Path.cwd() / "data" / "processed" / f"summary_{today:%b-%d-%Y}.pkl"

In [4]:
df = pd.read_csv(in_file)

## Prepare the data

### Column Cleanup
- Drop columns which we don't want to anlayze
- Remove all leading and trailing spaces (not nescessary)
- Rename the columns for consistency (not nescessary)

In [5]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


In [5]:
# Print shape of dataset
print('The dataset contains', np.shape(df)[0], 'rows and', np.shape(df)[1], 'columns.')

The dataset contains 88883 rows and 85 columns.


In [9]:
# Print column name
df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

In [6]:
# Drop unnecessary columns
survey = df[['Respondent',
             'MainBranch',
             'OpenSourcer',
             'OpenSource',
             'Employment',
             'Country',
             'Student',
             'EdLevel',
             'UndergradMajor',    #categorize on this..
             'EduOther',
             'OrgSize',   #small med large
             'DevType',    # THIS IS THE ANCHOR.  data scientist+analyst+researcher? vs Dev, Designer, admin
             'YearsCode',
             'Age1stCode',
             'YearsCodePro',
             'CareerSat',
             'JobSat',
             'MgrIdiot',   # compare with manager for interst
             'MgrMoney',
             'MgrWant',
             'JobSeek',
             'LastHireDate',
             'LastInt',
             'FizzBuzz',
             'JobFactors',      #THIS ONE
             'ResumeUpdate',    #why career seeking
             #'CurrencySymbol',
             #'CurrencyDesc',   
             #'CompTotal',
             #'CompFreq',
             'ConvertedComp',
             'WorkWeekHrs',  # work 
             'WorkPlan',
             'WorkChallenge',
             'WorkRemote',
             'WorkLoc',
             'ImpSyn',    # comeptence!!!
             'CodeRev',
            #'CodeRevHrs',
            #'UnitTests',
            #'PurchaseHow',
            #'PurchaseWhat',
             'LanguageWorkedWith',  # choose only Python/R, c++/c & assembly, other
             'LanguageDesireNextYear',
            #'DatabaseWorkedWith',        # WHATS THE STRATEGY FOR THIS YEAR VS. NEXT YEAR?
            # 'DatabaseDesireNextYear',
             'PlatformWorkedWith',
             'PlatformDesireNextYear',
             #'WebFrameWorkedWith',
             #'WebFrameDesireNextYear',
             #'MiscTechWorkedWith',
             #'MiscTechDesireNextYear',
             'DevEnviron',
             'OpSys',                     #YES
             #'Containers',                
             #'BlockchainOrg',
             'BlockchainIs',
             'BetterLife',   #optimist/pessimist
             'ITperson',   #these two are interstging
             'OffOn',
             #'SocialMedia',
             'Extraversion',  #extrovert / introvert
             'ScreenName',    # yes
            #'SOVisit1st',
            #'SOVisitFreq',
            #'SOVisitTo',
            #'SOFindAnswer',
            #'SOTimeSaved',
            #'SOHowMuchTime',
            #'SOAccount',
            #'SOPartFreq',
             'SOJobs',
             'EntTeams',
             'SOComm',
             'WelcomeChange',
             'SONewContent',
             'Age',
             'Gender',
             'Trans',
             'Sexuality',
             'Ethnicity',
             'Dependents' ]]
            #'SurveyLength',
            #'SurveyEase'
        
    

In [7]:
survey.describe()

Unnamed: 0,Respondent,ConvertedComp,WorkWeekHrs,Age
count,88883.0,55823.0,64503.0,79210.0
mean,44442.0,127110.7,42.127197,30.336699
std,25658.456325,284152.3,37.28761,9.17839
min,1.0,0.0,1.0,1.0
25%,22221.5,25777.5,40.0,24.0
50%,44442.0,57287.0,40.0,29.0
75%,66662.5,100000.0,44.75,35.0
max,88883.0,2000000.0,4850.0,99.0


In [8]:
# Print shape of dataset
print('The dataset now contains', np.shape(survey)[0], 'rows and', np.shape(survey)[1], 'columns.')

The dataset now contains 88883 rows and 57 columns.


### Create other simplified variables 

- Simplify (categorize)
    - Education: Advanced, College, Grade
    - Age->Gen: GenZ,Milennial,GenX,Boomer,Silent
    - (Languages: procedural vs. Object?  classic vs. new? high vs. low level?)
      

In [9]:
survey['EdLevel'].unique()

array(['Primary/elementary school',
       'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)',
       'Bachelor’s degree (BA, BS, B.Eng., etc.)',
       'Some college/university study without earning a degree',
       'Master’s degree (MA, MS, M.Eng., MBA, etc.)',
       'Other doctoral degree (Ph.D, Ed.D., etc.)', nan,
       'Associate degree', 'Professional degree (JD, MD, etc.)',
       'I never completed any formal education'], dtype=object)

In [10]:
# Simplify FormalEducation
def simplify_ed(edlevel):
    """Add a new field, FormalEducation_New, to dataframe, containing simplified FormalEducation values.
    Args:
    edlevel: Series       
    Returns:
    simplelevel: Series. Modified version of the input to put into the dataframe containing a new field, EdLevelSimple.
       
    """
    conditions_ed =[ (edlevel == 'Other doctoral degree (Ph.D, Ed.D., etc.)')
                        | (edlevel == 'Master’s degree (MA, MS, M.Eng., MBA, etc.)')
                        | (edlevel == 'Professional degree (JD, MD, etc.)'),   
                     (edlevel == 'Associate degree')
                        | (edlevel== 'Bachelor’s degree (BA, BS, B.Eng., etc.)')
                        | (edlevel == 'Some college/university study without earning a degree') ,
                     (edlevel == 'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)') 
                        | (edlevel == 'Primary/elementary school')
                        | (edlevel == 'I never completed any formal education')
                   ]
    choices_ed = ['Advanced', 'College', 'Grade']
    simplelevel = np.select(conditions_ed, choices_ed, default = np.NaN)
    
    return simplelevel

# Apply function to subsets
edlevel = survey.loc[:,'EdLevel']
survey['EdLevel_simple'] = edlevel

#survey.loc[:,'EdLevel_Simple'] = simplify_ed(edlevel)
#survey.loc[:,'EdLevel_Simple'] = simplify_ed(survey.loc[:,'EdLevel'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [11]:
survey['UndergradMajor'].unique()

array([nan, 'Web development or web design',
       'Computer science, computer engineering, or software engineering',
       'Mathematics or statistics',
       'Another engineering discipline (ex. civil, electrical, mechanical)',
       'Information systems, information technology, or system administration',
       'A business discipline (ex. accounting, finance, marketing)',
       'A natural science (ex. biology, chemistry, physics)',
       'A social science (ex. anthropology, psychology, political science)',
       'A humanities discipline (ex. literature, history, philosophy)',
       'Fine arts or performing arts (ex. graphic design, music, studio art)',
       'A health science (ex. nursing, pharmacy, radiology)',
       'I never declared a major'], dtype=object)

In [12]:
# Simplify Undergraduate Major
def simplify_major(major):
    """Add a new field, UndergradMajor_New, to dataframe, containing simplified UndergradMajor values.
    
    Args:
    df: major. Series containing the field UndergradMajor.
       
    Returns:
    df: major_simple. Modified version of the input dataframe containing a new field, UndergradMajor_New.
    """
    conditions_major = [(major == 'Computer science, computer engineering, or software engineering')
                            |(major == 'Web development or web design')
                            |(major == 'Information systems, information technology, or system administration'),
                        (major == 'Mathematics or statistics')
                            |(major == 'Another engineering discipline (ex. civil, electrical, mechanical)')
                            |(major == 'A natural science (ex. biology, chemistry, physics)'),
                        (major == 'A health science (ex. nursing, pharmacy, radiology)')
                            |(major == 'A business discipline (ex. accounting, finance, marketing)')
                            |(major == 'A humanities discipline (ex. literature, history, philosophy)')
                            |(major == 'A social science (ex. anthropology, psychology, political science)')
                            |(major == 'Fine arts or performing arts (ex. graphic design, music, studio art)')
                       ]

    choices_major = ['Computer/Tech','Math/Science','Other']

    major_simple = np.select(conditions_major, choices_major, default = np.NaN)
    
    return major_simple

# Apply function to subsets

umajor = survey.loc[:,'UndergradMajor']
survey['Major'] = simplify_major(umajor)
#survey.loc[:,'EdLevel_Simple'] = simplify_ed(survey.loc[:,'EdLevel'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [13]:
survey['Age'].unique()

array([14. , 19. , 28. , 22. , 30. , 42. , 24. , 23. ,  nan, 21. , 31. ,
       20. , 26. , 29. , 38. , 47. , 34. , 32. , 25. , 17. , 35. , 27. ,
       44. , 43. , 62. , 37. , 45. , 18. , 33. , 36. , 16. , 39. , 64. ,
       41. , 54. , 49. , 40. , 56. , 12. , 58. , 46. , 59. , 51. , 48. ,
       57. , 52. , 50. , 23.9, 55. , 15. , 67. , 13. ,  1. , 53. , 69. ,
       65. , 17.5, 63. , 61. , 68. , 73. , 70. , 60. , 16.5, 46.5, 11. ,
       71. ,  3. , 97. , 29.5, 77. , 74. , 26.5, 26.3, 24.5, 78. , 72. ,
       66. , 76. , 10. , 75. , 99. , 83. , 79. , 36.8, 14.1, 13.5, 19.5,
       98. , 43.5, 22.5, 31.5, 21.5, 28.5, 33.6,  2. , 38.5, 30.8, 24.8,
       90. , 61.3, 81. ,  4. , 17.3, 19.9, 80. , 85. , 88. , 23.5, 16.9,
       20.9, 91. , 98.9, 57.9,  9. , 94. , 95. , 37.5, 14.5,  5. , 82. ,
       84. , 37.3, 33.5, 53.8, 31.4, 87. ])

###  Categorize according to standard "generations"

- Gen Z
- Millenial
- Gen X
- Boomers
- Silent

https://www.pewresearch.org/wp-content/uploads/2019/01/FT_19.01.17_generations_2019.png


In [14]:
# Convert Age to generation
def find_gen(age):
    """
    """
    if age <= 22:
        gen = 'GenZ'
        return gen
    elif age <= 38:
        gen = 'Millenial'
    elif age <= 54:
        gen = 'GenX'
    elif age <= 73:
        gen = 'Boomer'
    else:
        gen = 'Silent'      
    return gen

# Apply function to subsets
survey.loc[:,'Gen'] = survey['Age'].apply(find_gen)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [15]:
def fix_years_coding(years_obj):
    if years_obj == 'Less than 1 year':
        return 0.0001
    elif years_obj == 'More than 50 years':
        return 55.
    else:
        return years_obj


In [16]:
survey['nYearsCode']=survey['YearsCode'].apply(fix_years_coding)
survey['nYearsCodePro']=survey['YearsCodePro'].apply(fix_years_coding)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [17]:
survey['nYearsCodePro'].unique()



array([nan, '1', 0.0001, '9', '3', '4', '10', '8', '2', '13', '18', '5',
       '14', '22', '23', '19', '35', '20', '25', '7', '15', '27', '6',
       '48', '12', '31', '11', '17', '16', '21', '29', '30', '26', '33',
       '28', '37', '40', '34', '24', '39', '38', '36', '32', '41', '45',
       '43', 55.0, '44', '42', '46', '49', '50', '47'], dtype=object)

### Respondent Cleanup

- keep professionals only.  Drop students and "other"
- Create our categories of respondents (aux columns)
    - DataScientist
    - Non-DataScientists
    - Other?


As we are interested in comparing data scientists to non-data scientists, we need to be able to differentiate between the two. This is done using the `DevType` field. As a result, we should drop any rows where this field is missing, since we can't determine which subset these rows fit into.


In [18]:
# Create data scientist and non-data scientist subsets.
data_scientists = survey['DevType'].str.contains('data', case=False, na=False, regex=True) #data scientists / analists / Data engineers

In [19]:
developers = survey['DevType'].str.contains('developer', case=False, na=False,regex=True) #data scientists / analists / Data engineers

    #survey_non_ds = survey[~data_scientists]
    survey_ds = survey[data_scientists]
    survey_devel = survey[developers]
    survey_non_ds = survey[~data_scientists]

In [20]:
#survey_non_ds = survey[~data_scientists]
survey_ds = survey.loc[data_scientists]
survey_devel = survey.loc[developers]
survey_non_ds = survey.loc[~data_scientists]

In [21]:
print('Now we have a group of n=', np.shape(survey_ds)[0], 'data scientists, to compare with n=', np.shape(survey_devel)[0], ' developers. (And a control grooup of all non-data scientists, n=c',np.shape(survey_non_ds)[0], ')')


Now we have a group of n= 19752 data scientists, to compare with n= 72491  developers. (And a control grooup of all non-data scientists, n=c 69131 )


## Split Multi-Selection Fields


For the fields where multiple selections were possible (i.e. `EducationTypes`, `SelfTaughtTypes` and `LanguageWorkedWith`), split the strings containing the multiple selections into a list of selections and then concatenate these lists into a single list (dropping any missing values in the process). In the case of `EducationTypes` and `SelfTaughtTypes`, also simplify these fields to reduce the length of category labels and to group similar categories into a single category.

### Create other simplified variables 

- Split lists for each group
    - Languages
    - Environments

## Split LanguageWorkedWith
languages = split_list(survey, 'LanguageWorkedWith')
languages_non_ds = split_list(survey_non_ds, 'LanguageWorkedWith')

In [55]:
del(languages)

In [22]:
languages_full = df['LanguageWorkedWith'].str.split(';', expand=True)
languages_full[:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,HTML/CSS,Java,JavaScript,Python,,,,,,,...,,,,,,,,,,
1,C++,HTML/CSS,Python,,,,,,,,...,,,,,,,,,,
2,HTML/CSS,,,,,,,,,,...,,,,,,,,,,
3,C,C++,C#,Python,SQL,,,,,,...,,,,,,,,,,
4,C++,HTML/CSS,Java,JavaScript,Python,SQL,VBA,,,,...,,,,,,,,,,
5,Java,R,SQL,,,,,,,,...,,,,,,,,,,
6,HTML/CSS,JavaScript,,,,,,,,,...,,,,,,,,,,
7,Bash/Shell/PowerShell,C,C++,HTML/CSS,Java,JavaScript,Python,SQL,,,...,,,,,,,,,,
8,Bash/Shell/PowerShell,C#,HTML/CSS,JavaScript,Python,Ruby,Rust,SQL,TypeScript,WebAssembly,...,,,,,,,,,,
9,C#,Go,JavaScript,Python,R,SQL,,,,,...,,,,,,,,,,


In [24]:
languages_ds = survey_ds['LanguageWorkedWith'].str.split(';', expand=True)
languages_non_ds = survey_non_ds['LanguageWorkedWith'].str.split(';', expand=True)
languages_devel = survey_devel['LanguageWorkedWith'].str.split(';', expand=True)


In [25]:
# Create dataframe containing split string values by respondent number
def split_list(df, col):
    """Create a new dataframe that splits the values of multi-selection column col into individual selections and 
    places each selection value on a separate row. This new dataframe can be linked back to the original dataframe by 
    Respondent value.
    
    Args:
    df: dataframe. Dataframe containing the multi-selection field col.
       
    Returns:
    out_df: dataframe. New dataframe giving split values of col.
    """
    in_res = list(df['Respondent'])
    in_list = list(df[col])
    
    out_res = []
    out_list = []
    
    for i in range(len(in_list)):
        if pd.isnull(in_list[i]) == False:
            vals = in_list[i].split(';')
            res = [in_res[i]]*len(vals)
            
            out_list.append(vals)
            out_res.append(res)
    
    out_df = pd.DataFrame({'Respondent': list(np.concatenate(out_res)), col: list(np.concatenate(out_list))})
    
    return out_df

In [36]:
summary = languages_ds.apply(pd.Series.value_counts)
summary_ds = pd.DataFrame({'count': summary.sum(axis=1).groupby(lambda x: x.strip()).sum()})

summary = languages_non_ds.apply(pd.Series.value_counts)
summary_non_ds = pd.DataFrame({'count': summary.sum(axis=1).groupby(lambda x: x.strip()).sum()})

summary = languages_devel.apply(pd.Series.value_counts)
summary_devel = pd.DataFrame({'count': summary.sum(axis=1).groupby(lambda x: x.strip()).sum()})



In [37]:
 
summary['LangPct_DS'] = summary_ds / len(summary_ds)*100
summary['LangPct_nonDS']= summary_non_ds / len(summary_non_ds)*100
summary['LangPct_Devel'] = summary_devel / len(summary_devel)*100



In [38]:
summary.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,LangPct_DS,LangPct_nonDS,LangPct_Devel
Assembly,4109,,,,,,,,,,...,,,,,,,,4946.428571,15885.714286,14675.0
Bash/Shell/PowerShell,23613,2579.0,,,,,,,,,...,,,,,,,,30607.142857,83646.428571,93542.857143
C,5553,5845.0,2142.0,,,,,,,,...,,,,,,,,14932.142857,49414.285714,48357.142857
C#,11003,7248.0,3109.0,1735.0,818.0,,,,,,...,,,,,,,,22192.857143,74582.142857,85403.571429
C++,3732,6171.0,4089.0,1704.0,,,,,,,...,,,,,,,,17178.571429,56121.428571,56057.142857


In [27]:
summary_devel.sort_values('count', ascending=False)

Unnamed: 0,count
SQL,13746.0
JavaScript,12800.0
HTML/CSS,12396.0
Python,10881.0
Bash/Shell/PowerShell,8570.0
Java,7531.0
C#,6214.0
PHP,5961.0
C++,4810.0
C,4181.0


## Perform Data Analysis

In [None]:
#cols_to_rename = {'col1': 'New_Name'}
#df.rename(columns=cols_to_rename, inplace=True)

### Clean Up Data Types

In [None]:
df.dtypes

## Column Cleanup

- Remove all leading and trailing spaces (not nescessary)
- Rename the columns for consistency (not nescessary)
- Drop columns which we don't want to anlayze

## Data Manipulation

### Create categories of respondents t

- hobbyist vs non-hobbyist
- Data Scientist (analysts) vs. other prof Developer
    - student vs non student of each t



In [None]:
# Look at descriptive statistics for data (ignore Respondent since this is just an ID field)
survey.drop(['Respondent'], axis = 1).describe()

### Save output file into processed directory

Save a file in the processed directory that is cleaned properly. It will be read in and used later for further analysis.

Other options besides pickle include:
- feather
- msgpack
- parquet

In [None]:
#%%

import requests
import zipfile
import shutil
import os
import pandas as pd
import json

#%%
urls = {
    2019: 'https://drive.google.com/uc?export=download&id=1QOmVDpd8hcVYqqUXDXf68UMDWQZP0wQV',
    2018: 'https://drive.google.com/uc?export=download&id=1_9On2-nsBQIw3JiY43sWbrF8EjrqrR4U',
    2017: 'https://drive.google.com/uc?export=download&id=0B6ZlG_Eygdj-c1kzcmUxN05VUXM',
    2016: 'https://drive.google.com/uc?export=download&id=0B0DL28AqnGsrV0VldnVIT1hyb0E',
    2015: 'https://drive.google.com/uc?export=download&id=0B0DL28AqnGsra1psanV1MEdxZk0',
    2014: 'https://drive.google.com/uc?export=download&id=0B0DL28AqnGsrempjMktvWFNaQzA',
    2013: 'https://drive.google.com/uc?export=download&id=0B0DL28AqnGsrenpPNTc5UE1PYW8',
    2012: 'https://drive.google.com/uc?export=download&id=0B0DL28AqnGsrX3JaZWVwWEpHNWM',
    2011: 'https://drive.google.com/uc?export=download&id=0Bx0LyhBTBZQgUGVYaGx3SzdUQ1U',
}

survey_filenames = {
    2019: 'survey_results_public.csv', 
    2018: 'survey_results_public.csv', 
    2017: 'survey_results_public.csv', 
    2016: '2016 Stack Overflow Survey Results/2016 Stack Overflow Survey Responses.csv',
    2015: '2015 Stack Overflow Developer Survey Responses.csv',
    2014: '2014 Stack Overflow Survey Responses.csv',
    2013: '2013 Stack Overflow Survey Responses.csv',
    2012: '2012 Stack Overflow Survey Results.csv',
    2011: '2011 Stack Overflow Survey Results.csv'
}

questions_filenames = {
    2019: 'survey_results_schema.csv', 
    2018: 'survey_results_schema.csv', 
    2017: 'survey_results_schema.csv' 
}

def survey_csvname(year):
    return 'survey{}.csv'.format(year)

def survey_dirname(year):
    return 'data{}/'.format(year)

#%%
def download_survey(year):
    print(f"Downloading {year}")
    request = requests.get(urls[year])
    with open("survey.zip", "wb") as file:
        file.write(request.content) 

    with zipfile.ZipFile("survey.zip", "r") as file:
        file.extractall("data")

    shutil.copytree("data/", survey_dirname(year))
    shutil.copy("data/" + survey_filenames[year], survey_csvname(year))
    os.remove("survey.zip")


#%%
#for year in range(2011, 2019):2019   
# 
#
#  
# describe the averages of intersting questions..True
#
#
# identify what "combinations" are interesting... choice...True


year = 2017
year = 2018
year = 2019


if True :
    if not os.path.exists(survey_csvname(year)):
        download_survey(year)
    print(f"Processing {year}")
    data=pd.read_csv(survey_csvname(year), encoding='latin1')      
    dataIndex=pd.read_csv(survey_dirname(year)+questions_filenames[year], encoding='latin1')
# first lets iterate through the questions and dump them 


for index, row in dataIndex.iterrows():
    # access data using column names
    print(index, row['Column'],": ", row['QuestionText'])

    


 
 
#%%


with open('data.json', 'w') as file:
        file.write(json.dumps(totals, indent=4, separators=(',', ': ')))


In [None]:
df.to_pickle(summary_file)