In [54]:
# standard lib imports
import os.path
import zipfile
import glob
import re
import json
import functools

# 3rd part imports

# project imports

In [55]:
# get all pathes of zip files with input-data
f_paths = glob.glob('./data/input/stackoverflow/*.zip')

# read year of data to file-pth
years = {int(re.match('.*/.*(?P<year>[0-9]{4})(?!/).*\.zip', pth).group('year')): pth for pth in f_paths}

# get overview of available data
for year in sorted(years.keys()):
    with zipfile.ZipFile(years[year], mode='r') as zfile:
        zfile.printdir()

File Name                                             Modified             Size
2011 Stack Overflow Survey Results.csv         2017-03-16 18:29:50       939640
__MACOSX/                                      2017-03-16 18:30:50            0
__MACOSX/._2011 Stack Overflow Survey Results.csv 2017-03-16 18:29:50          488
File Name                                             Modified             Size
2012 Stack Overflow Survey Results.csv         2017-03-16 20:01:30      2521440
__MACOSX/                                      2017-03-16 20:02:10            0
__MACOSX/._2012 Stack Overflow Survey Results.csv 2017-03-16 20:01:30          488
File Name                                             Modified             Size
2013 Stack Overflow Survey Responses.csv       2016-07-01 12:57:56      7931652
__MACOSX/                                      2016-09-06 14:19:28            0
__MACOSX/._2013 Stack Overflow Survey Responses.csv 2016-07-01 12:57:56          177
File Name                    

The main sourvey data within the zip-files can be identified by the file-type (.csv) and the size of the file. Unfortunately the name of this main data has not have a uniform naming throughout the years.

-> In the following cell a quick reorganization of the pathes and filenames of the survey-data is performed for having these information grouped by year of the survey

In [56]:
# regorganize path and main csv-file dict
for year in years.keys():
    years[year] = dict(zpth=years[year], fname=None)
    
# determine main csv-file with survey data in zips (by filetype + maximum size)
for year in years.keys():
    with zipfile.ZipFile(years[year].get('zpth'), mode='r') as zfile:
        size = 0
        file = ''
        for zinfo in zfile.infolist():
            # not a csv -> next file
            if not zinfo.filename.endswith('.csv'):
                continue
            # maximum file-size found?
            if zinfo.file_size > size:
                size = zinfo.file_size
                file = zinfo.filename
        # insert survey-filename in dict
        years[year]['fname'] = file

# write dict to json
with open('./data/input/stackoverflow/survey_pathes.json', 'w') as json_out:
    json.dump(years, json_out)
    
years

{2013: {'zpth': './data/input/stackoverflow/2013 Stack Overflow Survey Responses.zip',
  'fname': '2013 Stack Overflow Survey Responses.csv'},
 2012: {'zpth': './data/input/stackoverflow/2012 Stack Overflow Survey Results.zip',
  'fname': '2012 Stack Overflow Survey Results.csv'},
 2016: {'zpth': './data/input/stackoverflow/2016 Stack Overflow Survey Results.zip',
  'fname': '2016 Stack Overflow Survey Results/2016 Stack Overflow Survey Responses.csv'},
 2011: {'zpth': './data/input/stackoverflow/2011 Stack Overflow Survey Results.zip',
  'fname': '2011 Stack Overflow Survey Results.csv'},
 2015: {'zpth': './data/input/stackoverflow/2015 Stack Overflow Developer Survey Responses.zip',
  'fname': '2015 Stack Overflow Developer Survey Responses.csv'},
 2018: {'zpth': './data/input/stackoverflow/developer_survey_2018.zip',
  'fname': 'survey_results_public.csv'},
 2014: {'zpth': './data/input/stackoverflow/2014 Stack Overflow Survey Responses.zip',
  'fname': '2014 Stack Overflow Survey R

Now we have all needed information for reading the survey-data grouped by year

In [57]:
# read all data and check if they are consistent throughout the years (available parameters + dtypes)
datas = []
# iterate through years in ascending order
for year, ydict in sorted(years.items()):
    with zipfile.ZipFile(ydict.get('zpth'), mode='r') as zfile:
        with zfile.open(ydict.get('fname'), mode='r') as fin:
            # read csv-file to dataframe and append to datas-list
            datas.append(pd.read_csv(fin, encoding='latin-1', low_memory=False))

In [58]:
# create header-list with all available parameters over the years and normalize them
# (avoid differences caused by caps/ leading or tailing spaces)
parameters = [set(map(lambda x: x.strip().lower(), data.columns)) for data in datas]

# parameters available in all years:
functools.reduce(lambda x,y: x.intersection(y), parameters)

set()

There are no parameters which are available in all surveys or at least not available with the same naming.
How does it look like with parameters comparing only the last two survey-years (2019 and 2018):

In [59]:
# compare parameters which are both in 2018 and 2019
parameters[-2].intersection(parameters[-1])

{'age',
 'country',
 'currencysymbol',
 'databasedesirenextyear',
 'databaseworkedwith',
 'dependents',
 'devtype',
 'employment',
 'gender',
 'languagedesirenextyear',
 'languageworkedwith',
 'opensource',
 'platformdesirenextyear',
 'platformworkedwith',
 'respondent',
 'student',
 'undergradmajor'}

In [60]:
# parameters which are NOT both included in survey of year 2019 and 2018
parameters[-2].symmetric_difference(parameters[-1])

{'adblocker',
 'adblockerdisable',
 'adblockerreasons',
 'adsactions',
 'adsagreedisagree1',
 'adsagreedisagree2',
 'adsagreedisagree3',
 'adspriorities1',
 'adspriorities2',
 'adspriorities3',
 'adspriorities4',
 'adspriorities5',
 'adspriorities6',
 'adspriorities7',
 'age1stcode',
 'agreedisagree1',
 'agreedisagree2',
 'agreedisagree3',
 'aidangerous',
 'aifuture',
 'aiinteresting',
 'airesponsible',
 'assessbenefits1',
 'assessbenefits10',
 'assessbenefits11',
 'assessbenefits2',
 'assessbenefits3',
 'assessbenefits4',
 'assessbenefits5',
 'assessbenefits6',
 'assessbenefits7',
 'assessbenefits8',
 'assessbenefits9',
 'assessjob1',
 'assessjob10',
 'assessjob2',
 'assessjob3',
 'assessjob4',
 'assessjob5',
 'assessjob6',
 'assessjob7',
 'assessjob8',
 'assessjob9',
 'betterlife',
 'blockchainis',
 'blockchainorg',
 'careersat',
 'careersatisfaction',
 'checkincode',
 'coderev',
 'coderevhrs',
 'communicationtools',
 'companysize',
 'compfreq',
 'comptotal',
 'containers',
 'convert

The last to statements show the problems in analyzing this data comparing different years and looking for trends and developments in time. The survey is not consitant throughout the years. Even considering only the surveys of the two latest years (2018 and 2019), there are only 17 common parameters.
Quite alot of these difference are caused by different naming which is shown in the symmetric-difference of the two parameter-sets. E.g.: the parameter 'jobsatisfaction' is called 'jobsat' in the other survey. Same with 'yearscode' and 'yearscoding'; there are many more examples.
An automated analysis of this data of all these parameters over time would require heavy manual work and data preparation for asuring a correct mapping. Analysis of trends of single parameters over time should still be feasible though.