Data analysis with Python doesn't always require fancy tools and complex methods.  There are many modules in the standard library along with built-in language constructs that can be used for quick tasks.  Here I've prepared a JSON file from the RSS feed from Stack Overflow Careers.  The code that generates this file is found in `generate_jobs_json.py`.

Since this demo does not require anything outside of the standard library, I call it 'Anti-Social Python' because it doesn't rely on too many 'friends'.

In [1]:
import json

In [2]:
f = open('socjobs-prepared.json', 'r')

In [3]:
data = ''.join(f.readlines())

In [4]:
f.close()

The result of parsing the JSON string will be a `list` of `dict`.

In [5]:
jobs = json.loads(data)

I know that the RSS feed for Stack Overflow careers has 1000 entries.

In [6]:
len(jobs)

1000

Here is what a job looks like.

In [7]:
jobs[0]

{'author': 'Stride',
 'id': '126790',
 'location': 'New York, NY',
 'tags': ['java', 'scala', 'ruby-on-rails', 'ruby', 'javascript'],
 'title': 'Full Stack Agile Developer at Stride (New York, NY)'}

I want to do some ad-hoc analysis on the tags for each job.  However, not all jobs have tags.  Trying to access a key that does not exist will cause errors.  So I'll filter the tagless jobs out.

In [8]:
tagged_jobs = [job for job in jobs if 'tags' in job.keys()]

And indeed about 5% of the jobs were removed.

In [9]:
len(tagged_jobs)

949

All jobs should have a unique identifier.  So I'll do a couple of sanity checks to ensure this is the case.  First I'll get a `list` of all the ids.

In [10]:
every_id = [job['id'] for job in tagged_jobs]

I'll make sure that there are 949 of them

In [11]:
len(every_id)

949

However, at this point I still can't be sure that there are 949 **unique** ids.  I need to check for duplicates.  Converting a `list` to a `set` will automatically remove any duplicate values.  If the length of the `set` is still 949, all of the ids are unique.

In [12]:
len(set(every_id))

949

Now getting a `list` of all the jobs with the tag 'python' is easy.  And there are 170 of them.

In [13]:
python_jobs = [job for job in tagged_jobs if 'python' in job['tags']]

In [14]:
len(python_jobs)

170

The `Counter` is a dictionary-like structure that merely keeps track of frequency counts of values.  It can be found in the `collections` module in the standard library.

In [15]:
from collections import Counter

In [16]:
c = Counter()

I'll use the `Counter` to keep track of the count of each tag.

In [17]:
for job in tagged_jobs:
    for tag in job['tags']:
        c[tag] += 1

The `most_common` method gives a `list` of the top counts.

In [20]:
c.most_common(25)

[('javascript', 290),
 ('java', 207),
 ('python', 170),
 ('c#', 157),
 ('sql', 104),
 ('angularjs', 102),
 ('c++', 95),
 ('linux', 87),
 ('amazon-web-services', 83),
 ('node.js', 80),
 ('css', 77),
 ('php', 72),
 ('.net', 72),
 ('reactjs', 70),
 ('html', 66),
 ('ruby-on-rails', 59),
 ('mysql', 58),
 ('ruby', 55),
 ('rest', 54),
 ('sysadmin', 49),
 ('html5', 45),
 ('c', 42),
 ('sql-server', 42),
 ('jquery', 39),
 ('postgresql', 37)]

The `Counter` can also be accessed using a key like with a dictionary.  Here I am checking that there are still 170 python jobs.

In [19]:
c['python']

170