### Fields
======
1. id - The unique identifier for the profile
2. careerjunction_za_primary_jobtitle - The most recent job title of the profile
3. careerjunction_za_recent_jobtitles - The next job titles after the most recent one (max 2)
4. careerjunction_za_historical_jobtitles - All other job titles after recent ones (from the 4th job title)
5. careerjunction_za_future_jobtitles - Job titles the seeker would like to have as their next job (ambitions)
6. careerjunction_za_employer_names - All employers worked for
7. careerjunction_za_skills - All the skills
8. careerjunction_za_courses - Titles for education/courses
 
What we want is:-
- Any insight into the data that can be extrapolated
- If given a profile id, find similar profiles like that one. A combination of similar skills, courses and/or job titles.
- If given a profile id, recommend what their next job title(s) could be


As the amount of data captured increases, structure of data in the database become unstructured data.
From the JSON file, there are over 1000 separate events listed within the file. Each event has different fields, and some of the fields are nested within other fields. 
This type of data is very hard to store in a regular SQL database.This unstructured data is often stored in a format called JavaScript Object Notation (JSON). 
JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. Even though JSON starts with the word Javascript, 
it's actually just a format, and can be read by any language.

Python has great JSON support, with the json library. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries. 
JSON data looks much like a dictionary would in Python, with keys and values stored.

In this task, we explored the JSON file using Jupyter notebook, and then import it into Python and work with it using Pandas.


### The dataset

The data contains information about career post and information about how to match candidate to a particular job post and then make further suggestion 
about likehood of the post. There are quite a few questions we could answer using the dataset, including:

    [1.] What is the total number of the profile ID's' job present?
    [2.] What are the most common skills, education, and courses people serach for?
    [3.] What are the most common primary job title and recent job titles?
    [4.] employer names, future job titles etc.
    
Since we don't know the structure of the JSON file upfront (as assumption), so we do some exploration to figure it out. This task used Jupyter Notebook for the exploration.


## Exploring the JSON data

The first thing we do is taking a look at the first few lines of the data set.

In [131]:
%%bash
# path to the data set
head ../NumPy/datasets/data_science_extract.json

[
  {
    "id": 1,
    "careerjunction_za_historical_jobtitles": [
      "Marketer & Technical Liaison",
      "Quality Assurance Manager Haccp Team Leader",
      "New Product Developer Technologist",
      "Food Technologist",
      "Quality Controller"
    ],


We can tell that the JSON data is a list of dictionary, and it is well formatted. 
We can also see that:
    
#### "profile id", "careerjunction_za_historical_jobtitles","careerjunction_za_primary_jobtitle",
#### "careerjunction_za_employer_names", "careerjunction_za_skills", "careerjunction_za_courses", 
#### "careerjunction_za_recent_jobtitles", "careerjunction_za_future_jobtitles", 

are top level key, and they are indented three spaces. We get all of the top level keys by using the grep command to print any lines that have three leading spaces:

We can see from the data set that the top level keys ae in the header. A list of lists appears to be associated with the data set, and this likely contains each record in the job profile dataset. 
Each inner list is a record, and the first record appears in the output from the grep command.

We print out the full key structure of the JSON file by using grep to print out any lines with 2-6 leading spaces:

This shows us the full key structure associated with data_science_extract.json, and tell us which parts of the JSON file are relevant for us.

### Extracting information on the columns

Now that we know which key contains information on the columns, we read that information in. 
We assumed that the JSON file can't fit in memory and we can't just directly read it in using the json library. 
Instead, we iteratively read it in in a memory-efficient way.

In [580]:
import json #  json package iteratively parse the json file instead of reading it all in at once
import sys
import pandas as pd
import numpy as np
from pandas import DataFrame
from pandas.io.json import json_normalize #package for flattening json in pandas df
filename = "../NumPy/datasets/data_science_extract.json"

#strs = "{u'key':u'val'}"
#strs = strs.replace("u'",'"')
#print strs
def js_data(filename):
   with open(filename, 'r') as f_in:
    objects = json.load(f_in, encoding="utf-8")
    columns = list(objects)
    return columns

#https://medium.com/@gis10kwo/converting-nested-json-data-to-csv-using-python-pandas-dc6eddc69175    
if __name__ == "__main__":
    columns = js_data(filename)
    #print columns
print "The length of the column is equal to the total number of profile_id within the dataset {}".format(len(columns))
    
for row in columns:
    selected_row.append(row)
#print selected_row[1]
column_headers= len(selected_row)

all_rows = []
for i in selected_row:
    all_rows.append(i)
#print all_rows[0:3]



The length of the column is equal to the total number of profile_id within the dataset 2000


In [536]:
# code for using the id to search the dict values
def get_val(dct,key):
    for k, v in dct.iteritems():
        if key in dct.keys():
            print k, v
        else :
            for d in dct.values():
                get_val(d, key)

key='id'
get_val(selected_row[2],key)

careerjunction_za_historical_jobtitles []
careerjunction_za_primary_jobtitle Social Manager
careerjunction_za_skills [u'C# Developer', u'MYSQL', u'PHP', u'javascript', u'CSS3', u'HTML5', u'wordpress', u'AJAX', u'RDBMS', u'Magento', u'C++ Developer', u'JAVA Developer']
careerjunction_za_courses [u'Bsc in Computer Systems', u'Higher National Diploma in Information Technology', u'Senior Certificate']
careerjunction_za_employer_names [u'Bruce Records Studio', u'Crystal MAP']
careerjunction_za_recent_jobtitles [u'Junior Developer']
careerjunction_za_future_jobtitles [u'Web Developer', u'Application Developer', u'C# Developer']
id 3


In [537]:
msgs = pd.io.json.json_normalize(columns)
msgs.dtypes

careerjunction_za_courses                 object
careerjunction_za_employer_names          object
careerjunction_za_future_jobtitles        object
careerjunction_za_historical_jobtitles    object
careerjunction_za_primary_jobtitle        object
careerjunction_za_recent_jobtitles        object
careerjunction_za_skills                  object
id                                         int64
dtype: object

In [134]:
# Comparing list in the row of each column
'''
for i in selected_row[0]['careerjunction_za_recent_jobtitles']:
    if i in selected_row[1]['careerjunction_za_recent_jobtitles']:
        print True
    else:
        print False

'''

def compare_listcomp(row_x, row_y):
    return [i for i, j in zip(row_x, row_y) if i == j]


def compare_intersect(row_x, row_y):
    return frozenset(row_x).intersection(row_y)


compare_listcomp(selected_row[0]['careerjunction_za_recent_jobtitles'], selected_row[1]['careerjunction_za_recent_jobtitles'])
compare_intersect(selected_row[0]['careerjunction_za_recent_jobtitles'], selected_row[1]['careerjunction_za_recent_jobtitles'])

frozenset()

In [166]:
# Given profile ID

def profile_id(selected_row):
    column_headers = selected_row
    return column_headers

print (profile_id(selected_row[0]['careerjunction_za_recent_jobtitles']))
print (profile_id(selected_row[1]['careerjunction_za_recent_jobtitles']))

def look_up_similar(selected_row1, selected_row2):
    get_similar = []
    get_not_similar = []
    cnt = 0
    for i in selected_row1:
        if i in selected_row2:
            get_similar.append(i)
            cnt = +1
        else:
            get_not_similar.append(i)
            cnt = +1
    return get_similar, get_not_similar

print look_up_similar(selected_row[0]['careerjunction_za_recent_jobtitles'],selected_row[1]['careerjunction_za_recent_jobtitles'])



[u'Food Technologist', u'Product Specialist Microbiology']
[u'Senior Developer', u'Senior Developer']
([], [u'Food Technologist', u'Product Specialist Microbiology'])


In [198]:
#separate and group all dictionary to list
import ijson



SyntaxError: 'return' outside function (<ipython-input-198-aa165885c360>, line 19)

### Dictionary Comparison
Here I compare each of the dictionary in the dataset to find the profile similiarity using the profile id to compare to the recent jobtiltle, career skills, course/education list within the dictionary 
and I then use that to determine the jobtitle similarities.

#### A simple approach:

I make a new dict with the id's as key; lets call it source
the value of each source is is also a dict of the other ids (you are building a matrix); lets call it target
fill the count for source id at the target id with a counter of comparisons

In [584]:
def new_source(dict_l):
    good_columns = ["id","careerjunction_za_courses", "careerjunction_za_skills", 
                "careerjunction_za_recent_jobtitles","careerjunction_za_primary_jobtitle"]
    n = {}
    k = 'id'
    for k,v in dict_l.items():
        for i in good_columns:
            if i==k:
                n[k] = v
    return n
new_source(all_rows[3])

{u'careerjunction_za_courses': [u'Advanced Diploma In Computer Science'],
 u'careerjunction_za_primary_jobtitle': u'Database Developer Application Programmer',
 u'careerjunction_za_recent_jobtitles': [u'Programmer Technician'],
 u'careerjunction_za_skills': [u'Programming', u'Technical Support'],
 u'id': 4}

#### Extracting all values from all_rows list dictionary 



In [596]:
# extracting and making the list from the all_rows
from ast import literal_eval
def extract_values(count_lst_id):
    return count_lst_id = [item for value in count_lst_id for item in literal_eval(value)]
    




value_lst_count0 = extract_values(all_rows[0])
value_lst_count1 = extract_values(all_rows[1])
value_lst_count2 = extract_values(all_rows[2])
value_lst_count3 = extract_values(all_rows[3])
value_lst_count5 = extract_values(all_rows[4])
value_lst_count6 = extract_values(all_rows[6])


value_lst_count0 = extract_values(all_rows[0])

print map(str,value_lst_count0)

def extract_dict_keys(dict_keys):
    return map(str,dict_keys.keys())

dict_keys = extract_dict_keys(all_rows[3]) # remove the u' unicode

#print dict_keys
#print map(str, dict_keys)

def subsequence_counts_2(sequences):
    counts = Counter()
    for sequence in sequences:
        input = "".join(sequence)
        for j in range(1,len(input)+1):
    #this involves copying across the whole contents of counts into the new object.
            counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
    return counts



SyntaxError: invalid syntax (<ipython-input-596-8f1499817d8e>, line 4)

In [595]:
def compare_dict(dict1,dict2):
    diffkeys = [k for k in dict1 if dict1[k] != dict2[k]]
    for k in diffkeys:
        return k, ':', dict1[k], '->', dict2[k]

compare_dict(new_source(all_rows[1]),new_source(all_rows[2]))

(u'careerjunction_za_courses',
 ':',
 [u'B.Econ', u'Grade 12/Matric'],
 '->',
 [u'Bsc in Computer Systems',
  u'Higher National Diploma in Information Technology',
  u'Senior Certificate'])

{"careerjunction_za_courses": ["Bsc in Computer Systems", "Higher National Diploma in Information Technology", "Senior Certificate"], "careerjunction_za_skills": ["C# Developer", "MYSQL", "PHP", "javascript", "CSS3", "HTML5", "wordpress", "AJAX", "RDBMS", "Magento", "C++ Developer", "JAVA Developer"], "careerjunction_za_primary_jobtitle": "Social Manager", "id": 3, "careerjunction_za_recent_jobtitles": ["Junior Developer"]}


In [None]:
#https://codereview.stackexchange.com/questions/108443/extract-data-from-large-json-and-find-frequency-of-contiguous-sub-lists
I have been writing some code (see component parts here and here) that:

Takes a very large JSON (15GB gzipped, ~10million records)
Extracts the relevant parts of the JSON into a list of lists
Creates a list of all contiguous n-gram sub-lists found in the array
Creates a counter to count the frequency of each n-gram
Output the Counter showing the most common occurrences
When I run the complete function on the full dataset, I get out of memory errors.

Please help me optimise this code. Am I just looking for too many sub-list combinations?

I was thinking of possibly chunking up the JSON, processing in parallel and then combining the counters at the end, 
but I have no idea how to implement parallel processing in IPython 2.7.

In [552]:
# Finding most common contiguous sub-lists in an array of lists
# Objective: Given a set of sequences ( eg: step1->step2->step3, step1->step3->step5) ) 
# arranged in an array of lists, count the number of times every contiguous sub-lists occur
# https://codereview.stackexchange.com/questions/108052/finding-most-common-contiguous-sub-lists-in-an-array-of-lists
import random
import string
from collections import Counter
from timeit import timeit

def subsequence_counts_2(sequences):
    counts = Counter()
    for sequence in sequences:
        input = "".join(sequence)
        for j in range(1,len(input)+1):
    #this involves copying across the whole contents of counts into the new object.
            counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
    return counts

def test_data(n, m, choices):
    """Return a list of n lists of m items chosen randomly from choices."""
    return [[random.choice(choices) for _ in range(m)] for _ in range(n)]

#subsequence_counts_2(data)
data = test_data(10, 10, string.ascii_uppercase)

timeit(lambda:subsequence_counts_2(data), number=1)

0.0015540122985839844

0.10197997093200684

In [298]:
json_string = '{"favorited": false, "contributors": null}'
print json_string
value = json.loads(json_string)
print value
json_dump = json.dumps(value)
print json_dump


{"favorited": false, "contributors": null}
{u'favorited': False, u'contributors': None}
{"favorited": false, "contributors": null}


{'key2': 'value4', 'key1': 'value3'}
