### Fields
======
1. id - The unique identifier for the profile
2. careerjunction_za_primary_jobtitle - The most recent job title of the profile
3. careerjunction_za_recent_jobtitles - The next job titles after the most recent one (max 2)
4. careerjunction_za_historical_jobtitles - All other job titles after recent ones (from the 4th job title)
5. careerjunction_za_future_jobtitles - Job titles the seeker would like to have as their next job (ambitions)
6. careerjunction_za_employer_names - All employers worked for
7. careerjunction_za_skills - All the skills
8. careerjunction_za_courses - Titles for education/courses
 
What we want is:-
- Any insight into the data that can be extrapolated
- If given a profile id, find similar profiles like that one. A combination of similar skills, courses and/or job titles.
- If given a profile id, recommend what their next job title(s) could be


As the amount of data captured increases, structure of data in the database become unstructured data.
From the JSON file, there are over 1000 separate events listed within the file. Each event has different fields, and some of the fields are nested within other fields. 
This type of data is very hard to store in a regular SQL database.This unstructured data is often stored in a format called JavaScript Object Notation (JSON). 
JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. Even though JSON starts with the word Javascript, 
it's actually just a format, and can be read by any language.

Python has great JSON support, with the json library. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries. 
JSON data looks much like a dictionary would in Python, with keys and values stored.

In this task, we explored the JSON file using Jupyter notebook, and then import it into Python and work with it using Pandas.


### The dataset

The data contains information about career post and information about how to match candidate to a particular job post and then make further suggestion 
about likehood of the post. There are quite a few questions we could answer using the dataset, including:

    [1.] What is the total number of the profile ID's' job present?
    [2.] What are the most common skills, education, and courses people serach for?
    [3.] What are the most common primary job title and recent job titles?
    [4.] employer names, future job titles etc.
    
Since we don't know the structure of the JSON file upfront (as assumption), so we do some exploration to figure it out. This task used Jupyter Notebook for the exploration.


## Exploring the JSON data

The first thing we do is taking a look at the first few lines of the data set.

In [131]:
%%bash
# path to the data set
head ../NumPy/datasets/data_science_extract.json

[
  {
    "id": 1,
    "careerjunction_za_historical_jobtitles": [
      "Marketer & Technical Liaison",
      "Quality Assurance Manager Haccp Team Leader",
      "New Product Developer Technologist",
      "Food Technologist",
      "Quality Controller"
    ],


We can tell that the JSON data is a list of dictionary, and it is well formatted. 
We can also see that:
    
#### "profile id", "careerjunction_za_historical_jobtitles","careerjunction_za_primary_jobtitle",
#### "careerjunction_za_employer_names", "careerjunction_za_skills", "careerjunction_za_courses", 
#### "careerjunction_za_recent_jobtitles", "careerjunction_za_future_jobtitles", 

are top level key, and they are indented three spaces. We get all of the top level keys by using the grep command to print any lines that have three leading spaces:

We can see from the data set that the top level keys ae in the header. A list of lists appears to be associated with the data set, and this likely contains each record in the job profile dataset. 
Each inner list is a record, and the first record appears in the output from the grep command.

We print out the full key structure of the JSON file by using grep to print out any lines with 2-6 leading spaces:

This shows us the full key structure associated with data_science_extract.json, and tell us which parts of the JSON file are relevant for us.

### Extracting information on the columns

Now that we know which key contains information on the columns, we read that information in. 
We assumed that the JSON file can't fit in memory and we can't just directly read it in using the json library. 
Instead, we iteratively read it in in a memory-efficient way.

In [199]:
import json #  json package iteratively parse the json file instead of reading it all in at once
import sys
import pandas as pd
import numpy as np
from pandas import DataFrame
from pandas.io.json import json_normalize #package for flattening json in pandas df
filename = "../NumPy/datasets/data_science_extract.json"


def js_data(filename):
   with open(filename, 'r') as f_in:
    objects = json.load(f_in, encoding="utf8")
    columns = list(objects)
    return columns

#https://medium.com/@gis10kwo/converting-nested-json-data-to-csv-using-python-pandas-dc6eddc69175    
if __name__ == "__main__":
    columns = js_data(filename)
    #print columns
for row in columns:
    selected_row.append(row)
column_headers= len(selected_row)

all_rows = []
for i in selected_row:
    all_rows.append(i)
print all_rows[0]



{u'careerjunction_za_historical_jobtitles': [u'Marketer & Technical Liaison', u'Quality Assurance Manager Haccp Team Leader', u'New Product Developer Technologist', u'Food Technologist', u'Quality Controller'], u'careerjunction_za_primary_jobtitle': u'Senior Food Technologist', u'careerjunction_za_skills': [u'Microbiology', u'microsoft powerpoint', u'microsoft office', u'microsoft excel', u'microsoft project management', u'Microsoft word', u'Outlook', u'Internet explorer', u'Marketing/Sales', u'Quality Control', u'Quality Assurance', u'Research and development', u'Problem solving'], u'careerjunction_za_courses': [u'Btech: Food Technology', u'National Diploma: Food Technology', u'Senior Certificate'], u'careerjunction_za_employer_names': [u'Cape Herb & Spice', u'Greys Marine', u'Heinz Foods', u'Swift Silliker', u'Zemcor'], u'careerjunction_za_recent_jobtitles': [u'Food Technologist', u'Product Specialist Microbiology'], u'careerjunction_za_future_jobtitles': [u'Food technologist', u'New

[u'S']

In [133]:
msgs = pd.io.json.json_normalize(columns)
msgs.dtypes

careerjunction_za_courses                 object
careerjunction_za_employer_names          object
careerjunction_za_future_jobtitles        object
careerjunction_za_historical_jobtitles    object
careerjunction_za_primary_jobtitle        object
careerjunction_za_recent_jobtitles        object
careerjunction_za_skills                  object
id                                         int64
dtype: object

In [134]:
# Comparing list in the row of each column
'''
for i in selected_row[0]['careerjunction_za_recent_jobtitles']:
    if i in selected_row[1]['careerjunction_za_recent_jobtitles']:
        print True
    else:
        print False

'''

def compare_listcomp(row_x, row_y):
    return [i for i, j in zip(row_x, row_y) if i == j]


def compare_intersect(row_x, row_y):
    return frozenset(row_x).intersection(row_y)


compare_listcomp(selected_row[0]['careerjunction_za_recent_jobtitles'], selected_row[1]['careerjunction_za_recent_jobtitles'])
compare_intersect(selected_row[0]['careerjunction_za_recent_jobtitles'], selected_row[1]['careerjunction_za_recent_jobtitles'])

frozenset()

In [166]:
# Given profile ID

def profile_id(selected_row):
    column_headers = selected_row
    return column_headers

print (profile_id(selected_row[0]['careerjunction_za_recent_jobtitles']))
print (profile_id(selected_row[1]['careerjunction_za_recent_jobtitles']))

def look_up_similar(selected_row1, selected_row2):
    get_similar = []
    get_not_similar = []
    cnt = 0
    for i in selected_row1:
        if i in selected_row2:
            get_similar.append(i)
            cnt = +1
        else:
            get_not_similar.append(i)
            cnt = +1
    return get_similar, get_not_similar

print look_up_similar(selected_row[0]['careerjunction_za_recent_jobtitles'],selected_row[1]['careerjunction_za_recent_jobtitles'])



[u'Food Technologist', u'Product Specialist Microbiology']
[u'Senior Developer', u'Senior Developer']
([], [u'Food Technologist', u'Product Specialist Microbiology'])


In [198]:
#separate and group all dictionary to list
import ijson



with open(filename, 'r') as f:
    objects = ijson.items(f)
    for row in objects:
        rows = []
        for item in good_columns:
            rows.append(row[column_names.index(item)])
        return rows

SyntaxError: 'return' outside function (<ipython-input-198-aa165885c360>, line 19)

In [229]:
def new_source(dict_l):
    good_columns = ["id","careerjunction_za_courses", "careerjunction_za_skills", 
                "careerjunction_za_recent_jobtitles","careerjunction_za_primary_jobtitle"]
    n = {}
    k = 'id'
    for k,v in dict_l.items():
        for i in good_columns:
            if i==k:
                n[k] = v
    return n

new_source(all_rows[1])

{u'careerjunction_za_courses': [u'B.Econ', u'Grade 12/Matric'],
 u'careerjunction_za_primary_jobtitle': u'Senior Developer',
 u'careerjunction_za_recent_jobtitles': [u'Senior Developer',
  u'Senior Developer'],
 u'careerjunction_za_skills': [u'MVC5',
  u'JQuery',
  u'C#',
  u'BootStrap',
  u'REST Services',
  u'EntityFrameWork 6',
  u'SQL Databse Development',
  u'SSRS',
  u'SSIS'],
 u'id': 2}

{'Date': '2013-05-01', 'Product': 'Toys', 'Price': '$10'}


In [211]:
n = {k: d[k] for k in d.keys() & set(lis)}
print n

TypeError: unsupported operand type(s) for &: 'list' and 'set'