# Jobs Recommendation System Extrapolating

### Fields
======
1. id - The unique identifier for the profile
2. careerjunction_za_primary_jobtitle - The most recent job title of the profile
3. careerjunction_za_recent_jobtitles - The next job titles after the most recent one (max 2)
4. careerjunction_za_historical_jobtitles - All other job titles after recent ones (from the 4th job title)
5. careerjunction_za_future_jobtitles - Job titles the seeker would like to have as their next job (ambitions)
6. careerjunction_za_employer_names - All employers worked for
7. careerjunction_za_skills - All the skills
8. careerjunction_za_courses - Titles for education/courses
 
What we want is:-
- Any insight into the data that can be extrapolated
- If given a profile id, find similar profiles like that one. A combination of similar skills, courses and/or job titles.
- If given a profile id, recommend what their next job title(s) could be


As the amount of data captured increases, structure of data in the database become unstructured data.
From the JSON file, there are over 1000 separate events listed within the file. Each event has different fields, and some of the fields are nested within other fields. 
This type of data is very hard to store in a regular SQL database.This unstructured data is often stored in a format called JavaScript Object Notation (JSON). 
JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. Even though JSON starts with the word Javascript, 
it's actually just a format, and can be read by any language.

Python has great JSON support, with the json library. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries. 
JSON data looks much like a dictionary would in Python, with keys and values stored.

In this task, we explored the JSON file using Jupyter notebook, and then import it into Python and work with it using Pandas.


### The dataset

The data contains information about career post and information about how to match candidate to a particular job post and then make further suggestion 
about likehood of the post. There are quite a few questions we could answer using the dataset, including:

    [1.] What is the total number of the profile ID's' job present?
    [2.] What are the most common skills, education, and courses people serach for?
    [3.] What are the most common primary job title and recent job titles?
    [4.] employer names, future job titles etc.
    
Since we don't know the structure of the JSON file upfront (as assumption), so we do some exploration to figure it out. This task used Jupyter Notebook for the exploration.


## Exploring the JSON data

The first thing we do is taking a look at the first few lines of the data set.

In [None]:
%%bash
# path to the data set
head ../NumPy/datasets/data_science_extract.json

We can tell that the JSON data is a list of dictionary, and it is well formatted. 
We can also see that:
    
#### "profile id", "careerjunction_za_historical_jobtitles","careerjunction_za_primary_jobtitle",
#### "careerjunction_za_employer_names", "careerjunction_za_skills", "careerjunction_za_courses", 
#### "careerjunction_za_recent_jobtitles", "careerjunction_za_future_jobtitles", 

are top level key, and they are indented three spaces. We get all of the top level keys by using the grep command to print any lines that have three leading spaces:

We can see from the data set that the top level keys ae in the header. A list of lists appears to be associated with the data set, and this likely contains each record in the job profile dataset. 
Each inner list is a record, and the first record appears in the output from the grep command.

We print out the full key structure of the JSON file by using grep to print out any lines with 2-6 leading spaces:

This shows us the full key structure associated with data_science_extract.json, and tell us which parts of the JSON file are relevant for us.

### Extracting information on the columns

Now that we know which key contains information on the columns, we read that information in. 
We assumed that the JSON file can't fit in memory and we can't just directly read it in using the json library. 
Instead, we iteratively read it in in a memory-efficient way.

In [99]:
import json #  json package iteratively parse the json file instead of reading it all in at once
import sys
import pandas as pd
import numpy as np
from pandas import DataFrame
from IPython.display import Image
from pandas.io.json import json_normalize #package for flattening json in pandas df
filename = "../NumPy/datasets/data_science_extract.json"


def js_data(filename):
    # open JSON file and parse contents
    with open(filename, 'r') as f_in:
        objects = json.load(f_in, encoding="utf-8")
        columns = objects
    return columns


# Reformat columns to dictionary with profile id as key
def reformat_to_dict(columns):
    profiles = {}
    for c in columns:
        profiles[c['id']] = c 
    return profiles

# Given profile ID
def profile_id(columns):
    profile_ids = [c['id'] for c in columns]
    return profile_ids


#https://medium.com/@gis10kwo/converting-nested-json-data-to-csv-using-python-pandas-dc6eddc69175    
if __name__ == "__main__":
    columns = js_data(filename)
    profiles = reformat_to_dict(columns)
    #print profiles[1]

    selected_row = []   
    for row in columns:
        selected_row.append(row)
    column_headers= len(selected_row)
    #print column_headers
    all_rows = []
    for i in selected_row:
        all_rows.append(i)
    #print all_rows[0:3]



# print profile_id(columns)
# print "The length of the column is equal to the total number of profile_id within the dataset {}".format(len(columns))
# msgs =json_normalize(columns)
# msgs.dtypes
# print selected_row[0]['careerjunction_za_skills']
# print selected_row[5]['careerjunction_za_skills']

In [None]:
def new_source(dict_l):
    good_columns = ["id","careerjunction_za_courses", "careerjunction_za_skills", 
                "careerjunction_za_recent_jobtitles","careerjunction_za_primary_jobtitle"]
    n = {}
    k = 'id'
    for k,v in dict_l.items():
        for i in good_columns:
            if i==k:
                n[k] = v
    return n
new_source(all_rows[3])

#### Extracting all values from all_rows list dictionary 



In [None]:
# extracting and making the list from the all_rows
def extract_values(count_lst_id):
    return count_lst_id.values()

value_lst_count0 = extract_values(all_rows[0])
value_lst_count1 = extract_values(all_rows[1])
value_lst_count2 = extract_values(all_rows[2])
value_lst_count3 = extract_values(all_rows[3])
value_lst_count4 = extract_values(all_rows[4])
value_lst_count5 = extract_values(all_rows[5])


pro_id_5 = value_lst_count5[-1]
pro_id_4 = value_lst_count4[-1]
pro_id_3 = value_lst_count3[-1]
pro_id_2 = value_lst_count2[-1]
pro_id_1 = value_lst_count1[-1]
pro_id_0 = value_lst_count0[-1]




print "[%s]" % ", ".join(map(str, value_lst_count0)) 

def extract_dict_keys(dict_keys):
    return map(str,dict_keys.keys())

dict_keys = extract_dict_keys(profiles[2]) # remove the u' unicode
#print dict_keys
#print map(str, dict_keys)



def subsequence_counts_2(sequences):
    counts = Counter()
    for sequence in sequences:
        input = "".join(sequence)
        for j in range(1,len(input)+1):
    #this involves copying across the whole contents of counts into the new object.
            counts.update(input[i:i+j] for i in range(len(input)-(j-1)))
    return counts



### Finding the similarity

Find similar profile to other profile

### Dictionary Comparison
Here I compare each of the dictionary in the dataset to find the profile similiarity using the profile id to compare to the recent jobtiltle, career skills, course/education list within the dictionary 
and I then use that to determine the jobtitle similarities.

#### A simple approach:

I make a new dict with the id's as key; lets call it source
the value of each source is is also a dict of the other ids (you are building a matrix); lets call it target
fill the count for source id at the target id with a counter of comparisons

### Profile Similarity Measure
Similarity are measured in the range 0 to 1 [0,1]. When data is dense or continuous, this is the best proximity measure.
This project some metrics to find the similarity between job seeker profile. where the profile_ids are points or vectors .
We consider Jaccard similarity the profile_ids is the sets. Below show diagram the Sets,Cardinality,Intersection,and Union.

![Image of Yaktocat](https://i1.wp.com/dataaspirant.com/wp-content/uploads/2015/04/jaccaard2.png)

The Jaccard similarity measures the similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B

![Image of Yaktocat](https://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/jaccaard3.png)

#### Two main consideration about our similarity:

* Similarity (intersection) = 1 if X = Y         (Where X, Y are two profiles id)
* Similarity (null-intersection) = 0 if X ≠ Y

In [112]:
#!/usr/bin/env python

from math import*

def jaccard_similarity(x,y):
 
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)
 


# Function to check profile similiarities 
from collections import defaultdict

class DictionaryIntersection(object):
    def __init__(self,dictA,dictB):
        self.dictA = dictA
        self.dictB = dictB

    def __getitem__(self,attr):
        if attr not in self.dictA or attr not in self.dictB:
            raise KeyError('Not in both dictionaries,key: %s' % attr)
        #for c in dictA[attr];
         #   if c == 2:
          #      return dictA[attr]
        return self.dictA[attr],self.dictB[attr]

# Getting the profile_ID and element and grouped using the profile ID
# Getting the profile_ID and element and grouped using the profile ID
# Given profile ID
def get_profile__id(id):
    prof_id = []
    for i in profile_id(columns):
        if i == id:
            return id

def similar(data_a,data_b):
    third_dict = {}
    for k, v in data_b.iteritems():
        vals = []
        if isinstance(v, list): # check the nested list in dictionary
            for i in v:
                #print i
                vals.append(data_a.get(i)) # grab all values in dict and add
        else:
            vals.append(data_a.get(v))
            return "similar"
        if not vals:
            return "not similar"
        third_dict[k] = vals
    return third_dict

# finding next job title

def next_jobtitle(profile):
    recent_job = []
    for k, val in profile.iteritems():
        if k == 'careerjunction_za_future_jobtitles' :
            recent_job.append(val)
        elif k == '':
            return "no record"
    return recent_job

def course(profile):
    recent_job = []
    for k, val in profile.iteritems():
        if k == 'careerjunction_za_courses' :
            recent_job.append(val)
        elif k == '':
            return "no course record"
    return recent_job
print course(profiles[1])

# Compare first profile against second profile
# Using the profile ID to check if profile dictionary is 100% identical
def compare_profile(profile_id,other_profiles):
    # We exempt this in the dictionary
    def compare(data_a,data_b):
        # type: dictionary
        if (type(data_a) is dict):
            # is [data_b] a dictionary?
            if (type(data_b) != dict):
                return False
            # iterate over dictionary keys
            for dict_key,dict_value in data_a.items():
                # check if key exists in [data_b] dictionary, and same value?
                if ((dict_key not in data_b) or (not compare(dict_value,data_b[dict_key]))):
                    return False
            # dictionary identical
            return True
        # simple value - compare both value and type for equality
        return ((data_a == data_b) and (type(data_a) is type(data_b)))
    # compare a to b, then b to a
    return (compare(profile_id,other_profiles) and compare(other_profiles,profile_id))
    
def finding_similar(profile_id,profiles):
    result = []
    for profile in profiles:
        bool(compareProfile(profile_id,profiles))
        if True:
            result.append(profiles["id"])
        else:
            return "cannot be found"
    return result

#find similar skills in candidate profiles
def skill_similar(selected_row1, selected_row2):
    get_similar = []
    get_not_similar = []
    cnt = 0
    for i in selected_row1:
        if i in selected_row2:
            get_similar.append(i)
            cnt = +1
        elif i not in selected_row2:
            get_not_similar.append(i)
            cnt = +1
        else:
            return 
    return get_similar, get_not_similar

# Comparing list in the row of each column
def compare_listcomp(row_x, row_y):
    print [i for i, j in zip(row_x, row_y) if i == j]
#print compare_listcomp(selected_row[3]['careerjunction_za_recent_jobtitles'], selected_row[7]['careerjunction_za_recent_jobtitles'])


#find skil or course intersection
def compare_intersect(row_x, row_y):
    return frozenset(row_x).intersection(row_y)
#print compare_intersect(selected_row[3]['careerjunction_za_skills'], selected_row[7]['careerjunction_za_skills'])

# Check if profile are not the same and the merge profile.
def dict_diff(d1, d2, NO_KEY=''): 
    set_d1 = set(d1.keys()) 
    set_d2 = set(d2.keys())
    both = set_d1 & set_d2 
    diff = {k:(d1[k], d2[k])for k in both if d1[k] != d2[k]}
    diff.update({k:(d1[k], NO_KEY) for k in set_d1 - both}) 
    diff.update({k:(NO_KEY, d2[k]) for k in set_d2 - both}) 
    return diff



[[u'Btech: Food Technology', u'National Diploma: Food Technology', u'Senior Certificate']]


In [113]:
print jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])
print"\n"
print('Job seekers profile is not 100% identical: {0}'.format(compare_profile(profiles[1],profiles[2])))
print '\n'
job_seeker_intersect = DictionaryIntersection(profiles[get_profile__id(3)],profiles[get_profile__id(7)])
jobseeker = job_seeker_intersect['careerjunction_za_recent_jobtitles']
print "Most recent:", similar(profiles[get_profile__id(3)],profiles[get_profile__id(7)]), "in most recent job that inculde:",jobseeker
print '\n'
print "Next job title include:",next_jobtitle(profiles[get_profile__id(7)])
print '\n'
print "Your skill is:", skill_similar(selected_row[3]['careerjunction_za_skills'],selected_row[7]['careerjunction_za_skills'])
print '\n'
print "Combine profile:", dict_diff(profiles[3], profiles[7], NO_KEY='Profile ID not found')

0.375


Job seekers profile is not 100% identical: False


Most recent: similar in most recent job that inculde: ([u'Junior Developer'], [u'Senior Python Developer', u'Systems Developer'])


Next job title include: [[u'Software Architect', u'Python Developer', u'Senior Python Developer']]


Your skill is: ([], [u'Programming', u'Technical Support'])


Combine profile: {u'careerjunction_za_historical_jobtitles': ([], [u'Senior Developer', u'Senior Developer', u'Developer', u'Developer', u'Web Development', u'System Administrator', u'Apprentice Toolmaker']), u'careerjunction_za_primary_jobtitle': (u'Social Manager', u'Senior Developer'), u'careerjunction_za_skills': ([u'C# Developer', u'MYSQL', u'PHP', u'javascript', u'CSS3', u'HTML5', u'wordpress', u'AJAX', u'RDBMS', u'Magento', u'C++ Developer', u'JAVA Developer'], [u'PHP', u'SQL', u'Javascript', u'Linux', u'Python', u'PostgreSQL', u'AWS', u'Embedded Linux', u'GIT', u'SOA', u'HA Proxy', u'Gerrit', u'Sentry', u'C', u'.Net', u'AWS RDS', 

In [9]:
def extract_recentjob(dict_1, dict_2):
    extract = []
    for i, v in dict_1.iteritems():
        vals = []
        if isinstance(v, list):
            for i in v:
                if i in dict_2 and i =='careerjunction_za_recent_jobtitles':
                    vals.append(dict_2.get(i))
                else:
                    vals.append(dict_2.get(v))
                
    return vals
    
#extract_recentjob(profiles[3],profiles[7])    

[[u'Food technologist',
  u'New product development',
  u'auditor',
  u'inspections']]

In [None]:
def next_jobtile(profile):
    for k, val in profile.iteritems():
        for i in x_keys:
            if i == "careerjunction_za_recent_jobtitles":
                print ""
                return i
            if i == 'careerjunction_za_future_jobtitles':
                print "Your next job are:", i
            if i == 'careerjunction_za_skills':
                return i
            else:
                print"No profile"
                #recent_job.append(val)
    #return recent_job
next_jobtitle(profiles[1])

In [None]:
profiles[1]

In [None]:
def next_jobtitle(profile):
    next_job = []
    for k, val in profile.iteritems():
        if k == 'careerjunction_za_future_jobtitles':
            print k
            next_job.append(val)
        else:
            return "No profile"
    return "Your next job are:", next_job
next_jobtitle(profiles[1])




In [None]:
def intersect_dict(dict_a,dict_b): 
    keys_a = set(dict_a.keys())
    keys_b = set(dict_b.keys())
    intersection = keys_a & keys_b # '&' operator is used for set intersection
    return intersection