## Career-to-degree matcher
This notebook is intended to figure out the best way to retrieve suitable careers offered by ASU online when you start with a list of potential careers.

### Setting up elastic search
We run elasticsearch locally via docker:


```bash
docker-compose up -d elasticsearch
```

Make sure you run this before running this notebook.



Also make sure you do:

```bash
pip uninstall elasticsearch
pip install elasticsearch==8.13.0
```

you are running on Docker ElasticSearch version 8.17.6 and for that one you need to install a compatible 8x library.


### Other python libraries
Also make sure you add the sentence transformer library via:

```bash
pip install fastembed
```

In [1]:
import json
from tqdm.auto import tqdm
from fastembed import TextEmbedding
from elasticsearch import Elasticsearch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# load the degree data from that json file
degree_data_filename = "../datasets/degree_data_all_wCareers.json"
with open(degree_data_filename, "r") as f:
        documents = json.load(f)

In [3]:
# set up embedding model
model_name = 'jinaai/jina-embeddings-v2-small-en'
model  = TextEmbedding(model_name=model_name)

In [18]:
# embed the data inside the documents
operations = []
for doc in tqdm(documents, desc="Embedding degrees"):
    # create two strings, one with degree description and another one with the list of careers
    title_description = doc['degreeTitle'] + " " + doc['shortDescription'] + " " + doc['longDescription']
    careers = ", ".join(doc['careers'])

    # embed them into vectors
    doc['description_vector'] = list(model.embed(title_description))[0].tolist()
    doc['career_vector'] = list(model.embed(careers))[0].tolist()

    # append to a new list
    operations.append(doc)

Embedding degrees: 100%|██████████| 373/373 [00:14<00:00, 25.72it/s]


In [7]:
es_client = Elasticsearch(
    "http://localhost:9200",
)

In [9]:
# set up mapping
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "degreeTitle": {"type": "text"},
            "shortDescription": {"type": "text"},
            "longDescription": {"type": "text"},
            "careers": {"type": "keyword"},
            "description_vector": {
                "type": "dense_vector",
                "dims": 512,           # Must match your model (e.g. jina-v2-small)
                "index": True,
                "similarity": "cosine"
            },
            "career_vector": {
                "type": "dense_vector",
                "dims": 512,           # Must match your model (e.g. jina-v2-small)
                "index": True,
                "similarity": "cosine"  # or "l2_norm"
            }
        }
    }
}


index_name = "degree_information"

if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_settings)
else:
    print(f"Index '{index_name}' already exists.")

In [22]:
# populate elastic search with the dadta
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|██████████| 373/373 [00:01<00:00, 252.30it/s]


### Sample quiz response
Here you will find:
- the responses the user gave to the questions.
- the careers suggested by openAI.

In [None]:
answers = [
    {
        "id": 1,
        "question": "Where are you in your professional development?",
        "selections": [
            "early-career professional"
        ]
    },
    {
        "id": 2,
        "question": "What are your main interests or passions?",
        "selections": [
            "Technology",
            "Science and Research"
        ]
    },
    {
        "id": 3,
        "question": "What skills do you feel most confident in?",
        "selections": [
            "Technical Skills",
            "Analytical Skills"
        ]
    },
    {
        "id": 4,
        "question": "What type of work environment do you prefer?",
        "selections": [
            "Remote work"
        ]
    },
    {
        "id": 5,
        "question": "What are your career goals or motivations for pursuing a new degree?",
        "selections": [
            "Advancing in my current field"
        ]
    },
    {
        "id": 6,
        "question": "What industries are you most interested in working in?",
        "selections": [
            "Information Technology"
        ]
    },
    {
        "id": 7,
        "question": "What type of job roles are you most interested in pursuing?",
        "selections": [
            "Software Development",
            "Data Analysis"
        ]
    },
    {
        "id": 8,
        "question": "What aspects of a job do you find most rewarding?",
        "selections": [
            "Solving complex problems",
            "Working with cutting-edge technology"
        ]
    },
    {
        "id": 9,
        "question": "What type of side gigs or freelance work are you interested in exploring?",
        "selections": [
            "Freelance coding or development",
            "Consulting"
        ]
    },
    {
        "id": 10,
        "question": "What level of education are you aiming to achieve with your new degree?",
        "selections": [
            "Master's Degree"
        ]
    }
]

recommendations = [
    "Software Engineer",  
    "Data Scientist",
    "IT Consultant",
    "Systems Analyst",
    "Technical Project Manager",
    "DevOps Engineer",
    "Machine Learning Engineer",
    "Web Developer"
]

In [27]:
# set up user profile
user_profile = []
for answer in answers:
    user_profile.append(answer["question"] + " " + ", ".join(answer["selections"]))

user_profile = "; ".join(user_profile)
print(user_profile)

Where are you in your professional development? early-career professional; What are your main interests or passions? Technology, Science and Research; What skills do you feel most confident in? Technical Skills, Analytical Skills; What type of work environment do you prefer? Remote work; What are your career goals or motivations for pursuing a new degree? Advancing in my current field; What industries are you most interested in working in? Information Technology; What type of job roles are you most interested in pursuing? Software Development, Data Analysis; What aspects of a job do you find most rewarding? Solving complex problems, Working with cutting-edge technology; What type of side gigs or freelance work are you interested in exploring? Freelance coding or development, Consulting; What level of education are you aiming to achieve with your new degree? Master's Degree


In [28]:
# select career to search
query = recommendations[1]
print(query)

Data Scientist


In [63]:
def elastic_search(career, profile,model=[]):
    # convert queries to vectors too
    career_vector = list(model.embed(query))[0].tolist()
    profile_vector = list(model.embed(profile))[0].tolist()

    # generate search query
    search_query = {
        "size": 5,
        "query": {
            "script_score": {
                "query": {
                    "bool": {
                        "should": [
                            {
                                "multi_match": {
                                    "query": career,
                                    "fields": ["careers.text^5", "degreeTitle^2", "longDescription"],
                                    "type": "best_fields",
                                    "fuzziness": "AUTO"
                                }
                            },
                            {
                                "multi_match": {
                                    "query": profile,
                                    "fields": ["longDescription", "shortDescription"],
                                    "type": "best_fields"
                                }
                            }
                        ],
                        "minimum_should_match": 1
                    }
                },
                "script": {
                    "source": """
                        0.25 * cosineSimilarity(params.query_vector_desc, 'description_vector') + 
                        0.75 * cosineSimilarity(params.query_vector_career, 'career_vector') + 1.0
                    """,
                    "params": {
                        "query_vector_desc": profile_vector,
                        "query_vector_career": career_vector
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    return response["hits"]["hits"]


In [64]:
search_results = elastic_search(query,user_profile,model=model)

In [65]:
for result in search_results:
    print(result['_source']['degreeTitle'])
    print(result['_source']['careers'])
    print("")

Online Master of Science in Program Evaluation and Data Analytics
['Data ambassador', 'Data analyst', 'Data fellow', 'Data programmer', 'Data science engineer', 'Database-driven website consultant', 'Program performance and evaluation manager', 'Programmer analyst', 'Senior analyst in performance assessment']

Online Master of Science in Information Technology (IT)
['Data analyst', 'Database architect', 'Data engineer', 'DevOps engineer', 'Full stack developer', 'Network/cloud engineer', 'Network forensics engineer', 'Security engineer', 'Software engineer', 'Solutions architect']

Online Master of Computer Science – Big Data Systems
['Cloud support associate', 'Data architect', 'Data engineer', 'Data scientist', 'Database administrator', 'Software development engineer']

Online Master of Science in Biological Data Science
['Biostatistician', 'Bioinformatics scientist', 'Clinical data manager', 'Computer and information research scientist', 'Database administrator', 'Environmental scie