# 1️⃣ Knowledge Base (KB) with Retrieval Application

**Vector Database (Vector DB) / Vectorstore**

- [Vectorstores](https://python.langchain.com/v0.2/docs/integrations/vectorstores/): A vector store that stores embedded data and performs similarity search.
Resources

- [How-to guides](https://python.langchain.com/v0.2/docs/how_to/#vector-stores): How to build Vector DB through langchain

    1. [Elasticsearch](https://python.langchain.com/v0.2/docs/integrations/vectorstores/elasticsearch/)
    2. [Milvus](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)
    3. [Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/): [langchain-chroma](https://pypi.org/project/langchain-chroma/)

# 1.1. Environment Setup


In [1]:
from importlib.metadata import version
# !pip install langchain
# Select langchain to 0.1.20
try:
    print('langchain package version',version('langchain'))
    assert version('langchain') == '0.1.20'
except:
    !pip install langchain==0.1.20

# !pip install --upgrade langchain
# print('langchain package version',version('langchain'))

langchain package version 0.1.20


In [2]:
#!pip install -qU langchain-huggingface
# Select langchain-huggingface to 0.0.3
try:
    print('langchain-huggingface package version',version('langchain-huggingface'))
    assert version('langchain-huggingface') =='0.0.3'#'0.2.11'
except:
    !pip install langchain-huggingface==0.0.3
    #0.2.11 (if any)

# !pip install -qU langchain-huggingface
# print('langchain-huggingface package version',version('langchain-huggingface'))

langchain-huggingface package version 0.0.3


In [3]:
# Select langchain-chroma to 0.1.3
try:
    print('langchain_chroma package version',version('langchain_chroma'))
    assert version('langchain_chroma') == '0.1.3'
except:
    !pip install langchain_chroma==0.1.3

# !pip install -qU langchain_chroma==0.1.3
# print('langchain_chroma package version',version('langchain_chroma'))

langchain_chroma package version 0.1.3


In [4]:
!pip install --upgrade numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import os
#check current directory
os.getcwd()



'/content'

In [5]:
# check folders / files in current directory
!dir

=0.21,	chroma	drive  postings.csv  sample_data


In [6]:
import langchain_chroma
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# 1.2. Import Data
**Data**
- Source: [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)
- Full Downloaded Folder:
  - archive
    - **postings.csv** ✅ (selected for this project!)
    -  mappings
      - skills.csv.
      - industries.csv.
    - jobs
      - salaries.csv
      - job_skills.csv
      - job_industries.csv
      - benefits.csv
    - companies
      - employee_counts.csv
      - company_specialities.csv
      - company_industries.csv
      - companies.csv



**Selected Features in postings.csv**
- **ID**:
  - job_id  

- **Main data**: embedding (encoded from texts)
  1. description
  1. skills_desc
  
- **Meta data**
  1. title
  1. location
  1. min_salary
  1. pay_period
  1. job_posting_url
  
  


## 1.2.1. Download Data Directly from Kaggle

Tutorial:
[How to Load Kaggle Datasets Directly Into Google Colab?](https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/#:~:text=By%20uploading%20API%20credentials%20and,(CLI)%20within%20Google%20Colab.)

In [7]:
import kagglehub

# # Download latest version
# path = kagglehub.dataset_download("arshkon/linkedin-job-postings")

# Download a single file
df_path = kagglehub.dataset_download('arshkon/linkedin-job-postings', path='postings.csv', force_download=True)

print("Path to dataset files:", df_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/arshkon/linkedin-job-postings?dataset_version_number=13&file_name=postings.csv...


100%|██████████| 147M/147M [00:02<00:00, 70.6MB/s]

Extracting zip of postings.csv...





Path to dataset files: /root/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13/postings.csv


In [8]:
# Uzip a file in df_path, and save it to extract_path
import zipfile

def unzip_file(zip_filepath, extract_path):
    try:
        with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
            zip_ref.extractall(extract_path)
        print(f"Successfully unzipped {zip_filepath} to {extract_path}")
    except FileNotFoundError:
        print(f"Error: File not found at {zip_filepath}")
    except zipfile.BadZipFile:
        print(f"Error: Invalid zip file at {zip_filepath}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


unzip_file(df_path, "/content")

Error: Invalid zip file at /root/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13/postings.csv


In [9]:
df = pd.read_csv('/postings.csv', delimiter=',')
print(f"The data has {df.shape[0]} obervations with {df.shape[1]} variables")
print(f"The variables (features) of the data:\n{df.columns}")
df.head(5)

The data has 5 obervations with 5 variables
The variables (features) of the data:
Index(['job_id', 'title', 'company', 'location', 'description'], dtype='object')


Unnamed: 0,job_id,title,company,location,description
0,101,Data Scientist,TechCorp,New York,Work with large datasets and create predictive...
1,102,Machine Learning Engineer,InnoAI,San Francisco,Develop and optimize machine learning pipelines.
2,103,Data Analyst,DataWorks,Chicago,Analyze business data and create dashboards.
3,104,AI Researcher,NeuralNet,Boston,Conduct research on cutting-edge AI techniques.
4,105,Business Analyst,BizMetrics,Seattle,Interpret data to help guide strategic decisions.


In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
df['job_post'] = df['description'].astype(str)
df[['description', 'job_post']].head()


Unnamed: 0,description,job_post
0,Work with large datasets and create predictive...,Work with large datasets and create predictive...
1,Develop and optimize machine learning pipelines.,Develop and optimize machine learning pipelines.
2,Analyze business data and create dashboards.,Analyze business data and create dashboards.
3,Conduct research on cutting-edge AI techniques.,Conduct research on cutting-edge AI techniques.
4,Interpret data to help guide strategic decisions.,Interpret data to help guide strategic decisions.


In [12]:
# df['job_post'] = np.where(df['skills_desc'].isna(), df['description'].astype('str'), df['skills_desc'].astype('str'))
# df[['description', 'skills_desc', 'job_post']].head()

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   job_id       5 non-null      int64 
 1   title        5 non-null      object
 2   company      5 non-null      object
 3   location     5 non-null      object
 4   description  5 non-null      object
 5   job_post     5 non-null      object
dtypes: int64(1), object(5)
memory usage: 372.0+ bytes


## 1.2.1 Select Data Ingested into Build Vector DB

**Purpose**:
- Insert data samples into Vector Database (VectorDB), working as knowledge base

**Note**:
- Because data ingestion is very time consuming, here I only select N*2 (default: N=200) samples for demonstraton (approximatedly 15 min. for 400 samples).

- In the sample, I pick 50% of data related to a specific job tiltes (decide it yourself!!), and 50% other random picked job titles.

### 💡 Customize Yourself!

In [14]:
# define yourself for job titles you're interested
keywords = ['data science', 'data scientist', 'data analyst']

## Number of relevant / irrelevant samples
N = 200

In [15]:
# select the data with the job titles which contain the keywords you define
condition = df['title'].str.contains('|'.join(keywords), case=False, na=False)
df_ds = df[condition]
# 'case=False' makes the search case-insensitive
# 'na=False' ensures that NaN values are not considered in the search

N_ds = min(len(df_ds), N)
df_ds = df_ds.sample(n=N_ds)
print(f"There're {N_ds} samples searched according to keywords.")
df_ds.head(10)

There're 2 samples searched according to keywords.


Unnamed: 0,job_id,title,company,location,description,job_post
0,101,Data Scientist,TechCorp,New York,Work with large datasets and create predictive...,Work with large datasets and create predictive...
2,103,Data Analyst,DataWorks,Chicago,Analyze business data and create dashboards.,Analyze business data and create dashboards.


In [16]:
available = df[~condition]
n_samples = min(N, len(available))  # 避免抽太多
df_others = available.sample(n=n_samples)
print(f"There're {len(df_others)} samples searched not contained in keywords.")
df_others.head(10)

There're 3 samples searched not contained in keywords.


Unnamed: 0,job_id,title,company,location,description,job_post
4,105,Business Analyst,BizMetrics,Seattle,Interpret data to help guide strategic decisions.,Interpret data to help guide strategic decisions.
1,102,Machine Learning Engineer,InnoAI,San Francisco,Develop and optimize machine learning pipelines.,Develop and optimize machine learning pipelines.
3,104,AI Researcher,NeuralNet,Boston,Conduct research on cutting-edge AI techniques.,Conduct research on cutting-edge AI techniques.


In [17]:
# combine the two selected data sets together
df_select = pd.concat([df_ds, df_others])

print(f"There're totally {len(df_select)} samples for inserting VectorDB.")

There're totally 5 samples for inserting VectorDB.


# 1.3 Build a VectorDB

## 1.3.1.Create a Container

**Container (aka. collection)**
- To create VectorDB, you need to create a container in it, which is a collection that stores and organizes similar types of vectors, allowing efficient, relevant searches.
italicized text
- The created collection needs a specified encoder function, so it knows how to encode the data into embeddings

In [18]:
!pip install -U tokenizers>=0.21,<0.22
# specify and download the encoder from hugging face platform
!pip install sentence-transformers
encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
encoder

/bin/bash: line 1: 0.22: No such file or directory


HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [19]:
import chromadb
collection_name = "collection_postings"

persistent_client = chromadb.PersistentClient()
# print("check all functions/attribute for Chroma:\n", dir(persistent_client))

if not persistent_client.list_collections():
    print(f"{collection_name} not in collection (DB) yet!")
else:
    if collection_name in persistent_client.list_collections()[0].name:
        print(f"{collection_name} is already in collection and would be deleted!")
        persistent_client.delete_collection(collection_name)

print(f"Create collection: {collection_name}!")
vector_store = Chroma(
    client=persistent_client,
    collection_name=collection_name,
    embedding_function=encoder,
    persist_directory="./chroma_langchain_db",  #save data locally, remove if not neccesary
)

#print("check all functions/attribute for Chroma:\n", dir(vector_store))

# if collection_name in vector_store.list_collections():
#     vector_store.delete_collection(collection_name)


collection_postings is already in collection and would be deleted!
Create collection: collection_postings!


## 1.3.2. Indexing: Insert Data into VectorDB


In [20]:
from langchain_core.documents import Document

# create list of documents (each document is a chunk) with specified ID
# they will be later ingested into the VectorDB
ids = []
documents = []
for index, row in df_select.iterrows():
    id_current = str(index)
    ids.append(id_current)

    # Check if 'min_salary' column exists before accessing it
    min_salary = row.get('min_salary', 0)  # Use .get() with a default value
    # if pd.isna(min_salary):
    #     min_salary = 0  # or any default value you prefer

    # Check if 'pay_period' column exists before accessing it
    pay_period = row.get('pay_period', None) # If the column doesn't exist, assign None

    # Check if 'job_posting_url' column exists before accessing it
    job_posting_url = row.get('job_posting_url', '')  # Use .get() with a default value

    document_current = Document(
        page_content=row['job_post'], #Main Data (the one be encoded)
        metadata={"title": row['title'],
                  "location": row['location'],
                  "min_salary": min_salary,
                  "pay_period": pay_period, # Use the retrieved or default value
                  "job_posting_url": job_posting_url, # Use retrieved or default value
                 },
        id=row['job_id'],
    )
    documents.append(document_current)
print(f"There are {len(documents)} documents (chunks).\n")
print(f'Example document content:\n{document_current}')

There are 5 documents (chunks).

Example document content:
page_content='Conduct research on cutting-edge AI techniques.' metadata={'title': 'AI Researcher', 'location': 'Boston', 'min_salary': 0, 'pay_period': None, 'job_posting_url': ''}


In [21]:
import time
start = time.time()
# Replace None values with empty string for pay_period column in documents.metadata
for doc in documents:
    if doc.metadata.get('pay_period') is None:
        doc.metadata['pay_period'] = ''

vector_store.add_documents(documents=documents, ids=ids)
end = time.time()
print('Time spent (min.) for data insertion: \t', (end-start)/60)

Time spent (min.) for data insertion: 	 0.009063307444254558


In [22]:
# Check the data in vector_store.

# Get all the documents in the vector store
documents = vector_store.get(include=["documents", "metadatas", "embeddings"])


doc = documents["documents"][0]
metadata = documents["metadatas"][0]
embedding = documents["embeddings"][0]
print("First Document Metadata:\n", metadata)
print("First Document:\n", doc)
print(f"First Document Embedding (with vector len={len(embedding)}):\n", embedding)


First Document Metadata:
 {'job_posting_url': '', 'location': 'New York', 'min_salary': 0, 'pay_period': '', 'title': 'Data Scientist'}
First Document:
 Work with large datasets and create predictive models.
First Document Embedding (with vector len=768):
 [ 8.69334955e-03  9.51962247e-02 -8.15764889e-02 -3.25093083e-02
  1.80899038e-03  2.93163341e-02  5.76672750e-03  1.04200211e-04
  3.49824242e-02  6.23719990e-02  6.91251531e-02 -1.90565176e-03
 -2.04721987e-02  1.07634440e-01  1.08599057e-02 -7.78906792e-02
  1.83498356e-02 -4.03462127e-02 -6.87788427e-02 -3.25908251e-02
 -1.57017522e-02 -1.82069894e-02  3.80549803e-02 -2.91667623e-03
 -5.30641116e-02 -2.50934884e-02 -5.09273261e-03 -1.13806427e-02
 -3.47216763e-02  1.17658675e-02  2.95170993e-02 -2.83920281e-02
  1.50104482e-02  6.83274567e-02  1.37187806e-06 -3.83911543e-02
 -3.95595655e-02  2.76454687e-02 -8.36615916e-03 -1.19127454e-02
  1.39339818e-02 -3.83707345e-03  3.84253263e-02  2.47215852e-02
 -1.18372180e-02 -1.50844790

# 1.4 Search Engine: VectorDB as Retriever

### 1.4.1. Vector Search (Similarity Search)

**Similarity Search**
- To find the top most similar embeddings which are closer to the query embedding
- score: cosine similarity


**Application**
- According to your query (autobiography), find k=10 most suitable job posts.

### 💡 Customize Yourself!

**Prepare a query**: self description
- To provide your brief autobiography for your consultant’s reference Imagine you’re preparing your resume, what information should you put? (E.g., education, experience, abilities, personalities, job position you’re looking for, etc.)
- The words (texts only) need not be too long (< 500 words)

In [23]:
query = "I am a passionate job seeker with a strong desire to embark on a career in data science. Having recently graduated with a degree in Computer Science, I have honed my skills in Python programming and developed a deep interest in machine learning. During my studies, I immersed myself in various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. My journey into data science has been driven by a fascination with uncovering hidden patterns in data and using these insights to solve real-world problems. I am now seeking an role related to data analysis where I can leverage my Python expertise and enthusiasm for machine learning to contribute to a dynamic team, learn from experienced professionals, and continue to grow my skills in this exciting field. What jobs are most suitable for me?"
print(query)

I am a passionate job seeker with a strong desire to embark on a career in data science. Having recently graduated with a degree in Computer Science, I have honed my skills in Python programming and developed a deep interest in machine learning. During my studies, I immersed myself in various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. My journey into data science has been driven by a fascination with uncovering hidden patterns in data and using these insights to solve real-world problems. I am now seeking an role related to data analysis where I can leverage my Python expertise and enthusiasm for machine learning to contribute to a dynamic team, learn from experienced professionals, and continue to grow my skills in this exciting field. What jobs are most suitable for me?


In [24]:
results = vector_store.similarity_search_with_score(
    query , k=10,
)
i =1
for res, score in results:
    print(f"* [{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    i +=1



* [1][SIM=0.905995] Data Scientist
---------------------
           Work with large datasets and create predictive models. 
--------------------
            [{'job_posting_url': '', 'location': 'New York', 'min_salary': 0, 'pay_period': '', 'title': 'Data Scientist'}]


* [2][SIM=1.201063] Machine Learning Engineer
---------------------
           Develop and optimize machine learning pipelines. 
--------------------
            [{'job_posting_url': '', 'location': 'San Francisco', 'min_salary': 0, 'pay_period': '', 'title': 'Machine Learning Engineer'}]


* [3][SIM=1.291446] AI Researcher
---------------------
           Conduct research on cutting-edge AI techniques. 
--------------------
            [{'job_posting_url': '', 'location': 'Boston', 'min_salary': 0, 'pay_period': '', 'title': 'AI Researcher'}]


* [4][SIM=1.306716] Data Analyst
---------------------
           Analyze business data and create dashboards. 
--------------------
            [{'job_posting_url': '', 'locati

### 1.4.2. Vector Search with Filtering

**Filtering**  
- You can filter out some job posts based on the condition you set on the _Meta Data_.

**Application**
- According to the query, find k=5 best possible job posts which have minimum salary greater than (gt) 100000

In [25]:
results = vector_store.similarity_search_with_score(
    query , k=5, filter={"min_salary": {"$gt": 100000}}
)# perator: $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
i = 1
for res, score in results:
    print(f"* [{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    i +=1

# 2️⃣ Retrieval and Generation


# 2.1. Environment Setup

Sources  
- [langchain-chroma](https://pypi.org/project/langchain-chroma/)
- [Gemini API Python quickstart](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/tutorials/quickstart_colab.ipynb#scrollTo=-QhPWE1lwZHH)

In [26]:
# install package for Google Gemini
!pip install -q -U google-generativeai

In [27]:
!pip install tiktoken



# 2.2. Connect to VectorDB & LLM Agent


## 2.2.1. Connect to VectorDB (Chroma)

You can connect to the VectorDB with a specified collection name, after it's built-up.

In [28]:
import chromadb
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint

collection_name = "collection_postings"

#encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

persistent_client = chromadb.PersistentClient()
print(persistent_client.list_collections())

vector_store = Chroma(client=persistent_client,
                      collection_name=collection_name,
                      embedding_function=encoder)


[Collection(name=collection_postings)]


In [29]:
# # Check the data in vector_store.

# # Get all the documents in the vector store
# documents_with_embeddings = vector_store.get(include=["documents", "metadatas", "embeddings"])


# doc = documents_with_embeddings["documents"][0]
# metadata = documents_with_embeddings["metadatas"][0]
# embedding = documents_with_embeddings["embeddings"][0]
# print("First Document:\n", doc)
# print("First Document Metadata:\n", metadata)
# print(f"First Document Embedding (with vector len={len(embedding)}):\n", embedding)


## 2.2.2. Connect to Agent (Call Gemini API)

In [30]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro')

# 2.3. Retrieval and Generation Application

## 2.3.1. Prepare Prompt

**Purpose**

Give instruction to the AI assistant
1. Role: career conselor
1. Tasks: how the AI assistant respond to the query
1. Context: provide the AI assistant relevant information, so he/she can respond accordingly
  - query
  - specification: retrieved job posts

**Resources**
- [**What is Prompt Engineering?**](https://www.datacamp.com/blog/what-is-prompt-engineering-the-future-of-ai-communication)
 > **Prompt engineering** a practice of designing and refining prompts—questions or instructions—to elicit specific responses from AI models.
- [Prompt Optimization Techniques: Prompt Engineering for Everyone](https://www.datacamp.com/blog/prompt-optimization-techniques)

### 💡 Customize Yourself!

In [31]:
extraction_prompt = ''' You are a carear consultant who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above.

Question:
    <query>{query}</query>
Job Post Information:
    <specification>{specification}</specification>
Advice:
'''

## 2.3.2. Preprare Input Query


### 💡 Customize Yourself!

In [32]:
# this query is the same as the one at section 1.4.1.
query = "I recently graduated with a Bachelor degree in Computer Science, I use Python and have good grades in machine learning and deep learning. I had various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. I am now seeking an entry-level data scientist or data analyst role."

## 2.3.3. Search Results based on Query

In [33]:
results = vector_store.similarity_search_with_score(
    query , k=5, #filter={"title": {"$in": keywords}}
)
i=0
specification = ""
for res, score in results:
    print(f"[{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    specification += ('\nTitle: ' + res.metadata['title'] +'\n ' + res.page_content)
    i+=1

[0][SIM=1.192519] Data Scientist
---------------------
           Work with large datasets and create predictive models. 
--------------------
            [{'job_posting_url': '', 'location': 'New York', 'min_salary': 0, 'pay_period': '', 'title': 'Data Scientist'}]


[1][SIM=1.339531] Machine Learning Engineer
---------------------
           Develop and optimize machine learning pipelines. 
--------------------
            [{'job_posting_url': '', 'location': 'San Francisco', 'min_salary': 0, 'pay_period': '', 'title': 'Machine Learning Engineer'}]


[2][SIM=1.461522] Data Analyst
---------------------
           Analyze business data and create dashboards. 
--------------------
            [{'job_posting_url': '', 'location': 'Chicago', 'min_salary': 0, 'pay_period': '', 'title': 'Data Analyst'}]


[3][SIM=1.495674] AI Researcher
---------------------
           Conduct research on cutting-edge AI techniques. 
--------------------
            [{'job_posting_url': '', 'location': 'Bo

In [34]:
print(specification)


Title: Data Scientist
 Work with large datasets and create predictive models.
Title: Machine Learning Engineer
 Develop and optimize machine learning pipelines.
Title: Data Analyst
 Analyze business data and create dashboards.
Title: AI Researcher
 Conduct research on cutting-edge AI techniques.
Title: Business Analyst
 Interpret data to help guide strategic decisions.


## 2.3.4. Get Advice from Career Consultant

In [35]:
#Give your Career Consultant your query (and the relevant job posts)
prompt_all = extraction_prompt.format(query=query, specification=specification)
print(prompt_all)

 You are a carear consultant who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according

In [46]:
import google.generativeai as genai

genai.configure(api_key="AIzaSyBzY1SGsL60OkcEXPf-w8en4fKjn5wI1vE")

model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
response = model.generate_content("List 5 entry-level AI jobs.")
print(response.text)

1. **Data Annotator/Labeler:** This role involves tagging and labeling data (images, text, audio, video) to train AI models.  It requires attention to detail and the ability to follow specific guidelines, but doesn't typically require a formal degree in computer science.

2. **A.I. Customer Support Specialist:**  Companies using AI-powered chatbots and virtual assistants often need humans to handle more complex inquiries or escalate issues that the AI can't resolve. This role involves customer service skills and a basic understanding of how AI systems work.

3. **Junior Data Scientist:**  While a full-fledged Data Scientist role often requires a Master's or PhD, some entry-level positions exist for those with a strong analytical background (e.g., bachelor's in math, statistics, or a related field) and some programming skills (Python, R). These roles often focus on data cleaning, preparation, and basic analysis.

4. **Machine Learning Quality Assurance Tester:** This role involves testi

In [45]:
import google.generativeai as genai

genai.configure(api_key="AIzaSyBzY1SGsL60OkcEXPf-w8en4fKjn5wI1vE")  # ← 貼上你從 AI Studio 拿的 API Key

models = genai.list_models()

for m in models:
    print(f"模型名稱：{m.name}")
    print(f"支援的功能：{m.supported_generation_methods}\n")


模型名稱：models/chat-bison-001
支援的功能：['generateMessage', 'countMessageTokens']

模型名稱：models/text-bison-001
支援的功能：['generateText', 'countTextTokens', 'createTunedTextModel']

模型名稱：models/embedding-gecko-001
支援的功能：['embedText', 'countTextTokens']

模型名稱：models/gemini-1.0-pro-vision-latest
支援的功能：['generateContent', 'countTokens']

模型名稱：models/gemini-pro-vision
支援的功能：['generateContent', 'countTokens']

模型名稱：models/gemini-1.5-pro-latest
支援的功能：['generateContent', 'countTokens']

模型名稱：models/gemini-1.5-pro-001
支援的功能：['generateContent', 'countTokens', 'createCachedContent']

模型名稱：models/gemini-1.5-pro-002
支援的功能：['generateContent', 'countTokens', 'createCachedContent']

模型名稱：models/gemini-1.5-pro
支援的功能：['generateContent', 'countTokens']

模型名稱：models/gemini-1.5-flash-latest
支援的功能：['generateContent', 'countTokens']

模型名稱：models/gemini-1.5-flash-001
支援的功能：['generateContent', 'countTokens', 'createCachedContent']

模型名稱：models/gemini-1.5-flash-001-tuning
支援的功能：['generateContent', 'countTokens', 'createTu

# 2.4 What If: Generation without Retrieved Context

In [47]:
extraction_prompt = ''' You are a career consoler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer


Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs.

Question:
    <query>{query}</query>

Advice:
'''

prompt_all = extraction_prompt.format(query=query)

In [48]:
response = model.generate_content(prompt_all)
print(response.text)

Step 1. Analyze your client's abilities, including hard and soft skills:

* **Hard Skills:**  Proficient in Python, knowledgeable in machine learning and deep learning, experience with predictive modeling and large dataset analysis.  Possesses a Bachelor's degree in Computer Science.
* **Soft Skills (Inferred):** Project experience suggests potential teamwork, problem-solving, and analytical skills.  Pursuing a job indicates initiative and ambition.  Further clarification on communication, presentation, and interpersonal skills would be beneficial.

Step 2. Analyze and summarize the skills needed for the best possible jobs (Entry-Level Data Scientist/Analyst):

* **Data Scientist:** Programming skills (Python, R), statistical modeling, machine learning algorithms, data visualization, data wrangling, communication & presentation skills, domain expertise (depending on the industry).
* **Data Analyst:** SQL, data manipulation and cleaning, data visualization tools (Tableau, Power BI), sta

# 3️⃣ Store Data to your Google Drive

Save your file from Google Colab to Google Drive

**Note**
- Your data will be lost once you close the colab page, unless you save them somewhere else (e.g., Google Drive, Github, etc.)


**Save to Google Drive**
- To save this IPython (Jupyter) notebook, you can manually select:
  > File > Save a copy in Drive
- To save specific file (e.g., csv file), you can use the following method.
- Or after mounting the Drive here, you can manually drag the file / folder (with cursor) (e.g., chroma folder) to the specified folder (MyDrive)

## Mount Google Drive

In [49]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Create a Folder in Google Drive


In [50]:
def create_folder_in_drive(folder_name):
    # Define the path for the new folder (under MyDrive folder)
    folder_path = f'/content/drive/MyDrive/{folder_name}'

    # Create the folder if it doesn't exist
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f'Folder "{folder_name}" created in Google Drive.')
    else:
        print(f'Folder "{folder_name}" already exists.')

# Create the folder in you Google Drive
folder_name = 'Tutorial - LinkedIn Job Posting with GenAI'
create_folder_in_drive(folder_name)

subfolder_name = ''


Folder "Tutorial - LinkedIn Job Posting with GenAI" already exists.


## Save the Notebook to the Created Folder

In [51]:
import shutil

def save_file_to_drive(file_name, folder_name):
    source_path = f'/content/{file_name}'  # Current location of the notebook
    destination_path = f'/content/drive/MyDrive/{folder_name}/{file_name}'  # Destination path in Drive

    # Copy the notebook to the destination
    shutil.copy(source_path, destination_path)
    print(f'Notebook "{file_name}" saved to "{folder_name}" in Google Drive.')

# Save the current notebook (replace 'your_notebook_name.ipynb' with the actual name)
file_name = 'postings.csv'
save_file_to_drive(file_name, folder_name)

Notebook "postings.csv" saved to "Tutorial - LinkedIn Job Posting with GenAI" in Google Drive.
