<a href="https://colab.research.google.com/github/fengfrankgthb/BUS-41204/blob/main/StudentSearchingWorkflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workflow - Find My Students

In this notebook, we will illustrate a workflow that makes use of an LLM and web search tool to try to locate LinkedIn profiles of my past students.

# Mount Google Drive

I'd like to access information on my google drive where I have stored API (Application Programming Interface) keys needed to call OpenAI and the search tool and a spreadsheet with the names of my previous students. I'm also going to store the results back to my google drive when I'm done.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True) # force_remount = True

Mounted at /content/drive


# API keys

To make use of calls to services, we will need keys to access them. I want to keep my keys private and at least moderately safe by not displaying them directly in the notebook.

If someone other than me wanted to run this notebook, they would need to have their own API keys and then adapt the code block below to use them.

In [None]:
import json

# You would need to have your own API keys stored in your own google drive
# to execute this script yourself
# Load API keys
with open('/content/drive/My Drive/ColabSecrets/secrets.json', 'r') as f:
    secrets = json.load(f)

OPENAI_API_KEY = secrets["OPENAI_API_KEY"]
VALUE_SERP_API_KEY = secrets["VALUE_SERP_API_KEY"]

# Add libraries

We're going to make use of OpenAI as our LLM.

In [None]:
!pip install langchain_openai

In [None]:
import requests
import pandas as pd
from langchain_openai import ChatOpenAI

# Tools and LLM queries

In the following several code blocks, we're going to define a series of tools and functions to make LLM queries. These are going to be the main building blocks of our workflow.

`get_synonyms` will make use of the LLM to generate additional terms to use in our web searches.

In [None]:
def get_synonyms(term):
    """Uses an LLM to generate synonyms for a given search term."""
    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0)

    prompt = f"""
    Given the term: "{term}",
    generate a list of common synonyms or alternative names that should be included in a web search.
    Respond with a short, comma-separated list of synonyms.
    """

    response = llm.invoke(prompt)
    return response.content.split(", ")


`search_linkedin` makes use of a websearch tool to conduct the initial search trying to find linkedin profiles for individuals identified by name, a year they were in school, and the school they went to. Additional terms may also be provided.

In [None]:
def search_linkedin(student_name, year_taught, school, additional_terms="", start_index=0):
    """Searches for LinkedIn profiles, expanding search terms with synonyms."""

    # Generate year range
    years = f"{year_taught} OR {int(year_taught) + 1} OR {int(year_taught) + 2} OR {int(year_taught) + 3}"

    # **Expand school name with synonyms**
    school_synonyms = get_synonyms(school)
    expanded_school_terms = " OR ".join([school] + school_synonyms)

    # Construct search query
    query = f'{student_name} linkedin ({expanded_school_terms}) ({years}) {additional_terms}'
    # print(f"Query: {query}")

    url = "https://api.valueserp.com/search"

    params = {
        "api_key": VALUE_SERP_API_KEY,
        "q": query,
        "location": "United States",
        "hl": "en",
        "gl": "us",
        "num": 10,
        "start": start_index,
        "no_truncate": "true"
    }

    response = requests.get(url, params=params)
    if response.status_code != 200:
        print("Error fetching search results.")
        return []

    results = response.json().get("organic_results", [])

    # **Filter to include only LinkedIn URLs**
    linkedin_results = [
        {"title": r.get("title", ""),
         "link": r.get("link", ""),
         "snippet": r.get("snippet", "")}
        for r in results
        if "linkedin.com/in/" in r.get("link", "")
    ]

    return linkedin_results

`extract_relevant_info` parses search results.

In [None]:
def extract_relevant_info(student, year, school, search_results, program=""):
    """Processes search results to estimate likelihood of match."""
    profiles = []
    for result in search_results:
        title = result.get("title", "")
        snippet = result.get("snippet", "")
        link = result.get("link", "")
        likelihood = estimate_match_likelihood(student, year, school, title, snippet, program)
        profiles.append((title, link, snippet, likelihood))
    return profiles


`estimate_match_likelihood` makes use of the LLM to assess returned prompts for how likely they are to correspond to the desired student. Because the only information available is the returned search result (url, snippet, and headline), the evaluation is not incredibly accurate. However, I do not want to scrape (too much) data directly from websites without checking policies.

In [None]:
def estimate_match_likelihood(student, year, school, title, snippet, program):
    """Uses LLM to estimate likelihood that snippet belongs to student."""

    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0)

    prompt = f"""
    Given the following search result title and snippet:

    - **Title:** "{title}"
    - **Snippet:** "{snippet}"

    Estimate the likelihood (0-1) that this is the correct LinkedIn profile of the student named "{student}".

    ### **Guidelines for Likelihood Estimation**

    The student name provided will generally be in the form of FIRST NAME LAST NAME
    or (FIRST NAME OR NICKNAME) LAST NAME.

    The first two words of Title will often be FIRST NAME LAST NAME, and Title
    will almost always include a proper name with first and last name.

    Before comparing names, try to extrac FIRST NAME and LAST NAME from the title and store them in memory.
    Name comparisons should be made character by character.

    1. **First Name Matching:**
      - If the first name in the title exactly matches or is a common variation of the student's name, this increases confidence.
      - If the first name is different but phonetically or culturally similar, reduce the score slightly.
      - If the first name is completely different, penalize significantly.

    2. **Last Name Matching:**
      - If the last name in the title matches exactly, this is a strong signal.
      - If the last name is slightly different but could be a common variation, reduce confidence slightly.
      - If the last name is completely different, penalize significantly.

    3.  **First and Last Name Matching**
      - If both first and last name match exactly, this is a strong signal.

    4. **Education & School Match:**
      - If the snippet confirms the student attended "{school}", this increases confidence.
      - If school is missing or education is not provided, this should not increase or decrease confidence.

    5. **Program Match:**
      - If the snippet confirms the student was in the "{program}" program, this increases confidence.
      - If the program is clear but different, reduce confidence significantly.
      - If the program is missing, this should not increase or decrease confidence.

    6. **Employment & Experience:**
      - If the snippet contains employment information but no education details, rely mostly on name matching.
      - If the experience or employment information strongly aligns with what would be expected for a "{program}"
        student from "{school}", increase the score slightly. For example, if employment or title indicates
        the person is employed at a university and the person's program is PhD, this increases confidence
        in the match.

    ### **Final Output (Force Consistency)**
    1. First, output the extracted first and last names **in a structured format**.
    2. Then, compute the likelihood **based on the reasoning steps above** and return it.
    3. **ALWAYS use the same process for reasoning, even if not explicitly asked.**

    ### **Final Structured Output (Do Not Deviate)**
    Respond in the following exact format:

    Reasoning: First Name Match: [Exact/Close/No Match]
               Last Name Match: [Exact/Close/No Match]
               School Match: [Match/No Match]
               Program Match: [Match/Not Mentioned/Different]
               Employment Match: [Relevant/Not Relevant]

    Final Likelihood: [0-1]

    Make sure the likelihood value is **always on the last line, prefixed exactly with "Final Likelihood: " for easy extraction.**

    """

    response = llm.invoke(prompt)

    try:
        return float(response.content.split("Final Likelihood: ")[-1])
    except ValueError:
        return 0.5  # Default uncertainty

`suggest_refinement_terms` uses the LLM to process a results that is believed to match a student and generate additional search terms for use in further search trying to decide whether the student is in the analytics space.

In [None]:
def suggest_refinement_terms(title, snippet):
    """Uses an LLM to generate the best search terms based on the title and snippet associated with a linked in url returned from a web search."""

    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0)

    prompt = f"""
    The following title and snippet are associated with a linked in url returned from a web search:

    title:

    "{title}"

    snippet:

    "{snippet}"

    Suggest the best set of search terms to use in a general web search to determine whether the person associated with the LinkedIn account is involved in AI/ML.
    Prioritize any company names, job roles, and technical keywords in the title or snippet.

    Respond with a comma-separated list of refined search terms.
    """

    response = llm.invoke(prompt)
    return response.content.split(", ")


`search_general_web` just conducts a web search given a set of words. (Really don't need `search_general_web` and `search_linked_in`, but my chain of thought just went this way.)

In [None]:
def search_general_web(person_name, additional_terms=""):
    """Conducts a general web search using the person's name and relevant terms."""

    query = f'{person_name} {additional_terms}'

    url = "https://api.valueserp.com/search"

    params = {
        "api_key": VALUE_SERP_API_KEY,
        "q": query,
        "location": "United States",
        "hl": "en",
        "gl": "us",
        "num": 10  # Retrieve 10 results at a time
    }

    response = requests.get(url, params=params)
    if response.status_code != 200:
        print("Error fetching search results.")
        return []

    results = response.json().get("organic_results", [])

    # Return both the link and snippet for further AI/ML assessment
    return [{"link": r.get("link", ""), "snippet": r.get("snippet", "")}
            for r in results]


`assess_ai_ml_likelihood` uses a full set of returned web results (that include a person's name) from a search to try to decide whether it seems likely that the person is in the analytics space.

In [None]:
def assess_ai_ml_likelihood(web_results):
    """Uses an LLM to determine the likelihood that the person works in AI/ML based on search results."""

    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0)

    # Combine snippets into a single context
    snippets_text = "\n".join([result["snippet"] for result in web_results])

    prompt = f"""
    Given the following search result snippets:

    "{snippets_text}"

    Estimate the likelihood (0-1) that the person mentioned in these results is involved in AI/ML or data analytics.
    Consider whether they work at an AI/ML/analytics related firm, are involved in AI/ML/analytics research, or engage in related activities.
    Examples of related activities might include research in AI, machine learning, deep learning, or data science; being
    in venture capital investing in AI/ML/analytics firms; or being on a board at an AI-related company.
    Other examples that should return a value around .6 might be doing research or working in econometrics or statistics
    without any explicit results referencing AI/ML.

    Respond with a number between 0 and 1, followed by a short explanation.
    """

    response = llm.invoke(prompt)
    return response


`assess_ai_ml_from_snippet` is as above but only uses the snippet from a url identified as likely belong to a student in the initial search.

In [None]:
def assess_ai_ml_from_snippet(title, snippet):
    """Uses an LLM to estimate AI/ML involvement based solely on the LinkedIn snippet."""

    llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0)

    prompt = f"""
    Given the following LinkedIn snippet and title -

    title:

    "{title}"

    snippet:

    "{snippet}"

    Estimate the likelihood (0-1) that the person mentioned in these results is involved in AI/ML or data analytics.
    Consider whether they work at an AI/ML/analytics related firm, are involved in AI/ML/analytics research, or engage in related activities.
    Examples of related activities might include research in AI, machine learning, deep learning, or data science; being
    in venture capital investing in AI/ML/analytics firms; or being on a board at an AI-related company.
    Other examples that should return a value around .6 might be doing research or working in econometrics or statistics
    without any explicit results referencing AI/ML.

    Respond with a number between 0 and 1, followed by a short explanation.
    """

    response = llm.invoke(prompt)
    return response


`save_to_spreadsheet` does just what it says when we're looking at results produced from running the workflow interactively.

In [None]:
def save_to_spreadsheet(data, filename="student_profiles.xlsx"):
    """Saves the search results to an Excel spreadsheet."""
    df = pd.DataFrame(data, columns=[
        "Student Name", "LinkedIn URL", "Match Likelihood", "AI/ML Likelihood", "Summary"
    ])
    df.to_excel(filename, index=False)
    print(f"Results saved to {filename}")


# Run the workflow

The following code ties everything together to allow us to interactively run through the workflow. This allows us to see how everything works and gives more flexibility over the inputs.

In [None]:
# Interactive Process
data = []

while True:
    # Step 1: Prompt for student info
    student_name = input("Enter student name: ")
    year_taught = input("Enter year taught: ")
    school = input(f"Enter school (default: UChicago): ") or "University of Chicago"
    program = input(f"Enter program (default: MBA): ") or "MBA"
    additional_terms = input("Enter additional search terms (or leave blank): ")

    start_index = 0  # Track pagination index

    while True:
        print(f"Searching for LinkedIn profiles (results {start_index + 1}-{start_index + 10})...")
        search_results = search_linkedin(student_name, year_taught, school, additional_terms, start_index)

        if not search_results:
            retry = input("No profiles found. Provide more search terms? (y/n): ")
            if retry.lower() == "y":
                new_terms = input("Enter additional search terms: ").strip()
                additional_terms += " " + new_terms if new_terms else ""
                start_index = 0  # Reset pagination
                continue  # Re-run search with new terms
            else:
                data.append([student_name, "Not found", "", "", ""])
                break

        found_match = False  # Track if a match is confirmed

        for title, link, snippet, likelihood in extract_relevant_info(student_name, year_taught, school, search_results, program):
            print(f"\nPossible match:\nTitle: {title}\nURL: {link}\nSnippet: {snippet}\nLikelihood: {likelihood:.2f}")
            confirm = input("Is this the correct profile? (y/n): ")

            if confirm.lower() == "y":
                found_match = True  # Stop further link processing

                # Step 3: Initial AI/ML Assessment Using LinkedIn Snippet
                print("\nAssessing AI/ML involvement based on LinkedIn snippet...")
                ai_ml_result = assess_ai_ml_from_snippet(title, snippet)
                likelihood_ai_ml, summary = ai_ml_result.content.split("\n", 1)

                print(f"\nAI/ML Likelihood (Snippet Only): {likelihood_ai_ml}")
                print(f"Reasoning: {summary}")

                stop_here = input("Are you satisfied with this result? (y/n): ")

                if stop_here.lower() == "y":
                    # Save results and break out of search
                    data.append([student_name, link, likelihood, likelihood_ai_ml, summary])
                    break  # Move to next student immediately

                # Step 4: Conduct General Web Search for Further Verification
                print("\nConducting a general web search for additional AI/ML verification...")
                refined_search_terms = suggest_refinement_terms(title, snippet)

                print(f"\nSearch terms for additional AI/ML assessment: {', '.join(refined_search_terms)}")

                user_additional_terms = input("Enter any additional words to include (or press enter to skip): ").strip()
                if user_additional_terms:
                    refined_search_terms.append(user_additional_terms)

                print("\nConducting a general web search for additional AI/ML verification...")
                general_search_results = search_general_web(student_name, " ".join(refined_search_terms))

                # Step 5: Final AI/ML Assessment Based on Web Search
                if general_search_results:
                    ai_ml_result = assess_ai_ml_likelihood(general_search_results)
                    likelihood_ai_ml, summary = ai_ml_result.content.split("\n", 1)

                    print(f"\nFinal AI/ML Likelihood (Web Search): {likelihood_ai_ml}")
                    print(f"Reasoning: {summary}")

                # Save final results and **stop searching**
                data.append([student_name, link, likelihood, likelihood_ai_ml, summary])
                break  # Move to next student immediately

        if found_match:
            break  # **Ensure no more LinkedIn results are checked once a match is found**

        # If no match was found after checking all results, allow refinement
        more_results = input("No profiles matched. Search for more results with the same terms? (y/n): ")
        if more_results.lower() == "y":
            start_index += 10  # Fetch next set of results
            continue  # Re-run search without changing search terms

        retry = input("Provide new search terms? (y/n): ")
        if retry.lower() == "y":
            new_terms = input("Enter additional search terms: ").strip()
            additional_terms += " " + new_terms if new_terms else ""
            start_index = 0  # Reset pagination
            continue  # Re-run search with additional details
        else:
            data.append([student_name, "Not found", "", "", ""])
            break

    another = input("Search for another student? (y/n): ")
    if another.lower() != "y":
        break

In [None]:
# Save results only after all refinements
save_to_spreadsheet(data)

# Automate the workflow

Finally, we automate the workflow to take as input a spreadsheet containing names and year taught and output the spreadsheet updated with the results of executing the workflow student by student.

In [None]:
# Automate the process

def process_students(file_path, out_file_path="student_profiles.xlsx", row_indices=None):
    # Load the spreadsheet
    df = pd.read_excel(file_path)

    if row_indices is not None:
        df = df.iloc[row_indices]

    results = []

    for index, row in df.iterrows():
        last_name, first_name = row['Name'].split(', ')
        preferred_name = row['Preferred Name']
        student_name = f"({first_name} OR {preferred_name}) {last_name}"
        year_taught = row['Year']
        program = row['Program']
        school = 'University of Chicago'
        additional_terms = ''
        if year_taught > 2007:
            additional_terms += 'Booth '
            school += ' Booth School of Business'
        if 'PhD' in program:
            additional_terms += 'PhD '
        if 'MBA' in program:
            additional_terms += 'MBA '

        start_index = 0  # Track pagination index
        best_match = None  # Track the best LinkedIn match
        best_likelihood = 0  # Highest found likelihood

        while True:
            print(f"Searching for LinkedIn profile for {student_name} ...")
            search_results = search_linkedin(student_name, year_taught, school, additional_terms, start_index)
            if not search_results:
                break  # Stop searching if no results are found

            for title, link, snippet, likelihood in extract_relevant_info(student_name, year_taught, school, search_results, program):

                if likelihood > best_likelihood:
                    best_match = (title, link, snippet, likelihood)
                    best_likelihood = likelihood

                if likelihood > 0.7:
                    break  # Stop as soon as we find a strong match

            if best_likelihood > 0.7 or not search_results:
                break  # Stop searching if we found a strong match or no more results exist

            start_index += 10  # Fetch next set of results

            if start_index > 100:
                break # Stop searching if we go too long

        if not best_match:
            results.append([year_taught, student_name, "Not found", "", "", ""])
            continue  # Move to next student

        title, link, snippet, likelihood = best_match

        # Initial AI/ML Assessment Using LinkedIn Snippet
        ai_ml_result = assess_ai_ml_from_snippet(title, snippet)
        likelihood_ai_ml, summary = ai_ml_result.content.split("\n", 1)
        likelihood_ai_ml = float(likelihood_ai_ml)

        if likelihood_ai_ml < 0.6:
            # Conduct General Web Search for Further Verification
            refined_search_terms = suggest_refinement_terms(title, snippet)
            general_search_results = search_general_web(student_name, " ".join(refined_search_terms))

            if general_search_results:
                ai_ml_result = assess_ai_ml_likelihood(general_search_results)
                likelihood_ai_ml, summary = ai_ml_result.content.split("\n", 1)
                likelihood_ai_ml = float(likelihood_ai_ml)

        results.append([year_taught, student_name, link, likelihood, likelihood_ai_ml, summary])

    # Save results back to the specified spreadsheet
    # Create a DataFrame with the correct column names
    results_df = pd.DataFrame(results, columns=["Year", "Student Name", "LinkedIn URL", "Match Likelihood", "AI/ML Likelihood", "Reasoning"])

    results_df.to_excel(out_file_path, index=False)


Here we're inputting a spreadsheet with the names of the people I'd like to search for.

Of course, this block won't work for someone who does not have these files stored on their google drive.

In [None]:
# Load spreadsheet
#file = '/content/drive/MyDrive/Booth Students Taught short.xlsx'
file = '/content/drive/MyDrive/Booth PhD Students Short.xlsx'

#process_students(file, out_file_path='/content/drive/MyDrive/StudentSearchResults.xlsx')  # Process the whole file
process_students(file, out_file_path='/content/drive/MyDrive/StudentSearchResults.xlsx', row_indices=range(2))  # Process the rows of the file specified in row_indices