<a href="https://colab.research.google.com/github/gbeyderman/gbeyderman/blob/gh-pages/Pharma_Marketers_AI_Investment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measuring AI Investment in Pharmaceutical Marketing

## Description

This project aims to quantitatively assess the level of Artificial Intelligence (AI) investment among pharmaceutical marketers. By analyzing various data sources, we seek to understand the extent of AI adoption in pharma marketing and identify patterns, trends, and potential gaps influenced by perception bias in the industry.

## Objectives

- **Quantify AI Investment**: Determine the amount and nature of investment in AI by pharmaceutical marketing companies.
- **Identify Adoption Patterns**: Understand which areas within pharma marketing are most impacted by AI technologies.

## Data Sources

**Professional Networking Data**:
   - **LinkedIn Talent Insights**: A paid service providing aggregated data on industry hiring trends and in-demand skills.
   - **Industry Surveys and Reports**: Studies conducted by reputable firms analyzing skill demands in the pharma sector.

**Note on LinkedIn Data Access**: Direct access to individual LinkedIn profiles is restricted, and scraping is against LinkedIn's [User Agreement](https://www.linkedin.com/legal/user-agreement). However, aggregated data can be obtained legally through LinkedIn's Talent Insights or third-party industry reports.

## Methodology

### 1. Data Collection

- **Skill Demand Assessment**: Identify AI-related skills required in job postings.

### 2. Data Cleaning and Preprocessing

- Standardize data formats and terminologies.
- Remove duplicates and irrelevant entries.
- Ensure compliance with data privacy laws (e.g., GDPR).

### 3. Data Analysis

- **Quantitative Analysis**: Use statistical methods to quantify AI investments.
- **Trend Analysis**: Identify patterns over time in AI adoption.
- **Comparative Analysis**: Compare AI investments across different companies and sectors within pharma.
- **Sentiment Analysis**: Assess industry perceptions versus actual investment levels.

### 4. Visualization

- Develop dashboards and visualizations using tools like Tableau or Python libraries (`matplotlib`, `seaborn`) to present findings effectively.

### 5. Reporting

- Compile a comprehensive report detailing methodologies, analyses, findings, and recommendations.

## Expected Outcomes

- A quantified measure of AI investment levels in pharma marketing.
- Identification of gaps between perceived and actual AI adoption due to perception bias.
- Actionable recommendations for stakeholders to bridge perception gaps and enhance AI integration.


# Literature Review: Skillset Compositions in Specific Industries

## Introduction

Understanding the skillset compositions in specific industries is crucial for aligning educational programs, workforce development, and policy-making with industry needs. The following literature review summarizes scholarly research that examines skill requirements, skill gaps, and the impact of technological advancements across various industries.

---

## 1. Manufacturing Industry

**Title:** *Skill Demands and Mismatch in U.S. Manufacturing*  
**Authors:** Andrew Weaver, Paul Osterman  
**Journal:** *ILR Review*, Vol. 70, No. 2 (2017), pp. 275–307  
**Link:** [Skill Demands and Mismatch in U.S. Manufacturing](https://journals.sagepub.com/doi/10.1177/0019793916654925)

**Summary:**  
This study analyzes skill requirements in U.S. manufacturing by surveying employers. The team conducts a nationally representative survey of U.S. manufacturing establishments. It challenges the notion of widespread skill shortages and provides insights into the specific skills demanded in modern manufacturing settings.

This study contributes to the discussion of AI in pharma marketing by examining how skill demands and mismatch in the manufacturing sector provide insights into the broader question of skill gaps in technologically advanced industries, including pharma. By analyzing survey data on skill demands and hiring difficulties, the study emphasizes the need for specific technical and digital competencies, which are often central to AI adoption. While the focus is on manufacturing, the methodology and findings are relevant to understanding skill demands in the pharma marketing domain, particularly in AI.

The study reveals that challenges in hiring skilled workers are not purely a result of a lack of broad education or workforce training but can also be mediated by factors like firm strategy, technology adoption, and specific skill requirements. These findings imply that in the context of AI in pharma marketing, investments in targeted skills and competencies are essential, as is an understanding of how these skills align with business strategies and technological shifts. The insights point to the importance of focusing on firm-level strategies, tailored training solutions, and improved institutional relationships rather than general workforce inadequacies.

The research examines how different establishment characteristics (e.g., being part of a high-tech industry, membership in industry clusters) influence skill demands and hiring difficulties. This approach uncovers whether specific types of firms with higher skill demands face greater signs of mismatch, helping to understand the broader context of skill needs within the industry.

The study analyzes the relationship between skill demands and high-performance work system (HPW) variables, such as Total Quality Management (TQM) and self-managed work teams. By measuring the percentage involvement of core workers in these systems, the study assesses how workplace organization and management practices correlate with skill requirements and hiring challenges.

The study uses logistic regression (logit models) to estimate the relationship between various skill demands and the presence of long-term vacancies in core production roles. This approach allows for the calculation of the likelihood that certain skill demands predict hiring difficulties, providing insights into the factors contributing to skill gaps.

---

## 2. Data Science Across Industries

**Title:** *Development of a Job Advertisement Analysis for Assessing Data Science Competencies*  
**Authors:** JAN VOGT, THILO VOIGT, ANNIKA NOWAK, JAN M. PAWLOWSKI  
**Journal:** *Data Science Journal, 22: 33, pp. 1–16  
**Link:** [Analyzing the Skills Employers Seek for Data Scientists](https://datascience.codata.org/articles/1386/files/64f9ab2c38bb9.pdf)

**Summary:**  
There are different methods to extract competencies. In addition to quantitative surveys or qualitative expert interviews, there is also the procedure of job advertisement analysis, which has proven to be a very promising procedure in this work. The job advertisement analysis
assumes that the truth of the necessary competencies lies within the job adverts. Since there are several thousand job advertisements in a job portal in one occupational field, it is very timeconsuming to process them manually. In this respect, machine-supported algorithms are used. They allow, as this work showed, to analyze several thousand advertisements in a few hours. In
the context of this study, information extraction based on an ontology was applied by means of the NER. The procedure offers the possibility to fall back on already available vocabulary in order to extract content from the texts. The resulting weakness, that only known terms can be found, can be remedied by expanding on the terms through a whitelist. The evaluation of the focus group shows that the approach serves as a basis for the development of competences.

---

## 3. AI skill exposure

**Title:** *Generative AI’s Influence on Employment Patterns*  
**Authors:** Mar Carpanelli et al.  
**Journal:** *Economic Graph White Paper*
**Link:** [Generative AI’s Influence on Employment Patterns](https://economicgraph.linkedin.com/content/dam/me/economicgraph/en-us/PDF/gai-influence-on-employment-patterns.pdf)

**Summary:**  
In this study, we investigate the differential impacts of Generative
Artificial Intelligence (GAI) on workers in the US using LinkedIn
data, both overall and by their educational attainment levels. We
find a gradual shift away from GAI-disrupted occupations over the
past six years towards potentially augmented or insulated roles,
particularly pronounced among higher-educated workers.
Through an analysis of occupational transitions, we make
predictions under various scenarios of GAI impact, examining the
proportions of workers within different GAI occupation
classifications, transitions from employment to non-employment,
and shifts in occupational categories. While workers with higher
educational attainment tend to exhibit higher rates of
occupational mobility, the differential effects of GAI on
employment and occupational changes underscore the
importance of considering educational disparities in the context of
technological advancements.

In [None]:
from dataclasses import dataclass, field
from typing import List

@dataclass
class Skill:
    name: str
    endorsed: bool

@dataclass
class Employee:
    position: str
    skills: List[Skill] = field(default_factory=list)

@dataclass
class Company:
    name: str
    employees: List[Employee] = field(default_factory=list)

    def add_employee(self, position: str, skills: List[Skill]):
        employee = Employee(position=position, skills=skills)
        self.employees.append(employee)

    def get_employees_with_ai_skills(self):
        # Comprehensive list of AI-related skills
        ai_skills = {
            "Artificial Intelligence", "Machine Learning", "Deep Learning", "Natural Language Processing",
            "Computer Vision", "Neural Networks", "Reinforcement Learning", "Supervised Learning",
            "Unsupervised Learning", "Generative Adversarial Networks", "Data Science", "Data Engineering",
            "Big Data", "AI Ethics", "Model Optimization", "TensorFlow", "PyTorch", "Keras", "Scikit-Learn",
            "OpenAI GPT", "Hugging Face Transformers", "Speech Recognition", "AI Strategy", "MLOps",
            "Predictive Analytics", "Bayesian Networks", "Causal Inference", "Robotics", "Edge AI",
            "AI Research", "Algorithm Development", "AI Model Deployment", "Transfer Learning", "Graph Neural Networks",
            "Image Recognition", "Object Detection", "Speech Synthesis", "Chatbots", "Time Series Analysis",
            "Recommendation Systems", "Anomaly Detection", "Computer Vision APIs", "Semantic Analysis",
            "Pattern Recognition", "Knowledge Graphs", "AI Governance", "Model Explainability", "Data Labeling"
        }
        employees_with_ai_skills = []
        for employee in self.employees:
            if any(skill.name in ai_skills for skill in employee.skills):
                employees_with_ai_skills.append(employee)
        return employees_with_ai_skills

# Example usage
openai = Company(name="OpenAI")

# Add an employee with a position and skills
openai.add_employee(
    position="Research Scientist",
    skills=[
        Skill(name="Artificial Intelligence", endorsed=True),
        Skill(name="Python", endorsed=False),
        Skill(name="Deep Learning", endorsed=True),
        Skill(name="Graph Neural Networks", endorsed=True)
    ]
)

# Add another employee
openai.add_employee(
    position="AI Engineer",
    skills=[
        Skill(name="Machine Learning", endorsed=True),
        Skill(name="Natural Language Processing", endorsed=True),
        Skill(name="Predictive Analytics", endorsed=False),
        Skill(name="AI Ethics", endorsed=True)
    ]
)

# Add another employee
openai.add_employee(
    position="Data Scientist",
    skills=[
        Skill(name="Data Science", endorsed=True),
        Skill(name="Big Data", endorsed=True),
        Skill(name="TensorFlow", endorsed=False),
        Skill(name="Model Optimization", endorsed=True)
    ]
)

# Retrieve employees with AI-related skills
ai_employees = openai.get_employees_with_ai_skills()

# Print out employees with AI skills
for emp in ai_employees:
    print(f"Position: {emp.position}")
    for skill in emp.skills:
        print(f"  Skill: {skill.name}, Endorsed: {skill.endorsed}")


Position: Research Scientist
  Skill: Artificial Intelligence, Endorsed: True
  Skill: Python, Endorsed: False
  Skill: Deep Learning, Endorsed: True
  Skill: Graph Neural Networks, Endorsed: True
Position: AI Engineer
  Skill: Machine Learning, Endorsed: True
  Skill: Natural Language Processing, Endorsed: True
  Skill: Predictive Analytics, Endorsed: False
  Skill: AI Ethics, Endorsed: True
Position: Data Scientist
  Skill: Data Science, Endorsed: True
  Skill: Big Data, Endorsed: True
  Skill: TensorFlow, Endorsed: False
  Skill: Model Optimization, Endorsed: True


Stratification by Skill:

stratify_by_skill() organizes employees into different skill categories (strata) based on their skills. Each skill acts as a key, and the employees with that skill are grouped together.

Stratified Sampling:

sample_from_strata() performs the stratified sampling. It calculates the proportion of employees in each skill group (stratum) and samples employees proportionally to match the overall population distribution.
Checking Skill Overlap:

check_skill_overlap() allows you to input a list of skills (e.g., "Python" and "Machine Learning") and checks how many employees in the sampled group have all of the specified skills.

Example Walkthrough:

Stratification: Employees are grouped by their skills, so that you can ensure each skill category is appropriately represented in your sample.

Sampling: You take a 10% sample of employees, with proportions preserved for each skill group.

Skill Overlap: You check how many of the sampled employees have both Python and Machine Learning, giving you insight into the overlap between those skills in the sample.

This approach mirrors the stratified sampling strategy and gives you a practical tool for exploring the skill graph. You can adjust the sample_size and skills_to_check as needed for your analysis.

In [None]:
import random
from dataclasses import dataclass, field
from typing import List, Dict
from collections import defaultdict

@dataclass
class Skill:
    name: str
    endorsed: bool

@dataclass
class Employee:
    position: str
    skills: List[Skill] = field(default_factory=list)

@dataclass
class Company:
    name: str
    employees: List[Employee] = field(default_factory=list)

    def add_employee(self, position: str, skills: List[Skill]):
        employee = Employee(position=position, skills=skills)
        self.employees.append(employee)

    def stratify_by_skill(self):
        # Stratify employees based on their skills
        strata = defaultdict(list)
        for employee in self.employees:
            for skill in employee.skills:
                strata[skill.name].append(employee)
        return strata

    def sample_from_strata(self, strata: Dict[str, List[Employee]], sample_size: int):
        # Perform stratified sampling from each skill category
        total_population = len(self.employees)
        sample = []

        for skill_name, employees in strata.items():
            skill_population = len(employees)
            proportion = skill_population / total_population
            skill_sample_size = int(proportion * sample_size)

            # Sample employees from this skill stratum
            sampled_employees = random.sample(employees, min(skill_sample_size, len(employees)))
            sample.extend(sampled_employees)

        return sample

    def check_skill_overlap(self, sampled_employees: List[Employee], skills_to_check: List[str]):
        overlap_count = 0

        for employee in sampled_employees:
            employee_skill_names = {skill.name for skill in employee.skills}
            if all(skill in employee_skill_names for skill in skills_to_check):
                overlap_count += 1

        return overlap_count

# Example usage
openai = Company(name="OpenAI")

# Add employees with skills
openai.add_employee(
    position="Research Scientist",
    skills=[
        Skill(name="Artificial Intelligence", endorsed=True),
        Skill(name="Python", endorsed=True),
        Skill(name="Deep Learning", endorsed=True)
    ]
)

openai.add_employee(
    position="AI Engineer",
    skills=[
        Skill(name="Machine Learning", endorsed=True),
        Skill(name="Python", endorsed=True),
        Skill(name="NLP", endorsed=False)
    ]
)

openai.add_employee(
    position="Data Scientist",
    skills=[
        Skill(name="Data Science", endorsed=True),
        Skill(name="Big Data", endorsed=True),
        Skill(name="TensorFlow", endorsed=False)
    ]
)

openai.add_employee(
    position="Research Engineer",
    skills=[
        Skill(name="LLMs", endorsed=True),
        Skill(name="Machine Learning", endorsed=False),
        Skill(name="Python", endorsed=True)
    ]
)

# Define a comprehensive list of AI-related skills
ai_skills = [
    "Artificial Intelligence", "Machine Learning", "Deep Learning", "Natural Language Processing",
    "Computer Vision", "Neural Networks", "Reinforcement Learning", "Supervised Learning",
    "Unsupervised Learning", "Generative Adversarial Networks", "Data Science", "Data Engineering",
    "Big Data", "AI Ethics", "Model Optimization", "TensorFlow", "PyTorch", "Keras", "Scikit-Learn",
    "LLMs", "OpenAI GPT", "Hugging Face Transformers", "Speech Recognition", "AI Strategy", "MLOps",
    "Predictive Analytics", "Bayesian Networks", "Causal Inference", "Robotics", "Edge AI",
    "AI Research", "Algorithm Development", "AI Model Deployment", "Transfer Learning", "Graph Neural Networks",
    "Image Recognition", "Object Detection", "Speech Synthesis", "Chatbots", "Time Series Analysis",
    "Recommendation Systems", "Anomaly Detection", "Computer Vision APIs", "Semantic Analysis",
    "Pattern Recognition", "Knowledge Graphs", "AI Governance", "Model Explainability", "Data Labeling"
]

# Stratify employees based on their skills
strata = openai.stratify_by_skill()

# Sample employees (e.g., 10% of total population, adjust as necessary)
sample_size = int(0.10 * len(openai.employees))  # 10% sample
sampled_employees = openai.sample_from_strata(strata, sample_size)

# Check overlap for specific skills (e.g., Python and Machine Learning)
skills_to_check = ["Python", "Machine Learning"]
overlap_count = openai.check_skill_overlap(sampled_employees, skills_to_check)

# Print out the results
print(f"Sampled {len(sampled_employees)} employees.")
print(f"Number of employees with both {skills_to_check}: {overlap_count}")


Since late last year, LinkeIn research has seen a 142x increase globally in members adding AI aptitude skills to their LinkedIn profiles - and it’s not just workers in technical roles who are adding those skills!

To simulate skill overlap data for the AI-skilled employees while using a sample to reduce uncertainty, we can use Monte Carlo simulation. This approach will allow you to estimate skill overlap distributions by simulating skill combinations multiple times, incorporating data from actual samples to improve the accuracy of these simulated distributions.

Here’s how to proceed: Steps

Define Skill Probabilities:

Use observed probabilities from the sample for each skill overlap combination. For instance, from the sample, if you find that 30% of Python-skilled employees also have "Machine Learning," use this as the probability for that overlap in the simulation.

Simulate Skill Overlaps:
Based on these probabilities, run a Monte Carlo simulation to generate synthetic skill overlap data for the entire population of OpenAI 360 AI-skilled employees.

Adjust with Sample Data:
Incorporate your actual sample data to reduce simulation uncertainty. This can be done by blending sample data with the simulated data or updating simulation probabilities based on sample findings.

In [1]:
import random
from collections import Counter
import numpy as np

# Define the population size for AI-skilled employees
population_size = 453  # Number of employees with AI skills, as per original context
sample_size = 100      # Sample to refine simulation

# Skill probabilities for AI-specific skills (adjusted based on a 10% sample of OpenAI staff that listed AI as a skill)
skill_probabilities = {
    "Artificial Intelligence": 1.0,            # All profiles included this skill
    "Machine Learning": 1.0,                   # All profiles included this skill
    "Deep Learning": 0.75,                     # Frequently observed across profiles
    "Natural Language Processing": 0.58,       # Consistent but slightly less frequent
    "Computer Vision": 0.67,                   # Moderate presence across profiles
    "Reinforcement Learning": 0.67,            # Moderate presence across profiles
    "Generative AI": 0.78,                     # Frequently seen in profiles
    "Large Language Models": 0.72,             # Consistently observed in profiles
    "Data Science": 0.75,                      # Common across profiles
    "Data Analysis": 0.81,                     # Most frequent among data-related skills
    "Python": 0.97,                            # Almost all profiles included Python
    "TensorFlow": 0.15,                        # Observed with some regularity in AI profiles
    "PyTorch": 0.15,                           # Common with deep learning skills
    "Natural Language Processing (NLP)": 0.1,  # Moderate frequency with AI and language applications
    "NumPy": 0.2,                              # Data analysis-related skill with notable presence
    "Pandas (Software)": 0.1,                  # Appears frequently in data-focused profiles
    "SQL": 0.3,                                # High presence in data-focused roles
    "Java": 0.25,                              # Consistently observed in technical profiles
    "C++": 0.3,                                # Frequently observed in technical roles
    "Statistics": 0.12,                        # Consistently noted in data and AI profiles
    "Research": 0.2,                           # Strongly present across AI-related research roles
    "R": 0.3,                                  # Frequently seen in data science roles
    "Git": 0.1,                                # Observed in technical profiles with software development
    "Algorithms": 0.4,                         # Notable skill with AI, ML, and data structures
    "Data Mining": 0.25,                       # Seen in profiles with a data-oriented focus
    "Product Management": 0.2,                 # Frequently observed in AI-related roles with project scope
    "Cloud Computing": 0.07,                   # Present in profiles dealing with infrastructure
    "Agile Methodologies": 0.15,               # Commonly observed with project-oriented roles
    "Leadership": 0.08,                        # Observed in profiles with management roles
    "Teamwork": 0.08,                          # Frequent among collaborative AI teams
    "Distributed Systems": 0.1,                # Observed in AI profiles related to scalability
    "Computer Science": 0.2                    # Observed frequently in core AI profiles
}


# Function to simulate skill overlaps based on probabilities
def simulate_skill_overlaps(population_size, skill_probabilities):
    simulated_data = []
    for _ in range(population_size):
        skills = set()
        for skill, prob in skill_probabilities.items():
            if random.random() < prob:
                skills.add(skill)
        simulated_data.append(frozenset(skills))  # Use frozenset for unique skill sets
    return simulated_data

# Run initial simulation for the AI-skilled employee population
simulated_data = simulate_skill_overlaps(population_size, skill_probabilities)

# Count occurrences of each skill combination in simulated data
simulated_counts = Counter(simulated_data)

# Sample a subset to add real data and reduce uncertainty
sampled_data = random.sample(simulated_data, sample_size)

# Adjust probabilities based on sample data to refine simulation
sample_counts = Counter(sampled_data)
adjusted_probabilities = {
    skill_set: sample_counts[skill_set] / sample_size for skill_set in sample_counts
}

# Update skill probabilities with a blend of original probabilities and sample data
for skill_set, observed_prob in adjusted_probabilities.items():
    original_prob = skill_probabilities.get(" & ".join(skill_set), 0)
    skill_probabilities[" & ".join(skill_set)] = 0.7 * original_prob + 0.3 * observed_prob

# Optionally, re-run simulation with updated probabilities if needed for greater accuracy

# Print the skill overlap counts after simulation
#print("Simulated Skill Overlap Counts for AI-Skilled Employees:")
#for skill_set, count in simulated_counts.items():
#    print(f"{set(skill_set)}: {count} employees")
