# Grounded Theory Analysis: Professional Identity & AI

## Research Question
**"How do professionals negotiate maintaining their identity and legitimacy in the face of a tool that threatens to replace them?"**

---

### Dataset
- **Source**: [Anthropic/AnthropicInterviewer](https://huggingface.co/datasets/Anthropic/AnthropicInterviewer)
- **Split**: workforce (1,000 interviews)
- **Method**: Gioia Methodology for Grounded Theory
- **Model**: Gemini 2.5 Flash

---

## 1. Installation des d√©pendances

In [None]:
!pip install -q google-generativeai
!pip install -q datasets
!pip install -q pandas
!pip install -q tqdm
!pip install -q numpy
!pip install -q scikit-learn

## 2. Configuration

In [None]:
import google.generativeai as genai
from datasets import load_dataset
import json
import re
from typing import List, Dict, Any, Optional
import pandas as pd
from tqdm.notebook import tqdm
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
import random

# ============================================
# CONFIGURATION - MODIFIEZ ICI SI N√âCESSAIRE
# ============================================

# Cl√© API Gemini
GOOGLE_API_KEY = "AIzaSyDwJo__wvgb_FENhpG0bieFO03K-WAxGME"

# Mod√®le Gemini
MODEL_NAME = "gemini-2.5-flash-preview-05-20"
EMBEDDING_MODEL = "models/embedding-001"

# Question de recherche
RESEARCH_QUESTION = """How do professionals negotiate maintaining their identity and legitimacy 
in the face of a tool that threatens to replace them?"""

# Nombre d'interviews √† analyser (r√©duire pour tests)
SAMPLE_SIZE = 50  # Mettre None pour tout analyser

# Configuration API
genai.configure(api_key=GOOGLE_API_KEY)

print(f"‚úÖ Configuration charg√©e")
print(f"   Mod√®le: {MODEL_NAME}")
print(f"   Sample size: {SAMPLE_SIZE or 'Tous'}")

## 3. Chargement du Dataset

In [None]:
# Charger le dataset depuis Hugging Face
print("üì• Chargement du dataset Anthropic/AnthropicInterviewer...")
dataset = load_dataset("Anthropic/AnthropicInterviewer", split="workforce")

print(f"\n‚úÖ Dataset charg√©: {len(dataset)} interviews")
print(f"\nüìã Colonnes disponibles: {dataset.column_names}")

# Aper√ßu
print("\nüìÑ Exemple d'interview:")
print(dataset[0])

In [None]:
# Pr√©parer les donn√©es pour l'analyse
def prepare_interviews(dataset, sample_size=None):
    """Pr√©pare les interviews pour l'analyse."""
    interviews = []
    
    # √âchantillonner si n√©cessaire
    indices = range(len(dataset))
    if sample_size and sample_size < len(dataset):
        indices = random.sample(list(indices), sample_size)
    
    for idx in indices:
        item = dataset[idx]
        
        # Adapter selon la structure r√©elle du dataset
        # Le dataset peut avoir diff√©rentes structures
        interview = {
            'id': idx,
            'content': '',
            'metadata': {}
        }
        
        # Extraire le contenu textuel
        if isinstance(item, dict):
            # Si c'est un dict, chercher les champs de texte
            for key in ['text', 'content', 'transcript', 'conversation', 'messages']:
                if key in item:
                    val = item[key]
                    if isinstance(val, str):
                        interview['content'] = val
                    elif isinstance(val, list):
                        # Si c'est une liste de messages
                        interview['content'] = "\n".join([str(m) for m in val])
                    break
            
            # Si pas trouv√©, utiliser tout le dict
            if not interview['content']:
                interview['content'] = json.dumps(item, ensure_ascii=False)
            
            interview['metadata'] = {k: v for k, v in item.items() 
                                    if k not in ['text', 'content', 'transcript', 'conversation', 'messages']}
        else:
            interview['content'] = str(item)
        
        interviews.append(interview)
    
    return interviews

# Pr√©parer les interviews
interviews = prepare_interviews(dataset, SAMPLE_SIZE)
print(f"\n‚úÖ {len(interviews)} interviews pr√©par√©es pour l'analyse")

# Afficher un aper√ßu
print(f"\nüìÑ Aper√ßu du premier interview (500 premiers caract√®res):")
print(interviews[0]['content'][:500] + "...")

## 4. Service Gemini

In [None]:
class GeminiService:
    """Service pour les appels √† l'API Gemini avec gestion des erreurs."""
    
    def __init__(self, model_name: str = MODEL_NAME):
        self.model = genai.GenerativeModel(model_name)
        self.request_count = 0
        self.last_request_time = 0
    
    def _rate_limit(self):
        """Gestion du rate limiting."""
        self.request_count += 1
        # Pause toutes les 10 requ√™tes pour √©viter les limites
        if self.request_count % 10 == 0:
            time.sleep(2)
    
    def generate(self, prompt: str, temperature: float = 0.7, max_retries: int = 3) -> str:
        """G√©n√®re une r√©ponse avec retry automatique."""
        self._rate_limit()
        
        config = genai.types.GenerationConfig(
            temperature=temperature,
            top_p=0.95,
            max_output_tokens=8192,
        )
        
        for attempt in range(max_retries):
            try:
                response = self.model.generate_content(prompt, generation_config=config)
                return response.text
            except Exception as e:
                if attempt < max_retries - 1:
                    wait_time = (attempt + 1) * 5
                    print(f"‚ö†Ô∏è Erreur API, retry dans {wait_time}s: {str(e)[:100]}")
                    time.sleep(wait_time)
                else:
                    print(f"‚ùå √âchec apr√®s {max_retries} tentatives: {e}")
                    return ""
    
    def generate_json(self, prompt: str, temperature: float = 0.3) -> Any:
        """G√©n√®re une r√©ponse JSON structur√©e."""
        json_prompt = f"""{prompt}

CRITICAL: Respond ONLY with valid JSON. No text before or after.
Do not use ```json or ``` markers.
"""
        response = self.generate(json_prompt, temperature=temperature)
        
        if not response:
            return None
        
        # Nettoyer la r√©ponse
        cleaned = response.strip()
        if cleaned.startswith('```json'):
            cleaned = cleaned[7:]
        if cleaned.startswith('```'):
            cleaned = cleaned[3:]
        if cleaned.endswith('```'):
            cleaned = cleaned[:-3]
        
        try:
            return json.loads(cleaned.strip())
        except json.JSONDecodeError as e:
            print(f"‚ö†Ô∏è JSON parsing error: {e}")
            # Essayer d'extraire le JSON
            match = re.search(r'[\[\{].*[\]\}]', cleaned, re.DOTALL)
            if match:
                try:
                    return json.loads(match.group())
                except:
                    pass
            return None

# Instance globale
gemini = GeminiService()
print("‚úÖ GeminiService initialis√©")

## 5. Codage Gioia - Adapt√© √† la Question de Recherche

In [None]:
class GioiaAnalysis:
    """Analyse Gioia adapt√©e √† l'√©tude de l'identit√© professionnelle face √† l'IA."""
    
    def __init__(self, gemini_service: GeminiService, research_question: str):
        self.gemini = gemini_service
        self.research_question = research_question
    
    def initial_coding(self, interview: Dict) -> List[str]:
        """Codage de premier ordre focalis√© sur l'identit√© et la l√©gitimit√©."""
        prompt = f"""You are an expert qualitative researcher using the Gioia methodology.

RESEARCH QUESTION:
{self.research_question}

INTERVIEW TRANSCRIPT:
{interview['content'][:12000]}

Analyze this interview and identify FIRST-ORDER CODES related to:
1. Professional identity (how they define themselves, their expertise, their role)
2. Perceived threats from AI (fears, concerns, anxieties about replacement)
3. Coping strategies (how they adapt, resist, or embrace AI)
4. Legitimacy claims (how they justify their continued value/relevance)
5. Identity negotiation (how they redefine or maintain their professional self)

First-order codes should be close to the informant's own words and expressions.
Identify 5-15 relevant codes from this interview.

Respond with a JSON array of strings:
["code 1", "code 2", "code 3", ...]
"""
        result = self.gemini.generate_json(prompt)
        return result if isinstance(result, list) else []
    
    def second_order_coding(self, first_order_codes: List[str]) -> Dict[str, List[str]]:
        """Codage de second ordre - th√®mes √©mergents."""
        prompt = f"""You are an expert qualitative researcher using the Gioia methodology.

RESEARCH QUESTION:
{self.research_question}

FIRST-ORDER CODES from interviews with professionals about AI:
{json.dumps(first_order_codes, ensure_ascii=False, indent=2)}

Group these first-order codes into SECOND-ORDER THEMES.
Second-order themes are more abstract, researcher-driven concepts that capture:
- Identity maintenance strategies
- Threat perception patterns
- Legitimacy construction mechanisms
- Adaptation and resistance behaviors
- Professional boundary work

Create 5-10 meaningful second-order themes.

Respond with a JSON object:
{{
  "Theme Name": ["first-order code 1", "first-order code 2"],
  "Another Theme": ["code 3", "code 4", "code 5"]
}}
"""
        result = self.gemini.generate_json(prompt)
        return result if isinstance(result, dict) else {}
    
    def aggregate_dimensions(self, second_order_codes: Dict[str, List[str]]) -> Dict[str, List[str]]:
        """Dimensions agr√©g√©es - concepts th√©oriques."""
        prompt = f"""You are an expert qualitative researcher using the Gioia methodology.

RESEARCH QUESTION:
{self.research_question}

SECOND-ORDER THEMES:
{json.dumps(second_order_codes, ensure_ascii=False, indent=2)}

Aggregate these second-order themes into AGGREGATE DIMENSIONS.
Aggregate dimensions are high-level theoretical constructs that:
- Answer the research question
- Connect to existing theory (identity theory, legitimacy theory, technology acceptance, etc.)
- Provide a framework for understanding professional identity negotiation

Consider dimensions related to:
- Identity work (construction, maintenance, transformation)
- Legitimacy strategies (claiming, defending, redefining)
- Human-AI boundary negotiation
- Professional resilience mechanisms

Create 3-5 aggregate dimensions.

Respond with a JSON object:
{{
  "Aggregate Dimension 1": ["Theme A", "Theme B"],
  "Aggregate Dimension 2": ["Theme C", "Theme D"]
}}
"""
        result = self.gemini.generate_json(prompt)
        return result if isinstance(result, dict) else {}
    
    def run_full_analysis(self, interviews: List[Dict]) -> Dict:
        """Ex√©cute l'analyse Gioia compl√®te."""
        print(f"\n{'='*60}")
        print("üî¨ GIOIA ANALYSIS - Professional Identity & AI")
        print(f"{'='*60}")
        print(f"\nüìã Research Question:\n{self.research_question}\n")
        
        # √âtape 1: Codage initial
        print("\nüìù STEP 1: First-Order Coding...")
        all_first_order = []
        coded_interviews = []
        
        for i, interview in enumerate(tqdm(interviews, desc="Coding interviews")):
            codes = self.initial_coding(interview)
            interview['first_order_codes'] = codes
            coded_interviews.append(interview)
            all_first_order.extend(codes)
            
            # Progress update
            if (i + 1) % 10 == 0:
                print(f"   Processed {i+1}/{len(interviews)} interviews, {len(set(all_first_order))} unique codes")
        
        # D√©dupliquer
        unique_first_order = list(set(all_first_order))
        print(f"\n   ‚úÖ {len(unique_first_order)} unique first-order codes identified")
        
        # √âtape 2: Codage de second ordre
        print("\nüìù STEP 2: Second-Order Coding...")
        second_order = self.second_order_coding(unique_first_order)
        print(f"   ‚úÖ {len(second_order)} second-order themes created")
        
        # √âtape 3: Dimensions agr√©g√©es
        print("\nüìù STEP 3: Aggregate Dimensions...")
        dimensions = self.aggregate_dimensions(second_order)
        print(f"   ‚úÖ {len(dimensions)} aggregate dimensions created")
        
        return {
            'interviews': coded_interviews,
            'first_order_codes': unique_first_order,
            'second_order_themes': second_order,
            'aggregate_dimensions': dimensions
        }

# Instance
gioia = GioiaAnalysis(gemini, RESEARCH_QUESTION)
print("‚úÖ GioiaAnalysis initialis√©")

## 6. Lancer l'Analyse

In [None]:
# Ex√©cuter l'analyse Gioia compl√®te
results = gioia.run_full_analysis(interviews)

## 7. Visualisation des R√©sultats

In [None]:
from IPython.display import display, HTML, Markdown

def display_gioia_structure(results: Dict):
    """Affiche la structure Gioia compl√®te."""
    dimensions = results.get('aggregate_dimensions', {})
    second_order = results.get('second_order_themes', {})
    
    html = "<style>"
    html += ".gioia-table { border-collapse: collapse; width: 100%; font-family: Arial, sans-serif; }"
    html += ".gioia-table th, .gioia-table td { border: 1px solid #ddd; padding: 10px; text-align: left; vertical-align: top; }"
    html += ".gioia-table th { background-color: #4a90d9; color: white; }"
    html += ".dim-cell { background-color: #e8f4e8; font-weight: bold; font-size: 14px; }"
    html += ".theme-cell { background-color: #fff8e8; }"
    html += ".code-cell { font-size: 12px; color: #555; }"
    html += "</style>"
    
    html += "<h2>üìä Gioia Data Structure</h2>"
    html += "<table class='gioia-table'>"
    html += "<tr><th>1st Order Codes</th><th>2nd Order Themes</th><th>Aggregate Dimensions</th></tr>"
    
    for dim_name, themes in dimensions.items():
        first_row_dim = True
        theme_count = len(themes)
        
        for theme in themes:
            first_codes = second_order.get(theme, [])
            codes_html = "<br>".join([f"‚Ä¢ {c}" for c in first_codes[:6]])
            if len(first_codes) > 6:
                codes_html += f"<br><i>... +{len(first_codes)-6} more</i>"
            
            html += "<tr>"
            html += f"<td class='code-cell'>{codes_html}</td>"
            html += f"<td class='theme-cell'>{theme}</td>"
            
            if first_row_dim:
                html += f"<td class='dim-cell' rowspan='{theme_count}'>{dim_name}</td>"
                first_row_dim = False
            
            html += "</tr>"
    
    html += "</table>"
    display(HTML(html))

# Afficher la structure
display_gioia_structure(results)

In [None]:
# Afficher les statistiques
print("\n" + "="*60)
print("üìä ANALYSIS SUMMARY")
print("="*60)
print(f"\nüìã Interviews analyzed: {len(results['interviews'])}")
print(f"üìù First-order codes: {len(results['first_order_codes'])}")
print(f"üè∑Ô∏è Second-order themes: {len(results['second_order_themes'])}")
print(f"üì¶ Aggregate dimensions: {len(results['aggregate_dimensions'])}")

print("\n\nüì¶ AGGREGATE DIMENSIONS:")
print("-"*40)
for dim, themes in results['aggregate_dimensions'].items():
    print(f"\nüîπ {dim}")
    for theme in themes:
        code_count = len(results['second_order_themes'].get(theme, []))
        print(f"   ‚îî‚îÄ {theme} ({code_count} codes)")

## 8. Construction du Mod√®le Th√©orique

In [None]:
def build_theoretical_model(results: Dict, research_question: str) -> Dict:
    """Construit un mod√®le th√©orique √† partir des r√©sultats Gioia."""
    
    prompt = f"""You are a qualitative research expert building grounded theory.

RESEARCH QUESTION:
{research_question}

GIOIA ANALYSIS RESULTS:

Aggregate Dimensions:
{json.dumps(results['aggregate_dimensions'], ensure_ascii=False, indent=2)}

Second-Order Themes:
{json.dumps(results['second_order_themes'], ensure_ascii=False, indent=2)}

Based on this analysis, construct a THEORETICAL MODEL that:

1. Answers the research question
2. Identifies key constructs and their relationships
3. Proposes testable propositions
4. Connects to existing literature on:
   - Professional identity theory
   - Legitimacy theory
   - Technology acceptance/resistance
   - Boundary work in professions

Respond with a JSON object:
{{
  "model_name": "Name of your theoretical model",
  "core_argument": "The central theoretical argument (2-3 sentences)",
  "constructs": [
    {{
      "name": "Construct name",
      "definition": "Definition",
      "type": "independent/dependent/mediator/moderator/process",
      "grounded_in": ["Aggregate dimension(s) it comes from"]
    }}
  ],
  "propositions": [
    "P1: First proposition...",
    "P2: Second proposition..."
  ],
  "theoretical_contributions": [
    "Contribution 1",
    "Contribution 2"
  ],
  "practical_implications": [
    "Implication 1",
    "Implication 2"
  ],
  "boundary_conditions": [
    "When/where this model applies"
  ],
  "future_research": [
    "Direction 1",
    "Direction 2"
  ]
}}
"""
    
    model = gemini.generate_json(prompt)
    return model if isinstance(model, dict) else {}

# Construire le mod√®le
print("üèóÔ∏è Building theoretical model...")
theoretical_model = build_theoretical_model(results, RESEARCH_QUESTION)
print("‚úÖ Model constructed")

In [None]:
# Afficher le mod√®le th√©orique
def display_theoretical_model(model: Dict):
    md = f"""# üß† {model.get('model_name', 'Theoretical Model')}

## Core Argument
{model.get('core_argument', 'N/A')}

---

## Theoretical Constructs
"""
    for c in model.get('constructs', []):
        md += f"\n### {c.get('name')} *({c.get('type')})*\n"
        md += f"{c.get('definition')}\n"
        md += f"- *Grounded in: {', '.join(c.get('grounded_in', []))}*\n"
    
    md += "\n---\n\n## Propositions\n"
    for p in model.get('propositions', []):
        md += f"- **{p}**\n"
    
    md += "\n---\n\n## Theoretical Contributions\n"
    for c in model.get('theoretical_contributions', []):
        md += f"- {c}\n"
    
    md += "\n## Practical Implications\n"
    for i in model.get('practical_implications', []):
        md += f"- {i}\n"
    
    md += "\n## Boundary Conditions\n"
    for b in model.get('boundary_conditions', []):
        md += f"- {b}\n"
    
    md += "\n## Future Research Directions\n"
    for f in model.get('future_research', []):
        md += f"- {f}\n"
    
    display(Markdown(md))

display_theoretical_model(theoretical_model)

## 9. Visualisation du Mod√®le (Mermaid)

In [None]:
def generate_model_diagram(model: Dict) -> str:
    """G√©n√®re un diagramme Mermaid du mod√®le."""
    prompt = f"""Create a Mermaid flowchart diagram for this theoretical model:

{json.dumps(model, ensure_ascii=False, indent=2)}

Requirements:
- Use graph LR (left to right) or graph TD (top down)
- Show all constructs as nodes
- Show relationships based on propositions as labeled arrows
- Use different node shapes for different construct types:
  - Independent variables: rectangles
  - Dependent variables: rounded rectangles
  - Mediators: circles
  - Moderators: diamonds
- Add colors using style definitions

Return ONLY the Mermaid code, no markdown markers.
"""
    
    response = gemini.generate(prompt, temperature=0.3)
    
    # Nettoyer
    cleaned = response.strip()
    if cleaned.startswith('```mermaid'):
        cleaned = cleaned[10:]
    if cleaned.startswith('```'):
        cleaned = cleaned[3:]
    if cleaned.endswith('```'):
        cleaned = cleaned[:-3]
    
    return cleaned.strip()

# G√©n√©rer le diagramme
print("üìä Generating model diagram...")
mermaid_code = generate_model_diagram(theoretical_model)
print("\nüìà Mermaid Diagram Code:")
print("-"*40)
print(mermaid_code)

In [None]:
# Afficher le diagramme (fonctionne dans Colab/Jupyter)
from IPython.display import display, HTML

html = f"""
<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
<script>mermaid.initialize({{startOnLoad:true, theme:'default'}});</script>
<div class="mermaid">
{mermaid_code}
</div>
"""
display(HTML(html))

## 10. Export des R√©sultats

In [None]:
from datetime import datetime

# Compiler tous les r√©sultats
final_results = {
    'metadata': {
        'research_question': RESEARCH_QUESTION,
        'dataset': 'Anthropic/AnthropicInterviewer',
        'split': 'workforce',
        'sample_size': len(interviews),
        'model': MODEL_NAME,
        'analysis_date': datetime.now().isoformat()
    },
    'gioia_analysis': {
        'first_order_codes': results['first_order_codes'],
        'second_order_themes': results['second_order_themes'],
        'aggregate_dimensions': results['aggregate_dimensions']
    },
    'theoretical_model': theoretical_model,
    'visualization': {
        'mermaid_diagram': mermaid_code
    },
    'coded_interviews': [
        {
            'id': i['id'],
            'first_order_codes': i.get('first_order_codes', [])
        } for i in results['interviews']
    ]
}

# Sauvegarder
filename = f"grounded_theory_professional_identity_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(filename, 'w', encoding='utf-8') as f:
    json.dump(final_results, f, ensure_ascii=False, indent=2)

print(f"\n‚úÖ Results saved to: {filename}")

# T√©l√©charger (Colab)
try:
    from google.colab import files
    files.download(filename)
    print("üì• Download started...")
except:
    print("(Manual download required outside Colab)")

In [None]:
# Export CSV des codes pour analyse suppl√©mentaire
codes_data = []
for theme, codes in results['second_order_themes'].items():
    # Trouver la dimension agr√©g√©e
    dimension = None
    for dim, themes in results['aggregate_dimensions'].items():
        if theme in themes:
            dimension = dim
            break
    
    for code in codes:
        codes_data.append({
            'first_order_code': code,
            'second_order_theme': theme,
            'aggregate_dimension': dimension
        })

df_codes = pd.DataFrame(codes_data)
csv_filename = f"gioia_codes_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df_codes.to_csv(csv_filename, index=False)
print(f"\n‚úÖ Codes exported to: {csv_filename}")

# Afficher un aper√ßu
display(df_codes.head(20))

---
## üìñ Notes

### M√©thode Gioia
La m√©thode Gioia est une approche rigoureuse pour d√©velopper une th√©orie ancr√©e:
1. **Codes de 1er ordre**: Proches des mots des participants
2. **Th√®mes de 2nd ordre**: Concepts plus abstraits du chercheur
3. **Dimensions agr√©g√©es**: Construits th√©oriques de haut niveau

### Personnalisation
- Modifiez `SAMPLE_SIZE` pour analyser plus/moins d'interviews
- Adaptez `RESEARCH_QUESTION` pour changer le focus
- Le mod√®le peut √™tre chang√© dans la configuration

### R√©f√©rences
- Gioia, D. A., Corley, K. G., & Hamilton, A. L. (2013). Seeking qualitative rigor in inductive research
- Dataset: [Anthropic/AnthropicInterviewer](https://huggingface.co/datasets/Anthropic/AnthropicInterviewer)