<a href="https://colab.research.google.com/github/aivydebnath/NLP-Interview-Analysis/blob/main/Code/Assignment_Webmobi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Interviewing Candidates with Data Science Experience from Diverse Backgrounds
*Text Analysis with Python*



## Installations

**Dependencies**


*   TextBlob: For sentiment analysis
*   spaCy: For key phrase extraction




In [1]:
pip install pandas textblob spacy



In [2]:
import csv
import random
from textblob import TextBlob
import spacy
import re
import pandas as pd



*   Downloading the spaCy English model (en_core_web_sm) after installing spaCy



In [3]:
nlp = spacy.load('en_core_web_sm')

##Code


*Preparing the Dataset: Create a text file (**interview.txt**) containing multiple paragraphs of interview responses.*




In [4]:
def analyze_sentiment(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    if polarity > 0:
        return "Positive", 1
    elif polarity < 0:
        return "Negative", -1
    else:
        return "Neutral", 0

# Function to extract key phrases using spaCy

def extract_key_phrases(text):
    doc = nlp(text)
    key_phrases = [chunk.text.lower() for chunk in doc.noun_chunks]
    return key_phrases

# Define expected key phrases related to data science, ML, DL.
#If these words are said by candidates then considering they have these skills and more aligned to the role.
relevant_key_phrases = {
    'data science', 'machine learning', 'deep learning', 'artificial intelligence', 'neural networks',
    'supervised learning', 'unsupervised learning', 'reinforcement learning', 'nlp', 'natural language processing',
    'computer vision', 'data mining', 'big data', 'data visualization', 'data analysis', 'predictive modeling',
    'statistical analysis', 'regression analysis', 'classification', 'clustering', 'feature engineering',
    'data preprocessing', 'data cleaning', 'eda', 'exploratory data analysis', 'time series analysis',
    'anomaly detection', 'model evaluation', 'cross-validation', 'python', 'r', 'sql', 'tensorflow', 'keras',
    'pytorch', 'scikit-learn', 'pandas', 'numpy', 'matplotlib', 'seaborn', 'data pipelines', 'etl', 'apache spark',
    'hadoop', 'data warehousing', 'aws', 'azure', 'google cloud', 'data governance', 'data ethics', 'bi', 'business intelligence',
    'dashboarding', 'power bi', 'tableau', 'data storytelling', 'decision trees', 'decision tree', 'random forests', 'gradient boosting',
    'xgboost', 'lightgbm', 'ensemble methods', 'svm', 'support vector machines', 'knn', 'k-nearest neighbors', 'dimensionality reduction',
    'pca', 'principal component analysis', 't-sne', 'dbscan', 'data wrangling', 'feature selection', 'hyperparameter tuning',
    'grid search', 'bayesian optimization', 'model deployment', 'mlops', 'a/b testing', 'causal inference', 'data strategy',
    'data-driven decision making', 'recommender systems', 'graph analytics', 'knowledge graphs', 'data integration', 'communication' , 'team building' , 'Strategic Thinking'
}

# Reading the text file containing interview responses
def read_interview_responses(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        transcript = file.read()
    return transcript

# Processing interview responses

def process_interview_responses(transcript):
    # Spliting transcript into individual responses
    responses = re.split(r'\n(?=Candidate \d+:)', transcript)
    results = []

    # Processing each response from the text file
    for response in responses:
        lines = response.strip().split('\n')
        if len(lines) < 2:
            continue

        candidate_info = lines[0].strip()
        candidate_response = ' '.join(lines[2:]).strip()

        # Applying sentiment analysis
        sentiment, sentiment_score = analyze_sentiment(candidate_response)

        # Extracting key phrases using spaCy
        key_phrases = extract_key_phrases(candidate_response)

        # Assessing relevance of key phrases
        relevant_key_phrases_found = [phrase for phrase in relevant_key_phrases if phrase in key_phrases]
        relevance_score = len(relevant_key_phrases_found) / len(relevant_key_phrases) if len(relevant_key_phrases) > 0 else 0

        # Calculating overall quality score
        quality_score = sentiment_score + relevance_score

        # Determining overall quality assessment
        if quality_score >= 1.5:
            quality_assessment = "Excellent"
        elif quality_score >= 0.5:
            quality_assessment = "Good"
        elif quality_score >= -0.5:
            quality_assessment = "Fair"
        else:
            quality_assessment = "Poor"

        # Preparing the result in the format we want
        result = {
            'Candidate': candidate_info.split(':')[0].strip(),
            'Response': candidate_response,
            'Sentiment': sentiment,
            'Key Phrases': key_phrases,
            'Relevant Key Phrases': relevant_key_phrases_found,
            'Quality Score': quality_score,
            'Quality Assessment': quality_assessment
        }
        results.append(result)

    return results


##Output

*Results are stored in csv file for better understanding and analysis*

In [5]:
def save_results_to_csv(results, output_file):
    df = pd.DataFrame(results)
    df.to_csv(output_file, index=False)
    print(f"Quality assessment results saved to {output_file}")

# Main function to execute the script
def main():
# Here I have details of all the transcripted of the interview for all the candidates. Assuming 30 candidates gave interview with DS Exp.
    file_path = "/content/sample_data/Interview.txt"
    transcript = read_interview_responses(file_path)
    results = process_interview_responses(transcript)
    output_file = "quality_assessment_results.csv"
    save_results_to_csv(results, output_file)

if __name__ == "__main__":
    main()

Quality assessment results saved to quality_assessment_results.csv


*Results are displayed in the console*

In [6]:
def main():
    file_path = "/content/sample_data/Interview.txt"
    transcript = read_interview_responses(file_path)
    results = process_interview_responses(transcript)

    for result in results:
        print(result['Candidate'])
        print(f"Response: {result['Response']}")
        print(f"Sentiment: {result['Sentiment']}")
        print(f"Key Phrases: {', '.join(result['Key Phrases'])}")
        print(f"Relevant Key Phrases: {', '.join(result['Relevant Key Phrases'])}")
        print(f"Quality Assessment: {result['Quality Assessment']}")
        print()

if __name__ == "__main__":
    main()

Candidate 1
Response: Candidate (HR): Certainly. I led a project to develop a predictive analytics model to forecast employee turnover. Using historical HR data, such as employee tenure, performance reviews, and engagement survey results, we built a logistic regression model. The model achieved an accuracy of 85%, which helped the HR department proactively address retention risks. By implementing targeted retention strategies, we reduced turnover by 12% over the next year.  Interviewer: How did you handle the data privacy concerns associated with employee data?  Candidate (HR): Data privacy was a top priority. We anonymized the data to ensure individual employees couldn't be identified. Additionally, we implemented strict access controls, ensuring only authorized personnel could access sensitive information. We also worked closely with our legal and compliance teams to ensure our practices met all relevant data protection regulations.
Sentiment: Positive
Key Phrases: candidate (hr, i, 