# Employee Sentiment Analysis — End-to-End Deliverables

**Inputs**
- Raw emails CSV: `data/test(in).csv` (or your chosen input)
- Cached labels: `data/labeld_sentiments.csv` (auto-created if missing)

**Outputs produced by this notebook**
- Labeled messages: `data/labeld_sentiments.csv`
- EDA figures: `visualizations/*.png`
- Monthly sentiment scores: `data/employee_monthly_sentiment_scores.csv`
- Monthly rankings: `data/employee_monthly_rankings.csv`
- Flight-risk employees: `data/flight_risk_employees.csv`
- Trained linear model: `data/regression_model.joblib`
- Model coefficients: `data/regression_model_coefficients.csv`
- Model diagnostics: `visualizations/model/*.png`

> Spec coverage: sentiment labeling, EDA + visuals, monthly scoring, ranking, flight-risk, linear regression trend model.

**Project:** Springer Capital Employee Sentiment Analysis

**Objective:** Analyze an unlabeled dataset of employee messages to assess sentiment, identify trends, calculate employee scores, rank employees, flag flight risks, and build a predictive regression model.

This notebook follows the project specification, implementing each task sequentially.

---
## 0. Setup and Imports

First, we import all necessary libraries and the custom classes from the `src` directory. We also define key file paths.

In [None]:
import os
import pandas as pd

# Project modules
from src.load_data import LoadData
from src.labeling import SentimentLabeler
from src.plot_data import PlotData
from src.ranking import EmployeeScoring , EmployeeRanking
from src.regression import FeatureEngineer, TrainRegressionModel, PredictScore
from src.model_plotter import ModelPlotter

# I/O
DATA_DIR = "data"
VIZ_DIR = "visualizations"
MODEL_DIR = "data"

RAW_CSV = os.path.join(DATA_DIR, "test(in).csv")  # adjust if your input differs
LABELED_CSV = os.path.join(DATA_DIR, "labeld_sentiments.csv")
MONTHLY_SCORES_CSV = os.path.join(DATA_DIR, "employee_monthly_sentiment_scores.csv")
RANKINGS_CSV = os.path.join(DATA_DIR, "employee_monthly_rankings.csv")
FLIGHT_RISK_CSV = os.path.join(DATA_DIR, "flight_risk_employees.csv")
MODEL_PATH = os.path.join(MODEL_DIR, "regression_model.joblib")
COEFF_PATH = os.path.join(MODEL_DIR, "regression_model_coefficients.csv")

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(VIZ_DIR, exist_ok=True)

## 1. Data Ingestion and Normalization

**Purpose:**  
This step loads the raw employee message dataset into a structured pandas DataFrame using the `LoadData` class. It ensures that text, date, and sender information are consistently formatted.

**Why it matters:**  
Clean and normalized data is critical for reliable downstream analysis. Inconsistent datetime formats, null sender IDs, or text fields would otherwise break sentiment labeling and time-based scoring.

**Expected output:**  
- A DataFrame (`raw_df`) containing normalized fields:  
  `['employee_id', 'date', 'subject', 'body', 'text']`  
- Dates converted to `datetime` objects, messages sorted chronologically.

**Tie to project spec:**  
Satisfies **Task 0: Data Ingestion and Preprocessing**, preparing the dataset for sentiment labeling.


In [None]:
loader = LoadData(RAW_CSV)
raw_df = loader.load_pandas_dataframe(clean=True)

display(raw_df.head(3))
print("Shape:", raw_df.shape)
raw_df.info()


## 2. Sentiment Labeling and Caching

**Purpose:**  
Use the `SentimentLabeler` class to assign a sentiment category — **Positive**, **Negative**, or **Neutral** — to each message using a fine-tuned language model. The results are cached locally for reproducibility.

**Why it matters:**  
Sentiment classification forms the foundation for every subsequent task. Monthly scoring, employee ranking, and flight-risk detection all rely on these labels to quantify tone and emotional state.

**Expected output:**  
- A labeled dataset (`df`) with columns:  
  `['employee_id', 'date', 'body', 'sentiment', 'polarity']`  
- Saved cache file: `data/labeld_sentiments.csv` for faster reruns.

**Tie to project spec:**  
Fulfills **Task 1: Sentiment Labeling (Positive, Negative, Neutral)**, providing the core classification data for analysis.


In [None]:
if os.path.exists(LABELED_CSV):
    df = pd.read_csv(LABELED_CSV, parse_dates=["date"])
else:
    labeler = SentimentLabeler(raw_df.rename(columns={"text": "text"}))
    df = labeler.get_sentiments()
    df.to_csv(LABELED_CSV, index=False)

needed = {"employee_id", "date", "sentiment", "body"}
missing = needed - set(df.columns)
assert not missing, f"Missing columns: {missing}"

display(df.head(3))
print("Labeled rows:", len(df))


## 3. Exploratory Data Analysis (EDA) and Visualization

**Purpose:**  
Leverage the `PlotData` class to generate key visualizations that summarize message patterns, sentiment distributions, and time-based trends.

**Why it matters:**  
EDA reveals behavioral insights — such as whether negative messages are longer or cluster in specific time periods — helping validate model performance and data integrity before modeling.

**Expected output:**  
- A series of static PNGs saved in `visualizations/`, including:  
  - Sentiment distribution  
  - Message length vs sentiment  
  - Average sentiment over time  
  - Per-employee sentiment heatmaps  
- Inline previews within the notebook.

**Tie to project spec:**  
Addresses **Task 2: Exploratory Data Analysis (EDA) and Data Visualizations**, ensuring visual inspection of the dataset and labeled outputs.


In [None]:
eda = PlotData(df)
eda.run_all_plots()  # your helper saves all PNGs under visualizations/

from IPython.display import Image, display

to_show = [
    "visualizations/sentiment_distribution.png",
    "visualizations/message_length_distribution.png",
    "visualizations/message_length_by_sentiment.png",
    "visualizations/avg_sentiment_over_time.png",
    "visualizations/sentiment_per_employee.png",
    "visualizations/top_employees.png",
    "visualizations/avg_polarity_over_time.png",
]
for fn in to_show:
    if os.path.exists(fn):
        display(Image(filename=fn))


## 4. Monthly Sentiment Scoring

**Purpose:**  
Aggregate labeled messages by employee and month using the `EmployeeScoring` class to compute each employee’s monthly sentiment score.

**Why it matters:**  
This step transforms message-level sentiment into a quantifiable trend metric, allowing comparisons over time and across employees.

**Expected output:**  
- A table (`monthly_scores`) containing:  
  `['employee_id', 'month', 'score']`  
- Saved output file: `data/employee_monthly_sentiment_scores.csv`.

**Tie to project spec:**  
Implements **Task 3: Monthly Sentiment Scoring**, producing the foundational metric for employee ranking and regression modeling.


In [None]:
scorer = EmployeeScoring(df.copy())
monthly_scores = scorer.compute_scores()
monthly_scores.to_csv(MONTHLY_SCORES_CSV, index=False)

display(monthly_scores.head(10))
print("Monthly rows:", len(monthly_scores))


## 5. Employee Ranking and Performance Extremes

**Purpose:**  
Use the `EmployeeRanking` class to identify top and bottom performers per month based on their aggregated sentiment scores.

**Why it matters:**  
Ranking highlights standout employees — both positive and negative — enabling management to recognize strong engagement or flag potential morale concerns.

**Expected output:**  
- A ranked DataFrame (`rankings`) listing employees by monthly score.  
- Saved output file: `data/employee_monthly_rankings.csv`.

**Tie to project spec:**  
Fulfills **Task 4: Employee Ranking**, summarizing relative sentiment performance across the organization.


In [None]:
ranker = EmployeeRanking(monthly_scores, scores_available=True)
rankings = ranker.get_rankings(drop_type=False)
rankings.to_csv(RANKINGS_CSV, index=False)

display(rankings.head(12))
print("Ranking rows:", len(rankings))


## 6. Flight-Risk Detection (Rolling 30-Day Analysis)

**Purpose:**  
Apply the `flight_risk_analysis()` method from the scoring class to detect employees with sustained negative sentiment over a rolling 30-day window.

**Why it matters:**  
Consistent negativity often precedes disengagement or turnover. Automatically flagging these employees allows proactive HR follow-up.

**Expected output:**  
- A DataFrame of at-risk employees (`flight_risk_employees`).  
- Optional summary of when the risk condition was first met.  
- Saved CSV: `data/flight_risk_employees.csv`.

**Tie to project spec:**  
Meets **Task 5: Flight Risk Identification**, identifying employees at potential risk of leaving based on communication tone trends.


In [None]:
# names only
risks_names = scorer.flight_risk_analysis(return_names_only=True, min_neg_in_30d=4)
# full details
risks_full = scorer.flight_risk_analysis(return_names_only=False, min_neg_in_30d=4)
risks_full.to_csv(FLIGHT_RISK_CSV, index=False)

display(risks_names.head())
display(risks_full.head(10))
print("At-risk employees:", len(risks_names))


### Flight-risk visuals
- Saved CSV: `visualizations\flight_risk`.

In [None]:

from src.plot_data import FlightRiskPlots
fr = FlightRiskPlots(df.assign(
    sentiment_num=df.get("sentiment_num", df["sentiment"].map({"POSITIVE":1,"NEUTRAL":0,"NEGATIVE":-1}))
)[["employee_id","date","sentiment_num"]])
fr.run_all()

from IPython.display import Image, display
flight_pngs = [
    "visualizations/flight_risk/dist_neg30d.png",
    "visualizations/flight_risk/heatmap_neg_by_month_employee.png",
    "visualizations/flight_risk/cohort_days_to_first_risk.png",
]
for fn in flight_pngs:
    if os.path.exists(fn):
        display(Image(filename=fn))


## 7. Linear Regression Model for Sentiment Trends

**Purpose:**  
Train a linear regression model (LassoCV) using `TrainRegressionModel` to predict future monthly sentiment scores based on engineered linguistic and behavioral features.

**Why it matters:**  
Trend forecasting allows leadership to anticipate organizational morale shifts. It quantifies how sentiment is evolving and predicts future highs or lows.

**Expected output:**  
- Trained regression model saved to `data/regression_model.joblib`.  
- Coefficient file: `data/regression_model_coefficients.csv`.  
- Evaluation metrics (R², residuals).  
- Diagnostic plots in `visualizations/model/`.

**Tie to project spec:**  
Delivers **Task 6: Linear Regression for Sentiment Trends**, completing the predictive modeling component of the project.


In [None]:
# Ensure the training helper receives a 'month' column (some modules expect 'month' rather than 'date')
if 'month' not in monthly_scores.columns and 'date' in monthly_scores.columns:
	monthly_scores = monthly_scores.rename(columns={'date': 'month'})

trainer = TrainRegressionModel(sentiment_scores_df=monthly_scores, raw_df=df)
trainer.train(test_size=0.2, random_state=42)
r2 = trainer.evaluate()          # prints metrics and feature importances if implemented
trainer.save_model_artifacts(model_path=MODEL_PATH, coeff_path=COEFF_PATH)

print("R^2 on test:", r2)


## 7.1) Model diagnostics
Residuals, prediction vs actual, top coefficients, and feature correlations.


In [None]:
plotter = ModelPlotter(
    raw_df=df,
    sentiment_scores_df=monthly_scores,
    test_size=0.2,
    random_state=42,
    pipeline_path=MODEL_PATH
)
plotter.run_all_plots()

from IPython.display import Image, display
diag_pngs = [
    "visualizations/model/residual_plot.png",
    "visualizations/model/prediction_vs_actual.png",
    "visualizations/model/top_coefficients.png",
    "visualizations/model/feature_correlation_heatmap.png",
]
for fn in diag_pngs:
    if os.path.exists(fn):
        display(Image(filename=fn))


## 8. Example Prediction — Next-Month Sentiment Forecast

**Purpose:**  
Demonstrate how to use the trained model (`PredictScore`) to forecast an employee’s upcoming sentiment score based on their communication data and previous month’s score.

**Why it matters:**  
Showcases the model’s real-world use case: predicting future sentiment dynamics for a specific employee using only their historical trend and text-derived features.

**Expected output:**  
- Printed predicted vs. actual monthly sentiment scores for a sample employee.  
- Validates the model’s end-to-end functionality.

**Tie to project spec:**  
Serves as the **final demonstration step**, verifying that all upstream components — from labeling through regression — integrate cohesively.


In [None]:
# Assume: df (raw labeled), monthly_scores (precomputed), MODEL_PATH set.
# Use the employee from the first row of monthly_scores, predict 2010-03 using prior month 2010-02.

emp_id      = monthly_scores.iloc[0]["employee_id"]
prior_month = "2010-02"
target_month= "2010-03"



# Filter the DataFrame to find the specific row
specific_row = monthly_scores[
    (monthly_scores['employee_id'] == emp_id) & 
    (monthly_scores['month'] == prior_month)
]

# Get the value from the 'sentiment_num' column for that row
sentiment_value = specific_row["sentiment_num"].iloc[0] # .iloc[0] gets the first value from the resulting Series


# rows for that employee; set the month the model should predict for
month_df = df.loc[df["employee_id"].eq(emp_id)].copy()
month_df["month"] = target_month  # let the model’s feature engineer handle this

# predict with just the prior-month identifier
predictor = PredictScore(model_path=MODEL_PATH)
yhat = predictor.predict(month_df, sentiment_value)  # API: (raw_rows, prior_month_str)

# grab actual from monthly_scores for reporting
ms = monthly_scores.copy()
specific_row = monthly_scores[
    (monthly_scores['employee_id'] == emp_id) & 
    (monthly_scores['month'] == target_month)
]
actual = specific_row["sentiment_num"].iloc[0]
print(f"Employee: {emp_id}")
print(f"Prev month used: {prior_month}")
print(f"Target month:    {target_month}")
print(f"Predicted score: {float(yhat if not hasattr(yhat, '__len__') else yhat[0])}")
print(f"Actual score:    {actual}")


## 9) Deliverables checklist

- ✅ `data/labeld_sentiments.csv`
- ✅ `visualizations/*.png` (EDA)
- ✅ `data/employee_monthly_sentiment_scores.csv`
- ✅ `data/employee_monthly_rankings.csv`
- ✅ `data/flight_risk_employees.csv`
- ✅ `data/regression_model.joblib` and `data/regression_model_coefficients.csv`
- ✅ `visualizations/model/*.png` (diagnostics)

> Re-run policy: labeling is cached; delete `data/labeld_sentiments.csv` only if you need to relabel.


## Appendix — run order and reproducibility
- Run top to bottom on a clean kernel.
- Keep `random_state=42` stable in Sections 8–9.
- Do not overwrite CSVs unless you intend to regenerate downstream artifacts.
