**About the Dataset**

The dataset contains detailed compensation information for employees in the public protection sector, focusing specifically on the ADP Adult Probation department for the year 2013. It includes various attributes such as job codes, salary components (including salaries, overtime, and benefits), and total compensation for a range of job roles. The dataset encompasses diverse job families and positions, including technical roles, administrative roles, and personnel analysts.

**Suitability for Forecasting**

Diversity of Job Roles:
The dataset includes multiple job codes and job family codes, representing a wide range of positions within the organization. This diversity allows for the analysis of trends and patterns across different job functions and levels, providing valuable insights into how compensation structures may evolve.

 **Temporal Consistency:**

Since the data is from a specific year (2013), it ensures consistency for temporal analysis. This establishes a baseline for forecasting future compensation trends, facilitating the prediction of salary adjustments based on historical data.

**Trend Identification:**

The dataset can be utilized to identify trends over time, particularly if extended to other years. This enables forecasts that consider past changes and patterns in salaries and benefits, enhancing the predictive accuracy.

Conclusion
Overall, the dataset is highly suitable for forecasting purposes due to its comprehensive and detailed nature, diversity of job roles, and inclusion of various compensation components. Analyzing this data allows for the development of predictive models that take into account past trends and future projections, aiding in strategic planning and budget management within the organization.

In [None]:
 #Step 1: Load and Prepare Data
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt

# Load the dataset from the specified path
file_path = r"C:\Users\22251091\Downloads\archive (1)\Employee_Salary_Compensation.csv"
df = pd.read_csv(file_path)  # Use read_csv for CSV files


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\22251091\\Downloads\\archive (1)\\Employee_Salary_Compensation.csv'

In [None]:
# Check the first few rows of the DataFrame
print(df.head())

# Print the data types of all columns
print(df.dtypes)


NameError: name 'df' is not defined

In [None]:
df['Year'] = pd.to_datetime(df['Year'].astype(str))  # Convert year to datetime
df_grouped = df.groupby('Year')['Retirement'].sum().reset_index()


# Rename columns for Prophet
df_grouped.columns = ['ds', 'y']  # 'ds' for date and 'y' for the value


The **Year** column is converted into a datetime format to ensure it is in the correct format for time-series analysis.

We use **groupby()** to aggregate the retirement contributions by year, summing up the total contributions for each year.
The result is stored in a new DataFrame called df_grouped, which will be used as input for Prophet.

Prophet requires the columns to have specific names:

**ds (date):** This column should contain the date or timestamp data.

**y (value):** This column should contain the values you want to forecast (in this case, retirement contributions).
We rename the columns in df_grouped accordingly.

**Step 4: Create and Fit the Prophet Model**

In [None]:
model = Prophet()
model.fit(df_grouped)


We create an instance of the Prophet model.

The model is then fitted to our time-series data **(df_grouped)**. This is where the model learns from historical data to make predictions about future trends.

**Make Future Predictions**

In [None]:

future = model.make_future_dataframe(periods=12, freq='M')  # 12 months ahead
forecast = model.predict(future)

We generate future dates for which we want predictions using **make_future_dataframe()**.

Here, we set **periods=12** to forecast the next 12 months, with **freq='M'** indicating monthly intervals.
The **predict()** method is then called to generate the forecasted values.

**Plot the Forecasted Results**

In [None]:


# Plot the results
fig = model.plot(forecast)
plt.title('Forecast of Retirement Savings Growth in Locked Pot for Next Year')
plt.xlabel('Date')
plt.ylabel('Total Retirement Contributions')
plt.axvline(x=df_grouped['ds'].max(), color='r', linestyle='--', label='Last Data Point')
plt.legend()
plt.show()




We plot the forecasted results using model.plot(). This graph displays the historical data along with the forecasted values and the uncertainty intervals.


**SENTIMENT ANALYSIS(Emplopyee Feedback)**



In [None]:
import pandas as pd
import random

# Define the attributes from the retirement dataset
num_employees = 100  # Number of employees in the dataset
departments = ['Sales', 'HR', 'IT', 'Marketing', 'Finance', 'Operations']
job_roles = ['Manager', 'Team Lead', 'Analyst', 'Developer', 'Sales Rep']
reasons_for_leaving = ['Career Change', 'Personal Reasons', 'Relocation', 'Retirement', 'Layoff']

# Function to introduce random errors in feedback text
def introduce_errors(text):
    # Introduce random errors: extra spaces, random punctuation, and typos
    error_chance = random.random()
    if error_chance < 0.2:  # 20% chance to introduce errors
        text += "  " + random.choice(['!!', '???', '...', '##'])  # Add random punctuation
    if error_chance < 0.4:  # 20% chance for typo
        text = text.replace('I', '1').replace('a', '@')  # Simple typo replacements
    if error_chance < 0.6:  # 20% chance to add extra spaces
        text = text.replace(' ', '   ')  # Add extra spaces
    return text

# Generate synthetic employee reviews
data = {
    'Employee ID': [f'EMP{i:03d}' for i in range(1, num_employees + 1)],
    'Age': [random.randint(25, 60) for _ in range(num_employees)],
    'Gender': random.choices(['Male', 'Female', 'Non-binary'], k=num_employees),
    'Marital Status': random.choices(['Single', 'Married', 'Divorced', 'Widowed'], k=num_employees),
    'Department': random.choices(departments, k=num_employees),
    'Job Role': random.choices(job_roles, k=num_employees),
    'Salary': [random.randint(40000, 120000) for _ in range(num_employees)],
    'Performance Rating': [random.randint(1, 5) for _ in range(num_employees)],
    'Years at Company': [random.randint(1, 30) for _ in range(num_employees)],
    'Promotion Count': [random.randint(0, 5) for _ in range(num_employees)],
    'Work-Life Balance Rating': [random.randint(1, 5) for _ in range(num_employees)],
    'Job Satisfaction': [random.randint(1, 5) for _ in range(num_employees)],
    'Training Hours': [random.randint(0, 40) for _ in range(num_employees)],
    'Commute Distance': [random.randint(1, 30) for _ in range(num_employees)],
    'Absenteeism Rate': [random.uniform(0.0, 100.0) for _ in range(num_employees)],
    'Exit Interview Feedback': [
        introduce_errors(random.choice(['I enjoyed my time here.', 'I found better opportunities.',
                                        'The culture did not fit me.', 'I needed to relocate.',
                                        'I felt unappreciated.']))
        for _ in range(num_employees)
    ],
    'Reason for Leaving': random.choices(reasons_for_leaving, k=num_employees),
}

# Create a DataFrame
employee_reviews_df = pd.DataFrame(data)

# Display the first few rows of the dataset with errors
print(employee_reviews_df.head())

# Save to a CSV file if needed
employee_reviews_df.to_csv('synthetic_employee_reviews_with_errors.csv', index=False)


  Employee ID  Age      Gender Marital Status  Department   Job Role  Salary  \
0      EMP001   51  Non-binary        Married  Operations    Manager   78016   
1      EMP002   41  Non-binary        Widowed   Marketing    Analyst   54882   
2      EMP003   33      Female        Married          IT  Sales Rep   84506   
3      EMP004   38        Male       Divorced  Operations    Manager   60500   
4      EMP005   31      Female       Divorced  Operations    Manager  118953   

   Performance Rating  Years at Company  Promotion Count  \
0                   1                17                1   
1                   1                14                1   
2                   2                11                5   
3                   3                10                3   
4                   2                 3                2   

   Work-Life Balance Rating  Job Satisfaction  Training Hours  \
0                         4                 4              22   
1                         2 



**1.Text Processing**

**a. Cleaning **

Cleaning is the initial step in NLP that prepares the text data for further analysis. This involves:

Removing Punctuation and Symbols: Punctuation marks (e.g., periods, commas, exclamation marks) and special characters (e.g., hashtags, ampersands) can interfere with text analysis. For instance, feedback such as “I found better opportunities!!!” would be cleaned to “I found better opportunities”.

Correcting Errors: This includes fixing typos and ensuring that words are spelled correctly, which can enhance the accuracy of subsequent analyses.

**b. Tokenization**

Tokenization breaks down the cleaned text into individual units, known as tokens. These tokens can be:

Words: For instance, the sentence "I enjoyed my time here." would be tokenized into ["I", "enjoyed", "my", "time", "here"].
Phrases: In some analyses, you may choose to tokenize based on phrases rather than individual words, especially for capturing context.

**c. Lowercasing**

Lowercasing is the process of converting all text to lowercase to ensure uniformity and reduce redundancy. This means “I”, “i”, and “I.” would all become “i”. This step helps prevent the same word from being treated as different tokens due to case differences.

**d. Removing Stopwords **
Stopwords are common words that do not add significant meaning to the text. Examples include “is”, “a”, “the”, “and”. Removing these words helps focus on the more meaningful terms within the feedback, such as “enjoyed” or “opportunity”.

**e. Stemming and Lemmatization**

Both stemming and lemmatization are techniques used to reduce words to their root form, which helps in standardizing the text data.

**Stemming:** This process involves removing suffixes from words to get to their base form. For example, “enjoying” becomes “enjoy”.
Lemmatization: Unlike stemming, lemmatization considers the context and converts words to their base or dictionary form. For instance, “opportunities” would be reduced to “opportunity”.

**Text Representation**

Once the text data is cleaned and pre-processed, it needs to be transformed into a numerical format suitable for machine learning algorithms.

**a. Bag-of-Words (BoW)**

The Bag-of-Words model creates a representation of text data by counting the occurrences of each word in the document.

For instance, if one employee review contains the words "I enjoyed my time here" and another contains "I found better opportunities", the BoW would count how many times each unique word appears across all reviews.
This model treats each word as an independent feature, ignoring the order and context, which can lead to loss of information.
b. TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a more sophisticated text representation that weighs the importance of words based on their frequency in a document relative to their occurrence across multiple documents.

Term Frequency (TF) measures how often a word appears in a document relative to the total number of words in that document.
Inverse Document Frequency (IDF) gauges how important a word is across the entire corpus. It reduces the weight of common words that appear frequently in many documents and increases the weight of rarer words.
The TF-IDF score is the product of TF and IDF, providing a balanced measure of word importance.

 **c. Word Embeddings**

Word embeddings transform words into dense vector representations, capturing semantic meanings based on context.

Techniques like Word2Vec and GloVe create vectors for words such that similar words have similar representations in the vector space. For instance, “king” and “queen” would be closer together in the vector space than “king” and “apple”.
Word embeddings allow for capturing the nuances of meaning and relationships between words, facilitating better performance in NLP tasks.

**Model Training**

With the text data represented numerically, it’s ready for training machine learning models. Various techniques can be employed based on the task at hand:

**a. Supervised Learning**
For tasks like sentiment analysis or classification of reviews (positive, negative, neutral), supervised learning algorithms can be used:

**Algorithms:** Logistic regression, Support Vector Machines, Random Forests, and neural networks are commonly employed.
Training: The model is trained on a labeled dataset (where the outcomes are known) to learn the relationships between the text representations and the corresponding labels.

**b. Unsupervised Learning**
In scenarios where labels are not available, unsupervised learning techniques like clustering can help in discovering patterns within the data.

. Prediction/Classification
Once the model is trained, it can be used to predict outcomes for new data:

Application: For instance, given a new employee review, the trained model can classify it as positive, negative, or neutral based on learned patterns.
Output: The output can be probabilities for each class, allowing for nuanced interpretations of sentiments.

**5. Post-Processing and Evaluation**

After predictions are made, the results should be evaluated to measure performance:

Metrics: Common evaluation metrics include accuracy, precision, recall, and F1 score. These metrics help gauge how well the model performs on unseen data.
Fine-tuning: Based on evaluation results, further adjustments can be made, such as hyperparameter tuning, feature selection, or using different algorithms to improve performance.