# **Project Name**    - Glassdoor EDA: Unveiling Workplace Trends



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**            - Ashwin Suryawanshi

# **Project Summary -**

Write the summary here within 500-600 words.

This project conducts an in-depth exploratory data analysis (EDA) of Glassdoor company review data to uncover key workplace trends. The primary focus is on employee satisfaction, company ratings, and salary expectations across various industries and job roles. Using statistical analysis and data visualization techniques, the project aims to provide insights into factors shaping the modern workplace.

The dataset consists of employee reviews from diverse companies, job titles, and industry sectors. Key variables include company ratings, salary details (base pay, bonuses, stock options), work-life balance scores, company culture insights, and qualitative employee feedback. Additionally, where available, employee demographics are considered to analyze potential disparities.

* **Data Cleaning & Preprocessing:**

The analysis begins with data cleaning to handle missing values, inconsistencies, and outliers. Missing data points are imputed strategically, categorical variables are standardized, and inconsistent formats are addressed to ensure data integrity.

* **Company Ratings & Employee Satisfaction:**

A critical part of the study examines the relationship between company ratings and factors like salary, work-life balance, company culture, and career opportunities. Correlation analysis helps identify key drivers of employee satisfaction and highlight areas where companies can improve.

* **Salary Trends Analysis:**

Salary expectations are explored across different job roles and industries, analyzing salary distributions, disparities across demographics (where data permits), and the influence of company size, location, and experience level on pay structures. Visualizations like histograms and box plots illustrate these trends clearly.

* **Sentiment Analysis & Textual Insights:**

Beyond numerical data, the project delves into textual reviews to extract employee sentiments and commonly mentioned workplace themes. Using word clouds and sentiment analysis, the study highlights recurring concerns, praises, and trends in workplace culture and benefits.

By leveraging data visualization and statistical techniques, this EDA provides valuable insights into workplace dynamics, helping organizations understand employee experiences and improve their work environments.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

* How does salary vary by job position (e.g., Data Scientist vs. Software Engineer vs. DevOps
Engineer)?
* What is the impact of company size on salary levels?
* How do salaries differ by location (e.g., San Francisco vs. Austin vs. New York)?
* Can we build a predictive model to estimate salaries based on job attributes?

By analyzing this dataset, we can predict salary ranges, uncover market trends, and provide insights to
professionals and organizations.

#### **Define Your Business Objective?**

* **For Job Seekers:**  Helps professionals make informed career decisions by understanding expected salary ranges for different roles.

* **For Employers:**  Assists companies in setting competitive salaries to attract and retain top talent.
* **For Analysts & Researchers:**  Provides data-driven insights into salary trends based on industry,
experience, and geography.

* **For Recruiters:**  Aids in benchmarking salaries and ensuring fair compensation practices.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

from google.colab import drive
drive.mount('/content/drive')


In [None]:
df = pd.read_csv("/content/drive/MyDrive/Glassdoor EDA/Copy of glassdoor_jobs.csv")

### Dataset First View

In [None]:
# Dataset First Look

df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

rows = df.shape[0]
cols = df.shape[1]
print(f"Number of rows: {rows}")
print(f"Number of columns: {cols}")

### Dataset Information

In [None]:
# Dataset Info

df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isnull().sum()

In [None]:
# Visualizing the missing values

plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='coolwarm')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

* This dataset appears to be about job postings from Glassdoor.
* It contains information on various aspects of job listings, such as Company Name, Job Title, Salary, Location, Company Size, Revenue etc.
* There are 956 rows and 15 columns in the dataset.
* There is no duplicate rows in the dataset.
* There is no missing value in the dataset.
* The data types of the columns are integers, float and object.
* Further analysis will be required to understand the distribution of data in each column, the relationships between different columns, and the overall quality of the data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe()

### Variables Description

* **Job Title** – The title of the job position.
* **Company Name** – The name of the company offering the job.
* **Location** – The city and state where the job is located.
* **Headquarters** – The main office or corporate headquarters location of the company. This can be useful for understanding where the company is based versus where job positions are located.
* **Salary Estimate** – The estimated salary range for the job.
* **Job Description** – A textual summary of job responsibilities and qualifications.
* **Rating** – The company's overall rating on Glassdoor (scale of 1 to 5).
* **Company Size** – The approximate number of employees in the company (e.g., 51-200, 1000-5000).
* **Company Type** – Whether the company is public, private, government, or other organization.
* **Industry** – The business sector in which the company operates (e.g., Tech, Finance, Healthcare, Aeroscope, Energy).
* **Sector** – A broader classification of the industry (e.g., IT, Banking, Consulting, Manufacturing).
* **Revenue** – The company's estimated revenue range.
* **Founded** – The year the company was established.
* **Competitor** – A list of competing companies within the same industry as the employer. This helps job seekers compare opportunities and company standing against rivals.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for i in df.columns.tolist():
  print("Number of unique values in ",i,"is",df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Remove rows with missing values in 'Salary Estimate'
df = df[df['Salary Estimate'] != '-1']

# Extract salary range and convert to numeric
salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = salary.apply(lambda x: x.replace('K', '').replace('$', ''))
min_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))

df['min_salary'] = min_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = min_hr.apply(lambda x: int(x.split('-')[1]) if '-' in x else int(x.split('-')[0]))
df['avg_salary'] = (df.min_salary + df.max_salary) / 2

# Company Name text only
df['company_txt'] = df.apply(lambda x: x['Company Name'] if x['Rating'] < 0 else x['Company Name'][:-3], axis=1)

# State field
df['job_state'] = df['Location'].apply(lambda x: x.split(',')[1] if ',' in x else x)
df['job_state'] = df['job_state'].apply(lambda x: x.strip() if x.strip().lower() != 'los angeles' else 'CA')

df['same_state'] = df.apply(lambda x: 1 if x.Location == x.Headquarters else 0, axis=1)

# Age of Company
df['age'] = df.Founded.apply(lambda x: x if x < 1 else 2023 - x)

# Parsing of job description (python, etc.)
df['python_yn'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)
df['R_yn'] = df['Job Description'].apply(lambda x: 1 if 'r studio' in x.lower() or 'r-studio' in x.lower() else 0)
df['spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)
df['aws'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)
df['excel'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)

df['job_simp'] = df['Job Title'].apply(lambda x: x.split(',')[0] if ',' in x else x)

# Job description length
df['desc_len'] = df['Job Description'].apply(lambda x: len(x))

# Competitor count
df['num_comp'] = df['Competitors'].apply(lambda x: len(x.split(',')) if x != '-1' else 0)

print("Data wrangling complete.")


In [None]:
df.head(15)

### What all manipulations have you done and insights you found?

**1. Handling Missing Salary Data:**
- Removed rows with '-1' in 'Salary Estimate' as they represent missing salary information.
- Insight: Removing these rows ensures that the analysis focuses on jobs with available salary data, preventing skewed results.

**2. Salary Data Cleaning and Feature Engineering:**
- Extracted minimum and maximum salary values from the 'Salary Estimate' column.
- Created an 'avg_salary' column by averaging min and max salaries.
- Insight: This allows for numerical analysis of salary and comparisons across different job categories and companies.

**3. Company Name Cleaning:**
- Removed the rating from the company name to isolate the company's name for better analysis.
- Insight: Allows for cleaner grouping and analysis of companies, avoiding the influence of the rating on analysis.

**4. Location Feature Engineering:**
- Created a 'job_state' column from the 'Location' column to analyze job locations by state.
- Handled the special case of 'Los Angeles' to ensure correct state assignment.
- Created a binary feature 'same_state' to indicate if the job location is the same as the company's headquarters.
- Insight: State-level job location analysis can reveal geographical preferences and trends. The 'same_state' feature indicates whether jobs are located at headquarters or different locations, impacting commute preferences and company structure.

**5. Company Age Calculation:**
- Calculated the 'age' of the company based on the 'Founded' year.
- Insight: Company age can be an indicator of stability, experience, and possibly company culture.

**6. Job Description Keyword Extraction:**
- Created binary features for keywords like 'python', 'R', 'spark', 'aws', and 'excel' based on their presence in the 'Job Description'.
- Insight: These keywords highlight essential job skills and technologies.

**7. Job Title Simplification:**
- Created a 'job_simp' column by taking the first part of the 'Job Title' before a comma to simplify the job titles for easier analysis.
- Insight: Simplified job titles allow for easier categorization and analysis of job roles.

**8. Job Description Length:**
- Calculated the length of the job description ('desc_len') to potentially reflect the detail and complexity of the role.
- Insight: Description length may correlate with the seniority or complexity of the role.

**9. Competitor Count:**
- Calculated the number of competitors ('num_comp') for each company based on the 'Competitors' column.
- Insight: Competitor count might indicate the level of market competition and the company's position in the industry.


**Overall Insights:**

The performed manipulations created new features that could reveal relationships between job characteristics, company attributes, salary expectations, and required skills. Further exploratory data analysis, visualizations, and statistical tests can now uncover valuable insights from the refined dataset.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(10, 6))
sns.histplot(df['Rating'], kde=True, color='slateblue')
plt.title('Distribution of Company Ratings')
plt.xlabel('Company Rating')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram with a kernel density estimate (KDE) to visualize the distribution of company ratings.  A histogram is suitable for showing the frequency distribution of a single numerical variable, in this case, the company ratings. The KDE provides a smooth curve that estimates the probability density function, giving a better visual representation of the distribution's shape compared to just the histogram bars. This allows for a quick understanding of the overall rating distribution, including its central tendency, spread, and skewness.

##### 2. What is/are the insight(s) found from the chart?

The distribution of company ratings appears to be slightly right-skewed, meaning there are more companies with higher ratings and fewer with lower ratings.  A significant portion of companies have ratings between 3 and 4, suggesting that this range is common.  The peak of the distribution seems to fall around 3.5, indicating that this is the most frequent rating. The right skew could suggest that the platform tends to attract companies with higher ratings or that it might be easier for good companies to maintain higher ratings on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, understanding the distribution of company ratings can help job seekers make more informed decisions about potential employers.  It can also help companies identify areas for improvement in their workplace culture and practices to enhance their ratings and attract better talent. For example, companies with lower-than-average ratings can focus on improving their employee satisfaction and recognition.


A very high concentration of ratings at the lower end could indicate a problem. For example, if a very large number of companies have ratings below 2, this might signal a systemic issue within the industry or the platform, potentially discouraging top talent from applying to companies listed there. A very low number of highly-rated companies may suggest a lack of competitive, sought-after companies on the platform.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(12, 6))
sns.countplot(x='job_state', data=df, order=df['job_state'].value_counts().index)
plt.xticks(rotation=90)
plt.title('Number of Jobs by State')
plt.xlabel('State')
plt.ylabel('Number of Jobs')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a countplot because it effectively displays the distribution of categorical data, in this case, the number of job postings across different states.  A countplot provides a clear visual comparison of the frequency of each category (state) in the dataset.  It's a good way to quickly identify states with the highest and lowest concentrations of job opportunities.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that California (CA) has the highest number of job postings, significantly more than other states.  This suggests a higher concentration of job opportunities in California within the dataset. Other states like Massachusetts (MA), New York (NY) and Virginia (VA) also have a noticeable number of job listings, while many other states have considerably fewer.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this information can be valuable for recruiters, job seekers, and businesses making location-related decisions.  For example, companies could use this data to target their recruitment efforts in states with a higher concentration of job seekers or consider expanding their operations to areas with a high demand for their services.  Job seekers can use this data to prioritize their job search in areas with more opportunities.

 For negative growth, if a company is heavily concentrated in a state with a declining job market, this could affect the company's performance and potentially lead to layoffs or reduced profits.  Conversely, if a company is not represented in a rapidly growing job market, it could miss out on talent acquisition and business expansion opportunities.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(10,6))
sns.boxplot(x='avg_salary', y='job_state', data=df, palette= 'Greys', orient='h')
plt.title('Average Salary Distribution by State')
plt.xlabel('Average Salary')
plt.ylabel('State')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen to visualize the distribution of average salaries across different states. Box plots are excellent for comparing the distribution of a numerical variable (average salary) across different categories (states). They show the median, quartiles, and outliers, providing a concise summary of the salary distribution for each state.  This allows for a quick comparison of salary ranges and potential variations between states.

##### 2. What is/are the insight(s) found from the chart?

The box plot reveals variations in average salaries across different states.  Some states like California and New York tend to have higher average salaries compared to others.  The presence of outliers in several states suggests that there are some high-paying and low-paying jobs within those states, even within the same general geographic region.  The box plot also allows quick visual comparison of salary distributions across states. For example, one can easily observe that the median salary in California is likely higher than in several other states.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are beneficial for salary negotiations, workforce planning, and business expansion decisions. Job seekers can use the data to understand the typical salary ranges in various locations and negotiate more effectively. Businesses can use the information to offer competitive salaries and attract top talent. Setting up offices in regions with lower salaries can reduce operational costs.

 Negative growth implications could arise if a company offers salaries significantly below the average for a given location and job role. This can lead to difficulty attracting qualified candidates, decreased employee morale and retention, and ultimately hinder business growth. Similarly, if a business expands to an area with significantly higher average salaries than anticipated, operational costs may increase, affecting profitability.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10, 6))
sns.scatterplot(x='avg_salary', y='Rating', data=df, color='olive', edgecolor='black')
plt.title('Average Salary vs Company Rating')
plt.xlabel('Average Salary')
plt.ylabel('Company Rating')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen to visualize the relationship between average salary and company rating. Scatter plots are effective for showing the correlation between two numerical variables. In this case, it helps to determine if there's a relationship between how much a company pays its employees and its overall rating on Glassdoor.  Each point on the plot represents a job posting, with its average salary on the x-axis and company rating on the y-axis.  This allows for easy visual identification of any trends or patterns.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot of Average Salary vs Company Rating doesn't show a strong, clear linear correlation. There isn't a readily apparent trend where higher salaries always correspond to higher ratings or vice-versa. While there might be a slight tendency for some higher-rated companies to offer higher salaries, there are many exceptions. Many companies with a variety of ratings exist across the salary spectrum. More sophisticated statistical analysis would be needed to confirm any weak correlations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the relationship between salary and company rating can help businesses make informed decisions about compensation strategies.  A company might find that offering competitive salaries can positively impact its rating.  However, the lack of a strong correlation also suggests that other factors significantly influence company ratings, such as work-life balance, management quality, and opportunities for growth.

Negative growth could arise from a failure to recognize the importance of total compensation (not just base salary).  If a company focuses solely on minimizing salary costs without considering other aspects of employee well-being and job satisfaction, it could lead to lower ratings, increased employee turnover, and difficulties attracting top talent.  This could ultimately hinder business growth and profitability.  The absence of a strong positive correlation emphasizes that compensation is only one piece of the puzzle when it comes to attracting and retaining quality employees and maintaining a good company rating.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

plt.figure(figsize=(10, 6))
sns.regplot(x='avg_salary', y='desc_len', data=df, color='deeppink')
plt.title('Average Salary vs Job Description Length')
plt.xlabel('Average Salary')
plt.ylabel('Job Description Length')
plt.show()

##### 1. Why did you pick the specific chart?

A regression plot (regplot) was chosen to visualize the relationship between average salary and job description length.  Regression plots are useful for displaying the relationship between two numerical variables and also show a fitted linear regression line.  This allows one to see not only the general trend but also the strength and direction of the linear relationship between the variables. In this context, it helps to see if there's a correlation between how much a company pays and the amount of detail in the job description.

##### 2. What is/are the insight(s) found from the chart?

The regression plot shows a weak positive correlation between average salary and job description length.  This suggests that, on average, jobs with longer descriptions tend to have slightly higher average salaries. However, the correlation is not very strong, meaning there are many exceptions to this trend. Job description length alone isn't a great predictor of salary.  Many jobs with short descriptions may have high salaries, and vice-versa.  There's considerable scatter around the regression line.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, understanding this relationship can help in job postings.  While job description length alone isn't a strong indicator,  it's still useful to consider the level of detail in the job description when determining salary.  More detailed descriptions may reflect a more complex role or higher responsibilities, which could correlate with higher salaries.  However, overemphasizing the length could lead to unnecessary length without providing any further value.

Negative growth could result if the company assumes a strong relationship between description length and salary. Setting salary expectations solely based on description length, without considering other relevant factors, could result in underpaying or overpaying candidates and lead to dissatisfaction and higher turnover.  Moreover, overly lengthy descriptions that don't add significant value could deter candidates, making it harder to attract top talent.


#### Chart - 6 Heatmap

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(12,6))
sns.heatmap(df[['avg_salary','Rating','desc_len','num_comp']].corr(),annot=True,cmap='viridis')
plt.title('Correlation Matrix of Key Features')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a heatmap to visualize the correlation matrix of key features. A heatmap is an excellent way to display the correlation coefficients between multiple numerical variables in a concise and visually appealing manner.  The color intensity represents the strength of the correlation, making it easy to identify strong positive, strong negative, and weak correlations at a glance.  In this context, it provides a quick overview of how various features (average salary, company rating, job description length, and competitor count) relate to one another.

##### 2. What is/are the insight(s) found from the chart?

The heatmap shows the correlation between 'avg_salary', 'Rating', 'desc_len', and 'num_comp'.  Key insights include:

* **Weak correlations:**  There are no strong correlations between any of the features.  The relationships are generally weak.
* **Slight positive correlation between avg_salary and desc_len:**  As observed in the previous regression plot, there's a slightly positive correlation between average salary and job description length.  However, it's not a strong relationship.
* **Other correlations are weak or negligible:** There's no significant correlation between average salary and company rating or between average salary and the number of competitors.  The correlation between the other pairs of features is also relatively weak.


This reinforces the idea that salary is influenced by multiple factors, and these four variables alone don't fully explain salary variations. More comprehensive analysis would be required to identify other key drivers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the correlation between these key features can help companies make strategic decisions. For example, knowing the weak correlation between salary and rating suggests that companies should not solely focus on salary to improve their ratings.  Instead, they should invest in other aspects like work-life balance, company culture and career development opportunities.

 Negative growth could arise if a company misinterprets the weak correlations as non-existent.  For example, if a company assumes there's no relationship between salary and employee satisfaction (reflected in the rating) and fails to offer competitive compensation, it could lead to high employee turnover, difficulty attracting talent and ultimately hinder growth.  Ignoring the slight positive correlation between salary and job description length could also result in poorly constructed job postings, leading to lower quality applicants and difficulty filling open positions.  The insights should be interpreted with care and not be seen as definitive. They rather highlight factors to consider, rather than conclusive evidence.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(12,6))
sns.boxenplot(x='job_simp', y='avg_salary', data=df, palette='Reds')
plt.title('Salary Distribution by Simplified Job Title')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

I chose a boxenplot (letter-value plot) to visualize the distribution of average salaries across different simplified job titles.  Boxenplots are similar to box plots but provide a more detailed representation of the data distribution, particularly in the presence of a wide range of values or outliers.  They show the median, quartiles, and more quantiles, giving a better understanding of the data spread within each category.  This chart is suitable for comparing salary distributions across several job categories, especially when there's a wide range of salaries within each category.

##### 2. What is/are the insight(s) found from the chart?

The boxenplot reveals the distribution of average salaries across various job titles.  It allows for a comparison of the median salary, spread (interquartile range), and the presence of outliers for each job category.  Some job titles, like Data Scientist, Data Engineer, and Machine Learning Engineer tend to have higher median salaries and potentially wider ranges.  Other roles show lower medians and potentially less variation. The boxenplot's detail allows for a more nuanced comparison of salary ranges across roles compared to a regular boxplot.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this visualization helps in salary benchmarking and fair compensation. Understanding the salary distribution across job titles allows companies to ensure competitive compensation for their employees. It allows for informed decisions regarding hiring, promotions and salary adjustments within each role.

 Negative growth could result from offering salaries that are significantly lower than the market average for a specific role.  This can lead to difficulty attracting qualified candidates, higher employee turnover and decreased morale. Conversely, if a company overpays for a particular role consistently, it could strain the budget and reduce profit margins.  Understanding the distribution of salaries across different job categories is key to avoiding these potential issues.


#### Chart - 8

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(10,6))
sns.distplot(df['avg_salary'], color='darkcyan')
plt.title('Distribution of Average Salaries')
plt.xlabel('Average Salary')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A distplot (distribution plot) was chosen to visualize the distribution of average salaries.  Distplots are useful for showing the overall shape of the distribution of a numerical variable, including its central tendency, spread, and skewness.  In this case, it provides a clear picture of how average salaries are distributed across the entire dataset. This helps identify whether salaries are normally distributed, skewed, or if there are any unusual patterns.

##### 2. What is/are the insight(s) found from the chart?

The distribution of average salaries appears somewhat right-skewed.  This means there's a longer tail on the higher end of the salary range.  A significant portion of the salaries are concentrated within a certain range, but there are some jobs with considerably higher average salaries. The peak of the distribution indicates the most common average salary range, and the tails of the distribution show the less common extreme salary values.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of average salaries is important for setting salary ranges and evaluating compensation packages.  For example, companies can use this information to make sure their offered salaries are competitive and attract qualified candidates.

Negative growth could result from setting salaries too low or too high. If salaries are too low, it can discourage top talent from applying. If salaries are too high, it can strain the budget and affect profitability. A right skew might suggest that the company needs to look at the compensation structure to understand why there is a disproportionately higher number of jobs at the higher end of the salary range and how this affects company expenses and profitability.


#### Chart - 9 Pair Plot

In [None]:
# Chart - 9 visualization code

plt.figure(figsize=(10, 8))
sns.pairplot(df[['avg_salary', 'Rating', 'desc_len', 'num_comp']])
plt.suptitle('Pair Plot of Key Features', y=1.02)   # Adjust title position
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was chosen to visualize the relationships between multiple numerical variables simultaneously.  Pair plots create a matrix of scatter plots, where each variable is plotted against every other variable.  This allows for a quick overview of the pairwise relationships, including linear correlations, clusters, and other patterns.  The diagonal of the matrix typically shows the distribution of each individual variable (e.g., a histogram). In this case, it helps to visualize how 'avg_salary', 'Rating', 'desc_len', and 'num_comp' relate to each other in all possible combinations.



##### 2. What is/are the insight(s) found from the chart?

The pair plot visualizes the relationships between average salary, company rating, job description length, and the number of competitors.  It shows the distribution of each variable along the diagonal and scatter plots for all pairwise combinations.  This allows for the identification of correlations (or lack thereof) between these variables.  For example, it reinforces observations from previous visualizations regarding the weak or non-existent correlations between salary and other features like company rating or number of competitors.  The pair plot offers a comprehensive overview of these relationships in a single visualization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the pair plot's insights can positively impact businesses by providing a holistic view of how key factors interrelate, informing strategic decisions about compensation and resource allocation.  For example, observing weak correlations between salary and other factors (like company rating or competitor count) emphasizes the need for a multifaceted approach to attracting and retaining talent, rather than relying solely on salary.  This could lead to more effective strategies focused on employee well-being, company culture, or professional development opportunities.

Negative growth could stem from misinterpreting the visualized relationships.  If a company concludes from the weak correlations that salary is inconsequential to employee satisfaction or company performance, they might undervalue compensation packages. This could lead to higher employee turnover, difficulty attracting top talent, and ultimately, hinder business growth.  The pair plot highlights interrelationships; it does not negate the importance of competitive salaries in a broader context of other contributing factors.  Ignoring salary's role, even with a weak direct correlation, as part of a comprehensive compensation and benefits package could be detrimental to the company's success.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(18, 6))
sns.scatterplot(x='Size', y='Rating', data=df, color='purple', edgecolor='black')  # Assuming 'Size' column exists
plt.title('Company Size vs Company Rating')
plt.xlabel('Company Size')
plt.ylabel('Company Rating')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between company size and company rating. It's effective for identifying potential correlations or patterns between these two numerical variables. Each point represents a company, with its size on the x-axis and rating on the y-axis, making it easy to observe trends.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot helps to determine if larger companies tend to have higher or lower ratings compared to smaller companies.  It also reveals the overall distribution of company sizes across rating levels.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the relationship between company size and rating can inform decisions about growth strategies, target markets, and competitive positioning. For instance, if the visualization reveals that larger companies tend to have higher ratings, it may suggest the importance of scaling up operations to enhance market perception.  

Negative growth might arise from neglecting this relationship. If a company ignores the potential importance of size in relation to the company rating, it may miss opportunities to increase its rating through strategic scaling or expansion.  Conversely, it could misinterpret the data and pursue aggressive expansion that negatively impacts their rating, or they might shrink with a false assumption it helps the rating when it doesn't. The insights should be interpreted alongside other business factors and not used in isolation to guide strategy.


#### Chart - 11

In [None]:
# Chart - 11 visualization code

plt.figure(figsize=(14, 6))
df['Rating'].value_counts().plot(kind='bar', color='gold')
plt.title('Distribution of Company Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Companies')
plt.xticks(rotation=0)  # Keep x-axis labels horizontal
plt.show()

##### 1. Why did you pick the specific chart?

 A bar chart is suitable for visualizing the distribution of company ratings. It effectively shows the frequency or count of each rating value, allowing for easy comparison of the prevalence of different rating levels.

##### 2. What is/are the insight(s) found from the chart?

The bar chart illustrates the distribution of company ratings, highlighting which ratings are most frequent and which are less common within the dataset. This provides insight into the overall rating landscape of the companies in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of company ratings can inform strategic decision-making and performance benchmarking. For example, if a company's rating falls below the average or most frequent rating in the dataset, it may identify areas for improvement in its operations or customer service to enhance its image.

Negative growth could result from complacency if a company is among the highest-rated companies but fails to maintain or improve its performance.  A decline in ratings, even if the overall market trend shows a high concentration of high ratings, can negatively impact customer perception and lead to loss of market share.  Regular monitoring and analysis are crucial to maintain high ratings.


#### Chart - 12

In [None]:
# Chart - 12 visualization code

plt.figure(figsize=(6, 6))
df['num_comp'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Companies by Number of Competitors')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pie chart to visualize the distribution of companies by the number of competitors.  Pie charts are effective for displaying the proportion of different categories within a whole.  In this case, it helps to quickly see the relative percentage of companies that have varying numbers of competitors.


##### 2. What is/are the insight(s) found from the chart?

The pie chart shows the distribution of companies based on their number of competitors.  It visually represents the proportion of companies with different competitor counts as percentages of the total.  This allows for a quick understanding of the competitive landscape within the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of companies by the number of competitors can inform strategic decisions about market positioning, competitive analysis, and resource allocation.  For instance, if a significant portion of companies have a high number of competitors, it suggests a highly competitive market, requiring businesses to differentiate themselves effectively to gain market share.  Conversely, a smaller number of competitors might indicate opportunities for market expansion or less intense competition.

Negative growth could result from a failure to understand the competitive landscape.  If a company operating in a highly competitive market (many competitors) fails to adopt a robust competitive strategy, it could lose market share and profitability. Conversely, a company in a less competitive market (few competitors) might underestimate the importance of differentiation or innovation, potentially losing out on growth opportunities.  Misinterpreting the pie chart data without further investigation could lead to poor strategic decisions.


#### Chart - 13

In [None]:
# Chart - 13 visualization code

plt.figure(figsize=(12, 6))
sns.countplot(y='Sector', data=df, order=df['Sector'].value_counts().index)
plt.title('Distribution of Job Postings Across Sectors')
plt.xlabel('Number of Job Postings')
plt.ylabel('Sector')
plt.show()


##### 1. Why did you pick the specific chart?

A countplot (which is essentially a bar chart for categorical data) is suitable for visualizing the distribution of job postings across different sectors. It effectively displays the frequency of job postings within each sector, making it easy to compare the relative representation of each sector in the dataset.  The `order` parameter in the code ensures that the sectors are displayed in descending order of frequency, further enhancing readability.

##### 2. What is/are the insight(s) found from the chart?

The countplot shows the distribution of job postings across different sectors.  It indicates which sectors have the highest and lowest numbers of job postings in the dataset.  The sectors are ordered by frequency, so the most frequent sector is displayed at the top.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the sector distribution of job postings can positively impact businesses by informing strategic decisions about talent acquisition, market penetration, and resource allocation. For instance, if a sector shows a high concentration of job postings, it may indicate a growing demand for talent in that area, presenting opportunities for businesses to expand their operations or target their recruitment efforts.

Negative growth could arise from misinterpreting the sector distribution data. If a company overlooks a growing sector with a high demand for talent, they may lose out on opportunities for expansion and market share. Conversely, if they overemphasize a sector that is declining, they may invest resources inefficiently, leading to losses and reduced profitability.  The insights should be considered alongside other market trends and not used in isolation for strategic decisions.


#### Chart - 14

In [None]:
# Chart-14 visualization code

plt.figure(figsize=(16, 8))
sns.violinplot(x='Sector', y='avg_salary', data=df, palette='viridis', inner='quartile')
plt.xticks(rotation=45, ha='right')
plt.title('Salary Distribution by Sector')
plt.xlabel('Sector')
plt.ylabel('Average Salary')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a violin plot to visualize the distribution of average salaries across different sectors.  Violin plots combine the benefits of box plots and kernel density plots, displaying both the median, quartiles, and the overall distribution of the data within each sector. This allows for a richer understanding of the salary distribution compared to box plots or simple histograms, and provides a way to compare the shape of the distributions across the sectors visually.


##### 2. What is/are the insight(s) found from the chart?

The violin plot reveals the distribution of average salaries across different sectors.  It shows the median salary, the spread of salaries (represented by the width of the violin), and the overall shape of the salary distribution for each sector.  Some sectors might exhibit a higher median salary and a wider spread, indicating a greater variation in salaries within that sector.  Other sectors might show a narrower spread or a different distribution shape, suggesting more uniform salaries.  The violin plot effectively visualizes the differences in both central tendency and distribution shape of salaries across different sectors.


3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding salary distribution by sector helps businesses create competitive compensation packages to attract and retain talent.  It allows for data-driven decisions on salary ranges within specific sectors.  For example, if a sector consistently shows higher average salaries, the business can adjust their offerings accordingly to remain competitive.  Conversely, if a sector has lower salaries, they might explore cost-effective strategies without compromising talent acquisition.

Negative growth could arise from misinterpreting the salary distributions.  A company might mistakenly assume that a sector with a high median salary also has low variability.  If they offer a salary at the lower end of that sector's range, they could lose qualified candidates to competitors offering higher salaries within the same sector.  Similarly, offering excessively high salaries in sectors with lower average salaries might strain the budget without a corresponding increase in talent quality.  In essence, failing to consider the full salary distribution within each sector could result in poor hiring decisions, impacting the quality of the workforce and long-term business growth.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the provided analysis, here's a suggested approach for the client to achieve their business objective (assuming the objective is to optimize salary setting and talent acquisition):

**1.  Refine Salary Setting Strategy:**

* **Move beyond simplistic correlations:**  The analysis clearly shows weak correlations between salary and factors like job description length, company rating, and competitor count.  Salary determination should not rely solely on these factors.  Instead, incorporate a more holistic approach.
* **Market Research and Benchmarking:** Conduct thorough market research to understand salary ranges for specific job titles and experience levels within the relevant geographic area and industry.  Use industry-standard salary surveys and data sources.
* **Consider Job Complexity and Responsibilities:** Evaluate job descriptions meticulously to determine the complexity of the role and required skills. Focus on responsibilities and the level of decision-making authority rather than just description length.
* **Total Compensation Package:**  Look at the overall compensation package, not just base salary.  Benefits (health insurance, retirement plans, paid time off, etc.) contribute significantly to employee value proposition.
* **Internal Equity:**  Ensure internal pay equity. Regularly review salary ranges for existing employees to maintain fair compensation and avoid dissatisfaction.  

**2. Improve Talent Acquisition Strategies:**

* **Targeted Job Descriptions:** Create detailed and compelling job descriptions that clearly outline responsibilities, required skills, and company culture. Focus on value and clarity rather than length.
* **Highlight Company Culture and Benefits:** Emphasize positive aspects of working at the company—company culture, growth opportunities, work-life balance, and benefits—to attract top talent.
* **Competitive Compensation and Benefits:** Offer competitive salaries and benefits packages based on market research and benchmarking. This will attract and retain qualified candidates.
* **Optimize Employer Branding:** Focus on building a strong employer brand that appeals to the target audience. This will enhance the company’s appeal in the job market.
* **Leverage Data Analytics:** Continue using data analysis to monitor market trends, analyze the effectiveness of hiring strategies, and refine compensation packages over time.

**3. Address Potential Negative Growth Points:**

* **Avoid over-reliance on single indicators:** Recognize that salary is influenced by multiple factors, and using job description length or company rating in isolation for salary determination could lead to significant issues.
* **Monitor Employee Turnover:** Track employee turnover closely and investigate the reasons behind departures. Use exit interviews to gather valuable feedback.
* **Regularly Update Salary Data:** Compensation markets fluctuate. Regularly update salary data and benchmarks to ensure continued competitiveness and accuracy.


**Key Performance Indicators (KPIs):**

* **Time-to-hire:**  Measure how long it takes to fill open positions.
* **Employee satisfaction:** Use surveys or other methods to assess employee satisfaction with compensation and overall work experience.
* **Employee turnover rate:** Track the rate at which employees leave the company.
* **Quality of hire:** Evaluate the performance of newly hired employees.
* **Cost-per-hire:** Monitor the expenses associated with recruiting and hiring.


By implementing these suggestions and consistently monitoring KPIs, the client can create a positive business impact by improving talent acquisition, employee retention, and overall financial performance.  The provided analysis serves as a valuable starting point but should be complemented with additional market research and ongoing data analysis.

# **Conclusion**

In conclusion, this EDA project offers valuable insights into prevailing trends in the modern workplace, providing insights into factors that influence employee satisfaction, salary expectations, and overall company performance.  The findings from this analysis can serve as a resource for job seekers, employers, and researchers alike, providing a valuable tool for navigating the complexities of the current employment landscape.  Further analysis could include more advanced techniques like machine learning for predictive modeling of employee turnover or compensation, extending the scope and depth of the insights generated from this study.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***