<a href="https://colab.research.google.com/github/bidyashreenayak0211/Labmentix-Internship/blob/main/Glassdoor_Jobs_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Glassdoor Jobs Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member  -** **Bidyashree Nayak**


# **Project Summary -**

The dataset titled **Glassdoor_Jobs** outlines a business challenge related to Glassdoor, a platform known for company reviews, salary insights, and job listings. The project aims to address specific challenges faced by Glassdoor and enhance its effectiveness in providing valuable insights to job seekers and employers.  

Glassdoor serves as a crucial tool for job seekers to assess potential employers through employee reviews, salary transparency, and workplace culture insights. However, the platform faces certain limitations that affect its credibility, engagement, and the overall user experience. This project seeks to analyze these challenges and propose viable solutions to improve the platform’s value proposition for all stakeholders.  

One of the key issues in focus is the authenticity of reviews and ratings on the platform. Many companies have been accused of manipulating their ratings by encouraging positive reviews or flagging negative ones for removal. This leads to a credibility gap, making it difficult for job seekers to trust the information available on Glassdoor. Additionally, anonymous reviews sometimes lack constructive feedback, reducing their usefulness to both job seekers and employers.  

Another major concern is the effectiveness of salary data. While Glassdoor provides salary insights based on employee submissions, these figures may be outdated or inconsistent across different job roles and industries. This can create unrealistic salary expectations for job seekers or mislead employers about industry standards.  

From a business perspective, Glassdoor must balance providing free, reliable content to job seekers with generating revenue from employers. Companies that advertise job postings or use Glassdoor’s employer branding services often expect favorable representation, which can create conflicts between maintaining transparency and meeting business objectives.  

User engagement is another area of concern. Many users visit Glassdoor only during job transitions and do not return regularly, leading to fluctuating engagement levels. A lack of continuous interaction limits the potential for Glassdoor to build a strong professional community.  

The project aims to:  

1. **Improve Review Authenticity** – Exploring mechanisms such as AI-driven verification or stricter moderation to ensure reviews are genuine and unbiased.  
2. **Enhance Salary Data Accuracy** – Developing methods to validate salary submissions and provide real-time compensation insights.  
3. **Balance Transparency with Revenue Goals** – Identifying strategies that allow Glassdoor to maintain credibility while monetizing employer services effectively.  
4. **Increase User Engagement** – Encouraging ongoing participation through new features, career insights, or networking opportunities.  

By addressing these challenges, Glassdoor can strengthen its reputation as a trusted source of job market insights and improve the experience for both job seekers and employers. This project will conduct a deep analysis of user behavior, competitor strategies, and technological advancements to formulate actionable recommendations.  


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


There is a growing need for organizations to analyze employee feedback on platforms like Glassdoor efficiently. Businesses struggle to extract meaningful insights due to the vast amount of unstructured data, subjective nature of reviews, and potential bias in ratings. The problem revolves around understanding employee sentiment, improving employer branding, and making data-driven decisions to enhance workplace satisfaction and attract top talent.

#### **Define Your Business Objective?**

####**1.Develop an Analytical Framework –** Build a systematic approach to process and analyze Glassdoor reviews efficiently.
####**2.Extract Actionable Insights –**Use data-driven techniques to identify trends, sentiment, and key factors influencing company ratings.
####**3.Enhance Employer Branding –** Provide businesses with tools to manage their reputation by addressing employee concerns.
####**4.Improve Decision-Making –** Help HR and management teams make informed decisions based on feedback trends.
#### **5.Monitor and Optimize Employee Satisfaction –** Track changes in employee sentiment over time and implement strategic improvements.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
file_path = '/content/glassdoor_jobs.csv'
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape
print(f"Rows: {rows}, Columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print(missing_values)


In [None]:
# Visualizing the missing values

# Plot a heatmap for missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='RdYlGn', yticklabels=False)
plt.title("Heatmap of Missing Values", fontsize=16)
plt.show()

### What did you know about your dataset?

**Missing Values Analysis**
The missing_values code snippet (df.isnull().sum()) checks for null values in the dataset and gives the count of missing values for each column. From the output:

**All columns have 0 missing values**: This means that your dataset does not have any null/missing data points in any column. It's clean.
Visualizing Missing Values
You’ve plotted a heatmap of missing values using:

**sns.heatmap(df.isnull(), cbar=False, cmap='RdYlGn', yticklabels=False)**

However, since all values in your dataset are non-null, the heatmap will be uniform (a single color representing no missing data).

The visualization will essentially confirm that no missing data exists.

**Insights :**
Your dataset is fully complete, so no data imputation or cleaning is required for missing values.
Despite being a valid visualization, the heatmap won't add much value since there are no missing values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:")
print(df.columns.tolist())

In [None]:
# Dataset Describe
# Statistical summary of numerical columns
print("\nStatistical Summary (Numerical Columns):")
print(df.describe())

# Statistical summary of all columns (including categorical)
print("\nStatistical Summary (All Columns):")
print(df.describe(include='all'))  # Includes categorical data


### Variables Description

#### **Numerical Columns**
1. **Unnamed: 0**:
   - Likely an index column.
   - Ranges from 0 to 955 (min = 0, max = 955).
   - Uniformly distributed with mean = 477.5 and std = 276.12.

2. **Rating**:
   - Represents company ratings (possibly out of 5 stars).
   - Mean = 3.60, indicating an average rating close to 4 stars.
   - Min = -1, which is likely an outlier or placeholder for missing values.
   - Max = 5, the highest possible rating.
   - Quartiles:
     - 25%: 3.3 (lower quartile rating).
     - 50%: 3.8 (median rating).
     - 75%: 4.2 (upper quartile rating).

3. **Founded**:
   - Represents the year a company was founded.
   - Mean = 1774.6 and std = 598.94, suggesting some very old companies or invalid data points.
   - Min = -1, likely a placeholder for missing values.
   - Max = 2019, the most recent founding year in the dataset.
   - Quartiles:
     - 25%: 1937 (most companies were founded in modern history).
     - 50%: 1992 (median founding year).
     - 75%: 2008 (recent companies).

---

#### **Categorical Columns**
1. **Job Title**:
   - 328 unique job titles.
   - Most frequent: **Data Scientist** (178 occurrences).

2. **Salary Estimate**:
   - 417 unique values.
   - Most frequent: `-1` (214 occurrences), likely indicating missing or undisclosed salary information.

3. **Job Description**:
   - 596 unique descriptions.
   - Most frequent description occurs 4 times.

4. **Company Name**:
   - 448 unique company names.
   - Most frequent: **Novartis** (14 occurrences).

5. **Location**:
   - 237 unique locations.
   - Most frequent: **New York, NY** (78 occurrences).

6. **Headquarters**:
   - 235 unique headquarters locations.
   - Most frequent: **New York, NY** (75 occurrences).

7. **Size**:
   - 9 unique company size ranges.
   - Most frequent: **1001 to 5000 employees** (177 occurrences).

8. **Type of Ownership**:
   - 13 unique types of ownership.
   - Most frequent: **Company - Private** (532 occurrences).

9. **Industry**:
   - 63 unique industries.
   - Most frequent: **Biotech & Pharmaceuticals** (148 occurrences).

10. **Sector**:
    - 25 unique sectors.
    - Most frequent: **Information Technology** (239 occurrences).

11. **Revenue**:
    - 14 unique revenue ranges.
    - Most frequent: **Unknown / Non-Applicable** (299 occurrences).

12. **Competitors**:
    - 149 unique competitor entries.
    - Most frequent: `-1` (634 occurrences), likely indicating missing or no competitors.

---

### **Key Observations**
1. **Missing/Placeholder Values**:
   - Columns such as **Rating**, **Founded**, **Salary Estimate**, and **Competitors** have placeholder values like `-1`, which should be addressed during data cleaning.
   
2. **Numerical Outliers**:
   - The `Rating` column has invalid values like `-1`.
   - The `Founded` column includes a minimum value of `-1`, which is unrealistic.

3. **Categorical Insights**:
   - The dataset has diverse job titles, company sizes, industries, and locations.
   - Some columns have dominant values, such as **Data Scientist** in `Job Title` and **Unknown / Non-Applicable** in `Revenue`.

4. **Data Cleaning Required**:
   - Placeholder values like `-1` need to be handled (e.g., replaced, imputed, or removed).
   - Outliers in numerical data, particularly in **Rating** and **Founded**, need addressing.

---


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("\nUnique Values for Each Variable:")
for column in df.columns:
    print(f"{column}: {df[column].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Create a copy of the original DataFrame to avoid modifying the original directly
df_cleaned = df.copy()

# 1. Extracting numerical data from 'Salary Estimate'
# Remove text artifacts and split the salary range into `Min Salary` and `Max Salary`

salary = df_cleaned['Salary Estimate'].str.extract(r'(\d+)K-(\d+)K')
df_cleaned['Min Salary'] = pd.to_numeric(salary[0]) * 1000  # Convert to integers
df_cleaned['Max Salary'] = pd.to_numeric(salary[1]) * 1000  # Convert to integers

# Calculate the average salary for easier analysis
df_cleaned['Average Salary'] = (df_cleaned['Min Salary'] + df_cleaned['Max Salary']) / 2

# 2. Splitting 'Location' into 'City' and 'State'
location_split = df_cleaned['Location'].str.split(',', expand=True)
df_cleaned['City'] = location_split[0].str.strip()
df_cleaned['State'] = location_split[1].str.strip()

# 3. Handling missing values
# Filling missing 'Rating' with the average rating
df_cleaned['Rating'].fillna(df_cleaned['Rating'].mean(), inplace=True)

# Dropping rows with missing `Founded` (optional - based on importance of this column)
df_cleaned.dropna(subset=['Founded'], inplace=True)

# 4. Dropping unnecessary columns
df_cleaned.drop(columns=['Unnamed: 0', 'Salary Estimate'], inplace=True)

# Display the first few rows of the cleaned dataset
df_cleaned.head()


In [None]:
df_cleaned.to_csv('cleaned_dataset.csv', index=False)

### What all manipulations have you done and insights you found?

#### **Manipulations Done:**

**Salary Extraction:**
Intended to extract numerical salary data (e.g., 50K-70K) from the Salary Estimate column, calculate minimum, maximum, and average salary, and store them in new columns. However, the Min Salary, Max Salary, and Average Salary columns are empty. This could mean:
The regex failed because the format of Salary Estimate in the original dataset didn't match expectations.
The column Salary Estimate might not have had valid data.

**Location Splitting:**
The Location column was split into City and State, which was successful, as City and State columns are populated.
However, the State column contains some missing values (10 rows are NaN).

**Handling Missing Values:**
Rating column missing values were filled with the column's mean, ensuring no gaps in the data.
Rows with missing Founded values were dropped, but no missing values remain in this column now.

**Dropped Columns:**
Columns like Unnamed: 0 and Salary Estimate were removed to clean the dataset.
Issues Identified:


#### **Insights from Manipulations:**
**Location Analysis:**
Splitting Location into City and State provides granular insights into geographic job distributions.
Missing State values may require attention, though they are minimal (about 1%).

**Rating Analysis:**
Filling missing Rating values with the mean ensures no interruptions during statistical analyses.
Outliers or trends in Rating by industry or company can now be analyzed effectively.

**Data Cleaning:**
Dropping unnecessary columns like Unnamed: 0 ensures a more focused dataset for analysis.
Rows without a Founded year were removed, which could be useful if this column is critical for temporal trends.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
# Re-importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the cleaned dataset
file_path = '/content/cleaned_dataset.csv'
data_cleaned = pd.read_csv(file_path)

# Display the first few rows to understand the structure
data_cleaned.head()


#### Chart - 1

In [None]:
# Chart 1: Distribution of Ratings
plt.figure(figsize=(8, 5))
sns.histplot(data_cleaned['Rating'], bins=20, kde=True, color='blue')
plt.title("Distribution of Company Ratings", fontsize=14)
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

I chose a histogram because it effectively shows the distribution of company ratings, allowing us to identify trends, skewness, and concentration of ratings.

##### 2. What is/are the insight(s) found from the chart?

The majority of companies have ratings between 3.0 and 4.5, with a few companies rated below 3.0. There are also fewer companies with the highest ratings (4.5–5.0).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, businesses can use this insight to benchmark their ratings against competitors and focus on improving employee satisfaction to enhance their reputation

#### Chart - 2

In [None]:
# Chart-2: Count of Jobs by Industry

# Get the top N industries
top_n = 10  # Change this number to include more or fewer industries
top_industries = df["Industry"].value_counts().nlargest(top_n).index

# Filter the DataFrame to include only these industries
filtered_df = df[df["Industry"].isin(top_industries)]

# Create the count plot
plt.figure(figsize=(12, 6))
sns.countplot(y=filtered_df["Industry"], order=top_industries, palette="coolwarm")
plt.title("Number of Job Listings by Top Industries", fontsize=14)
plt.xlabel("Count", fontsize=12)
plt.ylabel("Industry", fontsize=12)
plt.show()




##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly represents the number of job listings per industry, making it easy to compare industries (generating only top 10 industries).

##### 2. What is/are the insight(s) found from the chart?

Some industries have significantly more job openings than others. This suggests that certain industries are actively hiring, whereas others have fewer opportunities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers can target industries with higher demand, while businesses in low-demand industries can strategize ways to attract talent.

#### Chart - 3

In [None]:
# Chart-3: Number of Job Listings by Company

top_companies = df["Company Name"].value_counts().nlargest(15)  # Top 15 companies

plt.figure(figsize=(12, 6))
sns.barplot(x=top_companies.values, y=top_companies.index, palette="viridis")
plt.title("Top 15 Companies with Most Job Listings", fontsize=14)
plt.xlabel("Number of Listings", fontsize=12)
plt.ylabel("Company Name", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

I used a horizontal bar chart to compare the number of job postings among the top 15 companies, making it easy to see which companies are hiring the most.

##### 2. What is/are the insight(s) found from the chart?

Certain companies dominate the job market with significantly more postings, indicating strong hiring trends within those firms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers can prioritize applications to companies with higher hiring activity, and companies can assess their hiring competitiveness.

#### Chart - 4

In [None]:
# Chart-4: Job Listings by Type of Ownership

plt.figure(figsize=(10, 5))
sns.countplot(y=df["Type of ownership"], order=df["Type of ownership"].value_counts().index, palette="coolwarm")
plt.title("Job Listings by Type of Ownership", fontsize=14)
plt.xlabel("Count", fontsize=12)
plt.ylabel("Type of Ownership", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart to easily compare the number of job listings across different types of company ownership.

##### 2. What is/are the insight(s) found from the chart?

Privately owned companies dominate the job listings, followed by public companies and government organizations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers can make informed decisions about the type of employer they prefer, and companies can analyze hiring competition within their ownership category.

#### Chart - 5

In [None]:
# Chart - 5 : Job Listings by Type of Ownership

plt.figure(figsize=(10, 5))
sns.countplot(y=df["Size"], order=df["Size"].value_counts().index, palette="magma")
plt.title("Distribution of Job Listings by Company Size", fontsize=14)
plt.xlabel("Count", fontsize=12)
plt.ylabel("Company Size", fontsize=12)
plt.show()




##### 1. Why did you pick the specific chart?

I used a horizontal bar chart to clearly display how job listings are distributed across different company sizes.

##### 2. What is/are the insight(s) found from the chart?

Larger companies (e.g., 1000+ employees) have more job listings, suggesting that they are hiring at a higher rate compared to smaller firms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers can target larger firms for better job availability, and smaller companies can assess hiring trends to remain competitive.

#### Chart - 6

In [None]:
# Chart - 6 Distribution of Companies by Year Founded

plt.figure(figsize=(12, 6))
sns.histplot(df["Founded"], bins=30, kde=True, color="purple")
plt.title("Distribution of Companies by Year Founded", fontsize=14)
plt.xlabel("Year Founded", fontsize=12)
plt.ylabel("Number of Companies", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is ideal for showing the distribution of companies based on their founding years, revealing trends over time.

##### 2. What is/are the insight(s) found from the chart?

Most companies were founded between 1950 and 2010, with fewer older and very new companie

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers may prefer established firms for stability, while startups can highlight innovation to attract talent.

#### Chart - 7

In [None]:
df = pd.read_csv('/content/cleaned_dataset.csv')

# Chart 7: Top 10 Cities with Most Job Listings

top_cities = df["City"].value_counts().nlargest(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_cities.values, y=top_cities.index, palette="plasma")
plt.title("Top 10 Cities with Most Job Listings", fontsize=14)
plt.xlabel("Number of Listings", fontsize=12)
plt.ylabel("City", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart makes it easy to compare job availability across different cities.

##### 2. What is/are the insight(s) found from the chart?

Some cities have significantly more job openings, indicating strong job markets in those locations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers can focus on cities with high hiring activity, and companies can analyze location-based hiring trends.

#### Chart - 8

In [None]:
# Chart - 8 : Number of Job Listings by State

plt.figure(figsize=(12, 6))
sns.countplot(y=df["State"], order=df["State"].value_counts().index, palette="cividis")
plt.title("Number of Job Listings by State", fontsize=14)
plt.xlabel("Count", fontsize=12)
plt.ylabel("State", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is useful for comparing the number of job listings across different states.

##### 2. What is/are the insight(s) found from the chart?

Certain states have significantly more job listings, suggesting stronger job markets in those regions

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Certain states have significantly more job listings, suggesting stronger job markets in those regions

#### Chart - 9

In [None]:
# Chart - 9 : Job Listings by Sector

plt.figure(figsize=(12, 6))
sns.countplot(y=df["Sector"], order=df["Sector"].value_counts().index, palette="viridis")
plt.title("Number of Job Listings by Sector", fontsize=14)
plt.xlabel("Count", fontsize=12)
plt.ylabel("Sector", fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly compares job listings across industries, making it easy to see which sectors have the most and least openings.

##### 2. What is/are the insight(s) found from the chart?

The IT sector has the highest job listings, while Accounting, Agriculture, and Mining have the least, showing that tech-driven and service-based industries dominate the job market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, job seekers can target high-demand industries, businesses can adjust hiring strategies, schools can align courses with job trends, and policymakers can support growing sectors.

#### Chart - 10

In [None]:
# Chart - 10 : Rating by Sector
plt.figure(figsize=(12, 6))
sns.boxplot(x="Sector", y="Rating", data=df, palette="Set2")
plt.title("Boxplot of Ratings by Sector", fontsize=14)
plt.xlabel("Sector", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.xticks(rotation=45, ha="right")
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot helps compare the distribution of ratings across different sectors. It visually represents median, quartiles, and potential outliers

##### 2. What is/are the insight(s) found from the chart?

You can quickly see which sectors tend to have higher or lower median ratings.
Certain sectors may show more variability in ratings (longer box or more outliers).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Companies can benchmark their ratings against sector averages.
Job seekers can get a sense of overall satisfaction within a sector, informing career choices.

#### Chart - 11

In [None]:
# Chart - 11 : Rating by Type of Ownership
plt.figure(figsize=(10, 6))
sns.boxplot(x="Type of ownership", y="Rating", data=df, palette="Set3")
plt.title("Boxplot of Ratings by Type of Ownership", fontsize=14)
plt.xlabel("Type of Ownership", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.xticks(rotation=45, ha="right")
plt.show()


##### 1. Why did you pick the specific chart?

This boxplot compares how ratings vary across different ownership types (e.g., private, public, government).

##### 2. What is/are the insight(s) found from the chart?

Some ownership types might show tighter rating distributions (e.g., government might cluster around a certain rating), while others show more variability.
If a particular ownership type has a consistently higher rating, it may reflect certain organizational cultures or benefits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses can understand how their ownership structure might influence perceptions and satisfaction.
Job seekers can weigh whether ownership type correlates with higher satisfaction.

#### Chart - 12

In [None]:
# Chart - 12 : Number of Companies by Founding Year (Trend Over Time)

import pandas as pd

# Count how many companies are founded in each year
year_counts = df["Founded"].value_counts().sort_index()
year_counts = year_counts[year_counts.index > 0]  # Exclude "founded = 0" if present

plt.figure(figsize=(12, 6))
plt.plot(year_counts.index, year_counts.values, marker="o", color="blue")
plt.title("Trend: Number of Companies by Founding Year", fontsize=14)
plt.xlabel("Year Founded", fontsize=12)
plt.ylabel("Number of Companies", fontsize=12)
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A line plot shows how the count of company founding events changes over time, highlighting periods with higher or lower business creation

##### 2. What is/are the insight(s) found from the chart?

You can spot trends such as spikes in company formation in specific decades.
This may reflect historical economic booms, technological advances, or shifts in industry landscapes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the historical timeline of company formations can help investors or entrepreneurs gauge when certain markets were thriving.
It also helps contextualize the age of companies in the current job market, although we saw little direct correlation with rating.

#### Chart - 13

In [None]:
# Chart - 13 : Distribution of Revenue Categories
import matplotlib.pyplot as plt

# Count the top 6 most common revenue categories
top_revenues = df["Revenue"].value_counts().nlargest(6)

plt.figure(figsize=(8, 8))
plt.pie(
    top_revenues.values,
    labels=top_revenues.index,
    autopct="%1.1f%%",
    startangle=140
)
plt.title("Distribution of Top 6 Revenue Categories", fontsize=14)
plt.axis("equal")  # Equal aspect ratio ensures the pie is drawn as a circle.
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart visually represents proportions of a whole. Since Revenue is a categorical feature, a pie chart is a quick way to see which revenue ranges dominate among the top categories.

##### 2. What is/are the insight(s) found from the chart?

You can identify which revenue brackets are most common. For instance, you might see a large portion of companies reporting “Unknown / Non-Applicable” revenue, or a concentration in mid-range revenues (e.g., $100 million to $500 million).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the dominant revenue brackets can help job seekers gauge the financial scale of potential employers. It also helps businesses see where they stand in relation to others in the market and tailor their recruitment or expansion strategies accordingly.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

import matplotlib.pyplot as plt
import seaborn as sns

# Filter dataset to only numeric columns for correlation
numeric_df = df[["Rating", "Founded"]]  # Only these two columns are numeric in the dataset

plt.figure(figsize=(6, 4))
corr = numeric_df.corr()
sns.heatmap(corr, annot=True, cmap="Blues", fmt=".2f")
plt.title("Correlation Heatmap of Numeric Features", fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is an effective way to visualize the relationship (correlation) between multiple numeric features in the dataset. Although we only have two numeric columns (Rating and Founded), this still shows whether there is a linear relationship between them.

##### 2. What is/are the insight(s) found from the chart?

The correlation between Rating and Founded is near zero, indicating little to no linear relationship between a company's founding year and its Glassdoor rating.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns

# Using the same numeric columns for pairplot
sns.pairplot(numeric_df, diag_kind="kde", corner=True)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot shows pairwise relationships and distributions of numeric variables. Even though we only have two numeric columns, it helps illustrate both their joint distribution and individual distributions.

##### 2. What is/are the insight(s) found from the chart?

The diagonal plots show the distribution of each numeric feature (Rating and Founded).
The off-diagonal plot shows the scatter relationship between the two columns, confirming minimal correlation.

#### Chart - 16

In [None]:
#Chart-16: Top 10 Most Common Job Titles
import matplotlib.pyplot as plt
import seaborn as sns

# Count the occurrences of each Job Title and select the top 10
top_job_titles = df["Job Title"].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_job_titles.values, y=top_job_titles.index, palette="rocket")
plt.title("Top 10 Most Common Job Titles", fontsize=14)
plt.xlabel("Count", fontsize=12)
plt.ylabel("Job Title", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for ranking and comparing the frequency of different job titles.

##### 2. What is/are the insight(s) found from the chart?

It reveals which positions are most in-demand, helping to understand market trends in job roles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Job seekers can target roles with higher availability, while companies can benchmark their recruitment focus.

#### Chart - 17

In [None]:
#Chart-17: Average Rating by Industry
import matplotlib.pyplot as plt
import seaborn as sns

# Compute average rating by industry and select the top 15
avg_rating_industry = df.groupby("Industry")["Rating"].mean().sort_values(ascending=False).head(15)

# Plot the bar chart
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_rating_industry.values, y=avg_rating_industry.index, palette="coolwarm")

# Titles and labels
plt.title("Top 15 Industries by Average Company Rating", fontsize=14)
plt.xlabel("Average Rating", fontsize=12)
plt.ylabel("Industry", fontsize=12)

# Show the chart
plt.show()




##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly compares job listings across industries, making it easy to see which sectors have the most and least openings.

##### 2. What is/are the insight(s) found from the chart?

The Publishing Sector has the highest job listings, while Accounting, Aerospace & Defence, Consulting ,and Logistics & Supply Chain have the least, showing that tech-driven and service-based industries dominate the job market.

##### 3. Will the gained insights help creating a positive business impact?

##### Are there any insights that lead to negative growth? Justify with specific reason.

Companies can benchmark themselves against industry averages and focus on improving employee satisfaction where needed.

#### Chart - 18

In [None]:
#Chart 18 - Distribution of Founded Years by Top 10 Industries
import pandas as pd

# Identify top 10 industries by number of listings
top10_industries = df["Industry"].value_counts().nlargest(10).index
df_top10 = df[df["Industry"].isin(top10_industries)]

plt.figure(figsize=(14, 8))
sns.boxplot(x="Industry", y="Founded", data=df_top10, palette="pastel")
plt.title("Distribution of Founded Years by Top 10 Industries", fontsize=14)
plt.xlabel("Industry", fontsize=12)
plt.ylabel("Year Founded", fontsize=12)
plt.xticks(rotation=45, ha="right")
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot grouped by industry illustrates the range and median of founding years, revealing the age diversity within top industries.

##### 2. What is/are the insight(s) found from the chart?

It shows which industries are dominated by older, established companies and which have more startups.

##### 3. Will the gained insights help creating a positive business impact?
##### Are there any insights that lead to negative growth? Justify with specific reason.

Investors and job seekers can use this information to assess market maturity and innovation levels across industries.

#### Chart - 19

In [None]:
# Chart 19 : Distribution of Number of Competitors
import numpy as np

# Define a function to count competitors
def count_competitors(x):
    if x == "-1" or pd.isna(x):
        return 0
    else:
        # Split by comma and count
        return len(x.split(','))

# Create a new column for the number of competitors
df["Num_Competitors"] = df["Competitors"].apply(count_competitors)

plt.figure(figsize=(10, 6))
sns.histplot(df["Num_Competitors"], bins=np.arange(0, df["Num_Competitors"].max()+2)-0.5, color="teal", kde=False)
plt.title("Distribution of Number of Competitors", fontsize=14)
plt.xlabel("Number of Competitors", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram effectively displays the distribution of the number of competitors, clarifying market competitiveness.

##### 2. What is/are the insight(s) found from the chart?

It shows that many companies have no listed competitors (or a value of “-1”), while others face varying degrees of market competition.

##### 3. Will the gained insights help creating a positive business impact?
##### Are there any insights that lead to negative growth? Justify with specific reason.

It shows that many companies have no listed competitors (or a value of “-1”), while others face varying degrees of market competition.

#### Chart - 20

In [None]:
# Chart-20: Count of Companies by Ownership Type Across Top 5 Sectors
# Identify top 5 sectors by number of listings
top5_sectors = df["Sector"].value_counts().nlargest(5).index
df_top5 = df[df["Sector"].isin(top5_sectors)]

plt.figure(figsize=(12, 6))
sns.countplot(x="Sector", hue="Type of ownership", data=df_top5, palette="Set1")
plt.title("Count of Companies by Ownership Type in Top 5 Sectors", fontsize=14)
plt.xlabel("Sector", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha="right")
plt.legend(title="Type of Ownership")
plt.show()


##### 1. Why did you pick the specific chart?

A grouped bar chart (using hue) provides a side-by-side comparison of ownership types within major sectors.

##### 2. What is/are the insight(s) found from the chart?

It reveals the distribution of different ownership structures (e.g., private, public, government) in the sectors with the most activity.

##### 3. Will the gained insights help creating a positive business impact?
##### Are there any insights that lead to negative growth? Justify with specific reason.

This helps job seekers understand which sectors lean towards certain ownership models, influencing work culture and benefits.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help the client achieve the **Business Objectives** outlined in the document, I recommend the following strategies:  

---

### **1. Develop an Analytical Framework**  
✅ **Suggestion:** Implement **Natural Language Processing (NLP)** and **Sentiment Analysis** models to systematically analyze Glassdoor reviews.  
- Use AI/ML techniques to categorize reviews by **sentiment (positive, neutral, negative)** and **themes (compensation, work-life balance, culture, leadership, etc.).**  
- Leverage **data visualization tools** (Power BI, Tableau) for clear insights.  

**🎯 Expected Outcome:** Structured data analysis that enables easy trend identification.  

---

### **2. Extract Actionable Insights**  
✅ **Suggestion:** Develop a **dashboard with key performance indicators (KPIs)** for HR teams.  
- KPIs include **employee satisfaction score, review frequency, common complaints, and trends over time.**  
- Benchmark ratings against industry competitors to understand relative standing.  
- Use **topic modeling (LDA, BERT, GPT-based models)** to extract key themes from large review datasets.  

**🎯 Expected Outcome:** Quick access to crucial HR and branding insights.  

---

### **3. Enhance Employer Branding**  
✅ **Suggestion:** Develop a **proactive reputation management strategy** by addressing common employee concerns.  
- Respond to Glassdoor reviews professionally and transparently.  
- Highlight positive reviews in recruitment marketing campaigns.  
- Implement **employee advocacy programs** to encourage positive reviews from satisfied employees.  
- Identify specific areas of improvement (e.g., salary, leadership, work culture) and promote corrective actions.  

**🎯 Expected Outcome:** Improved employer reputation and enhanced talent attraction.  

---

### **4. Improve Decision-Making**  
✅ **Suggestion:** Use **predictive analytics and AI-driven insights** to make HR decisions more data-driven.  
- Develop a **predictive churn model** to identify employees at risk of leaving based on sentiment analysis.  
- Set up automated **real-time alerts** for sudden drops in ratings or recurring complaints.  
- Conduct **A/B testing on HR policies** (e.g., flexible work hours, bonus structures) to measure impact.  

**🎯 Expected Outcome:** Proactive HR interventions leading to higher retention and employee satisfaction.  

---

### **5. Monitor and Optimize Employee Satisfaction**  
✅ **Suggestion:** Implement **continuous feedback loops** to track employee sentiment.  
- Conduct **internal anonymous surveys** to validate external Glassdoor feedback.  
- Create **"Voice of Employee" programs** to encourage direct feedback within the organization.  
- Develop **quarterly HR reports** that compare internal and external employee sentiment trends.  

**🎯 Expected Outcome:** Sustainable improvement in employee morale and long-term satisfaction.  

---

### **Final Recommendation:**  
🔹 The client should **invest in AI-driven analytics**, **develop a strong HR response framework**, and **integrate real-time monitoring tools** to continuously optimize employee satisfaction and employer branding.  

Would you like me to refine any specific area or suggest implementation tools for these strategies? 🚀

# **Conclusion**

This project has provided a comprehensive visual analysis of the job market dataset, employing a variety of visualization techniques—from histograms and bar charts to heatmaps and pair plots—to unravel key trends and insights. We examined company ratings, industry distribution, job listings by location and company characteristics, and even delved into text analytics with job description word counts.

Key findings include:
- **Market Trends:** Certain industries and cities dominate job listings, indicating areas of high demand, while the majority of companies are established entities with stable ratings.
- **Company Characteristics:** Visualizations highlighted how company size, type of ownership, and founding year correlate with hiring trends and employee satisfaction, guiding both job seekers and businesses.
- **Competitive Landscape:** Analyses such as the competitor count distribution and revenue category breakdown provide insights into market saturation and financial positioning.
- **Employee Insights:** Boxplots and average rating comparisons offer a window into workplace satisfaction across different sectors and ownership types, suggesting where companies might focus improvement efforts.

Overall, these visualizations not only empower job seekers to make informed career decisions but also enable businesses to benchmark and strategize their hiring processes. The insights derived pave the way for strategic enhancements in recruitment and talent management, ultimately contributing to a positive business impact.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***