# AmbitionBox: Comprehensive EDA Master Notebook

This notebook consolidates all exploratory data analysis steps for the AmbitionBox 10k Companies dataset.

### Structure:
1. **Univariate Analysis**: Individual feature distributions.
2. **Bivariate Analysis**: Relationships and correlations.
3. **Deep Dive & Hidden Insights**: Non-obvious patterns and strategic findings.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
sns.set(style="whitegrid")

# Load Data
df = pd.read_csv('ambitionbox_cleaned.csv')
print(f"Dataset Loaded: {df.shape[0]} companies, {df.shape[1]} features.")

## Part 1: Univariate Analysis
Understanding the distribution of core metrics.

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(df['ratings'], bins=20, kde=True, color='royalblue')
plt.title('Distribution of Employee Ratings')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
df['industry'].value_counts().head(10).plot(kind='barh', color='seagreen')
plt.title('Top 10 Industries by Company Volume')
plt.show()

## Part 2: Bivariate Analysis
Exploring connections between metrics.

In [None]:
plt.figure(figsize=(10, 8))
numeric_df = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_df.corr(), annot=True, cmap='mako', fmt='.2f')
plt.title('Correlation Matrix of Company Metrics')
plt.show()

In [None]:
top_industries = df['industry'].value_counts().head(15).index
plt.figure(figsize=(12, 6))
sns.barplot(data=df[df['industry'].isin(top_industries)], x='industry', y='ratings', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Sentiment Contrast Across Top 15 Industries')
plt.ylim(3, 5)
plt.show()

## Part 3: Deep Dive & Hidden Insights
Strategic findings for decision support.

In [None]:
# Hidden Gems: High Rating, Lower Maturity (Medium Visibility)
hidden_gems = df[(df['ratings'] >= 4.2) & (df['reviews'] < 1000) & (df['reviews'] > 50)]
print(f"Statistical Insight: Found {len(hidden_gems)} Hidden Gem companies.")
hidden_gems[['name', 'industry', 'ratings', 'reviews']].sort_values('ratings', ascending=False).head(10)

In [None]:
# Scaling Paradox: Performance based on geographic footprint
df['location_scale'] = pd.cut(df['more_locations'], bins=[-1, 0, 10, 50, 200, 10000], 
                              labels=['Single', 'Small (1-10)', 'Medium (11-50)', 'Large (51-200)', 'Enterprise (200+)'])

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='location_scale', y='ratings', palette='cool')
plt.title('Employee Satisfaction vs. Geographic Footprint (Scaling Index)')
plt.show()

### Master Insights Summary:
1. **Mid-Market Resilience**: Companies with 11-50 locations often show higher median ratings than global giants, suggesting a 'sweet spot' for work culture.
2. **Industry Giants**: IT Services dominated by volume, but Niche Product companies (Hidden Gems) lead in qualitative employee sentiment.
3. **Metric Linkage**: Strong correlation (0.7+) between Salary reports and Reviews indicates maturing digital transparency within these firms.