# **Data Analysis & Insights Assignment**

**Course:** Data Analysis & Insights

**Student Name:** Your Name Here

**Submission Date:** YYYY-MM-DD

---

## **Introduction**

This notebook presents a detailed analysis of a synthetic dataset containing numerical and categorical variables. The goal is to perform:

- **Data Cleaning** (handling missing values, duplicates, and outliers)
- **Exploratory Data Analysis (EDA)** (univariate, bivariate, and multivariate analysis)
- **Visualization and Interpretation**

---

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style('whitegrid')

print('Libraries Imported Successfully!')

## **1. Dataset Creation**

A synthetic dataset is generated with numerical and categorical variables, including missing values, duplicates, and outliers.

In [None]:
# Set random seed
np.random.seed(42)

# Generate dataset
num_records = 200

data = {
    'ID': np.arange(1, num_records + 1),
    'Age': np.random.randint(18, 65, num_records),
    'Salary': np.random.normal(50000, 15000, num_records).astype(int),
    'Department': np.random.choice(['HR', 'IT', 'Finance', 'Marketing'], num_records),
    'Experience': np.random.randint(0, 40, num_records),
    'Satisfaction_Score': np.random.uniform(1, 10, num_records).round(1),
}

df = pd.DataFrame(data)

# Introduce missing values & duplicates
df.loc[df.sample(frac=0.05).index, ['Age', 'Salary', 'Department', 'Experience']] = np.nan
df = pd.concat([df, df.iloc[:5]], ignore_index=True)

# Introduce salary outliers
df.loc[df.sample(3).index, 'Salary'] = [200000, 220000, 250000]

df.head()

## **2. Data Cleaning**

This section covers:

- Handling missing values
- Removing duplicate records
- Identifying and treating outliers

In [None]:
# Check missing values
print(df.isnull().sum())

# Fill missing numerical values with median
num_cols = ['Age', 'Salary', 'Experience']
for col in num_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Fill missing categorical values with mode
df['Department'].fillna(df['Department'].mode()[0], inplace=True)

# Remove duplicate records
df = df.drop_duplicates()

print('Data Cleaning Completed!')

In [None]:
# Outlier detection using IQR
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Capping outliers
df['Salary'] = np.where(df['Salary'] > upper_bound, upper_bound, df['Salary'])

print('Outliers Treated!')

## **3. Exploratory Data Analysis (EDA)**

This section covers:

- **Univariate Analysis** (distribution of individual variables)
- **Bivariate Analysis** (relationships between two variables)
- **Multivariate Analysis** (combined effects of multiple variables)

---

In [None]:
# Summary statistics
print(df.describe())

# Histograms
df.hist(figsize=(10, 6), bins=20, edgecolor='black')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

# Scatter plot: Age vs. Salary
sns.scatterplot(x='Age', y='Salary', data=df, hue='Department')
plt.show()

In [None]:
# Pair Plot
sns.pairplot(df, hue='Department')
plt.show()

## **4. Conclusion**

### **Key Findings:**

- Missing values were handled using median/mode imputation.
- Duplicate records were removed to ensure data integrity.
- Outliers in salary were detected using the IQR method and capped.
- **EDA revealed:**
  - Salary distributions vary significantly across departments.
  - Experience and Salary have a strong positive correlation.
  - Marketing and Finance departments tend to have higher satisfaction scores.

---
**This notebook is ready for submission.**
---