# Data Analysis Workflow Example

This notebook demonstrates a typical data analysis workflow using Python and common data science libraries.

## Workflow Steps:
1. Import libraries
2. Load and explore data
3. Data cleaning and preprocessing
4. Exploratory Data Analysis (EDA)
5. Statistical analysis
6. Visualization
7. Conclusions

## 1. Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

## 2. Load and Explore Data

For this example, we'll create a synthetic dataset to demonstrate the workflow.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Create synthetic data
n_samples = 1000

# Generate features
age = np.random.randint(18, 70, n_samples)
experience = np.random.randint(0, 40, n_samples)
education_level = np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples)

# Generate target variable (salary) with some correlation to features
base_salary = 30000
salary = (base_salary + 
          age * 500 + 
          experience * 1200 + 
          np.random.normal(0, 5000, n_samples))

# Create DataFrame
df = pd.DataFrame({
    'age': age,
    'experience': experience,
    'education': education_level,
    'salary': salary
})

print("Dataset shape:", df.shape)
df.head(10)

In [None]:
# Display basic information about the dataset
print("\nDataset Info:")
df.info()

In [None]:
# Display statistical summary
print("\nStatistical Summary:")
df.describe()

## 3. Data Cleaning and Preprocessing

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Check for duplicates
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")

In [None]:
# Encode categorical variables
education_mapping = {
    'High School': 1,
    'Bachelor': 2,
    'Master': 3,
    'PhD': 4
}

df['education_encoded'] = df['education'].map(education_mapping)

print("Education encoding:")
print(df[['education', 'education_encoded']].drop_duplicates().sort_values('education_encoded'))

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Distribution of education levels
print("Education distribution:")
print(df['education'].value_counts())

# Correlation analysis
print("\nCorrelation with salary:")
numeric_cols = ['age', 'experience', 'education_encoded', 'salary']
print(df[numeric_cols].corr()['salary'].sort_values(ascending=False))

## 5. Data Visualization

In [None]:
# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Salary distribution
axes[0, 0].hist(df['salary'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Salary')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Salary Distribution')
axes[0, 0].axvline(df['salary'].mean(), color='red', linestyle='--', label=f'Mean: ${df["salary"].mean():,.0f}')
axes[0, 0].legend()

# 2. Salary vs Experience
axes[0, 1].scatter(df['experience'], df['salary'], alpha=0.5)
axes[0, 1].set_xlabel('Years of Experience')
axes[0, 1].set_ylabel('Salary')
axes[0, 1].set_title('Salary vs Experience')

# 3. Salary vs Age
axes[1, 0].scatter(df['age'], df['salary'], alpha=0.5, color='green')
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Salary')
axes[1, 0].set_title('Salary vs Age')

# 4. Salary by Education Level
df.boxplot(column='salary', by='education', ax=axes[1, 1])
axes[1, 1].set_xlabel('Education Level')
axes[1, 1].set_ylabel('Salary')
axes[1, 1].set_title('Salary Distribution by Education Level')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Pairplot for quick visual analysis
sns.pairplot(df[numeric_cols], diag_kind='kde', corner=True)
plt.suptitle('Pairplot of Numeric Features', y=1.02)
plt.show()

## 6. Statistical Analysis and Simple Modeling

In [None]:
# Prepare data for modeling
X = df[['age', 'experience', 'education_encoded']]
y = df['salary']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

In [None]:
# Train a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Model Performance:")
print(f"Root Mean Squared Error: ${rmse:,.2f}")
print(f"R² Score: {r2:.4f}")

# Display feature coefficients
print("\nFeature Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: ${coef:,.2f}")
print(f"Intercept: ${model.intercept_:,.2f}")

In [None]:
# Visualize predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Salary')
plt.ylabel('Predicted Salary')
plt.title('Actual vs Predicted Salary')
plt.tight_layout()
plt.show()

In [None]:
# Residual plot
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Salary')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()

## 7. Conclusions

This notebook demonstrated a complete data analysis workflow:

### Key Findings:
- Experience shows a strong positive correlation with salary
- Age also correlates positively with salary
- Education level has an impact on salary ranges

### Workflow Summary:
1. **Data Loading**: Created a synthetic dataset for demonstration
2. **Data Exploration**: Examined data structure, types, and basic statistics
3. **Data Cleaning**: Checked for missing values and duplicates
4. **Feature Engineering**: Encoded categorical variables
5. **Visualization**: Created multiple plots to understand relationships
6. **Modeling**: Built a simple linear regression model
7. **Evaluation**: Assessed model performance using RMSE and R² metrics

### Next Steps:
- Try more advanced models (Random Forest, Gradient Boosting)
- Feature selection and engineering
- Cross-validation for better model evaluation
- Hyperparameter tuning
- Deploy the model for predictions