In [11]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency, ttest_ind

%matplotlib inline

In [40]:
df = pd.read_csv("HR_comma_sep.csv")
df

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.80,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low
...,...,...,...,...,...,...,...,...,...,...
14994,0.40,0.57,2,151,3,0,1,0,support,low
14995,0.37,0.48,2,160,3,0,1,0,support,low
14996,0.37,0.53,2,143,3,0,1,0,support,low
14997,0.11,0.96,6,280,4,0,1,0,support,low


If the dataset contains multiple columns with text (categorical variables), each needs to be appropriately encoded. Here's a general approach to preprocessing such a dataset:

1. **Label Encoding**: For categorical variables with a natural order (ordinal data).
2. **One-Hot Encoding**: For categorical variables without a natural order (nominal data).

### Example Dataset
Let's consider an extended example dataset:

```plaintext
satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary,education,gender
0.38,0.53,2,157,3,0,1,0,sales,low,bachelor,male
0.8,0.86,5,262,6,0,1,0,sales,medium,master,female
0.11,0.88,7,272,4,0,1,0,sales,medium,phd,male
0.72,0.87,5,223,5,0,1,0,technical,low,bachelor,female
0.37,0.52,2,159,3,0,1,0,IT,low,master,male
```

### Preprocessing Steps
1. **Identify Categorical and Numerical Columns**:
    - Categorical: `Department`, `salary`, `education`, `gender`
    - Numerical: `satisfaction_level`, `last_evaluation`, `number_project`, `average_montly_hours`, `time_spend_company`, `Work_accident`, `left`, `promotion_last_5years`

2. **Encode Categorical Variables**:
    - Use `LabelEncoder` for ordinal data like `salary` (if it has a natural order).
    - Use `pd.get_dummies` for nominal data like `Department`, `education`, `gender`.

3. **Standardize Numerical Features**:
    - Apply `StandardScaler` to numerical columns.

### Preprocessing Code

```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Define the extended data
data = {
    'satisfaction_level': [0.38, 0.8, 0.11, 0.72, 0.37],
    'last_evaluation': [0.53, 0.86, 0.88, 0.87, 0.52],
    'number_project': [2, 5, 7, 5, 2],
    'average_montly_hours': [157, 262, 272, 223, 159],
    'time_spend_company': [3, 6, 4, 5, 3],
    'Work_accident': [0, 0, 0, 0, 0],
    'left': [1, 1, 1, 1, 1],
    'promotion_last_5years': [0, 0, 0, 0, 0],
    'Department': ['sales', 'sales', 'sales', 'technical', 'IT'],
    'salary': ['low', 'medium', 'medium', 'low', 'low'],
    'education': ['bachelor', 'master', 'phd', 'bachelor', 'master'],
    'gender': ['male', 'female', 'male', 'female', 'male']
}

# Create the DataFrame
df = pd.DataFrame(data)

# Encode 'salary' column using LabelEncoder
label_encoder = LabelEncoder()
df['salary'] = label_encoder.fit_transform(df['salary'])

# One-hot encode nominal categorical columns
df = pd.get_dummies(df, columns=['Department', 'education', 'gender'], drop_first=True)

# Standardize numerical features
scaler = StandardScaler()
numerical_features = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Display the preprocessed DataFrame
df
```

### Explanation:

1. **Label Encoding for `salary`**:
   - Converts `low`, `medium`, `high` into numerical values because there is a natural order.

2. **One-Hot Encoding for Nominal Categorical Columns**:
   - `pd.get_dummies` is used for `Department`, `education`, and `gender` to avoid any implicit ordinal relationship.
   - `drop_first=True` is used to avoid multicollinearity by dropping the first category.

3. **Standardization**:
   - `StandardScaler` is applied to numerical columns to scale them to have a mean of 0 and a standard deviation of 1.

This approach ensures that all categorical data is appropriately encoded and numerical data is standardized, making the dataset ready for machine learning models.

In [13]:
plt.figure(figsize=(12, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()


ValueError: could not convert string to float: 'sales'

<Figure size 1200x800 with 0 Axes>