# Test 1. Exploratory Data Analysis

---
**By Leonardo H. Talero-Sarmiento - [ltalero@unab.edu.co](ltalero@unab.edu.co)**


# Information
The following manufacturing plant (**Craft.Co**) operates on an automated production line overseen by skilled engineers. The Materials selection process based on the product requirement, which then undergoes various processing stages, including shaping, assembling, and quality control. Each product's details, like its weight, production cost, sale price, and other relevant metrics, are captured in real-time and logged into our system. 

We use different machinery for varied tasks, and they are operated in shifts to ensure 24/7 production. Every item manufactured is assigned a unique product code, batch number, and undergoes strict quality control to ascertain its defect rate.

Below is an in-depth explanation of each column in our DataFrame:

### Columns:

1. **Engineer**: 
    - Description: Represents the engineer overseeing the production.
    - Type: Categorical
    - Values: Engineer_A, Engineer_B, Engineer_C

2. **Production_Cost**:
    - Description: The total cost incurred to manufacture the product.
    - Type: Continuous (Float)
    - Range: 10 to 200 currency units

3. **Sale_Price**:
    - Description: The price at which the product will be sold in the market.
    - Type: Continuous (Float)

4. **Hours_Spent**:
    - Description: Time taken in hours to produce the item.
    - Type: Continuous (Float)
    - Average: Around 5 hours

5. **Weight_kg**:
    - Description: Weight of the product in kilograms.
    - Type: Continuous (Float)
    - Average: Around 15kg

6. **Material_Used**:
    - Description: The primary material used for manufacturing the product.
    - Type: Categorical
    - Values: Steel, Plastic, Aluminium, Copper, Rubber

7. **Batch_Number**:
    - Description: Represents the production batch number.
    - Type: Integer
    - Range: Sequentially increasing

8. **Product_Code**:
    - Description: A unique code assigned to each product.
    - Type: String
    - Format: "P" followed by a unique number (e.g., P1001)

9. **Defect_Rate(%)**:
    - Description: The percentage of defective items in a batch.
    - Type: Continuous (Float)
    - Range: 0% to 1%

10. **Manufacture_Date**:
    - Description: The date when the product was manufactured.
    - Type: Date

11. **Expiration_Date**:
    - Description: The date when the product is deemed unfit for use/sale.
    - Type: Date

12. **Machine_ID**:
    - Description: ID of the machine that was used to produce the item.
    - Type: Integer
    - Range: 1 to 20

13. **Production_Shift**:
    - Description: The shift during which the product was manufactured.
    - Type: Categorical
    - Values: Morning, Afternoon, Night

# Test


## Exploratory Data Analysis (EDA) Plan for the Manufacturing DataFrame

### 1. Preliminary Data Examination
#### Grade: Easy
- **Expected Outputs**:
  - Display the first few rows of the DataFrame.
  - Display the shape of the DataFrame (number of rows and columns).
  - Display column data types and any missing values.
- **Hints**:
  - Use `.head()`, `.shape`, and `.info()` methods.

### 2. Descriptive Statistics
#### Grade: Easy
- **Expected Outputs**:
  - Summary statistics (like mean, median, standard deviation, min, max) for numeric columns.
- **Hints**:
  - Use `.describe()` method.

### 3. Unique Values Examination
#### Grade: Easy to Medium
- **Expected Outputs**:
  - Number of unique values in each column.
  - Display the unique values for categorical columns.
- **Hints**:
  - Use `.nunique()` for number of unique values.
  - Use `.unique()` to display unique values for specific columns.

### 4. Visualizing Data Distributions
#### Grade: Medium
- **Expected Outputs**:
  - Histograms for numeric columns to observe data distribution.
  - Boxplots to identify any outliers.
- **Hints**:
  - Libraries: `matplotlib` and `seaborn`.
  - Use `sns.histplot()` for histograms and `sns.boxplot()` for boxplots.

### 5. Correlation Analysis
#### Grade: Medium
- **Expected Outputs**:
  - Correlation matrix heatmap to understand relationships between numerical columns.
- **Hints**:
  - Use `.corr()` to compute the correlation matrix.
  - Use `sns.heatmap()` to visualize the correlation matrix.

### 6. Categorical Data Analysis
#### Grade: Medium
- **Expected Outputs**:
  - Bar plots for categorical data (like Engineer names, Materials used, Production Shift).
  - Count of each category in categorical columns.
- **Hints**:
  - Use `sns.countplot()` for visualizations.
  - Use `.value_counts()` to get count for each category.

### 7. Time Series Analysis (if applicable)
#### Grade: Medium to Hard
- **Expected Outputs**:
  - Line plots showing trends over time for relevant columns (like Production_Cost or Defect_Rate over Manufacture_Date).
- **Hints**:
  - Ensure the DataFrame is sorted by the date column.
  - Use `sns.lineplot()` for visualizations.

### 8. Bivariate Analysis
#### Grade: Medium
- **Expected Outputs**:
  - Scatter plots between two continuous variables to understand their relationships.
  - Box plots between categorical and continuous variables to understand distribution across categories.
- **Hints**:
  - Use `sns.scatterplot()` for scatter plots and `sns.boxplot()` for box plots.

### 9. Handling Missing Values (if any)
#### Grade: Medium
- **Expected Outputs**:
  - Strategies to replace or remove missing values.
  - A clean dataset without any missing values.
- **Hints**:
  - Use `.isna()` to identify missing values.
  - Use `.fillna()` or `.dropna()` to handle them.

### 10. Outlier Detection and Treatment
#### Grade: Hard
- **Expected Outputs**:
  - Identify columns with outliers.
  - Strategies to treat them (cap, floor, or remove).
- **Hints**:
  - Use IQR (Interquartile Range) method or Z-scores to identify outliers.
  - Use boxplots to visualize outliers.

