# Objective:

Exploratory Data Analysis (EDA) is a vital step in data science that involves understanding a dataset's structure, characteristics, and insights before applying advanced models. However, EDA can be complex and repetitive, involving tasks like data cleaning, preprocessing, statistical analysis, and visualization. In collaborative settings, managing these tasks efficiently while incorporating feedback and producing high-quality reports add further challenges.

This project simplifies the EDA process using a multi-agent system built with AutoGen. Each agent is designed with a specific role to ensure task specialization and moularity.


### Install Autogen

In [1]:
!pip install -q autogen-agentchat==0.2.38

### Relevant Imports

In [2]:
import os
import shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
from google.colab import userdata
import autogen
from autogen import UserProxyAgent, AssistantAgent, GroupChat, GroupChatManager

### Import Sample Dataset

Using the following dataset from Kaggle: [House Price Dataset](https://www.kaggle.com/datasets/neurocipher/house-price-for-linear-regression?resource=download)

### About the Dataset

This housing dataset provides detailed information on residential properties along with their market selling prices, making it perfect for ML regression projects, EDA, and price prediction models.

More information in the link

In [43]:
# Only if you need to clear workspace

import shutil
import os

workspace_path = "/content/workspace"

if os.path.exists(workspace_path):
    shutil.rmtree(workspace_path)

os.makedirs(workspace_path, exist_ok=True)
print("Workspace cleared.")

Workspace cleared.


In [44]:
uploaded = files.upload()

uploaded_filename = next(iter(uploaded))

os.makedirs("/content/workspace", exist_ok=True)

shutil.copy(
    f"/content/{uploaded_filename}",
    "/content/workspace/dataset.csv"
)

print("CSV available at:", "/content/workspace/dataset.csv")

Saving Housing.csv to Housing.csv
CSV available at: /content/workspace/dataset.csv


In [45]:
df = pd.read_csv("/content/workspace/dataset.csv")
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


### Set Up API Key and LLM Config

In [12]:
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

config_list_gpt4o = [{"model": "gpt-4o", "api_key": OPENAI_API_KEY}]

gpt4o_config = {
    "cache_seed": 42,
    "temperature": 0.0,
    "config_list": config_list_gpt4o,
    "timeout": 120,
    "max_tokens": 900,
}

## Set Up the Agents that will be Used

### Admin Agent

Oversees the workflow, ensures coordination, and maintains alignment with project goals.

In [46]:
user_proxy = UserProxyAgent(
    name="Admin",
    system_message=
    """
    A human admin. Code execution needs to be approved by this admin.
    """,
    human_input_mode="NEVER",
    max_consecutive_auto_reply=5,
    code_execution_config=False,
)

### Data Preperation Agent

Handles data cleaning and preprocessing

In [47]:
data_processer = AssistantAgent(
    name="DataProcesser",
    llm_config=gpt4o_config,
    system_message=
    """
    Data processer. Find the dataset at /content/workspace/dataset.csv. Go
    through the dataset and clean it to prepare it for exploratory data analysis
    using python code. Save the new dataset as /content/workspace/cleaned_data.csv.
    This dataset should be prepared before the data analyst performs EDA.

    Only output python code for the executor to execute. The
    user can't modify your code so do not suggest incomplete code which requires
    others to modify. If the code results in an error, fix the error and output
    the code again.

    If the result indicates there is an error, fix the error and output the code again.
    Suggest the full code instead of partial code or code changes. If the error
    can't be fixed or if the task is not solved even after the code is executed
    successfully, analyze the problem, revisit your assumption, collect
    additional info you need, and think of a different approach to try.
    """,
)

### EDA Agent

Performs statistical summarization and generates insights and visualizations.

In [48]:
data_analyst = AssistantAgent(
    name="DataAnalyst",
    llm_config=gpt4o_config,
    system_message=
    """
    Data Analyst. Use the dataset at /content/workspace/cleaned_data.csv that
    has been prepared by the data processer. Using python coding, perform
    statisical summarization and generate insights and visualizations.

    Only output python code for the executor to execute. The
    user can't modify your code so do not suggest incomplete code which requires
    others to modify. If the code results in an error, fix the error and output
    the code again.

    If the result indicates there is an error, fix the error and output the code again.
    Suggest the full code instead of partial code or code changes. If the error
    can't be fixed or if the task is not solved even after the code is executed
    successfully, analyze the problem, revisit your assumption, collect
    additional info you need, and think of a different approach to try.
    """
)

### Report Generator Agent

Creates a well-structured EDA report with clear insights and findings.

In [49]:
report_writer = AssistantAgent(
    name="ReportWriter",
    llm_config=gpt4o_config,
    system_message=
    """
    Report Writer. After the data analyst has written and gotten their code
    approved, the executor will execute the code. Use the outputs to write a
    report based on the findings located at /content/workspace/.

    The report should include the following
    sections: Introduction, Dataset Overview, Variable Descriptions, Data
    Processing steps, Univariate Analysis, Bivariate Analysis, Key Insights and
    Patterns, Limitations, Next Steps. The report should be written clearly and
    concisely. The report should ONLY use information provided by the other
    agents. Do not make up information.
    """
)

### Critic Agent

Reviews outputs and provides feedback to improve clarity, accuracy, and actionability.

In [50]:
critic = AssistantAgent(
    name="Critic",
    llm_config=gpt4o_config,
    system_message=
    """
    Critic. Double check code and reports generated from other agents and provide
    feedback.
    """
)

### Executor Agent

Validates code and ensures the accuracy of the results

In [51]:
executor = UserProxyAgent(
    name="Executor",
    system_message=
    """
    Executor. Execute the code written by the data processer and data analyst
    and report the results and the visualizations generated.
    """,
    human_input_mode="NEVER",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "/content/workspace.",
        "use_docker": False,
    },
)

### Combine the Agents Using a Groupchat

In [52]:
groupchat = GroupChat(
    agents=[user_proxy, data_processer, data_analyst, report_writer, critic, executor],
    messages=[],
    max_round=20,
)

manager = GroupChatManager(groupchat=groupchat, llm_config=gpt4o_config)

### Perform EDA on Sample Dataset

In [53]:
output_report = user_proxy.initiate_chat(
    manager,
    message=
    """
    Run an exploratory data analysis on the dataset saved at /content/workspace.
    After the EDA is performed, create a report with findings and visuals.
    """
)

Admin (to chat_manager):


    Run an exploratory data analysis on the dataset saved at /content/workspace.
    After the EDA is performed, create a report with findings and visuals.
    

--------------------------------------------------------------------------------

Next speaker: DataProcesser

DataProcesser (to chat_manager):

To prepare the dataset for exploratory data analysis, we need to clean it first. Below is the Python code to clean the dataset and save it as `cleaned_data.csv`.

```python
import pandas as pd

# Load the dataset
file_path = '/content/workspace/dataset.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Drop columns with a high percentage of missing values (e.g., more than 50%)
threshold = 0.5
data = data.dropna(thresh=int(threshold * len(data)), axis=1)

# Fill missing values for numerical columns with the median
numerical_col

# Outputed Results

Report Findings:
1. The dataset contains information about house prices and various features such as area, bedrooms, bathrooms, etc.
2. The correlation matrix shows that 'area' has a strong positive correlation with 'price'.
3. The distribution of house prices is right-skewed, indicating that most houses are priced lower, with a few high-priced outliers.
4. The boxplot shows that furnished houses tend to have higher prices compared to semi-furnished and unfurnished houses.
5. The pairplot provides insights into the relationships between different numerical variables.
6. Count plots for categorical variables show the distribution of each category within the dataset.


--------------------------------------------------------------------------------

Next speaker: ReportWriter

ReportWriter (to chat_manager):

# Exploratory Data Analysis Report

## Introduction
This report presents the findings from an exploratory data analysis (EDA) conducted on a dataset containing information about house prices and various features related to the properties. The analysis aims to uncover patterns, relationships, and insights that can inform future decision-making processes.

## Dataset Overview
The dataset consists of 545 entries, each representing a house with various attributes. The data includes both numerical and categorical variables, providing a comprehensive view of the factors influencing house prices.

## Variable Descriptions
The dataset includes the following variables:
- **price**: The price of the house.
- **area**: The area of the house in square feet.
- **bedrooms**: The number of bedrooms in the house.
- **bathrooms**: The number of bathrooms in the house.
- **stories**: The number of stories in the house.
- **mainroad**: Whether the house is located on the main road (yes/no).
- **guestroom**: Whether the house has a guestroom (yes/no).
- **basement**: Whether the house has a basement (yes/no).
- **hotwaterheating**: Whether the house has hot water heating (yes/no).
- **airconditioning**: Whether the house has air conditioning (yes/no).
- **parking**: The number of parking spaces available.
- **prefarea**: Whether the house is located in a preferred area (yes/no).
- **furnishingstatus**: The furnishing status of the house (furnished/semi-furnished/unfurnished).

## Data Processing Steps
1. **Loading the Data**: The dataset was loaded from a CSV file.
2. **Handling Missing Values**: There were no missing values in the dataset.
3. **Encoding Categorical Variables**: Categorical variables were encoded to numerical values for correlation analysis.
4. **Data Cleaning**: Duplicate rows were removed, and data types were appropriately set.

## Univariate Analysis
- **Price Distribution**: The distribution of house prices is right-skewed, indicating that most houses are priced lower, with a few high-priced outliers.
- **Categorical Variables**: Count plots were generated for categorical variables to show the distribution of each category within the dataset.

## Bivariate Analysis
- **Correlation Matrix**: A correlation matrix was generated to identify relationships between numerical variables. The analysis revealed a strong positive correlation between 'area' and 'price'.
- **Boxplot Analysis**: A boxplot of price by furnishing status showed that furnished houses tend to have higher prices compared to semi-furnished and unfurnished houses.

## Key Insights and Patterns
1. **Area and Price**: There is a strong positive correlation between the area of the house and its price, suggesting that larger houses tend to be more expensive.
2. **Furnishing Status**: Furnished houses generally command higher prices, indicating a premium for fully furnished properties.
3. **Price Distribution**: The right-skewed distribution of prices suggests a market with a majority of lower-priced houses and a few high-value properties.

## Limitations
- The dataset does not include information on other potential factors influencing house prices, such as location specifics, age of the property, or market conditions.
- The analysis is limited to the variables provided and may not capture all nuances of the housing market.

## Next Steps
- **Further Analysis**: Conduct deeper analysis on the impact of other potential factors on house prices.
- **Predictive Modeling**: Develop predictive models to estimate house prices based on the available features.
- **Data Enrichment**: Consider enriching the dataset with additional variables to improve the robustness of the analysis.

This report provides a foundational understanding of the dataset and highlights key areas for further exploration and analysis.