# 🌟 Project Description

  Data Analyst Assistant simplifies data exploration by combining the power of machine learning and large language models. It reads your dataset and allows you to ask questions such as “What is the average salary?” or “Which job title has the highest salary?” and returns meaningful insights using AI.

  The assistant uses the first few rows, column names, and overall metadata to generate accurate, context-aware responses to user questions.

In [1]:
!pip install requests




In [2]:
import os

# Set your token (Azure Inference currently uses GitHub PAT)
os.environ["GITHUB_TOKEN"] = "Use your_personal_access_token"


In [5]:
import os
import pandas as pd
import requests
import json

# Load GitHub token
token = os.getenv("GITHUB_TOKEN")  # Or set manually for testing

# Azure model endpoint
endpoint = "https://models.inference.ai.azure.com/chat/completions"
model_name = "Llama-3.3-70B-Instruct"

# Load user-provided dataset
df = pd.read_csv('/content/ds_salaries.csv')

# Function to query model with processed insights
def ask_question_about_data(user_question):
    # Preprocess or summarize relevant parts of the dataset
    numerical_description = df.describe().to_string()
    column_info = df.dtypes.to_string()
    shape_info = f"The dataset contains {df.shape[0]} rows and {df.shape[1]} columns."

    # Prompt construction
    prompt = f""" You are a professional data analyst. Here is the full context about a dataset:

{shape_info}

Column data types:
{column_info}

Summary statistics of numerical columns:
{numerical_description}

Now, using the information above, answer the following question:
{user_question}
"""

    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": model_name,
        "messages": [
            {"role": "system", "content": "You are a helpful data analyst."},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "top_p": 0.95,
        "max_tokens": 800
    }

    response = requests.post(endpoint, headers=headers, json=payload)

    if response.status_code == 200:
        return response.json()["choices"][0]["message"]["content"]
    else:
        return f"Error: {response.status_code}, {response.text}"


In [6]:
# Ask a single question at a time
user_question = "what are employment_type ?"  # You can change this line
response = ask_question_about_data(user_question)
print("\nResponse:\n", response)


Response:
 The data type for 'employment_type' is object, which typically means it's a categorical or string column. However, the specific categories or values within the 'employment_type' column are not provided in the summary statistics. 

To determine the specific employment types, you would need to examine the unique values within the 'employment_type' column. This could be done using the `unique()` function in pandas, for example: `df['employment_type'].unique()`. 

Common employment types might include full-time, part-time, contract, freelance, intern, etc., but without further information, we can't know the exact categories used in this dataset.


In [7]:
# Ask a single question at a time
user_question = "what are mean of each numerical columns ?"  # You can change this line
response = ask_question_about_data(user_question)
print("\nResponse:\n", response)


Response:
 To find the mean of each numerical column, we can look at the 'mean' row in the summary statistics provided. The numerical columns are 'work_year', 'salary', 'salary_in_usd', and 'remote_ratio'. 

The mean values are as follows:
- work_year: 2022.373635
- salary: 1.906956e+05 (or 190695.6)
- salary_in_usd: 137570.389880
- remote_ratio: 46.271638

So, the mean of each numerical column is approximately 2022 for 'work_year', 190,696 for 'salary', 137,570 for 'salary_in_usd', and 46 for 'remote_ratio'.


In [8]:
# Ask a single question at a time
user_question = "How many rows and columns are present in the dataset ?"  # You can change this line
response = ask_question_about_data(user_question)
print("\nResponse:\n", response)


Response:
 The dataset contains 3755 rows and 11 columns.


In [9]:
# Ask a single question at a time
user_question = "Hoo many null values are present in the dataser ?"  # You can change this line
response = ask_question_about_data(user_question)
print("\nResponse:\n", response)


Response:
 Based on the provided summary statistics, there are no null values present in the numerical columns of the dataset. 

The 'count' row in the summary statistics table shows the number of non-null values in each column. Since the count is 3755 for all numerical columns, which is the total number of rows in the dataset, it indicates that there are no null values in these columns.

However, without additional information about the categorical columns, we cannot determine if there are any null values in those columns.


In [10]:
# Ask a single question at a time
user_question = "summary of the dataset ?"  # You can change this line
response = ask_question_about_data(user_question)
print("\nResponse:\n", response)


Response:
 Based on the provided information, here is a summary of the dataset:

**Dataset Overview**

* The dataset contains 3755 rows and 11 columns.
* The columns include a mix of numerical and categorical data types.

**Numerical Columns Summary**

* The numerical columns are: `work_year`, `salary`, `salary_in_usd`, and `remote_ratio`.
* The mean values for these columns are:
	+ `work_year`: 2022.37
	+ `salary`: $190,695.60
	+ `salary_in_usd`: $137,570.39
	+ `remote_ratio`: 46.27%
* The standard deviations for these columns are:
	+ `work_year`: 0.69
	+ `salary`: $671,676.57
	+ `salary_in_usd`: $63,055.63
	+ `remote_ratio`: 48.59%
* The minimum and maximum values for these columns are:
	+ `work_year`: 2020-2023
	+ `salary`: $6,000 - $30,400,000
	+ `salary_in_usd`: $5,132 - $450,000
	+ `remote_ratio`: 0% - 100%

**Categorical Columns**

* The categorical columns are: `experience_level`, `employment_type`, `job_title`, `salary_currency`, `employee_residence`, `company_location`, and 