# Afte collect the data we need to clean it to improve performance of solar_panel_chatbot.

## Here we used feature engineering such as :

1. Missing values(NaN) imputation for categorical and numerical 
2. Spltting categorical and numerical data
3. Ploting graphs for visulization (data analysis)

## Imported Libraries and Modules



* **NumPy (numpy)**: Provides fundamental data structures (arrays) for efficient numerical computations and vectorized operations.
* **Matplotlib (matplotlib.pyplot as plt)**: Creates various static, animated, and interactive visualizations for exploring and understanding data.
* **Seaborn (seaborn as sns)**: Builds on Matplotlib to offer a high-level interface for creating statistical graphics.
* **scikit-learn (from sklearn.impute import SimpleImputer)**: Implements a collection of machine learning algorithms, including `SimpleImputer` for handling missing data (mean, median, etc.).
* **scikit-learn (from sklearn.compose import ColumnTransformer)**: Provides tools for building machine learning pipelines, including `ColumnTransformer` for applying different transformations to different columns in a dataset.

These libraries work together to facilitate various data analysis tasks:

* **Data Loading and Manipulation (NumPy):** Load and manipulate data using efficient arrays.
* **Visualization (Matplotlib & Seaborn):** Create informative plots and charts to understand data trends and relationships.
* **Missing Data Handling (scikit-learn.impute.SimpleImputer):** Impute missing values in the data using strategies like mean, median, or most frequent value.
* **Feature Engineering (scikit-learn.compose.ColumnTransformer):** Apply different transformations (e.g., scaling, encoding) to specific columns in a dataset, enabling custom preprocessing pipelines.

**Remember to install these libraries using `pip install numpy matplotlib seaborn scikit-learn` before running your Python code.**


In [1]:
!pip3 install pandas
!pip3 install numpy



In [2]:
import pandas as pd 
import numpy as np

In [3]:
df = pd.read_csv('a.csv') 

In [4]:
df.head()

Unnamed: 0,Sr.No.,name,purchase_price,discount,real_price,rating,Product Description
0,0,Luminous 550W 24V Mono PERC Halfcut Solar Panel,"₹16,739",60% OFF,"₹42,000",4.6,The Luminous Solar Panel is a high-quality sol...
1,1,Waaree Bi-55-540 540W 144 Cells Framed Dual Gl...,"₹12,799",55% OFF,"₹28,869",4.0,The Waaree Bi-55-540 540W 144 Cells Framed Dua...
2,2,Solar Universe 335W 24V Polycrystalline Solar ...,"₹10,899",31% OFF,"₹16,000",4.5,Looking for a top-of-the-line solar panel to p...
3,3,Waaree 540W 144 Cells Monocrystalline PERC Sol...,"₹12,699",53% OFF,"₹27,279",4.5,The Waaree 540W 144 Cells Monocrystalline PERC...
4,4,Waaree 535Wp 144 Cells Framed Dual Glass Mono ...,"₹12,899",42% OFF,"₹22,399",4.6,Introducing the Waaree 535Wp 144 Cells Framed ...


In [5]:
df.head()

Unnamed: 0,Sr.No.,name,purchase_price,discount,real_price,rating,Product Description
0,0,Luminous 550W 24V Mono PERC Halfcut Solar Panel,"₹16,739",60% OFF,"₹42,000",4.6,The Luminous Solar Panel is a high-quality sol...
1,1,Waaree Bi-55-540 540W 144 Cells Framed Dual Gl...,"₹12,799",55% OFF,"₹28,869",4.0,The Waaree Bi-55-540 540W 144 Cells Framed Dua...
2,2,Solar Universe 335W 24V Polycrystalline Solar ...,"₹10,899",31% OFF,"₹16,000",4.5,Looking for a top-of-the-line solar panel to p...
3,3,Waaree 540W 144 Cells Monocrystalline PERC Sol...,"₹12,699",53% OFF,"₹27,279",4.5,The Waaree 540W 144 Cells Monocrystalline PERC...
4,4,Waaree 535Wp 144 Cells Framed Dual Glass Mono ...,"₹12,899",42% OFF,"₹22,399",4.6,Introducing the Waaree 535Wp 144 Cells Framed ...


In [6]:
df.isnull().mean() * 100

Sr.No.                 0.0
name                   0.0
purchase_price         0.0
discount               0.0
real_price             0.0
rating                 0.0
Product Description    0.0
dtype: float64

In [7]:
import pandas as pd
import re

# Load the CSV file
df = pd.read_csv('a.csv')

# Define a function to extract the brand name, rated power, rated voltage, and panel type
def extract_components(name):
    # Assuming the brand name is the first word
    brand_name = name.split()[0]
    
    # Extracting rated power
    rated_power_match = re.search(r'(\d+)\s*W', name, re.IGNORECASE)
    rated_power = rated_power_match.group(1) + " Watt" if rated_power_match else None
    
    # Extracting rated voltage
    rated_voltage_match = re.search(r'(\d+)\s*V', name, re.IGNORECASE)
    rated_voltage = rated_voltage_match.group(1) + " Volt" if rated_voltage_match else None
    
    # Extracting panel type
    panel_types = [
        "Bifacial",
        "Half Cut Mono Bifacial",
        "Mono Bifacial",
        "Mono PERC",
        "Mono PERC Bifacial",
        "Monocrystalline",
        "Polycrystalline"
    ]
    panel_type = next((ptype for ptype in panel_types if ptype.lower() in name.lower()), None)
    
    return brand_name, panel_type, rated_power, rated_voltage

# Apply the function to the 'name' column and create new columns
df[['Brand Name', 'Panel Type', 'Rated Power', 'Rated Voltage']] = df['name'].apply(lambda x: pd.Series(extract_components(x)))

# Remove the symbol of ₹ from 'real_price' and 'actual_price'
df['real_price'] = df['real_price'].replace({'₹': ''}, regex=True)
df['purchase_price'] = df['purchase_price'].replace({'₹': ''}, regex=True)

# Rename columns
df.rename(columns={
    'real_price': 'Actual Price',
    'purchase_price': 'Purchase Price',
    'discount': 'Discount',
    'rating': 'Rating'
}, inplace=True)

# Drop the 'name' column
df.drop(columns=['name'], inplace=True)

# Adjust the index to start from 1 and rename it to 'Sr.No.'
df.index = df.index + 1
df.index.name = 'Sr.No.'

# Rearrange the columns in the specified order
desired_order = [
    'Brand Name', 'Panel Type', 'Rated Power', 
    'Rated Voltage', 'Actual Price', 'Purchase Price', 'Discount', 
    'Rating', 'Product Description'
]

# Reorder columns and save to CSV
df = df[desired_order]  # Just select the desired columns directly
df.to_csv('modified_a1.csv')


In [9]:
df = pd.read_csv('modified_a1.csv')

In [10]:
df.head()

Unnamed: 0,Sr.No.,Brand Name,Panel Type,Rated Power,Rated Voltage,Actual Price,Purchase Price,Discount,Rating,Product Description
0,1,Luminous,Mono PERC,550 Watt,24 Volt,42000,16739,60% OFF,4.6,The Luminous Solar Panel is a high-quality sol...
1,2,Waaree,Bifacial,540 Watt,,28869,12799,55% OFF,4.0,The Waaree Bi-55-540 540W 144 Cells Framed Dua...
2,3,Solar,Polycrystalline,335 Watt,24 Volt,16000,10899,31% OFF,4.5,Looking for a top-of-the-line solar panel to p...
3,4,Waaree,Monocrystalline,540 Watt,,27279,12699,53% OFF,4.5,The Waaree 540W 144 Cells Monocrystalline PERC...
4,5,Waaree,Bifacial,535 Watt,,22399,12899,42% OFF,4.6,Introducing the Waaree 535Wp 144 Cells Framed ...


In [11]:
df.isnull().mean() * 100

Sr.No.                  0.000000
Brand Name              0.000000
Panel Type             30.000000
Rated Power             0.000000
Rated Voltage          40.427928
Actual Price            0.000000
Purchase Price          0.000000
Discount                0.000000
Rating                  0.000000
Product Description     0.000000
dtype: float64

### To find which values are present the most frequently (i.e., the mode) in your 'Rated Voltage' and 'Panel Type' columns in pandas, you can use the `mode()` function.

In [12]:
mode_rated_voltage = df['Rated Voltage'].mode()[0]
mode_panel_type = df['Panel Type'].mode()[0]

print(f"The most frequent value in 'Rated Voltage' column is: {mode_rated_voltage}")
print(f"The most frequent value in 'Panel Type' column is: {mode_panel_type}")

The most frequent value in 'Rated Voltage' column is: 24 Volt
The most frequent value in 'Panel Type' column is: Polycrystalline


### fill the missing  numerical values with the categorical data

In [16]:
fill_values = {
    'Rated Voltage': '24 Volt',
    'Panel Type': 'Polycrystalline',   
}

# Fill missing numerical values with categorical data
df.fillna(value=fill_values, inplace=True)
df.to_csv('modified_a2.csv')

In [21]:
df = pd.read_csv('modified_a2.csv')

In [22]:
df.isnull().mean() * 100

Sr.No.                 0.0
Brand Name             0.0
Panel Type             0.0
Rated Power            0.0
Rated Voltage          0.0
Actual Price           0.0
Purchase Price         0.0
Discount               0.0
Rating                 0.0
Product Description    0.0
dtype: float64

In [23]:
df.head(20)

Unnamed: 0,Sr.No.,Brand Name,Panel Type,Rated Power,Rated Voltage,Actual Price,Purchase Price,Discount,Rating,Product Description
0,1,Luminous,Mono PERC,550 Watt,24 Volt,42000,16739,60% OFF,4.6,The Luminous Solar Panel is a high-quality sol...
1,2,Waaree,Bifacial,540 Watt,24 Volt,28869,12799,55% OFF,4.0,The Waaree Bi-55-540 540W 144 Cells Framed Dua...
2,3,Solar,Polycrystalline,335 Watt,24 Volt,16000,10899,31% OFF,4.5,Looking for a top-of-the-line solar panel to p...
3,4,Waaree,Monocrystalline,540 Watt,24 Volt,27279,12699,53% OFF,4.5,The Waaree 540W 144 Cells Monocrystalline PERC...
4,5,Waaree,Bifacial,535 Watt,24 Volt,22399,12899,42% OFF,4.6,Introducing the Waaree 535Wp 144 Cells Framed ...
5,6,Luminous,Polycrystalline,170 Watt,24 Volt,15000,5799,61% OFF,4.6,The Luminous LUM 12170 Solar Panel is a highly...
6,7,Luminous,Polycrystalline,170 Watt,24 Volt,30000,12639,57% OFF,4.6,Elevate Your Solar Energy Game with the Lumino...
7,8,Solar,Monocrystalline,180 Watt,12 Volt,9750,8699,10% OFF,4.4,The Solar Universe India 180W 12V Monocrystall...
8,9,UTL,Polycrystalline,165 Watt,12 Volt,9570,6099,36% OFF,4.2,This UTL 165W 12V Polycrystalline Solar PV Pan...
9,10,Luminous,Polycrystalline,165 Watt,12 Volt,16736,14709,12% OFF,4.6,The Luminous 165W 12V Polycrystalline Solar PV...
