# Python Phase - Product

This code will be the skeleton part for our Construct Week Project Approach. This file consists of all the major part of the analysis that takes place and finally, connecting MySQL databases for it to import all the files and therefore creating a final dashboard. 

For the following phase, we have a total number of 7 datasets in which all of them are unclean meaning, they are not aligned and have a clustered set of results. In order to get ahead of it, each particular dataset has been arranged to ensure the data has been assigned to their particular columns. 

We shall now begin the basic EDA (Exploratory Data Analysis) and ensure each dataset has been cleaned and is set to be used in creating a database and then the dashboard.

In [52]:
# Importing all the essential libraries for the analysis to be done.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Importing libraries to connect and inject all the values from the dataset into the database server.
from sqlalchemy import create_engine
import mysql.connector

# Importing the color code libraries for the dataset to identify the color using the color codes. 
import webcolors

In [53]:
# Creating a connector so that the server can be connected here.
db_connector = mysql.connector.connect(
    host = "127.0.0.1",       
    username = "root",
    password = "MySQL12345",
    database = "patternseekers"
)

# A custom message that displays if the operation has been successful.
print(f"You have successfully connected to your database.")

You have successfully connected to your database.


In [54]:
# This engine will be another verification so that all the records made here can be added into the database.
engine = create_engine(f"mysql+mysqlconnector://{"root"}:{"MySQL12345"}@{"127.0.0.1"}/{"patternseekers"}")
print("The connection to the MySQL Engine is now functional.")

The connection to the MySQL Engine is now functional.


In [3]:
# Locating the dataset path and assigning it to a new dataframe.
file_path = "Product [FIXED].csv"
product_df = pd.read_csv(file_path)

# Displaying the dataframe to check out the table. 
product_df

Unnamed: 0,ProductKey,Product,Standard Cost,Color,Subcategory,Category,Background Color Format,Font Color Format
0,210,"HL Road Frame - Black, 58",$868.63,Black,Road Frames,Components,#000000,#FFFFFF
1,215,"Sport-100 Helmet, Black",$12.03,Black,Helmets,Accessories,#000000,#FFFFFF
2,216,"Sport-100 Helmet, Black",$13.88,Black,Helmets,Accessories,#000000,#FFFFFF
3,217,"Sport-100 Helmet, Black",$13.09,Black,Helmets,Accessories,#000000,#FFFFFF
4,253,"LL Road Frame - Black, 58",$176.2,Black,Road Frames,Components,#000000,#FFFFFF
...,...,...,...,...,...,...,...,...
392,594,"Mountain-500 Silver, 48",$308.22,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000
393,595,"Mountain-500 Silver, 52",$308.22,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000
394,601,LL Bottom Bracket,$23.97,,Bottom Brackets,Components,#DCDCDC,#000000
395,602,ML Bottom Bracket,$44.95,,Bottom Brackets,Components,#DCDCDC,#000000


In [46]:
# Displaying the basic information of the dataset. 
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ProductKey               397 non-null    int64  
 1   Product                  397 non-null    object 
 2   Standard Cost            397 non-null    float64
 3   Color                    397 non-null    object 
 4   Subcategory              397 non-null    object 
 5   Category                 397 non-null    object 
 6   Background Color Format  397 non-null    object 
 7   Font Color Format        397 non-null    object 
 8   Background Color Name    397 non-null    object 
 9   Font Color Name          397 non-null    object 
dtypes: float64(1), int64(1), object(8)
memory usage: 31.1+ KB


In [50]:
# Removing the '$' for now so that the analysis can be done without having to encounter unnecessary errors while performing the EDA.
product_df['Standard Cost'] = product_df['Standard Cost'].replace(r'[\$]', '', regex=True).astype(float)

# Displaying the first 5 values from the table to verify if the values are all in the correct format.
product_df.head()

Unnamed: 0,ProductKey,Product,Standard Cost,Color,Subcategory,Category,Background Color Format,Font Color Format,Background Color Name,Font Color Name
0,210,"HL Road Frame - Black, 58",868.63,Black,Road Frames,Components,#000000,#FFFFFF,Black,White
1,215,"Sport-100 Helmet, Black",12.03,Black,Helmets,Accessories,#000000,#FFFFFF,Black,White
2,216,"Sport-100 Helmet, Black",13.88,Black,Helmets,Accessories,#000000,#FFFFFF,Black,White
3,217,"Sport-100 Helmet, Black",13.09,Black,Helmets,Accessories,#000000,#FFFFFF,Black,White
4,253,"LL Road Frame - Black, 58",176.2,Black,Road Frames,Components,#000000,#FFFFFF,Black,White


In [7]:
# Calling out the total number of NULL values present in the table and displaying how many are there.
product_df.isnull().sum()

ProductKey                  0
Product                     0
Standard Cost               0
Color                      56
Subcategory                 0
Category                    0
Background Color Format     0
Font Color Format           0
dtype: int64

In [45]:
# Identifying the data types to see what we will be dealing with.
product_df.dtypes

ProductKey                   int64
Product                     object
Standard Cost              float64
Color                       object
Subcategory                 object
Category                    object
Background Color Format     object
Font Color Format           object
Background Color Name       object
Font Color Name             object
dtype: object

In [10]:
# Replacing the NULL values with 'Unknown' in the 'Color' column so that they can be assigned to a particular category.
product_df['Color'] = product_df['Color'].fillna('Unknown')

# Calling out the total number of NULL values again to check if the NULL values have been replaced.
product_df.isnull().sum()

ProductKey                 0
Product                    0
Standard Cost              0
Color                      0
Subcategory                0
Category                   0
Background Color Format    0
Font Color Format          0
dtype: int64

In [12]:
# Removing any leading or trailing whitespace that contain unintended spaces. 
product_df.columns = product_df.columns.str.strip()

In [None]:
# Displaying a custom message to mention the number of duplicate values available (if they exist).
print(f'Duplicate values found in the following dataset are: {product_df.duplicated().sum()}')

Duplicate values found in the following dataset are: 0


In [18]:
# Removing the duplicate values from the dataset (if they exist).
product_df = product_df.drop_duplicates()
product_df

Unnamed: 0,ProductKey,Product,Standard Cost,Color,Subcategory,Category,Background Color Format,Font Color Format
0,210,"HL Road Frame - Black, 58",868.63,Black,Road Frames,Components,#000000,#FFFFFF
1,215,"Sport-100 Helmet, Black",12.03,Black,Helmets,Accessories,#000000,#FFFFFF
2,216,"Sport-100 Helmet, Black",13.88,Black,Helmets,Accessories,#000000,#FFFFFF
3,217,"Sport-100 Helmet, Black",13.09,Black,Helmets,Accessories,#000000,#FFFFFF
4,253,"LL Road Frame - Black, 58",176.20,Black,Road Frames,Components,#000000,#FFFFFF
...,...,...,...,...,...,...,...,...
392,594,"Mountain-500 Silver, 48",308.22,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000
393,595,"Mountain-500 Silver, 52",308.22,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000
394,601,LL Bottom Bracket,23.97,Unknown,Bottom Brackets,Components,#DCDCDC,#000000
395,602,ML Bottom Bracket,44.95,Unknown,Bottom Brackets,Components,#DCDCDC,#000000


In [24]:
# Checking out the summary of the following dataset in terms of statistics.
product_df.describe(include='all')

Unnamed: 0,ProductKey,Product,Standard Cost,Color,Subcategory,Category,Background Color Format,Font Color Format
count,397.0,397,397.0,397,397,397,397,397
unique,,295,,10,37,4,10,2
top,,"LL Road Frame - Black, 44",,Black,Road Frames,Components,#000000,#FFFFFF
freq,,3,,129,70,189,129,228
mean,408.0,,436.823073,,,,,
std,114.748275,,497.343079,,,,,
min,210.0,,0.86,,,,,
25%,309.0,,37.12,,,,,
50%,408.0,,204.63,,,,,
75%,507.0,,660.91,,,,,


In [42]:
# Identifying the colors and creating a column in the dataframe for it to understand which color it is using the webcolors library.
def get_color_name(hex_code):
    try:
        return webcolors.hex_to_name(hex_code).capitalize()  # Exact color name
    except ValueError:
        closest_name = None
        min_distance = float('inf')

        for name, hex_value in webcolors.CSS3_HEX_TO_NAMES.items():
            r1, g1, b1 = webcolors.hex_to_rgb(hex_code)
            r2, g2, b2 = webcolors.hex_to_rgb(hex_value)
            distance = ((r1 - r2) ** 2) + ((g1 - g2) ** 2) + ((b1 - b2) ** 2)

            if distance < min_distance:
                min_distance = distance
                closest_name = name

        return closest_name.capitalize()
    
# Apply function to detect color names
product_df['Background Color Name'] = product_df['Background Color Format'].apply(get_color_name)
product_df['Font Color Name'] = product_df['Font Color Format'].apply(get_color_name)

# Displaying the dataset to check if the colors have been labelled for the color formats.
product_df

Unnamed: 0,ProductKey,Product,Standard Cost,Color,Subcategory,Category,Background Color Format,Font Color Format,Background Color Name,Font Color Name
0,210,"HL Road Frame - Black, 58",868.63,Black,Road Frames,Components,#000000,#FFFFFF,Black,White
1,215,"Sport-100 Helmet, Black",12.03,Black,Helmets,Accessories,#000000,#FFFFFF,Black,White
2,216,"Sport-100 Helmet, Black",13.88,Black,Helmets,Accessories,#000000,#FFFFFF,Black,White
3,217,"Sport-100 Helmet, Black",13.09,Black,Helmets,Accessories,#000000,#FFFFFF,Black,White
4,253,"LL Road Frame - Black, 58",176.20,Black,Road Frames,Components,#000000,#FFFFFF,Black,White
...,...,...,...,...,...,...,...,...,...,...
392,594,"Mountain-500 Silver, 48",308.22,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000,Silver,Black
393,595,"Mountain-500 Silver, 52",308.22,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000,Silver,Black
394,601,LL Bottom Bracket,23.97,Unknown,Bottom Brackets,Components,#DCDCDC,#000000,Gainsboro,Black
395,602,ML Bottom Bracket,44.95,Unknown,Bottom Brackets,Components,#DCDCDC,#000000,Gainsboro,Black


In [51]:
# Finding out if there are any outliers in the dataset.

# Calculating the Quartiles and InterQuartile Range (IQR).
Q1 = product_df['Standard Cost'].quantile(0.25)
Q3 = product_df['Standard Cost'].quantile(0.75)
IQR = Q3 - Q1

# Identifying the outliers.
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = product_df[(product_df['Standard Cost'] < lower_bound) | (product_df['Standard Cost'] > upper_bound)]

# Displaying the a custom text to mention the number of outliers found along with the rows that have outliers in them.
print(f'Number of outliers that can be found from this dataset are: {len(outliers)}')
outliers

Number of outliers that can be found from this dataset are: 13


Unnamed: 0,ProductKey,Product,Standard Cost,Color,Subcategory,Category,Background Color Format,Font Color Format,Background Color Name,Font Color Name
45,348,"Mountain-100 Black, 38",1898.09,Black,Mountain Bikes,Bikes,#000000,#FFFFFF,Black,White
46,349,"Mountain-100 Black, 42",1898.09,Black,Mountain Bikes,Bikes,#000000,#FFFFFF,Black,White
47,350,"Mountain-100 Black, 44",1898.09,Black,Mountain Bikes,Bikes,#000000,#FFFFFF,Black,White
48,351,"Mountain-100 Black, 48",1898.09,Black,Mountain Bikes,Bikes,#000000,#FFFFFF,Black,White
196,310,"Road-150 Red, 62",2171.29,Red,Road Bikes,Bikes,#FF0000,#FFFFFF,Red,White
197,311,"Road-150 Red, 44",2171.29,Red,Road Bikes,Bikes,#FF0000,#FFFFFF,Red,White
198,312,"Road-150 Red, 48",2171.29,Red,Road Bikes,Bikes,#FF0000,#FFFFFF,Red,White
199,313,"Road-150 Red, 52",2171.29,Red,Road Bikes,Bikes,#FF0000,#FFFFFF,Red,White
200,314,"Road-150 Red, 56",2171.29,Red,Road Bikes,Bikes,#FF0000,#FFFFFF,Red,White
218,344,"Mountain-100 Silver, 38",1912.15,Silver,Mountain Bikes,Bikes,#C0C0C0,#000000,Silver,Black


In [57]:
# Changing the title names before pushing it into the database to avoid errors while quering in MySQL.
product_df.columns = product_df.columns.str.replace(' ', '_')

# Pushing all the data into the MySQL database.
product_df.to_sql(
    name = 'Products',
    con=engine,
    index = False,
    if_exists = 'append'
)

# Custom message to ensure the operation has been completed successfully.
print("Table 'Products' has been created and data has been inserted successfully.")

Table 'Products' has been created and data has been inserted successfully.


  product_df.to_sql(
