# Python Phase - Target

This code will be the skeleton part for our Construc Week Project Approach. This file consists of all the major part of the analysis takes place and finally, connecting MySQL databases for it to import all the files and therefore creating a final dashboard. 

For the following phase, we have a total number of 7 datasets in which all of them are unclean meaning, they are not aligned and have a clustered set of results. In order to get ahead of it, each particular dataset has been arranged to ensure the data has been assigned to their particular columns. 

We shall now begin the basic EDA (Exploratory Data Analysis) and ensure each dataset has been cleaned and is set to be used in creating a database and then the dashboard.

In [23]:
# Importing all the essential libraries for the analysis to be done.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Importing libraries to connect and inject all the values from the dataset into the database server.
from sqlalchemy import create_engine
import mysql.connector

In [24]:
# Creating a connector so that the server can be connected here.
db_connector = mysql.connector.connect(
    host = "127.0.0.1",       
    username = "root",
    password = "MySQL12345",
    database = "patternseekers"
)

# A custom message that displays if the operation has been successful.
print(f"You have successfully connected to your database.")

You have successfully connected to your database.


In [25]:
# This engine will be another verification so that all the records made here can be added into the database.
engine = create_engine(f"mysql+mysqlconnector://{"root"}:{"MySQL12345"}@{"127.0.0.1"}/{"patternseekers"}")
print("The connection to the MySQL Engine is now functional.")

The connection to the MySQL Engine is now functional.


In [5]:
# Locating the dataset path and assigning it to a new dataframe.
file_path = "Target [FIXED].csv"
target_df = pd.read_csv(file_path)

# Displaying the dataframe to check out the table. 
target_df

Unnamed: 0,EmployeeID,Target,TargetDate
0,90836195,"$500,000","Friday, December 1, 2017"
1,112432117,"$500,000","Saturday, July 1, 2017"
2,139397894,"$500,000","Friday, December 1, 2017"
3,191644724,"$500,000","Friday, September 1, 2017"
4,502097814,"$500,000","Saturday, July 1, 2017"
...,...,...,...
804,954276278,"$75,000","Monday, February 1, 2021"
805,954276278,"$100,000","Monday, March 1, 2021"
806,954276278,"$125,000","Thursday, April 1, 2021"
807,954276278,"$150,000","Saturday, May 1, 2021"


In [6]:
# Displaying the basic information of the dataset. 
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 809 entries, 0 to 808
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   EmployeeID  809 non-null    int64 
 1   Target      809 non-null    object
 2   TargetDate  809 non-null    object
dtypes: int64(1), object(2)
memory usage: 19.1+ KB


In [8]:
# Removing the '$' for now so that the analysis can be done without having to encounter unnecessary errors while performing the EDA.
target_df['Target'] = target_df['Target'].replace(r'[\$,]', '', regex=True).astype(float)

# Displaying the first 5 values from the table to verify if the values are all in the correct format.
target_df.head()

Unnamed: 0,EmployeeID,Target,TargetDate
0,90836195,500000.0,"Friday, December 1, 2017"
1,112432117,500000.0,"Saturday, July 1, 2017"
2,139397894,500000.0,"Friday, December 1, 2017"
3,191644724,500000.0,"Friday, September 1, 2017"
4,502097814,500000.0,"Saturday, July 1, 2017"


In [9]:
# Calling out the total number of NULL values present in the table and displaying how many are there.
target_df.isnull().sum()

EmployeeID    0
Target        0
TargetDate    0
dtype: int64

In [10]:
# Identifying the data types to see what we will be dealing with.
target_df.dtypes

EmployeeID      int64
Target        float64
TargetDate     object
dtype: object

In [11]:
# Searching for duplicate values in the dataset (if exists).
target_df.duplicated().sum()

np.int64(0)

In [16]:
# Converting the 'TargetDate' into a particular DateTime format.
target_df['TargetDate'] = pd.to_datetime(target_df['TargetDate'], format='%a, %B %d, %Y')
target_df['TargetDay'] = target_df['TargetDate'].dt.day_name()

# Displaying the first 5 rows to verify if the changes have been made.
target_df.head()

Unnamed: 0,EmployeeID,Target,TargetDate,TargetDay
0,90836195,500000.0,2017-12-01,Friday
1,112432117,500000.0,2017-07-01,Saturday
2,139397894,500000.0,2017-12-01,Friday
3,191644724,500000.0,2017-09-01,Friday
4,502097814,500000.0,2017-07-01,Saturday


In [17]:
# Identifying the data types to see what we will be dealing with.
target_df.dtypes

EmployeeID             int64
Target               float64
TargetDate    datetime64[ns]
TargetDay             object
dtype: object

In [18]:
# Removing any leading or trailing whitespace that contain unintended spaces. 
target_df = target_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [19]:
# Ensuring the Target DataFrame has no negative values in the dataset.
target_df = target_df[target_df['Target'] >= 0] 

In [21]:
# Rounding up the numerical values in the target column with 2 decimal points.
target_df['Target'] = target_df['Target'].round(2)

# Displaying the dataset to see if the changes have been made. 
target_df

Unnamed: 0,EmployeeID,Target,TargetDate,TargetDay
0,90836195,500000.0,2017-12-01,Friday
1,112432117,500000.0,2017-07-01,Saturday
2,139397894,500000.0,2017-12-01,Friday
3,191644724,500000.0,2017-09-01,Friday
4,502097814,500000.0,2017-07-01,Saturday
...,...,...,...,...
804,954276278,75000.0,2021-02-01,Monday
805,954276278,100000.0,2021-03-01,Monday
806,954276278,125000.0,2021-04-01,Thursday
807,954276278,150000.0,2021-05-01,Saturday


In [22]:
# Finding out if there are any outliers in the dataset.

# Calculating the Quartiles and InterQuartile Range (IQR).
Q1 = target_df['Target'].quantile(0.25)
Q3 = target_df['Target'].quantile(0.75)
IQR = Q3 - Q1

# Identifying the outliers.
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = target_df[(target_df['Target'] < lower_bound) | (target_df['Target'] > upper_bound)]

# Displaying the a custom text to mention the number of outliers found along with the rows that have outliers in them.
print(f'Number of outliers that can be found from this dataset are: {len(outliers)}')
outliers

Number of outliers that can be found from this dataset are: 87


Unnamed: 0,EmployeeID,Target,TargetDate,TargetDay
389,112432117,2200000.0,2017-11-01,Wednesday
390,112432117,1750000.0,2017-12-01,Friday
410,502097814,2200000.0,2017-11-01,Wednesday
411,502097814,1750000.0,2017-12-01,Friday
430,61161660,1750000.0,2018-09-01,Saturday
...,...,...,...,...
792,502097814,3000000.0,2021-08-01,Sunday
793,502097814,3500000.0,2021-09-01,Wednesday
794,502097814,2500000.0,2021-10-01,Friday
795,502097814,3000000.0,2021-11-01,Monday


In [28]:
# Pushing all the data into the MySQL database.
target_df.to_sql(
    name = 'Targets',
    con=engine,
    index = False,
    if_exists = 'append'
)

# Custom message to ensure the operation has been completed successfully.
print("Table 'Targets' has been created and data has been inserted successfully.")

Table 'Targets' has been created and data has been inserted successfully.


  target_df.to_sql(
