# A POWER BI DASHBOARD PROJECT

## Step 1: Business Understanding

#### Team: Team Namibia

#### Problem Statement:
Team Namibia has been assigned to design and deliver an end-to-end business intelligence solution. Our client has collected transactional data for the year 2019 but hasn’t been able to put it to good use. The client hopes we can analyze the data and put together a report to help them find opportunities to drive more sales and work more efficiently. 

#### Objective
In this analysis, we aim to:
- Identify key trends and patterns in the 2019 transactional data.
- Analyze sales performance and uncover opportunities to drive more sales.
- Evaluate product performance and categorize products based on price levels.
- Identify cities with the highest product deliveries.
- Identify effective working processes

#### Analytical Questions
1. How much money did we make this year? 
   
2. Can we identify any seasonality in the sales? 

3. What are our best and worst-selling products?

4. How do sales compare to previous months or weeks?

5. Which cities are our products delivered to most?

6. How do product categories compare in revenue generated and quantities ordered?

7. You are required to show additional details from your findings in your data. 

- NB: Products with unit prices above $99.99 should be labeled high-level products otherwise they should be basic level.

#### Hypothesis
- Null Hypothesis (H0): There are no significant differences in Amount amongst the group(columns) of factors being tested.
- Alternative Hypothesis (H1): There are significant differences in Amount amongst the group(columns) of factors being tested.

# Step 2: Data Understanding

Sales data was collected for each month in the entire year of 2019. 
The data for the first half of the year (January to June) was collected in Excel and saved as CSV files before management decided to use databases to store their data. 

## Load Data

### Import packages

In [1]:
# Import the pyodbc library to handle ODBC database connections
# Import the dotenv function to load environment variables from a .env file
import pyodbc 
from dotenv import dotenv_values    

# Import the warnings library to handle warning messages
import warnings
warnings.filterwarnings('ignore')       

import pandas as pd 
import numpy as np
import seaborn as sns

import glob
import os

### Load csv files

In [22]:
import glob
import os

# Define the folder path
folder_path = r'C:\Users\Pc\Desktop\Data analysis\Azubi Africa\Career Accelerator\Capstone\Power-BI-Dashboard-Project\Data'

# Get a list of all CSV files
csv_files = glob.glob(os.path.join(folder_path, '*.csv'))

# Load all CSV files and concatenate them into a single DataFrame
first_half = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)

first_half.shape

(85625, 6)

### Load  Database

#### Establishing a connection to the SQL database

In [3]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')

# Get the values for the credentials you set in the .env file
database = environment_variables.get("DATABASE")
server = environment_variables.get("SERVER")
username = environment_variables.get("UID")
password = environment_variables.get("PWD")

# Create the connection string using the retrieved credentials
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password};MARS_Connection=yes;MinProtocolVersion=TLSv1.2;"
connection_string

'DRIVER={SQL Server};SERVER=dap-projects-database.database.windows.net;DATABASE=dapDB;UID=capstone;PWD=Z7x@8pM$2w;MARS_Connection=yes;MinProtocolVersion=TLSv1.2;'

#### Load data from database

In [4]:
# Establish a connection to the database using the connection string
connection = pyodbc.connect(connection_string) 

# Define the SQL query to list all tables (for SQL Server)
query = """
SELECT TABLE_NAME 
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_TYPE = 'BASE TABLE';
"""

# Execute the SQL query and fetch the result into a pandas DataFrame using the established database connection
tables = pd.read_sql(query, connection)

tables

Unnamed: 0,TABLE_NAME
0,Sales_July_2019
1,Sales_August_2019
2,Sales_September_2019
3,Sales_October_2019
4,Sales_November_2019
5,Sales_December_2019


In [15]:
# Connection to each of the table
query1 = "Select * from Sales_July_2019"
query2 = "Select * from Sales_August_2019"
query3 = "Select * from Sales_September_2019"
query4 = "Select * from Sales_October_2019"
query5 = "Select * from Sales_November_2019"
query6 = "Select * from Sales_December_2019"

# Execute the SQL query and fetch the result
july_df = pd.read_sql(query1, connection)
august_df = pd.read_sql(query2, connection)
september_df = pd.read_sql(query3, connection)
october_df = pd.read_sql(query4, connection)
november_df = pd.read_sql(query5, connection)
december_df = pd.read_sql(query6, connection)

# Define the dictionary with DataFrames
half_2 = {
    'july': july_df,
    'august': august_df,
    'september': september_df,
    'october': october_df,
    'november': november_df,
    'december': december_df
}

# Iterate over the dictionary
for month, df in half_2.items():
    print(f'{month}')
    print(df.columns)
    print(df.shape)
    print('=' * 50)

july
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
(14371, 6)
august
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
(12011, 6)
september
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
(11686, 6)
october
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
(20379, 6)
november
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
(17661, 6)
december
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date',
       'Purchase_Address'],
      dtype='object')
(25117, 6)


In [21]:
# Merge all DataFrames into one
second_half = pd.concat(half_2, ignore_index=True)
print('Second half of the year merged data:', second_half.shape)

Second half of the year merged data: (101225, 6)


### Merged Data

In [24]:
# Merging both halfs into 2019_df
Merged_df = pd.concat([first_half, second_half], axis=0, ignore_index='True')

print('1st half year data:', first_half.shape[0])
print('2st half year data:', second_half.shape[0])
print('Total year 2019 data:', Merged_df.shape[0])


1st half year data: 85625
2st half year data: 101225
Total year 2019 data: 186850
