# Practical Data Analysis Using Jupyter Notebook

## Ch. 5: Gathering and Loading Data in Python 
---

## Introduction

What is SQL?  Why is it important for data analysis? These questions and more ahead.

## Topics

* Introduction to SQL and relational databases
* From SQL to pandas DataFrames
* Data about your data explained

## Imports

In [2]:
import pandas as pd
import sqlite3

---

## Introduction to SQL and relational databases

Relational databases solve the problem of storing data together in multiple tables while keeping consistency across them using the concept of a primary and foreign key:

* A primary key is the unique value (typically an integer) used to represent a single distinct record or tuple in each table.
* A foreign key would be a field in one table that references the primary key from another.

## From SQL to pandas DataFrames

Test data : `customer_sales.db`

In this DB we have three tables:

1.  `tbl_sales`
2.  `tbl_customers`
3.  `tbl_products`

![Entity Relationship Diagram](../data/erd.JPG)

Let's go on ahead and connect to the DB

In [2]:
conn = sqlite3.connect('../data/customer_sales.db')


...and create a DataFrame by using a SQL query.

In [5]:
df_sales = pd.read_sql_query("SELECT * FROM tbl_sales", conn)
df_sales.head()

Unnamed: 0,Sale_ID,Sale_Date,Description,Customer_ID,Product_ID,Sales_Amount,Sales_Quantity
0,1,12/31/2014,Purchased from Store,2,2,20,1
1,2,1/15/2015,Phone Purchase,1,1,30,2
2,3,6/14/2015,Internet Purchase,3,3,5,1
3,4,11/11/2015,Sales Convention Purchase,3,3,500,100
4,5,4/18/2016,Internet Purchase,4,1,20,2


In [8]:
df_sales.sort_values(by='Sale_Date')

Unnamed: 0,Sale_ID,Sale_Date,Description,Customer_ID,Product_ID,Sales_Amount,Sales_Quantity
1,2,1/15/2015,Phone Purchase,1,1,30,2
5,6,10/15/2016,Purchased from Store,5,1,20,1
3,4,11/11/2015,Sales Convention Purchase,3,3,500,100
0,1,12/31/2014,Purchased from Store,2,2,20,1
6,7,3/17/2017,Internet Purchase,4,1,20,1
4,5,4/18/2016,Internet Purchase,4,1,20,2
8,9,5/25/2019,Internet Purchase,1,3,10,2
2,3,6/14/2015,Internet Purchase,3,3,5,1
7,8,6/15/2018,Purchased from Store,3,3,5,1
9,10,6/9/2019,Internet Purchase,2,3,10,2


To limit the data displayed, we can use the DataFrame.`loc` method to isolate specific rows or columns based on how it is labeled by the header row. To retrieve the first row available, we simply run this command against our DataFrame and reference the index value, which begins with 0:

In [9]:
df_sales.loc[0]

Sale_ID                              1
Sale_Date                   12/31/2014
Description       Purchased from Store
Customer_ID                          2
Product_ID                           2
Sales_Amount                        20
Sales_Quantity                       1
Name: 0, dtype: object

To restrict the data displayed, we can use a nested command to isolate specific rows based on a condition. A business task you could address using this data would be to *identify customers with high sales so we can thank them personally*. To do this, we can filter the sales by a specific value and display only the rows that meet or exceed that condition. For this example, we assigned high to an arbitrary number so any `Sales_Amount` over 100 will be displayed using this command:

In [16]:
df_sales['Sales_Amount'] > 100 # returns True values as args for...
df_sales[df_sales['Sales_Amount'] > 100]

Unnamed: 0,Sale_ID,Sale_Date,Description,Customer_ID,Product_ID,Sales_Amount,Sales_Quantity
3,4,11/11/2015,Sales Convention Purchase,3,3,500,100


## Data about your data explained

### Fundamental statistics

Descriptive analytics is based on what has already happened in the past by analyzing the digital footprint of data to gain insights, analyze trends, and identify patterns.

Using SQL to read data from one or more tables supports this effort, which should include basic statistics and arithmetic.

![fundamental stats](../data/fundamental_stats.JPG)

### Metadata explained

Metadata is commonly known as descriptive information about the data source. A key concept exposed in metadata analysis is related to understanding that nulls exist in databases.

In Python and other coding languages such as Java, you may see the word `NaN` returned. This is an acronym for Not a Number and helps you to understand that you may not be able to perform statistical calculations or functions against those values.

`NaN` values will have special functions to handle them:
* NumPY: `nansum()`
* pandas: `isnull()`
* SQL: `is [not] null`

Okay enough of that, let's get into some examples.

In [3]:
# connect to the DB and make a dataframe
conn = sqlite3.connect('../data/customer_sales.db')
df_customer_sales = pd.read_sql_query("""
  SELECT * FROM tbl_customers
""", conn)

In [4]:
# identify any NaN fields
pd.isnull(df_customer_sales)

Unnamed: 0,Customer_ID,First_Name,Last_Name,Address_Line_1,Address_Line_2,City,State,ZipCode,Phone,Email
0,False,False,False,False,True,False,False,False,False,False
1,False,False,False,False,True,False,False,False,True,False
2,False,False,False,False,True,False,False,False,True,False
3,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,True,False,False,False,False,False
