# Introduction to Importing Data in Python

These are my notes for DataCamp's course [_Introduction to Importing Data in Python_](https://www.datacamp.com/courses/introduction-to-importing-data-in-python).

This course is presented by Hugo Bowne-Anderson, formerly Data Scientist at DataCamp. The collaborator is Francisco Castro.

Prerequisite:

- [_Intermediate Python_](../Intermediate%20Python/Intermediate%20Python.ipynb)

This course is part of these tracks:

- Data Engineer
- Scientist with Python
- Data Scientist Professional with Python
- Importing & Cleaning Data with Python

## Data Sets

| Name | File |
|:---|:---|
| Chinook (SQLite) | Chinook.sqlite |
| LIGO (HDF5) | L-L1_LOSC_4_V1-1126259446-32.hdf5 |
| Battledeath (XLSX) | battledeath.xlsx |
| Extent of Infectious Diseases | disarea.dta |
| Gene expressions (Matlab) | ja_data2.mat |
| MNIST | mnist_kaggle_some_rows.csv |
| Sales (SA7SBDAT) | sales.sas7bdat |
| Seaslugs | seaslugs.txt |
| Titanic | titanic_sub.csv |

## Imports

For convenience and clarity, all imports are gathered here.

In [None]:
import os
import pickle
import sqlite3
import sys

import h5py
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sas7bdat
import scipy.io
import sqlalchemy

# Display all warnings as errors.
os.environ["PYTHONWARNINGS"] = "error"

# Set a default plotting style.
plt.style.use("dark_background")

## Introduction and Flat Files

### Welcome to the Course

#### Reading a Text File (Demonstration)

In [None]:
# Read a text file.
filename = 'seaslug.txt'
file1 = open(filename, mode="r")
seaslug_text = file1.read()
print(seaslug_text)

#### Checking if a File is Closed (Extra)

This won't matter when we move on to using `with`.

In [None]:
# It is possible to determine whether a file is closed.
print("File is closed:", file1.closed)
print("Closing file...")
file1.close()
print("  Done.")
print("File is closed:", file1.closed)

#### Writing to a File (Demonstration)

In [None]:
# Writing to a file. I have enhanced this demonstration.
# Create the "moby_dick.txt" file, which is used below in an exercise.
moby_dick_text = """CHAPTER 1. Loomings.

Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
"""
filename2 = 'moby_dick.txt'
file2 = open(filename2, mode="w")
file2.write(moby_dick_text)
file2.close()

In [None]:
# Using the file context manager, it is not necessary to call file.close().
# Since this is not a function, file and text2 have global scope.
with open(filename, mode="r") as file3:
    text2 = file3.read()
    print("File is closed:", file3.closed)
print(file3)
print("File is closed:", file3.closed)
print(text2)

#### Explore the Working Directory (Exercise)

In [None]:
# List the current working directory in IPython.
# This can't be done in the regular Python interpreter.
!ls

#### Read an Entire Text File (Exercise)

In [None]:
# Read and print the entire file "moby_dick.txt".
file4 = open("moby_dick.txt", mode="r")
print(file4.read())
print("File is closed:", file4.closed)
print("Closing file...")
file4.close()
print("  Done.")
print("File is closed:", file4.closed)

#### Importing Text Files Line By Line (Exercise)

In [None]:
# Use file.readline() to read a file line by line.
with open("moby_dick.txt", mode="r") as file5:
    print(file5.readline())
    print(file5.readline())
    print(file5.readline())

### The Importance of Flat Files in Data Science

Flat files may contain a header followed by rows of data, where each row contains the attributes for a single object. [In bioinformatics, flat files often have other internal organization.] Hugo contrasts flat files with relational database tables; a flat file contains no relational information.

Hugo is using the titanic.csv file as an example of a CSV flat file.

MNIST.txt is a tab-delimited text file.
> The data consists of the famous MNIST digit recognition images, where each row contains the pixel values of a given image. Note that all fields in the MNIST data are numeric.

Typically, a data scientist uses NumPy or pandas to work with flat files.

#### Characteristics of Flat Files (Exercise)
- Flat files consist of rows, and each row is called a record.
- A record in a flat file is composed of _fields_ or _attributes_, each one of which contains at most one item of information.
- Flat files are pervasive in data science.

#### Why We Like Flat Files and the Zen of Python (Exercise)

The fifth aphorism of _The Zen of Python_ is: "Flat is better than nested."

In [None]:
# Obtain The Zen of Python.
import this
# See https://stackoverflow.com/questions/5855758/what-is-the-source-code-of-the-this-module-doing.

### Importing Flat Files with NumPy

If all the data in a flat file are numerical, read the file using NumPy. NumPy arrays are the standard for storing numerical data. NumPy arrays are often used by other data science packages such as scikit-learn. NumPy itself has a number of built-in functions that make it far easier and more efficient for us to import data as arrays.
- `numpy.loadtext()`
- `numpy.genfromtext()`
- `numpy.recfromcsv()`

To modify how NumPy imports the data:
- Use the `skiprows=1` argument to skip the header row.
- Use the usecols argument to select the columns to keep (e.g., `usecols=[0, 2]`).
- Use the dtype argument to specify the data type (e.g., `dtype=str`).

NumPy does not handle mixed data types well, such as the data that appear in the titanic_sub.csv file. To import such data, use `pandas.read_csv()` to create a pandas DataFrame.

#### Read the MNIST Digit Data (Exercise)

See http://yann.lecun.com/exdb/mnist/ for the data.

In [None]:
# Read the MNIST data.
filename = 'mnist_kaggle_some_rows.csv'
digits = np.loadtxt(filename, delimiter=',')
print("Type of object digits:", type(digits))

# Select and reshape a row. The first value is the label (the digit the image
# represents.) The remaining 28 * 28 values represent a 28 x 28 image.
# Reshape and present the data from row 21.
digit = int(digits[21, 0])
print("digit:", digit)
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data.
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

#### Customizing Your NumPy Import (Exercise)

In [None]:
# I used the !cat digits_header.txt file to print the contents of the file
# to the DataCamp console. I copied the data and saved it as
# digits_header.txt.
# The file has a header in the first row; we can't include the header in the
# NumPy array. For the exercise, keep only columns 0 and 2, where column 0
# is the encoded digit and column 2 is a pixel value. Since both are integers,
# set the dtype of the data to int.
file = 'digits_header.txt'
data = np.loadtxt(file, delimiter="\t", skiprows=1, usecols=[0, 2], dtype=int)
print(data[:5])

#### Importing Different Datatypes (Exercise)

These data consist of the percentage of sea slug larvae that had metamorphosed in a given time period. See http://www.stat.ucla.edu/projects/datasets/seaslug-explanation.html for the sea slug data.

In [None]:
# This file has string headers and numeric fields. This can't be loaded using np.loadtxt() because
# of inconsistent data types, unless we set dtype=str to convert all data to strings, or we
# set skiprows=1 to skip the header row.
file = 'seaslug.txt'
data = np.loadtxt(file, delimiter='\t', dtype=str)
print(data[:2])

# Import data as floats and skip the first row.
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
print(data_float[9])

# Plot a scatterplot of the data.
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('Time (min)')
plt.ylabel('Percentage of larvae')
plt.show()

#### Working with Mixed Datatypes (Structured Arrays) (Exercise)

`nunpy.loadtext` can't load data having different datatypes; use `numpy.genfromtxt` for this.

`numpy.genfromtxt` creates a structured array object, https://numpy.org/doc/stable/user/basics.rec.html. Each row in the structured array has type `numpy.void`; see https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.void.

In [None]:
# Working with mixed datatypes (1)
# Use np.genfromtxt() instead of np.loadtxt() because it can process mixed
# datatypes. np.genfromtxt returns a structured numpy.ndarray.
# Because numpy arrays have to contain elements that are all the
# same type, the structured array solves this by being a 1-D array, where each
# element in the array is a row of the flat file imported.

# If the encoding argument is not specified, numpy creates a warning:
#   Reading unicode strings without specifying the encoding argument is
#   deprecated. Set the encoding, use None for the system default.
data = np.genfromtxt('titanic_sub.csv', delimiter=',', names=True, dtype=None, encoding=None)
print("type(data):", type(data))
print("data.shape:", data.shape)
# numpy.ndarray.dtype has attributes names and fields.
print("data.dtype:", data.dtype)
print("data.dtype.names:", data.dtype.names)
print("data.dtype.fields:", data.dtype.fields)
print("sys.getsizeof(data):", sys.getsizeof(data))
print("row 0:", data[0])
# The type of row[0] is numpy.void, whatever that is.
print("type(data[0]):", type(data[0]))
print("sys.getsizeof(data[0]):", sys.getsizeof(data[0]))
print('"Fare" column, first five items:')
print(data["Fare"][:5])

#### Working with Mixed Datatypes (Record Arrays) (Exercise)

The `numpy.recfromcsv` function returns `numpy.recarray` objects; see https://numpy.org/doc/stable/user/basics.rec.html#record-arrays.

In [None]:
# Use np.recfromcsv for reading CSV files.
# Print the first three rows of data.
file = 'titanic_sub.csv'
d = np.recfromcsv(file, encoding=None)
print("type(d):", type(d))
print("d.dtype:", d.dtype)
print("d.shape:", d.shape)
print(d[:3])

### Importing Flat Files Using Pandas

NumPy is not adequate for dealing with tables of data. It is standard practice
and best practice to use pandas. See https://pandas.pydata.org.

Quoted from the course:

> Although arrays are incredibly powerful and serve a number of essential purposes, they cannot fulfill one of the most basic needs of a Data Scientist: to have "[two]-dimensional labeled data structure[s] with columns of potentially different types" that you can easily perform a plethora of Data Sciencey type things on: manipulate, slice, reshaped, groupby, join, merge, perform statistics in a missing-value-friendly manner, deal with times series. The need for such a data structure, among other issues, prompted Wes McKinney to develop the pandas library for Python. Nothing speaks to the project of pandas more than the documentation itself: "Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R." The data structure most relevant to the data manipulation and analysis workflow that pandas offers is the dataframe and it is the Pythonic analogue of R’s dataframe.

> Manipulating dataframes in pandas can be useful in all steps of the data scientific method, from exploratory data analysis to data wrangling, preprocessing, building models and visualization. Here we will see its great utility in importing flat files, even merely in the way that it deals with missing data, comments along with the many other issues that plague working data scientists. For all of these reasons, it is now standard and best practice in Data Science to use pandas to import flat files as dataframes. Later in this course, we’ll see how many other types of data, whether they’re stored in relational databases, hdf5, MATLAB or excel files, can easily be imported as dataframes.

#### Using pandas to Import Flat Files as DataFrames (1) (Exercise)

In [None]:
# Load the Titanic data using pandas.read_csv() to create a
# pandas.DataFrame object.
filename = 'titanic_sub.csv'
titanic = pd.read_csv(filename)
print(titanic.head())
print("type(titanic):", type(titanic))

# Extract a NumPy array from the DataFrame.
titanic_values = titanic.values
print("type(titanic_values):", type(titanic_values))
print("titanic_values.dtype:", titanic_values.dtype)
print(titanic_values[:5])
print("type(titanic_values[0]):", type(titanic_values[0]))

#### Using pandas to Import Flat Files as DataFrames (2) (Exercise)

In [None]:
# Read the first five rows of the file into a DataFrame, where the file does not
# contain a header. Extract the data as a numpy.ndarray object.
file = 'mnist_kaggle_some_rows.csv'
mnist = pd.read_csv(file, nrows=5, header=None)
mnist_array = mnist.values
print(type(mnist_array))

In [None]:
# I used the command !cat titanic_corrupt.txt in the DataCamp console, copied
# the data from the console, and pasted it into a new text file,
# titanic_corrupt.txt, in the project folder.
# I wrote the code when I didn't have a coy of titanic_corrupt.txt, hence
# the use of try..except.
# The file is tab-delimited, contains comments starting with '#', and indicates
# missing data with the string 'Nothing'.
# This code shows how to use pandas to plot a histogram.
file = 'titanic_corrupt.txt'
try:
    data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
    print(data.head())
    pd.DataFrame.hist(data[['Age']])
    plt.xlabel('Age (years)')
    plt.ylabel('count')
    plt.show()
except Exception as ex:
    print("I don't have the 'titanic_corrupt.txt' file.")

## Importing Other File Types

### Introduction to Other File Types

Hugo Bowne-Anderson discussed different file formats, including Python's pickle format. HDF5 files are increasing used for storing large datasets.

#### List Directories in IPython (Exercise)

In [None]:
# List the current working directory.
cwd = os.getcwd()
print(os.listdir(cwd))

#### Dump and Load a Pickled File (Exercise)

In [None]:
# Create a pickled file. This replicates the file used in the exercise.
d = {'Mar': '84.4', 'June': '69.4', 'Aug': '85', 'Airline': '8'}
filename = 'data.pkl'
with open(filename, 'wb') as file:
    pickle.dump(d, file)

# Read the pickled file.
with open(filename, 'rb') as file:
    d2 = pickle.load(file)
print(d2)
print(type(d2))

#### List Worksheets in an Excel Files (Exercise)

In [None]:
# This requires installing openpyxl for Excel support.
file = 'battledeath.xlsx'
# The parse method creates a pandas.core.frame.DataFrame from a sheet.
xl = pd.ExcelFile(file)
print("type(xl):", type(xl))
print(xl.sheet_names)

#### Load Data from Excel Worksheets (Exercise)

For documentation, see https://pandas.pydata.org/docs/reference/api/pandas.ExcelFile.parse.html and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html.

In [None]:
# Load the data from the individual worksheets into DataFrames.
for sheet_name in xl.sheet_names:
    worksheet = xl.parse(sheet_name)
    print(worksheet.head())

#### Customize Excel Import (Exercise)

In [None]:
# The parse xl.parse (pd.io.excel.ExcelFile.parse method takes arguments such
# as skiprows, names, and parse_cols. But the documentation is inadequate.
# Documentation for the arguments is available from pd.read_excel.
file = 'battledeath.xlsx'
xl = pd.ExcelFile(file)
print(type(xl))
print()

# Parse the first sheet, skip the header row, and rename the columns.
df1 = xl.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])
print(df1.head())
print()

# Parse the second sheet, keeping only the first column, skipping the header
# row, and setting the name of the single column.
df2 = xl.parse(1, usecols=[0], skiprows=[0], names=['Country'])
print(df2.head())
print()

# This is from a review exercise.
# Here the skiprows argument does not include the header line.
df3 = xl.parse(1, skiprows=1, names=["Countries", "Death"])
print(df3.head())
print()

# This is from a review exercise.
# Here the skiprows argument includes the header line.
# The result is not something I would want.
df4 = xl.parse("2002", skiprows=1)
print(df4.head())

### Importing SAS or Stata Files Using pandas

SAS stands for "Statistical Analysis System".

#### Importing SAS Files (Exercise)

Use `sas7bdat.SAS7BDAT()` or `pandas.read_sas()` to import data from SAS files.

In [None]:
# Import data from a SAS file.
with sas7bdat.SAS7BDAT('sales.sas7bdat') as file:
    # file is a sas7bdat.SAS7BDAT object.
    print(type(file))
    df_sas = file.to_data_frame()
print(df_sas.head())

In [None]:
# pandas has a read_sas method.
df_sas2 = pd.read_sas('sales.sas7bdat')
print(df_sas2.head())

# Plot a histogram of DataFrame features.
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('Count')
plt.show()

#### Importing Stata Files Using pandas (Exercise)

Stata stands for "Statistics data". Use `pandas.read_stata()` to import data from a Stata file.

In [None]:
# Import the data and create a histogram from one column.
df = pd.read_stata('disarea.dta')
print(df.head())
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extend of disease')
plt.ylabel('Number of countries')
plt.show()

### Importing HDF5 Files

Hierarchical Data Format version 5 (HDF5) is rapidly becoming the standard for storing large quantities of numerical data. (See O'Reilly book, _Python and HDF5_.) HDF5 can handle datasets of sizes hundreds of gigabytes or terabyes, even exabytes. HDF5 is managed by the HDF Group in Champaign, IL.

The example data comes from LIGO (Laster Interferometry Gravitational Wave Observatory) project. See https://losc.ligo.org/events/GW150914/.

#### Load HDF5 Data (Exercise)

In [None]:
# Load HDF5 data.
filename = 'L-L1_LOSC_4_V1-1126259446-32.hdf5'
data = h5py.File(filename, 'r')
# data has type h5py._hl.files.File. The data are hierarchical, not tabular.
print(type(data))

# Iterate through the structure: meta, quality, strain.
# See http://h5py.org/, http://docs.h5py.org/.
# We see:
#   h5py._hl.group.Group
#   h5py._hl.dataset.Dataset
for key1 in data.keys():
    print(key1, type(data[key1]))
    for key2 in data[key1].keys():
        print("  ", key2, type(data[key1][key2]))

print()
print(data['meta']['Description'], data['meta']['Detector'])
print(np.array(data["meta"]["Description"]))
print(np.array(data["meta"]["Detector"]))

#### Extract Data from HDF5 File (Exercise)

In [None]:
# data['strain'] is a Group.
# data['strain']['Strain'] is a Dataset.
# strain is a numpy.ndarray.
strain = np.array(data['strain']['Strain'])
print("type(strain):", type(strain))
print(strain.shape)

# Plot the first 10,000 readings.
num_samples = 10000
time = np.arange(0, 1, 1/num_samples)
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS time (s)')
plt.ylabel('Strain')
plt.show()

# Plot all the data.
# There are 131,072 values in strain.
time = np.arange(0, len(strain)/10000, 1/10000)
plt.plot(time, strain[:])
plt.xlabel('GPS time (s)')
plt.ylabel('Strain')
plt.show()

### Importing MATLAB Files

MATLAB is short for Matrix Laboratory. Use scipy to read and write MATLAB files:
```Python
scipy.io.loadmat()
scipy.io.savemat()
```

The file contains gene expression data. See https://www.mcb.ucdavis.edu/faculty-labs/albeck/workshop.htm.

In [None]:
# In the dictionary, the keys are the names of the MATLAB variables
# and the values are the values of the MATLAB variables.
# Examine the shapes of the various datasets.
filename = 'ja_data2.mat'
mat = scipy.io.loadmat(filename)
print(type(mat))

# Look at the keys and values.
for key, value in mat.items():
    if isinstance(value, np.ndarray):
        print(key, ":", type(value), ":", value.shape)
    else:
        print(key, ":", type(value), ":", value)

In [None]:
# Subset the array and plot it.
print(type(mat['CYratioCyt']))
print(mat['CYratioCyt'].shape)
# Alternative, for the shape:
print(np.shape(mat['CYratioCyt']))

# Subset and plot the data for a single row.
data = mat['CYratioCyt'][25, 5:]
print(data)
# fig = plt.figure()
plt.plot(data)
plt.xlabel('Time (min.)')
plt.ylabel('Normalized fluoresence (measure of expression)')
plt.show()

## Working with Relational Databases in Python

### Introduction to Relational Databases

For Codd's twelve Rules, see https://en.wikipedia.org/wiki/Codd%27s_12_rules.

In these exercises, we will use SQLAlchemy.

The Northwind database, SQLite version, is available at https://github.com/jpwhite3/northwind-SQLite3. I copied the Northwind_small.sqlite file into the working directory. 

I found a comment on the web that the Chinook database, which is provided by the course, stores different data and is an improved version for testing. The Chinook database is available at https://github.com/lerocha/chinook-database.

#### Explore SQLite Databases Using the CLI (Extra)
```
$ sqlite3 Chinook.sqlite
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> .tables
Album          Employee       InvoiceLine    PlaylistTrack
Artist         Genre          MediaType      Track
Customer       Invoice        Playlist
sqlite> .quit

$ sqlite3 Northwind_small.sqlite
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> .tables
Category              EmployeeTerritory     Region
Customer              Order                 Shipper
CustomerCustomerDemo  OrderDetail           Supplier
CustomerDemographic   Product               Territory
Employee              ProductDetails_V
sqlite> .mode html
sqlite> select * from sqlite_schema;
```

I had to modify the HTML to present it the way I wanted. For some reason, HTML mode did not include the column names for the query.

<table>
    <tr>
        <th>type</th>
        <th>name</th>
        <th>tbl_name</th>
        <th>rootpage</th>
        <th>sql</th>
    </tr>
<TR><TD>table</TD>
<TD>Employee</TD>
<TD>Employee</TD>
<TD>2</TD>
<TD>CREATE TABLE &quot;Employee&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;LastName&quot; VARCHAR(8000) NULL,<br>
  &quot;FirstName&quot; VARCHAR(8000) NULL,<br>
  &quot;Title&quot; VARCHAR(8000) NULL,<br>
  &quot;TitleOfCourtesy&quot; VARCHAR(8000) NULL,<br>
  &quot;BirthDate&quot; VARCHAR(8000) NULL,<br>
  &quot;HireDate&quot; VARCHAR(8000) NULL,<br>
  &quot;Address&quot; VARCHAR(8000) NULL,<br>
  &quot;City&quot; VARCHAR(8000) NULL,<br>
  &quot;Region&quot; VARCHAR(8000) NULL,<br>
  &quot;PostalCode&quot; VARCHAR(8000) NULL,<br>
  &quot;Country&quot; VARCHAR(8000) NULL,<br>
  &quot;HomePhone&quot; VARCHAR(8000) NULL,<br>
  &quot;Extension&quot; VARCHAR(8000) NULL,<br>
  &quot;Photo&quot; BLOB NULL,<br>
  &quot;Notes&quot; VARCHAR(8000) NULL,<br>
  &quot;ReportsTo&quot; INTEGER NULL,<br>
  &quot;PhotoPath&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>Category</TD>
<TD>Category</TD>
<TD>3</TD>
<TD>CREATE TABLE &quot;Category&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;CategoryName&quot; VARCHAR(8000) NULL,<br>
  &quot;Description&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>Customer</TD>
<TD>Customer</TD>
<TD>4</TD>
<TD>CREATE TABLE &quot;Customer&quot;<br>
(<br>
  &quot;Id&quot; VARCHAR(8000) PRIMARY KEY,<br>
  &quot;CompanyName&quot; VARCHAR(8000) NULL,<br>
  &quot;ContactName&quot; VARCHAR(8000) NULL,<br>
  &quot;ContactTitle&quot; VARCHAR(8000) NULL,<br>
  &quot;Address&quot; VARCHAR(8000) NULL,<br>
  &quot;City&quot; VARCHAR(8000) NULL,<br>
  &quot;Region&quot; VARCHAR(8000) NULL,<br>
  &quot;PostalCode&quot; VARCHAR(8000) NULL,<br>
  &quot;Country&quot; VARCHAR(8000) NULL,<br>
  &quot;Phone&quot; VARCHAR(8000) NULL,<br>
  &quot;Fax&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>index</TD>
<TD>sqlite_autoindex_Customer_1</TD>
<TD>Customer</TD>
<TD>5</TD>
<TD></TD>
</TR>
<TR><TD>table</TD>
<TD>Shipper</TD>
<TD>Shipper</TD>
<TD>8</TD>
<TD>CREATE TABLE &quot;Shipper&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;CompanyName&quot; VARCHAR(8000) NULL,<br>
  &quot;Phone&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>Supplier</TD>
<TD>Supplier</TD>
<TD>9</TD>
<TD>CREATE TABLE &quot;Supplier&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;CompanyName&quot; VARCHAR(8000) NULL,<br>
  &quot;ContactName&quot; VARCHAR(8000) NULL,<br>
  &quot;ContactTitle&quot; VARCHAR(8000) NULL,<br>
  &quot;Address&quot; VARCHAR(8000) NULL,<br>
  &quot;City&quot; VARCHAR(8000) NULL,<br>
  &quot;Region&quot; VARCHAR(8000) NULL,<br>
  &quot;PostalCode&quot; VARCHAR(8000) NULL,<br>
  &quot;Country&quot; VARCHAR(8000) NULL,<br>
  &quot;Phone&quot; VARCHAR(8000) NULL,<br>
  &quot;Fax&quot; VARCHAR(8000) NULL,<br>
  &quot;HomePage&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>Order</TD>
<TD>Order</TD>
<TD>11</TD>
<TD>CREATE TABLE &quot;Order&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;CustomerId&quot; VARCHAR(8000) NULL,<br>
  &quot;EmployeeId&quot; INTEGER NOT NULL,<br>
  &quot;OrderDate&quot; VARCHAR(8000) NULL,<br>
  &quot;RequiredDate&quot; VARCHAR(8000) NULL,<br>
  &quot;ShippedDate&quot; VARCHAR(8000) NULL,<br>
  &quot;ShipVia&quot; INTEGER NULL,<br>
  &quot;Freight&quot; DECIMAL NOT NULL,<br>
  &quot;ShipName&quot; VARCHAR(8000) NULL,<br>
  &quot;ShipAddress&quot; VARCHAR(8000) NULL,<br>
  &quot;ShipCity&quot; VARCHAR(8000) NULL,<br>
  &quot;ShipRegion&quot; VARCHAR(8000) NULL,<br>
  &quot;ShipPostalCode&quot; VARCHAR(8000) NULL,<br>
  &quot;ShipCountry&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>Product</TD>
<TD>Product</TD>
<TD>12</TD>
<TD>CREATE TABLE &quot;Product&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;ProductName&quot; VARCHAR(8000) NULL,<br>
  &quot;SupplierId&quot; INTEGER NOT NULL,<br>
  &quot;CategoryId&quot; INTEGER NOT NULL,<br>
  &quot;QuantityPerUnit&quot; VARCHAR(8000) NULL,<br>
  &quot;UnitPrice&quot; DECIMAL NOT NULL,<br>
  &quot;UnitsInStock&quot; INTEGER NOT NULL,<br>
  &quot;UnitsOnOrder&quot; INTEGER NOT NULL,<br>
  &quot;ReorderLevel&quot; INTEGER NOT NULL,<br>
  &quot;Discontinued&quot; INTEGER NOT NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>OrderDetail</TD>
<TD>OrderDetail</TD>
<TD>14</TD>
<TD>CREATE TABLE &quot;OrderDetail&quot;<br>
(<br>
  &quot;Id&quot; VARCHAR(8000) PRIMARY KEY,<br>
  &quot;OrderId&quot; INTEGER NOT NULL,<br>
  &quot;ProductId&quot; INTEGER NOT NULL,<br>
  &quot;UnitPrice&quot; DECIMAL NOT NULL,<br>
  &quot;Quantity&quot; INTEGER NOT NULL,<br>
  &quot;Discount&quot; DOUBLE NOT NULL<br>
)</TD>
</TR>
<TR><TD>index</TD>
<TD>sqlite_autoindex_OrderDetail_1</TD>
<TD>OrderDetail</TD>
<TD>15</TD>
<TD></TD>
</TR>
<TR><TD>table</TD>
<TD>CustomerCustomerDemo</TD>
<TD>CustomerCustomerDemo</TD>
<TD>16</TD>
<TD>CREATE TABLE &quot;CustomerCustomerDemo&quot;<br>
(<br>
  &quot;Id&quot; VARCHAR(8000) PRIMARY KEY,<br>
  &quot;CustomerTypeId&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>index</TD>
<TD>sqlite_autoindex_CustomerCustomerDemo_1</TD>
<TD>CustomerCustomerDemo</TD>
<TD>17</TD>
<TD></TD>
</TR>
<TR><TD>table</TD>
<TD>CustomerDemographic</TD>
<TD>CustomerDemographic</TD>
<TD>18</TD>
<TD>CREATE TABLE &quot;CustomerDemographic&quot;<br>
(<br>
  &quot;Id&quot; VARCHAR(8000) PRIMARY KEY,<br>
  &quot;CustomerDesc&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>index</TD>
<TD>sqlite_autoindex_CustomerDemographic_1</TD>
<TD>CustomerDemographic</TD>
<TD>19</TD>
<TD></TD>
</TR>
<TR><TD>table</TD>
<TD>Region</TD>
<TD>Region</TD>
<TD>21</TD>
<TD>CREATE TABLE &quot;Region&quot;<br>
(<br>
  &quot;Id&quot; INTEGER PRIMARY KEY,<br>
  &quot;RegionDescription&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>table</TD>
<TD>Territory</TD>
<TD>Territory</TD>
<TD>22</TD>
<TD>CREATE TABLE &quot;Territory&quot;<br>
(<br>
  &quot;Id&quot; VARCHAR(8000) PRIMARY KEY,<br>
  &quot;TerritoryDescription&quot; VARCHAR(8000) NULL,<br>
  &quot;RegionId&quot; INTEGER NOT NULL<br>
)</TD>
</TR>
<TR><TD>index</TD>
<TD>sqlite_autoindex_Territory_1</TD>
<TD>Territory</TD>
<TD>23</TD>
<TD></TD>
</TR>
<TR><TD>table</TD>
<TD>EmployeeTerritory</TD>
<TD>EmployeeTerritory</TD>
<TD>24</TD>
<TD>CREATE TABLE &quot;EmployeeTerritory&quot;<br>
(<br>
  &quot;Id&quot; VARCHAR(8000) PRIMARY KEY,<br>
  &quot;EmployeeId&quot; INTEGER NOT NULL,<br>
  &quot;TerritoryId&quot; VARCHAR(8000) NULL<br>
)</TD>
</TR>
<TR><TD>index</TD>
<TD>sqlite_autoindex_EmployeeTerritory_1</TD>
<TD>EmployeeTerritory</TD>
<TD>25</TD>
<TD></TD>
</TR>
<TR><TD>view</TD>
<TD>ProductDetails_V</TD>
<TD>ProductDetails_V</TD>
<TD>0</TD>
<TD>CREATE VIEW [ProductDetails_V] as<br>
select<br>
p.*,<br>
c.CategoryName, c.Description as [CategoryDescription],<br>
s.CompanyName as [SupplierName], s.Region as [SupplierRegion]<br>
from [Product] p<br>
join [Category] c on p.CategoryId = c.id<br>
join [Supplier] s on s.id = p.SupplierId</TD>
</TR>
</table>

```
sqlite>
sqlite> .quit
$
```

#### Explore SQLite Databases Using the sqlite3 Module (Extra)

For documentation of the sqlite3 package, see https://docs.python.org/3/library/sqlite3.html.

From https://www.sqlite.org/cli.html#querying_the_database_schema, this is a query that obtains the names of the tables:

```SQL
SELECT name FROM sqlite_schema 
WHERE type IN ('table','view') AND name NOT LIKE 'sqlite_%'
ORDER BY name
```

In [None]:
# Select the table names from the database.
conn = sqlite3.connect("Northwind_small.sqlite")
cur = conn.cursor()
sql1 = """
    SELECT
        name
    FROM
        sqlite_schema 
    WHERE
        type IN ('table', 'view')
        AND name NOT LIKE 'sqlite_%'
    ORDER BY
        name
"""
cur.execute(sql1)
rows = cur.fetchall() # a list of tuples
print(rows)
conn.close()

### Creating a Database Engine in Python

The course uses SQLAlchemy because it can be used to connect to many different relational database management systems.

#### Get the Table Names Using sqlalchemy (Demonstration)

The SQLAlchemy connection string is `sqlite:///Northwind_small.sqlite`. All three `/` characters are required for a relative path.

In [None]:
# Use sqlalchemy to connect to the SQLite database.
# This is the from the course's demonstration.
engine = sqlalchemy.create_engine('sqlite:///Northwind_small.sqlite', future=True)
print("type(engine):", type(engine))
# Calling engine.table_names() is deprecated.
# table_names = engine.table_names()
# Create an inspector object.
inspector = sqlalchemy.inspect(engine)
# Use an inspector to get the table names.
table_names = inspector.get_table_names()
print(table_names)
# Use the inspector to get the view names.
view_names = inspector.get_view_names()
print(view_names)

In [None]:
# Get the table names using a query. The query also returns view names.
# We reuse the sql1 query string from above.
# The documentation says to use sqlalchemy.text(sql1) as the argument to
# conn.execute(), which returns an iterator of type
# sqlalchemy.engine.cursor.LegacyCursorResult.
# With the context manager, the connection is automatically closed.
with engine.connect() as conn:
    result = conn.execute(sqlalchemy.text(sql1))
    print("result:", result)
    for row in result:
        print(row)

#### Create a Database Engine and Get the Table Names (Exercises)

In [None]:
# This is the recommended code.
engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', future=True)
inspector = sqlalchemy.inspect(engine)
table_names = inspector.get_table_names()
print(table_names)

### Querying Relational Databases in Python
The steps are:
1) import packages and functions
2) create the database engine
3) connect to the engine
4) query the database
5) save the result set to a DataFrame
6) close the connection

#### Execute a Basic Query (Demonstration)

In [None]:
# This code uses a context manager to manage the connection, as shown in
# the demonstration.
engine = sqlalchemy.create_engine("sqlite:///Northwind_small.sqlite", future=True)
with engine.connect() as conn:
    # Since "Order" is a SQL keyword, we must quote it here.
    result_set = conn.execute(sqlalchemy.text('SELECT * FROM "Order"'))
    orders = pd.DataFrame(result_set.fetchall())
    # Set the column names of the DataFrame.
    orders.columns = result_set.keys()
print(orders.head())

#### Select Data from Specified Table Columns (Demonstration)

In [None]:
# This code reuses engine from above.
# Select data from specified columns.
with engine.connect() as conn:
    result_set = conn.execute(sqlalchemy.text('SELECT Id, OrderDate, ShipName FROM "Order"'))
    orders2 = pd.DataFrame(result_set.fetchmany(size=5))
    # Fetch one row as follows:
    # order2 = pd.DataFrame(result_set.fetchone())
    orders2.columns = result_set.keys()
print(orders2.head())

#### Execute a Simple Query (Exercise)

In [None]:
# Select all data from table Album.
engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', future=True)
with engine.connect() as conn:
    rs1 = conn.execute(sqlalchemy.text('SELECT * FROM Album'))
    df1 = pd.DataFrame(rs1.fetchall())
    df1.columns = rs1.keys()
print(df1.head())

#### Select Data from Specific Columns and Return Three Rows (Exercise)

In [None]:
# This code reuses engine from above.
# Select data from specific columns and return 3 rows.
with engine.connect() as conn:
    rs2 = conn.execute(sqlalchemy.text('SELECT LastName, Title FROM Employee'))
    df2 = pd.DataFrame(rs2.fetchmany(size=3))
    df2.columns = rs2.keys()
print(df2.head())

#### Select Data Using a WHERE Filter (Exercise)

In [None]:
# This code reuses engine from above.
# Filter the data using WHERE.
with engine.connect() as conn:
    rs3 = conn.execute(sqlalchemy.text('SELECT * FROM Employee WHERE EmployeeId >= 6'))
    df3 = pd.DataFrame(rs3.fetchall())
    df3.columns = rs3.keys()
print(df3.head())

#### Select Data and Sort It (Exercise)

In [None]:
# This code reuses engine from above.
# Select rows and sort them.
with engine.connect() as conn:
    rs4 = conn.execute(sqlalchemy.text('SELECT * FROM Employee ORDER BY BirthDate'))
    df4 = pd.DataFrame(rs4.fetchall())
    df4.columns = rs4.keys()
print(df4.head())

### Querying Relational Databases Directly with pandas

#### Use pandas Directly to Query a Database (Demonstration)

In [None]:
# This code reuses engine from above.
# Use pandas directly to query a database.
with engine.connect() as conn:
    df5 = pd.read_sql_query(sqlalchemy.text('SELECT * FROM Employee ORDER BY BirthDate'), conn)
    print(df5.head())

#### Use pandas to Obtain Database Data (Exercise)

In [None]:
# This code reuses engine from above.
# Use the new algorithm to select the data.
# Use the old algorithm to select the data.
# Check that the DataFrames contain the same data.
with engine.connect() as conn:
    df6 = pd.read_sql_query(sqlalchemy.text('SELECT * FROM Album'), conn)
    print(df6.head())

with engine.connect() as conn:
    rs7 = conn.execute(sqlalchemy.text('SELECT * FROM Album'))
    df7 = pd.DataFrame(rs7.fetchall())
    df7.columns = rs7.keys()
print(df7.head())
print()
print(df6.equals(df7))

#### Use pandas for a More Complex Query (Exercise)

In [None]:
# This code reuses engine from above.
# Execute a slightly more complex query using pandas.
sql8 = """
    SELECT * 
    FROM Employee
    WHERE EmployeeId >= 6
    ORDER BY BirthDate
"""
with engine.connect() as conn:
    df8 = pd.read_sql_query(sqlalchemy.text(sql8), conn)
    print(df8.head())

### Advanced Querying: Exploiting Table Relationships

#### Execute a Query with an Inner Join (Demonstration)

In [None]:
# This query is modified to work correctly with the Northwind_small.sqllite
# database.
engine = sqlalchemy.create_engine('sqlite:///Northwind_small.sqlite', future=True)
sql9 = '''
    SELECT
        "Order".ID AS OrderId,
        Customer.CompanyName
    FROM
        "Order"
        INNER JOIN Customer
            ON "Order".CustomerId = Customer.ID
'''
with engine.connect() as conn:
    df9 = pd.read_sql_query(sqlalchemy.text(sql9), conn)
    print(df9.head())

#### Execute a Query with an INNER JOIN (Exercise)

In [None]:
# Obtain data from an inner join query and store it in a DataFrame.
engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', future=True)
sql10 = """
    SELECT
        Album.Title,
        Artist.Name
    FROM
        Album
        INNER JOIN Artist
            ON Album.ArtistId = Artist.ArtistId
"""
with engine.connect() as conn:
    rs10 = conn.execute(sqlalchemy.text(sql10))
    df10 = pd.DataFrame(rs10.fetchall())
    df10.columns = rs10.keys()
print(df10.head())

#### Filter an Inner Join (Exercise)

In [None]:
# Execute a query with an inner join and a where clause.
sql11 = '''
    SELECT
        *
    FROM
        PlaylistTrack
        INNER JOIN Track
            ON PlaylistTrack.TrackId = Track.TrackId
    WHERE
        Track.Milliseconds < 250000
'''
with engine.connect() as conn:
    df11 = pd.read_sql_query(sqlalchemy.text(sql11), conn)
    print(df11.head())