### Intro to Relational DBs

#### ADD a COLUMN with ALTER TABLE 
ALTER TABLE table_name <br>
ADD COLUMN column_name data_type;

#### Delete table
DROP TABLE table_name;

#### Rename columns
ALTER TABLE table_name<br>
RENAME COLUMN old_name TO new_name;

#### Delete columns
ALTER TABLE table_name<br>
DROP COLUMN column_name;

#### Insert distinct items to rows
INSERT INTO table_name_1<br>
SELECT DISTINCT col_1, col_2, col_3<br>
FROM table_name_2;

In [None]:
# -- Rename the organisation column
ALTER TABLE affiliations
RENAME COLUMN organisation TO organization;

# -- Delete the university_shortname column
ALTER TABLE affiliations
DROP COLUMN university_shortname;

# -- Insert unique professors into the new table
INSERT INTO professors 
SELECT DISTINCT firstname, lastname, university_shortname 
FROM university_professors;

### Integrity constraints

1. Attribute const = data types on columns
2. Key const = primary keys
3. Referential integrity const = foreign keys


Constraints give structure to the data, consistency, data quality but enforcing const is difficult

To conduct calculations change the data type to a working data type:<br>

SELECT CAST(some_column AS integer)<br>
FROM table;

Data types:
 - Are enforced on columns
 - Define the "domain" of a column
 - Define operations that are possible
 - Enforce consistent storage of values
 
Most common data types:
 - text: character strings of any length
 - varchar(n): a max of n characters
 - char(n): a fixed-length string of n characters
 - boolean: three-state value of TRUE/FALSE/NULL
 - numeric: aribtrary precision numbers
 - integer: whole numbers in the range of -/+2147483648
 - date, time, timestamp: various formats of date and time
 
To change the data type of a column:

ALTER TABLE table_name<br>
ALTER COLUMN column_name<br>
TYPE varchar(10)

In [None]:
# -- Specify the correct fixed-length character type
ALTER TABLE professors
ALTER COLUMN university_shortname
TYPE char(3);

# -- Change the type of firstname
ALTER TABLE professors
ALTER COLUMN firstname
TYPE varchar(64);

To truncate the values before type conversion:

ALTER TABLE table_name<br>
ALTER COLUMN column_name<br>
TYPE varchar(x)<br>
USING SUBSTRING(column_name FROM 1 FOR x)

In [None]:
# -- Convert the values in firstname to a max. of 16 characters
ALTER TABLE professors 
ALTER COLUMN firstname 
TYPE varchar(16)
USING SUBSTRING(firstname FROM 1 FOR 16)

The NOT NULL constraint (when a column does not allow for missing values)

You can add a not-null constraint after the table has been created by:

ALTER TABLE table_name<br>
ALTER COLUMN column_name<br>
SET NOT NULL

And to remove a not-null constraint:

ALTER TABLE table_name<br>
ALTER COLUMN column_name<br>
DROP NOT NULL

The UNIQUE constraint (when a column does not include duplicates)

CREATE TABLE table_name(<br>
    column_name UNIQUE<br>
);

ALTER TABLE table_name<br>
ADD CONSTRAINT some_name UNIQUE(column_name);

In [None]:
# -- Disallow NULL values in lastname
ALTER TABLE professors
ALTER COLUMN lastname
SET NOT NULL

# -- Make organizations.organization unique
ALTER TABLE organizations
ADD CONSTRAINT organization_unq UNIQUE(organization)

#### What is a key?

Attributes whose values are unique across the table (identify a record uniquely)<br>
A superkey is a set of attributes that uniquely identifies each tuple of a relation https://en.wikipedia.org/wiki/Superkey<br>
If no attributes can be removed without losing the uniquess property, then superkeys are "minimal"


Identify keys with SELECT COUNT DISTINCT

There's a very basic way of finding out what qualifies for a key in an existing, populated table:

 - Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey.

 - Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key.

Counting rows:

SELECT COUNT(DISTINCT(column_a, column_b, ...))<br>
FROM table;

In [None]:
# -- Count the number of rows in universities
SELECT COUNT(DISTINCT(university, university_city)) 
FROM universities;

#### Primary keys

 - Single primary key per table
 - Built from as few columns as possible
 - Purpose is to uniquely identify records in a table
 - Defined on columns that don't accept duplicate values
 - Time invariant i.e. never change over time, so must make sure that future data won't break the key

CREATE TABLE table_name(<br>
    col_1 integer UNIQUE NOT NULL,<br>
    col_2 text,<br>
    col_3 numeric<br>
);

CREATE TABLE table_name(<br>
    col_1 integer PRIMARY KEY,<br>
    col_2 text,<br>
    col_3 numeric<br>
);

CREATE TABLE table_name(<br>
    col_1 integer,<br>
    col_2 text,<br>
    col_3 numeric<br>
    PRIMARY KEY (col_1, col_2)<br>
);

ALTER TABLE table_name<br>
ADD CONSTRAINT some_name PRIMARY KEY (column_name)

In [None]:
# -- Rename the organization column to id
ALTER TABLE organizations
RENAME COLUMN organization TO id;

# -- Make id a primary key
ALTER TABLE organizations
ADD CONSTRAINT organization_pk PRIMARY KEY (id);



#### Surrogate keys

 - Method 1: adding a surrogate key with serial data type

ALTER TABLE table_name<br>
ADD COLUMN id SERIAL PRIMARY KEY;<br>

INSERT INTO table_name<br>
VALUES ('col1_value','col2_value', 'id_value')


 - Method 2: combine two or more existing columns into a new one

ALTER TABLE table_name<br>
ADD COLUMN column_c varchar(256);<br>

UPDATE table_name<br>
SET column_c = CONCAT(column_a, column_b);

ALTER TABLE table_name<br>
ADD CONSTRAINT pk PRIMARY KEY (column_c);

In [None]:
# -- Add the new column to the table
ALTER TABLE professors 
ADD COLUMN id serial;

# -- Make id a primary key
ALTER TABLE professors 
ADD CONSTRAINT professors_pkey PRIMARY KEY (id);

# -- Have a look at the first 10 rows of professors
SELECT *  FROM professors LIMIT(10);


In [None]:
# -- Count the number of distinct rows with columns make, model
SELECT COUNT(DISTINCT(make, model)) 
FROM cars;

# -- Add the id column
ALTER TABLE cars
ADD COLUMN id varchar(128);

# -- Update id with make + model
UPDATE cars
SET id = CONCAT(make, model);

# -- Make id a primary key
ALTER TABLE cars
ADD CONSTRAINT id_pk PRIMARY KEY(id);

# -- Have a look at the table
SELECT * FROM cars;

#### Model 1:N relationships with foreign keys

 - Foreign keys are designated columns that point to the primary key of another table
 - Domain of FK must be equal to domain of PK
 - FK values must exist in PK of the other table (FK const of "referential integrity")
 - FKs are not actual keys as duplicate values are allowed
 
The syntax to add a foreign key to an existing table is:

ALTER TABLE a <br>
ADD CONSTRAINT a_fkey FOREIGN KEY (b_id) REFERENCES b (id);

Table a should now refer to table b, via b_id, which points to id. a_fkey is, as usual, a constraint name you can choose on your own.

Pay attention to the naming convention employed here: Usually, a foreign key referencing another primary key with name id is named x_id, where x is the name of the referencing table in the singular form.

In [None]:
# -- Rename the university_shortname column
ALTER TABLE professors
RENAME COLUMN university_shortname TO university_id;

# -- Add a foreign key on professors referencing universities
ALTER TABLE professors
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id);

Foreign key constraints help you to keep order in your database mini-world. In your database, for instance, only professors belonging to Swiss universities should be allowed, as only Swiss universities are part of the universities table.

The foreign key on professors referencing universities you just created thus makes sure that only existing universities can be specified when inserting new data

INSERT INTO professors (firstname, lastname, university_id)<br>
VALUES ('Albert', 'Einstein', 'UZH');

In [2]:
# -- Select all professors working for universities in the city of Zurich
SELECT professors.lastname, universities.id, universities.university_city
FROM professors
JOIN universities
ON professors.university_id = universities.id
WHERE universities.university_city = 'Zurich';

#### Model M:N relationships with foreign keys

 - Create table
 - Add foreign keys for every connected table
 - Add additional attributes
 - No primary keys used
 
CREATE TABLE table_name(<br>
    col_1 INT REFERENCES table_other_1 (id),<br>
    col_2 varchar(256) REFERENCES table_other_2 (id),<br>
    function varchar(256)
);

In [None]:
# -- Add a professor_id column
ALTER TABLE affiliations
ADD COLUMN professor_id integer REFERENCES professors (id);

# -- Rename the organization column to organization_id
ALTER TABLE affiliations
RENAME organization TO organization_id;

# -- Add a foreign key on organization_id
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id);

Here's a way to update columns of a table based on values in another table:

UPDATE table_a<br>
SET column_to_update = table_b.column_to_update_from<br>
FROM table_b<br>
WHERE condition1 AND condition2 AND ...;

This query does the following:

For each row in table_a, find the corresponding row in table_b where condition1, condition2, etc., are met.

Set the value of column_to_update to the value of column_to_update_from (from that corresponding row).

The conditions usually compare other columns of both tables, e.g. table_a.some_column = table_b.some_column. Of course, this query only makes sense if there is only one matching row in table_b.

In [None]:
# -- Set professor_id to professors.id where firstname, lastname correspond to rows in professors
UPDATE affiliations
SET professor_id = professors.id
FROM professors
WHERE affiliations.firstname = professors.firstname AND affiliations.lastname = professors.lastname;



#### Referential integrity

 - A record referencing another table must refer to an existing record in that table
 - Always specified between two tables and enforced with foreign keys
 
Violations of referential itegrity are dealt with:
 - No action - throws an error
 - CASCADE: Delete all referencing records
 - RESTRICT - throws an error
 - SET NULL - set referencing column to null
 - SET DEFAULT - set referencing column to its default value

#### Change the referential integrity behavior of a key

So far, you implemented three foreign key constraints:

professors.university_id to universities.id

affiliations.organization_id to organizations.id

affiliations.professor_id to professors.id

These foreign keys currently have the behavior ON DELETE NO ACTION. Here, you're going to change that behavior for the column referencing organizations from affiliations. If an organization is deleted, all its affiliations (by any professor) should also be deleted.

Altering a key constraint doesn't work with ALTER COLUMN. Instead, you have to DROP the key constraint and then ADD a new one with a different ON DELETE behavior.

For deleting constraints, though, you need to know their name. This information is also stored in information_schema.

In [None]:
# -- Identify the correct constraint name
SELECT constraint_name, table_name, constraint_type
FROM information_schema.table_constraints
WHERE constraint_type = 'FOREIGN KEY';

# -- Drop the right foreign key constraint
ALTER TABLE affiliations
DROP CONSTRAINT affiliations_organization_id_fkey;

# -- Add a new foreign key constraint from affiliations to organizations which cascades deletion
ALTER TABLE affiliations
ADD CONSTRAINT affiliations_organization_id_fkey FOREIGN KEY (organization_id) REFERENCES organizations (id) ON DELETE CASCADE;

# -- Delete an organization 
DELETE FROM organizations 
WHERE id = 'CUREM';

# -- Check that no more affiliations with this organization exist
SELECT * FROM affiliations
WHERE organization_id = 'CUREM';

![image](assets/intro_to_dbs/table_schema.png)


#### Count affiliations per university

Now that your data is ready for analysis, let's run some exemplary SQL queries on the database. You'll now use already known concepts such as grouping by columns and joining tables.

In this exercise, you will find out which university has the most affiliations (through its professors). For that, you need both affiliations and professors tables, as the latter also holds the university_id.

As a quick repetition, remember that joins have the following structure:

SELECT table_a.column1, table_a.column2, table_b.column1, ... <br>
FROM table_a <br>
JOIN table_b <br>
ON table_a.column = table_b.column

This results in a combination of table_a and table_b, but only with rows where table_a.column is equal to table_b.column.

In [None]:
# -- Count the total number of affiliations per university
SELECT COUNT(*), professors.university_id 
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id

# -- Group by the university ids of professors
GROUP BY professors.university_id
ORDER BY count DESC;

In [None]:
# Join all tables in the database (starting with affiliations, professors, organizations,
# and universities) and look at the result.
# -- Join all tables
SELECT *
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id;

# Now group the result by organization sector, professor, and university city.
# Count the resulting number of rows.
# -- Group the table by organization sector, professor ID and university city
SELECT COUNT(*), organizations.organization_sector, 
professors.id, universities.university_city
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id
GROUP BY organizations.organization_sector, 
professors.id, universities.university_city;

# Only retain rows with "Media & communication" as organization sector,
# and sort the table by count, in descending order.
# -- Filter the table and sort it
SELECT COUNT(*), organizations.organization_sector, 
professors.id, universities.university_city
FROM affiliations
JOIN professors
ON affiliations.professor_id = professors.id
JOIN organizations
ON affiliations.organization_id = organizations.id
JOIN universities
ON professors.university_id = universities.id
WHERE organizations.organization_sector = 'Media & communication'
GROUP BY organizations.organization_sector, 
professors.id, universities.university_city
ORDER BY COUNT DESC;#### ADD a COLUMN with ALTER TABLE 
ALTER TABLE table_name <br>
ADD COLUMN column_name data_type;

#### Delete table
DROP TABLE table_name;

#### Rename columns
ALTER TABLE table_name<br>
RENAME COLUMN old_name TO new_name;

#### Delete columns
ALTER TABLE table_name<br>
DROP COLUMN column_name;

#### Insert distinct items to rows
INSERT INTO table_name_1<br>
SELECT DISTINCT col_1, col_2, col_3<br>
FROM table_name_2;

### Intro to DBs using Python

#### Connecting to a database with SQL Alchemy

from sqlalchemy import create_engine<br>
engine = create_engine('sqlite:///census_nyc.sqlite')<br>
connection = engine.connect()

 - Engine: a common interface to the database from SQLAlchemy
 - Connection string: All the details required to find the database including login credentials
 
Use print(engine.table_names()) to check the available tables in the database

To work with a table you have to use "Reflection" that reads the database and builds SQLAlchemy Table objects:

from sqlalchemy import MetaData, Table<br>
metadata = MetaData()<br>
census = Table('census', metadata, autoload=True, autoload_with=engine)<br>
print(repr(census))


An engine is just a common interface to a database, and the information it requires to connect to one is contained in a connection string, for example sqlite:///example.sqlite. Here, sqlite in sqlite:/// is the database driver, while example.sqlite is a SQLite file contained in the local directory. https://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls


In [None]:
# Import create_engine
from sqlalchemy import create_engine

# Create an engine that connects to the census.sqlite file: engine
engine = create_engine('sqlite:///census.sqlite')

# Print table names
print(engine.table_names())


#### Autoloading Tables from a database

SQLAlchemy can be used to automatically load tables from a database using something called reflection. Reflection is the process of reading the database and building the metadata based on that information. It's the opposite of creating a Table by hand and is very useful for working with existing databases.

To perform reflection, you will first need to import and initialize a MetaData object. MetaData objects contain information about tables stored in a database. During reflection, the MetaData object will be populated with information about the reflected table automatically, so we only need to initialize it before reflecting by calling MetaData().

You will also need to import the Table object from the SQLAlchemy package. Then, you use this Table object to read your table from the engine, autoload the columns, and populate the metadata. This can be done with a single call to Table(): using the Table object in this manner is a lot like passing arguments to a function. For example, to autoload the columns with the engine, you have to specify the keyword arguments autoload=True and autoload_with=engine to Table().

In [None]:
# Import create_engine, MetaData, and Table
from sqlalchemy import create_engine, MetaData, Table

# Create engine: engine
engine = create_engine('sqlite:///census.sqlite')

# Create a metadata object: metadata
metadata = MetaData()

# Reflect census table from the engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Print census table metadata
print(repr(census))

#### Viewing Table details

It is important to get an understanding of your database by examining the column names. This can be done by using the .columns attribute and accessing the .keys() method. For example, census.columns.keys() would return a list of column names of the census table.

Following this, we can use the metadata container to find out more details about the reflected table such as the columns and their types. For example, information about the table objects are stored in the metadata.tables dictionary, so you can get the metadata of your census table with metadata.tables['census']. This is similar to your use of the repr() function on the census table from the previous exercise.

In [None]:
from sqlalchemy import create_engine, MetaData, Table

engine = create_engine('sqlite:///census.sqlite')

metadata = MetaData()

# Reflect the census table from the engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Print the column names
print(census.columns.keys())

# Print full metadata of census
print(repr(census))


#### Query table
First need to establish a connection to the DB by using the .connect() method on the engine.

The create_engine() function returns an instance of an engine, but it does not actually open a connection until an action is called that would require a connection, such as a query.

The object returned by the .execute() method is a ResultProxy. On this ResultProxy, we can then use the .fetchall() method to get our results - that is, the ResultSet.

Notice that when you execute a query using raw SQL, you will query the table in the database directly. In particular, no reflection step is needed.

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///census.sqlite')

# Create a connection on engine
connection = engine.connect()

# Build select statement for census table: stmt
stmt = "SELECT * FROM census"

# Execute the statement and fetch the results: results
results = connection.execute(stmt).fetchall()

# Print results
print(results)

#### The pythonic way to query a table

In [None]:
# Import select
from sqlalchemy import select

# Reflect census table via engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Build select statement for census table: stmt
stmt = select([census])

# Print the emitted statement to see the SQL string
print(stmt)

# Execute the statement on connection and fetch 10 records: result
results = connection.execute(stmt).fetchmany(size=10)

# Execute the statement and print the results
print(results)

#### Result Proxy vs Set

Recall the differences between a ResultProxy and a ResultSet:

 - ResultProxy: The object returned by the .execute() method. It can be used in a variety of ways to get the data returned by the query.
 - ResultSet: The actual data asked for in the query when using a fetch method such as .fetchall() on a ResultProxy.
 
This separation between the ResultSet and ResultProxy allows us to fetch as much or as little data as we desire.

Once we have a ResultSet, we can use Python to access all the data within it by column name and by list style indexes.

In [None]:
# Get the first row of the results by using an index: first_row
first_row = results[0]

# Print the first row of the results
print(first_row)

# Print the first column of the first row by accessing it by its index
print(results[0][0])

# Print the 'state' column of the first row by using its name
print(results[0]['state'])

A connection string to connect to a database. In general, connection strings have the form "dialect+driver://username:password@host:port/database"

There are three components to the connection string in this exercise:
 - the dialect and driver ('postgresql+psycopg2://')
 - followed by the username and password ('student:datacamp')
 - followed by the host and port ('@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/'),
 - finally, the database name ('census')
 
You will have to pass this string as an argument to create_engine() in order to connect to the database.

In [25]:
# Import create_engine function
from sqlalchemy import create_engine

# Create an engine to the census database
engine = create_engine( "postgresql+psycopg2://student:datacamp@postgresql.csrrinzqubik.us-east-1.rds.amazonaws.com:5432/census")

# Create a connection on engine
connection = engine.connect()

# Use the .table_names() method on the engine to print the table names
print(engine.table_names())

  print(engine.table_names())


['census', 'new_data', 'census1', 'data', 'data1', 'employees', 'employees3', 'employees_2', 'nyc_jobs', 'final_orders', 'state_fact', 'orders', 'users', 'vrska']


#### Filtering queries - """ SOME METHODS SEEM TO BE DEPRECATED """
A where() clause is used to filter the data that a statement returns. For example, to select all the records from the census table where the sex is Female (or 'F') we would do the following:

select([census]).where(census.columns.sex == 'F')

In addition to == we can use basically any python comparison operator (such as <=, !=, etc) in the where() clause.

In [30]:
from sqlalchemy import select, Table, MetaData

metadata = MetaData()

# Reflect census table via engine: census
census = Table('census', metadata, autoload=True, autoload_with=engine)

# Create a select query: stmt
stmt = select([census])

# Add a where clause to filter the results to only those for New York : stmt_filtered
stmt = stmt.where(census.columns.state == "New York")

# Execute the query to retrieve all the data returned: results
results = connection.execute(stmt).fetchall()

# Loop over the results and print the age, sex, and pop2000
for result in results:
    print(result.age, result.sex, result.pop2000)

ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for table census

[SQL: SELECT census.state, census.sex, census.age, census.pop2000, census.pop2008 
FROM census 
WHERE census.state = %(state_1)s]
[parameters: {'state_1': 'New York'}]
(Background on this error at: http://sqlalche.me/e/14/f405)

#### Filter data selected from a Table - Expressions

In addition to standard Python comparators, we can also use methods such as in_() to create more powerful where() clauses. http://docs.sqlalchemy.org/en/latest/core/sqlelement.html#module-sqlalchemy.sql.expression

Method in_(), when used on a column, allows us to include records where the value of a column is among a list of possible values. For example, where(census.columns.age.in_([20, 30, 40])) will return only records for people who are exactly 20, 30, or 40 years old.

In [31]:
# Define a list of states for which we want results
states = ['New York', 'California', 'Texas']

# Create a query for the census table: stmt
stmt = select([census])

# Append a where clause to match all the states in_ the list states
stmt = stmt.where(census.columns.state.in_(states))

# Loop over the ResultProxy and print the state and its population in 2000
for result in connection.execute(stmt):
    print(result.state, result.pop2000)

ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for table census

[SQL: SELECT census.state, census.sex, census.age, census.pop2000, census.pop2008 
FROM census 
WHERE census.state IN (%(state_1_1)s, %(state_1_2)s, %(state_1_3)s)]
[parameters: {'state_1_1': 'New York', 'state_1_2': 'California', 'state_1_3': 'Texas'}]
(Background on this error at: http://sqlalche.me/e/14/f405)

#### Filter data selected from a Table

SQLAlchemy also allows users to use conjunctions such as and_(), or_(), and not_() to build more complex filtering.

For example, we can get a set of records for people in New York who are 21 or 37 years old with the following code:

select([census]).where(<br>
  and_(census.columns.state == 'New York',<br>
       or_(census.columns.age == 21,<br>
          census.columns.age == 37<br>
         )<br>
      )<br>
)
  
An equivalent SQL statement would be,for example,

SELECT * FROM census WHERE state = 'New York' AND (age = 21 OR age = 37)

In [None]:
# Import and_
from sqlalchemy import and_

# Build a query for the census table: stmt
stmt = select([census])

# Append a where clause to select only non-male records from California using and_
stmt = stmt.where(
    # The state of California with a non-male sex
    and_(census.columns.state == 'California',
         census.columns.sex != 'M'
         )
)

# Loop over the ResultProxy printing the age and sex
for result in connection.execute(stmt):
    print(result.age, result.sex)


#### Ordering by a single column

To sort the result output by a field, we use the .order_by() method. By default, the .order_by() method sorts from lowest to highest on the supplied column. You just have to pass in the name of the column you want sorted to .order_by().

In [None]:
# Build a query to select the state column: stmt
stmt = select([census.columns.state])

# Order stmt by the state column
stmt = stmt.order_by(census.columns.state)

# Execute the query and store the results: results
results = engine.execute(stmt).fetchall()

# Print the first 10 results
print(results[:10])

#### Ordering in descending order by a single column

You can also use .order_by() to sort from highest to lowest by wrapping a column in the desc() function. 

Pass desc() (for "descending") inside an .order_by() with the name of the column you want to sort by. For instance, stmt.order_by(desc(table.columns.column_name)) sorts column_name in descending order.

In [None]:
# Import desc
from sqlalchemy import desc

# Build a query to select the state column: stmt
stmt = select([census.columns.state])

# Order stmt by state in descending order: rev_stmt
rev_stmt = stmt.order_by(desc(census.columns.state))

# Execute the query and store the results: rev_results
rev_results = connection.execute(rev_stmt).fetchall()

# Print the first 10 rev_results
print(rev_results[:10])

#### Ordering by multiple columns

We can pass multiple arguments to the .order_by() method to order by multiple columns. In fact, we can also sort in ascending or descending order for each individual column.

Each column in the .order_by() method is fully sorted from left to right.

This means that the first column is completely sorted, and then within each matching group of values in the first column, it's sorted by the next column in the .order_by() method.

This process is repeated until all the columns in the .order_by() are sorted.

In [None]:
# Build a query to select state and age: stmt
stmt = select([census.columns.state, census.columns.age])

# Append order by to ascend by state and descend by age
stmt = stmt.order_by(census.columns.state, desc(census.columns.age))

# Execute the statement and store all the records: results
results = engine.execute(stmt).fetchall()

# Print the first 20 results
print(results[:20])

SQLAlchemy's func module provides access to built-in SQL functions that can make operations like counting and summing faster and more efficient.

To get a sum of the pop2008 column of census as shown below:

 - select([func.sum(census.columns.pop2008)])

To count the number of values in pop2008, you could use func.count() like this:

 - select([func.count(census.columns.pop2008)])

To count the distinct values of pop2008, you can use the .distinct() method:

 - select([func.count(census.columns.pop2008.distinct())])

In [None]:
# Build a query to count the distinct states values: stmt
stmt = select([func.count(census.columns.state.distinct())])

# Execute the query and store the scalar result: distinct_state_count
distinct_state_count = connection.execute(stmt).scalar()

# Print the distinct_state_count
print(distinct_state_count)

#### Count of records by column

Often, we want to get a count for each record with a particular value in another column.

The .group_by() method helps answer this type of query. You can pass a column to the .group_by() method and use in an aggregate function like sum() or count().

Much like the .order_by() method, .group_by() can take multiple columns as arguments.

In [None]:
# Import func
from sqlalchemy import func

# Build a query to select the state and count of ages by state: stmt
stmt = select([census.columns.state, func.count(census.columns.age)])

# Group stmt by state
stmt = stmt.group_by(census.columns.state)

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

# Print results
print(results)

# Print the keys/column names of the results returned
print(results[0].keys())

Determining the population sum by state

To avoid confusion with query result column names like count_1, we can use the .label() method to provide a name for the resulting column. This gets appended to the function method we are using, and its argument is the name we want to use.

We can pair func.sum() with .group_by() to get a sum of the population by State and use the label() method to name the output.

We can also create the func.sum() expression before using it in the select statement. We do it the same way we would inside the select statement and store it in a variable. Then we use that variable in the select statement where the func.sum() would normally be.

In [None]:
# Import func
from sqlalchemy import func

# Build an expression to calculate the sum of pop2008 labeled as population
pop2008_sum = func.sum(census.columns.pop2008).label('population')

# Build a query to select the state and sum of pop2008: stmt
stmt = select([census.columns.state, pop2008_sum])

# Group stmt by state
stmt = stmt.group_by(census.columns.state)

# Execute the statement and store all the records: results
results = connection.execute(stmt).fetchall()

# Print results
print(results)

# Print the keys/column names of the results returned
print(results[0].keys())