Udacity Full Stack Developer Nanodegree program
Brendon Smith (br3ndonland)
- Setup
- Starting Python in
logs.py
- Starting the virtual machine and exploring the data
- A tale of three queries
- I read through the Udacity documentation and rubric (see logs-udacity.md)
- I kept vagrant and the database in a separate directory because of the large size of the database file.
-
Shebang: when reading through the "Functionality" section of the rubric, I saw that it recommended "a correct shebang line to indicate the Python version." I actually hadn't written a shebang line before, but looked it up on Stack Overflow and drafted one.
-
I created an outline in the python file
logs.py
with the steps I would be working on. Here's the initial outline:#!/usr/bin/env python3 # Udacity database logs analysis project # 1. Most popular three articles # 2. Most popular authors # 3. Days on which >1% of HTTP requests led to errors
To function or not to function: Next, I decided to write each of the three queries as a Python function. I began building the functions based on the resources from Lesson 03. Python DB-API:
forumdb.py
- 3.3. Writing Code with DB API
- 3.16. Reference — Python DB-API
#!/usr/bin/env python3
# Udacity database logs analysis project
# Import the psycopg2 module to work with PostgreSQL
import psycopg2
# Store the database name as an object for easy reference in functions
DBNAME = "news"
# 1. Most popular three articles
def popular_articles():
"""Returns a sorted list of the three most highly accessed articles,
with the top article first.
"""
# Connect to database
db = psycopg2.connect(database=DBNAME)
# Create a cursor object to run queries and scan through results
c = db.cursor()
# Execute the SQL query using the cursor
c.execute()
# Fetch all results from the cursor object
articles = c.fetchall()
print(articles)
# Close connection
db.close()
pass
# 2. Most popular authors
def popular_authors():
"""Returns a sorted list of the most popular article authors,
with the most popular author at the top.
"""
# Connect to database
db = psycopg2.connect(database=DBNAME)
# Create a cursor object to run queries and scan through results
c = db.cursor()
# Execute the SQL query using the cursor
c.execute()
# Fetch all results from the cursor object
authors = c.fetchall()
print(authors)
# Close connection
db.close()
pass
# 3. Days on which >1% of HTTP requests led to errors
def errors():
"""Returns a list of days on which >1% of HTTP requests resulted in
HTTP error codes.
"""
# Connect to database
db = psycopg2.connect(database=DBNAME)
# Create a cursor object to run queries and scan through results
c = db.cursor()
# Execute the SQL query using the cursor
c.execute()
# Fetch all results from the cursor object
errors = c.fetchall()
print(errors)
# Close connection
db.close()
pass
Git commit at this point: "Initialize files and code outline"
- I already had vagrant installed from the instructions in Lesson 2.17. Installing the Virtual Machine.
- I unzipped newsdata.sql and moved it into the *vagrant- directory.
- I changed into the vagrant directory and started up vagrant (only necessary when restarting computer):
$ vagrant up
-
I then logged in to Ubuntu as before
$ vagrant ssh
-
I connected to the database and loaded the data with PostgreSQL:
$ cd /vagrant /vagrant$ psql -d news -f newsdata.sql
As explained in the Udacity documentation for the project (see logs-udacity.md):
Here's what this command does:
psql
— the PostgreSQL command line program-d news
— connect to the database named news which has been set up for you-f newsdata.sql
— run the SQL statements in the file newsdata.sql
Running this command will connect to your installed database server and execute the SQL commands in the downloaded file, creating tables and populating them with data.
This only needs to be done once. When reconnecting, after
vagrant up
andvagrant ssh
, simply use$ cd /vagrant /vagrant$ psql -d news
and to log out (the opposite of
vagrant ssh
), just type ctrl+d or$ logout
-
I then began exploring the data by running commands in the vagrant Linux terminal. This helped me understand how to build the SQL queries.
-
I started by viewing the tables:
/vagrant$ psql -d news psql (9.5.10) Type "help" for help. news=> \dt List of relations Schema | Name | Type | Owner --------+----------+-------+--------- public | articles | table | vagrant public | authors | table | vagrant public | log | table | vagrant (3 rows)
-
Next, I viewed the columns of each table.
\d articles
\d authors
\d log
-
I then broke it down further and started looking at the content of the columns.
select author from articles limit 10;
Theauthor
column in thearticles
table is a foreign key, "articles_author_fkey" that references the authorid
in theauthors
table.select slug from articles;
shows there are only 8 articles.select path from log limit 10;
path
looks likeslug
from thearticles
table, but it is repeated every time the article is accessed. We need to group and count the paths to find out how many times each article was accessed.
-
I tried out some of the SQL queries from the Udacity project instructions, like
select title, name from articles, authors where articles.author = authors.id;
Helpful reference info when building the SQL queries:
- 2.18. Reference — Elements of SQL
- 3.16. Reference — Python DB-API
- 4.15. Reference — Deeper into SQL
I broke each query down, as recommended in the Udacity instructions, and repeatedly iterated until I got it.
What are the most popular three articles of all time?
I started by figuring out how to aggregate path
and count hits in the log
table. I based my first successful query on 1.11. Quiz: Count All the Species.
news=> select path, count(*) as num from log group by path order by num desc;
This shows a list of 212 rows.
path | num
-------------------------------------+--------
/ | 479121
/article/candidate-is-jerk | 338647
/article/bears-love-berries | 253801
/article/bad-things-gone | 170098
/article/goats-eat-googles | 84906
/article/trouble-for-troubled | 84810
/article/balloon-goons-doomed | 84557
/article/so-many-bears | 84504
/article/media-obsessed-with-bears | 84383
/spam-spam-spam-humbug | 301
/%20%20%20 | 290
/+++ATH0 | 288
/article/candidate-is-jerkx | 161
/article/candidate-is-jerkq | 155
:
This was helpful. The top hit was the home page, which makes sense. The next top hits were the eight articles on the site.
I could see that there were a number of mistyped URLs. I decided not to group those with the correct URLs above, because users aren't technically accessing the page with a mistyped URL. The same users who mistyped the URL are probably correcting the URL and going to the page anyway, and thus still contributing to the total number of hits on the correct URL.
I limited the results to three, and offset by one (to take out the homepage "/" path) with limit 3 offset 1;
, giving me only the articles of interest:
news=> select path, count(*) as num from log group by path order by num desc limit 3 offset 1;
The offset 1
isn't totally ideal, because it's possible for one of the articles to be viewed more times than the homepage. I revised the code to exclude the homepage:
news=> select path, count(*) as num from log where path !='/' group by path order by num desc limit 3;
path | num
-----------------------------+--------
/article/candidate-is-jerk | 338647
/article/bears-love-berries | 253801
/article/bad-things-gone | 170098
(3 rows)
Let's review the two tables we're trying to join:
news=> select id, slug, title from articles;
id | slug | title
----+---------------------------+------------------------------------
23 | bad-things-gone | Bad things gone, say good people
24 | balloon-goons-doomed | Balloon goons doomed
25 | bears-love-berries | Bears love berries, alleges bear
26 | candidate-is-jerk | Candidate is jerk, alleges rival
27 | goats-eat-googles | Goats eat Google's lawn
28 | media-obsessed-with-bears | Media obsessed with bears
30 | trouble-for-troubled | Trouble for troubled troublemakers
29 | so-many-bears | There are a lot of bears
(8 rows)
news=> select path, count(*) as num from log where path !='/' group by path order by num desc limit 3;
path | num
-----------------------------+--------
/article/candidate-is-jerk | 338647
/article/bears-love-berries | 253801
/article/bad-things-gone | 170098
(3 rows)
Let's start by just trying a simple join between the two tables.
news=> select path, slug from log, articles limit 10;
path | slug
----------------------------+---------------------------
/ | bad-things-gone
/ | balloon-goons-doomed
/ | bears-love-berries
/ | candidate-is-jerk
/ | goats-eat-googles
/ | media-obsessed-with-bears
/ | trouble-for-troubled
/ | so-many-bears
/article/candidate-is-jerk | bad-things-gone
/article/candidate-is-jerk | balloon-goons-doomed
(10 rows)
Alright, so that's totally mismatched.
There is no foreign key here, but the path
column is similar to the slug
column in the articles table. I will need to slice the contents of each row in the path
column to remove /article/
from the list, to match with the slug
, like path[9:]
if it was Python.
This was a sticking point for me. I was stuck at this step for about a day. I had to browse my notes and documentation. I was thinking about using the LIKE
operator to match based on certain conditions, but that doesn't really slice out the contents of each row in the path
column.
I was looking through "A Gentle Introduction to SQL Using SQLite Part II", that I was referred to during computefest2018-pandas. I got to the SUBSTRING
section, tried reformatting my query to select a substring in the path
column of the log
table, and... that was it!
news=> select substr(path, 10), count(*) as num from log where path !='/' group by path order by num desc limit 3;
substr | num
--------------------+--------
candidate-is-jerk | 338647
bears-love-berries | 253801
bad-things-gone | 170098
(3 rows)
Yes!
Now time to write the full query! I started off just drafting it in plain English:
I want to join the
log
table with thearticles
table, where thesubstr
fromlog
matches theslug
fromarticles
, and show thetitle
column fromarticles
with thenum
count column created by the aggregation inlog
.
Wow, alright. Let's break that down:
-
where the
substr
fromlog
matches theslug
fromarticles
:where log.substr = articles.slug
-
show the
title
column fromarticles
with thenum
count column created by the aggregation inlog
:select title, num
Now let's put that together.
My first query didn't work:
select title, num from articles,
(select substr(path, 10), count(*) as num from log where path !='/' group by path order by num desc) as hits
where log.substr = articles.slug limit 3;
I had to break it down further. I started by working the successful query into a subquery, to select just the num
column from the aggregation. This worked:
select num from (select substr(path, 10), count(*) as num from log where path !='/' group by path order by num desc limit 3) as hits;
num
--------
338647
253801
170098
(3 rows)
Next, I continued iterating to get my first join of log
and articles
:
news=> select title, num from (select substr(path, 10), count(*) as num from log where path !='/' group by path order by num desc limit 3) as hits, articles limit 3;
title | num
----------------------------------+--------
Bad things gone, say good people | 338647
Bad things gone, say good people | 253801
Bad things gone, say good people | 170098
(3 rows)
Now I needed to establish the join condition. After a few more iterations, I got it! I also changed "num" to "views" for easier interpretation here:
news=> select title, views from (select substr(path, 10), count(*) as views from log where path !='/' group by path) as hits, articles where substr = slug order by views desc limit 3;
title | views
----------------------------------+--------
Candidate is jerk, alleges rival | 338647
Bears love berries, alleges bear | 253801
Bad things gone, say good people | 170098
(3 rows)
This also showed me that the order of columns in the SELECT
statement doesn't need to match the order the tables are mentioned in the FROM
statement.
I plugged this into the Python code in logs.py
to test it out. I reformatted the SQL query for Python with help from the psycopg2 docs:
# 1. Most popular three articles
def popular_articles():
"""Returns a sorted list of the three most highly accessed articles,
with the top article first.
"""
# Connect to database
db = psycopg2.connect(database=DBNAME)
# Create a cursor object to run queries and scan through results
c = db.cursor()
# Execute the SQL query using the cursor
c.execute("""
select title, views from
(select substr(path, 10), count(*) as views from log
where path !='/' group by path)
as hits, articles where substr = slug order by views desc limit 3;
""")
# Fetch all results from the cursor object
return c.fetchall()
# Close connection
db.close()
pass
In order to run the Python code within Vagrant, I created a vagrant/logs directory and copied in the logs.py
file.
I then formatted the Linux command line argument to call the function, with a little help from Stack Overflow via a DuckDuckGo search for "run python functions from command line":
vagrant@vagrant:/vagrant/logs$ python -c 'import logs; print(logs.popular_articles())'
[('Candidate is jerk, alleges rival', 338647L), ('Bears love berries, alleges bear', 253801L), ('Bad things gone, say good people', 170098L)]
I will need to reformat the output into a plain-text table like PostgreSQL. I'm surprised that psycopg2
doesn't return SQL tables as output by default. Isn't that what you would expect?
I tried a few different things. There is a PrettyTable
module, but the Python distribution with Vagrant doesn't have it. I'll get back to this after finishing the second and third queries.
Who are the most popular article authors of all time?
The second query is like an extension of the first, with an additional join to the authors table, and an aggregation to group the articles by author.
I started off viewing the name
and id
columns from the authors
table, so I knew what to look for:
news=> select name, id from authors;
name | id
------------------------+----
Ursula La Multa | 1
Rudolf von Treppenwitz | 2
Anonymous Contributor | 3
Markoff Chaney | 4
(4 rows)
Next, I merged the log
table and articles
table as before, eliminating the limit 3
, and including author
id, to display all eight articles:
select title, author, views from (select substr(path, 10), count(*) as views from log where path !='/' group by path) as hits, articles where substr = slug order by views desc;
title | author | views
------------------------------------+--------+--------
Candidate is jerk, alleges rival | 2 | 338647
Bears love berries, alleges bear | 1 | 253801
Bad things gone, say good people | 3 | 170098
Goats eat Google's lawn | 1 | 84906
Trouble for troubled troublemakers | 2 | 84810
Balloon goons doomed | 4 | 84557
There are a lot of bears | 1 | 84504
Media obsessed with bears | 1 | 84383
(8 rows)
Now to join the three tables, displaying information from all three to verify:
select name, author, title, views from (select substr(path, 10), count(*) as views from log where path !='/' group by path) as hits, articles, authors where substr = slug and author = authors.id order by views desc;
name | author | title | views
------------------------+--------+------------------------------------+--------
Rudolf von Treppenwitz | 2 | Candidate is jerk, alleges rival | 338647
Ursula La Multa | 1 | Bears love berries, alleges bear | 253801
Anonymous Contributor | 3 | Bad things gone, say good people | 170098
Ursula La Multa | 1 | Goats eat Google's lawn | 84906
Rudolf von Treppenwitz | 2 | Trouble for troubled troublemakers | 84810
Markoff Chaney | 4 | Balloon goons doomed | 84557
Ursula La Multa | 1 | There are a lot of bears | 84504
Ursula La Multa | 1 | Media obsessed with bears | 84383
(8 rows)
The second query, up to this point, only took me a few iterations over maybe an hour. The next part took me several more hours.
Aggregation aggravation: This step was more difficult. I tried creating views, but wasn't able to create a view and select from that view in the same SQL query (remember each of the three queries has to be a self-contained query).
I started moving in the right direction by nesting the entire subquery from the "Join the three tables" step above inside another query:
select name, views from
(select name, author, title, views from (select substr(path, 10), count(*) as views from log where path !='/' group by path) as hits, articles, authors where substr = slug and author = authors.id order by views desc) as threetables;
name | views
------------------------+--------
Rudolf von Treppenwitz | 338647
Ursula La Multa | 253801
Anonymous Contributor | 170098
Ursula La Multa | 84906
Rudolf von Treppenwitz | 84810
Markoff Chaney | 84557
Ursula La Multa | 84504
Ursula La Multa | 84383
(8 rows)
This narrows down the table to the columns I want, but I still need the aggregation.
I was able to successfully complete the aggregation by adding two parts:
sum(views) as total_views
at the beginning, before the long nested subquerygroup by name order by total_views desc
, after the long nested subquery, to tellpsql
how to compute the sum.
select name, sum(views) as total_views from
(select name, author, title, views from
(select substr(path, 10), count(*) as views from log
where path !='/' group by path)
as hits, articles, authors
where substr = slug and author = authors.id
order by views desc)
as threetables group by name order by total_views desc;
name | total_views
------------------------+-------------
Ursula La Multa | 507594
Rudolf von Treppenwitz | 423457
Anonymous Contributor | 170098
Markoff Chaney | 84557
(4 rows)
Success!
# 2. Most popular authors
def popular_authors():
"""Returns a sorted list of the most popular article authors,
with the most popular author at the top.
"""
# Connect to database
db = psycopg2.connect(database=DBNAME)
# Create a cursor object to run queries and scan through results
c = db.cursor()
# Execute the SQL query using the cursor
c.execute("""
select name, sum(views) as total_views from
(select name, author, title, views from
(select substr(path, 10), count(*) as views from log
where path !='/' group by path)
as hits, articles, authors
where substr = slug and author = authors.id
order by views desc)
as threetables group by name order by total_views desc;
""")
# Fetch all results from the cursor object
return c.fetchall()
# Close connection
db.close()
pass
I copied the updated code into the vagrant directory as before, then ran the code from the Linux command line:
vagrant@vagrant:/vagrant/logs$ python -c 'import logs; print(logs.popular_articles(), logs.popular_authors())'
([('Candidate is jerk, alleges rival', 338647L), ('Bears love berries, alleges bear', 253801L), ('Bad things gone, say good people', 170098L)], [('Ursula La Multa', Decimal('507594')), ('Rudolf von Treppenwitz', Decimal('423457')), ('Anonymous Contributor', Decimal('170098')), ('Markoff Chaney', Decimal('84557'))])
Git commit at this point:
"Complete SQL queries one and two"
On which days did more than one percent of requests lead to errors?
I need to sum the total number of HTTP requests, and divide by the number of HTTP 404
error codes.
I started by reviewing the contents of the log
table:
\d log
Table "public.log"
Column | Type | Modifiers
--------+--------------------------+--------------------------------------------------
path | text |
ip | inet |
method | text |
status | text |
time | timestamp with time zone | default now()
id | integer | not null default nextval('log_id_seq'::regclass)
Indexes:
"log_pkey" PRIMARY KEY, btree (id)
A quick view of the status
column in the log
table shows two status codes, 200 OK
and 404 NOT FOUND
:
news=> select status from log group by status;
status
---------------
404 NOT FOUND
200 OK
(2 rows)
Each HTTP request returns a status
and time
:
news=> select status, time from log limit 5;
status | time
--------+------------------------
200 OK | 2016-07-01 07:00:00+00
200 OK | 2016-07-01 07:00:47+00
200 OK | 2016-07-01 07:00:34+00
200 OK | 2016-07-01 07:00:52+00
200 OK | 2016-07-01 07:00:23+00
(5 rows)
The time
column contains more than just the date. I will have to slice out the time.
I took a look at the PostgreSQL documentation, and quickly found the entry for EXTRACT, date_part
. After reading through and trying it out, I actually found that date_trunc
was more appropriate. It sets the time of day to zero, while keeping the date the same:
select date_trunc('day', time) as date, status from log limit 5;
date | status
------------------------+--------
2016-07-01 00:00:00+00 | 200 OK
2016-07-01 00:00:00+00 | 200 OK
2016-07-01 00:00:00+00 | 200 OK
2016-07-01 00:00:00+00 | 200 OK
2016-07-01 00:00:00+00 | 200 OK
(5 rows)
I tried slicing out the time completely with a substring command.
news=> select substr(date_trunc('day', time), 1, 10) as date from log limit 5;
I couldn't get it to work initially, but I will return to this later if needed.
I was able to group the entries by date, revealing that the table was logging HTTP requests in July 2016:
select date_trunc('day', time) as date from log group by date;
date
------------------------
2016-07-01 00:00:00+00
2016-07-02 00:00:00+00
2016-07-03 00:00:00+00
2016-07-04 00:00:00+00
2016-07-05 00:00:00+00
2016-07-06 00:00:00+00
2016-07-07 00:00:00+00
2016-07-08 00:00:00+00
2016-07-09 00:00:00+00
2016-07-10 00:00:00+00
2016-07-04 00:00:00+00
2016-07-05 00:00:00+00
2016-07-06 00:00:00+00
2016-07-07 00:00:00+00
2016-07-08 00:00:00+00
2016-07-09 00:00:00+00
2016-07-10 00:00:00+00
2016-07-11 00:00:00+00
2016-07-12 00:00:00+00
2016-07-13 00:00:00+00
2016-07-14 00:00:00+00
2016-07-15 00:00:00+00
2016-07-16 00:00:00+00
2016-07-17 00:00:00+00
2016-07-18 00:00:00+00
2016-07-19 00:00:00+00
2016-07-20 00:00:00+00
2016-07-21 00:00:00+00
2016-07-22 00:00:00+00
2016-07-23 00:00:00+00
2016-07-24 00:00:00+00
2016-07-25 00:00:00+00
2016-07-26 00:00:00+00
One of my query attempts actually hung the virtual machine. I think it was including from log
at the top level of the query.
I logged in with a separate terminal window and used the top
command to troubleshoot.
/vagrant$ psql -d news
psql (9.5.10)
Type "help" for help.
news=> select date, status, http_requests from log,
(select date_trunc('day', time) as date from log) as datemod,
(select count(*) as http_requests from log) as requests
order by date limit 10;
In a separate terminal window:
$ vagrant ssh
vagrant@vagrant:~$ top
top - 13:12:30 up 15:05, 2 users, load average: 0.97, 0.50, 0.20
Tasks: 110 total, 2 running, 108 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.7 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1024012 total, 585268 free, 56496 used, 382248 buff/cache
KiB Swap: 1048572 total, 1048572 free, 0 used. 798024 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1508 postgres 20 0 202820 18888 12676 R 99.9 1.8 3:18.07 postgres
852 message+ 20 0 6044 3696 3320 S 0.3 0.4 0:02.86 dbus-daemon
3947 vagrant 20 0 13172 4676 3976 S 0.3 0.5 0:00.04 sshd
1 root 20 0 6744 5168 3820 S 0.0 0.5 0:01.87 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 0:01.04 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
10 root rt 0 0 0 0 S 0.0 0.0 0:00.62 watchdog/0
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
12 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns
13 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 perf
14 root 20 0 0 0 0 S 0.0 0.0 0:00.01 khungtaskd
15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 writeback
16 root 25 5 0 0 0 S 0.0 0.0 0:00.00 ksmd
I actually couldn't kill the process.
vagrant@vagrant:~$ kill -9 1508
-sh: kill: (1508) - Operation not permitted
Typing logout
doesn't do anything, because I don't have a command prompt. Quitting my terminal program (iTerm2) reveals a process still running in VirtualBox. I sent an ACPI shutdown signal through the GUI (can also be done on command line with VBoxManage controlvm Ubuntu acpipowerbutton
).
Okay, time to try again.
I started building the query, piece by piece.
This at least runs and establishes the count syntax:
select count(*) as http_requests from log group by time limit 5;
Now to build in the date truncation:
select date_trunc('day', time) as date, count(*) as http_requests from log group by date order by date desc;
date | http_requests
------------------------+---------------
2016-07-31 00:00:00+00 | 45845
2016-07-30 00:00:00+00 | 55073
2016-07-29 00:00:00+00 | 54951
2016-07-28 00:00:00+00 | 54797
2016-07-27 00:00:00+00 | 54489
2016-07-26 00:00:00+00 | 54378
2016-07-25 00:00:00+00 | 54613
2016-07-24 00:00:00+00 | 55100
2016-07-23 00:00:00+00 | 54894
2016-07-22 00:00:00+00 | 55206
2016-07-21 00:00:00+00 | 55241
2016-07-20 00:00:00+00 | 54557
2016-07-19 00:00:00+00 | 55341
2016-07-18 00:00:00+00 | 55589
2016-07-17 00:00:00+00 | 55907
2016-07-16 00:00:00+00 | 54498
2016-07-15 00:00:00+00 | 54962
2016-07-14 00:00:00+00 | 55196
2016-07-13 00:00:00+00 | 55180
2016-07-12 00:00:00+00 | 54839
2016-07-11 00:00:00+00 | 54497
2016-07-10 00:00:00+00 | 54489
2016-07-09 00:00:00+00 | 55236
2016-07-08 00:00:00+00 | 55084
2016-07-07 00:00:00+00 | 54740
2016-07-06 00:00:00+00 | 54774
2016-07-05 00:00:00+00 | 54585
2016-07-04 00:00:00+00 | 54903
2016-07-03 00:00:00+00 | 54866
2016-07-02 00:00:00+00 | 55200
2016-07-01 00:00:00+00 | 38705
(31 rows)
Alright!
I modified the total HTTP requests query to return the HTTP 404 errors:
select date_trunc('day', time) as date, count(*) as http_404 from log where status = '404 NOT FOUND' group by date order by date desc;
date | http_404
------------------------+----------
2016-07-31 00:00:00+00 | 329
2016-07-30 00:00:00+00 | 397
2016-07-29 00:00:00+00 | 382
2016-07-28 00:00:00+00 | 393
2016-07-27 00:00:00+00 | 367
2016-07-26 00:00:00+00 | 396
2016-07-25 00:00:00+00 | 391
2016-07-24 00:00:00+00 | 431
2016-07-23 00:00:00+00 | 373
2016-07-22 00:00:00+00 | 406
2016-07-21 00:00:00+00 | 418
2016-07-20 00:00:00+00 | 383
2016-07-19 00:00:00+00 | 433
2016-07-18 00:00:00+00 | 374
2016-07-17 00:00:00+00 | 1265
2016-07-16 00:00:00+00 | 374
2016-07-15 00:00:00+00 | 408
2016-07-14 00:00:00+00 | 383
2016-07-13 00:00:00+00 | 383
2016-07-12 00:00:00+00 | 373
Excellent!
Note the single quotes around status = '404 NOT FOUND'
. I tried double quotes first, but double quotes make PostgreSQL think "404 NOT FOUND"
is a column name, and it returns an error.
The two steps above for counting total HTTP requests and errors separately took me ~2-3 hours over Sunday afternoon 20180128 and Monday morning 20180129.
Now I need to combine the two queries from above:
select date_trunc('day', time) as date, count(*) as http_requests from log group by date order by date desc;
select date_trunc('day', time) as date, count(*) as http_404 from log where status = '404 NOT FOUND' group by date order by date desc;
I need to join the two queries on date. This is a similar task to the first query, where I joined the log
and articles
table based on the slug
in the URL:
news=> select title, views from (select substr(path, 10), count(*) as views from log where path !='/' group by path) as hits, articles where substr = slug order by views desc limit 3;
title | views
----------------------------------+--------
Candidate is jerk, alleges rival | 338647
Bears love berries, alleges bear | 253801
Bad things gone, say good people | 170098
(3 rows)
I continued by drafting the task in plain English:
- I want to see a table with three columns: the day, the total number of http requests, and the total number of error requests:
select day, http_requests, http_404
- I want to match the http requests and error requests columns based on the date:
where http_requests.date = http_404.date
It's not possible to combine both queries into a single select
statement because where
can be used only once. For example, the query below returns the three columns I want, but both columns display the error request count, because the entire statement is being filtered where status = '404 NOT FOUND'
:
select date_trunc('day', time) as date, count(*) as http_requests, count(*) as http_404 from log where status = '404 NOT FOUND' group by date order by date desc limit 5;
date | http_requests | http_404
------------------------+---------------+----------
2016-07-31 00:00:00+00 | 329 | 329
2016-07-30 00:00:00+00 | 397 | 397
2016-07-29 00:00:00+00 | 382 | 382
2016-07-28 00:00:00+00 | 393 | 393
2016-07-27 00:00:00+00 | 367 | 367
(5 rows)
This task will require subqueries.
I had to repeatedly iterate at this step for about 3 hours. It wasn't too difficult to split the task out into subqueries, but I was throwing errors regarding the aggregation, like ERROR: column "log.time" must appear in the GROUP BY clause or be used in an aggregate function
.
The two keys were:
- Establishing the proper join condition
where requests.date = errors.date
- Making sure I specified the correct names (like
errors.date
instead ofhttp_404.date
).
Successful query:
select requests.date, http_requests, http_404 from
(select date_trunc('day', time) as date, count(*) as http_requests from log group by date) as requests,
(select date_trunc('day', time) as date, count(*) as http_404 from log where status = '404 NOT FOUND' group by date) as errors
where requests.date = errors.date
order by requests.date desc;
date | http_requests | http_404
------------------------+---------------+----------
2016-07-31 00:00:00+00 | 45845 | 329
2016-07-30 00:00:00+00 | 55073 | 397
2016-07-29 00:00:00+00 | 54951 | 382
2016-07-28 00:00:00+00 | 54797 | 393
2016-07-27 00:00:00+00 | 54489 | 367
2016-07-26 00:00:00+00 | 54378 | 396
2016-07-25 00:00:00+00 | 54613 | 391
2016-07-24 00:00:00+00 | 55100 | 431
2016-07-23 00:00:00+00 | 54894 | 373
2016-07-22 00:00:00+00 | 55206 | 406
2016-07-21 00:00:00+00 | 55241 | 418
2016-07-20 00:00:00+00 | 54557 | 383
2016-07-19 00:00:00+00 | 55341 | 433
2016-07-18 00:00:00+00 | 55589 | 374
2016-07-17 00:00:00+00 | 55907 | 1265
2016-07-16 00:00:00+00 | 54498 | 374
2016-07-15 00:00:00+00 | 54962 | 408
2016-07-14 00:00:00+00 | 55196 | 383
2016-07-13 00:00:00+00 | 55180 | 383
2016-07-12 00:00:00+00 | 54839 | 373
2016-07-11 00:00:00+00 | 54497 | 403
2016-07-10 00:00:00+00 | 54489 | 371
2016-07-09 00:00:00+00 | 55236 | 410
2016-07-08 00:00:00+00 | 55084 | 418
2016-07-07 00:00:00+00 | 54740 | 360
2016-07-06 00:00:00+00 | 54774 | 420
2016-07-05 00:00:00+00 | 54585 | 423
2016-07-04 00:00:00+00 | 54903 | 380
2016-07-03 00:00:00+00 | 54866 | 401
2016-07-02 00:00:00+00 | 55200 | 389
2016-07-01 00:00:00+00 | 38705 | 274
(31 rows)
Awesome!
Now I basically need to add in another calculation, probably something like having http_404 > 0.01 - http_requests
. I tried having
but it was easier to just add and
to the where
restriction instead.
select requests.date, http_requests, http_404 from
(select date_trunc('day', time) as date, count(*) as http_requests from log group by date) as requests,
(select date_trunc('day', time) as date, count(*) as http_404 from log where status = '404 NOT FOUND' group by date) as errors
where requests.date = errors.date
and errors.http_404 > 0.01 - requests.http_requests
order by requests.date desc;
date | http_requests | http_404
------------------------+---------------+----------
2016-07-17 00:00:00+00 | 55907 | 1265
(1 row)
YES! The query shows one day, July 17, on which more than 1% of queries led to errors.
I tried going a bit further to show the error percentage as a column.
I figured it would be something like:
select errors.http_404 / requests.http_requests - 100 as error_percentage
It was difficult, because the http_requests
and http_404
columns are being created in this query.
Maybe I need another subquery.
This runs and displays the error_percentage
column, but doesn't correctly calculate the percentage:
select requests_and_errors.date, http_requests, http_404, (http_404 / http_requests *100) as error_percentage from
(select requests.date, http_requests, http_404 from
(select date_trunc('day', time) as date, count(*) as http_requests from log group by date) as requests,
(select date_trunc('day', time) as date, count(*) as http_404 from log where status = '404 NOT FOUND' group by date) as errors
where requests.date = errors.date
order by requests.date desc) as requests_and_errors;
I tried nesting the query even further, unsuccessfully:
select requests_and_errors.date, http_requests, http_404, error_percentage from
(select http_404 / http_requests - 100 as error_percentage from requests_and_errors) as percentage
(select requests.date, http_requests, http_404 from
(select date_trunc('day', time) as date, count(*) as http_requests from log group by date) as requests,
(select date_trunc('day', time) as date, count(*) as http_404 from log where status = '404 NOT FOUND' group by date) as errors
where requests.date = errors.date
order by requests.date desc)
as requests_and_errors,
group by requests_and_errors.date desc;
It gets difficult to keep track of this many subqueries.
Git commit at this point: "Complete SQL query three up to sticking point"
Instead of subqueries, I tried to calculate the error percentage from information in the original logs
table:
select date_trunc('day', time) as date, count(status = '404 NOT FOUND') from log group by date;
The count
returned is just the total number of requests though.
Eventually, I just decided to calculate the percentage in Python. I already need to format the output anyway, because psycopg2
does not output plain text SQL tables like psql
. I found it strange that psycopg2
doesn't format the Python output like the plain-text tables from PostgreSQL. The whole point of psycopg2
is to work with psql
, so why doesn't it return the same output?
I worked on cleaning up the output:
# Fetch all results from the cursor object
print("Query 1")
for table in c.fetchall():
print(table)
# Close connection
db.close()
pass
vagrant@vagrant:/vagrant/logs$ python -c 'import logs; logs.popular_articles(), logs.popular_authors(), logs.errors()'
Query 1
('Candidate is jerk, alleges rival', 338647L)
('Bears love berries, alleges bear', 253801L)
('Bad things gone, say good people', 170098L)
Query 2
('Ursula La Multa', Decimal('507594'))
('Rudolf von Treppenwitz', Decimal('423457'))
('Anonymous Contributor', Decimal('170098'))
('Markoff Chaney', Decimal('84557'))
Query 3
(datetime.datetime(2016, 7, 17, 0, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=0, name=None)), 55907L, 1265L)
Still not great, especially for query 3.
I was finally able to remove the timestamp from the date by converting it to a string, and slicing for characters 0-10, with print("Date:", str(table[0])[:10])
. I also improved the output formatting by including a header line.
# Fetch all results from the cursor object
print('\n', 'Query 3: Days on which >1% HTTP requests returned 404 errors')
for table in c.fetchall():
# Convert datetime to string and slice to retain only the date
print('Date:', str(table[0])[:10])
# Calculate error rate
print('Percent errors:', float(table[2]) / float(table[1]) - 100)
Next, I simply changed the Linux command (just by guessing) from python
to python3
, to specify. The output was formatted correctly. Maybe my shebang line isn't working, though it's written correctly, or maybe this distribution of Linux ignores the shebang line.
Here's the final Linux command:
vagrant@vagrant:/vagrant/logs$ python3 -c 'import logs; logs.popular_articles(), logs.popular_authors(), logs.errors()'
Query 1: Most popular three articles
Candidate is jerk, alleges rival 338647
Bears love berries, alleges bear 253801
Bad things gone, say good people 170098
Query 2: Most popular authors
Ursula La Multa 507594
Rudolf von Treppenwitz 423457
Anonymous Contributor 170098
Markoff Chaney 84557
Query 3: Days on which >1% HTTP requests returned 404 errors
Date: 2016-07-17
Percent errors: 2.2626862468027262
Done!