This is a python reporting tool for the Udacity logs analysis project.
The purpose of the project is to use psycopg2
to query a mock PostgreSQL database for a fictional news website.
The tool answers three questions:
- What are the most popular three articles of all time?
- Who are the most popular article authors of all time?
- On which days did more than 1% of requests lead to errors?
The database has three tables:
authors
includes information about the authors of articlesarticles
includes the articles themselveslog
includes one entry for each time a user has accessed the site
The program is designed to be run via the Udacity linux virtual machine command line.
You can install VirtualBox via this link and vagrant, the command line utility to manage the virtual machine here.
- Create a folder to store the files for this project and then open the folder with the terminal.
- newsdata.zip contains the newsdata.sql database structure, save this, create_views.sql and tool.py in your project folder.
- Save Vagrantfile in your vagrant folder.
- Type
vagrant init ubuntu/trusty64
to tell Vagrant what kind of Linux VM you want to run - Run the virtual machine by running
vagrant up
andvagrant ssh
to log in. - To load the data
cd
into the project folder and use the commandpsql -d news -f newsdata.sql
- To load the views required to run the program run the command
psql -d news -f create_views.sql
If you get an error message saying:
psql: FATAL: database "news" does not exist
psql: could not connect to server: Connection refused
This means the database server isn't running or isn't set up correctly, you may need to download the virual machine again into a fresh directory.
The views required to run the program can be imported from create_views.sql for ease of use.
The views are as follows:
CREATE VIEW top_three AS
SELECT path, count(*) AS num
FROM log
WHERE path LIKE '%article%'
GROUP BY path
ORDER BY num DESC
LIMIT 3;
CREATE VIEW author_titles AS
SELECT authors.name, articles.title, articles.slug
FROM articles
JOIN authors
ON authors.id = articles.author
ORDER BY name;
CREATE VIEW article_count AS
SELECT path, count(*) AS num
FROM log
WHERE path LIKE '%article%'
GROUP BY path;
CREATE VIEW requests_made AS
SELECT date_trunc('day', time), count(*) AS num_requests
FROM log
GROUP BY date_trunc('day', time);
CREATE VIEW request_errors AS
SELECT date_trunc('day', time), status, count(*) AS num_errors
FROM log
WHERE status != '200 OK'
GROUP BY date_trunc('day', time), status
ORDER BY date_trunc('day', time);
CREATE VIEW error_comparisons AS
SELECT to_char(requests_made.date_trunc, 'FMMonth DD, YYYY'),
round(cast(num_errors AS numeric)/num_requests*100,2) AS percentage
FROM request_errors
JOIN requests_made
ON request_errors.date_trunc = requests_made.date_trunc;
To run the program run python tool.py
in the command line
A plain text example of what is returned by the tool program can be seen in results.txt