# Building Blocks of Data Analytics

Python has made extracting valuable information from various data sources very easy. A testament of the dozens of open source data-manipulation-centric libraries available out there today.

At the very core of these libraries are creative ways of putting together loops, conditional statements, strings, lists, dictionaries, arithmetic, etc., in order to deliver such convenience.

In this talk, we will step back a bit and get back into the basics of exploring ways of putting these basic building blocks together to extract information from a dataset.

## Raw data

```
88.191.254.20 - - [22/Mar/2009:07:00:32 +0100] "GET / HTTP/1.0" 200 8674 "-" "-" "-"
66.249.66.231 - - [22/Mar/2009:07:06:20 +0100] "GET /popup.php?choix=-89 HTTP/1.1" 200 1870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
66.249.66.231 - - [22/Mar/2009:07:11:20 +0100] "GET /specialiste.php HTTP/1.1" 200 10743 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
```

## Extracting information

### Inspecting the data

> What data is available to me?

- Identify the headers (i.e. excel, csv)
- Research the format (i.e. log files)
- Research what each data is for


#### A standard Nginx log file

**Format:**

- Remote IP address (remote_addr)
- Remote user (remote_user)
- Local time (time_local)
- Requested page (http_request)
- Status code (status)
- Request size in bytes (request_bytes)
- Referer - from where the request was redirected from (http_referer)
- User agent (http_user_agent)
- Originating IP address if coming from a proxy/load balancer (http_x_forwarded_for)

**Mapping:**

```
remote_addr - remote_user [time_local] "http_request" status request_bytes "http_referer" "http_user_agent" "http_x_forwarded_for"
```

### Questions

#### Easy

- How many times did our site got visitors in total?
- How many visitors do we have?
- What devices were used access our site?
- Which pages were accessed in our site?


#### Difficult

- How many unique visitors per month?
- Top visitors per month? How many times did they visit?
- Top devices used per month? How many time did they visit?
- Other questions?

## Exploration

Let's figure out how we can access each individual values within a log.

In [None]:
log_line = '66.249.66.231 - - [22/Mar/2009:07:06:20 +0100] "GET /popup.php?choix=-89 HTTP/1.1" 200 1870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"'

In [None]:
# Source: https://docs.python.org/3/library/re.html
import re

LINE_PAT = re.compile(
    r'(?P<remote_addr>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) '
    r'\- '
    r'(?P<remote_user>.+) '
    r'\[(?P<time_local>.+)\] '
    r'"(?P<http_request>.+)" '
    r'(?P<status>\d{3}) '
    r'(?P<request_bytes>[\d\-]{1,}) '
    r'"(?P<http_referer>.+)" '
    r'"(?P<http_user_agent>.+)" '
    r'"(?P<http_x_forwarded_for>.+)"'
)

In [None]:
matched = LINE_PAT.match(log_line)

In [None]:
matched.groupdict()

In [None]:
matched.groupdict().get('remote_addr')

## Normalizing data

Let's now convert each value into it's proper data type.

In [None]:
data = matched.groupdict()

In [None]:
from datetime import datetime

for key, val in data.items():
    if (key == 'status' or key == 'request_bytes') and val.isdigit():
        data[key] = int(val)
    elif key == 'time_local':
        data[key] = datetime.strptime(val, '%d/%b/%Y:%H:%M:%S %z')
    if val == '-':
        data[key] = None

In [None]:
data

## Creating a strategy

What's the best way to group the data to answer the questions we have?

Do we need to create different groupings of our data?

### Base data structure

Regardless of our further groupings, we will always start with the base form:

```
[
    {
        'header': 'value',
        ...
    },
    {
        'header': 'value',
        ...
    },
    ...
]
```

In [None]:
raw_logs = []

with open('access.log') as fh:
    for line in fh:
        line = line.strip()
        m = LINE_PAT.match(line)
        log_data = m.groupdict() if m else None
        if log_data:
            for key, val in log_data.items():
                if (key == 'status' or key == 'request_bytes') and val.isdigit():
                    log_data[key] = int(val)
                elif key == 'time_local':
                    log_data[key] = datetime.strptime(val, '%d/%b/%Y:%H:%M:%S %z')
                if val == '-':
                    log_data[key] = None
            raw_logs.append(log_data)

In [None]:
raw_logs[2:4]

## Getting answers

Let's attempt to traverse our data structure to answer our questions.

### How many times did our site got visitors in total?

### How many visitors do we have?

### What devices were used access our site?

### Which pages were accessed in our site?

### How many unique visitors per month?

- Group by month
- Count all visitors for the month
- Count the unique visitors only

```
{
    'month': {
        'ip': [
            occurence,
            occurence,
            ...
        ],
        'ip': [
            occurence,
            occurence,
            ...
        ],
        ...
    },
    'month': {
        'ip': [
            occurence,
            occurence,
            ...
        ],
        'ip': [
            occurence,
            occurence,
            ...
        ],
        ...
    },
    ...
}
```

In [None]:
ds_log = {}
for log in raw_logs:
    date = log.get('time_local')
    ip = log.get('remote_addr')
    if date.month not in ds_log:
        ds_log[date.month] = {}

    if ip not in ds_log[date.month]:
        ds_log[date.month][ip] = [log]

    ds_log[date.month][ip].append(log)

In [None]:
ds_log.keys()

In [None]:
len(ds_log[3].keys())

### Top visitors per month? How many times did they visit?

- Group by month
- List all visitors for the month
- Group by unique visitors and retain the number of occurences they appear
- Sort by the visitor with the most number of visits

## Over to you

1. Using the data structure we have created, how can you navigate it so that you can answer the other questions?

2. Using the ideas here, can you improve on the data structure so you can navigate it better?

3. What other useful information can you extract from the data?

4. What actions can you take given these information?

5. Put things together to create a proper program and write the results in a file (or csv file).

## More practice!

Can you use what you've learned here on other data sets?

Find other data sets and use what you've learned here to extract information from them.

https://data.gov.ph/