# Weblog Exercise

We want to be able to perform analyses on the logs of a web server ("UofS_access_log.small"). To do this, you need to get the relevant data into a dataframe. This should be an automated process so that other log files can also be loaded.

The following tasks need to be done. The original dataframe should be reworked so that only these columns remain:

- domain: contains the addresses of the clients that sent a request
- timestamp: is a datetime field (POSIXct) that shows the time of the request
- resource: shows the resource that was requested
- response_code: gives the HTTP response code returned by the server
- response_length: indicates the length of the HTTP response


Import all necessary libraries here:

In [2]:
#SOLUTION_START
import pandas as pd
#SOLUTION_END

## Inlezen van de gegevens
Read the dataframe. Check for yourself what the separator is. Incorrect rows can be skipped. There is no header! The file uses the "latin" encoding (consult the docs to learn how to set the encoding) for characters.

In [3]:
#SOLUTION_START
log = pd.read_csv("UofS_access_log.small", sep=" ", encoding="latin", header=None, on_bad_lines='skip')
log.head()
#SOLUTION_END

Unnamed: 0,0,1,2,3,4,5,6,7
0,cad49.cadvision.com,-,-,[01/Jun/1995:00:53:19,-0600],GET /~lowey/webville/icons/blank_32.gif,200,167
1,130.89.250.24,-,-,[01/Jun/1995:02:45:12,-0600],GET /~lowey/webville/icons/south_32.gif,200,210
2,130.54.25.198,-,-,[01/Jun/1995:03:29:56,-0600],GET /~macphed/finite/fe_resources/node92.html,200,1668
3,148.81.17.41,-,-,[01/Jun/1995:04:02:17,-0600],GET /~friesend/tolkien/rootpage.html,200,461
4,anumsun6.univ-st-etienne.fr,-,-,[01/Jun/1995:04:40:30,-0600],GET /~macphed/finite/fe_resources/node58.html,200,1707


How many lines are in this data frame?

In [4]:
#SOLUTION_START
len(log)
#SOLUTION_END

48171

Copy all values from the first column into a variable "domain".
Copy all values from the seventh column into a variable "response_code".
Copy all values from the eighth column into a variable "response_length".


In [5]:
#SOLUTION_START
domain = log[0]
response_code = log[6]
response_length = log[7]
#SOLUTION_END

Check if "response_length" has the correct type. We expect it to be numbers. Convert the variables if necessary. (Look for an appropriate pandas functions starting with 'to_') If there are invalid values, they should be converted to NaN (tip: errors).

In [9]:
#SOLUTION_START

response_length.info()
response_length = pd.to_numeric(response_length, errors='coerce')
#SOLUTION_END

<class 'pandas.core.series.Series'>
RangeIndex: 48171 entries, 0 to 48170
Series name: 7
Non-Null Count  Dtype 
--------------  ----- 
48171 non-null  object
dtypes: object(1)
memory usage: 376.5+ KB


How many NaN values are in response_length?

In [10]:
#SOLUTION_START
response_length.isna().sum()
#SOLUTION_END

897

What percentage is that of all rows?

In [None]:
#SOLUTION_START
response_length.isna().sum() / len(response_length) * 100
#SOLUTION_END

The timestamp is spread across columns 3 (date and time) and 4 (timezone). Combine these into one string. Place the result in a variable "timestamp".

In [None]:
#SOLUTION_START
timestamp = log[3] + log[4]
timestamp.head()
#SOLUTION_END


Create a variable "resource" that contains all resources (in the 6th column). Remove the 'GET' and 'HTTP/1.0' that sometimes appear at the beginning and end.

In [None]:
#SOLUTION_START
resource = log[5].str.replace("GET ", "", regex=False).str.replace("HTTP/1.0", "", regex=False)
resource
#SOLUTION_END

Now create a dataframe named "log" with the columns "domain", "timestamp", "resource", "response_code", and "response_length". You can get the values from the created variables.

In [None]:
#SOLUTION_START
log = pd.DataFrame({'domain':domain, 'timestamp':timestamp, 'resource':resource, 'response_code':response_code, 'response_length':response_length})
log.info()
#SOLUTION_END

Remove all rows from your dataframe where a missing value occurs.

In [None]:
#SOLUTION_START
log.dropna(inplace=True)
log.head()
#SOLUTION_END

Find the row(s) with the largest response_length.

In [None]:
#SOLUTION_START
rows = log.response_length == log.response_length.max()
log.loc[rows]
#SOLUTION_END

Save the result in a CSV file "log_result.csv". Use ',' as the separator and "." for decimal numbers.

In [None]:
#SOLUTION_START
log.to_csv("log_result.csv", sep=",", decimal=".", index=False)
#SOLUTION_END

Try to import the file into a spreadsheet.