***File to load and analyse the collected Log Data***

In [2]:
import pandas as pd
import os
from glob import glob
import re
import pathlib

BASE_DIR = pathlib.Path().resolve()
file_paths = glob(os.path.join(BASE_DIR, 'logs\\', '*.tsv.gz'))

***Log Data Loading*** <br>
Loads all files in the specified location into one dataframe.

In [89]:
all_data = pd.DataFrame()

# Read the first file to get the column names
first_file_path = file_paths[0]
first_data = pd.read_csv(first_file_path, sep='\t', compression='gzip')

# Iterate through the list of files and concatenate them
for file_path in file_paths:
    # Use pandas to read the compressed TSV file with the column names from the first file
    current_data = pd.read_csv(file_path, sep='\t', compression='gzip', header=None, names=first_data.columns)

    # Concatenate the current data to the overall DataFrame
    all_data = pd.concat([all_data, current_data], ignore_index=True)

Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-12.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-13.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-14.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-15.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-16.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-17.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-18.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-10-19

Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-02.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-03.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-06.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-10.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-13.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-17.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-17-20.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-10-18-01

Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-01-01.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-02-01.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-02-09.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-02-11.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-02-13.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-02-21.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-03-01.tsv.gz
Finished concatenating file: C:\Users\Shadow\ShadowDrive\book_recommendation_goodreads\log_analysis\logs\2023-11-03-21

***Log Data Preprocessing*** <br>
Create a structured dataframe based on the last column of the initial frame with the raw data by using a regular expression which matches on the structure and splits up the data into multiple columns <br>
Exclude certain Session ID form the data (These where my Session ID used when testing certain functionalities) and dropping all rows where a value was NaN (that was the case, if the log message did not match the regular expression and was therefore irrelevant for the analysis)

In [111]:
df = all_data['WARNING 2023-10-10 12:02:21,427 Not Found: /favicon.ico ']
df = pd.DataFrame(df)

In [112]:
# Define the regular expression pattern
pattern = r'(?P<Warning_Level>\w+) (?P<Timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) User Interaction: Session ID: (?P<Session_ID>\w+) - (?P<User_Interaction>.+)'

# Apply the regular expression to extract information into columns
log_df = df['WARNING 2023-10-10 12:02:21,427 Not Found: /favicon.ico '].str.extract(pattern)

In [121]:
log_df = log_df.dropna()
exclude_session_ids = ["bg621uij", "iof2s0te"]
log_df = log_df[~log_df['Session_ID'].isin(exclude_session_ids)]
log_df.to_csv('Cleaned_log.csv', index=False)

***Log Data Analyis***

***Participation***

In [3]:
log_df = pd.read_csv('Cleaned_log.csv')

In [4]:
total_unique_users = log_df['Session_ID'].unique().size
print(f'There were {total_unique_users} distinct users who attempted the survey.')
print('103 of these completed the study, resulting in 112 responses in total.')
print('After filtering, 97 responses remained.')
# The last to information where not extracted from the log, rather from the response data

There were 150 distinct users who attempted the survey.
103 of these completed the study, resulting in 112 responses in total.
After filtering, 97 responses remained.


***Button Interaction***

***Book Description***

In [8]:
aggregated_description = log_df.loc[log_df['User_Interaction'].str.endswith('Description ')].groupby('User_Interaction').size().reset_index(name='amount_clicked')
print('In total, all buttons that displayed book descriptions were clicked 909 times. This results in approximately 8 description views per participation.')
print("The most clicked button was the one for 'The Doors of Time' with 57 clicks")
print("The least clicked button was the one for 'Harry Potter and the Prisoner of Azkaban' with 10 clicks")

In total, all buttons that displayed book descriptions were clicked 909 times. This results in approximately 8 description views per participation.
The most clicked button was the one for 'The Doors of Time' with 57 clicks
The least clicked button was the one for 'Harry Potter and the Prisoner of Azkaban' with 10 clicks


***Book Cover***

In [9]:
aggregated_cover = log_df.loc[log_df['User_Interaction'].str.contains('Image')].groupby('User_Interaction').size().reset_index(name='amount_clicked')
print('In total, all book covers were clicked 143 times.')
print("The most clicked cover was the one for 'So Long, and Thanks for All the Fish' with 15 clicks.")
print("The least clicked cover was again the one for 'Harry Potter and the Prisoner of Azkaban' and 'Self-Publishing Steps To Successful Sales' with 2 clicks respectively.")

In total, all book covers were clicked 143 times.
The most clicked cover was the one for 'So Long, and Thanks for All the Fish' with 15 clicks.
The least clicked cover was again the one for 'Harry Potter and the Prisoner of Azkaban' and 'Self-Publishing Steps To Successful Sales' with 2 clicks respectively.


***Back to Survey***

In [199]:
total_clicks_back_to_survey = log_df.loc[log_df['User_Interaction']=='Button clicked: Back to Survey Thank You ']['User_Interaction'].size
print(f'There were {total_clicks_back_to_survey} clicks in total for the back to survey button. So nearly 40 percent of the participants took a second look on the study.')
print('9 users from these 43 clicks took part in the study again. So circa 40 percent of the users who went back, participated again. Also, every eigth user participated at least twice in the study.')

There were 43 clicks in total for the back to survey button. So nearly 40 percent of the participants took a second look on the study.
9 users from these 43 clicks took part in the study again. So circa 40 percent of the users who went back, participated again. Also, every eigth user participated at least twice in the study.


In [203]:
aggregated_links = log_df.loc[log_df['User_Interaction'].str.contains('Link clicked')].groupby('User_Interaction').size().reset_index(name='amount_clicked')
print('The links listed in the scenario description as books that had already been read were clicked a total of 80 times.')

The links listed in the scenario description as books that had already been read were clicked a total of 80 times.
