# DS Python Test (Task3)
In this notebook we are addressing task 3

For further information, please access to the following 
[link]('../resources/DS_Python_Test.pdf') which contains the whole description of the tasks. 
However as a resume, the task to complete are the following ones:

- Task1: Understanding User Journeys (Mandatory)
- Task2: Finding the Longest Way to TripAdvisor (Optional)
- **Task3: User Engagement and Retention Analysis (Optional)**

This notebook will be a walktrough of the process to obtain the results.


-----------

## Context
In this case, we are talking about a trip advisor session, which is defined 
from the moment a user enters a link to the last consecutive trip_advisor link is clicked.

This will define certain cases or `states` of a session. For instance, if there is only one link of trip advisor. 

- If there is only one trip advisor link this is a `init-end` session. Because it ended and started in the same link.
- There is also the cases where there is only two links, which will correspond to an `init` and to an `end` state correspondingly.
- Additionaly there are cases where there are more than two trip_advisor consecutive links, which will correspond as a general form to 
an `init`, `during`, `during`, ..., `during`, `end` case for each instance. 

## Import Libraries

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from urllib.parse import urlparse

## Set Configuration and constants

In [2]:
# Constants
URL_TRIPADVISOR = 'https://www.tripadvisor.com'
URL_EMPTY_STRING = 'https://www.this_is_an_empty_string.com'

# Configuration
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

## Initial data read

In [3]:
# Proccessed data: 
df = pd.read_parquet('../data/processed/data.parquet')

## Helper functions: 

In [4]:
# Additional data process:
def define_session_number_per_user(data):
    """This function removes consecutive duplicates from a url.

    Parameters
    ----------
    data: pd.DataFrame
        Original processed data. 
        
    Returns
    -------
    df: pd.DataFrame
        Returns a Dataframe with an additional session_nb column 
        for each session of each user. 
    """
    new_col = 'session_nb'
    df = data.copy()
    df.loc[df.session_ta.str.startswith('init'), new_col] = 1
    df.loc[~df.session_ta.str.startswith('init'), new_col] = 0
    df[new_col] = df[new_col].cumsum().astype('int')
    return df

## Task 3: User Engagement and Retention Analysis (Optional)

### Objective:
Analyze user engagement and retention with respect to TripAdvisor links. 
Are there particular features or pages that lead to higher engagement? 
Identify and visualize drop-off points or areas where users might abandon
the journey to TripAdvisor.

In [5]:
# Lets modify original dataframe, adding session_numbers per user. and joined links
data = df.groupby('userid').apply(define_session_number_per_user).reset_index(drop=True)
data['links'] = (data['referrerurl'] + ',' + data['targeturl']).str.split(',')

In [6]:
# Now we can make a session-wise analysis:
# For each session lets obtain the 'session_time', 
# the minimun date, and the links (only trip advisor links)
data_sessionwise = data.groupby(['userid', 'session_nb']).agg(
    session_time = ('eventtimestamp', lambda x: x.max()-x.min()),
    session_min_date = ('eventtimestamp', 'min'),
    links_per_session = ('links', lambda x: [v for v in x.sum() if URL_TRIPADVISOR in v]),
)

# With this information now we can compute the "session max_date"
# We can as well get insight on the unique links per session and how much they are
data_sessionwise['session_max_date'] = data_sessionwise['session_min_date'] + data_sessionwise['session_time']
data_sessionwise['unique_links_per_session'] = data_sessionwise['links_per_session'].apply(lambda x: list(set(x)))
data_sessionwise['nunique_links_per_session'] = data_sessionwise['links_per_session'].apply(lambda x: len(set(x)))

# And we can get which was the first and last link.
data_sessionwise['first_link'] = data_sessionwise['links_per_session'].apply(lambda x: x[0])
data_sessionwise['last_link'] = data_sessionwise['links_per_session'].apply(lambda x: x[-1])

# With these information we can get a grasp of a high engagement condition:
# In this case, high engagement is defined as having a session time greater (or equal) than the 99th quantile of the session time. 
high_engagement_condition = data_sessionwise.session_time >= data_sessionwise.session_time.quantile(0.99)
#low_engagement_condition = data_sessionwise.session_time <= data_sessionwise.session_time.quantile(0.01)
data_sessionwise.loc[high_engagement_condition,'high_engagement'] = 1
#data_sessionwise.loc[low_engagement_condition,'high_engagement'] = 0

We can now get the high engagement links, remining that to be a high engagement link, 
one should keep the session open belonging to the top 99% of the session time.

In [7]:
# High engagement links
high_engagement_links = pd.Series(
    data_sessionwise[
        data_sessionwise.high_engagement == 1
    ].links_per_session.sum()
)

# Lets parse the url, get the path and split the words,
# and then get the initial (most representative) word of the path
high_engagement_links_path_res = high_engagement_links\
.apply(lambda x: urlparse(x).path.split('-')[0])\
.str.lower()\
.str.replace('.html', '')\
.value_counts(normalize=True) * 100

In [8]:
# Data Viz
high_engagement_links_path_res

/hotel_review                       12.584515
/attractions                        12.322792
/attraction_review                  10.250818
/restaurant_review                   8.091603
/attractionproductreview             7.917121
/restaurants                         6.673937
/attraction_products                 4.427481
/hotels                              4.143948
/showtopic                           3.009815
/                                    2.660851
/search                              2.529989
/showuserreviews                     1.941112
/tourism                             1.875682
/smartdeals                          1.723010
/showforum                           1.570338
                                     1.199564
/restaurantsnear                     1.155943
/shoppingcartcheckout                1.046892
/hotelslist                          1.025082
/locationphotodirectlink             0.916031
/registrationcontroller              0.872410
/attractionbookingdetails         

In [9]:
# Drop off points

# We will define the drop off points as the most relevant path segment of the last link of a session. 
drop_off_links_path_res = data_sessionwise\
.last_link\
.apply(lambda x: urlparse(x).path.split('-'))\
.explode()\
.value_counts(normalize=True) * 100

In [10]:
# Data viz:
drop_off_links_path_res.head(20)

last_link
Reviews                     6.936653
/Hotel_Review               2.851290
/Attraction_Review          2.036153
/Restaurant_Review          1.960065
/                           1.774796
reviews                     1.392103
/Attractions                1.287650
Activities                  1.284949
/ShowTopic                  1.253208
/Restaurants                0.916888
/ShowUserReviews            0.725316
/Hotels                     0.549277
Hotels.html                 0.538472
/attraction_review          0.530368
/AttractionProductReview    0.529918
/hotel_review               0.447976
/Commerce                   0.447751
/attractions                0.419162
/Attraction_Products        0.417586
activities                  0.416911
Name: proportion, dtype: float64