# ADA CAPI Notebook for Project Milestone 2

## Table of Contents:

1. [Loading and Preparing the Data](#load)
    1. [Load Tabular Data](#tabu)
    2. [Clean Tabular Data](#clean)
2. [Data Extraction](#extract)
    1. [Extracting metrics from textual articles](#gen)
        1. [Defining Article Metrics](#define_arti_metrics)
        2. [Extracting Article Metrics](#extracting_arti_metrics)
3. [Data Analysis](#analysis)
    1. [Exploring Path Lengths](#paths)
    2. [Exploring Categories in the Paths](#cats)
    3. [Exploring Subject Strength in Articles](#sub)
        1. [Exploring Subject Strength in Connected Articles](#graph_cat)
        2. [Exploring Subject Strength in Finished Path Articles](#graph_cat_fi)
        3. [Exploring Subject Strength in Uninished Path Articles](#graph_cat_unfi)
    4. [Analysing Article Metrics](#artmet)
        1. [Analysing Article Metrics by Category](#artmet_cat)
        2. [Analysing Article Metrics in Finished vs Unfinished paths](#artmetfu_path)
    5. [Analysing the In-Degree of Targets in Finished vs Unfinished Paths](#ltt)
    6. [Analysing Possible Shortest Path Distances in Finished vs Unfinished Paths](#shortest)
4. [Putting Everything Together](#everything)
    1. [Exploration per Actual Link](#actlink)
5. [Initial Regression](#regression)

    

In [None]:
import pandas as pd
import networkx as nx
import numpy as np
import os
from scipy import stats 

# Helper functions from utils folder
from utils.analysis import t_test_article_metrics, visualize_article_connections_per_category
from utils.preprocessing import get_all_links, merge_articles_categories, create_category_dictionaries

# Formatting libraries
import urllib
import datetime as datetime

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Imports to perform article analysis
import textstat
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

nltk.download('punkt') # Punkt tokenizer
nltk.download('stopwords') # Commong stopwords

# Load config and extract variables
import config
DATA_PATH = config.PATH_TO_DATA
PATH_GRAPGH_FOLDER = "wikispeedia_paths-and-graph"
ARTICLE_FOLDER = "plaintext_articles"
GENERATED_METRICS = "generated_data"

<a id="load"></a>
## 1 - Loading and Preparing the Data

Note that you can load the data from [here](#checkpoint1).

<a id="tabu"></a>
#### 1.1 - Load Tabular Data

In [None]:
# load in all data (except wikipedia articles)
finished_paths = pd.read_csv(os.path.join(DATA_PATH, PATH_GRAPGH_FOLDER, "paths_finished.tsv"), sep='\t', skiprows=15, 
                             names=["hashedIpAddress", "timestamp", "durationInSec", "path", "rating"])
unfinished_paths = pd.read_csv(os.path.join(DATA_PATH, PATH_GRAPGH_FOLDER, "paths_unfinished.tsv"), sep='\t', skiprows=16, 
                               names=["hashedIpAddress", "timestamp", "durationInSec", "path", "target", "type"])
edges = pd.read_csv(os.path.join(DATA_PATH, PATH_GRAPGH_FOLDER, "links.tsv"), sep='\t', skiprows=15, names=["start", "end"], encoding="utf-8")
articles = pd.read_csv(os.path.join(DATA_PATH, PATH_GRAPGH_FOLDER, "articles.tsv"), sep='\t', skiprows=12, names=["article"], encoding="utf-8")
categories = pd.read_csv(os.path.join(DATA_PATH, PATH_GRAPGH_FOLDER, "categories.tsv"), sep='\t', skiprows=13, 
                         names=["article", "category"], encoding="utf-8")
shortest_paths = np.genfromtxt(os.path.join(DATA_PATH, PATH_GRAPGH_FOLDER, "shortest-path-distance-matrix.txt"), delimiter=1, dtype=np.uint8)

<a id="clean"></a>
#### 1.2 - Clean Tabular Data

In [None]:
# Clean up url encoding in edge list
display(edges.head())
edges["start"] = edges.start.apply(urllib.parse.unquote)
edges["end"] = edges.end.apply(urllib.parse.unquote)
display(edges.head())

In [None]:
# Format datetime as datetime object
finished_paths["datetime"] = finished_paths.timestamp.apply(datetime.datetime.fromtimestamp)
unfinished_paths["datetime"] = unfinished_paths.timestamp.apply(datetime.datetime.fromtimestamp)
display(unfinished_paths.head())

In [None]:
# Clean up url encoding for articles
display(articles.head())
articles["article"] = articles.article.apply(urllib.parse.unquote)
display(articles.head())

In [None]:
# Clean up url encoding for categories
display(categories.head())
categories["article"] = categories.article.apply(urllib.parse.unquote)
display(categories.head())

In [None]:
# Identify broad categories of articles
display(categories.head())
categories["broad_category"] = categories["category"].apply(lambda x: x.split(".")[1]) # first entry after subject.
display(categories.head())

In [None]:
# merge articles and categories
articles_categories = pd.merge(articles, categories, how="left", on="article")
display(articles_categories.head())

# 6 articles without category! # TODO: discuss: what do we do with these?
print("Merge introduced {} NAs in category columns:".format(articles_categories.category.isna().sum()))
articles_categories[articles_categories.category.isna()]

In [None]:
# Convert paths to a readable format (lists) and remove url encoding
finished_paths["path"] = finished_paths["path"].apply(lambda x: x.split(";"))
finished_paths["path"] = finished_paths["path"].apply(lambda x: [urllib.parse.unquote(y) for y in x])

unfinished_paths["path"] = unfinished_paths["path"].apply(lambda x: x.split(";"))
unfinished_paths["path"] = unfinished_paths["path"].apply(lambda x: [urllib.parse.unquote(y) for y in x])

In [None]:
# Add start and target articles of path
finished_paths["start"] = [path[0] for path in finished_paths["path"]]
finished_paths["target"] = [path[-1] for path in finished_paths["path"]]

unfinished_paths["start"] = [path[0] for path in unfinished_paths["path"]]
unfinished_paths["target"] = unfinished_paths["target"].apply(urllib.parse.unquote)

<a id="extract"></a>

## 2 - Data Extraction

<a id="gen"></a>

#### 2.1 - Extracting metrics from textual articles

<a id="define_arti_metrics"></a>
##### 2.1.1 - Defining Article Metrics

The following metrics are extracted by performing textual pre-processing techniques in the wikipedia articles:
* Total word count: To understand the length of the article.
* Non stopword frequency: To identify words that contribute to the content's meaning.
* Stopword frequency: To identify common words that may not contribute to the content's meaning.
* Average word length: To assess the complexity of the language used.
* Average sentence length: Longer or more complex sentences (based on characters) may contribute to frustration.
* Number of paragraphs: To see if the article's structure plays a role in people giving up.
* Keyword frequency: To identify the most common keywords to understand the article's focus.
* Readability: To see if the ease of reading the article has an impact (metric: Flesch Reading Ease Score) Link: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests

In [None]:
def proprocess_article(article_text):
    preprocessed_text = article_text
    preprocessed_text = preprocessed_text.lower()
    preprocessed_text = preprocessed_text.replace("\n   ", " ") # As the articles are not continuous sentences
    return preprocessed_text

def calculate_article_metrics(article_text):
    preprocessed_text = proprocess_article(article_text)

    words = word_tokenize(preprocessed_text)
    sentences = sent_tokenize(preprocessed_text)

    # Calculate total word count
    total_word_count = len(words)

    # Calculate stopword frequency
    stop_words = set(stopwords.words("english"))
    stopwords_count = 0
    unique_words = []
    for word in words:
        if word.isalpha() and word.lower() in stop_words:
            stopwords_count +=1
        if word.isalpha() and word.lower() not in stop_words:
            unique_words.append(word.lower())

    # Calculate average word length
    average_word_length = sum(len(word) for word in words) / total_word_count

    # Calculate average sentence length
    average_sentence_length = sum(len(sentence) for sentence in sentences) / len(sentences)

    # Calculate number of paragraphs (assume every new line \n is paragraph)
    paragraphs_count = preprocessed_text.count('\n') + 1 # Count last paragraph

    # Calculate keyword frequency
    word_freq = nltk.FreqDist(unique_words)
    most_common_words = word_freq.most_common(10)  # Parameter to adjust

    # Calculate readability (Flesch Reading Ease Score) - 100: Easy to read, 0: Very confusing
    readability = textstat.flesch_reading_ease(preprocessed_text)

    return {
        "word_count": total_word_count,
        "non_stopword_count": total_word_count - stopwords_count,
        "non_stopword_percentage": (total_word_count - stopwords_count) / total_word_count,
        "stopword_count": stopwords_count,
        "stopword_percentage": stopwords_count / total_word_count,
        "avg_word_length": average_word_length,
        "avg_sent_length": average_sentence_length,
        "paragraph_count": paragraphs_count,
        "common_words": most_common_words,
        "readability_score": readability,
    }

<a id="extracting_arti_metrics"></a>
##### 2.1.2 - Extracting Article Metrics

The commented code below was used to access the `plaintext_articles` folder and read all articles inside, creating a dataframe with all the metric information (see table below). To reduce runtime, we compute the article metrics once and then read the generated CSV file.


Article Metrics DataFrame Description:
| Column Name                   | Metric                   | Purpose                                                            | Description                                                                                                  |
|--------------------------|--------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| `word_count`         | Total Word Count         | To understand the overall length of the article.                   | Represents the total number of words in the article.                                                         |
| `non_stopword_count`         | Non-Stopword Frequency   | To identify words that contribute to the content's meaning.         | Measures the frequency of non-stopwords, highlighting contextually relevant terms.                           |
| `stopword_count`         | Stopword Frequency       | To identify common words that may not contribute significantly.   | Measures the frequency of stopwords, aiding in identifying less informative words.                           |
| `avg_word_length`         | Average Word Length      | To assess the complexity of the language used.                     | Calculates the average length of words in the article.                                                        |
| `avg_sent_length`         | Average Sentence Length  | To evaluate sentence complexity based on characters.              | Computes the average number of characters per sentence, providing insights into structure and readability.   |
| `paragraph_count`        | Number of Paragraphs     | To assess the role of article structure in user engagement.        | Indicates the total number of paragraphs in the article.                                                      |
| `common_words`         | Keyword Frequency        | To identify common keywords and understand the article's focus.    | Reveals the frequency of keywords, aiding in discerning prevalent themes within the content.                   |
| `readability_score`         | Readability (Flesch Score)| To see if the ease of reading the article has an impact.                        | Utilizes the Flesch Reading Ease Score for readability assessment. [Learn more](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) |

In [None]:
folder_path = os.path.join(DATA_PATH, ARTICLE_FOLDER)
if os.path.exists(folder_path) and os.path.isdir(folder_path):

  article_metrics = pd.DataFrame(columns=["article", "word_count", "non_stopword_count", "non_stopword_percentage", "stopword_count", "stopword_percentage", "avg_word_length", "avg_sent_length", "paragraph_count", "common_words", "readability_score"])

  for file_name in os.listdir(folder_path):
    file_path = os.path.join(folder_path, file_name)
    
    if os.path.isfile(file_path):
      root, extension = os.path.splitext(file_name)
      readable_file_name = urllib.parse.unquote(root)
      
      with open(file_path, "r", encoding="utf-8") as article:
        metrics = calculate_article_metrics(article.read())

        metrics["article"] = readable_file_name
        article_metrics.loc[len(article_metrics)] = metrics
else:
  raise FileNotFoundError("The specified folder path does not exist or is not a directory.")

article_metrics.to_csv(os.path.join(GENERATED_METRICS, "article_metrics.csv"), index=False)

Loading the article data

In [None]:
article_metrics = pd.read_csv(os.path.join(GENERATED_METRICS, "article_metrics.csv"))

In [None]:
display(article_metrics.info())
display(article_metrics.head())

<a id="analysis"></a>

## 3 -Data analysis

<a id="paths"></a>

#### 3.1 - Exploring Path Lengths

Compare the path lengths between the finished and unfinished paths to detect potential outliers or trends that might influence the analysis

In [None]:
# calculate path lengths for finished paths and show summary statistics
finished_paths["path_length"] = finished_paths.path.apply(lambda el: len(el))
finished_paths["path_length"].describe()

In [None]:
# calculate path lengths for unfinished paths and show summary statistics
unfinished_paths["path_length"] = unfinished_paths.path.apply(lambda el: len(el))
unfinished_paths["path_length"].describe()

In [None]:
# compare distributions of finished and unfinished paths
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 6))
threshold = 40 # for now we remove some outliers to make the plots meaningful

ax.set_title("Distribution of Finished vs. Unfinished Paths")
sns.histplot(x=finished_paths.path_length[finished_paths.path_length < threshold], ax=ax, discrete=True, alpha=0.4)
sns.histplot(x=unfinished_paths.path_length[unfinished_paths.path_length < threshold], ax=ax, discrete=True, alpha=0.4)

In [None]:
# Compare distributions of path lengths across finished, restarted paths and unfinished paths that timed out
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14, 4), sharey=True)

sns.histplot(x=finished_paths.path_length[finished_paths.path_length < threshold], ax=axes[0], discrete=True)
axes[0].set_title("Finished Paths")

unfinished_clean = unfinished_paths[(unfinished_paths.path_length < threshold) & (unfinished_paths.type == "restart")]
sns.histplot(data=unfinished_clean, x="path_length", ax=axes[1], discrete=True,)
axes[1].set_title("Unfinished Paths - Restart")

unfinished_clean = unfinished_paths[(unfinished_paths.path_length < threshold) & (unfinished_paths.type == "timeout")]
sns.histplot(data=unfinished_clean, x="path_length", ax=axes[2], discrete=True,)
axes[2].set_title("Unfinished Paths - Timeout")

We see there are quite a few outliers and extreme values. For the analysis we have to think about removing them:
- Do we remove games with just one click?
- Do we remove games with a very high path length (e.g., above 30), since the player might just have clicked randomly?

We plan to address these questions in the analysis, e.g., through a sensitivity analysis: We run our analysis first on the regular data, before checking if we get similar results while removing certain outlier games.

<a id="cats"></a>

#### 3.2 - Exploring Categories in the Paths

We want to look at the occurances of categories in the paths, to gain an understanding of whether certain categories lead to games that are on average easier for people.

In [None]:
# Seeing which categories are most represented in articles
count_articles = categories.groupby("broad_category").size()

print("Below shows how many articles each of the broad categories are represented by")
display(count_articles)

In [None]:
# Create dictionaries for easy discovery of what categories an article belongs to
article_to_category, article_to_broad_category = create_category_dictionaries(categories)

In [None]:
# Count how many times each category has occured as a target in the finished and unfinished paths.
# Note that some articles are represented by multiple categories, which are thus counted extra.

# TODO: discuss: maybe wrap this in a function?

all_target_broad_categories_f = [
  article_to_broad_category[target] for target in finished_paths["target"] if target in article_to_broad_category
]
all_target_broad_categories_f = [item for sublist in all_target_broad_categories_f for item in sublist]
count_cats_finished_target = Counter(all_target_broad_categories_f)
keys_finished = list(count_cats_finished_target.keys())
keys_finished.sort()
sorted_cats_f = {i: count_cats_finished_target[i] for i in keys_finished}

all_target_broad_categories_u = [
  article_to_broad_category[target] for target in unfinished_paths["target"] if target in article_to_broad_category
]
all_target_broad_categories_u = [item for sublist in all_target_broad_categories_u for item in sublist]
count_cats_unfinished_target = Counter(all_target_broad_categories_u)
keys_unfinished = list(count_cats_unfinished_target.keys())
keys_unfinished.sort()
sorted_cats_u = {i: count_cats_unfinished_target[i] for i in keys_unfinished}

# Plotting the results.
ax = plt.barh(list(sorted_cats_f.keys()), sorted_cats_f.values(), label="Finished paths")
ax2 = plt.barh(list(sorted_cats_u.keys()), sorted_cats_u.values(), label="Unfinished paths")
plt.xlabel("Count")
plt.title("Occurences of categories as targets")
plt.gca().invert_yaxis()
plt.legend()
plt.show()

The plot shows clearly that some categories occur as targets relatively more in finished paths while others have the opposite outcome. For example, "Couuntries" occurs as a target in finished paths multiple times as often as it does in unfinished paths, whereas "Everyday_life" occurs as a target in finished paths only slightly more often than it does in unfinished paths. This indicates the possibility that certain categories make for easier games. Nonetheless, to establish a proper relationship, we may in the future need to control for certain other variables. For example, it could be that articles in the "Countries" category are simply better connected than those in the "Everyday_life" category. 

The plot also shows the imbalance in the categories. We have a lot of paths ending in "Geography" and "Science", but very few ending in "Mathematics" and "Art".

<a id="sub"></a>

#### 3.3 - Exploring Subject Strength in Articles

<a id="graph_cat"></a>

##### 3.3.1 - Exploring Subject Strength in Connected Articles
Visualizing the strength of the categories for connected articles (those which are connected by an edge).

In [None]:
# Visualizing FINISHED PATHS article connections per category
edge_category = merge_articles_categories(edges, ["start", "end"], articles_categories)
visualize_article_connections_per_category(edge_category, "Article Connections Based on Category (Normalized and Scaled Edges)")

# TODO: Should add some comments about these?

<a id="graph_cat_fi"></a>

##### 3.3.2 - Exploring Subject Strength in Finished Path Articles
Visualizing the strength of the categories for both start and target articles in the finished paths using a graph.

In [None]:
# Visualizing FINISHED PATHS article connections per category
finished_paths_categories = merge_articles_categories(finished_paths, ["start", "target"], articles_categories)
visualize_article_connections_per_category(finished_paths_categories, "Start & Target Article Connections in Finished Path Based on Category (Normalized and Scaled Edges)")

<a id="graph_cat_unfi"></a>

##### 3.3.3 - Exploring Subject Strength in Uninished Path Articles
Visualizing the strength of the categories for both start and target articles in the unfinished paths using a graph.

In [None]:
# Visualizing UNFINISHED PATHS article connections per category
unfinished_paths_categories = merge_articles_categories(unfinished_paths, ["start", "target"], articles_categories)
visualize_article_connections_per_category(unfinished_paths_categories, "Start & Target Article Connections in Unfinished Path Based on Category (Normalized and Scaled Edges)")

<a id="artmet"></a>

#### 3.4 - Analysing Article Metrics

<a id="artmet_cat"></a>

##### 3.4.1 - Analysing Article Metrics by Category

In [None]:
# Merge articles with their corresponding categories
article_metrics_with_categories = article_metrics.merge(categories, how="left", on=["article"])
display(article_metrics_with_categories.head())

In [None]:
metrics_to_plot = ['word_count', 'stopword_count', 'stopword_percentage', 'non_stopword_count', 'non_stopword_percentage','avg_word_length', 'avg_sent_length', 'paragraph_count','readability_score']
fig, axes = plt.subplots(nrows=len(metrics_to_plot), ncols=2, figsize=(15, 6 * len(metrics_to_plot)))

for idx, metric in enumerate(metrics_to_plot):
  # Bar plot
  ax_bar = axes[idx, 0]
  sns.barplot(x=article_metrics_with_categories["broad_category"], y=article_metrics_with_categories[metric], errorbar=("ci", 95), ax=ax_bar)
  ax_bar.set_xlabel("Category")
  ax_bar.set_ylabel(metric)
  ax_bar.set_title("Mean and CI of {} per Category".format(metric))
  ax_bar.set_xticklabels(ax_bar.get_xticklabels(), rotation=90)

  # Violin plot
  ax_violin = axes[idx, 1]
  sns.violinplot(x=article_metrics_with_categories["broad_category"], y=article_metrics_with_categories[metric], ax=ax_violin)
  ax_violin.set_xlabel("Category")
  ax_violin.set_ylabel(metric)
  ax_violin.set_title("Distribution of {} per Category".format(metric))
  ax_violin.set_xticklabels(ax_violin.get_xticklabels(), rotation=90)

plt.tight_layout()
plt.show()

<a id="artmetfu_path"></a>

##### 3.4.2 - Analysing Article Metrics in Finished vs Unfinished paths

In [None]:
# Show the article metrics per finished and unfinished parths (both for start and end articles)
start_finished_article_metrics = finished_paths.merge(article_metrics_with_categories, how="left", left_on="start", right_on="article")
end_finished_article_metrics = finished_paths.merge(article_metrics_with_categories, how="left", left_on="target", right_on="article")
start_unfinished_article_metrics = unfinished_paths.merge(article_metrics_with_categories, how="left", left_on="start", right_on="article")
end_unfinished_article_metrics = unfinished_paths.merge(article_metrics_with_categories, how="left", left_on="target", right_on="article")

In [None]:
metrics_to_plot = ["word_count", "stopword_count", "stopword_percentage", "non_stopword_count", "non_stopword_percentage","avg_word_length", "avg_sent_length", "paragraph_count", "readability_score"]
dataframes = [start_finished_article_metrics, start_unfinished_article_metrics, end_finished_article_metrics, end_unfinished_article_metrics]
dataframe_labels = ["Start Finished", "Start Unfinished", "Target Finished", "Target Unfinished"]

# TODO: is this right? Should the plots not show the metrics for start finished, start unfinished, target finished, target unfinished?

fig, axes = plt.subplots(nrows=len(metrics_to_plot), ncols=2, figsize=(15, 6 * len(metrics_to_plot)))

for idx, metric in enumerate(metrics_to_plot):
  data = [df[metric] for df in dataframes]
  
  # Bar plot
  ax_bar = axes[idx, 0]
  sns.barplot(data=data, errorbar=("ci", 95), ax=ax_bar)
  ax_bar.set_xlabel("Type of article")
  ax_bar.set_ylabel(metric)
  ax_bar.set_title("Mean and CI of {} per Category".format(metric))
  ax_bar.set_xticklabels(dataframe_labels)

  # Violin plot
  ax_violin = axes[idx, 1]
  sns.violinplot(data=data, ax=ax_violin)
  ax_bar.set_xlabel("Type of article")
  ax_violin.set_ylabel(metric)
  ax_violin.set_title("Distribution of {} per Category".format(metric))
  ax_violin.set_xticklabels(dataframe_labels)

plt.tight_layout()
plt.show()

In [None]:
print("Start Articles (comparing finished vs unfinished):")
t_test_article_metrics(metrics_to_plot, start_finished_article_metrics, start_unfinished_article_metrics)

print("\nTarget Articles (comparing finished vs unfinished):")
t_test_article_metrics(metrics_to_plot, end_finished_article_metrics, end_unfinished_article_metrics)


# TODO: discuss: why are we comparing start and end article of the same game? what are we trying to do - add a commnet?!
print("\nFinished Articles (comparing start vs target):")
t_test_article_metrics(metrics_to_plot, start_finished_article_metrics, end_finished_article_metrics)

print("\nUnfinished Articles (comparing start vs target):")
t_test_article_metrics(metrics_to_plot, start_unfinished_article_metrics, end_unfinished_article_metrics)

| Metric                     | Start Articles (Finished vs Unfinished) | Target Articles (Finished vs Unfinished) | Finished Articles (Start vs Target) | Unfinished Articles (Start vs Target) |
|----------------------------|-----------------------------------------|------------------------------------------|--------------------------------------|----------------------------------------|
| word_count                  | t-statistic: 1.873, p-value: 0.061       | t-statistic: 36.838, p-value: 0.000      | t-statistic: -30.824, p-value: 0.000 | t-statistic: 8.949, p-value: 0.000     |
| stopword_count              | t-statistic: 1.353, p-value: 0.176       | t-statistic: 34.692, p-value: 0.000      | t-statistic: -30.009, p-value: 0.000 | t-statistic: 8.076, p-value: 0.000     |
| stopword_percentage         | t-statistic: -3.366, p-value: 0.001      | t-statistic: 4.045, p-value: 0.000       | t-statistic: -0.362, p-value: 0.717 | t-statistic: 6.045, p-value: 0.000     |
| non_stopword_count          | t-statistic: 2.131, p-value: 0.033       | t-statistic: 37.599, p-value: 0.000      | t-statistic: -30.943, p-value: 0.000 | t-statistic: 9.317, p-value: 0.000     |
| non_stopword_percentage     | t-statistic: 3.366, p-value: 0.001       | t-statistic: -4.045, p-value: 0.000      | t-statistic: 0.362, p-value: 0.717  | t-statistic: -6.045, p-value: 0.000    |
| avg_word_length             | t-statistic: -3.090, p-value: 0.002      | t-statistic: 10.974, p-value: 0.000     | t-statistic: 2.987, p-value: 0.003  | t-statistic: 14.113, p-value: 0.000   |
| avg_sent_length             | t-statistic: 4.863, p-value: 0.000       | t-statistic: -0.260, p-value: 0.795     | t-statistic: -3.443, p-value: 0.001 | t-statistic: -6.964, p-value: 0.000    |
| paragraph_count             | t-statistic: 0.038, p-value: 0.970       | t-statistic: 37.247, p-value: 0.000     | t-statistic: -29.294, p-value: 0.000| t-statistic: 11.952, p-value: 0.000   |
| readability_score           | t-statistic: 4.652, p-value: 0.000       | t-statistic: -21.100, p-value: 0.000   | t-statistic: -8.953, p-value: 0.000 | t-statistic: -27.779, p-value: 0.000 |

1. **Finished vs Unfinished Start Articles:**
   - The stopword_percentage is significantly lower (and non_stopword_percentage higher) in finished articles than unfinished, suggesting a potential emphasis on more meaningful content. 
   - Finished start articles also tend to have higher avg_sent_length and readability_score, indicating a focus on well-structured and reader-friendly content.

2. **Finished vs Unfinished Target Articles:**
   - Finished target articles exhibit significantly higher values across various metrics, including word_count, stopword_percentage (with lower non_stopword_percentage), avg_word_length, and paragraph_count. Additionally, they have a significantly lower readability_score, suggesting that finished target articles could be more challenging to comprehend.

3. **Finished Start vs. Target Articles:**
   - TODO

4. **Unfinished Target Articles:**
   - TODO




<a id="ltt"></a>

#### 3.5 - Analysing the In-Degree of Targets in Finished vs Unfinished Paths

It is possible that certain paths are easier objectively because their targets have a larger "in-degree", i.e. the number of edges in the graph pointing to it. This would be intuitive: if there are more ways to get to the target, it should be easier to do so. This section explores whether this idea is reflected in the distributions of the in-degrees of the targets in finished and unfinished paths.

In [None]:
# Counting how many links point to targets in finished and unfinished paths, known as the "in-degree".
finished_paths["links_to_target"] = finished_paths["path"].apply(lambda x: len(edges.loc[edges["end"] == x[-1]]))
unfinished_paths["links_to_target"] = unfinished_paths["target"].apply(lambda x: len(edges.loc[edges["end"] == x]))

We suspect that the in-degree may follow a power-law. We check this below.

In [None]:
# Building the arrays of for the cumulative distributions of in-degrees:
finished_indegree_cumulative=plt.hist(finished_paths.links_to_target,bins=100,log=True,cumulative=-1,histtype='step')
unfinished_indegree_cumulative=plt.hist(unfinished_paths.links_to_target,bins=100,log=True,cumulative=-1,histtype='step')
plt.close()

# Plotting the CCDF plots of the in-degrees for finished and unfinished paths:
plt.loglog(finished_indegree_cumulative[1][1:],finished_indegree_cumulative[0], label="Finished paths")
plt.loglog(unfinished_indegree_cumulative[1][1:],unfinished_indegree_cumulative[0], label="Unfinished paths")
plt.title('Histogram of In-degree (cumulative)')
plt.ylabel('# of targets (in log scale)')
plt.xlabel('In-degree (in log scale)')
plt.legend()
plt.show()


In [None]:
# Printing mean in-degree of the targets in the finished and unfinished paths.
print("The targets that were reached had an in-degree of {:.3f} on average.".format(finished_paths['links_to_target'].mean()))
print("The targets that were not reached had an in-degree of {:.3f} on average.".format(unfinished_paths['links_to_target'].mean()))

In [None]:
# Printing median in-degree of the targets in the finished and unfinished paths.
print("The targets that were reached had a median in-degree of {:.3f}.".format(finished_paths['links_to_target'].median()))
print("The targets that were not reached had a median in-degree of {:.3f}.".format(unfinished_paths['links_to_target'].median()))

In [None]:
# Conducting a t-test
t_test_article_metrics(["links_to_target"], finished_paths, unfinished_paths)

The p-value of a t-test between the number of links pointing to the targets of finished and unfinished paths is 0.0. This means we reject the null hypothesis that the number of links pointing to the targets are statistically the same at the 5% level of significance, indicating that the in-degree of the target indeed has a statistical significance in whether a game will be finished or not.

In [None]:
# Creating a boxplot of the trends

finished_links =  pd.DataFrame()
finished_links["links_to_target"] = finished_paths["links_to_target"]
finished_links["path_type"] = "Finished paths"

unfinished_links =  pd.DataFrame()
unfinished_links["links_to_target"] = unfinished_paths["links_to_target"]
unfinished_links["path_type"] = "Unfinished paths"

df_links = pd.concat([finished_links,unfinished_links])

ax = sns.boxplot(x="path_type", y="links_to_target", data=df_links)
plt.xlabel(" ")
plt.ylim([-5,155])
plt.ylabel("Number of links to target")

The boxplots above highlight these conclusions. The in-degree of targets in the finished paths are noticeably higher than those in the unfinished paths.

<a id="shortest"></a>

#### 3.6 - Analysing Possible Shortest Path Distances in Finished vs Unfinished Paths

Another potential factor that may determine whether a game will be completed or not, in a more objective manner, is the shortest path length possible between the source and the target. This factor is also intuitive. If a shorter path exists in theory, the path length should also be shorter on average in practice, leading to simpler games. This section explores whether this idea is reflected in the distributions of the length of the shortest possible paths in finished and unfinished games.

In [None]:
# Retrieving the shortest possible paths for the finished games

finished_paths["shortest_path_length"] = finished_paths["path"].apply(
    lambda x: shortest_paths[articles.loc[articles['article'] == x[0]].index[0]][articles.loc[articles['article'] == x[-1]].index[0]]
    )


Important note: There are typos in the targets.

Eg. At index 141 in unfinished paths, the target is written as "Long_peper", when it should be "Long_pepper".

Overall, an issue arises in unfinished paths 28 times, but this doesn't seem to be an issue in finished paths. These data points are ignored so far.

In [None]:
# Retrieving the shortest possible paths for the unfinished games

# TODO: wrap in a function since we are using it twice (see below)?

shortest_unfinished = []
not_found = 0
for i in range(len(unfinished_paths)):
    source = articles.loc[articles['article'] == unfinished_paths.iloc[i]["path"][0]]
    target = articles.loc[articles['article'] == unfinished_paths.iloc[i]["target"]]
    if len(source) != 0 and len(target) != 0:
        index_source = source.index[0]
        index_target = target.index[0]
        shortest_unfinished.append(int(shortest_paths[index_source][index_target]))
    else:
        shortest_unfinished.append(None)
        not_found+=1

unfinished_paths["shortest_path_length"] = shortest_unfinished
print(f"{not_found} shortest paths not found")

In [None]:
# Testing to confirm that there are no issues in the finished paths.

shortest_finished = []
not_found2 = 0
for i in range(len(finished_paths)):
    source = articles.loc[articles['article'] == finished_paths.iloc[i]["path"][0]]
    target = articles.loc[articles['article'] == finished_paths.iloc[i]["path"][-1]]
    if len(source) != 0 and len(target) != 0:
        index_source = source.index[0]
        index_target = target.index[0]
        shortest_finished.append(int(shortest_paths[index_source][index_target]))
    else:
        shortest_finished.append(None)
        not_found2+=1

print(f"{not_found2} shortest paths not found")

In [None]:
# Counting number of "impossible" paths

# TODO: this fails since shortest_path length has not been calculated yet - move back down?
print(f"There are {len(finished_paths[finished_paths['shortest_path_length'] == 255])} impossible finished paths.")
print(f"There are {len(unfinished_paths[unfinished_paths['shortest_path_length'] == 255])} impossible unfinished paths.")

# These will be ignored in the following analyses.

# TODO: let's add some more description here what we are doing, since this is somewhat counterintuitive that there are impossbile paths
# TODO: should we then actually exlude them by filtering the dataframe?

We suspect that the shortest path length may follow a power law. We check this below:

In [None]:
# Building the arrays of for the cumulative distributions of in-degrees:
finished_spl_cumulative=plt.hist(finished_paths[finished_paths['shortest_path_length'] != 255]['shortest_path_length'],bins=5,log=True,cumulative=-1,histtype='step')
unfinished_spl_cumulative=plt.hist(unfinished_paths[unfinished_paths['shortest_path_length'] != 255]['shortest_path_length'],bins=5,log=True,cumulative=-1,histtype='step')
plt.close()

# Plotting the CCDF plots of the in-degrees for finished and unfinished paths:
plt.loglog(finished_spl_cumulative[1][1:],finished_spl_cumulative[0], label="Finished paths")
plt.loglog(unfinished_spl_cumulative[1][1:],unfinished_spl_cumulative[0], label="Unfinished paths")
plt.title('Histogram of shortest path length (cumulative)')
plt.ylabel('# of games (in log scale)')
plt.xlabel('Shortest path length (in log scale)')
plt.legend()
plt.show()


In [None]:
# Printing mean shortest possible paths in the finished and unfinished paths.

print("The shortest possible paths were {:.3f} long on average in the finished paths.".format(
    finished_paths[finished_paths['shortest_path_length'] != 255]['shortest_path_length'].mean()
    ))
print("The shortest possible paths were {:.3f} long on average in the unfinished paths.".format(
    unfinished_paths[unfinished_paths['shortest_path_length'] != 255]['shortest_path_length'].mean()
    ))


In [None]:
# Printing median shortest possible paths in the finished and unfinished paths.

print("The shortest possible paths had a median length of {:.3f} in the finished paths.".format(
    finished_paths[finished_paths['shortest_path_length'] != 255]['shortest_path_length'].median()
    ))
print("The shortest possible paths had a median length of {:.3f} in the unfinished paths.".format(
    unfinished_paths[unfinished_paths['shortest_path_length'] != 255]['shortest_path_length'].median()
    ))


In [None]:
# Doing a t test on the shortest path lengths
t_test_article_metrics(["shortest_path_length"], finished_paths, unfinished_paths)

The p-value of a t-test between the shortest possible path lengths of finished and unfinished games is 0.0. This means we reject the null hypothesis that the shortest possible game paths are statistically the same across the two groups at the 5% level of significance, and thus the length of the shortest path possible does indeed have a statistically significant effect on whether a game will be completed or not.

In [None]:
# Creating a boxplot of the trends

finished_shortest =  pd.DataFrame()
finished_shortest["shortest_path_length"] = finished_paths[finished_paths['shortest_path_length'] != 255]["shortest_path_length"]
finished_shortest["path_type"] = "Finished paths"

unfinished_shortest =  pd.DataFrame()
unfinished_shortest["shortest_path_length"] = unfinished_paths[unfinished_paths['shortest_path_length'] != 255]["shortest_path_length"]
unfinished_shortest["path_type"] = "Unfinished paths"

df_shortest = pd.concat([finished_shortest,unfinished_shortest]).reset_index(drop=True) # reset index to avoid duplicated index error from seaborn boxplot

ax = sns.boxplot(x="path_type", y="shortest_path_length", data=df_shortest)
plt.xlabel(" ")
plt.ylabel("Shortest path possible from source to target")

# TODO: Better alternative to boxplot? Currently cannot discern the 25/50/75 percentiles...

The boxplot of these results provides additional context, as it makes clear that unfinished paths tend to have longer possible shortest lengths. Intuitively, it makes sense that this would be the case.

This is an interesting situation. The past two analyses show that the targets are more difficult to get to in the unfinished paths, due to their lower in-degree and the larger value of the possible shortest path to them.

A challenge for us may be to try to isolate whether the difference between whether a path is finished or not can be fully explained by more objective factors like this, or if there is a human component that we can isolate as well, when controlling for factors such as these. Eg, are some categories actually more difficult to get to, or do the differences in the target category distributions in the finished and unfinished paths arise because some categories may be more likely to have longer possible shortest paths to them or have fewer links pointing at them?

<a id="checkpoint1"></a>

## Checkpoint for dataframe

In [None]:
# The below code saves the version of the dataframe above:
finished_paths.to_pickle(os.path.join(GENERATED_METRICS, "finished_paths_initial_stats.pkl"))
unfinished_paths.to_pickle(os.path.join(GENERATED_METRICS, "unfinished_paths_initial_stats.pkl"))

In [None]:
# The below code reads that version of the dataframe from the file:
finished_paths = pd.read_pickle(os.path.join(GENERATED_METRICS, "finished_paths_initial_stats.pkl"))
unfinished_paths = pd.read_pickle(os.path.join(GENERATED_METRICS, "unfinished_paths_initial_stats.pkl"))

<a id="everything"></a>

## 4 - Putting Everything Together
### Building a Logistic Regression  to determine influencing factors on the propensity of a player to give up a game (restart or timeout)
We combine all factors we have explored before to build a regression model to predict whether a player gives up to then interpret the coefficients

In [None]:
### we first merge the article metrics (categories, word count etc.) to the finished and unfinished paths to create a dataset

# merge and unfinished paths while adding a flag
finished_paths["give_up"] = 0
unfinished_paths["give_up"] = 1
games = pd.concat((finished_paths, unfinished_paths), axis=0)

# drop duplicates in the article column, since one article might belong to more than one main category, some rows are duplicated. For now, we just drop these
article_metrics_with_categories = article_metrics_with_categories.drop_duplicates(subset="article")

# define columns that may be relevant from article metrics:
# some columns are exluded since they are contained in others, or because they are complements (stopword vs non-stopword percentage)
keep = ['article',
        'broad_category',
        'paragraph_count',
        'readability_score',
        "stopword_percentage",
        'avg_word_length',
        'avg_sent_length',
        ]

# merge on start
start_metrics = article_metrics_with_categories[keep].add_prefix("start_")
games = pd.merge(games, start_metrics, how="left", left_on="start", right_on="start_article") # add prefix
print(games.shape) # check results of the merge

# merge on target
target_metrics = article_metrics_with_categories[keep].add_prefix("target_")
games = pd.merge(games, target_metrics, how="left", left_on="target", right_on="target_article") # add prefix
print(games.shape) # check results of the merge

# subset games to only include those with reasonable path lengths - TBD, no filtering for now
# games = games[(games.path_length > 1) & (games.path_length < 30)]


# remove unnecessary columns
# TODO: add backlicks to the removal column once they are in (since we do not know if there are going to be any backclicks before the game starts?)
to_drop = ['hashedIpAddress', 'timestamp', 'durationInSec', 'path', 'rating',
       'datetime', 'start', 'target', 'path_length', "target_article", "start_article", "type",]
games = games.drop(to_drop, axis=1)
print(games.shape) # check results of the subsetting


In [None]:
percent_missing = games.isnull().sum() * 100 / len(games)
percent_missing

In [None]:
# drop all NAs - for the actual analysis we need to investigate more why these are coming from.
# for this proof of concept we just remove them
games = games.dropna(axis=0, how="any")
games

In [None]:
# create formula for logistic regression
target = "give_up"
predictors = [col for col in games.columns if col != target]
formula = target + " ~ " + " + ".join(predictors)
formula

In [None]:
import statsmodels.formula.api as smf

# logistic regression model with the full formula (i.e. all relevant predictors)
mod = smf.logit(formula=formula, data=games)
res = mod.fit(maxiter=30)
print(res.summary())

In [None]:
import statsmodels.formula.api as smf

# regression model with limited predictors: only those visible to a player at the start of the game
formula = 'give_up ~ links_to_target + shortest_path_length + start_broad_category + start_paragraph_count + start_readability_score + start_stopword_percentage + start_avg_sent_length + C(target_broad_category)'
mod = smf.logit(formula=formula, data=games)
res = mod.fit(maxiter=30)
print(res.summary())


The two regression models are largely congruent and offer some interesting initial findings (non-exhaustive):
- some categories have a large and statistically significant influence on the proabability of a player giving up. A few examples:
    - paths starting from *language and literature* or more niche topics like *Design and Technology* increase the propensity to give up
    - a target article in the categories *Geography* or *Countries* strongly decreases the probability. This is consistent with the hypothesis that these are rather "easy" categories, as many links point to them.
- similarly, many of the article related metrics are statistically significant; for instance:
    - the shortest_path_length has a large positive coefficient, indicating that objectively longer paths do lead to more failures
    - more detailed article metrics are statistically relevant, but the effect sizes are quite small (e.g., in-degree of target, readability score etc.)