# Analysis of user collection

In this notebook, the retrieved results of the user collection is analyzed.

In [None]:
import glob
import time
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Configure modules like pandas and plotting libraries.

In [None]:
plt.rcParams['figure.figsize'] = [12, 8]
pd.set_option('display.min_rows', 24)

Folders and constants

In [None]:
fp_figs = Path("figs")

annotated_users = Path("results", "users_enriched_summer2021.xlsx")

## Dataset
In this section, output data of the various methods is collected and combined. 

In [None]:
data_files = glob.glob("methods/*/results/*.csv")

for i, d in enumerate(data_files):
    print(i+1, d)

All files is loaded and combined in a single dataset. The dataset contains the variables:
- `source` The source is the method used to find the user, e.g. paperswithcode, github_search_users, github_search_topics. The name is derived from the file name of the results file.
- `service` The service the user profile is found, e.g. github.com, github.warwick.ac.uk, gitlab.
- `date` The date of collection. This can be useful when updating results.
- `user_id` The user handle of the found user. 

In [None]:
df_user_names = pd.concat(
    [pd.read_csv(fp) for fp in data_files], 
    axis=0, 
    keys=data_files, 
    names=["source", "row"]
) \
    .reset_index("source") \
    .reset_index(drop=True)

df_user_names["source"] = df_user_names["source"].apply(lambda x: Path(x).stem)
df_user_names

## Findability

The following results shows how many times user handles are found with each retrieval method. If a user is found with multiple methods, the findability of the user can be consider to be better. 

### Cross table user and method

Create a cross table with service and user on one axis and the extraction method on the other axis.

In [None]:
df_name_method_crosstab = df_user_names \
    .drop("date", axis=1) \
    .groupby(["service", "user_id", "source"]) \
    .size() \
    .unstack("source") \
    .fillna(0) \
    .applymap(int)

df_name_method_crosstab

The following example can be uncommented to find the findabilty of a specific project. 

In [None]:
# df_name_method_crosstab.loc[('github.com', 'asreview')]

### Findability per user

The following table show which users are found with most collection strategies. The count indicates the number of methods, and the relative score the number relative to the total number of collection strategies. 

In [None]:
# compute the count
df_name_findability = (df_name_method_crosstab > 0) \
    .astype(int) \
    .sum(axis=1) \
    .sort_values(ascending=False) \
    .to_frame(name="count")

# compute relative score
df_name_findability["relative"] = df_name_findability["count"] / len(data_files)

df_name_findability


In [None]:
retrieval_count = df_name_findability["count"].value_counts().to_frame(name="Retrieval count")

sns.barplot(
    x=retrieval_count.index.astype(str), 
    y=retrieval_count["Retrieval count"]
)
plt.title("Retrieval count for each user in dataset")
plt.savefig(Path(fp_figs, 'user_collection_user_findability.png'))

### Findability per method

This table indicates how succeful a method is collecting users.

In [None]:
df_method_findability = (df_name_method_crosstab > 0) \
    .astype(int) \
    .sum(axis=0) \
    .sort_values(ascending=False) \
    .to_frame(name="count")

df_method_findability

In [None]:
df_method_findability

In [None]:
sns.barplot(
    x=df_method_findability.index, 
    y=df_method_findability["count"]
)
plt.title("Retrieval count for each method in dataset")
plt.savefig(Path(fp_figs, 'user_collection_method_findability.png'))

## Filtering of users

Not all users collected in the previous steps are relevant to the analysis. Some are not (or no longer) part of the organisation, or excluded for other reasons (for example students).

In [None]:
df_annotated_users = pd.read_excel(annotated_users)

df_annotated_users[
    ["user_id", 
     "is_student", 
     "is_employee", 
     "is_currently_employed", 
     "is_research_group", 
     "final_decision"
    ]
]

In [None]:
df_annotated_users_included = df_annotated_users[df_annotated_users["final_decision"] == 1]

print("The number of included users is", len(df_annotated_users_included))

## Properties of users

In [None]:
df_annotated_users_included \
    .loc[df_annotated_users_included["final_decision"] == 1, "is_research_group"] \
    .fillna(0) \
    .value_counts()

In [None]:
# create a boxplot with swarm
ax = sns.boxplot(x='public_repos', data=df_annotated_users_included)
ax = sns.swarmplot(x='public_repos', data=df_annotated_users_included, color=".25")

# output users with most public repos
df_annotated_users_included[['user_id', 'public_repos']].sort_values('public_repos', ascending=False).head(10)

In [None]:
sns.boxplot(x='public_gists', data=df_annotated_users_included)
df_annotated_users_included[['user_id', 'public_gists']].sort_values('public_gists', ascending=False).head(10)

In [None]:
ax = sns.boxplot(x='followers', data=df_annotated_users_included, showfliers=False)

df_annotated_users_included[['user_id', 'followers']].sort_values('followers', ascending=False).head(10)

In [None]:
sns.boxplot(x=df_annotated_users_included['following'], showfliers = False)
df_annotated_users_included[['user_id', 'following']].sort_values('following', ascending=False).head(10)
