# Keyword Analysis of CHI Best Papers 2016-2020

In this analysis, we will collect and chart the author keywords used for the Best Paper winners at the Conference on Human Factors in Computing Systems (CHI) from 2016 to 2020.
The aim of this research is to identify commonalities between those top 1% of articles honoured by the Best Paper Committee and to explore signficant trends among those.

## Reading in the data

First, we import the necessary packages and read in the keyword data from the respective file.

In [38]:
# Imports
import pandas as pd

# Utility Functions
def string_to_list(list_as_string: str, delimiter=", ") -> list[str]:
    """A simple function that splits a list input as a string by the specified delimiter and returns the list."""
    return list_as_string.split(delimiter)

def print_dataframe_summary(dataframe: pd.DataFrame, include_description=False, head_nrows=10) -> None:
    """Uses pandas' info, head, and---depending on the description flag---describe methods to output information about a DataFrame."""
    print("Information about dataframe:")
    print(dataframe.info())
    print()
    print("Top 10 rows of dataframe:")
    print(dataframe.head(n=head_nrows))
    if include_description:
        print()
        print("Description of dataframe:")
        print(dataframe.describe())

In [39]:
best_papers = pd.read_csv("best_papers.csv", delimiter=";", converters={"keywords": string_to_list})

print_dataframe_summary(best_papers)

Information about dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   year      85 non-null     int64 
 1   keywords  85 non-null     object
 2   doi       85 non-null     object
dtypes: int64(1), object(2)
memory usage: 2.1+ KB
None

Top 10 rows of dataframe:
   year                                           keywords  \
0  2020  [deception, influence, behavior change, cybers...   
1  2020  [Visualization, Design Studies, Service-Learni...   
2  2020  [VR, trigeminal, smell, thermal, illusion, hap...   
3  2020  [Visualization, Responsive Design, News, Mobil...   
4  2020  [Hardware device realization, low volume elect...   
5  2020  [Game-based learning, game design, computation...   
6  2020  [AR/VR authoring, augmented reality, virtual r...   
7  2020  [Mobile Video Calls, Distributed Families, Fac...   
8  2020  [Transgender, non-binary, part

## Normalisation

Next, since we are interested in individual keywords and their co-occurences, we normalise the DataFrame to have one keyword per row.

In [47]:
# Explode (have one row for every keyword)
keywords = best_papers.explode("keywords").rename(columns={"keywords": "original_keyword"})

print_dataframe_summary(keywords)

Information about dataframe:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 434 entries, 0 to 84
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   year              434 non-null    int64 
 1   original_keyword  434 non-null    object
 2   doi               434 non-null    object
dtypes: int64(1), object(2)
memory usage: 13.6+ KB
None

Top 10 rows of dataframe:
   year    original_keyword                                      doi
0  2020           deception  https://doi.org/10.1145/3313831.3376832
0  2020           influence  https://doi.org/10.1145/3313831.3376832
0  2020     behavior change  https://doi.org/10.1145/3313831.3376832
0  2020       cybersecurity  https://doi.org/10.1145/3313831.3376832
0  2020         interaction  https://doi.org/10.1145/3313831.3376832
1  2020       Visualization  https://doi.org/10.1145/3313831.3376829
1  2020      Design Studies  https://doi.org/10.1145/3313831.3376829
1  

### Pre-processing

There are several possible pitfalls with the raw author keywords that have to be addressed.
Therefore, we will pre-process the keywords using the following transformations:

- Removing any possible leading and trailing whitespace
- Transforming all keywords to their lowercase form
- Replacing abbreviations which might be used in either short- or long-form (e.g. AI) with their long-form (e.g. artificial intelligence)
- Replacing all dialect spellings (e.g. British English) with their American English forms (CHI allows any consistent dialect but American English is most common)

In [75]:
# Remove possible leading and trailing whitespace
keywords["original_keyword"].replace(to_replace="(^\s* | \s*$)", inplace=True)
keywords = keywords.reset_index(drop=True)

# Copy original keyword into keyword column
keywords["keyword"] = keywords["original_keyword"]

# Transform to lowercase
keywords["keyword"] = keywords["keyword"].str.lower()

# Replace abbreviations
abbreviations = {
    "nhst": "null hypothesis significance testing",
    "em": "electromagnetic",
    "ictd": "information communication technologies for development",
    "hci4d": "human-computer interaction for development",
    "cci": "child-computer interaction",
    "gis": "geographic information systems",
    "esm": "experience sampling method",
    "ema": "ecological momentary assessment",
    "vru": "vulnerable road user",
    "ehmi": "external human-machine-interface",
    "ai": "artificial intelligence",
    "ml": "machine learning",
    "aac": "alternative and augmentative communication",
    "nicu": "neonatal intensive care unit",
    "hci": "human-computer interaction",
    "ar": "augmented reality",
    "vr": "virtual reality"
}
regex_leading = r"(^|(?<=\/)|(?<=\s))"
regex_trailing = r"(?=.$|\/|,|\s*)"
regex_abbreviations = {f"{regex_leading}{key}{regex_trailing}": value.lower() for key, value in abbreviations.items()}
keywords["keyword"].replace(to_replace=regex_abbreviations, inplace=True, regex=True)

# Replace dialect spellings
dialect_spellings = {
    "customisation": "customization",
}
keywords["keyword"].replace(to_replace=dialect_spellings, inplace=True)

print_dataframe_summary(keywords)

Information about dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 434 entries, 0 to 433
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   year              434 non-null    int64 
 1   original_keyword  434 non-null    object
 2   doi               434 non-null    object
 3   keyword           434 non-null    object
dtypes: int64(1), object(3)
memory usage: 13.7+ KB
None

Top 10 rows of dataframe:
   year    original_keyword                                      doi  \
0  2020           deception  https://doi.org/10.1145/3313831.3376832   
1  2020           influence  https://doi.org/10.1145/3313831.3376832   
2  2020     behavior change  https://doi.org/10.1145/3313831.3376832   
3  2020       cybersecurity  https://doi.org/10.1145/3313831.3376832   
4  2020         interaction  https://doi.org/10.1145/3313831.3376832   
5  2020       Visualization  https://doi.org/10.1145/3313831.3376829   
6  2

## To Do

- splitting combined expressions (e.g. AR/VR)?
- check correlation with CCS concepts?
- number of author keywords in best papers?