# Analyzing ICWSM Ethics Statements
### Author: Campbell Lund
### 6/22/2023
This notebook contains the preprocessing and NLP analysis for the 2022 and 2023 ethics statements published by the ICWSM Conference.

### Table of contents:
1. [Preprocessing](#sec1)
2. [Word Frequency](#sec2)
3. [Sentence Analysis](#sec3)

## 1. Preprocessing <a name="sec1"></a>

In [680]:
import pandas as pd

In [681]:
# load data
df22 = pd.read_csv('data/ICWSM2022EthicsStatements.csv')
df23 = pd.read_csv('data/ICWSM2023EthicsStatements.csv')

In [682]:
len(df23)

108

In [683]:
df22_length = len(df22)
df23_length = len(df23)

# remove NaN values
df22 = df22.dropna(inplace=False)
df23 = df23.dropna(inplace=False)

# calculate the number of papers without ethics statements
num_without_statements22 = df22_length - len(df22)
num_without_statements23 = df23_length - len(df23)

print('Number of papers without ethics statements for 2022: ',num_without_statements22)
print('Number of papers without ethics statements for 2023: ',num_without_statements23)

# rework to save the papers wo statements in a seperate df: why is dropna not defaulting to this???

Number of papers without ethics statements for 2022:  80
Number of papers without ethics statements for 2023:  11


In [684]:
len(df23)

97

In [685]:
# add a column to the df for ethics statement word count and assign a unique paper ID
df22['Word Count'] = 0
df23['Word Count'] = 0
df22['ID'] = 0
df23['ID'] = 0
counter = 0

for i, row in df22.iterrows():
    text = df22.loc[i,'Ethics']
    words = text.split()
    df22.loc[i,'Word Count'] = len(words)
    df22.loc[i,'ID'] += counter
    counter += 1

counter = 0

for i, row in df23.iterrows():
    text = df23.loc[i,'Ethics']
    words = text.split()
    df23.loc[i,'Word Count'] = len(words)
    df23.loc[i,'ID'] = counter
    counter += 1


In [686]:
df23

Unnamed: 0,Paper Title,Authors,Section Title,Ethics,Word Count,ID
0,How Do US Congress Members Advertise Climate Change: An Analysis of Ads Run on Meta’s Platforms,"Laurenz Aisenpreis, Gustav Gyrst, Vedran Sekara",Ethical Statement,"The data in this paper is derived from the Meta Ad Library. It contains publicly accessible ads run on Meta platforms by US politicians. Working with social media data carries risks of privacy issues and the right to be forgotten. However, our data analysis is limited to aggregated data presentations and only concerns ads published by public figures.",58,0
1,The Pursuit of Peer Support for Opioid Use Recovery on Reddit,"Duilio Balsamo, Paolo Bajardi, Gianmarco De Francisci Morales, Corrado Monti, Rossano Schifanella",Ethical Statement,"This work follows the guidelines and the ethical considerations by Eysenbach and Till (2001); Moreno et al. (2013); Ramırez-Cifuentes et al. (2020). All the results provide aggregated estimates and do not include any information on individuals. The users in our study were fully aware of the public nature and free accessibility of the content they posted since the subreddits are of public domain, are not password-protected, and have thousands of active subscribers. Reddit’s pseudonymous accounts make the retrieval of the true identity of users unlikely. Nevertheless, as a further privacy measure, the authors’ names were anonymized before using the data for analysis. Therefore, our research did not require informed consent.",110,1
2,Exposure to Marginally Abusive Content on Twitter,"Jack Bandy, Tomo Lazovich",Ethics Statement,"This study intersects with a number of topics related to the ethics of algorithmic platforms. For example, our analysis requires collecting data about which Tweets users view while using Twitter, and also requires human annotators to review and rate potentially abusive content. Overall, we agree with researchers who view this type of work as necessary for understanding and protecting democratic discourse (Fiske 2022), especially in terms of standard risk-benefit frameworks. Still, it is important to note different measures taken to address potential risks. While the holdback experiment is necessary, it is not ideal for many users to be excluded from algorithmic timeline features. Twitter has thus worked to provide more users access to algorithmic timelines while maintaining statistical robustness in the holdback experiment. As of 2020, the experiment included over 2 million active accounts in the reverse-chronological timeline group (Husz ́ar et al. 2022), but when this analysis was conducted in 2021, that number had been reduced to 630k. This study was not subject to an academic IRB process, however, it went through standard legal and privacy review processes at Twitter. Finally, the data used in this paper was fully anonymized for publication, following standard ethical procedures. We do not include any results that might disclose the identity of any account in the datasets.",214,2
3,Finding Qs: Profiling QAnon Supporters on Parler,"Dominik Bär, Nicolas Pröllochs, Stefan Feuerriegel",Ethics Statement,"This research did not involve interventions with human subjects, and, thus, no approval from the Institutional Review Board was required by the author institutions. All analyses are based on publicly available data and we do not make any attempt to track users across different platforms. We neither de-anonymize nor de-identify their accounts. Furthermore all analyses conform with national laws. To respect privacy, we explicitly do not publish usernames in our paper (except for celebrity profiles) and only report aggregate results.",80,3
4,Predicting Future Location Categories of Users in a Large Social Platform,"Raiyan Abdul Baten, Yozen Liu, Heinrich Peters, Francesco Barbieri, Neil Shah, Leonardo Neves, Maarten W. Bos",Ethics Statement,"Any experiment dealing with data as sensitive as ours (e.g.,location) needs to operate ethically and securely. Our approach actively aims to minimize risks of misuse and intrusion by avoiding user-identifiable data, such as demographic identities and spatial coordinates. Thus, our model may be preferable in highly sensitive settings. The datasets were anonymized before analysis. All experiments were conducted in Snapchat’s internal secure storage systems, and no data was stored outside Snapchat’s ecosystem. Thus, we do not foresee strong ethical concerns induced by our work.",84,4
...,...,...,...,...,...,...
101,Auditing Elon Musk’s Impact on Hate Speech and Bots,"Daniel Hickey, Matheus Schmitz, Daniel Fessler, Paul E. Smaldino, Goran Muric, Keith Burghardt\r",Ethics Statement,"All data were collected from the public Twitter API; identifiable information was removed prior to analysis, minimizing risks to Twitter users. Our work provides several potential benefits for society, including an audit of the steps ostensibly being taken to combat harm on Twitter, and a new way to detect hate speech at scale using commercial APIs as well as a curated list of hate words. Perspective API, which we use to classify hate, is run by Alphabet, a competitor to Twitter, but we believe this does not affect our results.",90,92
102,The Amplification Paradox in Recommender Systems,"Manoel Horta Ribeiro, Veniamin Veselovsky, Robert West",Ethical Considerations,"We do not foresee a negative societal impact coming from this research, which, on the contrary, may help improve algorithmic audits of recommender systems like YouTube.",26,93
103,Host-Centric Social Connectedness of Migrants in Europe on Facebook,"Aparup Khatua, Emilio Zagheni, Ingmar Weber",Ethical Considerations,"Anonymous and aggregate data were obtained through Facebook’s Marketing API. Given the minimum group size of 1000, any individual re-identification risk is minimal. However, there is a risk of group-level harm by mapping vulnerable populations, such as those of a particular faith. To mitigate this risk, Facebook removed targeting attributes related to religion and other sensitive attributes, including the one used in this study5. Note, however, that our study does not target Muslim migrants themselves but natives’ non-Muslimin the respective countries, limiting the potential group harm. Still, the removal of the targeting attribute of“friends of people who have engaged with Ramadan” limits the reproducibility. Given the sensitivity of the topic, we commit to sharing our data with other researchers upon request",121,94
105,Different Affordances on Facebook and SMS Text Messaging Do Not Impede Generalization of Language-Based Predictive Models,"Tingting Liu, Salvatore Giorgi, Xiangyu Tao, Sharath Chandra Guntuku, Douglas Bellew, Brenda Curtis, Lyle Ungar",Broader Impact,"Our findings have important implications. Firstly, our research highlights the variations in psycho-linguistic features between Facebook and SMS, thus warranting further investigation of downstream applications. Secondly, future researchers can build predictive models on large-scale social media language and apply them to SMS, which may offer a new approach to address the cost-accuracy trade-off in the context of just-in-time interventions on mobile devices. This study involves human subjects and was approved by the Institutional Review Board (IRB). The data used in this study raise ethical concerns such as handling sensitive personal information (PII) and thus, we have taken measures to securely store, clean, and analyze the data, further data sharing is not possible (3). We use social media, SMS data, and machine learning methods to estimate sensitive attributes like depression. Such estimates can have both positive and negative implications, ranging from providing support to causing discrimination. We must use them with caution",151,95


In [687]:
pd.set_option('display.max_colwidth', None)
df23.sort_values(by=['Word Count'],ascending=False)

Unnamed: 0,Paper Title,Authors,Section Title,Ethics,Word Count,ID
42,Popular Support for Balancing Equity and Efficiency in Resource Allocation: A Case Study in Online Advertising to Increase Welfare Program Awareness,"Allison Koenecke, Eric Giannella, Robb Willer, Sharad Goel",Ethical Statement,"While our research aims to generate positive societal impact via increasing equity in SNAP enrollment, there remains a primary ethical consideration: in optimizing for SNAP enrollment among minority demographics, we inherently reduce the SNAP enrollment among majority demographics. Below, we discuss this trade-off and its potential societal impact, as well as concerns arising from data collection. For each source of potential negative societal impact, we describe the principles used to mitigate our concerns. The crux of our work involves setting a fixed budget for an advertising bidding algorithm, and optimizing for potential SNAP enrollees who are Spanish speakers. For each additional Spanish speaking individual presented with a Get-CalFresh ad, we will necessarily decrease the number of English speaking individuals presented with the same ad — potentially by more than one. The long term consequences of our experiment include that certain individuals belonging to majority groups will not be shown the GetCalFresh ad, and will thus have a lower likelihood of filling out Get-CalFresh’s SNAP application when searching for the same Google keywords that would otherwise trigger the GetCal-Fresh ad to be shown. Across California, we hope to see an increase in SNAP applications from individuals who would not otherwise have easily found the online resources to complete the forms, in keeping with GetCalFresh’s goal of assisting the neediest individuals. If applied broadly, our framework can be used to substantiate decision-makers’ choices across a range of algorithm-based applications. Depending on the individuals and groups whose preferences are surveyed, this could either yield policy suggestions that propose more equity-based allocations, or ones that propose more efficiency-based allocations as is the norm. The key distinction will stem from whose preferences are elicited, and whether their fairness preferences are biased outside the scope of the efficiency-equity trade-off. One way to ameliorate this concern is to ensure representation of underrepresented groups among individuals whose preferences are being elicited (Kasy and Abebe 2021; Whit-taker 2020). As to the ethical challenges of data collection, this research uses data on three fronts: first, from Google ad target-ing towards a large swath of Google users in California; second, from Code for America’s compilation of GetCalFreshapplications; and third, from our Prolific survey. In Google ad targeting, we do not have access to any individual-level data; rather, we can only see audience-level statistics (e.g., how many total impressions or clicks were received on an ad). In the GetCalFresh applications, Code for America continuously tracks all applications that come through its system, but takes particular care to ensure data privacy. For example, even though one could argue the benefits of collecting race-based data from users to optimize for racial equity, the GetCalFresh application does not collect race data at all because the application does not require race information. Further, our team only obtained access to anonymized household-level data pertaining to the research at hand. In the Prolific survey, we pre-registered our experiment viaAsPredicted(#84866), indicating what individual-level data we aimed to collect; we made the decision to not publicly re-lease said data for Prolific user privacy reasons—our dataset includes sensitive information such as political affiliationand income levels, which were imperative to collect to understand the socioeconomic drivers of fairness preferences. Across these three data sources, we have minimized the potential data privacy harm to the extent possible while still allowing for this research to be conducted. The survey question text, along with code reproducing data analysis, are posted on GitHub for reproducibility. Across both the inherent demographic trade-off and data privacy considerations, we have made the choices we feel best lead to equitable outcomes for the neediest potential GetCalFresh users, with minimal harm to the broader set of potential users. While our research focuses on SNAP within California, there is room for future work both across America, and via analogous food stamp programs globally. However, our design choices would need to be revisited since the concept of neediest recipients may vary considerably in different contexts.",652,38
5,"Followback Clusters, Satellite Audiences, and Bridge Nodes: Coengagement Networks for the 2020 US Election","Andrew Beers, Joseph S. Schafer, Ian Kennedy, Morgan Wack, Emma S. Spiro, Kate Starbird","Ethical Considerations, Limitations, and SoftwareSharing","As network visualizations continue to be central in social media analysis, we believe it is necessary to briefly examine the ethical considerations on whether network visualizations such as these should be used in every circumstance. We have chosen here to visualize users participating in a high-prominence topic, consisting of mostly public-facing accounts such as politicians and media outlets. The same methods applied to communities with a higher expectation of privacy, or who face higher risks from exposure, may be unethical surveillance if researchers have not derived consent from members of these communities. There are also ethical implications to naming accounts visualized as nodes in networks. Some users, due to gender, race, or other factors, are at higher risk of harassment if identified as influential in a given community, while other users who explicitly seek attention in online communities may use their identification in networks as a propaganda tool in hateful campaigns. In our reporting of this work, we have declined to name some accounts for both reasons. We also stress here how the data collection procedure, and subsequent description of that procedure, affects which communities appear to be participating in a phenomenon. Several politically-active Twitter communities in the US that have been previously described in research, such as Black Twitter (Clark 2019) and non-English language communities (Fang 2021; Soto-Vasquez et al. 2020), are not explicitly visible in our analysis, likely due to their different posting volumes and the choice of terms and topics on which we chose to center our data collection. Researchers working with such visualizations whose research bears on policy and public perception must explain such limitations in the communication of theirwork. We have described the basic form of an approach for visualizing engagements in social network data, and there are many ways in which this method can be modified to more saliently capture engagement dynamics. For example, the current formulation of this method places emphasis on users who frequently share content, which is not necessarily undesirable given the external impact of this behavior. However, different formulations of the network projection scheme, such as those that weight users’ engagements relative to their average level of engagement, may be a better reflection of real discourse communities that exist at lower sharing volumes. We also observe that many of these coengagement networks create densely-connected subgraphs in which most nodes are connected to most other nodes, making internode relationships difficult to visually identify. Accordingly, these networks may be complementary with other techniques to improve visualizations of social networks, such as the edge sparsification procedures for densely-connected networks proposed by Nocaj et al. (2015). In the hopes that others may replicate our methods of both visualization and analysis on new datasets, we make the code available for generating these graphs either from structured data received from the Twitter API, or in general JSON and CSV-based formats. This code uses the visualization capabilities of the open-source network visualization library Gephi, and its implementation of the ForceAtlas algorithm for network visualization (Bastian, Heymann, and Jacomy2009; Jacomy et al. 2014). We have packaged this code in publicly-available Docker containers, a relatively portable and stable code format which can be run on many machines with relatively few installation requirements. Additionally, we have made available node and link data for all visualizations displayed in this paper, as well as a list of Twitter ID numbers for tweets and users corresponding to data used to generate these graphs. We hope by making the code for generating these graphs open-source, other researchers both qualitative and quantitative will both explore the potential and limitations of this method, as well as contribute modifications to this scheme as appropriate.",607,5
18,Misleading Repurposing on Twitter,"Tuğrulcan Elmas, Rebekah Overdorf, Karl Aberer",Ethical Impact,"Data Collection and Management: This study only uses public data provided by Twitter and the Internet Archive, both of which have been analyzed extensively by previous work. To comply with the Twitter Terms of Service and protect the privacy of Twitter users, we do not share the data of repurposed accounts from the popular dataset. However, we share the ids of the repurposed accounts from the integrity dataset, since these accounts have already been made public by Twitter and, as such, there is no risk of further harms in their release. Threats to User Anonymity and PrivacyWe additionally mitigate any privacy loss to normal Twitter users by limiting our study to only two types of accounts: 1) accounts in the civic integrity dataset which have been designated by Twitter as harmful to public dialogue and released by Twitter, and 2) popular accounts which can influence the public. For an account to be considered “popular”, we follow Twitter’s lead in choosing a threshold of 5,000 followers, the threshold Twitter uses in the civic integrity dataset to determine if a user’s profile be made public. This group of accounts does include legitimate users who do not intend to mislead others or participate in malicious activity, and in the course of our study, we uncovered their former account names/old profiles via parsing publicly available data. This may include accidental deanonymization of a currently pseudonymized account if the user self-stated their identity in an old version of their profile and posted enough tweets from the old version of their account to appear in the 1% sample. We mitigated this risk to the best of our availability by not releasing the data publicly, performing the annotation ourselves to not expose the data to crowd workers, and not reading their tweets. Further Potential Impacts of WorkWe must also consider the impact of publishing such a study and making this type of platform manipulation known to the general public and academic community. First, we hope that this work raises awareness among Twitter users that accounts that they follow may be repurposed for malicious purposes so that they can notice such accounts when they see them, and possibly even report them as malicious. We also hope that pointing out and studying this phenomenon urges academics and Twitter alike to put more resources into mitigation methods that do not have negative impacts on normal users, especially those from already marginalized groups. Awareness goes both ways, though, and this paper could also lead to malicious users learning about repurposing. This could lead to some who did not know that repurposing was possible to maliciously repurpose more accounts. However, we know from the widespread use of malicious repurposing that this phenomenon is already known by many who wish to use it maliciously. By bringing this problem to light, we hope to mitigate this risk by promoting user and platform awareness, thus discouraging its use. Although the goal of this paper is to uncover malicious re-purposing, parts of our methodology could be repurposed to deanonymize users who want to remain anonymous, as long as at one point in the past their account had an identifiable attribute. Users should be made aware that if they wish to remain anonymous, a new account should be created from scratch rather than repurposing a non-anonymous account. Finally, this work further illustrates that deletion privacy is important for users, but that it also can prevent malicious activity from being discovered. While users need to be able to delete and hide their prior activities and accounts, this study underlines how such mechanisms can be misused to mislead and deceive users.",602,15
44,Large-Scale Demographic Inference of Social Media Users in a Low-Resource Scenario,"Karim Lasri, Manuel Tonneau, Haaya Naushan, Niyati Malhotra, Ibrahim Farouq, Víctor Orozco-Olvera, Samuel Fraiberger",Ethical Statement,"Addressing the previously unknown, we describe NigerianTwitter and highlight that interpretable and fair algorithms can provide comparably high performance to more advanced but less transparent and potentially more biased methods. Broadly speaking, beyond describing Nigerian Twitter, we expect that the utility of this approach will be evident in future work that relies on demographic inference to evaluate policy impact. Yet, we acknowledge that our approach, as well as broader research aimed at inferring demographic attributes of users, may raise several ethical concerns. For instance, the tools developed can be used for profiling purposes in pursuit of malicious objectives. Due to the accessibility of available tools and the high risk of re-identification (Rocher, Hendrickx, and De Montjoye 2019), the data used for development and evaluation may be sensitive and require confidentiality. Moreover, we are aware that name-based demographic inference may disproportionately miscategorize minority groups and individuals which can have serious empirical and ethical consequences, as extensively discussed in Lock-hart, King, and Munsch (2023). In this work, we only provide a modeling pipeline aimed at accurately inferring demographic traits of social media users and thus remain agnostic on the usage of such tool. In doing so, we leave it to practitioners and researchers who wish to incorporate such information in their analysis to estimate whether their workis ethically desirable. It is our responsibility, however, to draw attention to a number of critical points and limitations, which should be taken into account when evaluating whether future work building on our methods is ethically justified. In the following, we build on the suggestions formulated by Lockhart, King, and Munsch (2023). First, our method produces attributes based on external ascriptions, and therefore should be used in case studies where one is interested in external ascription, e.g. how social media users perceive each other, instead of focusing on a user’s true sense of self-identity. Additionally, our method builds on labels produced and revised by local domain experts who have a strong knowledge of the extent to which names or profiles signal certain traits, as described in §3.3. Labels assigned without such precaution might not be as accurate and produce erroneous inferences. We try to limit subjective judgments and individual biases in the annotation processby duplicating the labeling task among our four experts and by making sure disagreements are arbitrated. This however does not impede unfair biases being shared by all of our an-notators, despite their level of expertise. Further, for each demographic attribute, we limit ourselves to traits which can be retrieved from a user’s name with high accuracy in a given population, as demonstrated by our name matching results. In doing so, we fail to capture a variety of subgroups, including non-binary individuals, ethnic minorities, and traditional faiths, as resources are lacking for these groups which could result in developing poorly performing models. This draws a limitation of our approach: while providing accurate predictions at the aggregated level, it disregards minority sub-groups, which limits inclusivity. Any downstream usage of methods or data similar to ours should take this limitation into account. Finally, as our pipeline is accurate at an aggregated level, downstream applications relying on similar data should preferably make use of demographic predictions at the group level, as individual predictions might contain erroneous associations which could add confounds to the modeling pipeline.",548,40
16,We Are in This Together: Quantifying Community Subjective Wellbeing and Resilience,"MeiXing Dong, Ruixuan Sun, Laura Biester, Rada Mihalcea",Broader Impact,"Our work, in conjunction with existing work (Ashokkumarand Pennebaker 2021; Biester et al. 2021), shows that the pandemic had different effects on different communities. Further, we show that signals from social media can be predictive of how a community copes with adversity. We found that cities more affected by the pandemic tended to have less connected members and had previously placed more importance on life aspects that were most impacted by social distancing during the pandemic, such as seeing friends and participating in group activities. Our features were predictive of whether a city’s wellbeing was affected by the pandemic. However, predicting the subsequent recovery trajectory of affected cities proved to be more difficult, implying that there are other factors involved and further work is needed to understand community resilience over time. Our findings indicate that differential policies should be put in place for communities, based on the pandemic’s local impact. Cities more impacted by social distancing measures may need to place higher priority on re-establishing social activities, such as local events and cultural festivals. Furthermore, policymakers could use automatically derived signals from social media as a real-time source of feedback for their policies, especially during times that require quick decision-making like during the pandemic. However, it’s important that such factors are considered holistically. A limitation of our findings is that they do not necessarily reflect the general public. Our work focuses on a single social media site (Reddit) where the users tend to be young (7) and male (8). Further, Reddit activity may potentially overrepresent the number of people affected by the pandemic (e.g., more people joining and posting due to physical lockdowns) or exclude those who largely ignored the pandemic. Our analyses look at all posts and do not focus only on posts that relate to COVID-19, which may lessen this bias. Additionally, there are many other facets of how communities acted during COVID, such as their general compliance with COVID-19 prevention measures or vaccination rates. However, further studies using other social media, surveys, and more facets of how communities acted during COVID (e.g., mask adherence, general compliance with COVID-19 prevention measures) can help support and extend our insights. Because our study is based solely on observational data, we cannot establish causal links between the community characteristics we have identified and the wellbeing recovery outcomes. To address this, future work could involve col-lecting ground truth data about city recovery and resilience, such as through large surveys of individuals in each city regarding their wellbeing during the pandemic. Finally, our study should not be construed to be a comprehensive study of wellbeing. Subjective wellbeing does not consist solely of the presence of positive affect. It is more complex and multi-faceted, involving other aspects such as life satisfaction which are impacted in different ways (Ket-tlewell et al. 2020). Future work could study how community factors relate to these additional aspects of wellbeing. Furthermore, the relation of our wellbeing metric to metrics such as self-reported life satisfaction has not been studied; a misalignment between stance expressed in social media and public opinion surveys has been noted in prior work (Josephet al. 2021), and we leave it to future work to study how wellbeing as expressed in social media posts relates to self-reported wellbeing.",541,14
...,...,...,...,...,...,...
57,The Chance of Winning Election Impacts on Social Media Strategy,"Taichi Murayama, Akira Matsui, Kunihiro Miyazaki, Yasuko Matsubara, Yasushi Sakurai",Ethical Considerations,"The data in this paper is derived from publicly accessible user-generated content. We pay the utmost attention to the privacy of individuals in this study. When sharing our twitter data, we will publish only a list of tweet IDs.",39,52
36,Online Emotions during the Storming of the U.S. Capitol: Evidence from the Social Media Network Parler,"Johannes Jakubik, Michael Vossing, Nicolas Prollochs, Dominik Bar, Stefan Feuerriegel",Ethics Statement,We respect the privacy and agency of all people potentially impacted by this work and take specific steps to protect their privacy (see main text). The analysis was conducted in accordance with the Institutional Review Board at ETHZurich.,38,32
70,Social Influence-Maximizing Group Recommendation,"Yangke Sun, Bogdan Cautis, Silviu Maniu",Ethics and Competing Interests,"The positive outcome of our research is more effective recommendations, leading to more awareness and adoption of items. In our view, there are no negative outcomes and no ethical implications pertaining to the data collection process.",36,65
84,A Multi-Platform Collection of Social Media Posts about the 2022 U.S. Midterm Elections,"Rachith Aiyappa, Matthew R. DeVerna, Manita Pote, Bao Tran Truong, Wanying Zhao, David Axelrod, Aria Pessianzadeh, Zoher Kachwala, Munjung Kim, Ozgur Can Seckin, Minsuk Kim, Sunny Gandhi, Amrutha Manikonda, Francesco Pierri, Filippo Menczer, Kai-Cheng Yang",Ethical Statement,This study has been granted exemption from Institutional Review Board review (Indiana University protocol 17036). The collection and release of the dataset are in compliance with the platforms’ terms of service.,31,76


In [688]:
import re

In [689]:
def remove_citations(df,patterns):
    citation_df = pd.DataFrame({'ID':[],'Citations':[]})
    counter = 0
    for i, row in df.iterrows():
        text = df.loc[i,'Ethics']
    
        for j, p in enumerate(patterns):
            citations = re.findall(p,text)
    
            # if citations are found, remove them from the text and add to a seperate df
            if (len(citations) != 0):
                text = re.sub(p,'',text)
                for c in citations:
                    ID = str(i)+str(counter)
                    new_row = {'ID': ID, 'Citations': c}
                    citation_df.loc[len(citation_df)] = new_row
                    counter += 1
                
                # replace previous text with cleaned version
                df.loc[i,'Ethics'] = text
        
        # reset the counter before each text
        counter = 0
    return (citation_df, df)

In [690]:
# remove the following regular expressions for citations from the df and store them in a seperate df with unique ID (ID = paper ID + citation index)
patterns = [r'\(\d{4}\)', r'\(\d+\)', r'\([A-Za-z]+\s\d{4}\)', r'\([A-Za-z]+\d{4}\)', r'\b\d+\b', r'\set\sal\.']

citation_data22 = remove_citations(df22,patterns)
citation_df22 = citation_data22[0]
df22 = citation_data22[1]

citation_data23 = remove_citations(df23,patterns)
citation_df23 = citation_data23[0]
df23 = citation_data23[1]

In [691]:
citation_df23

Unnamed: 0,ID,Citations
0,10,(2001)
1,11,(2013)
2,12,(2020)
3,13,et al.
4,14,et al.
...,...,...
196,971,22
197,972,129
198,980,2019
199,1030,1000


In [692]:
df23

Unnamed: 0,Paper Title,Authors,Section Title,Ethics,Word Count,ID
0,How Do US Congress Members Advertise Climate Change: An Analysis of Ads Run on Meta’s Platforms,"Laurenz Aisenpreis, Gustav Gyrst, Vedran Sekara",Ethical Statement,"The data in this paper is derived from the Meta Ad Library. It contains publicly accessible ads run on Meta platforms by US politicians. Working with social media data carries risks of privacy issues and the right to be forgotten. However, our data analysis is limited to aggregated data presentations and only concerns ads published by public figures.",58,0
1,The Pursuit of Peer Support for Opioid Use Recovery on Reddit,"Duilio Balsamo, Paolo Bajardi, Gianmarco De Francisci Morales, Corrado Monti, Rossano Schifanella",Ethical Statement,"This work follows the guidelines and the ethical considerations by Eysenbach and Till ; Moreno ; Ramırez-Cifuentes . All the results provide aggregated estimates and do not include any information on individuals. The users in our study were fully aware of the public nature and free accessibility of the content they posted since the subreddits are of public domain, are not password-protected, and have thousands of active subscribers. Reddit’s pseudonymous accounts make the retrieval of the true identity of users unlikely. Nevertheless, as a further privacy measure, the authors’ names were anonymized before using the data for analysis. Therefore, our research did not require informed consent.",110,1
2,Exposure to Marginally Abusive Content on Twitter,"Jack Bandy, Tomo Lazovich",Ethics Statement,"This study intersects with a number of topics related to the ethics of algorithmic platforms. For example, our analysis requires collecting data about which Tweets users view while using Twitter, and also requires human annotators to review and rate potentially abusive content. Overall, we agree with researchers who view this type of work as necessary for understanding and protecting democratic discourse , especially in terms of standard risk-benefit frameworks. Still, it is important to note different measures taken to address potential risks. While the holdback experiment is necessary, it is not ideal for many users to be excluded from algorithmic timeline features. Twitter has thus worked to provide more users access to algorithmic timelines while maintaining statistical robustness in the holdback experiment. As of , the experiment included over million active accounts in the reverse-chronological timeline group (Husz ́ar ), but when this analysis was conducted in , that number had been reduced to 630k. This study was not subject to an academic IRB process, however, it went through standard legal and privacy review processes at Twitter. Finally, the data used in this paper was fully anonymized for publication, following standard ethical procedures. We do not include any results that might disclose the identity of any account in the datasets.",214,2
3,Finding Qs: Profiling QAnon Supporters on Parler,"Dominik Bär, Nicolas Pröllochs, Stefan Feuerriegel",Ethics Statement,"This research did not involve interventions with human subjects, and, thus, no approval from the Institutional Review Board was required by the author institutions. All analyses are based on publicly available data and we do not make any attempt to track users across different platforms. We neither de-anonymize nor de-identify their accounts. Furthermore all analyses conform with national laws. To respect privacy, we explicitly do not publish usernames in our paper (except for celebrity profiles) and only report aggregate results.",80,3
4,Predicting Future Location Categories of Users in a Large Social Platform,"Raiyan Abdul Baten, Yozen Liu, Heinrich Peters, Francesco Barbieri, Neil Shah, Leonardo Neves, Maarten W. Bos",Ethics Statement,"Any experiment dealing with data as sensitive as ours (e.g.,location) needs to operate ethically and securely. Our approach actively aims to minimize risks of misuse and intrusion by avoiding user-identifiable data, such as demographic identities and spatial coordinates. Thus, our model may be preferable in highly sensitive settings. The datasets were anonymized before analysis. All experiments were conducted in Snapchat’s internal secure storage systems, and no data was stored outside Snapchat’s ecosystem. Thus, we do not foresee strong ethical concerns induced by our work.",84,4
...,...,...,...,...,...,...
101,Auditing Elon Musk’s Impact on Hate Speech and Bots,"Daniel Hickey, Matheus Schmitz, Daniel Fessler, Paul E. Smaldino, Goran Muric, Keith Burghardt\r",Ethics Statement,"All data were collected from the public Twitter API; identifiable information was removed prior to analysis, minimizing risks to Twitter users. Our work provides several potential benefits for society, including an audit of the steps ostensibly being taken to combat harm on Twitter, and a new way to detect hate speech at scale using commercial APIs as well as a curated list of hate words. Perspective API, which we use to classify hate, is run by Alphabet, a competitor to Twitter, but we believe this does not affect our results.",90,92
102,The Amplification Paradox in Recommender Systems,"Manoel Horta Ribeiro, Veniamin Veselovsky, Robert West",Ethical Considerations,"We do not foresee a negative societal impact coming from this research, which, on the contrary, may help improve algorithmic audits of recommender systems like YouTube.",26,93
103,Host-Centric Social Connectedness of Migrants in Europe on Facebook,"Aparup Khatua, Emilio Zagheni, Ingmar Weber",Ethical Considerations,"Anonymous and aggregate data were obtained through Facebook’s Marketing API. Given the minimum group size of , any individual re-identification risk is minimal. However, there is a risk of group-level harm by mapping vulnerable populations, such as those of a particular faith. To mitigate this risk, Facebook removed targeting attributes related to religion and other sensitive attributes, including the one used in this study5. Note, however, that our study does not target Muslim migrants themselves but natives’ non-Muslimin the respective countries, limiting the potential group harm. Still, the removal of the targeting attribute of“friends of people who have engaged with Ramadan” limits the reproducibility. Given the sensitivity of the topic, we commit to sharing our data with other researchers upon request",121,94
105,Different Affordances on Facebook and SMS Text Messaging Do Not Impede Generalization of Language-Based Predictive Models,"Tingting Liu, Salvatore Giorgi, Xiangyu Tao, Sharath Chandra Guntuku, Douglas Bellew, Brenda Curtis, Lyle Ungar",Broader Impact,"Our findings have important implications. Firstly, our research highlights the variations in psycho-linguistic features between Facebook and SMS, thus warranting further investigation of downstream applications. Secondly, future researchers can build predictive models on large-scale social media language and apply them to SMS, which may offer a new approach to address the cost-accuracy trade-off in the context of just-in-time interventions on mobile devices. This study involves human subjects and was approved by the Institutional Review Board (IRB). The data used in this study raise ethical concerns such as handling sensitive personal information (PII) and thus, we have taken measures to securely store, clean, and analyze the data, further data sharing is not possible . We use social media, SMS data, and machine learning methods to estimate sensitive attributes like depression. Such estimates can have both positive and negative implications, ranging from providing support to causing discrimination. We must use them with caution",151,95


In [693]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\campb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\campb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [694]:
stopwords = stopwords.words('english')
tokenizer = word_tokenize

In [695]:
def clean_text_words(text):
    tokens = word_tokenize(text)
    
    # make text lowercase
    lowercase_tokens = [w.lower() for w in tokens]
    
    # remove punctuation
    remove_punct_tokens = [w for w in lowercase_tokens if w not in string.punctuation]
    
    # remove stopwords
    clean_tokens = [w for w in remove_punct_tokens if w not in stopwords]
    return clean_tokens

In [696]:
def clean_text_sents(text):
    tokens = sent_tokenize(text)
    
    # make text lowercase
    lowercase_tokens = [w.lower() for w in tokens]
    
    return lowercase_tokens

In [697]:
tokenized_words22 = df22.copy()
tokenized_words23 = df23.copy()

tokenized_sents22 = df22.copy()
tokenized_sents23 = df23.copy()

In [698]:
# clean and tokenize each word in the ethics statement
for col in tokenized_words22.columns[:-2]:
    tokenized_words22[col] = tokenized_words22[col].apply(clean_text_words)

for col in tokenized_words23.columns[:-2]:
    tokenized_words23[col] = tokenized_words23[col].apply(clean_text_words)

In [699]:
# clean and tokenize each sentence in the ethics statement
for col in tokenized_sents22.columns[:-2]:
    tokenized_sents22[col] = tokenized_sents22[col].apply(clean_text_sents)

for col in tokenized_sents_23.columns[:-2]:
    tokenized_sents23[col] = tokenized_sents23[col].apply(clean_text_sents)

## 2. Word Frequency <a name="sec2"></a>

In [700]:
from nltk import FreqDist

In [701]:
freq_df22 = pd.DataFrame({'ID':[],'Frequent Words':[]})
freq_df23 = pd.DataFrame({'ID':[],'Frequent Words':[]})

In [702]:
# propagate freq_df with the 5 most frequently used words for each paper
for i, row in tokenized_words22.iterrows():
    freq_dist = FreqDist(tokenized_words22.loc[i,'Ethics'])
    most_common_words = freq_dist.most_common(5)
    
    freq_df22.loc[i,'ID'] = tokenized_words22.loc[i,'ID']
    freq_df22.loc[i,'Frequent Words'] = [most_common_words]
    
for i, row in tokenized_words23.iterrows():
    freq_dist = FreqDist(tokenized_words23.loc[i,'Ethics'])
    most_common_words = freq_dist.most_common(5)
    
    freq_df23.loc[i,'ID'] = tokenized_words23.loc[i,'ID']
    freq_df23.loc[i,'Frequent Words'] = [most_common_words]

In [703]:
freq_df23

Unnamed: 0,ID,Frequent Words
0,0.0,"[(data, 4), (meta, 2), (ads, 2), (paper, 1), (derived, 1)]"
1,1.0,"[[(users, 2), (public, 2), (’, 2), (work, 1), (follows, 1)]]"
2,2.0,"[[(algorithmic, 3), (users, 3), (twitter, 3), (standard, 3), (experiment, 3)]]"
3,3.0,"[[(analyses, 2), (research, 1), (involve, 1), (interventions, 1), (human, 1)]]"
4,4.0,"[[(data, 3), (sensitive, 2), (thus, 2), (snapchat, 2), (’, 2)]]"
...,...,...
101,92.0,"[[(twitter, 4), (hate, 3), (api, 2), (data, 1), (collected, 1)]]"
102,93.0,"[[(foresee, 1), (negative, 1), (societal, 1), (impact, 1), (coming, 1)]]"
103,94.0,"[[(risk, 3), (data, 2), (facebook, 2), (’, 2), (given, 2)]]"
105,95.0,"[[(data, 4), (sms, 3), (implications, 2), (thus, 2), (social, 2)]]"


## 3. Sentence Analysis <a name="sec3"></a>

In [704]:
tokenized_sents22

Unnamed: 0,Paper Title,Authors,Section Title,Ethics,Word Count,ID
0,"[leaders or followers?, a temporal analysis of tweets from ira trolls]","[siva k. balasubramanian, mustafa bilgic, aron culotta, libby hemphill, anita nikolich, matthew a. shapiro]",[ethical statement],"[the data in this paper is derived from publicly accessible user-generated content online., while our focus is on aggregate trending keywords and not individual user characteristics, such data carry risks for issues of privacy and “right-to-be-forgotten.” to mitigate these issues and comply with terms of service, we will release only tweet ids for the data used in this study.]",59,0
1,[community under surveillance: impacts of marginalization on an online labor forum],"[hanna barakat, elissa m. redmiles]",[ethics statement],"[our work was approved by our institution’s ethics review board., despite this approval, tension exists within the literature regarding the ethics of analyzing ‘public’ forum data and why, when, and how users may perceive public forum discussions as private (e.g., proferes et al., ; vi-tak, shilton, and ashktorab ; eysenbach and till ; razi, badillo-urquiola, and wisniewski )., in this work, we follow guidance from cook, ayers, and horsch and dym and fiesler on preserving the platform’s anonymity and draw on the established framework from eysenbach and till to assess the ethics of our work., eysenbach and till suggest three central aspects to evaluating the ‘private’ or ‘public’ nature of online sources., first is an assent of access to the forum: is registration necessary to view content or post?, this specific forum is an ‘open’ forum and the social media platform does not require registration., second, forum size; this forum had approximately , members at the time of data collection, which is between a medium- and large-sized forum for this platform., third, eysenbach and till recommend assessing how members perceive the forum., it is difficult to gauge whether members perceive the forum as public or private., however, posts and comments often allude to the public nature of the content posted., for example, members warned each other not to post specific platform names in case law enforcement was monitoring the page and frequently suggested moving conversations out of public view to direct messages., while our assessment along these three criteria suggests that forum members are aware of the public nature of the platform, to protect participants we omit usernames user names (cook, ayers, and horsch ; dym and fiesler2020), anonymize the platform and forum name, alter or paraphrase all quotes so they cannot be reverse-searched, and the names of social media sites, messaging applications,and other tools are intentionally removed from this paper to avoid any harmful repercussions to sex workers using those tools (costanza-chock ).]",327,1
3,"[linguistic characterization of divisive topics online: case studies on contentiousness in abortion, climate change, and gun control]","[jacob beel, tong xiang, sandeep soni, diyi yang]",[ethics statement],"[this research study has been approved by the institutional review board (irb) at the researchers’ institution., in this work, we leverage no information that was not publicly available at the time of data collection, leveraging the public post histories of the users who participated in the conversations which we studied., as such, user private information is not disclosed, and none of the posts which were used to compute the gender and location features have been saved or need to be saved during the course of this computation.]",87,2
5,"[to recommend or not?, a model-based comparison of item-matching processes]","[serina chang, johan ugander]",[broader impact],"[our work is primarily motivated by broader impacts: our goal is to develop a principled framework through which we can meaningfully assess the impacts of recommender systems on society., recommender systems are ubiquitous on the web and social media platforms, and have the potential for large-scale negative consequences such as pulling individuals into filter bubbles, increasing population-level polarization, or exacerbating social inequalities., however, in order to distill the role of recommender systems in contributing to these social phenomena, we need to compare user outcomes under recommender systems to a credible assessment of user outcomes without recommender systems., thus, we develop two contrasting models that capture each of these worlds and systematically compare them, so that we can analyze the consequences of recommender systems relative to a counterfactual world without then., as a model-based approach, we establish key general insights about recommender systems, but we do not make any claims about specific platforms nor do we offer instruction on how real-world recommender systems should be designed., we use real movie ratings data from movielens to demonstrate that our theoretical results translate to real-world settings, not to demonstrate superior performance on the movie recommendation task., the test users in our simulation experiments are synthetic, generated from a distribution learned from the real users; the real users are also anonymized in the data., practitioners should be aware of the limitations and theoretical nature of our work., they should not directly apply our findings to their domains, but we hope that they appreciate our main takeaways: that recommender systems fundamentally alter how humans interact with content, and that seemingly minor algorithmic decisions (such as regularization) can have major effects on user outcomes.]",278,3
9,[echoes through time: evolution of the italian covid-19 vaccination debate],"[giuseppe crupi, yelena mejova, michele tizzani, daniela paolotti, andre panisson]",[broader impact & ethical considerations],"[major efforts are ongoing to better understand the reasons for vaccine hesitancy in europe and around the world., the european centre for disease prevention and control (ecdc) conducts regular surveys to measure covid- behaviors, and supports the inclusion of additional data sources to understand the beliefs and expectations around vaccination (european centre for disease prevention and control )., the surveillance tools and opinion community detection methods presented in this paper may assist in better understanding the popular sentiments, associations, questions, and misunderstandings that circulate on one of the most popular social media in italy., although this dataset captures a small portion of italian speakers, most social media users do not engage in posting, and use platforms for entertainment and as a source of information (van mierlo ), thus the potential audience of the content captured here may be a norder of magnitude larger than the users we analyze., still, we should remember that there are other groups which are not captured in this data, including people not having access to internet and the platform, and those with disabilities not allowing them to interact with it., while the dataset collected for this study contains only publicly posted tweets, it is possible that it contains postsfrom vulnerable groups, including those with serious or chronic health conditions relevant to covid- and the vaccination campaigns, those emotionally or psychologically vulnerable, and family members and friends who are concerned for the wellbeing of their loved ones, among others., in order to preserve the privacy of the individual users, we reveal the twitter handles only of public figures or anonymous accounts in this paper, and otherwise do not use iden-tifiable information in the analysis of the data., furthermore, we will abide by twitter’s terms of service by making the ids of the collected available upon request, such that those which have been deleted by their authors will not be available when the metadata is re-collected (unfortunately, limiting reproducibility somewhat)., however, to support the transparency of this work, we make the code available to the research community, especially that pertaining community detection, random walk controversy score computation, and community-aware topic modeling9., finally, as opinion surveillance methods used in this work may be applied to identify other communities and individuals therein, we urge the research community to strictly abide by the code of ethics for the research and application of these tools (such as one by aaai10) in order to minimize harm to the subjects of research.]",410,4
10,[social media reveals urban-rural differences in stress across china],"[jesse cui, tingdan zhang, kokil jaidka, dandan pang, garrick sherman, vinit jakhetiya, lyle h. ungar, sharath chandra guntuku]","[limitations, ethics, and future work]","[this study, like several social media based studies, has many limitations., first, our findings may not cover the full picture of stress because of the internet censorship in china (vuoriand paltemaa ) as we only use publicly available posts., we assume a few stress-related words as reported in previous research such as psychosomatic symptoms, substance use, or suicidal ideation may not be present in our dataset because they could harm the establishment of a ‘healthy and harmonious internet environment’ (paltemaa )., china’s bureau of statistics recommends mapping neighborhood committees to urban-rural regions and we did not have access to such granular location for weibo users in our dataset., that said, the current county tier classification system that maps counties into urban and rural regions based on a tier system serves as a reasonable proxy as has been seen in prior works ., further, we replicated the findings on language data at the province level., discussions around the use of social media based health indices should include public health experts, computer scientists, lawyers, ethicists, clinicians, policy makers, and individuals from different socioeconomic and cultural backgrounds (benton, coppersmith, and dredze )., there are several potential ways to broaden our findings., first, we analyzed posts that mention several variants of psychological stress which is potentially a subset of all posts that truly indicate a stressed mental state., while language-based estimates of stress have been validated in english using traditional survey instruments (guntuku et al., a), future work could examine its correlation with other well-being facets in mandarin., second, specific sub-populations(e.g., immigrants) who have unique stressors due to diverse reasons (e.g., ‘hukou’) that separate rural and urban residents into disparate social, economic, and political spheres could be examined., despite living in cities, migrant peasant workers still maintain their rural hukou type and are treated as rural residents, with little to no access to urban social security, which can trigger different degrees and aspects of stress., further, users are likely to express themselves differently across social media platforms., in this study, we only used data from one major social media platform - weibo., it would be interesting to examine if the results would differ in other platforms, as has been found in the us (jaidka, guntuku, and ungar ).in summary, our findings suggest that a nuanced analysis of regional social media usage is necessary, in order to situate it in an understanding of the digital divide in internet and social media usage., differences in social contexts are related to a differential use of social media to cope with daily stressors and act as a buffer for stress and subjective well-being., despite societal growth and technological advancements, the characteristics of urban and rural life appear to be replicated and reinforced in the online sphere – in context after context, and culture after culture.]",472,5
18,[improving wikidata with student-generated concept maps],"[hayden freedman, andre van der hoek, bill tomlinson]",[ethics statement],"[the authors have no conflicts of interests to declare, and believe that the work described in this paper meets the standards of the aaai code of professional ethics and conduct., we believe that the enrichment of wikidata with sustainability knowledge is both socially responsible and broadly accessible., we recognize that crowdsourcing efforts may lead to the introduction of bias in datasets (ghai ); in future work, it will be relevant to determine in what ways bias introduced by students is greater than, less than, or different from bias introduced by other crowds., and, whileai systems involve many ethical challenges , this work seeks to produce a more robust knowledge corpus that may enable ai systems to contribute more effectively to the transition to sustainability., by doing so, it may contribute to the well-being of marginalized groups that are most likely to be affected by planetary issues such as climate change.]",153,6
19,[mining points-of-interest data to predict urban inequality: evidence from germany and france],"[manuel ganter, malte toetzke, stefan feuerriegel]",[ethical statement],"[this research did neither involve interventions with human subjects nor individualized human data., thus, no approval from the institutional review board was required by the author institutions.]",27,7
21,[safer: social capital-based friend recommendation to defend against phishing attacks],"[zhen guo, jin-hee cho, ing-ray chen, srijan sengupta, michin hong, tanushree mitra]",[ethical statement],"[we use publicly available datasets collected from twitterapi in the existing research (yang, harkreader, and gu2011; yang ; cresci ) to evaluate our proposed approach., the datasets were all anonymized by hiding identity information by their publishers., broader impact., this work can introduce the potential broader impact to build a safe, trustworthy cyberspace by defending online social networks against phishing attacks through intelligent user interactions based on the proposed-friending recommendation framework., funding and competing interests., this work is funded by the virginia tech and has no competing interests with financial activities outside this paper.]",98,8
25,[on the infrastructure providers that support misinformation websites],"[catherine han, deepak kumar, zakir durumeric]",[ethics of deplatforming],"[our paper focuses on understanding the providers that directly or indirectly support misinformation websites and whether deplatforming helps curb the spread of misinformation., it remains an open question whether companies should deplatform all kinds of misinformation sites, and if they do, how they should choose which sites to deplatform., while a few providers have policies that prohibit misinformation, many do not, which may inadvertently enable misinformation websites to thrive on their platforms., we encourage providers to actively consider writing concrete policies around abusive content and misinformation., we also note that several of the largest ad providers have publicly announced their intent to fight online abusive content and misinformation; however, according to our data, they have failed to take meaningful action against known problematic sites., for instance, over % of all misinformation sites rely on google for ads., these mainstream ad providers are not only supporting misinformation sites by providing them ad revenue, but also profiting from maintaining relationships with these publishers., we encourage providers to reconsider how they are enforcing their policies.]",172,9


In [705]:
sentence_df22 = pd.DataFrame({'ID':[],'Sentence':[]})
sentence_df23 = pd.DataFrame({'ID':[],'Sentence':[]})

In [706]:
# make a new df of sentences with unique IDs (ID = paper ID + sentence index)
for i, row in tokenized_sents22.iterrows():
    sents = tokenized_sents22.loc[i,'Ethics']
    paper_ID = tokenized_sents22.loc[i,'ID']
    ID = i
    for j, s in enumerate(sents):
        ID = str(i)+str(j)
        new_row = {'ID': ID, 'Sentence': s}
        sentence_df22.loc[len(sentence_df22)] = new_row
        
for i, row in tokenized_sents23.iterrows():
    sents = tokenized_sents23.loc[i,'Ethics']
    paper_ID = tokenized_sents23.loc[i,'ID']
    ID = i
    for j, s in enumerate(sents):
        ID = str(i)+str(j)
        new_row = {'ID': ID, 'Sentence': s}
        sentence_df23.loc[len(sentence_df23)] = new_row

In [707]:
sentence_df23

Unnamed: 0,ID,Sentence
0,00,the data in this paper is derived from the meta ad library.
1,01,it contains publicly accessible ads run on meta platforms by us politicians.
2,02,working with social media data carries risks of privacy issues and the right to be forgotten.
3,03,"however, our data analysis is limited to aggregated data presentations and only concerns ads published by public figures."
4,10,this work follows the guidelines and the ethical considerations by eysenbach and till ; moreno ; ramırez-cifuentes .
...,...,...
826,1072,"our only variables extracted from the twitterdata were tweet ids, timestamps of when the tweets were created, and the impression count, which is part of the public metric variable."
827,1073,"no tweet texts, account profile information, or other information that could identify individuals or groups (pii) were analyzed."
828,1074,reproducibility: all data from the analyses of this article are available online (www.pfeffer.at/data/halflife).
829,1075,"the data includes all tweet ids, tweet creation time, and for each collection iteration for every tweet, its collection time, and the number of views."
