The Reddit forum called r/changemyview, also called CMV, is a forum whose point is to discuss the opinions of individuals and to try to change those original opinions.
CMV has plenty of rules to encourage users to post in a civil manner, as well as rules to ensure replies are also civil.
Here we will use machine learning classifiers to attempt to learn what matters the most in changing the opinions of others in CMV.
You can also checkout our app CMV Check.
The data was originally obtained from Changed My View and can be downloaded on Raw Data. (You can also download the processed data)
The raw data contained four separate files:
- Comments_TextData.csv
- Comments_MetaData.csv
- Submissions_TextData.csv
- Submissions_MetaData.csv
The Comments_TextData.csv is the data file that contains the body or text of the comments. It has following columns:
subreddit | created_utc | author | id | parent_id | body | body_html |
---|
The Comments_MetaData.csv is the data file that contains the information about the comments. It has following columns:
subreddit_id | can_gild | score | saved | gilded | replies | mod_reports | can_mod_post | name |
---|---|---|---|---|---|---|---|---|
archived | parent_id | approved_at_utc | created | author | author_flair_text | author_cakeday | retrieved_on | removal_reason |
distinguished | approved_by | link_id | user_reports | report_reasons | controversiality | likes | downs | banned_by |
created_utc | ups | edited | author_flair_css_class | score_hidden | num_reports | subreddit | id | stickied |
The Submissions_TextData.csv is the data file that contains the body or text of the submissions or posts. It has following columns:
subreddit | created_utc | author | id | title | selftext | selftext_html |
---|
The Submissions_MetaData.csv is the data file that contains the information about the submissions or posts. It has following columns:
subreddit_id | can_gild | score | saved | gilded | replies | mod_reports | can_mod_post | name |
---|---|---|---|---|---|---|---|---|
archived | parent_id | approved_at_utc | created | author | author_flair_text | author_cakeday | retrieved_on | removal_reason |
distinguished | approved_by | link_id | user_reports | report_reasons | controversiality | likes | downs | banned_by |
created_utc | ups | edited | author_flair_css_class | score_hidden | num_reports | subreddit | id | stickied |
author_id | created | original_link | stickied | score | report_reasons | created_utc | author_flair_css_class | subreddit |
secure_media_embed | link_flair_css_class | media | archived | saved | secure_media | ups | preview | adserver_imp_pixel |
promoted_by | url | is_self | from_kind | from | retrieved_on | media_embed | num_comments | third_party_tracking_2 |
thumbnail | hide_score | href_url | user_reports | thumbnail_width | thumbnail_height | mobile_ad_url | mod_reports | likes |
locked | imp_pixel | third_party_tracking | contest_mode | distinguished | author_flair_text | gilded | from_id | domain |
over_18 | promoted | hidden | promoted_url | id | can_gild | is_video | post_hint | name |
disable_comments | num_reports | edited | suggested_sort | third_party_trackers | link_flair_text | downs | approved_by | clicked |
spoiler | view_count | permalink | promoted_display_name | subreddit_id | author_cakeday | adserver_click_url | banned_by | author |
brand_safe | quarantine |
In order to start, you will need to install the libraries Sklearn, imblearning, Numpy, and matlotlib. Then, you must do is make sure all the inputs and outputs are set correctly.
Based on the names of your input/output files, you may need to create the folders for the output files location before running the code below.
Once this is done, the first step would be to create the data with needed feature, for that you will have to run createData()
in main.py or donwload directly the processed data)
Inside the main.py file is where the data gets trimmed to only include the information that you need. This speeds up the loading process for the data. You will find the following code in the if __name__ == '__main__':
:
inputs = ["/home/shared/CMV/RawData/Comments_MetaData.csv", "/home/shared/CMV/RawData/Comments_TextData.csv",
"/home/shared/CMV/RawData/Submissions_MetaData.csv", "/home/shared/CMV/RawData/Submissions_TextData.csv"]
outputs = ['/home/shared/CMV/SlimmedData/Slimmed_Comments_MetaData.csv',
'/home/shared/CMV/SlimmedData/Slimmed_Comments_TextData.csv',
'/home/shared/CMV/SlimmedData/Slimmed_Submissions_MetaData.csv',
'/home/shared/CMV/SlimmedData/Slimmed_Submissions_TextData.csv']
The inputs need to be changed to the location of your saved raw data files, and the outputs need to be changed to the locations you would like to save your slimmed data files.
Inside the main.py is where the comments that receive deltas will be separated from the rest of the comments. You will find the following code in the run()
function:
input = '/home/shared/CMV/SlimmedData/Slimmed_Comments_TextData.csv'
deltas_file = '/home/shared/CMV/SortedData/delta_winning_ids.txt'
deltas_data_file = '/home/shared/CMV/SortedData/delta_comments_data.csv'
nodeltas_data_file = '/home/shared/CMV/SortedData/nodelta_comments_data.csv'
The input needs to be changed to be the Slimmed version of the Comments TextData. The last three are output files what will store the file with the ID's of all delta-winning comments, and the separated Slimmed Data Files for delta-winning comment and nondelta-winning comments.
Inside the main.py is where the comments data will be sampled. It randomly selects a number of comments to train and test on the prediction algorithms. You will find the following code in the if __name__ == '__main__':
:
delta_file = "/home/shared/CMV/SortedData/delta_comments_data.csv"
nodelta_file = "/home/shared/CMV/SortedData/nodelta_comments_data.csv"
delta_sample_file = "/home/shared/CMV/SampledData/delta_sample_data.csv"
nodelta_sample_file = "/home/shared/CMV/SampledData/nodelta_sample_data.csv"
The delta_file is the input file with all the delta-comments data, and the nodelta_file is the location of the input file with all the nodelta-comments data. You can then change the two input sample files to specifiy the location of the output files.
Inside the main.py is where the comments data will turn into usable numeric features for our data. You will find the following code in the if __name__ == '__main__':
:
delta_input = "/home/shared/CMV/SampledData/delta_sample_data.csv"
nodelta_input = "/home/shared/CMV/SampledData/nodelta_sample_data.csv"
word_list_input = "../data/word_list.csv"
output_delta = "/home/shared/CMV/FeatureData/delta_sample_feature_data.csv"
output_nodelta = "/home/shared/CMV/FeatureData/nodelta_sample_feature_data.csv"
For the inputs, you change their locations to their associated name input. The word_list_input variable gets set to the location of a csv containing more categories/list of words to turn into features. The output files saved are then the locations of the data files with each column as a calculated numeric feature for both delta-winning comments and nondelta-winning comments.
Once this been completed you can then run the main.py file.
model = models.RandomForet()
creates for us our prediction model.
X_train, X_test, y_train, y_test = engineer.run(model = model)
separated our data into train and test for accuracy.
Next is to train the model and create our predictions with our model as shown below.
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Finally, our model gets scored for us to see the accuracy.
score = accuracy.score(y_pred, y_test)
print(score)
The accaracy of each models is
The confusion matrix for Random Forest
The features importance for Random Forest
The original CSV data from the subreddit came as raw data with many columns that were not useful to use thus far in our predictions. equivalent. The document is also massive so we do cut unnecessary columns from the CSV data.
We have two functions that we use to slim the documents inside of the slimmer.py file.
# Takes a single string as inputs for input and output.
# Takes a list of columns you wish to to keep as input for columns.
# Write to the output file name the modified or slimmed data.
slim(input, output, columns)
# Takes a list of string as inputs for input and output.
# Takes a 2d list of columns you wish to to keep as input for columns.
# Write to the output file names the modified or slimmed data.
slim_all(inputs, outputs, columns_lst)
In order to run Prediction Algorithms on the Delta-Winning comments, the Delta-Winning Comments must first be found. The code searches all comments for either !delta or Δ, and adds the parent of the delta comment to a list of Delta-Winning comments. Some comments that are Delta-Winning may not be in the Meta Data, in which case we add it to a bad_deltas category, otherwise the comments go into a good_deltas category. Inside the labeler.py file there is a function that returns the list of delta comments.
# This takes in the slimmed data as input and sales to the output
# file a list of delta-winning comment ID's.
# Also returns the list of delta-winning comments ID's.
get_deltas(input, output)
Once that the deltas have been found, the comments data can not split comments into delta-winning comments or non-delta winning comments. The command goes through all of the comment data and if they are in our list of deltas, add them to a list of delta comments, otherwise add them in a list of non-delta comments. In our labeler.py file we have a function that does this.
# The input file is location of our data. The output files are the locations you would like to save the split data into. The delta_files is the location of the file that contain the list of all deltas.
create_labels(input, output_delta, output_nodelta, deltas_file)
Then the features must be generated in order to classify our data. The features.py file contains all the necessary functions to generate our features file. In order to generate the features and to save them you must run the function:
# the inputs parameter is a list of all files you would like to generate features from.
# The outputs parameter is the list of files you would like to then write to.
# The word_list_input will contain the data for a category of persuasive linguistic features.
# The wanted_features parameter is a string that consists of between 1-4 letters for which features you would like to generate
generateFeatures(inputs, ouputs, word_list_input, wanted_features)
If wanted_features contains the letter 'o'
, these features will be generated:
- certainty_count
- extremity_count
If wanted_features contains the letter 'c'
, these features will be generated:
- lexical_diversity_rounded
- char_count_rounded
- link_count
- quote_count
- questions_count
- bold_count
- avgSentences_count
- enumeration_count
- excla_count
If wanted_features contains the letter 'n'
, these features will be generated:
- high arousal
- low arousal
- medium arousal
- medium dominance
- low dominance
- high dominance
- high valence
- low valence
- medium valence
- examples
- hedges
- self references
If wanted_features contains the letter 'v'
, then using a Count Vectorizer, we generate a list of vectors containing the count of words in a comment. We then return this vector count as features.
Inside of the models.py there are certain preset models that you can use just by setting model = ModelName()
.
The models you can use are:
- RandomForest()
- AdaBoost()
- GradientBoosting()
- LogisticRegression()
- DecisionTree()
- GaussianNB()
- BernoulliNB()
- SVM()
- MLP()
- SGD()
Testing the accuracy is simple:
score = accuracy_score(y_pred, y_test)
In the folder textgenration you can find a GPT2 model to generate CMV comments fomr a submission. You can run train_model.py and then generate_text.py.
For example with the following submission:
Seeing a person dressed as a woman that is clearly a man will never not be weird. I know gender dysphoria is real and transgender people exist. I know that the world would be better off accepting that fact. But in the push to accept transgender people in society, there seems to be this awkward game of pretend that is expected to be played. It is one thing to simply adjust the pronouns you use to make them comfortable, but it is another thing entirely to be expected to convince yourself that they are the gender in which they identify. If you meet a trans woman that is clearly trans (most notably ones that used to be men with clear male facial features, endomorph or even receding hairlines) you are expected to actually believe she is a woman on a biological level. There are some very lucky biological men that can transition and be incredibly convincing, but most aren't, and if you aren't one of those lucky people, you will never be seen as an actual woman no matter how much pretend we are expected to play. We can all act like they are women, but the brain already sorts them out as "men dressed as women" way before any conscious decision is made, and it will always be weird and awkward seeing these individuals no matter how nice and accepting we are about it.
Example 1 of generated comment:
I believe you are missing the point of gender. The notion of gender is a social construct. It is not something you can change. It is a construct that can and should be changed. In my opinion, gender is not a construct you can actually change. It is not something that people are forced to "change". It is a set of behaviors that people can and should mimic.
The issue with this is that you are saying that being forced to be a woman is like being forced to be a man. I am not talking about "being forced to be a man..." but "being forced to be a woman is like being forced to be a man." I would like to think you are aware of this, but I am not. I am a man, and I am talking about being forced to be a man in a non-threatening manner. That does not make it any less "manly." It just makes it a little more "manly." This is a universal experience, and one that many trans people are forced to face. I am not saying you should not be forced to be a woman, but I am saying you should be forced to be a man. I am not talking here about "being forced to be a man is wrong" or "being forced to be a man is wrong." I am not saying it is okay to be forced to be a man in public, but I am saying it is ok to be forced to be a woman in private. Just think about how uncomfortable it is for you to be forced to be a man in public, and how uncomfortable it will be for you to be forced to be a woman in the same manner. I am willing to bet you that many trans people will not be forced to be a woman, but many will. If you are truly bothered by this, you need to consider what the purpose of the feminine would be in relation to the masculine
Example 2 of generated comment:
I would go so far as to say that the transgender community has historically, and purposefully, been underfunded, underfunded, and underfunded. It's actually been a fairly decent start to my post, but at a really low level. I believe it was around $50M for the whole thing. At that point, I was basically looking for a buzzword, and a typo. I think it was around $20M for the first two movies. They made a deal with the studios to make a second Terminator movie, and they were basically giving us a fourth movie. When that came around, the budgets were basically the same, except for a few minor bits. I think that's why I believe that the transgender community has had such a negative influence on the rest of Hollywood, and the rest of the entertainment industry. I think that it's been an underfunded community, and I think that it's been underfunded for a long time, but I don't think it was the first time that the LGBT community has been underfunded, either. I think you're missing the point. I think the money was spent on the very first two movies, and that the people that made them, the people that made them financially, the people that made them financially, weren't really paying their bills. That's something that are even more relevant now, because of the influence the LGBT community has had on the rest of the entertainment industry, because you didn't have a chance to really have a say in it. I think that it's been years of intense negotiation, both within the LGBT community and within the industry itself, and it was basically this:
- How much money should we make to help and defend the community?
- What do you think of the current high standards of the industry? I think that your question is a little fuzzy, and you kind of need to narrow it down a bit. You're really only asking this in the sense that you're asking it in the sense that you're asking it in the sense that you're asking it in the sense that you're asking it in the sense that you're asking it.
I think that if you're serious about fighting this, you should talk to some smart people. You need to be willing to negotiate, but you need to be willing to talk to people who are willing to pay for it, and they need to be willing to pay for it. If you're not willing to pay, you're not going to get a decent representation. If you're willing to pay, you're not going to get a good one. Ultimately, you need to make sure that you're making a good enough case to convince people. If you're not in good company, you should be ashamed of yourself. It's really not a matter of whether or not you're in good company, but who is in good company? EDIT: Sorry for the lengthy post, but I couldn't find a good post with all of the information I needed. I'll continue to post as I get more and more errors.
Example 3 of generated comment:
I'm aware that this may be a little off topic, but I'm not going to lie about my attitude towards this. I'm not surprised by it. I've been taught that trans people are basically nothing more than caricatures of men, and thus women. I'm not even sure that trans people can be expected to be feminine in any way, but I assume that's the reason behind my not taking gender dysphoria personally. If I were a cis man, I'd be pissed. I'm not saying that I'm against trans people being able to transition, I'm not arguing. But I'm not saying that men dressed as women, I'm simply saying that my view is completely wrong. All that said, I do think that there is room for a bit of debate on this. I think that trans people should be treated equally, though I do think that trans people can be expected to wear the gender-shifting clothing and hairstyles that they choose, regardless of gender. And I think that trans people can be treated fairly, though I am open to that as well. But this isn't how it works. I've met a lot of trans people, and I've been treated pretty badly. I'm not saying that I'm feeling bad about it, just that I don't think that it's entirely fair. I think that cis people who complain about being treated like shit are just being transphobic, and that's totally unfair.