# Scrapping data with PRAW library

## Introduction

In this Python Notebook file, I am going to set up everything to scrap data from Reddit. Reddit has its own API to access their data. PRAW is my choice because it respects the terms of use stated by Reddit, is open-source and has the support of a big community and the very Reddit. For scrapping data from Reddit, my first choice and recommendation is to prefer PRAW over other data scrapping libraries. 

This is also the moment to design what database models will best suit your interests. In this project, I am separating comments and posts in two different .csv files. I added a row called post_id, so I can use it to track the origin of each comment as well as the unique id of each post. I have chosen this database model because it would allow further researchers to more easily work with the textual productions from comments. 

Another option is using an hierarchical database, like a JSON file, and nest the comments inside each post. While it is also a valuable form of organizing data, it would require more computer processing power when analyzing only the comments, since further analysis algorithms would need to read all the posts just to find the nested comments. 

### Importing libraries

In this initial part of the setup, I am just importing the needed libraries for this program. I am using the PRAW library to access the Reddit API to scrap data, the pandas library for database usage, the csv library to work with csv databases and the os library to use the operational system functions. 


In [5]:
import praw
import pandas as pd
import csv
import os


### Setting up the credentials to use the PRAW API

To use the PRAW API, you need to set up a Reddit account and create an app. You can do this here: https://www.reddit.com/prefs/apps/

This information is private and you should never share your keys with untrusted parties. 

In [6]:
reddit = praw.Reddit(client_id='xxx', #left top part
                     client_secret='yyy',
                     user_agent='yyy')


So far everything working! Below, I will test if the connection is okay. You can see the message bellow to make sure if the API is working.

In [7]:
try:
    subreddit = reddit.subreddit("test")
    for submission in subreddit.new(limit=1):
        print(submission.title)
except Exception as e:
    print(f"Connection failed: {e}")
else:
    print("Connection successful")

test3
Connection successful


### Hands on! Performing the actual data scrapping using PRAW and saving it in a CSV database

In this code, I start setting up the PRAW library to conduct the search for the keywords "rape" and "raping" in the CrusaderKings subreddit. I created two lists, one called "posts" and the other one called "processed posts IDs".

This way, I can keep track of the post IDs that I already processed and prevent the creation of duplicated posts when using a similar keyword for my search, as it is the case of "rape" and "raping."

The first part of this data scraping is conducting a loop search that will retrieve all the posts with the keyword "rape" in the title of the posts or the content of the posts. Then, I repeat the process using a similar keyword (i.e., "raping"), since I could expand the variety of posts that talk about this same subject.

I store the post IDs, the titles, the body, the general score, the upvotes, and the downvotes from each Reddit post that was retrieved from this keyword search.

In some circumstances, the comments might not have a general score, nor a score of upvotes nor downvotes. To prevent errors, I test if these categories are empty and I add a null value to them.

After each loop, I save all the posts and comments that I scraped from Reddit in two different files, one file for the posts and another file for the comments. So, I print a message saying that every post and comment was saved in its respective CSV files successfully.

To help you visualizing each post being scrapped, I also added printing lines to say the title of each post, followed by the keyword that was used for it to be retrieved. 

In [8]:
ck_subreddit = reddit.subreddit('CrusaderKings')

# CSV file for posts
posts = []
processed_post_ids = set()  # set to keep track of already processed post IDs

for post in ck_subreddit.search('rape', limit=1000):
    if 'rape' in post.title or 'rape' in post.selftext:
        if post.id not in processed_post_ids:
            posts.append([post.id, post.title, post.selftext, post.score, post.ups, post.downs])
            processed_post_ids.add(post.id)
            print(post.title, 'rape')

for post in ck_subreddit.search('raping', limit=1000):
    if 'raping' in post.title or 'raping' in post.selftext:
        if post.id not in processed_post_ids:
            posts.append([post.id, post.title, post.selftext, post.score, post.ups, post.downs])
            processed_post_ids.add(post.id)
            print(post.title, 'raping')

posts_df = pd.DataFrame(posts, columns=['post_id', 'title', 'body', 'score', 'ups', 'downs'])
posts_df.to_csv('posts.csv', index=False)

# CSV file for comments
comments = []
for post in posts:
    post_id = post[0]
    for comment in reddit.submission(post_id).comments:
        if comment.score == None:
            comments.append([post_id, comment.id, comment.body, None, None, None])
        else:
            comments.append([post_id, comment.id, comment.body, comment.score, comment.ups, comment.downs])
            print('comment added')

comments_df = pd.DataFrame(comments, columns=['post_id', 'comment_id', 'comment', 'score', 'ups', 'downs'])
comments_df.to_csv('comments.csv', index=False)

print("CSV files successfully saved.")


Feudal vs Tribal rape
It pains me how unpolished this game is rape
Holy hell guys the crusades in this game actually happened we shouldn't joke about this stuff it was horrifying rape
good lord do the devs know the limits of compassion rape
Got my first immortal character (650 hours) during an Aztec rape of my empire! rape
I really don’t want to rape my husband. rape
What's the most depraved mod? I want rape, sexual slavery, sadism and everything else you have to offer! rape
All I did was raze their cities and rape their women. I don't get why they are so mad at me rape
Can you still start a war, capture the other king's wife, rape her, impregnate her, cut off her hands, tongue, eyes, then ransom her back to give birth to your spawn? rape
How can I ensure my heir doesn’t get raped by war each time I die? rape
The King of Scotland's only daughter claimed I raped her rape
Could I be the victim of Satanic rape? rape
A conversation with my marshal (2) rape
When the king of Denmark pisses y

### Last comments

As you can see, this code will output many lines of text with the titles of the posts and when each comment is being saved. In my test, it retrieved 138 posts and hundreds of comments from these posts. 

The next step is preparing the data to use it!