# Data Collection From Reddit

The data was collected from the social media reddit. The following subreddits were considered:

- r/worldnews
- r/news
- r/UpliftingNews
- r/InternationalNews
- r/GlobalClimateChange
- r/politics

The columns considered for analysis are:

- Title: The post title
- Score: It is a net score that equals upvotes minus downvotes for a particular Reddit post
- Time: Date of the post
- Comments: Number of comments
- UpvoteRatio: It is the ratio between upvotes and total votes
- PostId: Unique identifier for the vote
- Subreddit: Name of the subreddit
- SelfText: The text in the post
- IsSelfText: This boolean tells if the post has self text (In some cases, there is a link or picture instead)

## Libraries Installation and Imports

In [6]:
from IPython.display import clear_output

!pip install praw

clear_output()

In [7]:
import warnings
import pandas as pd
import csv
import praw
import os
warnings.filterwarnings('ignore')

## Reddit API client configuration

In [2]:
client_id = os.environ['CLIENT_ID']
client_secret = os.environ['CLIENT_SECRET']
user_agent = os.environ['USER_AGENT']

In [9]:
# Create a Reddit API client

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

## Get data from API and store it in CSV file

In [8]:
# Functions used to pull data from reddit

def create_dict_posts(search, kind):
    """ This function transform post into a list of dictionaties form"""
    result = {}
    posts_list = []
    posts = search.new(limit=None) if kind == 'new' else search.top(limit=None)
    for post in posts:
        result["Title"] = post.title
        result["Score"] = post.score
        result["Time"] = post.created
        result["Comments"] = post.num_comments
        result["UpvoteRatio"] = post.upvote_ratio
        result["PostId"] = post.id
        result["Subreddit"] = post.subreddit.display_name
        result["SelfText"] = post.selftext
        result["IsSelfText"] = post.is_self
        posts_list.append(result)
        result = {}
    return posts_list


def store_in_CSV(file_name, posts):
    """This function stores data in CSV form"""
    fields = ['Title', 'Score', 'Time', 'Comments', 'UpvoteRatio', 'PostId', 'Subreddit', 'SelfText', 'IsSelfText']
    with open(file_name, 'a', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fields)
        if csvfile.tell() == 0:
            writer.writeheader()
        writer.writerows(posts)

def scrap_dataset(subreddits, file_path, reddit, kind='new'):
    """This functions creates a CSV file based on Reddit data and returns a DataFrame"""
    for sub in subreddits:
        search_results = reddit.subreddit(sub)
        posts = create_dict_posts(search_results, kind)
        store_in_CSV(file_path, posts)
    df = pd.read_csv(file_path)
    print(df.shape)
    return df

In [16]:
# defining the subreddits list related to technology

subreddits = ['worldnews', 'news', 'UpliftingNews', 'InternationalNews', 'GlobalClimateChange', 'politics']
df = scrap_dataset(subreddits, './data/news.csv', reddit, kind='new')
clear_output()

In [17]:
df.head()

Unnamed: 0,Title,Score,Time,Comments,UpvoteRatio,PostId,Subreddit,SelfText,IsSelfText
0,An anti-gay Hungarian politician has resigned ...,204544,1606847000.0,8392,0.93,k4qide,worldnews,,False
1,Trump Impeached for Abuse of Power,202899,1576719000.0,20000,0.88,eclwg9,worldnews,,False
2,Vladimir Putin's black belt revoked by interna...,200152,1646081000.0,6904,0.89,t3pgaz,worldnews,,False
3,"Two weeks before his inauguration, Donald J. T...",189351,1531966000.0,18004,0.84,901p5f,worldnews,,False
4,"Queen Elizabeth II has died, Buckingham Palace...",189029,1662658000.0,16452,0.79,x96k3v,worldnews,,False


In [18]:
df.shape

(10363, 9)