# A Democratic World Wide Web
## An Introduction and Implementation of the Plebeian Algorithm to Freely Combat Misinformation
## By: Benjamin D. Fedoruk

### About the Speaker
Benjamin D. Fedoruk is a data scientist, currently studying mathematics and computer science at Ontario Tech University. He has completed various data science works, including research into carbon pricing efficacy, public transit systems in northern/rural communities, and most majorly, has conducted research to reduce the spread of misinformation on social media. He is interested in NLP, and data science at-large. 

### Workshop Overview
The Internet is filled with a vast array of information -- some is true, some is false. But, is there a programmatic method to effectively determine the verity of information, without the need for human intervention? This is where the Plebeian Algorithm shines, which you will learn about in this workshop. You will learn how to implement a Plebeian-esque algorithm into your personal projects. 

### Intended Audience
This workshop is for you if:

- you have an introductory level of data science skills in Python.
- you are interested in natural language processing.
- you want to learn applications of data science skills to real-world problems.
- you want to find solutions to the spread of misinformation. 
- you are intrigued by problem solving approaches using data science. 

### Topics Covered
Topics covered in this workshop include:

- Plebeian algorithms
- democratic moderation of content
- natural language processing, applied to a real-world issue
- expansion on the basics of Python data science. 

### Workshop Takeaways
Topics covered in this workshop include:

- Plebeian algorithms
- democratic moderation of content
- natural language processing, applied to a real-world issue
- expansion on the basics of Python data science. 

## Pre-Requisites

I am going to be using Jupyter Notebook for this course, but you can feel free to follow along in any other Python setup. 

If you haven't installed some of the core Python packages (using `pip`), here's what you'll need to run (on a Bash prompt). This can also be done using Anaconda, in a similar way. I'm assuming that if you're attending this workshop, you'll know how to install packages for your specific Python setup! 

In [111]:
%%capture
# Below is the command you should run in a prompt:
!pip install numpy matplotlib nltk pandas demoji seaborn urllib3 basc_py4chan bs4

Defaulting to user installation because normal site-packages is not writeable


## Imports
Below are the imports for all of the work we'll be doing on this file. 

And here are some of the basic declarations we'll need.

## Analyzing YouTube Comments from a Dataset
The first exploration we will do with the Plebeian Algorithm will be an analysis of YouTube comments from a dataset. The dataset should be contained in the same directory as this .ipynb file, saved as a CSV (comma-separated values) file. The file is named `youtube_comments_usa.csv`. 

Although it may be advisable to add a popularity check to the comments being analyzed, this approach will take all comments into account, as most posts have less than 5 likes or replies. 

**Warning**: The comments in this dataset are not curated, and are directly gathered from YouTube. As such, there is a strong chance that some comments may contain choice language, or potentially triggering material. I recommend that if this is a concern for you, that you avoid printing out the `comment_text` column of the dataset. 

### Import the Dataset
Below, we shall import the dataset using pandas. Some lines in the CSV are skipped -- although this is unfortunate, this workshop will not focus on the data cleaning of this retrieved data. Instead, we will simply skip those entries: we have enough data without them. They will be stored in a pandas dataframe, which can essentially be thought of as a table, with rows and columns. The columns have headers, which we will print in the subsequent section. 

### Preliminary Analysis of the Dataset
Here's some quick information about the dataset we're working with. I'll print off the first few comments, the columns, and the number of rows/comments. 

In [122]:
def remove_links_translate_emoji(text: str) -> str:
    return re.sub(r'\w:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', demoji.replace_with_desc(text, ' ').lower())

And here's some of the basics:

### Data Cleaning and Sentiment Analysis
Let's do some quick cleaning and also generate some sentiment analysis for each comment. We are using VADER, as described above. VADER is the ideal sentiment analyzer for informal text with slang (see *Hands-On Python Natural Language Processing* by Kedia and Rasu). Obviously, YouTube comments contain this level of informal language.

### Visualize The Results
Great work so far! We will now use seaborn to visualize the results. 

## Analyzing Reddit using API
Now we'll do another case study, using Reddit. We will be using `urllib` and `pushshift` to grab posts on Reddit. This can include original posts and comments, however for this exercise (for the sake of time) I will be only gathering the headlines. Obviously this will be easier to code, but it should be noted that the results may not be useful in any meaningful sense. 

**Warning**: The text entries gathered in this exercise will be gathered live during the workshop. I cannot guarantee the profanity of the language used. It may be triggering to some individuals. I recommend that if this is a concern for you, that you avoid printing out the text entries gathered. 
### Pre-Requisites for URLLib
Below are some quick setup procedures we need to follow for `urllib`. To save time, I have provided these in the pre-notes. 

In [123]:
def load_results(lower_bound_timestamp, upper_bound_timestamp, target_result_size, target_subreddit, score_threshold):
    headline_collection = set()
    reddit_data_url = f'https://api.pushshift.io/reddit/submission/search/?after={lower_bound_timestamp}&before={upper_bound_timestamp}&sort_type=score&sort=desc&subreddit={target_subreddit}&size={target_result_size}&score={score_threshold}'
    
    try:
        with urllib.request.urlopen(reddit_data_url) as url:
            data = json.loads(url.read().decode())
            for submission in data['data']:
                headline_collection.add(submission['title'])
        return headline_collection
    except urllib.error.HTTPError as e:
        return set()
    except urllib.error.URLError as e:
        return set()
    
headlines = set()
time_now = datetime.datetime.now()
limit_delta = 365
limit_lower_delta = 360
subreddit = "politics"
result_size = 1000
score_limit = ">1"

### Scrape from Reddit
We'll now scrape from Reddit using the method we created above. This step may take some time to execute.

### Get Sentiment Analysis Results
You know the drill -- time to do some sentiment analysis using nltk and VADER.

### Visualize the Results
Well, that was a little bit more tough than last time, but hopefully it feels a little bit familiar. Hopefully at this point, you're beginning to understand the idea we're going for. 

## Analyzing 4chan using API
Let's do one final analysis using 4chan. We will be using `basc_py4chan` to grab posts on 4chan. The 4chan platform offers users visibility of only the top posts -- once a post is no longer in the top set of posts, it is removed. Thus, 4chan offers users ephemerality, and also anonymity. 

**Warning**: The text entries gathered in this exercise will be gathered live during the workshop. I cannot guarantee the profanity of the language used. It may be triggering to some individuals. I recommend that if this is a concern for you, that you avoid printing out the text entries gathered. I know I've been saying this for the past few, but this is the one time I'm going to nearly guarantee that you'll see some potentially disturbing material. As mentioned, this is due to the ephemerality and anonymity of 4chan. I'll avoid printing them out. 
### Pre-Requisites
Below are some pre-requisites for this analysis. I haven't included these herein; I think these will be easy enough to type out. 

### Scrape /b/ from 4chan
We are going to be scraping the /b/ board from 4chan. 

### Perform Sentiment Analysis on Results
You guessed it! Time to do the same process again! Let's use nltk and VADER to analyze the sentiment of these posts. 

### Visualize the Results
Time to see what this means, again! Let's plot these results! 