# Gammar/Campbell NLP assignment

## Aims

To analyse reddit posts to discover trends in healthy eating on food subreddits. We will attempt the following:

- Download a series of posts from reddit using their API
- Characterise these posts using exploratory analysis
- Predict whether a subreddit is a healthy eating subreddit based on its post content


## Why Reddit?

Reddit is a social media site where users post either links around the web or text content that they've written themselves. These posts are subject to a an upvote/downvote system which causes popular submissions to rise higher onto users' feeds. Older posts have lower weighted upvotes, resulting in a constant feed of new, highly regarded content.

Reddit's style is casual, but serious. Posts are generally typed out in full sentences with emojis and reaction images being comparatively rare. This makes it an ideal candidate for natural language processing.

## Read in data

This section is rendered as markdown rather than a code block as it takes approximately 40 minutes to run. We've provided CSVs to save you the trouble. These are to be placed in a `data/` subdirectory. Multireddits are groups of subreddits grouped by a common theme. We did a Google search for a "food multireddit" and [came across a general one](https://www.reddit.com/r/Cooking/comments/cg7lha/misc_heres_all_of_the_food_related_subreddits_i/) made by Reddit user [Nomeii](https://www.reddit.com/user/Nomeii/).

We used the [Python Reddit API Wrapper (praw)](https://praw.readthedocs.io/en/stable/index.html) to loop through each of these subreddits picking the top 1000 most upvoted posts from the past year. We saved this as a csv and will be using this as the base of our of our future analysis.

```python
import praw
import pandas as pd
import datetime as time
from colorama import Fore, Style
from pathlib import Path

# Make data directory for newly written data if it doesn't already exist
DATA_DIRECTORY = "data/"
Path(DATA_DIRECTORY).mkdir(parents=True, exist_ok=True)

POST_TIME_PERIOD = "year"

# Number of posts per subreddit to pull
N_TITLES = 1000

#Obtained from praw.ini file in working directory
reddit = praw.Reddit("uls-healthyeating", check_for_async=False)

# Chose this multireddit as it has many food subreddits
food_multireddit = reddit.multireddit(name="food", redditor = "nomeii")

top_dict = {"subreddit" : [],
            "title" : [],
            "is_self" : [],
            "selftext" : [],
            "author" : [],
            "url" : [],
            "score" : [],
            "upvote_ratio" : [],
            "n_gilded" : [],
            "num_comments" : [],
            "permalink" : [],
            "created_utc" : []
            }

# Cycle over each of the subreddits, grab posts and append it to the
# global dictionary

for index,subreddit in enumerate(food_multireddit.subreddits):
    subreddit_name = subreddit.display_name
    
    #Subreddits to appear in red
    print("\n[", time.datetime.now(), "]", f"{Fore.RED}****{subreddit_name}****{Style.RESET_ALL}")
    print(f"Subreddit number: {index}")
    
    subreddit_data = subreddit.top(limit=N_TITLES, time_filter=POST_TIME_PERIOD)
    
    for post in subreddit_data:
        
        top_dict["subreddit"].append(subreddit_name)
        top_dict["title"].append(post.title)
        top_dict["is_self"].append(post.is_self)
        top_dict["selftext"].append(post.selftext)
        top_dict["author"].append(None if post.author is None else post.author.name)
        top_dict["url"].append(post.url)
        top_dict["score"].append(post.score)
        top_dict["upvote_ratio"].append(post.upvote_ratio)
        top_dict["n_gilded"].append(post.gilded)
        top_dict["num_comments"].append(post.num_comments)
        top_dict["permalink"].append(post.permalink)
        top_dict["created_utc"].append(post.created_utc)
    
# Combine dictionary into one large dataframe    
top_df = pd.DataFrame(top_dict)

top_df.to_csv(DATA_DIRECTORY + "reddit_data.csv", index = False)
```

## Data description

The Reddit data is a 13-column dataset with ~70K rows. It contains among others: post titles, author names, subreddit names, upvote scores, post time and post text where available. We retrieved it from the Reddit API on 2023-03-08 as of the most recent update.

We've also manually made a mapping between each of the subreddits and our expert opinion as to if the subreddit relates to healthy eating. Subreddits explicitly about healthy eating and plant-based diets were considered healthy as well as home-cooking subreddits. We made this mapping to facilitate supervised learning in our models.

## Modelling healthy-eating subreddits

## Future avenues of research