# Data Science Project
#### Christopher Healy D00270638
#### Using Data Science to predict the popularity of Reddit posts.

# Step 1: Business Understanding
Reddit is a highly popular social media platform that is used by a wide range of individuals, entrepeneurs, and businesses. It's popularity offers the potential of a global reach to posters across an extremely diverse range of topics.

### Audience:
Many people use Reddit as a means to promote themselves or their product. This includes but is not limited to:
- Entrepeneurs advertising products to niche markets.
- Musicians/performers trying to build a platform.
- "Influencers" trying to gain recognition.

The success/popularity of a Reddit post can be measured with "Karma", a numerical value which corresponds to the total sum of ratings on a post, i.e. upvotes minus downvotes.


### Aims of the project.
Determining how a Reddit post's Karma is influenced can allow these groups to achieve the maximum reach possible. By analysing the information associated with Reddit posts and comparing it to their popularity, I hope to answer:
1. How is post length connected to post success?
2. Is date/time of posting relevant to a post's success.
3. How does the "mood" found in a post's body influence its success?
4. How do attached photos impact post performance?
5. Are there any other factors that affect how a Reddit post will be received?

Posts can be analysed to determine their "mood" from their content, this will need to be done before any machine learning models can be trained. Similarly, if post photos are included in the data they will need to be classified in some way so that their influence on post Karma can be quantitatively seen.


# Step 2: Data Mining
The data is gathered from the Reddit API using the Python Reddit API Wrapper (praw).
Post information is gathered from a variety of popular subreddits to ensure a variety of themes and audiences.
The data fetched includes a variety of information on posts submitted to the following subreddits:
- r/funny
- r/todayilearned
- r/technology
- r/aww
- r/worldnews
- r/food
- r/gaming

See separate Jupyter Notebook "DataMining.ipynb"

# Step 3 - Data Cleaning
### Datapoints used:
| Variable      | Type                | Notes                                                         |
|---------------|---------------------|---------------------------------------------------------------|
| id            | Nominal Categorical | Not Important                                                 |
| title         | Nominal Categorical | Theme can be extracted from content. Length may be relevant   |
| subreddit     | Nominal Categorical | Important                                                     |
| created_utc   | Discrete Numerical  | May be useful for determining if time of day impacts success. |
| ups           | Discrete Numerical  | Important                                                     |
| downs         | Discrete Numerical  | Important                                                     |
| is_video      | Nominal Categorical | Important                                                     |
| self_text     | Nominal Categorical | Theme can be extracted, length may be important               |
| is_self       | Nominal Categorical | Possibly relevant                                             |
| title_length  | Discrete Numerical  | (Calculated Manually) Possibly relevant                       |
| post_length   | Discrete Numerical  | (Calculated Manually) Possibly Relevant                       |
| up_down_ratio | Continous Numerical | (Calculated Manually) Important                               |
| post_theme    | Nominal Categorical | (Calculated through API) Possibly important                   |
| score         | Discrete Numerical  | (Calculated Manually) Used to determine post success          |

In [11]:
#imports
import pandas as pd
import numpy as np




In [12]:
#load data
reddit_data = pd.read_csv("reddit_data.csv", quotechar='"', escapechar='\\')
reddit_data.head()



Unnamed: 0,id,title,subreddit,created_utc,ups,downs,is_video,selftext,is_self
0,1ou9lw7,Train employee,funny,1762868000.0,0,0,True,,False
1,1ou97dx,Indeed a Best job,funny,1762867000.0,363,0,False,,False
2,1ou7n0f,The fear of flying sandal is universal,funny,1762862000.0,2139,0,True,,False
3,1ou4p4t,Your lawn guy knows something you don't,funny,1762852000.0,2610,0,True,,False
4,1ou3mas,The irony of this ad in this article,funny,1762848000.0,126,0,False,Ironic humor brought to you by an algorithm...,False
