## Imports

In [1]:
import pandas as pd

Reading the data from the subreddit_stats file in data/raw folder and converting it from a CSV to a pandas data frame

In [2]:
df = pd.read_csv("../data/raw/subreddit_stats.csv")

## Technology

The goal is to take extract the subreddit and comments per post columns and the technology niche *only* and save it as a
processed csv file

In [3]:
# Technology subreddits
technology = [
   "programming",
    "machinelearning",
    "technology",
]

# Get only the 2 columns
df_technology = df[["Subreddit", "Avgerage Comments"]]

# Only get data from technology niche
df_technology = df_technology[df_technology['Subreddit'].isin(technology)]

This will be repeated for the remaining 2 niches:

* Gaming
* Fitness & helth

## Gaming

In [4]:
# Gaming subreddits
gaming = [
    "gaming",
    "Minecraft",
    "leagueoflegends",
]

# Get only the 2 columns
df_gaming = df[["Subreddit", "Avgerage Comments"]]

# Only get data from technology niche
df_gaming = df_gaming[df_gaming['Subreddit'].isin(gaming)]

## Fitness & Health

In [5]:
#  Fitness subreddits
fitness = [
    "fitness",
    "running",
    "nutrition",
]

# Get only the 2 columns
df_fitness = df[["Subreddit", "Avgerage Comments"]]

# Only get data from technology niche
df_fitness = df_fitness[df_fitness['Subreddit'].isin(fitness)]

Convert the dataframes into CSV files

In [6]:
df_technology.to_csv("../data/processed/subreddit_stats_tech.csv", index=False) # Technology
df_gaming.to_csv("../data/processed/subreddit_stats_gaming.csv", index=False) # Gaming
df_fitness.to_csv("../data/processed/subreddit_stats_fitness.csv", index=False) # Fitness

## Niche Comparison

Now we'll compare the niche, the data we want to build will look like this:

| Niche | Average Comments per Post|
|----------|----------|
| technology | ...  |
| gaming    | ...  |

For this we'll get mean of comments per post of each niche.

In [7]:
mean_tech = df_technology["Avgerage Comments"].mean() # Technology
mean_gaming = df_gaming["Avgerage Comments"].mean() # Gaming
mean_fitness = df_fitness["Avgerage Comments"].mean() # Fitness

Now we create a new dataframe with our custom columns

In [8]:
niche = pd.DataFrame(columns=["Niche", "Average Comments"])

And now we append (ie. add a row) for each of our niche's

In [9]:
# Technology
technology_niche = pd.DataFrame({
    "Niche": "Technology",
    "Average Comments": mean_tech
}, index=[0])

# Gaming 
gaming_niche = pd.DataFrame({
    "Niche": "Gaming",
    "Average Comments": mean_gaming
}, index=[0])

# Fitness
fitness_niche = pd.DataFrame({
    "Niche": "Fitness",
    "Average Comments": mean_fitness
}, index=[0])

# Concatenate all
niche = pd.concat([niche, technology_niche, gaming_niche, fitness_niche])

  niche = pd.concat([niche, technology_niche, gaming_niche, fitness_niche])


Convert dataframe to CSV

In [10]:
niche.to_csv("../data/processed/niche_comparisons.csv", index=False)

## Overall Summary

This is the final processing, here the total Posts Scanned, and Total comments are added. This is how we want it to look:

| T. Posts Scanned | Total Comments|
|----------|----------|
| ... | ...  |


Initializing the dataframe

In [11]:
summary = pd.DataFrame()

Getting total of the posts scanned and total of the comments

In [12]:
total_posts = df["Posts Scanned"].sum() # Total posts
total_comments = df["Total Comments"].sum() # Total comments

total_summary = pd.DataFrame(
    {
        "Posts Scanned": total_posts,
        "Total Comments": total_comments,
    }, index=[0]
)

Now we finally concatenate everything and convert it to a CSV

In [13]:
# Concatenate Everything
summary = pd.concat([summary, total_summary])

# Convert to CSV
summary.to_csv("../data/processed/overall_summary.csv", index=False)