# Differentiating Between Two Opposing Reddits

By: Christopher Kuzemka: [Github Repository](https://git.generalassemb.ly/chriskuz/project_3)

## Problem Statement

My girlfirned loves to use Reddit. One of her favorite subreddits is "r/aww", a community dedicated forum largely consisting of cute animals and cute moments captured on video and on camera. However, there is another reddit that is the complete opposite of cute animals and cute moments captured on video and on camera -- this subreddit is known as "r/natureismetal" -- and it was merged together with "r/aww" to create a "super-subreddit" known as "r/dangerouslycute." The official reason for doing so is unknown, but the top rumor for the merger narrates that both moderators from each subreddit felt that they had enough of a mutual following to justify the merge and consolidate all posts together. This made my girlfriend very upset, as she was never a fan of the content from "r/natureismetal" and now she is tainted by its controversial content. We can also imagine that many others must also be displeased by such merger as well. 

As a good data scientist who wishes to make his significant other happy, I have decided to help my girlfriend make an app that will run Javascript in the background with Reddit and ultimately separate the consolidated subreddit content. My girlfriend will be writing all other code necessary to create the app, while we will explore the jumbled data of "r/dangerouslycute" and help create the model that will separate the two subreddits from this super-subreddit.

Using data collected previously from the subreddits before the merge, we are going to utilize natural language processing classification models to separate the subreddit content. Our supervised learning models will be judged by their accuracy measure for success. We will do an in depth analysis on a successful model and explore the various quirks behind the influences its predictions.

## Executive Summary

 

## Table of Contents
[1.00 Data Loading](#1.00-Data-Loading)

[2.00 Data Cleaning and Analysis](#2.00-Data-Cleaning-and-Moderate-Analysis)

- [2.01 Quick Check](#2.01-Quick-Check)

- [2.02 Data Documentation Exploration](#2.02-Data-Documentation-Exploration)

- [2.03 Cleaning](#2.03-Cleaning)

- [2.04 Feature Engineering](#2.04-Feature-Engineering)

- [2.05 Dummifying Columns](#2.05-Dummifying-Columns)

- [2.06 Exploratory Data Analysis and Visualization](#2.06-Exploratory-Data-Analysis-and-Visualization)

[3.00 Machine Learning Modeling and Visulalization](#3.00-Machine-Learning-Modeling-and-Visulalization)

- [3.01 Model Preparation](#3.01-Model-Preparation)

- [3.02 Modeling](#3.02-Modeling)

- [3.03 Model Selection](#3.03-Model-Selection)

- [3.04 Model Evaluation](#3.04-Model-Evaluation)

[4.00 Conclusions](#4.00-Conclusions)

[5.00 Sources and References](#5.00-Sources-and-References)



# 1.00 Data Loading

### Package Import

In [17]:
import pandas as pd #imports pandas library
import numpy as np #imports numpy library
import matplotlib.pyplot as plt #imports matplotlib.pyplot library
import seaborn as sns #imports seaborn library

from sklearn.model_selection import train_test_split, cross_val_score #imports data splitting for modeling

import copy as cp #imports sophisticated dataframe copying library

### Loading The Data

In [8]:
dangerouslycute_data = pd.read_csv('../data/dangerouslycute_data.csv') #reads in the initial dataframe 

# 2.00 Data Cleaning and Analysis 

## 2.01 Quick Check

In [9]:
dangerouslycute_data.head() #displays the head of the data

Unnamed: 0.1,Unnamed: 0,index,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,over_18,author_flair_text,total_awards_received,timestamp
0,0,0,Huge Grizzly Bear,,natureismetal,1584925587,cobrakiller2000,198,1,False,False,,0,2020-03-22
1,1,1,In my kitchen houseplant..,,natureismetal,1584929238,Bronco7771,2,1,False,False,,0,2020-03-22
2,2,2,In my kitchen houseplant..,,natureismetal,1584929255,Bronco7771,2,1,False,False,,0,2020-03-22
3,3,3,Deathlock,,natureismetal,1584931304,Hamstah_Huey,1,1,False,False,,0,2020-03-22
4,4,4,Seal eats an octopus,,natureismetal,1584940215,huntergill123,231,1,False,False,,0,2020-03-23


In [11]:
dangerouslycute_data.tail()

Unnamed: 0.1,Unnamed: 0,index,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,over_18,author_flair_text,total_awards_received,timestamp
4995,2495,495,This picture cracks me up every time.,,aww,1574575629,bjax_7,68,1,False,False,,0,2019-11-24
4996,2496,496,Please take me,,aww,1574575720,side_effect7,1,1,False,False,,0,2019-11-24
4997,2497,497,Made me tear up,,aww,1574575767,ancientflowers,0,1,False,False,,0,2019-11-24
4998,2498,498,"Husband: ""We're not getting a cat!"" Husband an...",,aww,1574575820,sleepingdragon80,10,1,False,False,,0,2019-11-24
4999,2499,499,Micio got angry because I poked him in his ponch.,,aww,1574575826,CartonOfPutters,3,1,False,False,,0,2019-11-24


In [12]:
dangerouslycute_data.shape

(5000, 14)

In [16]:
dangerouslycute_data.isnull().mean().sort_values(ascending = False)

author_flair_text        0.9982
selftext                 0.9894
timestamp                0.0000
total_awards_received    0.0000
over_18                  0.0000
is_self                  0.0000
score                    0.0000
num_comments             0.0000
author                   0.0000
created_utc              0.0000
subreddit                0.0000
title                    0.0000
index                    0.0000
Unnamed: 0               0.0000
dtype: float64

In [18]:
dangerouslycute_data.isnull().sum().sort_values(ascending = False)

author_flair_text        4991
selftext                 4947
timestamp                   0
total_awards_received       0
over_18                     0
is_self                     0
score                       0
num_comments                0
author                      0
created_utc                 0
subreddit                   0
title                       0
index                       0
Unnamed: 0                  0
dtype: int64

In [19]:
dangerouslycute_data.dtypes

Unnamed: 0                int64
index                     int64
title                    object
selftext                 object
subreddit                object
created_utc               int64
author                   object
num_comments              int64
score                     int64
is_self                    bool
over_18                    bool
author_flair_text        object
total_awards_received     int64
timestamp                object
dtype: object

## 2.00 Data Cleaning and Analysis 

## 3.00 Machine Learning Modeling and Visualization

## 4.00 Conclusions

## 5.00 Sources and References