![Marvel Logo](https://www.regmovies.com/static/dam/jcr:5914096c-fc7c-4fda-ace5-f46c1744faa7/MARVEL-title_Small.jpg)

# Project 3: Web APIs & Classification

## Problem Statement

Marvel superhero movies have been connecting with audience for decades and we want to continue to faciliate the connection. Marvel Studios is planning to create new characters to get appear in the new series between 2021-2025 (5-year plan).

In order to create new characters with the qualities, values, characteristics that our audience can relate and connect to the most, our research team has carried out the following project to find out the best classfication predictive model to predict the category of a given statement on super powers. 

#### Data source 

Reddit is a network of communities based on people's interests. We believe our audience who has interest in superheroes or superpowers have shared a lot of thoughts on this popular platform. Hence, we have collected datasets from here.

The two subreddits that we have selected are:

1. shittysuperpowers
2. godtiersuperpowers

Both subreddits address in common certain root problems that the described superpowers wishes to tackle. T

1. shittysuperpowers: 

    C-Tier super powers, less serious, something sort of useful, but not really useful; shittysuperpowers tackle themes or problems that are perceived as generally futile or too big to solve.
    
    E.g. You can throw a banana and it will return to you like a boomerang. 


2. godtiersuperpowers: 
    
    God-Tier super powers, positive, superior and intellectual powers; godtiersuperpowers tackle themes/problems that are perceived more hopefully with an optimism for an eventual positive outcome.
    
    E.g. You can undo any car accident with a push of a button.

#### Types of model

Three models are developed and evalulated:

1. LogisticRegression
2. K Nearest Neighbor
3. Naive Bayes 

#### Model evaluation

After building a predictive classification model, we need to evaluate the performance of the model, that is how good the model is in predicting the outcome of new observations test data that haven't been used to train the model.

The used metrics and methods for assessing the performance of predictive classification models, include:

1. Classification accuracy, representing the proportion of correctly classified observations, out of 100%.


2. Confusion matrix, which is 2x2 table showing four parameters, including the number of true positives, true negatives, false negatives and false positives.


3. Sensitivity, Specificity and Precision, which are three major performance metrics describing a predictive classification model, out of 100%.

#### Goal

With this model, our team will be able to make predictions on subreddit posts which category the superpowers mentioned in the posts belongs to. More precisely, we want to find out what kind of super powers our potential audience would consider positive or amazing , and what are shitty or lame. 

Based on this information, we wil be able to identify the positve and negative abilities, desires, emotions, values etc. and apply them on the design of our new movie series characters. 

Since the new movie series is a 5-year plan, a reliable and accurate classfication predictive model is necessary. We will present our findings to our producers, directors or even actors for their decision making, on storylines and character development. Ultimately, Marvel Studios will be able to come up with more creative, innovative, inspiring superheroes with appealing super powers that audience likes. No matter to the primary stakeholders like filmmakers, or our secondary skateholders, which are our audience, this classification model will generate a lot of insights for our team. 

##  Executive Summary

In the past few decades, Marvel Studios has been producing a series of superhero films, based on characters that appear in Marvel Comics publications. The mission statement for Marvel Studios is: A vision as far-reaching as our stories. “ Our mission to expand enables our legends like Thor and The X-Men to come to life in unexpected ways. And our mission is also to resonate with people today.

However, the Marvel Cinematic Universe (MCU) has had no shortage of critics. Certainly some have been disappointed in quality or storylines, character development or entertainment value, all of which are to be expected. Viewers and members of Hollywood alike debate the human authenticity of superhero films. It has been said by some that they’re “despicable”, silly and lacking conviction, others that they’re devoid of human emotion and experiences. 

We want to produce Superhero movies that have values. They can bring hope and strength to those facing adversity, and a break from reality to everyone who steps foot into that theater or streams a movie from the comfort of their home. 

Simply by searching what kind of super powers people are interested in isn't enough, we want to further understand and predict what category of a certain super power that audience consider it as. In order to classify the posts content and its words into one of the two classes (shittysuperpowers, godtiersuperpowers), a classfication model is needed. 

We expect this model to help us create few new characters that contains those superpowers that audience consideres as god-tier, intellectual, positive, and superior. Hence, Marvel Studios will gain back reputation with good movie quality and signification.  

### Contents:
1. [Data Collection](#Data-Collection)
2. [Data Cleaning & EDA](#Data-Cleaning-&-EDA)
3. [Preprocessing & Modeling](#Preprocessing-&-Modeling)
4. [Conclusion & Recommendation](#Conclusion-&-Recommendation)

In [1]:
# libaray imports

import requests
import pandas as pd
import time
import random
import json

from random import randint
from time import sleep

## 1. Data Collection

#### 1) Fetch the content by URL.

In [12]:
# Target web page:
# https://www.reddit.com/r/shittysuperpowers/new/
# https://www.reddit.com/r/godtiersuperpowers/new/

url1 = 'https://www.reddit.com/r/shittysuperpowers/new/.json'
url2 = 'https://www.reddit.com/r/godtiersuperpowers/new/.json'

headers = {'User-agent': 'Ruby Fung'}

# Establishing the connection to the web page:
res1 = requests.get(url1, headers=headers)
res2 = requests.get(url2, headers=headers)

# use status codes to understand how the target server responds to your request.
res1.status_code
res2.status_code


200

In [3]:
reddit_dict1 = res1.json()
reddit_dict2 = res2.json()

In [4]:
posts1 = [p['data'] for p in reddit_dict1['data']['children']]

In [13]:
reddit_dict1['data']['after']
# name of the last post

't3_j3awo0'

In [6]:
pd.DataFrame(posts1)['name']

0     t3_j3ide5
1     t3_j3iakc
2     t3_j3i7iw
3     t3_j3hx96
4     t3_j3hkvh
5     t3_j3hdga
6     t3_j3h6sc
7     t3_j3gbk3
8     t3_j3ftit
9     t3_j3faer
10    t3_j3em6v
11    t3_j3ekqr
12    t3_j3efb2
13    t3_j3dnhp
14    t3_j3dmad
15    t3_j3dlhi
16    t3_j3dbxa
17    t3_j3ctae
18    t3_j3cs39
19    t3_j3cjk6
20    t3_j3bux0
21    t3_j3bpbj
22    t3_j3bk5u
23    t3_j3bdy7
24    t3_j3awo0
Name: name, dtype: object

In [7]:
url1 + '?after=' + reddit_dict1['data']['after']

'https://www.reddit.com/r/shittysuperpowers/new/.json?after=t3_j3awo0'

In [8]:
posts2 = [p['data'] for p in reddit_dict2['data']['children']]

In [14]:
reddit_dict2['data']['after']
# name of the last post

't3_j3bofm'

In [10]:
pd.DataFrame(posts2)['name']

0     t3_j3ifab
1     t3_j3i20g
2     t3_j3ho4u
3     t3_j3hi6s
4     t3_j3hev3
5     t3_j3h6l4
6     t3_j3h5ow
7     t3_j3gwcy
8     t3_j3gr6v
9     t3_j3gq5m
10    t3_j3gfnh
11    t3_j3g56g
12    t3_j3fgfi
13    t3_j3f29f
14    t3_j3f24f
15    t3_j3eyp4
16    t3_j3e27a
17    t3_j3dkna
18    t3_j3ddxm
19    t3_j3d3u3
20    t3_j3d1gq
21    t3_j3bysf
22    t3_j3byh8
23    t3_j3btqr
24    t3_j3bofm
Name: name, dtype: object

In [11]:
url2 + '?after=' + reddit_dict2['data']['after']

'https://www.reddit.com/r/godtiersuperpowers/new/.json?after=t3_j3bofm'

#### Looping through the posts, 25 posts at a time

In [None]:
# loop shittysuperpower subreddit 

posts1 = []
after1 = None

for i in range(40):
    print (i)
    if after1 == None:
        current_url1 = url1
    else: 
        current_url1 = url1 + '?after=' + after1
    print(current_url1)
    
    res1 = requests.get(current_url1, headers=headers)
    
    if res1.status_code != 200:
        print('Status error', res1.status_code)
        break
    
    
    current_dict1 = res1.json()
    current_posts1 = [p['data'] for p in current_dict1['data']['children']]
    posts1.extend(current_posts1)
    after1 = current_dict1['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    # sleep time between 2 to 10 seconds is moderate, not too short or too long
    sleep_duration = random.randint(2, 10)
    print(sleep_duration)
    time.sleep(sleep_duration)

In [None]:
len(posts1)

In [None]:
pd.DataFrame(posts1).to_csv('../data/shittysuperpowers.csv', index = False)

In [None]:
# loop godtiersuperpower subreddit 

posts2 = []
after2 = None

for i in range(40):
    print (i)
    if after2 == None:
        current_url2 = url2
    else: 
        current_url2 = url2 + '?after=' + after2
    print(current_url2)
    
    res2 = requests.get(current_url2, headers=headers)
    
    if res2.status_code != 200:
        print('Status error', res2.status_code)
        break
    
    
    current_dict2 = res2.json()
    current_posts2 = [p['data'] for p in current_dict2['data']['children']]
    posts2.extend(current_posts2)
    after2 = current_dict2['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    # sleep time between 2 to 10 seconds is moderate, not too short or too long
    sleep_duration = random.randint(2, 20)
    print(sleep_duration)
    time.sleep(sleep_duration)

In [None]:
len(posts2)

In [None]:
pd.DataFrame(posts2).to_csv('../data/godtiersuperpowers.csv', index = False)

Data cleaning, EDA will be carried on in next notebook.