# Project Wattpad

## Getting Wattpad Data
This Jupyter Notebook file uses the Wattpad API to get data from Wattpad. The main content that we will use for analysis is the Wattpad Stories. The stories have categories and languages associated with them. The category and language data is also available via the api. 

Our main focus here will be to get all the raw data from the api, do the data cleanup and save it into csv files that we will use for analysis later.

In [1]:
# Import Dependencies
import requests
import json
import numpy as np
import csv
import yaml
import os
from pandas.io.json import json_normalize

### Set up for API calls
We need to first set up the details to be able to make the api calls and define the placeholders for our data files and other variables.

In [2]:
# Load the config.yaml file to get the api keys and other parameters
with open("./config.yaml") as y:
    cfg = yaml.load(y)

header = {
    "Authorization": "Basic {}".format(cfg["keys"]["API_KEY"]),
    "Content-Type": "application/json",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",

    }

# Files to save our data
categories_file_name = "data/categories.csv"
languages_file_name = "data/languages.csv"

### Getting the Categories from the Wattpad api
The Wattpad api provides a call to get a list of categories used to categorize all the stories. 
We will get this list and store it as a csv file fo use later

In [3]:
################################################################################
# This function makes a Wattpad api call to get a list of all the categories
# It writes all the categories data into a csv file to be used later
################################################################################
def get_categories():
    category_url = "https://www.wattpad.com/v4/categories"
    
    # Make the api call
    req = requests.get(category_url, headers=header)
    category_response = req.json()
    
    # Write to the csv file
    with open(categories_file_name,'w') as csvfile:
        write=csv.writer(csvfile, delimiter=',')
        
        # Write the header row
        write.writerow(["ID","NAME"])
        
        # Loop through the data and write
        for category in category_response["categories"]:
            write.writerow([category["id"],category["name"]])
            

In [4]:
# Call the function to get all the categories from Whatpad and then view the data 
# from the csv file that is created to make sure we have usable data
get_categories()

# Open the csv file and read its contents to see if we got all the data right
with open(categories_file_name) as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    for row in reader:
        print(row)

['ID', 'NAME']
['4', 'Romance']
['5', 'Science Fiction']
['3', 'Fantasy']
['7', 'Humor']
['12', 'Paranormal']
['8', 'Mystery / Thriller']
['9', 'Horror']
['11', 'Adventure']
['23', 'Historical Fiction']
['1', 'Teen Fiction']
['6', 'Fanfiction']
['2', 'Poetry']
['17', 'Short Story']
['21', 'General Fiction']
['24', 'ChickLit']
['14', 'Action']
['18', 'Vampire']
['22', 'Werewolf']
['13', 'Spiritual']
['16', 'Non-Fiction']
['10', 'Classics']
['19', 'Random']


### Getting Languages from Wattpad
The Wattpad api provides a call to get a list of languages used for all the stories. 
We will get this list and store it as a csv file fo use later

In [5]:
################################################################################
# This function makes a Wattpad api call to get a list of all the languages
# It writes all the language code data into a csv file to be used later
################################################################################
def get_languages():
    language_url = "https://www.wattpad.com/v4/languages"
    
    # Make the api call
    req = requests.get(language_url, headers=header)
    category_response = req.json()
    
    # Write to the csv file
    with open(languages_file_name,'w') as csvfile:
        write=csv.writer(csvfile, delimiter=',')
        
        # Write the header row
        write.writerow(["LANGUAGE_CODE"])
        
        # Loop through the data and write 
        for category in category_response["languages"]:
            write.writerow([category["code"]])
            

In [6]:
# Make the call to get the languages and then view the data from the csv file that
# is created to make sure we have usable data
get_languages()

# Open the csv file and read its contents to see if we got all the data right
with open(languages_file_name) as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    for row in reader:
        print(row)

['LANGUAGE_CODE']
['en']
['fr']
['it']
['de']
['es']
['pt-PT']
['pt-BR']
['ru']
['zh-TW']
['ja']
['ko']
['zh-CN']
['nl']
['pl']
['ro']
['ar']
['he']
['tl']
['vi']
['id']
['hi']
['ms']
['tr']
['cs']
['ml']
['sv']
['nn']
['hu']
['da']
['el']
['fa']
['th']
['is']
['fi']
['et']
['lv']
['lt']
['ca']
['bs']
['sr']
['hr']
['sl']
['bg']
['sk']
['be']
['uk']
['bn']
['ur']
['ta']
['sw']
['af']
['gu']
['or']
['pa']
['as']
['mr']


### Getting Stories from Wattpad
The main content we will be working with is Wattpad stories. The api gives us a list of stories written by users that are read by all the users. We will use this content for our analysis.

In [7]:
def get_stories(x):
    BASE_URL = "https://www.wattpad.com/v4/stories?limit=100offset%3D0&offset="+str(x)

    req = requests.get(BASE_URL.format("stories"), headers=header)
    json_response = req.json()
    return(json_response)

In [8]:
#number of stories
N = 10000
json_list = []
for x in np.arange(0,N,100):
    json_list.append(get_stories(x))

In [9]:
len(json_list)

100

In [10]:
pages_of_stories = [x['stories'] for x in json_list]
print(len(pages_of_stories))
pages_of_stories[10]

100


[{'categories': [19, 0],
  'commentCount': 22,
  'completed': False,
  'copyright': 0,
  'cover': 'https://a.wattpad.com/cover/44159049-256-k298489.jpg',
  'cover_timestamp': '2015-07-08T02:48:09Z',
  'createDate': '2015-07-08T02:47:49Z',
  'deleted': False,
  'description': '"Well, that\'s just a coincidence, Honey. Don\'t think too much about it"\n\nMy mom always said that every time I tell her about something that just miraculously happen. Well, I don\'t believe that... For me "There\'s no such thing as coincidence, there\'s always a reason behind it" \n\nSet about a few years later, this story tells about a young girl named Jacqueline who meets a mysterious man who changes her life forever.\n\n\nThis is the Spin-off story of CHANGED by @sfdlovato\n#CHANGEDWritingContest',
  'firstPartId': 146005924,
  'firstPublishedPart': {'createDate': '2015-07-08T02:52:45Z',
   'id': 146005924},
  'id': '44159049',
  'language': {'id': 1, 'name': 'English'},
  'lastPublishedPart': {'createDate':

In [11]:
################################################################################
# Creates a single array of all stories downloaded, parses each json element
# into its own column, then changes the values of the categories column to be
# a single integer instead of an array.
################################################################################

flat_list=[x for y in pages_of_stories for x in y]

stories_df = json_normalize(flat_list)

for i in range(len(stories_df['categories'])):
    stories_df.loc[i, 'categories'] = stories_df['categories'][i][0]
    
stories_df.to_csv(os.path.join('Data', 'stories.csv'))