# Project 3: Web APIs & Classification

### Contents

- [Problem Statement](#Problem-Statement)
- [Functions](#Functions)
- [Importing](#Importing)


- [Inspect Data](#Inspect-Data)
- [Clean Data](#Clean-Data)
- [Output Clean Data](#Output-Clean-Data)


- [EDA](#EDA)


- [LR Model Exploratory](#LR-Model-Exploratory)
- [Create Feature Matrix and Target](#Create-Feature-Matrix-and-Target)
- [LR Model](#LR-Model)
- [Ridge Model](#Ridge-Model)
- [Lasso Model](#Lasso-Model)


- [Output Model Predictions](#Output-Model-Predictions)


- [Descriptive and Inferential Statistics](#Descriptive-and-Inferential-Statistics)
- [Outside Research](#Outside-Research)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

### Problem Statement

- Blank

### Functions

In [1]:
# user configuration

url = "http://www.reddit.com/r/boardgames.json"
headers = {'User-agent':'Bleep blorp bot 0.1'}

num_requests = 40

In [2]:
def output_json(data):

    timestamp = dt.datetime.now()
    timestamp = timestamp.strftime("%Y_%m_%d %H_%M_%S")
    
    filename = 'boardgames ' + timestamp + '.json'
    
    file = open(json_path + filename,"w+")    
    json.dump(data,file)
    
    print('created',filename)

### Importing

In [3]:
# import libraries

# maths
import scipy.stats as stats
import numpy as np
import pandas as pd
#from pandas.api.types import is_numeric_dtype

# visual
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LinearRegression,Ridge,RidgeCV,Lasso,LassoCV,ElasticNet 
from sklearn import linear_model
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score,mean_squared_error

# web
import requests
import json
from IPython.display import Image
from IPython.core.display import HTML

# others
import time
import datetime as dt
#import re
#import os

In [4]:
# file paths

input_path = '../data/input/'
mid_path = '../data/mid/'
output_path = '../data/output/'

json_path = '../data/json/'

image_path = '../images/'

In [5]:
# get json from reddit boardgames

response = requests.get(url,headers=headers)
print(response.status_code)

json_data = response.json()
sorted(json_data.keys())

200


['data', 'kind']

In [6]:
# print number of posts in json_data
print(len(json_data['data']['children']))

# print info in kind
print(json_data['kind'])
print('')

# print data keys
print(sorted(json_data['data'].keys()))
print('')

# print 1st post
print(json_data['data']['children'][0]['data'])
print('')

# print id of of last post
print(json_data['data']['after'])

26
Listing
['after', 'before', 'children', 'dist', 'modhash']
{'approved_at_utc': None, 'subreddit': 'boardgames', 'selftext': '**Welcome to /r/boardgames Daily Discussion and Game Recommendations**\n\nThis is meant to be a place where you can ask any and all questions relating to the board gaming world: general or specific game recommendations, rule clarifications, definitions of terms/acronyms, and other quick questions that might not warrant their own post. \n\nIf you are seeking game recommendations you will get better responses if you give us enough background to help you. You can use [this template](https://www.reddit.com/r/boardgames/wiki/personalized-game-recommendation-template-no-explainer) to do so. [Here](https://www.reddit.com/r/boardgames/wiki/personalized-game-recommendation-template) is a version with explanations of what we\'re looking for.  \n\nIf you reply to any comment that has a game name in **bold** with "**/u/r2d8 getparentinfo**", one of our robots will tell yo

In [7]:
# print ids of all posts

[post['data']['name'] for post in json_data['data']['children']]

['t3_cdd70w',
 't3_cd96px',
 't3_cdgvqg',
 't3_cdfyjv',
 't3_cd52gi',
 't3_cd9zij',
 't3_cd4prk',
 't3_cdd9do',
 't3_cdedxz',
 't3_cdco7u',
 't3_cddjmf',
 't3_cd955j',
 't3_cdbbm3',
 't3_cdh4dv',
 't3_cd57ta',
 't3_cdf41m',
 't3_cdgwtq',
 't3_cdgwer',
 't3_cd69wt',
 't3_cdgoxh',
 't3_cd9cj8',
 't3_ccvd2t',
 't3_cdfwst',
 't3_cdfucd',
 't3_cd22ih',
 't3_cd4tpz']

In [8]:
# get json_data with multiple requests

posts = []
after = None

for i in range(1,num_requests+1):
    
    if after == None:
        params = {}
    else:
        params = {'after':after}
        
    response = requests.get(url,params=params,headers=headers)
    
    if response.status_code == 200:       
        
        print('process request',i)
        
        json_data = response.json()
        posts.extend(json_data['data']['children'])
        after = json_data['data']['after']
        
        output_json(json_data)        
        
    else:
        print(response.status_code)
        break
        
    time.sleep(3)

process request 1
created boardgames 2019_07_15 21_02_41.json
process request 2
created boardgames 2019_07_15 21_02_44.json
process request 3
created boardgames 2019_07_15 21_02_48.json
process request 4
created boardgames 2019_07_15 21_02_52.json
process request 5
created boardgames 2019_07_15 21_02_56.json
process request 6
created boardgames 2019_07_15 21_03_00.json
process request 7
created boardgames 2019_07_15 21_03_04.json
process request 8
created boardgames 2019_07_15 21_03_08.json
process request 9
created boardgames 2019_07_15 21_03_12.json
process request 10
created boardgames 2019_07_15 21_03_15.json
process request 11
created boardgames 2019_07_15 21_03_19.json
process request 12
created boardgames 2019_07_15 21_03_23.json
process request 13
created boardgames 2019_07_15 21_03_27.json
process request 14
created boardgames 2019_07_15 21_03_31.json
process request 15
created boardgames 2019_07_15 21_03_35.json
process request 16
created boardgames 2019_07_15 21_03_39.json
p

In [9]:
len(posts)

989

In [10]:
# check for dulpicates

len(set(p['data']['name'] for p in posts))

963

### Inspect Data

In [11]:
# list all columns in df_train

#print(df_train.columns)

In [12]:
# output 1st 5 records in df_train

#df_train.head()

In [13]:
#df_train_info = df_train.describe()
#df_train_info

In [14]:
# Check for nulls in columns

#null_cols = df_train.isnull().sum()
#mask_null = null_cols > 0
#null_cols[mask_null].sort_values(ascending=False)

In [15]:
# Check for nulls in rows

#null_rows = df_train.isnull().sum(axis=1)
#mask_null = null_rows > 0
#null_rows[mask_null].sort_values(ascending=False)

In [16]:
# show column summary

#df_train_info = df_train.info()

In [17]:
# find columns with the most empty cells

#df_train.count(axis=0).sort_values()

### Clean Data

In [18]:
# rename columns

In [19]:
# convert all string in cells to lowercase -> prevent duplicates when creating dummies

#df_train = df_train.applymap(lambda s:s.lower() if type(s) == str else s)
#df_test = df_test.applymap(lambda s:s.lower() if type(s) == str else s)

In [20]:
# fill nan/empty cells with na

#cols = ['pool_qual','bsmt_qual','fireplace_score','garage_qual','garage_cond','bsmt_type1_score','bsmt_type2_score']

#for col in cols:
    #df_train[col] = df_train[col].fillna(value='na')
    #df_test[col] = df_test[col].fillna(value='na')

In [21]:
# # fill nan/empty cells with 0

# cols = ['bath_half_bsmt_num','bath_full_bsmt_num','garage_area','garage_car_num','bsmt_total_area',
#         'bsmt_unfinish_area','bsmt_type2_area','bsmt_type1_area']

# for col in cols:
#     df_train[col] = df_train[col].fillna(value='0')
#     df_test[col] = df_test[col].fillna(value='0')

### Output Clean Data

In [22]:
# output to csv

#df_train.to_csv(mid_path + 'df_train_clean.csv')

### EDA

In [23]:
# df_train heatmap (staircase)

#corr = df_train.corr()
#mask = np.zeros_like(corr)
#mask[np.triu_indices_from(mask)] = True

#fig, ax = plt.subplots(figsize=(20, 10))
#sns.heatmap(corr, mask=mask, vmax=.3, square=True,cmap="coolwarm_r");

In [24]:
#create_scatterplot('Sale Price vs House Quality',df_train,x ='house_qual',y ='sale_price',hue ='lot_subclass')

#print("sale price tends to increase as house quality increases.")

In [25]:
# create histograms for all numeric columns
#df_train.hist(figsize=(15, 15));

#print('plot histograms for all numeric columns to check for zeros and abnormalities.')

In [26]:
#create_boxplot(df_train,x='lot_subclass',y='sale_price',title='Sale Price vs Lot Subclass')

#print("Most subclasses has sale price between 100K to 200K.")
#print("4 subclasses have sale price above 200k and they have more outliers.")
#print("I will convert lot_subclass to dummy variables for model predictions.")

### Create Feature Matrix and Target

### LR Model

### Ridge Model

### Lasso Model

### Output Model Predictions

In [27]:
# output to csv

#timestamp = dt.datetime.now()
#timestamp = timestamp.strftime(" %Y_%m_%d %H_%M_%S ")
    
# contains selected columns for feature matrix
#df_cols.to_csv(output_path + 'columns' + timestamp + '.csv')

### Descriptive and Inferential Statistics

### Outside Research

### Conclusions and Recommendations