# <font color='blue'> Scrapping NBA - Basketball data - JSON</font> 


## JSON context and examples

**JSON format (JavaScript Object Notation)** is a text format independent of the programming language used. We can jave JSON files as we have any other file-format. This format is a collection of name/value pairs (similar to the concept of Python dictionaries {key:value}. 

It is also known as: record, struct, hash table, keyed list, or associative array.

Example of this structure:


In [1]:

{
"player":
		{ "firstName":"Michael", 
          "lastName":"Jordan" 
        },
"teams_of_this_player":
		{ "team1":"Chicago Bulls", 
          "team2":"Washington Wizards" 
        },
"DOB": 
 		 "February 17, 1963", 
"extra_note": 
 		{"text": "American retired professional basketball player, and owner of the Charlotte Hornets.",
         "source": 'Wikipedia'
        }
}

{'DOB': 'February 17, 1963',
 'extra_note': {'source': 'Wikipedia',
  'text': 'American retired professional basketball player, and owner of the Charlotte Hornets.'},
 'player': {'firstName': 'Michael', 'lastName': 'Jordan'},
 'teams_of_this_player': {'team1': 'Chicago Bulls',
  'team2': 'Washington Wizards'}}

**Note** that this is similar to another file format that you might be interested in: **"XML"** (eXtensible Markup Language). This language was created as an annotation-system for documents, in order to help humans and machines to read and identify **< marks > and 'contents'**. Those marks have to be created according to the needs. (This is similar to JSON xcept for the end tag and the fact that JSON format is faster to read). It is highly used within an HTML context because XML is self descriptive and created to define information (as opposed to HTML which displays info).

Take a look to this example:



'''

	<player>
		<firstName>Michael</firstName>
		<lastName>Jordan</lastName>
	</player>
	<teams_of_this_player>
		<team1>Chicago Bulls</team1>
		<team2>Washington Wizards</team2>
	</teams_of_this_player>
	<DOB>February 17, 1963</DOB>
	<extra_note>
		<text>American retired professional basketball player, and owner of the Charlotte Hornets.</text>
		<source>Wikipedia</source>
	</extra_note>
    
'''

**Python libraries** that help to deal with XML structures:  [xmltodict](https://github.com/martinblech/xmltodict) and [untangle](https://github.com/stchris/untangle)

## JavaScript and JSON to Python dictionaries (.js)

Now let's deal with the .js files of the NBA stats. [Website here](http://stats.nba.com)

In [2]:

# Python library 
import json


In [3]:

### Downloading the JAVASCRIPT file
### http://stats.nba.com/js/data/ptsd/stats_ptsd.js

out_pathname = "scrap_docs/stats_ptsd.js"
with open(out_pathname,'r') as jf:
    data = jf.read()
    
# JavaScript Variable

target_variable = "stats_ptsd = "
index_variable = data.find(target_variable)
print ( "\nThe variable required is located at the character-index number {0} \n".format(index_variable) )

# We should notice here that the data is structured in the folllowing way: 
# Douible quotes for prinicipal dictionaries: Generated, seasons_count, teams_count, players_count and data

string_for_json = data[ index_variable+len(target_variable): -1]    # -1 to get rid of the semicolon  ;
data_json = json.loads(string_for_json)

print ("Principal keys: {}\n".format(data_json.keys()))
print ("Secondary data-keys: {}\n".format(data_json['data'].keys()))



The variable required is located at the character-index number 4 

Principal keys: [u'generated', u'players_count', u'data', u'teams_count', u'seasons_count']

Secondary data-keys: [u'seasons', u'players', u'teams']



## JSON to Python dictionaries (.json)

Now let's work with the JSON format available in the NBA stats website.

<table class="image">
<caption text-align="center">This is how the data looks today!! </caption>
<tr><td><img src="scrap_docs/Capture.PNG" style="max-width:100%; width: 70%; max-width: none; text: 'hello'" ></td></tr>
</table>


In [4]:

### Downloading the JSON file
### http://stats.nba.com/schedule/summerleague/#!?PD=N
### http://data.nba.com/data/10s/v2015/json/mobile_teams/utah/2017/scores/14_todays_scores.json

out_pathname = "scrap_docs/14_full_schedule.json"
with open(out_pathname,'r') as jf:  
    data = json.load(jf)
    
print data['lscd'][0]['mscd']['mon'] + "\n"
print "Options:", data['lscd'][0]['mscd']['g'][1].keys(), "\n"

print "Example:\n"
print data['lscd'][0]['mscd']['g'][0]['v']  #Visit
print data['lscd'][0]['mscd']['g'][0]['h']  # Host
print "\n"


July

Options: [u'bd', u'h', u'ac', u'ptsls', u'seq', u'gcode', u'is', u'v', u'vtm', u'an', u'stt', u'as', u'gid', u'gdte', u'st', u'ppdst', u'gdtutc', u'etm', u'htm', u'utctm', u'seri'] 

Example:

{u'tn': u'Hornets', u're': u'1-0', u's': u'74', u'tid': 1610612766, u'tc': u'Charlotte', u'ta': u'CHA'}
{u'tn': u'Heat', u're': u'0-1', u's': u'67', u'tid': 1610612748, u'tc': u'Miami', u'ta': u'MIA'}




# <font color='blue'>  Elastic Search engine for JSON format </font> 

This is an example of how to work with elastic search in Python. You should first install the ElasticSearch Python-package and set-up your elasticsearch server. In other words, this means (in a local environment) that while using the following Python commands, I need to have a terminal/shell open (inside the elasticsearch/bin folder) in my computer with 'ElasticSearch' exec-file running [(details here).](https://www.elastic.co/downloads/elasticsearch)

This **ElasticSearch engine** is a distributed and multitenant full-text search for JSON free-sructured documents. It has an HTTP web interface (we need the server/host) and it is based on an INDEX framework. It was developed in Java and now released public available by Apache, and used by Facebook, Mozilla, Soundcloud, among others.

### Things to know about ElasticSearch:

- To creat a better context for elastic search you might want to consider data-structures based on:
    * How data flows in your current system
    * Streaming/real-time vs static
- Index should include these properties
    * atomic, consistent, isolated and durable
- Relations can still be done (like in a SQL context)
    * For example: You can have an index-type for users {'user': { 1:{'age':15,'name':'John T'}, 2:{'age':18,'name':'Sarah V'} }  and then match with another index-type {websites: { '_id':122456, 'info':{'user':2,'url':'sarahcom'} }. Those types can definitely be joined. 
- During index creation we can consider to "denormlize"
    * This means that we can have redundant copies to do less queries (each document contains all the information to determine matches). We could put the user info inside the website info too, as here  {websites: { '_id':122456, 'info':{'user':2,'url':'sarahcom'}, 'user': { 'age':18,'name':'Sarah V' } }  and then build queries doing "user.age"
- You should try to avoid language problems with the following ideas:
    * Remove diadritics
    * Reduce words and keys to 'roots' of words (sstemming and lemmatization)
    * Remove [stopwords](https://en.wikipedia.org/wiki/Stop_words)
    * Include synonyms to work better with misspelling 
    * Misspelling can be addressed with "Leveshtem Distance" and "Fuzzy match" (4 main techniques: substituion of chars, insert new chars, delete a char, and transpose a char)
- Naming conventions for field-names
    * lower case, underscore "_" for combined words, go from general index to particular (system->core->user->data), if you have 2 fields that explain the same information but in 2 units/measures then remove the less granular (keep the details), use singular and plural in your favor to make keys self-descriptive (if you have many requests write "requests_num" as your key instad of "request_num")
    
### Now, let's work with our JSON in an ElasticSearch context

In [5]:

from elasticsearch import Elasticsearch, helpers
import sys, os

es = Elasticsearch()

def load_json(filename):
    if filename.endswith('.json'):
        with open(filename,'r') as open_file:
            yield json.load(open_file)

helpers.bulk(es, load_json("scrap_docs/14_full_schedule.json"), index='index_basket', doc_type='games_scores')


(1, [])

In [7]:

## Get results based on ID.

## I got the ID from the command "GET index_basket/games_scores/_search" in the Chrome SENSE console
## They can change once you re-run this
## https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig?hl=en

result= es.get(index='index_basket', doc_type='games_scores', id= "AV12BTAghrFY_Cr3ITXW")

result['_source']['lscd'][0]['mscd']['g'][0]


{u'ac': u'Orlando',
 u'an': u'Amway Center',
 u'as': u'FL',
 u'bd': {u'b': [{u'disp': u'NBATV',
    u'lan': u'English',
    u'scope': u'natl',
    u'seq': 1,
    u'type': u'tv'}]},
 u'etm': u'2017-07-01T11:00:00',
 u'gcode': u'20170701/CHAMIA',
 u'gdte': u'2017-07-01',
 u'gdtutc': u'2017-07-01',
 u'gid': u'1421700001',
 u'h': {u're': u'0-1',
  u's': u'67',
  u'ta': u'MIA',
  u'tc': u'Miami',
  u'tid': 1610612748,
  u'tn': u'Heat'},
 u'htm': u'2017-07-01T11:00:00',
 u'is': 1,
 u'ppdst': u'I',
 u'ptsls': {u'pl': [{u'fn': u'Okaro',
    u'ln': u'White',
    u'pid': u'1627855',
    u'ta': u'MIA',
    u'tc': u'Miami',
    u'tid': 1610612748,
    u'tn': u'Heat',
    u'val': u'20'}]},
 u'seq': 1,
 u'seri': u'',
 u'st': u'3',
 u'stt': u'Final',
 u'utctm': u'15:00',
 u'v': {u're': u'1-0',
  u's': u'74',
  u'ta': u'CHA',
  u'tc': u'Charlotte',
  u'tid': 1610612766,
  u'tn': u'Hornets'},
 u'vtm': u'2017-07-01T11:00:00'}

In [8]:

es.indices.delete(index='index_basket', ignore=[400, 404])


{u'acknowledged': True}

### JSON from the web  (json response from the web)

In [9]:

### Using internet protocol to download a JSON

import requests
import pandas as pd

my_url= 'http://stats.nba.com/stats/commonteamroster?LeagueID=00&Season=2013-06&TeamID=1610612737'

# We can get this from Inspect -> Network -> Headers -> Request headers -> User Agent
# It changes if you use IOS or Windows 

headers_nba = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
response = requests.get(my_url, headers = headers_nba)

headers = response.json()['resultSets'][0]['headers']
players = response.json()['resultSets'][0]['rowSet']
players_df = pd.DataFrame(players, columns=headers)

players_df.head(10)


Unnamed: 0,TeamID,SEASON,LeagueID,PLAYER,NUM,POSITION,HEIGHT,WEIGHT,BIRTH_DATE,AGE,EXP,SCHOOL,PLAYER_ID
0,1610612737,2013,0,Jeff Teague,0,G,6-2,181,"JUN 10, 1988",26.0,4,Wake Forest,201952
1,1610612737,2013,0,Lou Williams,3,G,6-1,175,"OCT 27, 1986",27.0,8,South Gwinnett HS (GA),101150
2,1610612737,2013,0,Paul Millsap,4,F,6-8,253,"FEB 10, 1985",29.0,7,Louisiana Tech,200794
3,1610612737,2013,0,DeMarre Carroll,5,F,6-8,212,"JUL 27, 1986",27.0,4,Missouri,201960
4,1610612737,2013,0,Pero Antic,6,C,6-11,260,"JUL 29, 1982",31.0,R,Macedonia,203544
5,1610612737,2013,0,Shelvin Mack,8,G,6-3,207,"APR 22, 1990",24.0,2,Butler,202714
6,1610612737,2013,0,John Jenkins,12,G,6-4,215,"MAR 06, 1991",23.0,1,Vanderbilt,203098
7,1610612737,2013,0,Gustavo Ayon,14,F-C,6-10,250,"APR 01, 1985",29.0,2,"Tepic, Mexico",202970
8,1610612737,2013,0,Al Horford,15,C-F,6-10,250,"JUN 03, 1986",28.0,6,Florida,201143
9,1610612737,2013,0,Dennis Schroder,17,G,6-1,168,"SEP 15, 1993",20.0,R,Germany,203471


### Let's build our ElasticSearch index

In [10]:

players_dicts = players_df.to_dict(orient='index')

es = Elasticsearch()

for d in players_dicts:
    es.index(index='index_basket', doc_type='team-roaster',id=d,body=players_dicts[d])


In [12]:

## Option1 for queries
myquery = {
    "query": {
        "match" : {"PLAYER": "Lou Williams"} 
    }
}
result = es.search(index='index_basket', doc_type='team-roaster',body=myquery ,explain=True)


## Option2 for queries
result = es.search(index='index_basket', doc_type='team-roaster', q='Lou' ,explain=True)
result['hits']['hits'][0][u'_source']


{u'AGE': 27.0,
 u'BIRTH_DATE': u'OCT 27, 1986',
 u'EXP': u'8',
 u'HEIGHT': u'6-1',
 u'LeagueID': u'00',
 u'NUM': u'3',
 u'PLAYER': u'Lou Williams',
 u'PLAYER_ID': 101150,
 u'POSITION': u'G',
 u'SCHOOL': u'South Gwinnett HS (GA)',
 u'SEASON': u'2013',
 u'TeamID': 1610612737,
 u'WEIGHT': u'175'}

In [13]:

print ("Results", len(result['hits']['hits'][0]) )
print ("Score = ", result['hits']['hits'][0]['_score'] )

print ("\nScore is the boolean model to find matching documents and the formula to computea a relevance met\n")

result['hits']['hits'][0][u'_source']


('Results', 8)
('Score = ', 0.9351026)

Score is the boolean model to find matching documents and the formula to computea a relevance met



{u'AGE': 27.0,
 u'BIRTH_DATE': u'OCT 27, 1986',
 u'EXP': u'8',
 u'HEIGHT': u'6-1',
 u'LeagueID': u'00',
 u'NUM': u'3',
 u'PLAYER': u'Lou Williams',
 u'PLAYER_ID': 101150,
 u'POSITION': u'G',
 u'SCHOOL': u'South Gwinnett HS (GA)',
 u'SEASON': u'2013',
 u'TeamID': 1610612737,
 u'WEIGHT': u'175'}

In [14]:

es.indices.delete(index='index_basket', ignore=[400, 404])


{u'acknowledged': True}

## Tool created from this data:  [Basket Strategies in Shiny](https://mariazm.shinyapps.io/basketstrategies/)

In [None]:

### Video Demo
#<p>
#<video controls src="Attachments/App_Demo.mp4" />
#</p>
