<div id="toc"> </div>

# Setting up the notebook

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import numpy.random as rnd
import os

# to make this notebook's output stable across runs
rnd.seed(42)

# To plot figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [20]:
# Where to save the figures
PROJECT_ROOT_DIR = "."

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)


# Demonstrations

## Demo 1

In [None]:
import requests

We'll look to scrape from a website that has table and save it to a `.csv` file. We'll do a quick scrape of a [US Government public database](https://www.cia.gov/library/publications/the-world-factbook/fields/print_2085.html)

In [None]:
stats = requests.get("https://www.cia.gov/library/publications/the-world-factbook/fields/print_2085.html")

In [None]:
#Let's see if we got the content or not?
stats.status_code

In [None]:
#Printing content
stats.content

## Demo 3

**BS4 example:**

In [None]:
! pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(stats.content, 'html.parser')

In [None]:
# Randomly selecting a chunk of the html content
soup.prettify()

To separate out all the relevant tags, we need to taste the soup!

The list of cast is in the tag `<table class="cast_list">`

In [None]:
table = soup.find("table", attrs={'id':'fieldListing'})

In [None]:
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

In [None]:
rows[1]

## Demo 4

**Saving scapped data into a CSV file**

* We'll now save the raw data to a csv file

In [None]:
# Import csv package
import csv

In [None]:
with open('data/facts.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows((row for row in rows if row))

Below are the first 5 rows in the saved csv file.

In [None]:
! cat data/facts.csv | head -5

There's lot more complex operations you can do with Beautiful Soup, but it would be better to move to `scrapy` Python package to be more productive and hassle free web scraping.

## Demo 5
    
**Query 1: _We want to get the general information of the user_**

In [None]:
import requests

In [None]:
url = "https://api.github.com/users/shwedosh"
request = requests.get(url)

In [None]:
# If 200 then we are doing it right!
request.status_code

In [None]:
request.json()

**Query 2: _We want to get the number of public repositories the user has_**

In [None]:
url = "https://api.github.com/users/karpathy/repos"
request = requests.get(url)

In [None]:
request.status_code

In [None]:
repositories = request.json()

In [None]:
repositories

## Demo 6

**Querying the Twitter REST API**
- We'll be using `tweepy` package to navigate through the streaming API.
- We'll need to use credentials for the Twitter App that we creating in the Pre-Reading section.

**Query 1: _We need to get all the tweets for a specific topic_**

In [None]:
! pip install tweepy

In [None]:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

At this point we need to get all the required tokens that the Twitter App generated for us when we created it.

In [None]:
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"

In [None]:
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print ("The data collected is \n {}".format(data))
        return True

    def on_error(self, status):
        print ("The status is {}".format(status))

In [None]:
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = tweepy.Stream(auth, l)

#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['football'])

(_Note: You need to interrupt the kernel if running in notebook, it would otherwise continue running._)

**Query 2: _We need to get the tweets from a specific user._**

In [None]:
# We already have authorized the Twitter App
api = tweepy.API(auth)

#initialize a list to hold all the tweepy Tweets
alltweets = []    

#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)

#save most recent tweets
alltweets.append(new_tweets)

# We'll be able to download max 200 tweets, there a non-trivial way in which you could download the rest!

## Demo 7

**Parsing CSV responses**
- CSV responses. If the API responses are in the CSV Format we take the help `csv` package to navigate through it. 

In [None]:
# We'll consider the Quandl API for our purposes
import requests
url = "https://www.quandl.com/api/v3/datasets/WIKI/AAPL.csv"
request = requests.get(url)

In [None]:
request.status_code

In [None]:
datacsv = str(request.content)

In [None]:
rows = [val for val in datacsv.split('\\n')]

In [None]:
# Sample row printed from the response recieved
rows[1]

## Demo 8

**Parsing JSON responses**
    
* We'll consider the Github API to understand the JSON Response Parsing

In [None]:
url = "https://api.github.com/users/karpathy"
request = requests.get(url)

In [None]:
# Dumping the JSON Response in a variable
userDetails = request.json()

In [None]:
# Printing out all the user details
userDetails

## Demo 9

In [None]:
# Getting the Name and Location of the user
print ("The name of the repository owner is {}.".format(userDetails.get('name')))
print ("The location of the repository owner is {}.".format(userDetails.get('location')))

## Demo 10

**Worked out example Reading YAML files into Python (Code snippet below):**

In [1]:
! pip install pyyaml



In [2]:
import yaml

In [21]:
# For demonstration purposes we would import a sample YAML file
with open("./data/sample.yml", 'r') as stream:
    try:
        sample = (yaml.load(stream))
    except yaml.YAMLError as exc:
        print(exc)

In [22]:
sample

{'A': 'a', 'B': {'C': 'c', 'D': 'd', 'E': 'e'}}

## Demo 11

In [4]:
class FlatDict:        
    def flatDict(self, dictObj=None):
        '''Flatten a given dict
        '''
        #print('Arg received: ', dictObj)
        for key, value in dictObj.items():
            #print('Now iterating through: ', {key:value})
            if isinstance(value, dict):
                #print('Value: ', value, ', Is value a dictionary? ', isinstance(value, dict))
                for key2, value2 in value.items():
                    self.flatDict({'_'.join([key, key2]) : value2})
            elif isinstance(value, list) and isinstance(value[0], str):
                value = ', '.join(value)
                #print('The pair to be updated: ', {key:value})
                self.flatteneddict.update({key:value})
            else:
                #print('The pair to be updated: ', {key:value})
                self.flatteneddict.update({key:value})
        
    
    def __init__(self, dictObj=None):
        self.flatteneddict = {}
        if not isinstance(dictObj, dict):
            raise ValueError('Expected a dictionary object as input!')
        self.flatDict(dictObj)
    
    
    def __repr__(self):
        return(str(self.flatteneddict))

In [24]:
FlatDict(sample)

{'A': 'a', 'B_C': 'c', 'B_D': 'd', 'B_E': 'e'}

In [37]:
import pandas as pd
pd.set_option("display.max_columns", 101)

class CricDF(FlatDict):
    
    def __init__(self,  dictObj=None):
        super().__init__(dictObj)
        self.info = dictObj["info"]
#         print(self.flatteneddict)
        self.ballsDF = pd.DataFrame()

    def get_ballsDF(self):  
        for idx, inningsObj in enumerate(self.flatteneddict["innings"]):# idx = 0, 1; inningsObj = {'ist ininnings': dict}
            inningsDict = list(inningsObj.values())[0]                  # inningsDict = {'team': val, 'deliveries': dict}
            for ball in inningsDict['deliveries']:                      # a dict
                self.flatteneddict = {}                                 # clear out details of last delivery
                self.flatteneddict.update({'innings': idx + 1})
                self.flatteneddict.update({'batting_team': inningsDict['team']})
                self.flatDict(self.info)
#                 print(self.flatteneddict)
                
#                 print('Ball: ', ball)
                
                for ball_no, ball_details in ball.items():
#                     print('ball_no: ', ball_no, 'ball_details: ', ball_details)
                    self.flatDict(ball_details)
                    idx_df = int(1000*(idx+1) + 10*ball_no)
                    newDF = pd.DataFrame(self.flatteneddict, index=[idx_df])
                    self.ballsDF = pd.concat([self.ballsDF, newDF])
                    
        cols = ['competition', 'gender', 'match_type', 'dates','city', 'umpires', 'venue', 'teams',
                'toss_winner', 'toss_decision', 'outcome_by_runs', 'outcome_winner', 'player_of_match', 
                'innings', 'batting_team', 'batsman', 'non_striker', 'bowler', 'overs', 
                'runs_batsman', 'runs_extras', 'extras_byes', 'extras_legbyes', 
                'extras_wides', 'runs_total', 'wicket_fielders', 'wicket_kind', 'wicket_player_out']
        
        self.ballsDF = self.ballsDF[cols]

- A class to read `Cricsheet` YAMLs and produce CSVs

## Demo 12

In [38]:
z = CricDF(file)
z.get_ballsDF()
z.ballsDF

Unnamed: 0,competition,gender,match_type,dates,city,umpires,venue,teams,toss_winner,toss_decision,outcome_by_runs,outcome_winner,player_of_match,innings,batting_team,batsman,non_striker,bowler,overs,runs_batsman,runs_extras,extras_byes,extras_legbyes,extras_wides,runs_total,wicket_fielders,wicket_kind,wicket_player_out
1001,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,B Lee,20,0,0,,,,0,,,
1002,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,B Lee,20,0,0,,,,0,,,
1003,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,B Lee,20,1,0,,,,1,,,
1004,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,ML Hayden,PA Patel,B Lee,20,0,0,,,,0,,,
1005,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,ML Hayden,PA Patel,B Lee,20,4,0,,,,4,,,
1006,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,ML Hayden,PA Patel,B Lee,20,0,0,,,,0,,,
1011,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,S Sreesanth,20,4,0,,,,4,,,
1012,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,S Sreesanth,20,4,0,,,,4,,,
1013,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,S Sreesanth,20,4,0,,,,4,,,
1014,IPL,male,T20,2008-04-19,Chandigarh,"MR Benson, SL Shastri","Punjab Cricket Association Stadium, Mohali","Kings XI Punjab, Chennai Super Kings",Chennai Super Kings,bat,33,Chennai Super Kings,MEK Hussey,1,Chennai Super Kings,PA Patel,ML Hayden,S Sreesanth,20,2,0,,,,2,,,
