__Sources:__
1. Wikipedia
2. [Coursera Andrew NG]()
3. [Tom Mitchell's Machine Learning course](http://www.cs.cmu.edu/~tom/)
4. [Machine Learning Mastery](http://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/)

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import numpy.random as rnd
import os

# to make this notebook's output stable across runs
rnd.seed(42)

# To plot figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [2]:
# Where to save the figures
PROJECT_ROOT_DIR = "."

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)


### Demo 1

In [48]:
import requests

We'll look to scrape from a website that has table and save it to a `.csv` file. We'll do a quick scrape of a [US Government public database](https://www.cia.gov/library/publications/the-world-factbook/fields/print_2085.html)

In [49]:
stats = requests.get("https://www.cia.gov/library/publications/the-world-factbook/fields/print_2085.html")

In [50]:
#Let's see if we got the content or not?
stats.status_code

200

In [None]:
#Printing content
stats.content

### Demo 3

**BS4 example:**

In [8]:
! pip install beautifulsoup4



In [9]:
from bs4 import BeautifulSoup

In [10]:
soup = BeautifulSoup(stats.content, 'html.parser')

In [None]:
# Randomly selecting a chunk of the html content
soup.prettify()

To separate out all the relevant tags, we need to taste the soup!

The list of cast is in the tag `<table class="cast_list">`

In [12]:
table = soup.find("table", attrs={'id':'fieldListing'})

In [13]:
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

In [14]:
rows[1]

[b'Afghanistan',
 b'\ntotal: 42,150 km\npaved: 12,350 km\nunpaved: 29,800 km (2006)\n']

### Demo 4

**Saving scapped data into a CSV file**

* We'll now save the raw data to a csv file

In [15]:
# Import csv package
import csv

In [16]:
with open('data/facts.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows((row for row in rows if row))

Below are the first 5 rows in the saved csv file.

In [17]:
! cat data/facts.csv | head -5

b'Afghanistan',"b'\ntotal: 42,150 km\npaved: 12,350 km\nunpaved: 29,800 km (2006)\n'"
b'Albania',"b'\ntotal: 18,000 km\npaved: 7,020 km\nunpaved: 10,980 km (2002)\n'"
b'Algeria',"b'\ntotal: 113,655 km\npaved: 87,605 km (includes 645 km of expressways)\nunpaved: 26,050 km (2010)\n'"
b'American Samoa',b'\ntotal: 241 km (2008)\n'
b'Andorra',b'\ntotal: 320 km (2015)\n'


There's lot more complex operations you can do with Beautiful Soup, but it would be better to move to `scrapy` Python package to be more productive and hassle free web scraping.

### Demo 5
    
**Query 1: _We want to get the general information of the user_**

In [52]:
import requests

In [53]:
url = "https://api.github.com/users/shwedosh"
request = requests.get(url)

In [54]:
# If 200 then we are doing it right!
request.status_code

200

In [55]:
request.json()

{'login': 'shwedosh', 'id': 22659624, 'avatar_url': 'https://avatars3.githubusercontent.com/u/22659624?v=3', 'gravatar_id': '', 'url': 'https://api.github.com/users/shwedosh', 'html_url': 'https://github.com/shwedosh', 'followers_url': 'https://api.github.com/users/shwedosh/followers', 'following_url': 'https://api.github.com/users/shwedosh/following{/other_user}', 'gists_url': 'https://api.github.com/users/shwedosh/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/shwedosh/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/shwedosh/subscriptions', 'organizations_url': 'https://api.github.com/users/shwedosh/orgs', 'repos_url': 'https://api.github.com/users/shwedosh/repos', 'events_url': 'https://api.github.com/users/shwedosh/events{/privacy}', 'received_events_url': 'https://api.github.com/users/shwedosh/received_events', 'type': 'User', 'site_admin': False, 'name': 'Shweta Doshi', 'company': None, 'blog': '', 'location': None, 'email': None, 'hire

**Query 2: _We want to get the number of public repositories the user has_**

In [56]:
url = "https://api.github.com/users/karpathy/repos"
request = requests.get(url)

In [57]:
request.status_code

200

In [58]:
repositories = request.json()

In [59]:
repositories

[{'id': 47010479, 'name': 'arxiv-sanity-preserver', 'full_name': 'karpathy/arxiv-sanity-preserver', 'owner': {'login': 'karpathy', 'id': 241138, 'avatar_url': 'https://avatars0.githubusercontent.com/u/241138?v=3', 'gravatar_id': '', 'url': 'https://api.github.com/users/karpathy', 'html_url': 'https://github.com/karpathy', 'followers_url': 'https://api.github.com/users/karpathy/followers', 'following_url': 'https://api.github.com/users/karpathy/following{/other_user}', 'gists_url': 'https://api.github.com/users/karpathy/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/karpathy/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/karpathy/subscriptions', 'organizations_url': 'https://api.github.com/users/karpathy/orgs', 'repos_url': 'https://api.github.com/users/karpathy/repos', 'events_url': 'https://api.github.com/users/karpathy/events{/privacy}', 'received_events_url': 'https://api.github.com/users/karpathy/received_events', 'type': 'User', 'site_a

### Demo 6

**Querying the Twitter REST API**
- We'll be using `tweepy` package to navigate through the streaming API.
- We'll need to use credentials for the Twitter App that we creating in the Pre-Reading section.

**Query 1: _We need to get all the tweets for a specific topic_**

In [43]:
! pip install tweepy



In [44]:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

At this point we need to get all the required tokens that the Twitter App generated for us when we created it.

In [36]:
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"

In [37]:
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print ("The data collected is \n {}".format(data))
        return True

    def on_error(self, status):
        print ("The status is {}".format(status))

In [39]:
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = tweepy.Stream(auth, l)

#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['football'])

(_Note: You need to interrupt the kernel if running in notebook, it would otherwise continue running._)

**Query 2: _We need to get the tweets from a specific user._**

In [46]:
# We already have authorized the Twitter App
api = tweepy.API(auth)

#initialize a list to hold all the tweepy Tweets
alltweets = []    

#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)

#save most recent tweets
alltweets.append(new_tweets)

# We'll be able to download max 200 tweets, there a non-trivial way in which you could download the rest!

### Demo 7

**Parsing CSV responses**
- CSV responses. If the API responses are in the CSV Format we take the help `csv` package to navigate through it. 

In [1]:
# We'll consider the Quandl API for our purposes
import requests
url = "https://www.quandl.com/api/v3/datasets/WIKI/AAPL.csv"
request = requests.get(url)

In [2]:
request.status_code

200

In [3]:
datacsv = str(request.content)

In [4]:
rows = [val for val in datacsv.split('\\n')]

In [5]:
# Sample row printed from the response recieved
rows[1]

'2017-04-13,141.91,142.38,141.05,141.05,17439547.0,0.0,1.0,141.91,142.38,141.05,141.05,17439547.0'

### Demo 8

**Parsing JSON responses**
    
* We'll consider the Github API to understand the JSON Response Parsing

In [6]:
url = "https://api.github.com/users/karpathy"
request = requests.get(url)

In [7]:
# Dumping the JSON Response in a variable
userDetails = request.json()

In [8]:
# Printing out all the user details
userDetails

{'avatar_url': 'https://avatars0.githubusercontent.com/u/241138?v=3',
 'bio': None,
 'blog': 'twitter.com/karpathy',
 'company': None,
 'created_at': '2010-04-10T17:55:32Z',
 'email': 'andrej.karpathy@gmail.com',
 'events_url': 'https://api.github.com/users/karpathy/events{/privacy}',
 'followers': 9438,
 'followers_url': 'https://api.github.com/users/karpathy/followers',
 'following': 5,
 'following_url': 'https://api.github.com/users/karpathy/following{/other_user}',
 'gists_url': 'https://api.github.com/users/karpathy/gists{/gist_id}',
 'gravatar_id': '',
 'hireable': None,
 'html_url': 'https://github.com/karpathy',
 'id': 241138,
 'location': 'Stanford',
 'login': 'karpathy',
 'name': 'Andrej',
 'organizations_url': 'https://api.github.com/users/karpathy/orgs',
 'public_gists': 7,
 'public_repos': 29,
 'received_events_url': 'https://api.github.com/users/karpathy/received_events',
 'repos_url': 'https://api.github.com/users/karpathy/repos',
 'site_admin': False,
 'starred_url': 'h

### Demo 9

In [9]:
# Getting the Name and Location of the user
print ("The name of the repository owner is {}.".format(userDetails.get('name')))
print ("The location of the repository owner is {}.".format(userDetails.get('location')))

The name of the repository owner is Andrej.
The location of the repository owner is Stanford.


### Demo 10

**Worked out example Reading YAML files into Python (Code snippet below):**

In [61]:
! pip install pyyaml



In [62]:
import yaml

In [73]:
# For demonstration purposes we would import a sample YAML file
with open("./data/sample.yml", 'r') as stream:
    try:
        print(yaml.load(stream))
    except yaml.YAMLError as exc:
        print(exc)

{'A': 'a', 'B': {'C': 'c', 'D': 'd', 'E': 'e'}}


### Demo 11

In [47]:
class FlatDict:        
    def flatDict(self, dictObj=None):
        '''Flatten a given dict
        '''
        #print('Arg received: ', dictObj)
        for key, value in dictObj.items():
            #print('Now iterating through: ', {key:value})
            if isinstance(value, dict):
                #print('Value: ', value, ', Is value a dictionary? ', isinstance(value, dict))
                for key2, value2 in value.items():
                    self.flatDict({'_'.join([key, key2]) : value2})
            elif isinstance(value, list) and isinstance(value[0], str):
                value = ', '.join(value)
                #print('The pair to be updated: ', {key:value})
                self.flatteneddict.update({key:value})
            else:
                #print('The pair to be updated: ', {key:value})
                self.flatteneddict.update({key:value})
        
    def __init__(self, dictObj=None):
        self.flatteneddict = {}
        if not isinstance(dictObj, dict):
            raise ValueError('Expected a dictionary object as input!')
        self.flatDict(dictObj)
    
    def __repr__(self):
        return(str(self.flatteneddict))

In [64]:
xxx = {'x': 1, 'y':2, 'z': {'a':1, 'b':2}}
FlatDict(xxx)

{'x': 1, 'y': 2, 'z_a': 1, 'z_b': 2}

In [82]:
import pandas as pd

class CricSheet(FlatDict):
    def __init__(self, innings=None, info=None):
        self.ballsDF = pd.DataFrame()
        self.get_ballsDF(innings, info)
    
    # x.innings (list) 
    #  > '1st innings', '2nd innings' (dicts) 
    #    > 'team':value, 'deliveries':[dicts of balls {ball_no:value}]
    # Params - x.innings[0] and x.innings[1], if they exist
    def get_ballsDF(self, innings=None, info=None):     # dictObj = x.innings
        for idx, inningsObj in enumerate(innings):      # idx = 0, 1; inningsObj = {'ist ininnings': dict}
            inningsDict = list(inningsObj.values())[0]  # inningsDict = {'team': val, 'deliveries': dict}
            for ball in inningsDict['deliveries']:      # a dict
                #clear out details of last delivery
                self.flatteneddict = {} 
                self.flatteneddict.update({'innings': idx + 1})
                self.flatteneddict.update({'batting_team': inningsDict['team']})
                self.flatDict(info)
                
                #print('Ball: ', ball)
                
                for ball_no, ball_details in ball.items():
                    #print('ball_no: ', ball_no, 'ball_details: ', ball_details)
                    self.flatDict(ball_details)
                    
                    idx_df = int(100*(idx+1) + 10*ball_no)
                    
                    newDF = pd.DataFrame(self.flatteneddict, index=[idx_df])
                    #print('newDF: \n', type(newDF), '\n', newDF)
                    
                    self.ballsDF = pd.concat([self.ballsDF, newDF])


- A class to read `Cricsheet` YAMLs and produce CSVs

### Demo 12

In [75]:
import os
import glob
from tqdm import tqdm

class Cricsheet(object):
    """We would be able to extract the content from the YAML files and produce CSVs from it.
    
    Attributes:
        ymlpath: Path of the folder where the YAML files are kept.
        csvpath: Path where the output files should be stored.
    """
    
    def __init__(self, ymlpath, csvpath):
        """Returns a Cricsheet object that has *ymlpath* and *csvpath* as the 
           folder paths required.
        """
        self.ymlpath = ymlpath
        self.csvpath = csvpath
        
    def file_list(self):
        """Returns the list of files in the YAML folder supplied.
        """
        filelist = [file for file in glob.glob(os.path.join(self.ymlpath,"*[.yaml|.yml]"))]
        return filelist
    
    def read_files(self):
        """Returns the contents of the files read.
        """
        
        for file in tqdm(self.file_list()):
            with open(file, 'r') as stream:
                try:
                    return yaml.load(stream)
                except yaml.YAMLError as exc:
                    print(exc)

In [76]:
_list = Cricsheet("./data/test/", "./data/test/")

In [77]:
file = _list.read_files()

  0%|          | 0/2 [00:00<?, ?it/s]


In [78]:
type(file)

<class 'dict'>

In [79]:
file.get('innings')[0]

{'1st innings': {'team': 'Chennai Super Kings', 'deliveries': [{0.1: {'batsman': 'PA Patel', 'bowler': 'B Lee', 'non_striker': 'ML Hayden', 'runs': {'batsman': 0, 'extras': 0, 'total': 0}}}, {0.2: {'batsman': 'PA Patel', 'bowler': 'B Lee', 'non_striker': 'ML Hayden', 'runs': {'batsman': 0, 'extras': 0, 'total': 0}}}, {0.3: {'batsman': 'PA Patel', 'bowler': 'B Lee', 'non_striker': 'ML Hayden', 'runs': {'batsman': 1, 'extras': 0, 'total': 1}}}, {0.4: {'batsman': 'ML Hayden', 'bowler': 'B Lee', 'non_striker': 'PA Patel', 'runs': {'batsman': 0, 'extras': 0, 'total': 0}}}, {0.5: {'batsman': 'ML Hayden', 'bowler': 'B Lee', 'non_striker': 'PA Patel', 'runs': {'batsman': 4, 'extras': 0, 'total': 4}}}, {0.6: {'batsman': 'ML Hayden', 'bowler': 'B Lee', 'non_striker': 'PA Patel', 'runs': {'batsman': 0, 'extras': 0, 'total': 0}}}, {1.1: {'batsman': 'PA Patel', 'bowler': 'S Sreesanth', 'non_striker': 'ML Hayden', 'runs': {'batsman': 4, 'extras': 0, 'total': 4}}}, {1.2: {'batsman': 'PA Patel', 'bowl

In [83]:
CricSheet(file)

AttributeError: 'str' object has no attribute 'values'