# Python for Data Science
#### See 'Python Library.docx' for additional info

In [1]:
# import math module as m
import math as m

# import copy module
import copy

# import collections module or just the Counter type
# from collections import Counter
import collections

# import datetime data type from datetime module
from datetime import datetime

# import timezone object from pytz
from pytz import timezone

# import timedelta from datetime
from datetime import timedelta

# import pendulum
import pendulum

# import numpy as np
import numpy as np

# import pandas as pd
import pandas as pd

# import random module
import random

# import sql alchemy db engine
from sqlalchemy import create_engine

# import the necessary functions from the urllib.request package
from urllib.request import urlretrieve, urlopen, Request

# import requests for url requests
import requests

# import json
import json

# import glob
import glob

# import re
import re

## Functions/Methods
- `int(data)` converts data to 'int' (truncates floats)
- `float(data)` converts data to 'float'
- `str(data)` converts data to string
- `max(data)` returns the maximum value
    - data could be values separated by commas or a list (see numpy for arrays)
- `round(num[, digits])` rounds number to specified digits (not decimal places) or 'int' by default

#### Math Module
- `import math as m`
- `m.pow(num, pow)` raises num to pow (num ** pow)
- `m.sqrt(num)` returns the square root of num
- `m.ceil(num)` rounds num up
- `m.floor(num)` rounds num down
- `m.pi` returns pi to 15 decimal places

#### String Functions
- `str(x)` converts x to a string
- `len(str)` returns the length of the supplied string

#### String Methods
- `.isalpha()` returns 'True' if every char of the string is a letter
- `.islower()` returns 'True' if all chars are lower case
- `.isupper()` returns 'True' if all chars are upper case
- `.isdigit()` returns 'True' if all chars are digits 0-9
- `.startswith(str)` returns 'True' if the object string starts with the supplied string
- `.endswith(str)` returns 'True' if the object string ends with the supplied string
- `.lower()` and `.upper()` covert all chars to lower case and upper case respectively
- `.title()` will capitalize the first letter of ALL words
- `.lstrip()` and `.rstrip()` will strip whitespace from the left and right sides respectively
- `.strip()` will strip whitespace from both sides of a string
- `.ljust(width)` left justifies the string with added whitespace to fill the supplied width
- `.rjust(width)` like ljust, but right justifies
- `.center(width)` same as the justified methods, but centers the string
- `.find(str[, start][, end])` searches the object string for the supplied string
    - can specify optional `start` and `end` index values
- `.replace(old, new[, num])` replaces occurrences of old string with new string in the object string
    - can specify optional `num` number of occurrences to replace
- `.split([delimiter][, num])` splits the string into substrings at each delimiter
    - default `delimiter` is any whitespace
    - can supply optional `num` number of occurrences to split
- `.join(sequence)` uses the object string as a delimiter to concat the list supplied as the `sequence` into a string
    - works on a list, or will delimit chars of a string if a string is supplied as the `sequence`
    - `', '.join(names)` would join a list of names delimited by ', '

## Simultaneous State Updates
- Use when doing recursive type functions or functions that update variables
    - this way you can update all variables simultaneously and not mix updated vs. old variables
    - syntax is like tuple unpacking, perform the function, then unpack the variable
    ```python
    x, y, dx, dy = (x + dx * t,
                    y + dy * t,
                    influence(m, x, y, dx, dy, partial='x'),
                    influence(m, x, y, dx, dy, partial='y'))
    ```
    - this code will run all equations prior to updating the variable, then update all at once

## Lists
#### List Comprehensions
- Process lists without using for loops and increases efficiency by saving memory and time
- Works like any iterable
- `new_list = [var_statement for var in old_list]`
    - `var_statement` is your calculation or code to execute for each list item
    - `var_statement` should reference `var` which is assigned by you and represents each element
    - `old_list` is the original list
- `new_list = [statement_with_var1_var2 for var1 in list1 for var2 in list2]`
    - nested syntax
- `statement` is something to return
    - () would return a tuple
    - [] would return a list
    - can also do arithmetic or other math-like-stuff

#### List Operations to Remember
- Slicing
    - `list[start:end:step]` just like with numpy arrays
    - if start or stop is blank, starts at the beginning or goes to the end respectively
    - default step is 1
- Creating a Deep Copy
    - `copy.deepcopy(list)` outputs a copy of a list, not just another reference to it
- `len(list)`, `min(list)`, and `max(list)` all perform as expected
- Sorting
    - `sorted(list[, key=function])`
        - **returns a new list** sorted, with optional function to run before sorting
            - option function like `key=str.lower` does not affect values, just sort behavior
            - see .sort() method for rearranging a list
    - `list.sort([key=function])`
        - **modifies the current list** and sorts, based on optional function
        - default sort order in all cases is 0-9, A-Z, a-z
- Other list functions using `random` module
    - `random.choice(list)` returns a randomly selected item from the list
    - `random.shuffle(list)` shuffles the items randomly
- List methods
    - `list.append(item)` adds item to the end of a list
    - `list.remove(item)` removes item from the list
    - `list.index(item)` returns the index for the item
    - `list.pop([index])` removes and returns item at specified index (or last item by default)
    - `list.count(item)` returns the number of occurrences of item in list or 0 if none
    - `list.reverse()` reverses the order of items in the list

## Dictionaries
- Checking if a key exists
    - `if 'key' in dictionary:`
    - if they key is an index value, do not put in quotes
- Creating a dictionary from two lists
    - `dictioncary = dict(izip(key_list, value_list))`
    - `izip` saves memory over `zip`
- Deleting items
    - `del dictionary[key]`
        - throws an error if 'key' does not exist
    - `dictionary.pop[key[, default_value]]`
        - returns the popped (deleted) value, and can return a 'default_value'
    - `dictionary.clear()` deletes all dictionary items
- `dictionary.get(key[, default_value])`
    - returns the value associated with key, or default_value if specified and key doesn't exist
    - works like `dictionary[key]` except you can specify default value
    - returns 'None' if the key does not exist or the default value provided
- `dictionary.keys()`
    - returns a view object of all the keys (**default iterator for a dictionary**)
- `dictionary.items()`
    - returns a view object containing a tuple of each key/value pair
    - use `.iteritems()` if you want to iterate over the key/value pairs
        - only in python 2, gone from python 3
- `dictionary.values()`
    - returns a view object containing all of the values
- `dictionary[key].update(list_or_tuple)`
    - will add a list or tuple under the specified key (key should be new, or it will change/update that key
- `print(list(dictionary.items())[:num])`
    - lets you slice the dictionary to only print `num` number of items

#### Looping Through Dictionaries
- More information below in looping section
- Need an iterator obtained using the `.keys()`, `.items()`, or `.values()` methods
    - can supply the `iter()` function around the dictionary to use an iterable object
        - uses less memory and speeds up code
- `for key in dictionary.keys():`
    - `statements with dictionary[key]`
- `for key, value in dictionary.items():`
    - `statements using key and value`
- `for value in dictionary.values()`
    - `statements using value
    
#### Creating a Dictionary with Histogram Counts of List Values
- Problem: how many times does a value appear in a list?
    - Solution: turn it into a dictionary with values as keys and their counts as values
        ```python
        # assumes a list called values
    
        d = {}
        for value in values:
            d[value] = d.get(value, 0) + 1
            
        # resulting dictionary holds each key once and records it number of occurences in 'values'
        ```
    - Solution 2: use defaultdict from the collections module, see below
    
#### Grouping a List Using Dictionaries
- Solution 1:
    - I want to take a list and group values by some measure into a dictionary
        ```python
        """ Take a list of names and put them in a dictionary.
        
        This dictionary will have name length as keys.
        All names of that length will be assigned to that key.
        """

        d = {}
        for name in names:
            key = len(name)
            if key not in d:
                d[key] = []
            d[key].append(name)
        ```
    - Change the 'key' line to group by something different<br><br>
- Solution 2 (better):
    ```python
    d = {}
    for name in names:
        key = len(name)
        d.setdefault(key, []).append(name)
    ```
- Solution 3 (best, use defaultdict from the collections module):
    ```python
    from collections import defaultdict
    
    d = defaultdict(list)
    for name in names:
        key = len(name)
        d[key].append(name)
    ```

## Sets
- Sets
    - Are similar to lists
    - Contain only distinct values
    - Are unordered
    - Are mutable
    - Are usually created from lists
- Creating a set
    - `set(list_name)` creates a set from a list
        - the set will only contain the unique values from the list
    - `set()` creates an empty set
- Methods of sets
    - `.add(value)` only adds the value if it doesn't exist in the set, if it does, nothing happens
    - `.update(list_or_set)` accepts a list/set and merges the two together
    - `.discard(value)` removes the item specified by *value* from the set, nothing happens if value is not found
    - `.pop(key)` removes the item at the index provided, throws a key error if the key doesn't exist
    - `.union(set2)` accepts a second set as an arg, and returns all of the unique values from the two sets
    - `.intersection(set2)` accepts a second set as an arg, and only returns values found in both sets
    - `.difference(set2)` returns the values from set1 that aren't in set2

## Collections Module
#### Basics
- Need to `import collections` or its sub packages `from collections import subpackage`
    - Subpackages
        - `Counter`
            - `counter_object = collections.Counter()`
                - creates an empty counter object
            - `counter_object = collections.Counter(dictionary)`
                - returns a counter object with each unique value as a key, and the number of occurrences as the value
            - counter object is similar to a dictionary
            - basically returns a dictionary histogram of your data
            - `counter_object[key]` will return the hist value for the specified key
            - `counter_object.most_common(num)` will return the `num` of most common records in descending order
        - `defaultdict`
            - `dfdict_object = collections.defaultdict(list)` 
                - pass the function a default data type that each key will have even if it doesn't exit
                    - useful for looping
                - use to count the number of occurrences in a list
                    ```python
                       store_counts = collections.defaultdict(int)
                       for item in my_list:
                           store_counts[item] += 1
                    ```
                - use to count the number of key occurrences in a dictionary
                    - `store_counts = collections.defaultdict(int)` create a place to store the counts
                    - `for item in dictionary:` 
                        - `if item.get('key1'):`
                            - `store_counts['key1'] += 1`
                        - `if item.get('key2'):`
                            - `store_counts['key2'] += 1`
                    - this might work better to store the keys in a list first and use the previous method
                        ```python
                        store_counts = defaultdict(int)

                        d_keys = list(dictionary.keys())
                        for item in d_keys:
                            store_counts[item] += 1
                        ```
                    - when finished looping, `store_counts` will store these counts
                    - `print(store_counts)` will show each 'key' and its counts
                - can also always convert the parts of a dictionary to a list, and use the list method
                    - `key_list = dictionary.keys()` then use the list method
                    - `value_list = dictionary.values()` then use the list method
                - group list objects together by a 'key' function you specify
                    - group each 'name' from a list 'names' by the length, with the length as the key
                        ```python
                        from collections import defaultdict

                        d = defaultdict(list)
                        for name in names:
                            key = len(name)
                            d[key].append(name)
                        ```
        - `OrderedDict`
            - `ord_dict = collections.OrderedDict()`
            - creates an ordered dictionary (standard dict vs. Python 3.6 and higher, but not before then)
            - use `.popitem()` on an ordered dictionary to remove the last item and return it
            - add the `last=False` argument to remove the first item
        - `namedtuple`
            - Alternative to a dictionary or pandas df
            - See examples below for syntax using a constructor and without a constructor
            - Using a constructor:
                - `MyTuple = collections.namedtuple('MyTuple', field_list)`
                - where `field_list` is a list of strings that name your fields
                - works like defining and creating an object
                    - `MyTuple()` is now a constructor
            - in an instance of a named tuple
                - each 'field' of the named tuple is available as an attribute
                    - `result.field1` to access the value
                - slicing an instace to get rows `result[:num]` no list conversion needed
        - `deque(list)`
            - will change the data structure of a list to allow faster deletes, pops, and appends at either end
            - use with `del list_name[0]`, `list_name.popleft()`, `list_name.appendleft()`
                - when used with `deque`, the code performance is much faster
        - See 'ChainMap' for joining dictionaries

In [2]:
# named tuple example converting from a dictionary directly without a constructor
my_dict = {'bigkey1': {'field1': 'value1', 'field2': 'value3', 'field3': 3}, 
         'bigkey2': {'field1': 'value4', 'field2': 'value5', 'field3': 6}}
my_tuple = collections.namedtuple('ATuple', my_dict.keys())(*my_dict.values())
print('Named Tuple Using No Constructor:')
print(my_tuple)
print()
print('bigkey2 value from field2: ', my_tuple.bigkey2['field2'])
print()

# using a constructor
MyTuple = collections.namedtuple('MyTuple', ' '.join(sorted(my_dict.keys())))
my_tuple2 = MyTuple(**my_dict)
print('Named Tuple Using a Constructor:')
print(my_tuple2)
print()
print('bigkey1 value from field3: ', my_tuple2.bigkey1['field3'])

Named Tuple Using No Constructor:
ATuple(bigkey1={'field1': 'value1', 'field2': 'value3', 'field3': 3}, bigkey2={'field1': 'value4', 'field2': 'value5', 'field3': 6})

bigkey2 value from field2:  value5

Named Tuple Using a Constructor:
MyTuple(bigkey1={'field1': 'value1', 'field2': 'value3', 'field3': 3}, bigkey2={'field1': 'value4', 'field2': 'value5', 'field3': 6})

bigkey1 value from field3:  3


## Dates and Times
#### datetime Module
- Need to `import datetime` or datetime type `from datetime import datetime`
- Create a datetime object for 'now' based on system time on your machine
    - `now = datetime.now()`
    - `now_utc = datetime.utcnow()` sets the time in utc timezone
- Timzones, naive, and aware datetimes
    - Use the 'pytz' module and 'timezone' object
        - `from pytz import timezone`
    - Naive datetimes have no info on timezone
    - Aware datetimes have timezones attached to them
    - Examples:
        - create or get a datetime object `dto` for this example
            - this begins as a naive datetime
        - set timezone to an object `ny_tz = timezone('US/Eastern')`
        - use `.replace()` method to change the timezone setting
            - `ny_date = dto.replace(tzinfo=ny_tz)`
            - `ny_date` is now an aware datetime
        - can convert to other timezones on aware datetimes using `.astimezone(tz)` method
            - `la_tz = timezone(US/Pacific)`
            - `la_date = ny_date.astimezone(la_tz)`
- Convert a string to a datetime
    - `dto = datetime.strptime(string, format_string)`
    - you supply the 'format_string' as a string using format codes below to format the datetime object
- Format Codes
    - use the codes plus any delimiters like '/' or '-' to create a format string
    - `%d` two digit day
    - `%m` two digit month
    - `%y` two digit year
    - `%Y` four digit year
    - `%H` hour as 24 hour
    - `%I` hour in 12 hr format, only use to display, not create
        - `%p` am/pm specifier
    - `%f` microsecond
    - `%M` minute
    - `%S` second
    - `%a` weekday abbr 'Sat'
    - `%A` weekday name 'Saturday'
    - `%b` month abbr 'Oct'
    - `%B` month name 'October'
    - `%c` date/time formatted for locale (if set)
    - `%x` date formatted for locale
    - `%X` time formatted for locale
- Convert a datetime to a string with given format
    - `string = dto.strftime(format_code)` lets you specify the format
    - `string = dto.isoformat()` spits out the string as ISO format 'YYYY-MM-DDTHH:MM:SS'
- Adding/Subtracting datetimes
    - Use the 'timedelta' object
        - `from datetime import timedelta`
    - Add/subtract a length of time
        - Set a change in time to a timedelta object
            - `dt_diff = timedelta(type=value)` sets the type of parameter to the value supplied
                - Types
                    - days, months, years, minutes etc.
        - Perform the operation
            - `dt_span = old_dt +/- dt_diff` add/subtract or whatever using the 'timedelta' object created
    - Find the difference between two datetimes
        - `time_diff = datetime1 - datetime2` this returns a timedelta object
        
#### Pendulum Module
- Need to import
    - `import pendulum`
- Create a pedulum datetime object for 'now'
    - `now_dt = pendulum.now(timezone_string)`
- Creating a pendulum datetime object from a string
    - `parsed_dt = pendulum.parse(string, tz=timezone_string)`
        - where you provide a string containing datetime info and a string with timezone info
        - timezone strings are the same as for the datetime module above
        - to use datetime, you would have to create a dt object, then alter the timezone in different steps
    - if date is not in iso8601 format, add arg `strict=False` to parse the date
- Converting a pendulum datetime to another timezone
    - `new_dt = old_dt.in_timezone(timezone_string)`
- Converting a pendulum datetime to iso string
    - `iso_dt = old_dt.to_iso8601_string()`
- Differences between datetimes
    - `diff = pendulum_time1 - pendulum_time2`
        - the `.in_XX()` method will display this difference in a meaningful way
            - `in_words()`
            - `in_days()`
            - `in_hours()`
        - Example:
            - `print(diff.in_words())` will display this difference in words that makes sense
                - 'x days y hours z minutes'

## Formatting Numbers as Strings
- Used for display or converting numbers to strings
- `"{:format_specification}".format(data)` formats supplied `data` to `:format_specification`
- It is possible to pass mulitple format specifications and multiple data entries/types
    - `"{:form_spec1}{:form_spec2}{form_spec3}".format("str", float, int)`
    - can print grid if doing multiple times using consistent `field_width` in each position
- format_sepcification is comprised of:
    - `[field_width][,][.decimal_places][type_code]`
    - `field_width` is specified in pixels as 'int'
        - by default, strings are justified left and numbers justified right
        - add `>` or `<` just before pixels to specify 'right' and 'left' justified
    - `,` specifies whether large numbers use commas
    - `.decimal_places` dot with number of decimal places to include
    - `type_code` specifies data type
        - `d` integer (decimals can't be specified)
        - `f` float (will round to specified decimal places)
        - `%` percent (supply decimal, it multiplies by 100 and adds '%'
        - `e` converted to scientific notation

In [3]:
# format grid example
heads = ["item", "price", "qty"]
x = ["hammer", 9.99, 11]
y = ["nails", 1.99, 24]
print("{:15}{:>8}{:>8}".format(heads[0], heads[1], heads[2]))
print("{:15}{:8.2f}{:8d}".format(x[0], x[1], x[2]))
print("{:15}{:8.2f}{:8d}".format(y[0], y[1], y[2]))

item              price     qty
hammer             9.99      11
nails              1.99      24


## Program Structure
#### Shebang
- `#!/usr/bin/env python3`
    - first line of a .py file
    - ignored by Windows but used by Unix-like systems to use correct interpreter

#### Three-Tier Approach
- Good practice when writing a program to use three tier approach
- Main function
    - starts an application, but is defined last
    - `def main:` with code to execute the app
    - `if __name__ == '__main__':
        - `main()`
- UI Tier
    - User interface
    - Contains the `__main__` function
    - Console app is usually procedural code vs GUI which is object-oriented
- Business or Object Tier
    - Processing tier
    - OOP to work with classes/objects and data from a database
- Database Tier
    - Provide database access

## Iterables and Looping
#### Iterators and Iterables
- Iterator is any object with an `iter()` method
    - lists, strings, dictionaries, etc.
    - using `iter()` around any iterable object can improve performance
    - Use `next()` to iterate through the object
    - Access all (remaining) values at once using `next(*var)`

In [4]:
# create the iterable object
# use next to iterate
# throws StopIteration Error when out of values
var = iter([1, 2, 3, 4, 5, 6])
print(next(var))
print(next(var))
print([next(var), next(var), next(var), next(var)])
# print(next(var)) # throws exception because out of values!

1
2
[3, 4, 5, 6]


#### Zip and iZip Objects
- `izip` will save memory over using `zip`
    - use `izip` whenever possible
- `zip(iterables)` iterables separated by commas
    - zipping will take two or more iterables (such as lists) and stitch them together
    - it matches values at the same index and combines into a tuple
        - you end up with a zip object containing tuples
        - the tuple at index 0 contains all of the index 0 values
    - convert to a list using `list(zip_obj)`
    - convert to a dictionary using `dict(zip_obj)`

#### Iterating Through a File
- `with open('file.txt') as file:`
    - `it = iter(file)`
    - `print(next(it))` print the next line of the file
- Pandas Jupyter Notebook file has more on working with files

#### While Loops
- Will run as long as the condition is True, until a `return` is executed
- See examples

In [5]:
# will run until conditional is no longer true
mylist = []
i = 0
while len(mylist) < 11:
    mylist.append(i)
    i += 1
print(mylist)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [6]:
# will run until '4' is entered
num = 4
inp = 4 # comment this line out and uncomment input line to run
while True:
    # inp = input("Guess a number between 1 and 5: ")
    if int(inp) == num:
        print("You win!")
        print("Bye!")
        break
    elif int(inp) != num:
        print("Guess again!")

You win!
Bye!


#### For Loops 
- **Note:**
    - When iterating, performance can be improved by wrapping iterable in `iter()`
- `for i in range(data):`
    - followed by statements
    - can use `range(data)` to specify a number of times
    - use `xrange(data)` for very large sets in Python 2
        - `xrange` no longer exists in Python 3
- Looping backwards
    - `for i in reversed(range):`
    - `for i in sorted(data, reversed=True):`
        - when you need to sort the data backwards
- Lists    
    - General Syntax    
        - `for i in list:` to loop through each list item
    - **Enumerating a list**
        - lets you access the index of each list value
        - `for index, var in enumerate(list):`
            - can also use `en_obj = enumerate(list):`
                - this object will contain tuples of (index, value) pairs
                - can set `start=num` to change 0 indexing
    - Sorting a list to loop
        - `for i in sorted(list):`
            - add optional `reversed=True` arg to sort backwards
    - Key functions for custom sorting
        - `for i in sorted(list, key=fxn):`
            - where 'fxn' is your sort function
                - i.e. `len`
    - Combining Lists for Looping
        - use `izip` instead of `zip` to combine lists for looping
        - `izip` is an iterator that won't waste memory like `zip` when used in a loop
- Dictionaries
    - `for key, value in dictionary.items():` to work with keys and values
        - makes a big long list to go through
    - `for key in dictionary:` to work with just the keys
        - or can use `for key in dictionary.keys():`
        - this way is necessary if you want to edit/mutate the keys
    - `for value in dictionary.values():` to just wokr with the values
- Call a function until a sentinel value
    - a sentinel value is one that signifies the end of a loop
    - `for i in iter(partial(f.read, 32), ''):`
        - assumes a file stored in 'f', reading the file 32 bytes at a time
- Can add an `else:` statement at the end to run immediately after the loop is finished
    - would add this before a final return/yield statment
    - Example to return the index of a value if in a list, otherwise return -1:
        ```python
        def find(seq, trg):
            for i, value in enumerate(seq):
                if value == trg:
                    break
                else:
                    return -1
                return i
        ```

#### List Comprehensions
- List processing without using a "for loop"
    - Use a **generator** (see below) if possible for improved performance
- Works with **any iterable**
- Syntax
    - `new_list = [output_expression for iterator_var in iterable]`
        - `output_expression` itself can be a list comprehension

In [7]:
# simple list comprehensions
nums = [num for num in range(10)] # populate a list
print(nums)
nums_squared = [num ** 2 for num in nums] # perform arithmetic on a list
print(nums_squared)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [8]:
# nested list comprehension to create a 5x5 matrix
# unformatted, so right now is list of lists
# can convert to numpy array
matrix = [[col for col in range(0,5)] for row in range(0,5)]
print(matrix)
# convert to numpy array
np_matrix = np.array(matrix)
print(np_matrix)

[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]
[[0 1 2 3 4]
 [0 1 2 3 4]
 [0 1 2 3 4]
 [0 1 2 3 4]
 [0 1 2 3 4]]


In [9]:
# nested list comprehension for combining every possibility
poss_tuples = [(num1, num2) for num1 in range(0, 3) for num2 in range(4, 7)]
print(poss_tuples)

[(0, 4), (0, 5), (0, 6), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6)]


- Conditionals with list comprehensions
    - can be placed on the output expression or the iterable

In [10]:
# list comprehension with conditional on the iterable
# square only the even numbers in this range
even_squares = [num ** 2 for num in range(11) if num % 2 == 0]
print(even_squares)

[0, 4, 16, 36, 64, 100]


In [11]:
# list comprehension with conditional on the output expression
# code doesn't run without the 'else' statement
# output is a '0' for every odd num
result = [num * 3 if num % 2 == 0 else 0 for num in range(11)]
print(result)

[0, 0, 6, 0, 12, 0, 18, 0, 24, 0, 30]


- Dictionary Comprehenions
    - use `{}` instead of `[]`
    - use `.items()` on the dictionary
    - separate key/value pairs with colon `key: value`

In [12]:
# dictionary comprehension
# use .items()
my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
doubles = {key:value*2 for (key, value) in my_dict.items()}
print(doubles)

{'a': 2, 'b': 4, 'c': 6, 'd': 8}


#### Generators
- "Lazy Evaluation"
    - data aren't stored in memory and only called on when needed
    - saves memory when using very large amounts of data
- Syntax
    - like a list comprehension but use `()` instead of `[]`
- Creates a generator object, which is an iterable
    - these types of objects can only be iterated through once before throwing an exception

In [13]:
# create a huge generator object of only even numbers
# the numbers are not stored in memory, and are only generated when called
# generate a huge generator object containing even values
gen_obj = (num for num in range(1, 10 ** 1000000) if num % 2 == 0)
print(next(gen_obj))
print(next(gen_obj))

2
4


- Generator functions
    - use `yield` rather than `return`
    - when called and stored in a var, creates a generator object

In [14]:
# generator function to create an object of specified size
def gen(n):
    """Generate values from 0 to n"""
    i = 0
    while i < n:
        yield i
        i += 1
big_obj = gen(100000000000000)
print(next(big_obj))
print(next(big_obj))
print(next(big_obj))

0
1
2


In [15]:
# generator comprehension within a function
print(sum(i**2 for i in range(10)))

285


## Functions
#### Special Info for args and kwargs
- `*args`
    - use the `*` modifier before the temp name of a tuple often 'args'
    - unpack this tuple in the function

In [16]:
# *args example
def args_func(*args):
    """returns your list of args"""
    list_of_args = []
    for arg in args:
        list_of_args.append(arg)
    return list_of_args
example = args_func(5, 6, 'seven', 4.5)
print(example)

[5, 6, 'seven', 4.5]


- `**kwargs`
    - use the `**` modifier before the temp name of a dictionary often 'kwargs'
    - use `kwargs.items()` in your code to acces keys and values
    - allows the use of named args

In [17]:
# **kwargs example
def kwargs_func(**kwargs):
    """returns your dictionary of kwargs"""
    dict_of_kwargs = {}
    for key, value in kwargs.items():
        dict_of_kwargs[key] = value
    return dict_of_kwargs
example2 = kwargs_func(name='me', height=1.8, size='XL') # key names don't need quotes
print(example2)

{'name': 'me', 'height': 1.8, 'size': 'XL'}


## Lambda Functions
- Quick and easy way to write a function
    - syntax often not entirely clear
- Use
    - anonymous functions, where function is an arg of another function
- Syntax
    - `func_name = lambda args: statements`
    - can supply 0 or more args

In [18]:
# example using a lambda function for the arg in a 'map' function
my_list = ['wow', 'whoa', 'holy cow']
result = map(lambda item: item + '!!!', my_list) # item is an arg here, and represents each list item
list(result) # print(result) indicates this is a map object, must convert back to a list

['wow!!!', 'whoa!!!', 'holy cow!!!']

In [19]:
# example using 'filter' function with lambda function as the conditional
result2 = filter(lambda item: len(item) > 3, my_list) # filters only items with more than 3 chars
list(result2)

['whoa', 'holy cow']

## Database Interfaces
#### See Pandas Library
- `df = pd.read_sql_query("SELECT * FROM table_name", engine)
    - `engine` is the engine to connect to
    - see below for 'Creating an engine'

#### SQL Alchemy
- Need to import the appropriate package
    - `from sqlalchemy import create_engine`
- Creating an engine
    - `engine = create_engine('sqlite:///db_name.sqlite')
        - above syntax `'db_type:///db_name.extension'`
- Store table names as a list
    - `table_names = engine.table_names()`
        - use the `table_names()` method of an engine object
    - can `print(table_names)` to view the list contents
- Create a database connection object
    - necessary to execute queries
    - `con = engine.connect()`
            - don't forget to close it!
    - `with engine.connect() as con:`
        - nest all statements requiring a connection inside this with statement to avoid having to close the connection
- Create a SQL Alchemy result object
    - `rs = con.execute("SQL query")` where "SQL query" is any single SQL query
- Create a Pandas dataframe from a result object
    - `df = pd.DataFrame(rs.fetchall())` to return all rows from the result
    - see also
        - `rs.fetchmany(size=num)` where you supply the 'num' of rows to fetch
- Set the column names
    - `df.columns = rs.keys()`
- Close the database connection
    - `con.close()`
    - not necessary when using the `with engine.connect() as con:` strategy

## Scraping Data from the Web
#### urllib Package
- Interface for fetching data from the web
- Need to import the appropriate functions
    - `from urllib.request import urlretrieve, urlopen, Request`
- Store URL in a variable
    - `url = 'http://..../filename.csv'`
- Use the contents of the url to write to a file
    - `urlretrieve(url, 'filename.csv')` 
    - this saves the file to the local system
- Import the data directly to a dataframe without saving to the local system using pandas
    - `df = pd.read_csv(url, sep=';')` using the appropriate separator (delimiter)
    - `df = pd.read_excel(url, sheetname=none)` can import all sheets 'none' or specify a sheet name
        - sheet names are keys to each dataframe
- Open URL's
    - `urlopen(url)` works like `open()` but accepts urls
- Workflow for obtaining the HTML from a website as a string ('Requests' package below is better for this)
    - `url = 'http:....'`
    - `request = Request(url)`
    - `response = urlopen(request)`
    - `html = response.read()` gets html from a website as a string
    - `response.close()` close the response
        - could probably do `with urlopen(request):`

#### GET requests using 'Requests' package (great API for making requests)
- Great package!
- Need to import
    - `import requests`
- Workflow to GET the html from a website as a string
    - `url = 'http:....'`
    - `r = requests.get(url)`
    - `html = r.text` returns the HTML as a string
- See API section below for more info
- To create a dictionary when GET request retrieves a json file
    - `json_data = r.json()` creates a dictionary when 'url' retrieves a json file

#### Beautiful Soup Package to Work with HTML Data
- View website for more info https://www.crummy.com/software/BeautifulSoup/
- Import it
    - `from bs4 import BeautifulSoup`
- Scrape data from the web using the 'Requests' package above to get the html as a string
- Create a beautiful soup object
    - `soup = BeautifulSoup(html)` using the 'html' string object returned from the 'text' attribute above
- 'Prettify' the html
    - `soup.prettify()` cleans up the html to make it more useful
    - view it `print(soup.prettify())`
- Methods/Attributes of 'Beautiful Soup' Objects
    - `soup.title` access the `<title>` tag
    - `soup.get_text()` accesses the text from the page
    - `.get(tag)` will access the value of a specified tag
        - see use below with `.find_all()` method
    - `soup.findall('tag')` will return all of the specified tags (don't wrap the tag, example `a`)
        - `for link in soup.find_all('a'): print(link.get('href'))`
        - code above will print all of the 'href' values for every 'a' tag

## JSON Files
#### Loading JSON Files from Local Storage
- `import json` to import the package
- `with open ('filename.json', 'r') as json_file: json_data = json.load(json_file)`
    - this creates the python object `json_data`, which is a dictionary

## APIs
#### Getting Data Using 'Requests' Package
- Import package `import requests`
- Assign url `url = 'http://www.....com'`

#### Using APIs
- View documentation on the API homepage (when availabe)
    - info on how to send requests for different queries
        - url path for requests
        - formatting query string to pass variables requesting specific info
        - 'apikey' or other required args for retrieving data


In [20]:
# import requests
# already done

# assign url (hackers movie from open movie database API)
# see open movie database api documentation http://www.omdbapi.com/
url = 'http://www.omdbapi.com/?i=tt3896198&apikey=b90e9eaa&t=hackers'

# package and send 'get request'
r = requests.get(url)

# use .json() decoder method of a response object
# saves as a dictionary
json_data = r.json()

# print out the dictionary
for k in json_data.keys():
    # print(k + ': ' + str(json_data[k]))
        # datatype issue with list vs string
        # format by using 'json_data[k]' rather than key, value in for loop
        # convert 'values' to strings
        # or solve by not trying to concatenate the values to the keys
    print(k + ': ', json_data[k])

Title:  Hackers
Year:  1995
Rated:  PG-13
Released:  15 Sep 1995
Runtime:  107 min
Genre:  Comedy, Crime, Drama, Thriller
Director:  Iain Softley
Writer:  Rafael Moreu
Actors:  Jonny Lee Miller, Angelina Jolie, Jesse Bradford, Matthew Lillard
Plot:  Hackers are blamed for making a virus that will capsize five oil tankers.
Language:  English, Italian, Japanese, Russian
Country:  USA
Awards:  N/A
Poster:  https://m.media-amazon.com/images/M/MV5BNmExMTkyYjItZTg0YS00NWYzLTkwMjItZWJiOWQ2M2ZkYjE4XkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '6.2/10'}, {'Source': 'Rotten Tomatoes', 'Value': '33%'}, {'Source': 'Metacritic', 'Value': '46/100'}]
Metascore:  46
imdbRating:  6.2
imdbVotes:  62,125
imdbID:  tt0113243
Type:  movie
DVD:  24 Apr 2001
BoxOffice:  N/A
Production:  MGM
Website:  N/A
Response:  True


#### Twitter API
- Setting up twitter API
    - Must have twitter acct to use
    - Login to twitter apps and create a new app
    - Info needed to access API in 'Keys and Access Tokens' tab
- REST API
    - For reading/writing twitter data
- Streaming API
    - Monitor/process tweets in real time
- 'Tweepy' is a package for working with Twitter data
    - `import tweepy`
    - Have to set several things obtained from twitter API 'Keys and Access Tokens'
        - `access_token = '...'`
        - `access_token_secret = '...'`
        - `consumer_key = '...'`
        - `consumer_secret = '...'`
        - `auth = tweepy.OAuthHandler(consumer_key, consumer_secret)`
        - `auth.set_access_token(access_token, access_token_secret)`
        - must define a 'twitter stream listener class'
    - Create an instance from your class, and authenticate it
        - `l = ClassName()`
        - `sream = tweepy.Sream(auth, l)` 
    - Can filter streams to catch data by keywords
        - `stream.filter(track=['apples', 'oranges'])`

## Cleaning and Combining Data
#### Using Pandas & Matplotlib
- Using `df` as the dataframe name for examples
- View summary info about the dataframe
    - View the `.head()` and `.tail()` of the dataframe as a quick check
    - View the `.columns` attribute to check column headers
    - View the `.shape` attribute to see the size of the dataframe
    - Call `.info()` to see more general info on the dataframe
        - Let's you quickly see the number of NULL values
        - Let's you quickly see the datatype for each column
- Check the non-null values with `.value_counts()` method
    - `df.column_name.value_counts(dropna=False)`
        - could also use `['column_name']` notation instead of dot notation to select the column_name
        - this is necessary if the column name has special chars, spaces, or is a python keyword
    - `dropna=False` will directly report the number of NaN or NULL values
    - can chain the `.head()` method to this call to only return the top 5 of the results
- Check data types
    - `df.dtypes` using dtypes attribute
    - lists the datatypes for each column
- Run summary stats using the `.describe()` method
    - `df.describe()` will report the following stats for numeric cols only:
        - count  -- non missing values
        - mean
        - std
        - min
        - 25% percentile
        - 50% percentile (median)
        - 75% percentile
        - max
- Visualize the data to check for outliers then use slicing to check for those values
    - Histogram
        - `df.column_name.plot('hist')`
        - `plt.show()`
    - Slice to find the issue
        - `df[df.column_name > weird_value]` # replacing condition and *weird value* with the outlier you're looking for
    - Boxplots
        - `df.boxplot(column='y_col_name', by='x_col_name')`  # x_col_name is what you are grouping by
        - `plt.show()`

#### Convert Column to a Different Data Type
- Check the data types for each column using the `.dtype` attribute of a dataframe
- Assign the data to a column using the `.astype(type)` method, and specifying a data type
    - Data types:
        - `str` # converts to string
        - `float` # converts to a float, or use .to_numeric()
        - `int` # converts to an int, or use .to_numeric()
        - `'category'` # converts to a categorical variable
            - advantages of using 'category' as dtype:
                - reduce memory storage if there are only a few different categories that are repeated
                - adds functionality/efficiency for analysis with certain packages
    - Example:
        - `df['column1'] = df['column1'].astype(str)` to convert/overwrite the column to string data type
- Convert to numeric values using the `pd.to_numeric()` function
    - `df['column1'] = pd.to_numeric(df['column1'], errors='coerce')`
        - the `errors='coerce'` option will convert any invalid entries to NaN/missing
        - will throw an exception if trying to convert non-numeric values without `errors='coerce'`

#### Concatenating Dataframes Using Pandas
- Useful when combining data sources
- `df_concat = pd.concat([df1, df2], axis=0,ignore_index=True)`
    - works well when your dataframes have the same columns in the same order
        - adds rows, keeping your columns
    - `axis` is optional
        - `axis=0` is default, and adds rows to your columns
        - `axis=1` will add new columns on the right of the dataframe, matching on row index
            - your row indexes must be in the correct order, if not, then merge the sets (see below)
    - `ignore_index` is optional
        - default is `ignore_index=False` and keeps original index values (produces duplicate row indexes)
        - `ignore_index=True` will reindex the new dataframe
- Use `glob` to find files based on a pattern
    - need to `import glob`
    - useful when trying to process thousands of files for concatenation
    - uses **wildcards** to help matching
        - `*` matches zero or more of any char
        - `?` matches any single char in that position
        - `[ ]` matches chars specified within
            - `[0-9]` matches number 0-9
            - `[09]` mathces 0 and matches 9
    - creates a list of file names that match your pattern
    - Example:
        - `csv_files = glob.glob('*.csv')` will store a list of all csv files
- Example: to combine these skills to create a large dataframe from many files
    - `list_data = []`
    - `for filename in csv_files:`
        - `data = pd.read_csv(filename)`
        - `list_data.append(data)`
            - this results in a list of dataframes, which can be loaded into `pd.concat`
    - `df = pd.concat(list_data)`

#### Merge Data into a Dataframe
- Do when row indexes from multiple dataframes do not refer to the same instance
    - i.e. row 2 is not equivalent to row 2 in the other dataframe
    - similar to a SQL JOIN
- `pd.merge(left=df_left, right=df_right, on=None, left_on='left_col_name', right_on='right_col_name')
    - this will combine all cols from both dataframes
    - it does this by using 'left_col_name' and 'right_col_name' as keys
        - both 'left_col_name' and 'right_col_name' will be included in the merged df
        - can use `on='name'` if the column names to match are spelled the same, otherwise use `left_on`/`right_on`
        - use either `on=` or `left_on` and `right_on`, don't need both methods at once
- Three types of merges
    - one:one
    - one:many/many:one
    - many:many
        - if the number of keys do not match, values will be added from the one with fewer to equal the one with more
    - no difference in syntax to accomplish this

#### Tidy Data
- Should incorporate the following principles
    - Columns represent separate variables
    - Rows represent individual observations
    - Observational units form tables
- Organize for analysis, not necessarily for reporting
    - Use `pd.melt()` when cols contain values, rather than variables
        - Ex. cols -> treatmenta, treatmentb -> should be one col as treatment
- Melt 
    - `pd.melt(frame=df, id_vars=[, value_vars=][, var_name=][, value_name=])` function
    - Turns columns into rows
    - Provide single *string*, or list of strings as args to `id_vars` and `value_vars`
        - `id_vars` are cols you don't want to alter
        - `value_vars` are cols made up of values rather than variables
            - .melt() will use all cols not specified in id_vars if nothing specified for value_vars
    - Default name for variable cols is 'variable' and value is 'value'
        - add additional args of `var_name='', value_name=''` to the melt() function to change these
    - Melting does not always achieve tidy data, must use when necessary to meet the three principles above
- Pivot 
    - `df.pivot(index='', columns='', values='')` method
    - Only works when not pivoting duplicate values (use pivot table method for that case)
    - Turns unique row values into columns
    - Use when multiple variables are stored in the same column, or individual observations are restricted to a single row
    - Arguments
        - index = column or columns to fix during the pivot, the col with duplicated observations
        - columns = column to pivot into new columns
        - values = the values to fill in the new columns created by the pivot
- Pivot Table
    - Allows you to specify how to deal with duplicate values
    - `df.pivot_table(index='', columns='', values='', aggfunc=)`
        - works similar to pivot
        - aggfunc: np.mean (numpy mean) is the default aggfunc if none specified
            - this argument is NOT a string!
- Melt and Pivot are the reverse operations of each other
    - If you call `.pivot()` or `.pivot_table()` you may need to reset the index
        - `df.reset_index()`# this fixes the col headers
- Melting/Parsing when needing to reshape data
    - Ex. two cols contain values, each col header is a single sex and age range
        - what we want is a sex col and a separate col for each age range
    - Strategy
        - Melt the two cols together `df_new = pd.melt(frame=df, id_vars='cols_to_keep_as_list_of_strings')`
        - Parse (slice) out the char that indicates the sex into a new column `df_new['sex'] = df_new.variable.str[0]`
            - use `str` and index of 0 to grab the first char from column 'variable', which was created when melting
        - Parse (slice) out the rest of the chars into the age column `df_new['age_group'] = df_new.variable.str[1:]`
            - `.str[1:]` starts slicing at 1, and continues to the end of the string
    - Additional Parsing Tips
        - Split a string at a certain char `_` for this example...
        - store it in a col as a list, then parse out the parts of that list 
        - Example: in code block below

In [21]:
# parsing out one col with two values separated by an '_'

df = pd.DataFrame(data={'parse_this': ['two_values']})
df['str_split'] = df.parse_this.str.split('_')
df['single0'] = df.str_split.str.get(0)
df['single1'] = df.str_split.str.get(1)
print(df.head())

   parse_this      str_split single0 single1
0  two_values  [two, values]     two  values


In [7]:
# melting a dataframe

names = ['ed', 'frank', 'linda', 'jon']
treatmenta = [23, 44, 54, 95]
treatmentb = [32, 53, 22, 43]
df = pd.DataFrame()
df['names'] = names
df['treatmenta'] = treatmenta
df['treatmentb'] = treatmentb

print('Original DataFrame')
print(df.head())

# melt it
df_melt = pd.melt(frame=df, id_vars='names', value_vars=['treatmenta', 'treatmentb'], 
                  var_name='treatment', value_name='value')

print()
print('Melted Dataframe')
print(df_melt.head())

# pivot it back to the original
df_pivot = df_melt.pivot_table(index='names', columns='treatment', values='value')

print()
print('Pivoted DataFrame')
print(df_pivot.head())

# reset the index to flatten the pivoted df
df_orig = df_pivot.reset_index()

print()
print('Pivoted DF with Reset Index')
print(df_orig.head())

Original DataFrame
   names  treatmenta  treatmentb
0     ed          23          32
1  frank          44          53
2  linda          54          22
3    jon          95          43

Melted Dataframe
   names   treatment  value
0     ed  treatmenta     23
1  frank  treatmenta     44
2  linda  treatmenta     54
3    jon  treatmenta     95
4     ed  treatmentb     32

Pivoted DataFrame
treatment  treatmenta  treatmentb
names                            
ed                 23          32
frank              44          53
jon                95          43
linda              54          22

Pivoted DF with Reset Index
treatment  names  treatmenta  treatmentb
0             ed          23          32
1          frank          44          53
2            jon          95          43
3          linda          54          22


#### Cleaning String Data
- **Regular Expressions** Using the `re` Library
    - `import re`
- Specifying Patterns
    - Special Chars for Patterns
        - `^` put this at the beginning of a pattern, tells the pattern to start searching at the beginning
        - put a dollar sign at the end of a pattern to check for something at the end
        - `\d` any digit
        - `\w` alphanumeric (non-whitespace) chars
        - `\.` a period/decimal point
        - `[ ]` will find any chars in the range supplied
            - `[A-Z]` any cap letters, `[13579]` any odd digit, etc.
        - use `\` before a dollar sign to escape it, Jupyter doesn't have an escape for it unfortunately
    - Specifying the Number of Chars to Match
        - `*` 0 or more times
        - `+` 1 or more times
        - `{num}` specifies the number of chars (specified) just before to match
    - Examples:
        - `\d*` 0 or more digits
        - `\d+` 1 or more digits
        - `\d*\.\d{2}` specifies any number of digits, a decimal, then two more digits
- Store a pattern in a variable
    - `pattern = re.compile('')` enter the pattern between single quotation marks
- Matching
    - `match_object = pattern.match('string')` returns a match object when checking the supplied 'string'
    - `bool(match_object)` returns true/false
        - convert to true/false using the `bool()` function
- Return all instances found
    - `re.findall(pattern, 'string')` will return every instance where the pattern is found in the supplied string

#### Cleaning String Data
#### Using Functions with Regular Expressions on Pandas Dataframes
- Import the modules
    - NaN from numpy (or just use it through numpy)
        - `np.nan`
    - re
- Store a compiled pattern to match
- Write a function `function_name` in this example
    - arguments should be `(row, pattern)`
    - access the column you want using `local_row_var = row['column_name']` and store in a local variable in the function
    - use `if bool(pattern.match(local_row_var)):` to check for data that match the pattern
        - use the standard 'and' if checking multiple columns for certain patterns
        - within the if statement, you can:
            - `local_row_var.replace("this value", "with this value")` replace part of the cell contents
                - to remove char(s), supply empty string for the second arg
            - `local_row_var = float(local_row_var)` convert data types
            - `return local_row_var1 - local_row_var2` return a calculation involing more than one row, if defined those vars
        - can `else: return(NaN)` when if condition not met
- If function is relatively simple, use a **lambda function** inside `.apply()`
    - `(lambda x: statements with x)` x represents the value of each row in this column
- Use `.apply(function)` method to apply the function to each row
    - The entire row is passed each time
    - `df['new_column'] = df.apply(function_name, axis=1, pattern=pattern)`
        - `axis=1` to check each row (`axis=0` is default and will work on each column)
            - doing this because using more than one column of data
        - pass the `pattern` to `df.apply` since it was defined outside the the function
            - can pass any additional parameters that are needed for a function this way
- Use `.apply(function)` for just one column
    - `df['new_column_name'] = df.column_name.apply(function_name)`
    - do not set `axis=1` for this method, because only using one column

#### Duplicate and NaN Values
- Remove duplicates with the `df.drop_duplicates()` method
    - the row index is not a factor limiting this method
    - different indexes with all duplicated values will be dropped
- NaN Values
    - Three options
        - can leave them
        - can drop them
        - can replace them
    - Count NaN's
        - `df.info()` will provide counts
    - Dropping NaN's
        - `df.dropna()` will return a new dataframe with only complete rows
            - drops entire row if any NaN values detected
    - Replace NaN's
        - `df.fillna()` will replace all NaN values with an option below
            - fill with user provided value
            - fill with summary stat
        - typically used one column at a time
            - `df['column1'] = df['column1'].fillna('missing')`
        - can provide a list of columns to fill several at once
            - `df[['column1', 'column2']] = df[['column1', 'column2']].fillna(0)`
        - **Using a Stat** to replace values
            - Be careful here, mean ok if no outliers, median better if outliers
            - `mean_value = df['column_name'].mean()`
            - `df['column_name'] = df['column_name'].fillna(mean_value)`
- Assert Statements to Verify No Missing Data
    - will return nothing if evaluates to true, returns an error if evaluates to false
    - Check one column    
        - `assert df.column_name.notnull().all()`
        - `.notnull()` checks for null values
        - `.all()` requires all values be not null to return true
    - Check entire dataframe
        - `assert df.notnull().all().all()`
        - first `.all()` checks all columns
        - second `.all()` requires all columns to return true
- Assert Statements to Check Values
    - Check for non-negative values
        - `assert (df.column1 >= 0).all()` check a column 
        - `assert (df >= 0).all().all()` check an entire dataframe

#### Write the Cleaned Data to a File
- `df.to_csv['cleaned_filename.csv']`