<a href="https://colab.research.google.com/github/finesketch/data_science/blob/main/Data_Science_from_Scratch/02_A_Crash_Course_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Whitespace Formatting

Programmers will often argue over whether to use tabs or spaces for indentation. For many languages it doesn’t matter that much; however, Python considers tabs and spaces different indentation and will not be able to run your code if you mix the two. When writing Python you should always use spaces, never tabs.

In [1]:
for i in [1, 2, 3, 4, 5]:
  print(i)
  for j in [6, 7, 8, 9, 0]:
    print(j)
    print(i + j)
  print(i)
print('done looping')

1
6
7
7
8
8
9
9
10
0
1
1
2
6
8
7
9
8
10
9
11
0
2
2
3
6
9
7
10
8
11
9
12
0
3
3
4
6
10
7
11
8
12
9
13
0
4
4
5
6
11
7
12
8
13
9
14
0
5
5
done looping


In [2]:
# Whitespace is ignored inside parentheses and brackets
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 +
                           13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)
long_winded_computation

210

In [None]:
# for code easy read, follow this:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

easier_to_read_list_of_lists = [[1, 2, 3],
                                [4, 5, 6],
                                [7, 8, 9]]

In [3]:
# use a backslash to indicate that a statement continues onto the next line
two_plus_three = 2 + \
                 3

In [None]:
for i in [1, 2, 3, 4, 5]:

    # notice the blank line
    print(i)

## Module

Certain features of Python are not loaded by default. These include both features that are included as part of the language as well as third-party features that you download yourself. In order to use these features, you’ll need to *import* the modules that contain them.

**re** is the module containing functions and constants for working with regular expressions. After this type of import you must prefix those functions with **re.** in order to access them.

In [None]:
import re
my_regex = re.compile("[0-9]+", re.I)

If you already had a different **re** in your code, you could use an *alias*:

In [None]:
import re as regex
my_regex = regex.compile("[0-9]+", regex.I)

In [None]:
# import a specific modules from a library package
from collections import defaultdict, Counter
lookup = defaultdict(int)
my_counter = Counter()

In [None]:
# never import everything using "*"
match = 10
from re import *    # uh oh, re has a match function
print(match)        # "<function match at 0x10281e6a8>"

## Functions

A function is a rule for taking zero or more inputs and returning a corresponding output. In Python, we typically define functions using **def**:

In [4]:
def double(x):
    """
    This is where you put an optional docstring that explains what the
    function does. For example, this function multiplies its input by 2.
    """
    return x * 2

Python functions are *first-class*, which means that we can assign them to variables and pass them into functions just like any other arguments.

In [5]:
def apply_to_one(f):
    """Calls the 'double(x)' function f with 1 as its argument"""
    return f(1)

my_double = double             # refers to the previously defined function
x = apply_to_one(my_double)    # equals 2

In [6]:
# short anonymous functions - lambdas
y = apply_to_one(lambda x: x + 4)      # equals 5

In [7]:
another_double = lambda x: 2 * x       # don't do this

def another_double(x):
    """Do this instead"""
    return 2 * x

In [8]:
# function parameters can also be given default arguments
def my_print(message = "my default message"):
    print(message)

my_print("hello")   # prints 'hello'
my_print()          # prints 'my default message'

hello
my default message


In [9]:
# specify parameters by name
def full_name(first = "What's-his-name", last = "Something"):
    return first + " " + last

full_name("Joel", "Grus")     # "Joel Grus"
full_name("Joel")             # "Joel Something"
full_name(last="Grus")        # "What's-his-name Grus"

"What's-his-name Grus"

## Strings

Strings can be delimited by single or double quotation marks.

In [10]:
single_quoted_string = 'data science'
double_quoted_string = "data science"

In [11]:
# uses backslashes to encode special characters
tab_string = "\t"       # represents the tab character
len(tab_string)         # is 1

1

In [12]:
# use raw strings using r"" to include the \ character
not_tab_string = r"\t"  # represents the characters '\' and 't'
len(not_tab_string)     # is 2

2

In [13]:
# multiline strings
multi_line_string = """This is the first line.
                    and this is the second line
                    and this is the third line"""

In [14]:
# f-string or f""" feature in Python 3.6
first_name = "Joel"
last_name = "Grus"

full_name1 = first_name + " " + last_name             # string addition 
full_name2 = "{0} {1}".format(first_name, last_name)  # string.format 
full_name3 = f"{first_name} {last_name}"              # f""" (new way)

## Lists (or Arrays)

Probably the most fundamental data structure in Python is the list, which is simply an ordered collection.

In [16]:
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True] # NumPy does not allow for this (homogeneous)
list_of_lists = [integer_list, heterogeneous_list, []]

list_length = len(integer_list)     # equals 3
list_sum    = sum(integer_list)     # equals 6

In [17]:
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# basic
zero = x[0]          # equals 0, lists are 0-indexed
one = x[1]           # equals 1
nine = x[-1]         # equals 9, 'Pythonic' for last element
eight = x[-2]        # equals 8, 'Pythonic' for next-to-last element
x[0] = -1            # now x is [-1, 1, 2, 3, ..., 9]

# more
first_three = x[:3]                 # [-1, 1, 2]
three_to_end = x[3:]                # [3, 4, ..., 9]
one_to_four = x[1:5]                # [1, 2, 3, 4]
last_three = x[-3:]                 # [7, 8, 9]
without_first_and_last = x[1:-1]    # [1, 2, ..., 8]
copy_of_x = x[:]                    # [-1, 1, 2, ..., 9]

In [18]:
# "slice" strings using [ start : stop : step ]
every_third = x[::3]                 # [-1, 3, 6, 9]
five_to_three = x[5:2:-1]            # [5, 4, 3]

# check
x[5:2:1]                             # []

[]

In [20]:
# check through the elements in a list
# not recommended unless the list is small (check one at a time)
1 in [1, 2, 3]  # True
0 in [1, 2, 3]  # False

False

In [21]:
# concatenate lists
x = [1, 2, 3]
x.extend([4, 5, 6])     # x is now [1, 2, 3, 4, 5, 6]

In [23]:
# list addition
x = [1, 2, 3]
y = x + [4, 5, 6]       # y is [1, 2, 3, 4, 5, 6]; x is unchanged

In [22]:
# append to a list
x = [1, 2, 3]
x.append(0)      # x is now [1, 2, 3, 0]
y = x[-1]        # equals 0
z = len(x)       # equals 4

In [24]:
# unpack a list (if you know how many are there)
x, y = [1, 2]    # now x is 1, y is 2

In [27]:
# will get ValueError if item mismatched and larger in the LEFT side
x, y, z = [1, 2]    # now x is 1, y is 2

ValueError: ignored

In [28]:
# use "_" to ignore an item
_, y = [1, 2]    # now y == 2, didn't care about the first element

## Tuples

Tuples are lists’ immutable cousins. Pretty much anything you can do to a list that doesn’t involve modifying it, you can do to a tuple.

In [29]:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3      # my_list is now [1, 3]

try:
    my_tuple[1] = 3
except TypeError:
    print("cannot modify a tuple")

cannot modify a tuple


In [30]:
# tuple is a way to return multiple values
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3)     # sp is (5, 6)
s, p = sum_and_product(5, 10)  # s is 15, p is 50

In [31]:
# tuples and lists can be used for multiple assignment
x, y = 1, 2     # now x is 1, y is 2
x, y = y, x     # swap variables; now x is 2, y is 1 (Pythonic)

## Dictionaries

Key and key pairs.

In [32]:
# create a dictionary
empty_dict = {}                     # Pythonic
empty_dict2 = dict()                # less Pythonic
grades = {"Joel": 80, "Tim": 95}    # dictionary literal

In [33]:
# value lookup using key
joels_grade = grades["Joel"]        # equals 80

In [34]:
# not found
try:
    kates_grade = grades["Kate"]
except KeyError:
    print("no grade for Kate!")

no grade for Kate!


In [35]:
# existence of a key
# compare to list, but this membership check is "fast" even for large dictionaries
joel_has_grade = "Joel" in grades     # True
kate_has_grade = "Kate" in grades     # False

In [36]:
# "get"
joels_grade = grades.get("Joel", 0)   # equals 80
kates_grade = grades.get("Kate", 0)   # equals 0
no_ones_grade = grades.get("No One")  # default is None

In [37]:
# assignment
grades["Tim"] = 99                    # replaces the old value
grades["Kate"] = 100                  # adds a third entry
num_students = len(grades)            # equals 3

In [38]:
# use dictionary to represent structured data
tweet = {
    "user" : "joelgrus",
    "text" : "Data Science is Awesome",
    "retweet_count" : 100,
    "hashtags" : ["#data", "#science", "#datascience", "#awesome", "#yolo"]
}

tweet_keys   = tweet.keys()     # iterable for the keys
tweet_values = tweet.values()   # iterable for the values
tweet_items  = tweet.items()    # iterable for the (key, value) tuples

"user" in tweet_keys            # True, but not Pythonic
"user" in tweet                 # Pythonic way of checking for keys
"joelgrus" in tweet_values      # True (slow but the only way to check)

### defaultdict

Imagine that you’re trying to count the words in a document. An obvious approach is to create a dictionary in which the keys are words and the values are counts. As you check each word, you can increment its count if it’s already in the dictionary and add it to the dictionary if it’s not.

In [None]:
# Option 1: use "in" keyword
word_counts = {}
for word in document:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

In [None]:
# Option 2: Just run and catch exception
word_counts = {}
for word in document:
    try:
        word_counts[word] += 1
    except KeyError:
        word_counts[word] = 1

In [None]:
# Option 3: Use "get"
word_counts = {}
for word in document:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

In [None]:
# Option 4: defaultdict
from collections import defaultdict

word_counts = defaultdict(int)          # int() produces 0
for word in document:
    word_counts[word] += 1

# useful for list and dict
dd_list = defaultdict(list)             # list() produces an empty list
dd_list[2].append(1)                    # now dd_list contains {2: [1]}

dd_dict = defaultdict(dict)             # dict() produces an empty dict
dd_dict["Joel"]["City"] = "Seattle"     # {"Joel" : {"City": Seattle"}}

dd_pair = defaultdict(lambda: [0, 0])
dd_pair[2][1] = 1                       # now dd_pair contains {2: [0, 1]}

## Counters

A Counter turns a sequence of values into a defaultdict(int)-like object mapping keys to counts.

In [None]:
from collections import Counter
c = Counter([0, 1, 2, 0])          # c is (basically) {0: 2, 1: 1, 2: 1}

# counting the value
# "0" has 2 occurances
# "1" has 1 occurance
# "2" has 1 occurance

In [None]:
# a very simple way to solve our word_counts problem
# recall, document is a list of words
word_counts = Counter(document)

In [None]:
# A Counter instance has a most_common method that is frequently useful
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10):
    print(word, count)

## Sets

A collection of distinct elements. Very fast, find distinct items in a collection.

In [None]:
primes_below_10 = {2, 3, 5, 7}

In [None]:
s = set()
s.add(1)       # s is now {1}
s.add(2)       # s is now {1, 2}
s.add(2)       # s is still {1, 2}
x = len(s)     # equals 2
y = 2 in s     # equals True
z = 3 in s     # equals False

In [None]:
# very fast
stopwords_list = ["a", "an", "at"] + hundreds_of_other_words + ["yet", "you"]

"zip" in stopwords_list     # False, but have to check every element

stopwords_set = set(stopwords_list)
"zip" in stopwords_set      # very fast to check

In [None]:
# find distinct items in a collection
item_list = [1, 2, 3, 1, 2, 3]

num_items = len(item_list)                # 6
item_set = set(item_list)                 # {1, 2, 3}
num_distinct_items = len(item_set)        # 3
distinct_item_list = list(item_set)       # [1, 2, 3]