---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-data-analysis/resources/0dhYG) course resource._

---

Python using interpreter turns out to be very useful for tasks that require a lot of investigation, versus those that require a lot of design. Shell scripting is one example of this. And data cleaning is another.

A common surprise for some programmers coming from a Java or C background is that Python is a dynamically typed language, similar to languages like JavaScript. This means that when you declare a variable, you can assign it to be an integer on one line, and a string on the next line.

Since there is no compilation step, you don't have anyone to help you manage types. You need to either check for the presence of functionality when you go to use it or try and use the functionality and catch any errors that occur.

What's happening underneath is that the browser is sending your Python code across to a machine in the cloud, which executes the code in a Python three interpreter, and sends the results back.

The restart and run all function is particularly useful, as it wipes the interpreter state and reruns all of the cells in the current notebook.

# The Python Programming Language: Functions

Functions are great but they're a bit different than you might find in other languages and here are some of subtleties involved:
1. First, since there's no typing, you don't have to set your return time. 
2. Second, you don't have to set user return statement at all actually. There's a special value called none that's returned. None a similar to null in Java and represents the absence of value.
3. Third, in Python, you can have default values for parameters.

Moreover, all of the optional parameters, the ones that you got default values for, need to come at the end of the function declaration. It also means that you can pass an optional parameters as labeled values.

We're going to dive into strings in more detail, but print will take an item, try and convert it to a string and print the output.

In [2]:
x = 1
y = 2
x + y

3

In [3]:
y

2

<br>
`add_numbers` is a function that takes two numbers and adds them together.

In [4]:
def add_numbers(x, y):
    return x + y

add_numbers(1, 2)

3

<br>
'add_numbers' updated to take an optional 3rd parameter. Using `print` allows printing of multiple expressions within a single cell.

In [5]:
def add_numbers(x,y,z=None):
    if (z==None):
        return x+y
    else:
        return x+y+z

print(add_numbers(1, 2))
print(add_numbers(1, 2, 3))

3
6


Okay, a final word on the basics of functions in Python. In Python, you can assign a variable to a function. By assigning a variable to a function, you can pass that variable into other functions allowing some basic functional programming.

<br>
`add_numbers` updated to take an optional flag parameter.

In [6]:
def add_numbers(x, y, z=None, flag=False):
    if (flag):
        print('Flag is true!')
    if (z==None):
        return x + y
    else:
        return x + y + z
    
print(add_numbers(1, 2, flag=True))

Flag is true!
3




<br>
Assign function `add_numbers` to variable `a`.

In [7]:
def add_numbers(x,y):
    return x+y

a = add_numbers
a(1,2)

3

# The Python Programming Language: Types and Sequences

<br>
Use `type` to return the object's type.

In [8]:
type('This is a string')

str

In [9]:
type(None)

NoneType

In [10]:
type(1)

int

In [11]:
type(1.0)

float

In [12]:
type(add_numbers)

function

<br>
Tuples are an immutable data structure (cannot be altered). So, a tuple is a sequence of variables which itself is immutable. That means that a tuple has items in an ordering, but that it cannot be changed once created. We write tuples using parentheses, and
we can mix types for the contents of the tuple.

In [13]:
x = (1, 'a', 2, 'b')
type(x)

tuple

<br>
Lists are a mutable data structure. So, Lists are very similar, but they can be mutable, so you can change their length, number
of elements, and the element values.

In [14]:
x = [1, 'a', 2, 'b']
type(x)

list

<br>
Use `append` to append an object to a list.

In [15]:
x.append(3.3)
print(x)

[1, 'a', 2, 'b', 3.3]


<br>
This is an example of how to loop through each item in the list. Both lists and tuples are iterable types, so you can write loops to go through every value they hold. The norm, if you want to look each item in the list is to use a for statement. This is similar to the for each loop in languages like Java and C# but note that there's no typing required.

In [16]:
for item in x:
    print(item)

1
a
2
b
3.3


<br>
Or using the indexing operator:

In [17]:
i=0
while( i != len(x) ):
    print(x[i])
    i = i + 1

1
a
2
b
3.3


<br>
Use `+` to concatenate lists.

In [18]:
[1,2] + [3,4]

[1, 2, 3, 4]

<br>
Use `*` to repeat lists.

In [19]:
[1]*3

[1, 1, 1]

<br>
Use the `in` operator to check if something is inside a list.

In [20]:
1 in [1, 2, 3]

True

<br>
Now let's look at strings. Use bracket notation to slice a string.

In [21]:
x = 'This is a string'
print(x[0]) #first character
print(x[0:1]) #first character, but we have explicitly set the end character
print(x[0:2]) #first two characters


T
T
Th


<br>
This will return the last element of the string.

In [22]:
x[-1]

'g'

<br>
This will return the slice starting from the 4th element from the end and stopping before the 2nd element from the end.

In [23]:
x[-4:-2]

'ri'

<br>
This is a slice from the beginning of the string and stopping before the 3rd element.

In [24]:
x[:3]

'Thi'

<br>
And this is a slice starting from the 4th element of the string and going all the way to the end.

In [25]:
x[3:]

's is a string'

Slicing is core to the Python language and is a big part of the scientific computing with Python as well. Especially if you start manipulating matrices. We're going to talk more about slicing in the next module.

As we saw, strings are just lists of characters. So operations you can do on a list, you can do on a string.

In [26]:
firstname = 'Christopher'
lastname = 'Brooks'

print(firstname + ' ' + lastname)
print(firstname*3)
print('Chris' in firstname)


Christopher Brooks
ChristopherChristopherChristopher
True


<br>
`split` returns a list of all the words in a string, or a list split on a specific character.

In [27]:
firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0] # [0] selects the first element of the list
lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1] # [-1] selects the last element of the list
print(firstname)
print(lastname)

Christopher
Brooks


In Python 3 strings are Unicode based. In early computing characters of strings were limited to one of 256 different values. This was enough to get all of the upper or lower case Latin characters, as well as single digit numbers represented. This language was called ASCII and was fairly compact. But the world doesn't just run on Latin characters and there's a need to support non-English languages as well as characters which are not commonly used in words, but are commonly used elsewhere like mathematical operators.

The Unicode Transformation Format, or UTF, is an attempt to solve this. It can be used to represent over a million different characters. This includes not only human languages like you might expect, but symbols like emojis too.

Python 3 uses Unicode by default so there is no problem in dealing with international character sets.

<br>
Make sure you convert objects to strings before concatenating.

In [28]:
'Chris' + 2

TypeError: can only concatenate str (not "int") to str

In [35]:
'Chris' + str(2)

'Chris2'

<br>
Dictionaries associate keys with values.

In [36]:
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}
x['Christopher Brooks'] # Retrieve a value by using the indexing operator


'brooksch@umich.edu'

In [37]:
x['Kevyn Collins-Thompson'] = None
x['Kevyn Collins-Thompson']

<br>
Iterate over all of the keys:

In [38]:
for name in x:
    print(x[name])

brooksch@umich.edu
billg@microsoft.com
None


<br>
Iterate over all of the values:

In [39]:
for email in x.values():
    print(email)

brooksch@umich.edu
billg@microsoft.com
None


<br>
Iterate over all of the items in the list:

In [40]:
for name, email in x.items():
    print(name)
    print(email)

Christopher Brooks
brooksch@umich.edu
Bill Gates
billg@microsoft.com
Kevyn Collins-Thompson
None


This last example is a little bit different, and it's an example of something called unpacking. In Python you can have a sequence. That's a list or a tuple of values, and you can unpack those items into different variables through assignment in one statement.

<br>
You can unpack a sequence into different variables:

In [41]:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu')
fname, lname, email = x

In [42]:
fname

'Christopher'

In [43]:
lname

'Brooks'

<br>
Make sure the number of values you are unpacking matches the number of variables being assigned.

In [44]:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Arbor')
fname, lname, email = x

ValueError: too many values to unpack (expected 3)

In [47]:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Arbor')
fname, lname, email, name = x

# The Python Programming Language: More on Strings

In [48]:
print('Chris' + 2)

TypeError: can only concatenate str (not "int") to str

In [49]:
print('Chris' + str(2))

Chris2


<br>
Python has a built in method for convenient string formatting.

In [50]:
sales_record = {
'price': 3.24,
'num_items': 4,
'person': 'Chris'}

sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

print(sales_statement.format(sales_record['person'],
                             sales_record['num_items'],
                             sales_record['price'],
                             sales_record['num_items']*sales_record['price']))


Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96


String manipulation is a big part of data cleaning.

# Reading and Writing CSV files

<br>
Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

* mpg : miles per gallon
* class : car classification
* cty : city mpg
* cyl : # of cylinders
* displ : engine displacement in liters
* drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
* fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
* hwy : highway mpg
* manufacturer : automobile manufacturer
* model : model of car
* trans : type of transmission
* year : model year

In [58]:
import csv

%precision 2

with open('../datasets/mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))
    
mpg[:3] # The first three dictionaries in our list.

[OrderedDict([('', '1'),
              ('manufacturer', 'audi'),
              ('model', 'a4'),
              ('displ', '1.8'),
              ('year', '1999'),
              ('cyl', '4'),
              ('trans', 'auto(l5)'),
              ('drv', 'f'),
              ('cty', '18'),
              ('hwy', '29'),
              ('fl', 'p'),
              ('class', 'compact')]),
 OrderedDict([('', '2'),
              ('manufacturer', 'audi'),
              ('model', 'a4'),
              ('displ', '1.8'),
              ('year', '1999'),
              ('cyl', '4'),
              ('trans', 'manual(m5)'),
              ('drv', 'f'),
              ('cty', '21'),
              ('hwy', '29'),
              ('fl', 'p'),
              ('class', 'compact')]),
 OrderedDict([('', '3'),
              ('manufacturer', 'audi'),
              ('model', 'a4'),
              ('displ', '2'),
              ('year', '2008'),
              ('cyl', '4'),
              ('trans', 'manual(m6)'),
              ('drv',

<br>
`csv.Dictreader` has read in each row of our csv file as a dictionary. `len` shows that our list is comprised of 234 dictionaries.

In [62]:
len(mpg)

234

<br>
`keys` gives us the column names of our csv.

In [63]:
mpg[0].keys()

odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

<br>
This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.

In [64]:
sum(float(d['cty']) for d in mpg) / len(mpg)

16.86

<br>
Similarly this is how to find the average hwy fuel economy across all cars.

In [65]:
sum(float(d['hwy']) for d in mpg) / len(mpg)

23.44

<br>
Use `set` to return the unique values for the number of cylinders the cars in our dataset have.

In [66]:
cylinders = set(d['cyl'] for d in mpg)
cylinders

{'4', '5', '6', '8'}

<br>
Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.

In [67]:
CtyMpgByCyl = []

for c in cylinders: # iterate over all the cylinder levels
    summpg = 0
    cyltypecount = 0
    for d in mpg: # iterate over all dictionaries
        if d['cyl'] == c: # if the cylinder level type matches,
            summpg += float(d['cty']) # add the cty mpg
            cyltypecount += 1 # increment the count
    CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg')

CtyMpgByCyl.sort(key=lambda x: x[0])
CtyMpgByCyl

[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

<br>
Use `set` to return the unique values for the class types in our dataset.

In [68]:
vehicleclass = set(d['class'] for d in mpg) # what are the class types
vehicleclass

{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

<br>
And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.

In [69]:
HwyMpgByClass = []

for t in vehicleclass: # iterate over all the vehicle classes
    summpg = 0
    vclasscount = 0
    for d in mpg: # iterate over all dictionaries
        if d['class'] == t: # if the cylinder amount type matches,
            summpg += float(d['hwy']) # add the hwy mpg
            vclasscount += 1 # increment the count
    HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg')

HwyMpgByClass.sort(key=lambda x: x[1])
HwyMpgByClass

[('pickup', 16.88),
 ('suv', 18.13),
 ('minivan', 22.36),
 ('2seater', 24.80),
 ('midsize', 27.29),
 ('subcompact', 28.14),
 ('compact', 28.30)]

# The Python Programming Language: Dates and Times

First, you should be aware that date and times can be stored in many different ways. One of the most common legacy methods for storing the date and time in online transactions systems is based on the offset from the epoch, which is January 1, 1970. There's a lot of historical cruft around this, but it's not uncommon to see systems storing the date of a transaction in seconds or milliseconds since this date. So if you see large numbers where you expect to see date and time, you'll need to convert them to make much sense out of the data.

In [70]:
import datetime as dt
import time as tm

<br>
`time` returns the current time in seconds since the Epoch. (January 1st, 1970)

In [71]:
tm.time()

1605814073.25

<br>
Convert the timestamp to datetime.

In [72]:
dtnow = dt.datetime.fromtimestamp(tm.time())
dtnow

datetime.datetime(2020, 11, 19, 19, 27, 54, 411064)

<br>
Handy datetime attributes:

In [73]:
dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second # get year, month, day, etc.from a datetime

(2020, 11, 19, 19, 27, 54)

<br>
`timedelta` is a duration expressing the difference between two dates.

In [74]:
delta = dt.timedelta(days = 100) # create a timedelta of 100 days
delta

datetime.timedelta(days=100)

<br>
`date.today` returns the current local date.

In [75]:
today = dt.date.today()

In [76]:
today - delta # the date 100 days ago

datetime.date(2020, 8, 11)

In [77]:
today > today-delta # compare dates

True

# The Python Programming Language: Objects and map()

While you'll use objects a lot in Python, you're less likely to be creating new classes when you use the interactive environment, because it's a bit verbose. But I think it's important to go over a few details of objects in Python, just so that you aren't surprised when you see them.


Classes in Python are generally named using camel case, which means the first character of each word is capitalized. You don't declare variables within the object, you just start using them. Class variables can also be declared. These are just variables which are shared across all instances.

To define a method, you just write it as you would have a function. The one change, is that to have access to the instance which a method is being invoked upon, you must include self, in the method signature. Similarly, if you want to refer to instance variables set on the object, you prepend them with the word self, with a full stop.

<br>
An example of a class in python:

In [78]:
class Person:
    department = 'School of Information' #a class variable

    def set_name(self, new_name): #a method
        self.name = new_name
    def set_location(self, new_location):
        self.location = new_location

We can instantiate this class by calling the class name with empty parenthesis behind it. Then we can call functions and print out attributes of the class using the dot notation, common in most languages.

There are a couple of implications of object-oriented programming in Python, that you should take away from this very brief example. First, objects in Python do not have private or protected members. If you instantiate an object, you have full access to any of the methods or attributes of that object. Second, there's no need for an explicit constructor when creating objects in Python. You can add a constructor if you want to by declaring the **`__init__`** method.

**If you're more interested, I'd recommend checking out the Python documentation from the Python tutorial. It's fairly comprehensive overview of the object features of the language, and there will be a reference in the class resources.**

In [79]:
person = Person()
person.set_name('Christopher Brooks')
person.set_location('Ann Arbor, MI, USA')
print('{} live in {} and works in the department {}'.format(person.name, person.location, person.department))

Christopher Brooks live in Ann Arbor, MI, USA and works in the department School of Information


The map function is one of the basis for functional programming in Python. Functional programming is a programming paradigm in which you explicitly declare all parameters which could change through execution of a given function. Thus functional programming is referred to as being side-effect free, because there is a software contract that describes what can actually change by calling a function. Now, Python isn't a functional programming language in the pure sense. Since you can have many side effects of functions, and certainly you don't have to pass in the parameters of everything that you're interested in changing.

But functional programming causes one to think more heavily while chaining operations together. And this really is a sort of underlying theme in much of data science and date cleaning in particular. So, functional programming methods are often used in Python, and it's not uncommon to see a parameter for a function, be a function itself.

The `map` built-in function is one example of a functional programming feature of Python, that I think ties together a number of aspects of the language. The map function signature looks like this `map(function, iterable, ...)`. The first parameters of function that you want executed, and the second parameter, and every following parameter, is something which can be iterated upon.

All the iterable arguments are unpacked together, and passed into the given function. That's a little cryptic, so let's take a look at an example.

<br>
Here's an example of mapping the `min` function between two lists.

In [80]:
store1 = [10.00, 11.00, 12.34, 2.34]
store2 = [9.00, 11.10, 12.34, 2.01]
cheapest = map(min, store1, store2)
cheapest

<map at 0x7fdd34345748>

But when we go to print out the map, we see that we get an odd reference value instead of a list of items that we're expecting. This is called **lazy evaluation**. In Python, the map function returns to you a map object. It doesn't actually try and run the function min on two items, until you look inside for a value. This is an interesting design pattern of the language, and it's commonly used when dealing with big data. This allows us to have very efficient memory management, even though something might be computationally complex.

Maps are iterable, just like lists and tuples, so we can use a for loop to look at all of the values in the map.

This passing around of functions and data structures which they should be applied to, is a hallmark of functional programming. It's very common in data analysis and cleaning.

<br>
Now let's iterate through the map object to see the values.

In [81]:
for item in cheapest:
    print(item)

9.0
11.0
12.34
2.01


# The Python Programming Language: Lambda and List Comprehensions

You may have seen the keyword `lambda` appear in this week's content, and you'll certainly see it appear more as you spend more and more time with Python and data science. Lambda's are Python's way of creating anonymous functions. These are the same as other functions, but they have no name. The intent is that they're simple or short lived and it's easier just to write out the function in one line instead of going to the trouble of creating a named function.

<br>
Here's an example of lambda that takes in three parameters and adds the first two.

In [82]:
my_function = lambda a, b, c : a + b

There's only one expression to be evaluated in a lambda. The expression value is returned on execution of the lambda. The return of a lambda is a function reference. 

Note that you can't have default values for lambda parameters and you can't have complex logic inside of the lambda itself because you're limited to a single expression.

In [83]:
my_function(1, 2, 3)

3

In [91]:
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson', 'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']

def split_title_and_name(person):
    return person.split()[0] + ' ' + person.split()[-1]

lambda_function = lambda person: person.split()[0] + ' ' + person.split()[-1]

#option 1
for person in people:
    print(split_title_and_name(person) == lambda_function(person))

#option f
#list(map(split_title_and_name, people)) == list(map(???))

True
True
True
True


So lambdas are really much more limited than full function definitions. But I think they're very useful for simple little data cleaning tasks. And you'll see lots of examples with them on the web. So you should be able to read and write lambdas.
<br>
Let's iterate from 0 to 999 and return the even numbers.

In [92]:
my_list = []
for number in range(0, 1000):
    if number % 2 == 0:
        my_list.append(number)
my_list

[0,
 2,
 4,
 6,
 8,
 10,
 12,
 14,
 16,
 18,
 20,
 22,
 24,
 26,
 28,
 30,
 32,
 34,
 36,
 38,
 40,
 42,
 44,
 46,
 48,
 50,
 52,
 54,
 56,
 58,
 60,
 62,
 64,
 66,
 68,
 70,
 72,
 74,
 76,
 78,
 80,
 82,
 84,
 86,
 88,
 90,
 92,
 94,
 96,
 98,
 100,
 102,
 104,
 106,
 108,
 110,
 112,
 114,
 116,
 118,
 120,
 122,
 124,
 126,
 128,
 130,
 132,
 134,
 136,
 138,
 140,
 142,
 144,
 146,
 148,
 150,
 152,
 154,
 156,
 158,
 160,
 162,
 164,
 166,
 168,
 170,
 172,
 174,
 176,
 178,
 180,
 182,
 184,
 186,
 188,
 190,
 192,
 194,
 196,
 198,
 200,
 202,
 204,
 206,
 208,
 210,
 212,
 214,
 216,
 218,
 220,
 222,
 224,
 226,
 228,
 230,
 232,
 234,
 236,
 238,
 240,
 242,
 244,
 246,
 248,
 250,
 252,
 254,
 256,
 258,
 260,
 262,
 264,
 266,
 268,
 270,
 272,
 274,
 276,
 278,
 280,
 282,
 284,
 286,
 288,
 290,
 292,
 294,
 296,
 298,
 300,
 302,
 304,
 306,
 308,
 310,
 312,
 314,
 316,
 318,
 320,
 322,
 324,
 326,
 328,
 330,
 332,
 334,
 336,
 338,
 340,
 342,
 344,
 346,
 348,
 350,

We've learned a lot about sequences and in Python. Tuples, lists, dictionaries and so forth. Sequences are structures that we can iterate over, and often we create these through loops or by reading in data from a file.

Python has built in support for creating these collections using a more abbreviated syntax called list comprehensions.
<br>
Now the same thing but with list comprehension.

In [93]:
my_list = [number for number in range(0,1000) if number % 2 == 0]
my_list

[0,
 2,
 4,
 6,
 8,
 10,
 12,
 14,
 16,
 18,
 20,
 22,
 24,
 26,
 28,
 30,
 32,
 34,
 36,
 38,
 40,
 42,
 44,
 46,
 48,
 50,
 52,
 54,
 56,
 58,
 60,
 62,
 64,
 66,
 68,
 70,
 72,
 74,
 76,
 78,
 80,
 82,
 84,
 86,
 88,
 90,
 92,
 94,
 96,
 98,
 100,
 102,
 104,
 106,
 108,
 110,
 112,
 114,
 116,
 118,
 120,
 122,
 124,
 126,
 128,
 130,
 132,
 134,
 136,
 138,
 140,
 142,
 144,
 146,
 148,
 150,
 152,
 154,
 156,
 158,
 160,
 162,
 164,
 166,
 168,
 170,
 172,
 174,
 176,
 178,
 180,
 182,
 184,
 186,
 188,
 190,
 192,
 194,
 196,
 198,
 200,
 202,
 204,
 206,
 208,
 210,
 212,
 214,
 216,
 218,
 220,
 222,
 224,
 226,
 228,
 230,
 232,
 234,
 236,
 238,
 240,
 242,
 244,
 246,
 248,
 250,
 252,
 254,
 256,
 258,
 260,
 262,
 264,
 266,
 268,
 270,
 272,
 274,
 276,
 278,
 280,
 282,
 284,
 286,
 288,
 290,
 292,
 294,
 296,
 298,
 300,
 302,
 304,
 306,
 308,
 310,
 312,
 314,
 316,
 318,
 320,
 322,
 324,
 326,
 328,
 330,
 332,
 334,
 336,
 338,
 340,
 342,
 344,
 346,
 348,
 350,

Just like with lambdas, list comprehensions are a condensed format which may offer readability and performance benefits and you'll often find them being used in data science tutorials or on stack overflow.