# Introduction to Python

This is a modified version of the original course notebook, which used Python2.7. I'm also thankful for the input of my friend and colleague, Anthony Debarros, and for his generosity in allowing me to incorporate some of his training examples in this tutorial.

## Basic maneuvering: Import statements

To use a Python library in a script, we have to import it. You do that using what are known as import statements. Below we'll import the csv module, which lets us read and write csvs

In [1]:
import csv

Now we'll import the BeautifulSoup module

In [2]:
from bs4 import BeautifulSoup

And we'll import the requests module

In [3]:
import requests

We'll use these later when we're writing our webscraper. For now, let's see which version of Python we're running

In [4]:
import sys
print(sys.version)

3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]


# Notes

Notes are written sections in your code that are not executed by the Python interpreter. The most comment type of note begins with a `#`. This "notes out" the text to the right of the symbol. You can also use ```'''``` which allow for multiline notes or what are called [docstrings](https://docs.python.org/3/tutorial/controlflow.html#documentation-strings). While we'd like our code to be clean enough for people to read without too many notes, that's not always the case. Noting your code helps the future you remember what you did and others who are working with your code. 

In [None]:
# This is a note

'''
This is a docstring.
'''

# Printing

The `print()` statement outputs text to your Python interpreter, whether the terminal or here in the notebook.

In [None]:
print('Hello, world!')

# Data types

Just like in SQL, Python works with several different types of data such as integers, strings (text), dates, times and decimals (also known as floats). Let's take a look at these using Python's built in function `type()`, which tells us what kind of data we're working with.

In [5]:
type(10)

int

In [6]:
type('Hello, world!') ## Notice the single quotes.

str

In [7]:
type(3.14)

float

In [8]:
type('2020-01-20') ## Notice something off here?

str

To work with dates and times, we need to use Python built in datetime module.

In [14]:
import datetime

d = datetime.date(2019, 7, 11)
type(d)

datetime.date

# Variables
Variables hold data of several types, including strings, numbers and objects such as lists. To store a value in a variable, we use the `=` operator. Python is a "dynamically typed" language, meaning that you don't need to declare a variable's type before assigning a value to it. 

Just like we went over in SQL, variables stand in for values and help us as a shorthand but also as a way to build functions and code that isn't reliant on specific values but rather categories or types of values. 

For example, we can use Python to calculate the sum such as below.

In [None]:
10 + 6

Or we can use variables, which give us more flexibility.

In [16]:
x = 10

y = 6

In [None]:
print(x + y) # Addition of variables
print(y + 12) # Addition of other integers
print(x / y) # Division
print(y ** 2) # Multiplication

Important: Notice that the above prints a result but it doesn't change the underlying variables. For example:

In [None]:
print(x) ## Still equals 10 despite the addition and other math done above

Why is this? Because to change a variable we would need to define it, like we did originally with our `x` and `y`. 

In [None]:
x = x + y
print(x) ## Now x = 16

We can also assign the result to a new variable, so we maintain our old ones. 

In [20]:
x = 10 ## Sets x back to our original value.
z = x + y 

print(z) ## z now equals 16

16


Notice that if I add two integers, I get an integer.

In [21]:
type(z)

int

If I add an integer to a float, I get a float.

In [None]:
t = 5.6

g = x + t

print(g)
type(g)

# Working with strings (text)

Much like we did in Excel and OpenRefine, Python is great for wrangling text. And it's much more powerful than the tools we've used so far. Like we saw above, you use quote marks, either single or double quotes, to denote string values.

In [28]:
a = 'Jaqueline P. Gillum'
print(a)

Jaqueline P. Gillum


In [27]:
# Strings have methods you can use to transform text.

print(a.upper())
print(a.lower())
print(a.replace('P', 'T'))
print(a.replace('P', 'T').lower())


# List of string methods at https://www.w3schools.com/python/python_ref_string.asp

JAQUELINE P. GILLUM
jaqueline p. gillum
Jaqueline T. Gillum
jaqueline t. gillum


In [24]:
# The methods do not alter the original string.
print(a)

Jaqueline P. Gillum


In [25]:
# Concatenate strings.

first_name = 'Phillip'
middle_name = 'Seymour'
last_name = 'Hoffman'

print(first_name + ' ' + last_name)

print(first_name + ' ' + middle_name + ' ' + last_name)

Phillip Hoffman
Phillip Seymour Hoffman


Note: If you want to concatenate a number with a string, you must convert the number to a string! Like below.

In [None]:
a = 5

print('The number is ' + str(a))

Or even better, use `string interpolation`, which is a built in Python method that allows you to insert varibales into text using brackets.

In [None]:
# Python also lets you insert variables into strings via 'string interpolation'.

print(f'I have {a} cats.')

# Boolean operators: True, False, None, and, or
    
In Python, you can assign special values to variables to indicate a state of true or false as well as the absence of a value. When evaluating variables, you can test more than one condition at a time by using the operators `and` and `or`.

Just like in SQL, the concept of a null value is represented in Python using `None`. This is just like it sounds. It stands for no value. It does not represent `0`.

In [36]:
animal = None

print(animal)

None


In [None]:
## We use one equal sign to assign variables

animal = 'cheetah'

name = 'Chaz'

## We use two equal signs to test their equality.

animal == 'Elephant'



In [None]:
animal == 'cheetah'

Also, like in SQL, we can use `and` and `or` to test whether multiple conditions are True, use the and operator. 


In [None]:
animal == 'cheetah' and name == 'Chaz'

What will the below return?

In [None]:
animal == 'cheetah' or animal == 'lion'

We can also assign a value of True or False to a variable.

In [None]:
is_open = False
print(is_open)

# First data structure: Lists

Lists are what are known in python as a collection. They are ordered and can be changed. I like to think of them as containers. We will be using a list later on to store the individual records we scrape down from the FBI's FOIA Reading Room website.

In python, text is stored in a datatype called a string. Strings are bytes that represent characters but for our purposes just think of them as ways to store words and letters.
Numbers are stored in three data types: int, float and complex. 

We generally use integers (whole numbers) and floats (decimals). 

Later we will go over two other types of data structures: a dictionary and a tuple. But first, let's explore lists.

Here we have a list with two string, or text, objects:

In [5]:
list_of_strings = ['hello', 'world']

Here we have a list of five python integers.

In [32]:
list_of_nums =  [1, 3, 5, 7, 11]

Python can tell you what type you're working with. Below tells us this is a list.

In [7]:
type(list_of_strings)

list

In [8]:
type(list_of_nums)

list

Or how many objects are in our list.

In [9]:
len(list_of_strings)

2

In [10]:
len(list_of_nums)

5

Or tell you the type of object located in a certain position of your list. For instance, this tells us the type of the first item in your list of strings which is 'hello'

In [11]:
type(list_of_strings[0])

str

This tells us the type of the second item in your list of numbers, which is 3

In [12]:
type(list_of_nums[1])

int

Let's print the items in our list of strings.

In [13]:
print(list_of_strings)

['hello', 'world']


Now let's print the objects in our list of numbers.

In [14]:
print(list_of_nums)

[1, 3, 5, 7, 11]


And let's print just an item from each. In this case, the second object in our list of strings and the fifth item in our list of numbers. Below we're using what's called a list index to locate the specific item we want out of the list. Note: Python counts from zero, so the first item has an index of `[0]`.

In [16]:
print(list_of_strings[1])

world


In [17]:
print(list_of_nums[4])

11


In [None]:
We can use indexes to slice out portions of our list, like below.

In [None]:
print(list_of_nums[0:2])

Notice above works from the left to right. We can also get the last value by starting from the right.

In [None]:
print(list_of_nums[-1])

We can sort our list.

In [33]:
# Sort the list.

list_of_nums.sort() ## ascending

print(list_of_nums)

list_of_nums.sort(reverse = True)

print(list_of_nums)


[1, 3, 5, 7, 11]
[11, 7, 5, 3, 1]


Add and remove values.

In [None]:
list_of_nums.append(13)
list_of_strings.remove('hello')
print(list_of_strings)
print(list_of_nums)

Another great method for cleaning of text data. We can use `split()` to break up a string into a list.

In [35]:
b = 'How many ice cream scoops can I get for a dollar?'
print(b.split())

['How', 'many', 'ice', 'cream', 'scoops', 'can', 'I', 'get', 'for', 'a', 'dollar?']


## Loops (a type of control flow)

In Python and most programming languages, we use loops. Loops are a construct of programming that allows us repeat a series of commands and apply them to every object in a grouping, or in this case, a list. 

In [18]:
for i in list_of_strings: # Read: For each item in my list do the following
    print(i) # Print, or display, the item

hello
world


Now try the same for our list of numbers.

In [19]:
for n in list_of_nums: 
    print(n)

1
3
5
7
11


For loops are very powerful and can be as simple or as complex as you make them.

In [50]:

list_of_nums = [1, 3, 5, 7, 11]


for n in list_of_nums:
    thing = n + 10
    print(thing)

print(list_of_nums)

11
13
15
17
21
[1, 3, 5, 7, 11]


## Printing specific ways

Talk about .join()

Join allows us to print in specific ways. For example, let's print each item of our list of strings on a new line.

In [21]:
print('\n'.join(list_of_strings))

hello
world


Now we're ready to starting working with csvs.

## Reading and writing csvs

Remember our import statements. Let's import the built-in csv module.

In [28]:
import csv

When working with csvs in Python, we first need to read in the data from the csv. You do this by using the csvreader. For this exercise, we'll be using the county_pops.csv we used last week that includes population data for Maryland counties from the U.S. Census Bureau.

First for ease of use in this exercise, we'll import a module called ```os``` and create a variable called work_dir with our current working directory. It will make it easier for us to have a variable assigned to the working directoy path for now rathre than having to type it in every time.

In [29]:
import os

In [30]:
work_dir = os.getcwd()

In [None]:
print(work_dir)

Now let's use the work_dir as we read in our county pops csv like below.

In [34]:
csvreader = csv.reader(open(work_dir+'/county_pops.csv'), delimiter=',', quotechar='"')

Now we have a Python object named csvreader that contains the data from our county_pops.csv. Next, let's take a look at what's inside. We'll use a loop like we discussed before to do this.

In [36]:
for row in csvreader:
    print(', '.join(row))

GEO.id, GEO.id2, GEO.display-label, D001
0500000US24001, 24001, Allegany County, Maryland, 75087
0500000US24003, 24003, Anne Arundel County, Maryland, 537656
0500000US24005, 24005, Baltimore County, Maryland, 805029
0500000US24009, 24009, Calvert County, Maryland, 88737
0500000US24011, 24011, Caroline County, Maryland, 33066
0500000US24013, 24013, Carroll County, Maryland, 167134
0500000US24015, 24015, Cecil County, Maryland, 101108
0500000US24017, 24017, Charles County, Maryland, 146551
0500000US24019, 24019, Dorchester County, Maryland, 32618
0500000US24021, 24021, Frederick County, Maryland, 233385
0500000US24023, 24023, Garrett County, Maryland, 30097
0500000US24025, 24025, Harford County, Maryland, 244826
0500000US24027, 24027, Howard County, Maryland, 287085
0500000US24029, 24029, Kent County, Maryland, 20197
0500000US24031, 24031, Montgomery County, Maryland, 971777
0500000US24033, 24033, Prince George's County, Maryland, 863420
0500000US24035, 24035, Queen Anne's County, Maryla

Now let's make that data into a list

In [42]:
csvreader = csv.reader(open(work_dir+'/county_pops.csv'), delimiter=',', quotechar='"')
things = list(csvreader)

print(things)

[['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737'], ['0500000US24011', '24011', 'Caroline County, Maryland', '33066'], ['0500000US24013', '24013', 'Carroll County, Maryland', '167134'], ['0500000US24015', '24015', 'Cecil County, Maryland', '101108'], ['0500000US24017', '24017', 'Charles County, Maryland', '146551'], ['0500000US24019', '24019', 'Dorchester County, Maryland', '32618'], ['0500000US24021', '24021', 'Frederick County, Maryland', '233385'], ['0500000US24023', '24023', 'Garrett County, Maryland', '30097'], ['0500000US24025', '24025', 'Harford County, Maryland', '244826'], ['0500000US24027', '24027', 'Howard County, Maryland', '287085'], ['0500000US24029', '24029', 'Kent County, Maryland', '20197'], ['05000

And let's cut out a line or two

In [44]:
csv_things=things[0:5]

print(csv_things)

[['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737']]


Let's make a state column by splitting apart one of the values in each row. How would we go about that?

In [44]:
test_list = [['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737']]

In [45]:
for t in test_list[1:]:
    city_state = t[2]
    # print(city_state.split(',')[1])
    state = city_state.split(',')[1]
    t.append(state)
    
print(test_list)

[['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087', ' Maryland'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656', ' Maryland'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029', ' Maryland'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737', ' Maryland']]


What did we forget?

In [46]:
test_list[0].append('state')

In [47]:
print(test_list)

[['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001', 'state'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087', ' Maryland'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656', ' Maryland'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029', ' Maryland'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737', ' Maryland']]


Now, let's try it with the whole list of lists from our csv.

## Writing to a csv

Now let's take the smaller subset of data we've cut off and named csv_things and write that to another csv

In [51]:
with open(work_dir+"/new_pops.csv", "w") as outfile: 
    writer = csv.writer(outfile, quotechar='"')
    for csv_row in csv_things:
        writer.writerow(csv_row)

Alright, now you've mastered reading and writing csv files. 