# Introduction to Python and Basic Web Scraping

## Basic maneuvering: Import statements

To use a Python library in a script, we have to import it. You do that using what are known as import statements. Below we'll import the csv module, which lets us read and write csvs

In [2]:
import csv

Now we'll import the BeautifulSoup module

In [3]:
from bs4 import BeautifulSoup

And we'll import the requests module

In [4]:
import requests

## Introduction to lists

Lists are what are known in python as a collection. They are ordered and can be changed. I like to think of them as containers. We will be using a list later on to store the individual records we scrape down from the FBI's FOIA Reading Room website.

In python, text is stored in a datatype called a string. Strings are bytes that represent characters but for our purposes just think of them as ways to store words and letters.
Numbers are stored in three data types: int, float and complex. 

We generally use integers (whole numbers) and floats (decimals).

Here we have a list with two string, or text, objects:

In [5]:
list_of_strings = ['hello', 'world']

Here we have a list of five python integers.

In [6]:
list_of_nums =  [1, 3, 5, 7, 11]

Python can tell you what type you're working with. Below tells us this is a list.

In [7]:
type(list_of_strings)

list

In [8]:
type(list_of_nums)

list

Or how many objects are in our list.

In [9]:
len(list_of_strings)

2

In [10]:
len(list_of_nums)

5

Or tell you the type of object located in a certain position of your list. For instance, this tells us the type of the first item in your list of strings which is 'hello'

In [11]:
type(list_of_strings[0])

str

This tells us the type of the second item in your list of numbers, which is 3

In [12]:
type(list_of_nums[1])

int

## Printing

Sometimes we want the python console to display things in the console. For example, the contents of our lists or a specific item in the list. We use print() to do this. Note: In Python2.7, you don't use the (). It's just print. 

Let's print the items in our list of strings.

In [13]:
print(list_of_strings)

['hello', 'world']


Now let's print the objects in our list of numbers.

In [14]:
print(list_of_nums)

[1, 3, 5, 7, 11]


And let's print just an item from each. In this case, the second object in our list of strings and the fifth item in our list of numbers.

In [16]:
print(list_of_strings[1])

world


In [17]:
print(list_of_nums[4])

11


## Looping and printing

In Python and most programming languages, we use loops. Loops are a construct of programming that allows us repeat a series of commands and apply them to every object in a grouping, or in this case, a list. 

In [18]:
for i in list_of_strings: # Read: For each item in my list do the following
    print(i) # Print, or display, the item

hello
world


Now try the same for our list of numbers.

In [19]:
for n in list_of_nums: 
    print(n)

1
3
5
7
11


## Printing specific ways

Talk about .join()

Join allows us to print in specific ways. For example, let's print each item of our list of strings on a new line.

In [21]:
print('\n'.join(list_of_strings))

hello
world


Now we're ready to starting working with csvs.

## Reading and writing csvs

Remember our import statements. Let's import the built-in csv module.

In [28]:
import csv

When working with csvs in Python, we first need to read in the data from the csv. You do this by using the csvreader. For this exercise, we'll be using the county_pops.csv we used last week that includes population data for Maryland counties from the U.S. Census Bureau.

First for ease of use in this exercise, we'll import a module called ```os``` and create a variable called work_dir with our current working directory. It will make it easier for us to have a variable assigned to the working directoy path for now rathre than having to type it in every time.

In [29]:
import os

In [30]:
work_dir = os.getcwd()

In [None]:
print(work_dir)

Now let's use the work_dir as we read in our county pops csv like below.

In [34]:
csvreader = csv.reader(open(work_dir+'/county_pops.csv'), delimiter=',', quotechar='"')

Now we have a Python object named csvreader that contains the data from our county_pops.csv. Next, let's take a look at what's inside. We'll use a loop like we discussed before to do this.

In [36]:
for row in csvreader:
    print(', '.join(row))

GEO.id, GEO.id2, GEO.display-label, D001
0500000US24001, 24001, Allegany County, Maryland, 75087
0500000US24003, 24003, Anne Arundel County, Maryland, 537656
0500000US24005, 24005, Baltimore County, Maryland, 805029
0500000US24009, 24009, Calvert County, Maryland, 88737
0500000US24011, 24011, Caroline County, Maryland, 33066
0500000US24013, 24013, Carroll County, Maryland, 167134
0500000US24015, 24015, Cecil County, Maryland, 101108
0500000US24017, 24017, Charles County, Maryland, 146551
0500000US24019, 24019, Dorchester County, Maryland, 32618
0500000US24021, 24021, Frederick County, Maryland, 233385
0500000US24023, 24023, Garrett County, Maryland, 30097
0500000US24025, 24025, Harford County, Maryland, 244826
0500000US24027, 24027, Howard County, Maryland, 287085
0500000US24029, 24029, Kent County, Maryland, 20197
0500000US24031, 24031, Montgomery County, Maryland, 971777
0500000US24033, 24033, Prince George's County, Maryland, 863420
0500000US24035, 24035, Queen Anne's County, Maryla

Now let's make that data into a list

In [42]:
csvreader = csv.reader(open(work_dir+'/county_pops.csv'), delimiter=',', quotechar='"')
things = list(csvreader)

print(things)

[['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737'], ['0500000US24011', '24011', 'Caroline County, Maryland', '33066'], ['0500000US24013', '24013', 'Carroll County, Maryland', '167134'], ['0500000US24015', '24015', 'Cecil County, Maryland', '101108'], ['0500000US24017', '24017', 'Charles County, Maryland', '146551'], ['0500000US24019', '24019', 'Dorchester County, Maryland', '32618'], ['0500000US24021', '24021', 'Frederick County, Maryland', '233385'], ['0500000US24023', '24023', 'Garrett County, Maryland', '30097'], ['0500000US24025', '24025', 'Harford County, Maryland', '244826'], ['0500000US24027', '24027', 'Howard County, Maryland', '287085'], ['0500000US24029', '24029', 'Kent County, Maryland', '20197'], ['05000

And let's cut out a line or two

In [44]:
csv_things=things[0:5]

print(csv_things)

[['GEO.id', 'GEO.id2', 'GEO.display-label', 'D001'], ['0500000US24001', '24001', 'Allegany County, Maryland', '75087'], ['0500000US24003', '24003', 'Anne Arundel County, Maryland', '537656'], ['0500000US24005', '24005', 'Baltimore County, Maryland', '805029'], ['0500000US24009', '24009', 'Calvert County, Maryland', '88737']]


## Writing to a csv

Now let's take the smaller subset of data we've cut off and named csv_things and write that to another csv

In [51]:
with open(work_dir+"/new_pops.csv", "w") as outfile: ## in Python2.7 use "wb"
    writer = csv.writer(outfile, quotechar='"')
    for csv_row in csv_things:
        writer.writerow(csv_row)

Alright, now that you've mastered reading and writing csv files. Let's turn to our scraper.