# Intro to Python

## Importing Modules

In [1]:
import csv
import re
import requests
from lxml import html

When you import modules, you generally won't see any messages unless Python failed to import the module e.g.

In [2]:
import certifi

ImportError: No module named 'certifi'

I get an error message because I never installed the `certify` module. To rectify that, I would type `pip install certifi` in a terminal.

## Assigning Variables

Decide a name for your variable e.g. `myInt` and assign it a value (in this case 2). Note that variables can be of many types, including integers, strings, lists, etc. You can name them whatever you want, but try to pick an informative name that does not conflict with Python's pre-existing variables and functions. For example, it would be unwise to call your variable `int` since it's
1. not informative
2. has the same name as Python's `int()` function

In [3]:
myInt = 2
myString = 'Hello! How are you?'
myList = [myInt, myString, [10-3, 12*4]]

print(myInt)
print(myString)
print(myList)

2
Hello! How are you?
[2, 'Hello! How are you?', [7, 48]]


Note that you can use pre-existing variables to define a new variable (see `myList`). You can also have the outcome of a calculation (like $12\times4$) be assigned to a variable. Finally, elements of lists can be other lists.

In [4]:
print(myList[0])
print(myList[2])
print(myList[2][1])

2
[7, 48]
48


Elements of lists can be accessed by their index number. For example, to get the first element of `myList`, I call `myList[0]`. Note that unlike R, Python begins its indices at 0 rather than 1. You can "nest" these indices. For example, `myList[2][1]` gets the 2nd element of the 3rd element of `myList`.

## Getting Help for Functions

If you don't know how to use a function or what exactly it's doing, you can try calling the `help()` function

In [5]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



## String Concatenation

You can combine strings in Python together simply by adding them together. 

In [3]:
a = 'This '
b = 'is the end of the sentence'
print(a+b)

This is the end of the sentence


In [6]:
picDir = '/publish/thunews/9659/20161125111440926399642/20161125111954532883008.jpg'
preURL = 'http://news.tsinghua.edu.cn'
picFullURL = preURL + picDir
print(picFullURL)

http://news.tsinghua.edu.cn/publish/thunews/9659/20161125111440926399642/20161125111954532883008.jpg


Make sure that you are keeping track of the types of your variables, otherwise there may be unintended consequences.

In [7]:
print(2+3)
print('2' + '3')

5
23


## For Loops and List Comprehensions

If you want to iterate over something and create a new list, you can use a for loop or a list comprehension.

In [6]:
urlPaths = ['/placeholder1', '/placeholder2']
urls = []                       # create empty list
for whatever in urlPaths:
    print('Whatever is equal to', whatever)
    urls.append('http://news.tsinghua.edu.cn'+whatever) # string concatenation
    print(urls)
print(urls)

Whatever is equal to /placeholder1
['http://news.tsinghua.edu.cn/placeholder1']
Whatever is equal to /placeholder2
['http://news.tsinghua.edu.cn/placeholder1', 'http://news.tsinghua.edu.cn/placeholder2']
['http://news.tsinghua.edu.cn/placeholder1', 'http://news.tsinghua.edu.cn/placeholder2']


In [8]:
urls2 = ['http://news.tsinghua.edu.cn'+url for url in urlPaths]
print(urls2)

['http://news.tsinghua.edu.cn/placeholder1', 'http://news.tsinghua.edu.cn/placeholder2']


In [9]:
urls==urls2

True

As can be seen, the two methods produce the same results.

## Regular Expressions

Regular expressions (regex or regexp) allow you to match strings, and can be used in Python via the `re` module.

In [4]:
testString = 'He said "hello" and his friend replied "how are you?" afterwards'

Suppose you wanted to get everything within quotes, which would be "hello" and "how are you" in this case. Let's try the following regex:

In [5]:
re.search('"(.*)"', testString)

<_sre.SRE_Match object; span=(8, 53), match='"hello" and his friend replied "how are you?"'>

As can be seen here, the regex matched the entire substring "hello" and his friend replied "how are you". The span refers to the position of the first match (the 9th character) and the end match (the 54th character). First, let's break down the regex, which in this case is `"(.*)"`.
- The first quotation mark " says to find a quotation mark. The first quotation mark is the one before the word 'hello' and is found by the regex.
- The dot . says to match any non-line break character.
- The asterisk \* says to *greedily* match the preceding token (the dot in this case) 0 or more times.
- The second quotation mark " says to end with a quotation mark.

What's happening is that the regex is behaving *greedily*, where it tries to match as much as possible. Since the dot actually matches nearly every possible character (and certainly every character in `testString`), it actually runs all the way to the end of `testString` (which is the 's' in 'afterwards') before backtracking and matching on the quotation mark after 'how are you?'. It thus returns *everything* between the absolute first quotation mark in `testString` and the absolute last one.

What if we did lazy matching instead?

In [6]:
re.search('"(.*?)"', testString)

<_sre.SRE_Match object; span=(8, 15), match='"hello"'>

Now, the \*? says to match the preceding token (the dot, which matches any non-line break character) *lazily*. Instead of trying to match as many as characters as possible, it will now try to match the least. Therefore, it again finds the first quotation mark before 'hello' but now it will end on the quotation mark after 'hello', since this minimally satisfies the expression.

If all we wanted was just the first set of quotation marks, this result may be acceptable. However, we wanted both "hello" and "how are you?". 

In [7]:
re.findall('"(.*?)"', testString)

['hello', 'how are you?']

In this case, the solution is simple. Rather than using `re.search`, which will find only the first match, the function `re.findall` will find all matches in the given string.

This is only a very, very rudimentary introduction to regular expressions, and there's much more to them than what's presented here. They are very flexible, and you are encouraged to read more about them, especially if you work with a lot of text data.