# Class 2 Python 进阶

# 2.1 List/Dictionary Comprehension 列表字典解析

## 2.1.1 列表解析

列表解析式是将一个列表（实际上适用于任何可迭代对象（iterable））转换成另一个列表的工具。在转换过程中，可以指定元素必须符合一定的条件，才能添加至新的列表中，这样每个元素都可以按需要进行转换。

可以理解为一个简化形式的for 循环

基本格式：

[表达式0 for var in 可遍历的变量 if 表达式2]

所有的列表解析都能通过标准的for if 来完成，主要功能是缩短代码和弱化底层的逻辑结构

### 例1. add "Hello, " before the string
Input: ['Python2.7', 'Cpython', 'Python3.4', 'Perl5.0', 'Lua', 'Python3.6', 'Powershell']

#### 普通for 循环

In [12]:
inputstrings =  ['Python2.7', 'Cpython', 'Python3.4', 'Perbl5.0', 'Lua', 'Python3.6', 'Powershell']
outputstrings = []
for item in inputstrings:
    moditem  = "Hello, " + item
    outputstrings.append(moditem)
outputstrings

['Hello, Python2.7',
 'Hello, Cpython',
 'Hello, Python3.4',
 'Hello, Perbl5.0',
 'Hello, Lua',
 'Hello, Python3.6',
 'Hello, Powershell']

#### 列表解析

In [13]:
inputstrings =  ['Python2.7', 'Cpython', 'Python3.4', 'Perbl5.0', 'Lua', 'Python3.6', 'Powershell']
outputstrings = ["Hello, " + item for item in inputstrings]
outputstrings

['Hello, Python2.7',
 'Hello, Cpython',
 'Hello, Python3.4',
 'Hello, Perbl5.0',
 'Hello, Lua',
 'Hello, Python3.6',
 'Hello, Powershell']

#### 在列表解析中加入if
only keep python related and add "Hello, " before the string

In [14]:
inputstrings =  ['Python2.7', 'Cpython', 'Python3.4', 'Perbl5.0', 'Lua', 'Python3.6', 'Powershell']
outputstrings = ["Hello, " + item for item in inputstrings if 'python' in item.lower()]
outputstrings

['Hello, Python2.7', 'Hello, Cpython', 'Hello, Python3.4', 'Hello, Python3.6']

### 例2. 求0-99的所有被7整除的数，并且求它们的平方和

#### for and if

In [2]:
n = 100
sqs = []
for i in range(n):
    if i % 7 == 0:
        sqs.append(i ** 2)
result = sum(sqs)
result

49735

#### 列表解析

In [6]:
n = 100
result = sum([i ** 2 for i in range(n) if i % 7 ==0]) # for if inline with reasonable clear logic
result

49735

In [16]:
def nseven1(n):
    sqs = []
    for i in range(n):
        if i % 7 == 0:
            sqs.append(i ** 2)
    result = sum(sqs)
    return result

def nseven2(n):
    result = sum([i ** 2 for i in range(n) if i % 7 ==0]) # for if inline with reasonable clear logic
    result

#### 速度比较
快一些，不质变

In [21]:
%timeit nseven1(10000)

1000 loops, best of 3: 654 µs per loop


In [22]:
%timeit nseven2(10000)

1000 loops, best of 3: 578 µs per loop


### 例3. 解除list 嵌套
Input: [[1,2], [3,4], [6,7], [10,101]]

Output: [1,2,3,4,6,7,10,101]

In [8]:
nestedlist =  [[1,2], [3,4], [6,7], [10,101]]
unnestedlist = [item for sublist in nestedlist for item in sublist]
unnestedlist

[1, 2, 3, 4, 6, 7, 10, 101]

## 2.1.2 字典解析

### 字典解析和列表解析基本一样，表达式要同时表述key和value

[表达式0:表达式1 for var in 可遍历的变量 if 表达式2]

### 例1. 把列表的key全部变成大写
continentdict = {'China': 'AS', 'Korea': 'AS', 'Canada': 'NA', 'France': 'EU', 'BRAZIL': 'SA', 'Russia': 'EU'}

In [23]:
continentdict = {'China': 'AS', 'Korea': 'AS', 'Canada': 'NA', 'France': 'EU', 'BRAZIL': 'SA', 'Russia': 'EU'}

In [24]:
continentdictallcap = {key.upper():continentdict[key] for key in continentdict}
continentdictallcap

{'BRAZIL': 'SA',
 'CANADA': 'NA',
 'CHINA': 'AS',
 'FRANCE': 'EU',
 'KOREA': 'AS',
 'RUSSIA': 'EU'}

# 2.2. 文件IO

文件是存储在硬盘上的，python提供了读写文件的功能. 
#### readonly, default
f = open(filepath, 'r')
f = open(filepath)
#### write only
f = open(filepath, 'w')

## 2.2.1 读文件

In [48]:
file1 = open('fileio1.txt', 'r') # this file is under the same directory, otherwise, please use full path
# file1 = open(r'C:\scriptwb\python course\2\fileio1.txt')
print 'Object file1: ', file1, '\n' # file handler object

print file1.read()
file1.close()

Object file1:  <open file 'fileio1.txt', mode 'r' at 0x0000000003F251E0> 

You're getting ready to start a new company. What language should you choose to build it?
Or to phrase the same question a different way: You are looking for a job, which language should you learn?
You might guess from the title of this post that I think the right answer is Python. But why?
The answer is that Python is powerful. But what does that mean, exactly? What makes for power in a programming language?


In [55]:
file1 = open('fileio1.txt', 'r') # this file is under the same directory, otherwise, please use full path
for (linenumber, linecontent) in enumerate(file1):
    print '[%d] %s' %(linenumber, linecontent.strip())
file1.close()

[0] You're getting ready to start a new company. What language should you choose to build it?
[1] Or to phrase the same question a different way: You are looking for a job, which language should you learn?
[2] You might guess from the title of this post that I think the right answer is Python. But why?
[3] The answer is that Python is powerful. But what does that mean, exactly? What makes for power in a programming language?


#### 当完成读阶段或者写阶段时务必close文件
#### 建议使用with语句，自动完成close文件

In [54]:
with open('fileio1.txt', 'r') as file1:
    for (linenumber, linecontent) in enumerate(file1):
        print '[%d] %s' %(linenumber, linecontent.strip())

[0] You're getting ready to start a new company. What language should you choose to build it?
[1] Or to phrase the same question a different way: You are looking for a job, which language should you learn?
[2] You might guess from the title of this post that I think the right answer is Python. But why?
[3] The answer is that Python is powerful. But what does that mean, exactly? What makes for power in a programming language?


## 2.2.2 写文件

In [56]:
continentdict = {'China': 'AS', 'Korea': 'AS', 'Canada': 'NA', 'France': 'EU', 'BRAZIL': 'SA', 'Russia': 'EU'}

如果需要把这个字典写入文件

filehandler.write(string)

In [62]:
with open('continentdict.txt', 'w') as g:
    g.write(str(continentdict))
# readout and print this file
with open('continentdict.txt', 'r') as f:
    print f.read()

{'Canada': 'NA', 'BRAZIL': 'SA', 'Korea': 'AS', 'France': 'EU', 'China': 'AS', 'Russia': 'EU'}


如果我们想写成一个csv表格，表头

Country,Continent

Canada,NA

Brazil,SA

...,...

In [63]:
with open('continentdict.csv', 'w') as g:
    g.write('Country,Continent\n')
    for country in continentdict:
        g.write('%s,%s\n'%(country, continentdict[country]))
# readout and print this file
with open('continentdict.csv', 'r') as f:
    print f.read()        

Country,Continent
Canada,NA
BRAZIL,SA
Korea,AS
France,EU
China,AS
Russia,EU



#### 不像table? csv是一种通用的表格存储形式，任何数据分析软件的基本能读格式。本质是纯文本文件，逗号作为列分割，回车作为行分割

我们可以用excel打开试试

我们也可以用python pandas包打开看看 【数据分析的基本扩展包之一】

In [66]:
import pandas
df = pandas.read_csv('continentdict.csv', keep_default_na=False)
df

Unnamed: 0,Country,Continent
0,Canada,
1,BRAZIL,SA
2,Korea,AS
3,France,EU
4,China,AS
5,Russia,EU


# 2.3. 与操作系统OS的交互

## 2.3.1 文件与文件夹

这里仅仅举例说明常见的需求，有更进一步的学习需求请查阅
https://docs.python.org/2/library/os.html

In [67]:
# os的基本操作在os包里，基本的python包，无需任何扩展安装
import os

### 目录操作

#### 当前目录

In [96]:
cwd1 = os.getcwd()
print cwd1
# we store current working directory in cwd1, later we change working directory and use it to change it back

C:\scriptwb\python course\2


In [97]:
os.listdir(cwd1)

['.ipynb_checkpoints',
 'Class2.ipynb',
 'continentdict.csv',
 'continentdict.txt',
 'fileio1.txt']

#### 查询上一层dir

In [98]:
parentwd = os.path.dirname(cwd)
parentwd

'C:\\scriptwb\\python course'

#### 返回上一层，再返回来

In [99]:
os.chdir(parentwd) # changed to upper level

In [100]:
os.getcwd()

'C:\\scriptwb\\python course'

In [101]:
os.chdir(cwd1) # changed back
os.getcwd()

'C:\\scriptwb\\python course\\2'

### 文件查询与操作

#### 文件 信息

In [69]:
os.path.isfile('continentdict.csv')

True

In [70]:
os.path.isfile('continentdict.tsv')

False

In [72]:
filename = 'continentdict.csv'
print '%s is created at %s, last modified at %s, full path is %s' %(
    filename, os.path.getctime(filename), os.path.getmtime(filename), os.path.abspath(filename))

continentdict.csv is created at 1509947218.29, last modified at 1509947270.29, full path is C:\scriptwb\python course\2\continentdict.csv


看起来时间是一种特殊格式，人没法阅读。怎么处理？

Let us google "os.path.filectime convert"

第一个结果是stackoverflow的

https://stackoverflow.com/questions/19501711/how-can-i-convert-os-path-getctime

In [73]:
from datetime import datetime
datetime.fromtimestamp(1382189138.4196026).strftime('%Y-%m-%d %H:%M:%S')

'2013-10-19 06:25:38'

基于这个回答，我们可以写一个转化函数，从机器能理解的浮点秒数，转化成人能读懂得日期时间

In [78]:
def converttimestamp(timestamp):
    return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')

现在能够正常显示了

In [79]:
filename = 'continentdict.csv'
print '%s is created at %s, last modified at %s, full path is %s' %(
    filename, converttimestamp(os.path.getctime(filename)), 
    converttimestamp(os.path.getmtime(filename)), os.path.abspath(filename))

continentdict.csv is created at 2017-11-05 21:46:58, last modified at 2017-11-05 21:47:50, full path is C:\scriptwb\python course\2\continentdict.csv


#### 拷贝与删除文件

Google python copy file

We can find this page
https://docs.python.org/2/library/shutil.html

google python delete file

We can find this page
https://stackoverflow.com/questions/6996603/how-to-delete-a-file-or-folder 

In [109]:
import shutil
shutil.copyfile('continentdict.csv', 'continentdict_copy.csv')

In [111]:
file2remove = 'continentdict_copy.csv'
print os.path.isfile(file2remove)
os.remove(file2remove)
print os.path.isfile(file2remove)

True
False


## 2.3.2 OS命令行

### 从命令行传递参数，运行python代码

命令行基础

python code1.py arg1 arg2 arg3 ...

python - 操作系统命令

code1.py - 操作系统命令 参数0 要运行的代码

arg1 代码命令行参数0

arg2 代码命令行参数1

...

####基础模块sys

sys.argv是一个list，从操作系统获得[arg0, arg1, arg2...]， arg0通常是python代码名， arg1, arg2...一般是要传给python代码的参数

#### 进阶模块 argparse
https://docs.python.org/2.7/library/argparse.html

#### sumxy from command line
命令行demo

建立一个sumxy.py文件

输入一下内容

打开命令行，输入python sumxy 1 2

在jupyter中， ! 开头表示这一行是命令行操作

In [116]:
! ls
! type sumxy.py
! python sumxy.py 1 2

Class2.ipynb
continentdict.csv
continentdict.txt
fileio1.txt
sumxy.py
#sumxy.py 
import sys

def sumxy(x, y):
    return x + y

if __name__ == '__main__':
    args = sys.argv[1:]
    x = float(args[0])
    y = float(args[1])
    print sumxy(x, y)
3.0


### 从python运行命令行
基础

os.system(cmd)

os.popen(cmd)

进阶

subprocess module

https://docs.python.org/2/library/subprocess.html

In [122]:
import os
cmdoutput = os.popen('python sumxy.py 1 2').read()
print cmdoutput

3.0



#### 为什么用python运行命令行
一个项目的各个部分可能是很多人用不同的语言写的，需要整合在一起. python可以作为胶水语言，写project高层的manamge code，调用python和其他语言写的代码来执行任务。

<font size="5" color="red">以下的内容掌握需要较长时间的理解和磨练。往往Python开发工程师也并不需要完全精通这些高级功能。
了解概念功能-》 能大概看懂最基础的代码
</font>
<font size="5" color="Blue">
很多功能你不懂细节，记不住代码长什么样，但是并不妨碍你使用它</font>

# 2.4 字符串进阶

## 2.4.1 正则表达式

#### 正则表达式是一套通用的文本检索提取规则，各种编程语言都有正则表达式的模块。学习正则表达式需要大概2-3小时的学习达到运用程度，精通则需要较长时间的练习。

中文维基：

正则表达式，又称正规表示式、正規表示法、正規運算式、規則運算式、常規表示法（英语：Regular Expression，在代码中常简写为regex、regexp或RE），是计算机科学的一个概念。正则表达式使用单个字符串来描述、匹配一系列符合某个句法规则的字符串。在很多文本编辑器裡，正則表达式通常被用来检索、替换那些符合某个模式的文本。

cheatsheet
https://www.debuggex.com/cheatsheet/regex/python

#### 表头变换
metrics是一个数据表的表头，我们这里把它放入了一个list。每个column名看起来每一个代表某组测量数据，ID0和ID1表示组号，去掉组号其他字符表示这个数据的名称。我们希望变换

In [145]:
metrics = ['LENGTH_ID0_FINAL', 'LENGTH_ID1_FINAL', 'WIDTH_ID0', 'WIDTH_ID1', 'ID2_WEIGHT', 'ID3_WEIGHT']

变换成两个list 作为双杭表头

列表1：数据的名称

['LENGTH_FINAL', 'LENGTH_FINAL', 'WIDTH', 'WIDTH', 'WEIGHT', 'WEIGHT']

列表2：组号

['ID0', 'ID1', 'ID0', 'ID1', 'ID2', 'ID3']

因为不同的数据会有略微不同的格式，有的组号放在了左边，有的放在了中间，有的写在了最右边，所以简单的.split('_')操作并不好用

In [125]:
import re

In [137]:
pattern = '(.*)(ID\d)(.*)'
matchresult = re.match(pattern, 'WIDTH_ID1')

In [128]:
matchresult.groups()

('WIDTH_', 'ID1', '')

In [147]:
results = [re.match(pattern, metric).groups() for metric in metrics]
IDs = [result[1] for result in results]
measurements = [(result[0].strip('_') + '_' + result[2].strip('_')).strip('_') for result in results]
print 'measurement:',measurements
print 'ID:',IDs

measurement: ['LENGTH_FINAL', 'LENGTH_FINAL', 'WIDTH', 'WIDTH', 'WEIGHT', 'WEIGHT']
ID: ['ID0', 'ID1', 'ID0', 'ID1', 'ID2', 'ID3']


#### how to decide if email is valid

google search

regular expression match an email address

I found a page http://emailregex.com/

In [150]:
emailpattern = r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"
emails = ['john@hotmail.com', 'tom3@gmail.com', '32111198@qq.com', 't%wan@163.com', 'slayer@99cn']
validemails = [email for email in emails if re.match(emailpattern, email)]
validemails

['john@hotmail.com', 'tom3@gmail.com', '32111198@qq.com']

#### how to valid a phone number
https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number 