# Load the required packages

In [1]:
import csv
import re
import time
import requests
from lxml import html

# Getting the Page into Python

Get the page into Python by first finding the page URL.

![Getting the URL](./images/findURLCropped.png)

In [3]:
thURL = 'http://news.tsinghua.edu.cn/publish/thunews/9648/index.html'
thPage = requests.get(thURL)

In [4]:
dir(thPage)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

The `requests` object `thPage` has many attributes, but the one we're interested in is its `content`, which contains the HTML source of the page we requested in string form. We use the `fromstring` command from `lxml`'s `html` package to parse the HTML.

In [5]:
thHTML = html.fromstring(thPage.content)
print(thHTML)

<Element html at 0x7f32ba083d18>


# Getting Article Titles

The page is now ready for traversal. Now we can use XPath expressions to find the content we're looking for. Suppose we're interested in getting the titles of all the articles on the page. To determine what XPath expressions to use, let's inspect the HTML source. 

![Inspecting HTML source](./images/inspectElement.png)
![Find element path](./images/getXPATH.png)

We notice that all the article titles have the `<a>` tag with a "class" attribute equal to "jiequ". Let's try getting all the `<a>` tags with "class" equal to "jiequ". We use the XPath expression `//a[@class="jiequ"]`.
- The `//` says to look for all nodes that satisfy the following conditions
    - The `a` says to look for nodes that have the `<a>` tag
        - The brackets `[]` provide the attribute conditions that the `<a>` tag will need to satisfy to be "caught" by the XPath expression
        - The `@` says what attribute to look for.
            - `@class` says to look for the "class" attribute
            - `@class="jiequ"` says to look for nodes where the "class" attribute is equal to "jiequ"
            
Altogether, the XPath expression `//a[@class="jiequ"]` says to find all nodes with an `<a>` tag with a "class" attribute equal to "jiequ". Any node that does not satisfy this condition will not be returned. To utilize this XPath expression with `lxml`, we take our parsed HTML object `thHTML` and pass our XPath expression to it as a string using the `.xpath()` command.

In [9]:
testXpath = '//a[@src]'
print(len(thHTML.xpath(testXpath)))
print(thHTML.xpath(testXpath))

0
[]


In [10]:
aList = [2, 3, 4]
print(len(aList))

3


In [12]:
aString = 'blah blah blah'
print(len(aString))
print(len(1))

14


TypeError: object of type 'int' has no len()

In [13]:
titleNodes = thHTML.xpath('//a[@class="jiequ"]')
print(titleNodes)
print('There are', len(titleNodes), 'nodes in titleNodes')

[<Element a at 0x7f32b97cd728>, <Element a at 0x7f32b97cd778>, <Element a at 0x7f32b97cd7c8>, <Element a at 0x7f32b97cd818>, <Element a at 0x7f32b97cd868>, <Element a at 0x7f32b97cd8b8>, <Element a at 0x7f32b97cd908>, <Element a at 0x7f32b97cd958>, <Element a at 0x7f32b97cd9a8>, <Element a at 0x7f32b97cd9f8>, <Element a at 0x7f32b97cda48>, <Element a at 0x7f32b97cda98>, <Element a at 0x7f32b97cdae8>, <Element a at 0x7f32b97cdb38>, <Element a at 0x7f32b97cdb88>, <Element a at 0x7f32b97cdbd8>, <Element a at 0x7f32b97cdc28>, <Element a at 0x7f32b97cdc78>, <Element a at 0x7f32b97cdcc8>, <Element a at 0x7f32b97cdd18>]
There are 20 nodes in titleNodes


Looks like we found 20 nodes that satisfied the conditions we wanted. Checking the webpage, we do find that there are indeed 20 news articles. To get the actual information we want, we can add an attribute for the XPath expression to look for. 

In [28]:
titles = thHTML.xpath('//li[@class="clearfix"]//a[@class="jiequ"]/text()')
print('\n'.join(titles))

清华航院李群仰课题组等揭示超薄二维材料摩擦演化之谜
清华大学举办全校干部学习班 深入学习贯彻学习党的十八届六中全会精神
清华大学提供总体技术的国家安全指挥控制系统为厄瓜多尔提供重要安全保障
“可扩展大气动力模拟”联合成果获高性能计算应用领域“戈登·贝尔”奖
清华张奇伟课题组等为超分辨显微技术引入偏振新维度
【专题】学习宣传贯彻党的十八届六中全会精神
邱勇率团访问印度尼西亚 围绕“一带一路”推动教育与文化交流
邱勇率团访问马来西亚  推动教育与文化交流
清华大学与深圳市合作共建清华大学深圳国际校区
清华谭旭课题组发文阐明遗传性大疱性表皮松解症的发病机制
清华大学美术学院纪念建院60周年
清华大学召开第二次人才工作会议
清华大学医学院举办成立十五周年系列活动
清华大学建筑学院喜迎建院（系）70周年
朱镕基会见清华经管学院顾问委员会委员
邱勇、陈旭河北省调研访问 签署科研创新基地合作备忘录
施一公获何梁何利科学与技术成就奖 张希郑纬民获科学与技术进步奖
清华大学罗永章团队发现全新广谱肿瘤标志物并获准用于临床
清华大学工程物理系喜迎建系60周年
国务委员常万全调研清华大学国防教育


By adding `/text()` to the end of our original XPath expression, we tell it both to
1. Look for all nodes with `<a>` tags and "class" attributes equal to "jiequ" *and have a text attribute*
2. Return the text attribute of the selected nodes

Take special note of #1. In order for a node to be returned, it *must* have a text attribute. The XPath for `titles` returned the same number of nodes as the XPath for `titleNodes`, so it's not an issue for this case. However, keep an eye out for situations where this does become a problem.

# Your Turn: Get the URLs for Each Article

In [20]:
urls = thHTML.xpath('//a[@class="jiequ"]/@href')
print(urls)
newUrls = ['http://news.tsinghua.edu.cn/' + whatever for whatever in urls]
print(len(newUrls))
print('\n'.join(newUrls))

['/publish/thunews/9648/2016/20161125111440926399642/20161125111440926399642_.html', '/publish/thunews/9648/2016/20161122202918632829820/20161122202918632829820_.html', '/publish/thunews/9648/2016/20161120214925214249640/20161120214925214249640_.html', '/publish/thunews/9648/2016/20161118211238375604661/20161118211238375604661_.html', '/publish/thunews/9648/2016/20161117090911161188849/20161117090911161188849_.html', '/publish/thunews/10512/index.html', '/publish/thunews/9648/2016/20161112201433942549469/20161112201433942549469_.html', '/publish/thunews/9648/2016/20161110085518421655801/20161110085518421655801_.html', '/publish/thunews/9648/2016/20161104121542200694631/20161104121542200694631_.html', '/publish/thunews/9648/2016/20161103114818345858471/20161103114818345858471_.html', '/publish/thunews/9648/2016/20161102164748801103885/20161102164748801103885_.html', '/publish/thunews/9648/2016/20161101134648201938705/20161101134648201938705_.html', '/publish/thunews/9648/2016/2016103118

# Getting the URLs for the Article Thumbnails 

After inspecting the thumbnail elements, we find that all of them kind of look like
`<img src="/publish/thunews/9658/20161024083051776846828/20161024083302809898275.jpg">`. In that case, let's try this XPath expression `//img[@src]`, which says to find all image nodes that have a "src" attribute.

In [21]:
picNodes = thHTML.xpath('//img[@src]')
print('picNodes has', len(picNodes), 'elements')

picNodes has 23 elements


Uh oh, this is not correct. There are only 20 articles, but the XPath is picking up on 23 "thumbnails." What's going on?

![Extra pictures...](./images/otherPics.png)

What's happening is that the XPath is also getting nodes that correspond to the Tsinghua school logo and others, which also have `<img>` tags with "src" attributes. In reality this is not surprsing. Essentially any image on a webpage would satisfy this XPath. How do we fix this?

## Method 1: Finding a More Specific Pattern Within the Element Itself

Perhaps there's a specific pattern that will only match the article thumbnails? Let's see. The following corresponds to an article thumbnail:

![Article thumbnail](./images/artThumb.png)

And this is the Tsinghua logo thumbnail:

![Tsinghua logo](./images/logoThumb.png)

As you can see, the Tsinghua logo is found in a subdirectory called "images", while the article thumbnails are in a subdirectory made up of integers. The not-very-robust way to do this is to notice that all the article thumbnails contain '/publish/thunews/9\*\*\*', where the asterisks represent other integers. Then we can use `contains()` to get all nodes with `<img>` tags that have a "src" attribute that contains '/publish/thunews/9':

In [22]:
picPathsSpec = thHTML.xpath('//img[contains(@src, "/publish/thunews/9")]/@src')
picsSpec = ['http://news.tsinghua.edu.cn'+pic for pic in picPathsSpec]
print('\n'.join(picsSpec))
print('There are', len(picsSpec), 'URLs in picsSpec')

http://news.tsinghua.edu.cn/publish/thunews/9659/20161125111440926399642/20161125111954532883008.jpg
http://news.tsinghua.edu.cn/publish/thunews/9658/20161122202918632829820/20161122203845082708760.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161120214925214249640/20161120215436392657745.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161118211238375604661/20161118212549066134025.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161117090911161188849/20161117091926006846722.jpg
http://news.tsinghua.edu.cn/publish/thunews/9648/20161114135843940539734/20161114140009879961789.jpg
http://news.tsinghua.edu.cn/publish/thunews/9662/20161112201433942549469/20161112201538262841013.jpg
http://news.tsinghua.edu.cn/publish/thunews/9662/20161110085518421655801/20161110091737579381267.jpg
http://news.tsinghua.edu.cn/publish/thunews/9658/20161104121542200694631/20161104122141235297838.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161103114818345858471/201611031152520896

## Method 2: Enforcing a Constraint on the Ancestors/Descendants

Another way to get only the correct thumbnail URLs is to leverage the hierarchy in the HTML document. While looking through the HTML source, you may have noticed that the news article nodes basically fell into nodes that looked like this:

![HTML hierarchy](./images/artElems.png)

Each `<li>` node corresponds to a single article.

![Each article](./images/liExpand.png)

In [23]:
articleNodes = thHTML.xpath('//li[@class="clearfix"]')
print('There are', len(articleNodes), 'nodes in articleNodes')

There are 20 nodes in articleNodes


We can make it so that we are only finding images that are the descendants of each `<li>` node with a "class" attribute equal to "clearfix".

In [10]:
picPathsHier = thHTML.xpath('//li[@class="clearfix"]//img/@src')
picsHier = ['http://news.tsinghua.edu.cn'+pic for pic in picPathsHier]
print('\n'.join(picsSpec))
print('Does the result of method 1 equal method 2?', picsHier==picsSpec)

http://news.tsinghua.edu.cn/publish/thunews/9659/20161125111440926399642/20161125111954532883008.jpg
http://news.tsinghua.edu.cn/publish/thunews/9658/20161122202918632829820/20161122203845082708760.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161120214925214249640/20161120215436392657745.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161118211238375604661/20161118212549066134025.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161117090911161188849/20161117091926006846722.jpg
http://news.tsinghua.edu.cn/publish/thunews/9648/20161114135843940539734/20161114140009879961789.jpg
http://news.tsinghua.edu.cn/publish/thunews/9662/20161112201433942549469/20161112201538262841013.jpg
http://news.tsinghua.edu.cn/publish/thunews/9662/20161110085518421655801/20161110091737579381267.jpg
http://news.tsinghua.edu.cn/publish/thunews/9658/20161104121542200694631/20161104122141235297838.jpg
http://news.tsinghua.edu.cn/publish/thunews/9659/20161103114818345858471/201611031152520896

The XPath is now saying to find all nodes with `<li>` tags that have a "class" attribute to "clearfix" **and** have an `<img>` node at *any* descendant level. That means the `<img>` node could be a direct descendant of the `<li>` node, or it could be 30 levels nested below the `<li>` node. In this case, there was only one image nested under each `<li>` node, so this worked out fine. However, if there were multiple images, then this XPath expression may not have worked out as well.

The results we got using Method 2 are the same as those we got using Method 1. 

## Your Turn: Can You Think of a Different Way to Get the Same Results?

It can be similar to Methods 1 and 2.

# Getting Article Summaries

Now we would like to get the summary for each article. 

![Summary elements](./images/summElem.png)

Following the same general procedures as before, we create the following XPath expression. 

In [14]:
summariesPre = thHTML.xpath('//div[@class="contentwraper"]/p/text()')
print(summariesPre)

[]


What happened? The XPath expression should definitely have returned something. There are `<p>` nodes that clearly have a text attribute. 

The problem is that `requests` is not viewing the web page the same way that your browser is viewing it. Much of the content on this page is being generated with JavaScript, which your browser can trigger but `requests` cannot. We will deal with this general problem later, but for this specific instance, we can still work within our general framework of `requests` + `lxml`. This page is a bit odd in that the things the JavaScript functions act on are precisely the information we want. Consider this XPath

In [16]:
summariesPre = thHTML.xpath('//div[@class="contentwraper"]/p//text()')
print('\n'.join(summariesPre[:3]))

cutSummary("清华大学航天航空学院李群仰课题组与合作者于11月24日在《自然》在线发表题为“石墨烯摩擦接触界面的状态演化”（The evolving quality of frictional contact with graphene）表明，界面摩擦对于二维材料存在独特的机理：二维材料由于其超薄的几何特性和超大的柔性，能够通过改变自身构型来影响接触界面的钉扎状态，进而可从界面的“质”而不仅是“量”上来调控其摩擦性能。",180);
cutSummary("11月18日至19日，清华大学举办2016年全校干部学习班，深入学习党的十八届六中全会精神和《胡锦涛文选》。党委书记陈旭在总结会上讲话，校长邱勇主持学习总结会。党委常务副书记、副校长姜胜耀作开班动员并主持学习辅导报告会，党委副书记邓卫主持学习班交流会和有关学习辅导报告会。",180);
cutSummary("11月16日，习近平主席在对厄瓜多尔进行国事访问前在厄瓜多尔《电讯报》发表题为《搭建中厄友好合作的新桥》的署名文章，特别提到：“我高兴地得知，在这次抗震救灾中，中国提供设备技术并负责建设的厄瓜多尔公共安全服务系统发挥了重要作用。作为指挥中枢，厄瓜多尔公共安全服务系统高效处理了大量信息，及时发出一道道指令，挽救了许多生命，降低了灾害带来的损失。”文中所说的公共安全服务系统，正是由清华大学公共安全研究院提供总体技术支持，清华控股旗下北京辰安科技股份有限公司研发的国家安全指挥控制系统（ECU911）。",180);


As can be seen from the first 3 elements of `summariesPre`, the webpage still contains the full article summaries, which are apparently being fed to a function called `cutSummary()` which presumably shortens the strings it is given. This text attribute is actually part of the `<script>` node that is a child of the initial `<p>` node we were looking at. This is why the single slash `/` didn't find it, but the double slash `//` did. The single `/` looks for a direct descendant, while the double `//` looks through all descendant levels.

We can use a regular expression to clean it up and get the article summaries we want.

In [18]:
summaries = [re.search('"(.*)"', summary).group(1) for summary in summariesPre]
print('There are', len(summaries), 'summaries. The last three are \n', '\n'.join(summaries[-3:]))

There are 20 summaries. The last three are 
 2013年，清华大学罗永章团队通过肺癌临床试验在世界上首次证明肿瘤标志物血浆热休克蛋白90α（Hsp90α）可用于肝癌患者的检测；经过在医疗机构的推广使用，现在这项成果已被国家食品药品监督管理总局批准在临床中使用，这标志着首个由我国科学家定义、并获准用于临床的广谱肿瘤标志物诞生，对提高癌症诊疗水平具有深远意义。
10月16日，清华大学工程物理系在大礼堂举行纪念建系60周年活动。工物系1964届系友、最高人民检察院原检察长贾春旺，中国核试验基地司令员吴应强，中国核工业集团公司总经理钱智民，中国工程物理研究院副院长赖新春等嘉宾与工物系校友、师生们欢聚一堂，回首峥嵘岁月，共绘美好蓝图。清华大学党委书记陈旭、常务副校长程建平、党委副书记史宗恺及学校老领导王大中、滕藤、梁尤能、康克军等出席活动。
10月13日下午，国务委员、中央军委委员、国防部部长常万全到清华大学调研指导国防教育工作。清华大学党委书记陈旭、校长邱勇、党委副书记史宗恺等汇报了有关情况。


# Your Turn: Get the Dates for Each Article

# Getting the Number of Views for Each Article

Following the same general procedure, we generate the following XPath:

In [30]:
views = thHTML.xpath('//font[contains(@id, "itemlist_total_")]//text()')
print(views)

[]


We encounter the same problem as with the summaries, except this time the information we need doesn't seem to be lurking in the webpage somewhere. 

## Method 1: Checking Browser Traffic

Let's to check the requests that the browser is making.

![Browser requests](./images/network.png)

We notice that our browser is requesting a bunch of other URLs while loading the page of the form http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161125111440926399642&0.3851099563224041

In [31]:
testViewURL = requests.get('http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161125111440926399642&0.3851099563224041')
print(testViewURL.text)

3862


Following the URL shows a web page with a single number...that happens to correspond with the number of views for that article! How can we use this URL to get the views for every article?

The first part of the URL (http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp) seems to be the base URL, and the number between 'articleID=' and the ampersand '&' seems to be an article ID number. 

In [12]:
print(picPathsHier[0])

/publish/thunews/9659/20161125111440926399642/20161125111954532883008.jpg


Indeed, it corresponds to the directory where the thumbnails (and articles) are kept. 

What about the number that comes after the '&'? First, let's see how important it is. Try the URL without the '&' and everything after it: http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161125111440926399642. 

In [32]:
testViewURL = requests.get('http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161125111440926399642')
print(testViewURL.text)

3864


Looks like it still works!

In [17]:
thHTML.xpath('/html/head/script[contains(text(), "getResData")]/text()')

['\nfunction getResData(fwl){\n jQuery.get("/application/visitor/article_list_visitors_2.jsp?articleID="+fwl+"&"+Math.random(), function(data){\n   $(".itemlist_total_"+fwl).empty().text(data);\n });\n};\n getResData(\'20161125111440926399642\');\n getResData(\'20161124153021482822403\');\n getResData(\'20161122145022352519997\');\n getResData(\'20161123171411075671175\');\n getResData(\'20161122160953066330557\');\n getResData(\'20161122202918632829820\');\n getResData(\'20161122121028124838955\');\n getResData(\'20161121095048294737054\');\n getResData(\'20161120214925214249640\');\n getResData(\'20161118211238375604661\');\n getResData(\'20161117110018590202437\');\n getResData(\'20161117090911161188849\');\n getResData(\'20161116165236288738250\');\n getResData(\'20161114144139271904234\');\n getResData(\'20161114142313203693192\');\n getResData(\'20161114172418011284718\');\n getResData(\'20161112201433942549469\');\n getResData(\'20161111162022644779365\');\n getResData(\'2016111

The site actually places its JavaScript functions in the head of the HTML document, and we can use an XPath to get it. We can see that the function to get the views (perhaps among other things) is called `getResData`, which is mainly a wrapper around the `jQuery.get()` function -- probably similar to the `requests.get()` function we use -- to generate the URL that we found by observing our browser traffic. The number that comes after '&' is just a random number generated by the `Math.random()` function. It's unclear why it's there.

So now all the components are in place. Now we just have to put them together.
1. Get all the article IDs
2. Use the article IDs to form the URLs that the JavaScript would call
3. Get the content from the URLs

In [32]:
articleIDs = [re.search('/\d+?/(\d+)', picPath).group(1) for picPath in picPathsHier]
viewURLs = ['http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID='+articleID for articleID in articleIDs]
views = []
for viewURL in viewURLs:
    viewPage = requests.get(viewURL)
    views.append(viewPage.text)
    print('There were', viewPage.text, 'views for', viewURL)
    time.sleep(3)
    
print(views)
print('There are', len(views), 'elements in views')

There were 3371 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161125111440926399642
There were 2824 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161122202918632829820
There were 3679 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161120214925214249640
There were 5959 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161118211238375604661
There were 4770 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161117090911161188849
There were 88 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161114135843940539734
There were 3545 views for http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161112201433942549469
There were 4080 views for http://news.tsinghua.edu

There is actually a chance the above will yield an error at http://news.tsinghua.edu.cn/application/visitor/article_list_visitors_2.jsp?articleID=20161114135843940539734, the one with 88 views. This one seems to be different from the others. Its general URL structure is different and it's a [專題](http://news.tsinghua.edu.cn/publish/thunews/10512/index.html).

## Method 2: Using Selenium

Another way of dealing with JavaScript-generated content is to use Selenium to read a page.

In [34]:
from selenium import webdriver

Everything is essentially the same as before with `requests`, except instead of just calling `requests.get()`, we first initiate our PhantomJS driver, and then call `get()`.

In [36]:
driver.quit()

'NoneType' object has no attribute 'path'


In [37]:
driver = webdriver.Firefox()


In [41]:
driver.get(thURL)

Let's get the page through `lxml` and try our original XPath expression again.

In [42]:
selHTML = html.fromstring(driver.page_source)

viewsSel = selHTML.xpath('//font[contains(@id, "itemlist_total_")]//text()')
print(viewsSel)
print('There are', len(viewsSel), 'elements in viewsSel')

['3879', '2993', '3807', '6005', '4830', '3550', '4090', '8304', '3799', '3458', '3082', '2421', '5933', '11503', '6258', '7836', '5744', '10902', '9242']
There are 19 elements in viewsSel


A half-success? We can now get the number of views using our original XPath expression, but now we appear to be missing an article. The culprit in this case is again the 專題

![No views](./images/noViews.png)

There are no views shown here for the 專題. How should we deal with this? We cannot leave it as is, since now we don't which views correspond to which articles.

In [7]:
articleNodes = selHTML.xpath('//li[@class="clearfix"]')
viewsSel = [article.xpath('.//font[contains(@id, "itemlist_total_")]//text()') for article in articleNodes]
print(viewsSel)
print('There are', len(viewsSel), 'elements in viewsSel')

[['3389'], ['2828'], ['3682'], ['5961'], ['4775'], [], ['3545'], ['4080'], ['8299'], ['3793'], ['3455'], ['3071'], ['2415'], ['5932'], ['11495'], ['6255'], ['7831'], ['5739'], ['10897'], ['9237']]
There are 20 elements in viewsSel


Now they are correct. However, it's currently a list of lists where each sublist has a single string. Let's clean it up.

In [8]:
viewsSel = [int(view[0]) if view else 'unknown' for view in viewsSel]
print(viewsSel)

[3389, 2828, 3682, 5961, 4775, 'unknown', 3545, 4080, 8299, 3793, 3455, 3071, 2415, 5932, 11495, 6255, 7831, 5739, 10897, 9237]


# Saving the Data

We've gotten a decent amount of content. Now we can save it for future analysis. Since the number of views change over time, it might be good to note when we acquired (or at least saved) the data.

In [None]:
currentTime = [time.strftime("%Y-%m-%d %H:%M:%S")]*len(titles)
resultWithViews = zip(titles, urls, summaries, dates, pics, viewsSel, currentTime)
with open('thNewsWithViews.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['title', 'url', 'summary', 'date', 'pic', 'views', 'writetime'])
    writer.writerows(resultWithViews)

# Exercises

## Getting the Rest of the Pages

We only got the first page of articles. What about the others? Remember to stagger your requests!

## Getting the Actual Articles

We only have the summaries so far. What if we want the entire article? Remember to stagger your requests!