# Acquire the Data

## Sources of Data

We want to understand what are the important trends in Machine Learning at the moment. So we want to get a list of articles about Machine Learning that people are talking about. We can do that from many sources, but we decided to pick three sources to do that.

1. [Reddit.com - Machine Learning](https://www.reddit.com/r/MachineLearning/) - Reddit is a user generated discussion forum where recent articles and topics on Maching Learning are discussed by the community.

2. [Data Tau](http://www.datatau.com/)- Data Tau is the hacker news for machine learning. Users post articles about latest trends in data science and machine learning and can have discussion arount it.

3. [Twitter #machinelearning](https://twitter.com/search?q=%23machinelearning&src=typd) - We can also look at Twitter with #machinelearning tags to find the latest articles and post about machine learning that are being discussed in the social media.


## Working with Data Tau

Let us start with Data Tau site and scrape the data to acquire it.

![](img/datatau.png)

We will want to scrape the title and date for each of the article in this page

In [1]:
import requests
from bs4 import BeautifulSoup 
import re
import pandas as pd

In [2]:
base_url = 'http://www.datatau.com'

## Understand the HTML Structure

In [3]:
#Let us use request to get the url
dataTau = requests.get(base_url)

In [4]:
# Check if the page has been scraped - we should see Response 200
dataTau

<Response [200]>

In [5]:
dataTau = open('dataTau.html', 'rb').read()

In [6]:
# Let us see the text content of the page
dataTau

b'<html><head><link rel="stylesheet" type="text/css" href="news.css">\n<link rel="shortcut icon" href="http://www.iconj.com/ico/d/x/dxo02ap56v.ico">\n<script>\nfunction byId(id) {\n  return document.getElementById(id);\n}\n\nfunction vote(node) {\n  var v = node.id.split(/_/);   // {\'up\', \'123\'}\n  var item = v[1]; \n\n  // adjust score\n  var score = byId(\'score_\' + item);\n  var newscore = parseInt(score.innerHTML) + (v[0] == \'up\' ? 1 : -1);\n  score.innerHTML = newscore + (newscore == 1 ? \' point\' : \' points\');\n\n  // hide arrows\n  byId(\'up_\'   + item).style.visibility = \'hidden\';\n  byId(\'down_\' + item).style.visibility = \'hidden\';\n\n  // ping server\n  var ping = new Image();\n  ping.src = node.href;\n\n  return false; // cancel browser nav\n} </script><script>\n\n  (function(i,s,o,g,r,a,m){i[\'GoogleAnalyticsObject\']=r;i[r]=i[r]||function(){\n  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),\n  m=s.getElementsByTagName(o)[0];

In [7]:
# Start the beautifulsoup library and create a soup!
soup = BeautifulSoup(dataTau,'html.parser')

In [8]:
# See the pretty form HTML - Not so pretty though!
print (soup.prettify())

<html>
 <head>
  <link href="news.css" rel="stylesheet" type="text/css">
   <link href="http://www.iconj.com/ico/d/x/dxo02ap56v.ico" rel="shortcut icon">
    <script>
     function byId(id) {
  return document.getElementById(id);
}

function vote(node) {
  var v = node.id.split(/_/);   // {'up', '123'}
  var item = v[1]; 

  // adjust score
  var score = byId('score_' + item);
  var newscore = parseInt(score.innerHTML) + (v[0] == 'up' ? 1 : -1);
  score.innerHTML = newscore + (newscore == 1 ? ' point' : ' points');

  // hide arrows
  byId('up_'   + item).style.visibility = 'hidden';
  byId('down_' + item).style.visibility = 'hidden';

  // ping server
  var ping = new Image();
  ping.src = node.href;

  return false; // cancel browser nav
}
    </script>
    <script>
     (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.paren

### Get the title in each page

We have 30 articles on each page. Let us see if we can get the html tag and attribute to get this data

Let us see which html tag we need the '`td .title`'

![](img/title.png)

In [9]:
title_class = soup.select('td .title')

In [10]:
len(title_class)

61

We are getting double the number -> Let us see why by examining the first two elements in the list

In [11]:
title_class[0:2]

[<td align="right" class="title" valign="top">1.</td>,
 <td class="title"><a href="https://www.springboard.com/blog/eat-rate-love-an-exploration-of-r-yelp-and-the-search-for-good-indian-food/" rel="nofollow">An Exploration of R, Yelp, and the Search for Good Indian Food</a><span class="comhead"> (springboard.com) </span></td>]

In [12]:
title_class[-1]

<td class="title"><a href="/x?fnid=CSS821ucAs" rel="nofollow">More</a></td>

Aha - We are getting both the number and the title name. We need to be even more specific and pick only the one with `<a>`

In [13]:
title_class = soup.select('td .title a')

In [14]:
len(title_class)

31

Why do we get 31 and not 30 articles... Lets check

In [15]:
title_class[0]

<a href="https://www.springboard.com/blog/eat-rate-love-an-exploration-of-r-yelp-and-the-search-for-good-indian-food/" rel="nofollow">An Exploration of R, Yelp, and the Search for Good Indian Food</a>

In [16]:
title_class[0].get_text()

'An Exploration of R, Yelp, and the Search for Good Indian Food'

In [17]:
title_class[-1]

<a href="/x?fnid=CSS821ucAs" rel="nofollow">More</a>

Ok... so the last link is the link to the "More" - which is the next page. That is good. We can use it to get the link to the next url to scrape

**NOTE: Taking care of the edge cases**

When we run this on multiple pages, we find that sometimes there are more than one `<a>` link in the title. To take of this we re-write the selection criterion to only pick the first `<a>` link in the title only

In [18]:
title_class = soup.select('td .title > a:nth-of-type(1)')

In [19]:
title_class[0].get_text()

'An Exploration of R, Yelp, and the Search for Good Indian Food'

### Get the date for each title

To get the date for each title, we need html tag and class - '`td .subtext`'

![](img/date.png)

In [20]:
date_class = soup.select('.subtext')

In [21]:
len(date_class)

30

In [22]:
date_class[0]

<td class="subtext"><span id="score_11989">5 points</span> by <a href="user?id=Rogerh91">Rogerh91</a> 4 hours ago  | <a href="item?id=11989">discuss</a></td>

In [23]:
date_class[0].get_text()

'5 points by Rogerh91 4 hours ago  | discuss'

## Automate the Scraping Process

We now write a function which starts with first page, gets all the title and date string and puts it in to a dataframe and then moves to the next page.

In [24]:
# Let us create an empty dataframe to store the data
df = pd.DataFrame(columns=['title','date'])
df.count()

title    0
date     0
dtype: int64

In [25]:
def get_data_from_tau(url):
    print(url)
    dataTau = requests.get(url)
    soup = BeautifulSoup(dataTau.content,'html.parser')
    title_class = soup.select('td .title > a:nth-of-type(1)')
    date_class = soup.select('.subtext')
    print(len(title_class),len(date_class))
    for i in range(len(title_class)-1):
        df.loc[df.shape[0]] = [title_class[i].get_text(),date_class[i].get_text()]
    print('updated df with data')
    return title_class[len(title_class) - 1]

In [26]:
url = base_url
for i in range(0,6):
    more_url = get_data_from_tau(url)
    url = base_url+more_url['href']

http://www.datatau.com
31 30
updated df with data
http://www.datatau.com/x?fnid=aFffLhBQyN
31 30
updated df with data
http://www.datatau.com/x?fnid=0urLeo7gjV
31 30
updated df with data
http://www.datatau.com/x?fnid=uMJcXgJIJs
31 30
updated df with data
http://www.datatau.com/x?fnid=qyftRfLQ6D
31 30
updated df with data
http://www.datatau.com/x?fnid=VcML5GXJiJ
31 30
updated df with data


In [27]:
df.shape

(180, 2)

In [28]:
df.head()

Unnamed: 0,title,date
0,"An Exploration of R, Yelp, and the Search for ...",5 points by Rogerh91 6 hours ago | discuss
1,Deep Advances in Generative Modeling,7 points by gwulfs 15 hours ago | 1 comment
2,Spark Pipelines: Elegant Yet Powerful,3 points by aouyang1 9 hours ago | discuss
3,Shit VCs Say,3 points by Argentum01 10 hours ago | discuss
4,"Python, Machine Learning, and Language Wars",4 points by pmigdal 17 hours ago | discuss


In [29]:
df.to_csv('data_tau.csv', encoding = "utf8", index = False)