In [1]:
import json
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
import time

In [2]:
%system GoogleScraper -m http -p 10 -n 50 -q "should site:answers.yahoo.com" --output-filename Results/google.json

['2016-06-28 13:52:39,980 - GoogleScraper.caching - INFO - 10 cache files found in .scrapecache/',
 '2016-06-28 13:52:39,980 - GoogleScraper.caching - INFO - 10/10 objects have been read from the cache. 0 remain to get scraped.']

# Open the file
We saved the above results into a file **quora_june.json**. We open it up and use the json library to load it into a python list

In [3]:
with open("Results/google.json", 'r') as f:
    data  = json.load(f)

In [4]:
'''
Looks like the results key has a list of length 50
Each item in the list is a python dictionary of the all important metadata
Lets put all this data into one list. A list of python dictionaries
'''

all_results = []
for d in data:
    all_results.extend(d['results'])

In [5]:
'''
We should have 500 pieces of metadata
'''
len(all_results)

500

In [6]:
'''
Success! We have 500 pieces of metadata, each containing a different search result
Like take a look at the 101st one
'''
sample_result = all_results[0]
sample_result

{'domain': 'answers.yahoo.com',
 'id': '401',
 'link': 'https://answers.yahoo.com/question/index?qid=20100802084437AAEixwf',
 'link_type': 'results',
 'rank': '1',
 'serp_id': '9',
 'snippet': "Aug 2, 2010 - Should the letter be justified. I am looking online but I can't find a sample formated cover letter. ... I feel that letters typed with just the left margin justified look better and are easier to read.",
 'title': 'Should I justify my cover letter? | Yahoo Answers',
 'visible_link': 'https://answers.yahoo.com/question/index?qid...'}

In [7]:
'''
Looks like the url is in the 'link' key of the meta data. Lets look at it by itself
'''
link = sample_result['link']
link

'https://answers.yahoo.com/question/index?qid=20100802084437AAEixwf'

# Inspecting html
Now that we know how to access each link from GoogleScraper, we will manually inspect the page in order to find the question and some other meta data associated with that page. We can now use the requests library to get the html

In [8]:
'''
This command, issues a get request and returns a 'response' object
'''
response = requests.get(link)

In [9]:
'''
The 'text' attribute of the response object contains the actual text of the html when you inspect it with 
your browser's developers tools

BeautifulSoup is a python library that parses the gigantic mess of html code underneath
'''
soup = BeautifulSoup(response.text, 'html.parser')

In [14]:
'''
After manually inspecting the page, it looks like the question is going to always be
in a div with class "QuestionArea". This is of course subject to change but generally should be stable

The actually text of the question is in a class called "rendered_qtext"
'''
question = soup.find("h1", { "itemprop" : "name" })
follows = soup.find("span", {"class","follow-text"})
answers = soup.find("span", {"class","D-n"})
question.text

'\n            Should I justify my cover letter?\n        '

In [15]:
follows["data-ya-fc"]

'1'

In [16]:
answers.text

'2'

In [40]:
'''
Lets take a look at the text. Variable question_span is still a BeautifulSoup object
'''
question_span.text

NameError: name 'question_span' is not defined

# Requests library is not enough :(
After some more work not shown here, it was discovered that the requests library does not use your google profile information to login. Logging in to google is very important as this gives acccess to the number of followers of a question.

# Use selenium to manipulate page
selenium is a third party python library that lets you manipulate a webpage. A chrome driver had to be downloaded first (so this part won't work in your local notebook without downloading it yourself and changing the path). A new username and password dedicated just for scraping quora was created by phu. 

The code below might not work on your machine because it did ask me (teddy in houston) where the user normally logs in from. But after I entered that info in, the code below shold work every time.

It sleeps for 3 seconds between entering username and password to make sure it has time to get to the password page.

In [11]:
'''
Selenium manually logs into google. Beforehand I created a quora account with the gmail email below.

A new window should pop-up and you can actually see the magic happen before your eyes.

Eventually chrome will navigate to the "link" variable above (look up several cells)
'''
driver_location = "./chromedriver"
driver = webdriver.Chrome(executable_path=driver_location)
driver.get("http://www.google.com")
driver.find_element_by_class_name('gb_Me').click()
inputElement = driver.find_element_by_id("Email")
inputElement.send_keys('bsmithfun2016@gmail.com')
inputElement.send_keys(Keys.ENTER)

time.sleep(3)

inputElement = driver.find_element_by_id("Passwd")
inputElement.send_keys('qwerty123!!!')
inputElement.send_keys(Keys.ENTER)
driver.get(link)

In [17]:
'''
Now lets get that follows number
'''
soup = BeautifulSoup(driver.page_source, 'html.parser')
question = soup.find("h1", { "itemprop" : "name" })
follows = soup.find("span", {"class","follow-text"})
answers = soup.find("span", {"class","D-n"})
question.text
answers.text

AttributeError: 'NoneType' object has no attribute 'text'

# Success! Now lets automate this!
The below code automatically navigates to each of the 500 quora links to grab 3 items
1. The question
2. The number of followers
3. The number of answers

It keeps everything in text - we will handle number of followers that return '5.4k' later

**We might want to think about sleeping for a couple seconds between each iteration so quora doesn't boot us**

In [42]:
'''
This takes about 1 - 2 seconds per iteration

It looks like a small bug creeps up here if the url is not formatted correctly.
This just happened when one of the url had some junk in front of it ('/url?url=')
I'll fix this later
'''
questions = []
for result in all_results:
    link = result['link']
    #driver.get(link)
    #soup = BeautifulSoup(driver.page_source, 'html.parser')
    #question_area = soup.find("div", { "class" : "QuestionArea" })
    #question_span = question_area.find("span", {"class": "rendered_qtext"})
    #answers = soup.find("div", {"class", "QuestionPageAnswerHeader"})
    #answers_num = answers.text.split()[0]
    #follows_num = soup.find('span', {'class', 'count'}).text
    #questions.append([question_span.text, follows_num, answers_num])
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    question_f = soup.find("h1", { "itemprop" : "name" })
    follows_f = soup.find("span", {"class","follow-text"})
    answers_f = soup.find("span", {"class","D-n"})
    
    if question_f is not None:
        question = question_f.text
        follows = -1
        answers = -1
        if follows_f is not None:
            follows = follows_f["data-ya-fc"]
        if answers_f is not None:
            answers = answers_f.text
        
        questions.append([link.strip(), question.strip(), follows, answers])
    else:
        questions.append([link.strip(), None, None, None])

# pandas - easy data manipulation
pandas is an awesome 3rd party library that makes data manipulation a breeze

In [44]:
'''
use pandas main object the DataFrame to insert all the data and insepct the top 20 rows
'''
df_questions = pd.DataFrame(questions, columns=['URL', 'Question', 'Follows', 'Answers'])
#df_questions['Question'] = df_questions['Question'].map(lambda x: x.strip())
#df_questions['Follows'] = df_questions['Follows'].map(lambda x: int(x))
#df_questions['Answers'] = df_questions['Answers'].map(lambda x: int(x))
df_questions.head(20)

Unnamed: 0,URL,Question,Follows,Answers
0,https://answers.yahoo.com/question/index?qid=20100802084437AAEixwf,Should I justify my cover letter?,1,2
1,https://au.answers.yahoo.com/question/index?qid=20100723074700AAJGCyK,How do you know if you should be admitted to a mental hospital (inpatient or outpatient)?,0,"How do you know if you should be admitted to a mental hospital (inpatient or outpatient)? \r\nPlease don't say 'see a psychiatrist', because I kn..."
2,https://au.answers.yahoo.com/question/index?qid=20070719171425AAsXirQ,SIX FLAGS- What exactly should i take n wear?,1,8
3,https://nz.answers.yahoo.com/question/index?qid=20090314175642AA76T8c,How much water should my 10 month old be drinking?,1,"My daughter is starting to lose interest in her breast feed in the middle of the day, so I still offer and sometimes she drinks for 5 minutes and..."
4,https://ca.answers.yahoo.com/question/index?qid=20110521190937AAE9pFP,Should i leave my gambling addcited husband?,0,I have discovered that my husband has had a gambling addiction the entire time i have known him. I was a single mother of a two year old and had...
5,https://answers.yahoo.com/question/index?qid=20060811152848AAqjaqh,How long should i walk and how many days a week in order to lose weight?,0,13
6,https://answers.yahoo.com/question/index?qid=20090111160824AAJUuro,Should I eat every 2 or 3 hours?,5,37
7,https://answers.yahoo.com/question/?qid=20081220040505AAMI4Xe,How fast should my internet connection speed be to play PS3 online?,0,"I have Sky Broadband Connect which offers me up to 8mb download speeds (but I can tell its not). I went to ""www.speedtest.com"" and it says my dow..."
8,https://answers.yahoo.com/question/index?qid=20090131135802AAfe1Vy,Should this be capitalized?,0,4
9,https://answers.yahoo.com/question/?qid=20100422200801AAsWXIB,Should i read The Iliad or The Odyssey first?,0,"im 13 and LOVE greek mythology. i can read complex stories, and at the age of 8 i was able to read at a 12th grade level. i know that the iliad w..."


In [39]:
'''
Sort by most follows
'''
pd.options.display.max_colwidth = 150
df_questions.sort_values('Follows', ascending=False, inplace=True)
df_questions.head(20)

TypeError: unorderable types: int() > str()

In [46]:
'''
Convert to csv
'''
df_questions.to_csv('Results/Yahoo! Answers - Srape Results.csv', index=False)