In [21]:
import json
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
import time

# Make sure you downloand anaconda
[download anaconda](https://www.continuum.io/downloads)
1. Anaconda is the best data science python distribution. It comes with 100's of 3rd party libraries for all sorts of machine learning
2. You get the excellent jupyter notebook where this material is being produced. Start in command prompt with **jupyter notebook**
3. Download new packages with **conda install packagename**
4. There is more to jupyter notebook - a lot more - http://jupyter.org/, http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html

# Scraping Google
Usually, a simple get request from the python **requests** libray is all it takes to retrieve the underlying html from a website, for currently unknown reasons this does not work at all for search requests. 

Because of this, a [third party software GoogleScraper](https://github.com/NikolaiT/GoogleScraper) is used from the command prompt that yields very nice meta-data on google scraped results. There are many options that you can use to get the desired result. Read the docs for more info.

### What does the GoogleScraper command below return?
1. json data of the first 10 pages of Google search results with 50 results per page for a total of 500 items
2. It specifically search quora for any appearance of the word "should"
3. The only real meta-data we are after is the url of the result. But there are several other items returned (see Below)

In [22]:
'''
************** This is a python string but can be thought of as a comment ***********

Run the command inside of ipython notebook! Very cool feature
'''
%system GoogleScraper -m http -p 10 -n 50 -q "should site:quora.com" --output-filename quora_june.json

['2016-06-28 01:07:25,825 - GoogleScraper.caching - INFO - 10 cache files found in .scrapecache/',
 '2016-06-28 01:07:25,825 - GoogleScraper.caching - INFO - 10/10 objects have been read from the cache. 0 remain to get scraped.']

# Open the file
We saved the above results into a file **quora_june.json**. We open it up and use the json library to load it into a python list

In [23]:
with open("quora_june.json", 'r') as f:
    data  = json.load(f)

In [24]:
'''
What type is our object?
'''
type(data)

list

In [25]:
'''
How many items in the list?
'''
len(data)

10

In [26]:
'''
Lets inspect the first item and see what type it is
'''
first = data[0]
type(first)

dict

In [27]:
'''
Since we have a dictionary, lets see the keys to the dict
'''
first.keys()

dict_keys(['num_results', 'page_number', 'results', 'num_results_for_query', 'effective_query', 'status', 'requested_by', 'search_engine_name', 'requested_at', 'no_results', 'id', 'query', 'scrape_method'])

In [28]:
'''
lets look at some of the values of these keys
'''
first['query'], first['id'], first['num_results_for_query']

('should site:quora.com', '1', 'About 1,800,000 results (0.52 seconds)\xa0')

In [29]:
'''
Some further investigation, leads to see that the actual results are in the 'resutls key'
'''
first_results = first['results']
type(first_results)

list

In [30]:
len(first_results)

50

In [31]:
type(first_results[0])

dict

In [32]:
'''
An example result
'''
first_results[0]

{'domain': 'www.quora.com',
 'id': '1',
 'link': 'https://www.quora.com/Why-should-I-vote-for-Bernie-Sanders',
 'link_type': 'results',
 'rank': '1',
 'serp_id': '1',
 'snippet': 'Why should the upper class vote for Bernie Sanders?  What are some reasons to not vote for Bernie Sanders?  Why would, or should, a Republican vote for Bernie Sanders?',
 'title': 'Why should I vote for Bernie Sanders? - Quora',
 'visible_link': 'https://www.quora.com/Why-should-I-vote-for-Bernie-Sanders'}

In [33]:
'''
Looks like the results key has a list of length 50
Each item in the list is a python dictionary of the all important metadata
Lets put all this data into one list. A list of python dictionaries
'''

all_results = []
for d in data:
    all_results.extend(d['results'])

In [34]:
'''
We should have 500 pieces of metadata
'''
len(all_results)

501

In [35]:
'''
Success! We have 500 pieces of metadata, each containing a different search result
Like take a look at the 101st one
'''
sample_result = all_results[100]
sample_result

{'domain': 'www.quora.com',
 'id': '251',
 'link': 'https://www.quora.com/Laundry-How-should-you-wash-clothes',
 'link_type': 'results',
 'rank': '1',
 'serp_id': '6',
 'snippet': 'Keep in mind that warmer water increases the possibility of dye bleeding, so whites and light colors should always be washed separately from bright or dark\xa0...',
 'title': 'Laundry: How should you wash clothes? - Quora',
 'visible_link': 'https://www.quora.com/Laundry-How-should-you-wash-clothes'}

In [36]:
'''
Looks like the url is in the 'link' key of the meta data. Lets look at it by itself
'''
link = sample_result['link']
link

'https://www.quora.com/Laundry-How-should-you-wash-clothes'

# Inspecting html
Now that we know how to access each link from GoogleScraper, we will manually inspect the page in order to find the question and some other meta data associated with that page. We can now use the requests library to get the html

In [37]:
'''
This command, issues a get request and returns a 'response' object
'''
response = requests.get(link)

In [38]:
'''
The 'text' attribute of the response object contains the actual text of the html when you inspect it with 
your browser's developers tools

BeautifulSoup is a python library that parses the gigantic mess of html code underneath
'''
soup = BeautifulSoup(response.text, 'html.parser')

In [39]:
'''
After manually inspecting the page, it looks like the question is going to always be
in a div with class "QuestionArea". This is of course subject to change but generally should be stable

The actually text of the question is in a class called "rendered_qtext"
'''
question_area = soup.find("div", { "class" : "QuestionArea" })
question_text = question_area.find("span", {"class": "rendered_qtext"})

In [40]:
'''
Lets take a look at the text. Variable question_span is still a BeautifulSoup object
'''
question_span.text

NameError: name 'question_span' is not defined

# Requests library is not enough :(
After some more work not shown here, it was discovered that the requests library does not use your google profile information to login. Logging in to google is very important as this gives acccess to the number of followers of a question.

# Use selenium to manipulate page
selenium is a third party python library that lets you manipulate a webpage. A chrome driver had to be downloaded first (so this part won't work in your local notebook without downloading it yourself and changing the path). A new username and password dedicated just for scraping quora was created by phu. 

The code below might not work on your machine because it did ask me (teddy in houston) where the user normally logs in from. But after I entered that info in, the code below shold work every time.

It sleeps for 3 seconds between entering username and password to make sure it has time to get to the password page.

In [42]:
'''
Selenium manually logs into google. Beforehand I created a quora account with the gmail email below.

A new window should pop-up and you can actually see the magic happen before your eyes.

Eventually chrome will navigate to the "link" variable above (look up several cells)
'''
driver_location = "./chromedriver"
driver = webdriver.Chrome(executable_path=driver_location)
driver.get("http://www.google.com")
driver.find_element_by_class_name('gb_Me').click()
inputElement = driver.find_element_by_id("Email")
inputElement.send_keys('bsmithfun2016@gmail.com')
inputElement.send_keys(Keys.ENTER)

time.sleep(3)

inputElement = driver.find_element_by_id("Passwd")
inputElement.send_keys('qwerty123!!!')
inputElement.send_keys(Keys.ENTER)
driver.get(link)

In [43]:
'''
Now lets get that follows number
'''
soup = BeautifulSoup(driver.page_source, 'html.parser')
follows_num = soup.find('span', {'class', 'count'}).text
follows_num

'40'

# Success! Now lets automate this!
The below code automatically navigates to each of the 500 quora links to grab 3 items
1. The question
2. The number of followers
3. The number of answers

It keeps everything in text - we will handle number of followers that return '5.4k' later

**We might want to think about sleeping for a couple seconds between each iteration so quora doesn't boot us**

In [None]:
'''
This takes about 1 - 2 seconds per iteration

It looks like a small bug creeps up here if the url is not formatted correctly.
This just happened when one of the url had some junk in front of it ('/url?url=')
I'll fix this later
'''
questions = []
for result in all_results:
    link = result['link']
    driver.get(link)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    question_area = soup.find("div", { "class" : "QuestionArea" })
    question_span = question_area.find("span", {"class": "rendered_qtext"})
    answers = soup.find("div", {"class", "QuestionPageAnswerHeader"})
    answers_num = answers.text.split()[0]
    follows_num = soup.find('span', {'class', 'count'}).text
    questions.append([question_span.text, follows_num, answers_num])

# pandas - easy data manipulation
pandas is an awesome 3rd party library that makes data manipulation a breeze

In [None]:
'''
use pandas main object the DataFrame to insert all the data and insepct the top 20 rows
'''
df_questions = pd.DataFrame(questions, columns=['Question', 'Follows', 'Answers'])
df_questions.head(20)

# Convert strings to ints
This could have been done in the loop above, but I wanted to inspect the data first. Lets take a look at the last character of the follows column

In [None]:
'''
Look at frequency of occurence of last character in follows column
'''
df_questions['Follows'].str.get(-1).value_counts()

In [None]:
'''
They are all numbers except for 'k'

This line might be hard to follow but basically...
it multiplies the number by 1000 if it ends in k else just makes it an int
and reassigns it to the same column
'''
df_questions['Follows'] = df_questions['Follows'].map(lambda x: int(float(x[:-1]) * 1000) if x[-1] == 'k' else int(x))

In [None]:
'''
Lets do same thing for Answers column
'''
df_questions['Answers'].str.get(-1).value_counts()

In [None]:
'''
Just the + sign is the only non-numeric

Lets take a look at those
'''
df_questions[df_questions['Answers'].map(lambda x: '+' in x)]

In [None]:
'''
Looks like number of answres stops at 100

Make conversion
'''
df_questions['Answers'] = df_questions['Answers'].map(lambda x: 100 if '+' in x else int(x))

In [None]:
'''
Inspect data after transformation
'''
df_questions.head(20)

In [None]:
'''
Sort by most follows
'''
pd.options.display.max_colwidth = 150
df_questions.sort_values('Follows', ascending=False, inplace=True)
df_questions.head(20)

In [None]:
'''
Convert to csv
'''
df_questions.to_csv('questions.csv', index=False)