# Emma's Februrary 2018 Research Notebook 

* locally download webpages using selenium
* use beautiful soup to identify Knowledge Panels and Top Stories 
* compare results with Jan 2018 results

### This notebook works on scraping data off of Google's SERP (search engine result page). I am particularly interested in the Knowledge Panel (1).  This notebook gets data from a spreadsheet of 7269 local news sites (2) Sends a selenium driver instance to download the SERP for each query (3)Uses Beautiful Soup to analyze the SERP for key elements (4) Stores results in a csv file (5) compares February results with January results

In [11]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "sample kp.png")

#### Libraries

To install selenium on your computer follow these instructions: http://selenium-python.readthedocs.io/installation.html

In [2]:
from selenium import webdriver
import pandas as pd
from selenium.common.exceptions import InvalidElementStateException
import time


## Download Sites Locally

Using a pandas dataframe, read the spreadsheet of local news sites

In [2]:
usnpl_df = pd.read_excel('news_sites_usnpl.xlsx')

In [3]:
usnpl_df.head()

Unnamed: 0,sites,urls
0,Sand Mountain Reporter,http://www.sandmountainreporter.com/
1,Alexander City Outlook,http://www.alexcityoutlook.com/
2,Andalusia Star-News,http://www.andalusiastarnews.com/
3,Anniston Star,http://www.annistonstar.com/
4,Arab Tribune,http://www.thearabtribune.com/


Create a list of the sites 

In [4]:
sites = list(usnpl_df['sites'])

### Selenium Download

* on my computer it takes about 3 min to download 100 sites


In [5]:
done = False
i = 3280

In [8]:
driver = webdriver.Chrome() #make sure you either specify a path or have the chromedriver 
driver.get('http://www.google.com/xhtml')

while not done: 
    try: 
        time.sleep(0.1) #should change to implicit loading
        
        search_box = driver.find_element_by_name('q')
        search_box.clear()
        search_box.send_keys(sites[i]) #search for query
        search_box.submit() 
        
        with open('feb_results/'+ str(i) + '_' + sites[i].replace('/', ' ') + '.html', 'w') as f: #replace is used b/c 1+ local news site has a / in its name
            f.write(driver.page_source.encode('ascii', 'ignore').decode('ascii'))
        i+=1
        
        if i % 100 == 0: #tracker for your own sanity of how much progress you have made
            print i,
            
        if i == len(sites): #is len(sites) b/c increment before this line of code
            done = True
        
 
        
    except InvalidElementStateException:
        #restart driver instance only when Google requires captcha
        driver = webdriver.Chrome()
        driver.get('http://www.google.com/xhtml')
        print i,

 5900 6000 6100 6200 6300 6400 6500 6600 6700 6800 6900 7000 7100 7200


### Analyze locally stored results

In [190]:
from bs4 import BeautifulSoup
import os
from collections import Counter
import difflib 
import re

In [203]:
local_results = os.listdir('feb_usnpl_results')

In [213]:
results = []
i = 0
for site in local_results:
    soup = BeautifulSoup(open('feb_usnpl_results/' + site, "r"), 'html.parser')
    kp = soup.find_all("div",  class_="knowledge-panel")
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    
    site_name = ' '.join(site.replace('.html', '').split('_')[1:])
        #site_name = ''.join(filter(None,re.split('(\d+)', site.replace('.html','')))[:-1])
    if len(kp) != 0:
        #print len(kp)
        heading = kp[0].find("div", class_="_Q1n")
            #in Feb 2018, the 'Olympia' returns this funky result about the Olympics without a heading
        if heading and difflib.get_close_matches(site_name, [heading.find("span").text],
                                                              cutoff = 0.7):
            kp_bool = True
            kp = kp[0]
            writes_about = False

            tab_list = kp.find("div", class_="_knw")
            if tab_list: 
                tab_list = tab_list.find_all("span") #lits of spans including writes about, reviewed claims, awards
            else: 
                tab_list = []
            other_topic_write_about_format = kp.find_all("div", class_= '_Obi')
            alternate_description = False
            if  any("Writes about" in x for x in tab_list) or any("write about" in x.text for x in other_topic_write_about_format):
                if  any("write about" in x.text for x in other_topic_write_about_format): 
                    alternate_description = True
                writes_about = True
                #topics = [topic.text for topic in kp.find_all("div", class_= "_Uut")]

            awards = False
            if  any("Awards" in x for x in tab_list):
                awards = True

            reviewed_claims = False
            if  any("Reviewed claims" in x for x in tab_list):
                reviewed_claims = True
            wikipedia = 'Wikipedia' in kp.text
        
        else: 
            kp_bool = False
            writes_about = False
            awards = False
            wikipedia = False
            reviewed_claims = False
            alternate_description = False
        
    else:
        kp_bool = False
        writes_about = False
        awards = False
        wikipedia = False
        reviewed_claims = False
        alternate_description = False
        
    top_stories_list = soup.find_all("h3", class_='_MRj') #the location of top stories if it exists
    top_stories = bool(len(top_stories_list))
    
    results.append([kp_bool, site_name, writes_about, awards, reviewed_claims, wikipedia, top_stories])
        
    i+=1
    if i %100 == 0:
        print i,

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300 6400 6500 6600 6700 6800 6900 7000 7100 7200


In [214]:
df = pd.DataFrame(results, columns= ['Knowledge Panel', 'Site name', 'WritesAbout',
                           'Awards', 'ReviewedClaims', 'Wikipedia', 'Top Stories'])

In [216]:
df.head(10)

Unnamed: 0,Knowledge Panel,Site name,WritesAbout,Awards,ReviewedClaims,Wikipedia,Top Stories
0,False,Sand Mountain Reporter,False,False,False,False,True
1,False,Montrose Daily Press,False,False,False,False,True
2,False,Tri Lakes Tribune,False,False,False,False,False
3,True,Nederland Mountain-Ear,False,False,False,False,False
4,False,Left Hand Valley Courier,False,False,False,False,False
5,False,Norwood Post,False,False,False,False,False
6,False,Sunshine Express,False,False,False,False,False
7,True,Plaindealer,False,False,False,True,False
8,True,Pagosa Springs Daily Post,False,False,False,False,False
9,False,Pagosa Springs Sun,False,False,False,False,True


In [215]:
df.to_csv('feb_usnpl_results.csv')

In [None]:
Counter(df['WritesAbout'])

In [None]:
### Comparing January and Feb Results

In [None]:
jan_df = pd.read_csv('jan_usnpl_results.csv')
feb_df = pd.read_csv('feb_usnpl_results.csv')

In [135]:
jan_df.head()

Unnamed: 0.1,Unnamed: 0,Site name,Has Top Stories?,Has Knowledge Panel?,Label,Has Wikipedia?,Has Writes About?,Has Awards?,Has Reviewed Claims?
0,0,27east.com,False,False,,False,False,False,False
1,1,27east.com,False,False,,False,False,False,False
2,2,27east.com,False,False,,False,False,False,False
3,3,2HaveFunIn Minnesota,False,False,,False,False,False,False
4,4,417 Magazine,False,False,,False,False,False,False


In [136]:
feb_df.head()

Unnamed: 0.1,Unnamed: 0,Site name,Has Top Stories?,Has Knowledge Panel?,Label,Has Wikipedia?,Has Writes About?,Has Awards?,Has Reviewed Claims?
0,0,Sand Mountain Reporter,False,False,,False,False,False,False
1,1,Montrose Daily Press,False,False,,False,False,False,False
2,2,Tri Lakes Tribune,False,False,,False,False,False,False
3,3,Nederland Mountain-Ear,True,True,,False,False,False,False
4,4,Left Hand Valley Courier,False,False,,False,False,False,False


In [139]:
Counter(jan_df['Has Knowledge Panel?'])

Counter({False: 4513, True: 2756})

In [140]:
Counter(feb_df['Has Knowledge Panel?'])

Counter({False: 4485, True: 2784})

In [142]:
Counter(feb_df['Has Writes About?'])

Counter({False: 7114, True: 155})

In [143]:
Counter(jan_df['Has Writes About?'])

Counter({False: 7200, True: 69})