# Yahoo Groups message scraper

This notebook provides code that will dowload messages from *private* Yahoo Groups. Another notebook `YahooScra.jpynb` uses [Selenium](http://selenium-python.readthedocs.io/). This notebook uses [reqeusts](http://docs.python-requests.org/en/latest/user/install/).

Another GitHub.com repository ([YahooGroups-Archiver](https://github.com/andrewferguson/YahooGroups-Archiver)) provided guidance related to accessing *private* Yahoo Groups. After logging into the private group you need to provide two pieces of information from the cookies:

> Cookie information can be found through the use of a plug-in for
> your web browser. (I use 'Cookie Manager' on FireFox, although
> there are many other options for FireFox and other browsers). The
> two cookies you are looking for are called Y and T, and they are 
> linked to the domain yahoo.com. Extract the data from these 
> cookies, and paste it into the appropriate variables... a cookie
> will expire after a certain amount of time, which varies between 
> computers. This means that you may have to re-fetch the Y and T 
> cookie data every few days, or you will not be able to archive 
> private groups. ([YahooGroups-Archiver](https://github.com/andrewferguson/YahooGroups-Archiver))

(Last tested April 21, 2018)

In [1]:
import pandas as pd
from pandas import Series, DataFrame
from bs4 import BeautifulSoup
import requests
import json
import datetime
import time
import os

In [2]:
# Get groupname from user.
grp_name = input(prompt='What is the group name you seek to scrape?  ')
# Give grp_name a default if no input.
if grp_name == '':
    grp_name = 'concatenative'

What is the group name you seek to scrape?  studentjudicial


In [None]:
# These variables will take the cookies from 'Cookie Manager' discussed above as strings.
cookie_T = ''
cookie_Y = ''

# Define a list to log successful attempts.
fetch_log = []

# Iterate through the private group messages. The most recent group message number 
# can be found at: https://groups.yahoo.com/api/v1/groups/<groupname>/messages?count=10
# Parse the results to discern the most recent message number.
for i in range(1737,0,-1):
    fetch_log.append(
        ('{} Working on message {}.'.format(
            str(datetime.datetime.now()), str(i))))
    # For demonstration purposes use a generic groupsuch as 'concatenative'
    
    # Alternate code approach available.
    # s = requests.Session()
    # post = s.get('https://groups.yahoo.com/api/v1/groups/' + grp_name + '/messages/' + str(i) + '/', cookies={'T': cookie_T, 'Y': cookie_Y})

    post = requests.get(
        'https://groups.yahoo.com/api/v1/groups/' + grp_name + '/messages/' + str(i) + '/', 
        cookies={'T': cookie_T, 'Y': cookie_Y})

    # Yahoo Groups api returns JSON.
    post_parsed = json.loads(post.text)
    
    # The api result uses html in the messageBody; remove html tags from messageBody.
    try:
        soup = BeautifulSoup(post_parsed['ygData']['messageBody'], 'html.parser')
        soupstring = soup.get_text()
    # If no message found for that index; populate with message not found.
    except KeyError:
        soupstring = 'Post number {} was not found.'.format(str(i))
    
    # Optionally remove comments from html
    # pullstring = soupstring[soupstring.find('<!--'):soupstring.find('-->')+3]
    # cleanstring = soupstring.replace(pullstring,'')
    # soupstring = cleanstring
    
    # Save the message body as a .txt file.
    post_file = open(
        os.path.join(
            'msgs', ''.join(
                (grp_name + '_' + str(i) + r'_post.txt'))), 'w', encoding='utf-8')
    post_file.write(soupstring)
    post_file.close()
    
    # Save the api result as a .json file.
    json_file = open(
        os.path.join(
            'msgs', ''.join(
                (grp_name + '_' + str(i) + r'_json.json'))), 'w', encoding='utf-8')
    json_file.write(post.text)
    json_file.close()
    
    # Optionally pause to assist in avoiding CAPTCHA and other anti-robot features.
    time.sleep(.1)

In [None]:
# Save fetch_log for later reference.
with open(grp_name + '_fetch_log_' + 
        str(datetime.datetime.now())[2:16].replace(" ", "-").replace(":","") + 
        '.log', mode='w') as logfile:
            print('This is the log of fetched messages file from {}'.format(
                str(datetime.datetime.now())), file=logfile)
            print('Yahoo Group name {}.'.format(grp_name), file=logfile)
            for fetch_line in fetch_log:
                print(fetch_line, file = logfile)
logfile.close

In [3]:
# Get location of messages to compile.
folder_location = input(prompt='What is the location of messages (no input will default to msgs)?  ')
# Give grp_name a default if no input.
if folder_location == '':
    folder_location = 'msgs'

What is the location of messages (no input will default to msgs)?  


In [17]:
# Build a dataset from the files created above.

# Define a list to log errors.
error_log  = []

# Define a list to hold structured data.
grandlist = []

# Iterate through the message files.
for i in range(33102,0,-1):
    try:
        work_file = open(
            os.path.join(
                folder_location, grp_name + '_' + str(i) + '_json.json'), 'r', encoding='utf-8')
        work_parse = json.loads(work_file.read())
        
        # Define list to hold current record.
        mylist = []
        mylist = [
            work_parse['ygData']['userId'],
            work_parse['ygData']['authorName'],
            work_parse['ygData']['subject'],
            work_parse['ygData']['postDate'],
            str(datetime.datetime.fromtimestamp(
                int(work_parse['ygData']['postDate'])).strftime('%Y-%m-%d %H:%M:%S')),
            work_parse['ygData']['msgId'],
            work_parse['ygData']['prevInTopic'],
            work_parse['ygData']['nextInTopic'],
            work_parse['ygData']['prevInTime'],
            work_parse['ygData']['nextInTime'],
            work_parse['ygData']['topicId'],
            work_parse['ygData']['numMessagesInTopic'],
            work_parse['ygData']['messageBody']]
        work_file.close()
        
        work_file = open(
            os.path.join(
                folder_location, grp_name + '_' + str(i) + '_post.txt'), 'r', encoding='utf-8')
        mylist.append(work_file.read())
        work_file.close()

        # Add the current observation to the structured data set.
        grandlist.append(mylist)
    except FileNotFoundError:
        # If file not found, provide output and log error.
        print('Message number ' + str(i) + ' - Not found.')
        error_log.append('FileNotFoundError. Message number {}.'.format(str(i)))
    except KeyError:
        # If any of the JSON keys (variables) not found, provide output and log error.
        print('Message number ' + str(i) + ' - KeyError.')
        error_log.append('KeyError. Message number '.format(str(i)))
    except OSError:
        # If OSError, provide output and log error.
        print('Message number ' + str(i) + ' - OSError.')
        error_log.append('OSError. Message number '.format(str(i)))

Message number 33086 - KeyError.
Message number 32171 - KeyError.
Message number 31976 - KeyError.
Message number 31891 - KeyError.
Message number 31373 - KeyError.
Message number 31002 - KeyError.
Message number 30761 - KeyError.
Message number 30710 - KeyError.
Message number 30575 - KeyError.
Message number 29483 - KeyError.
Message number 29460 - KeyError.
Message number 29015 - KeyError.
Message number 28586 - KeyError.
Message number 27679 - KeyError.
Message number 27097 - KeyError.
Message number 27014 - KeyError.
Message number 27007 - KeyError.
Message number 26995 - KeyError.
Message number 26625 - KeyError.
Message number 26624 - KeyError.
Message number 25492 - KeyError.
Message number 25227 - KeyError.
Message number 24592 - KeyError.
Message number 23643 - KeyError.
Message number 22883 - KeyError.
Message number 22632 - KeyError.
Message number 22438 - KeyError.
Message number 22437 - KeyError.
Message number 22283 - KeyError.
Message number 22257 - KeyError.
Message nu

Message number 8888 - KeyError.
Message number 8872 - KeyError.
Message number 8871 - KeyError.
Message number 8864 - KeyError.
Message number 8863 - KeyError.
Message number 8862 - KeyError.
Message number 8858 - KeyError.
Message number 8851 - KeyError.
Message number 8845 - KeyError.
Message number 8844 - KeyError.
Message number 8839 - KeyError.
Message number 8835 - KeyError.
Message number 8830 - KeyError.
Message number 8829 - KeyError.
Message number 8810 - KeyError.
Message number 8809 - KeyError.
Message number 8800 - KeyError.
Message number 8797 - KeyError.
Message number 8791 - KeyError.
Message number 8788 - KeyError.
Message number 8721 - KeyError.
Message number 8718 - KeyError.
Message number 8716 - KeyError.
Message number 8715 - KeyError.
Message number 8666 - KeyError.
Message number 8662 - KeyError.
Message number 8625 - KeyError.
Message number 8614 - KeyError.
Message number 8613 - KeyError.
Message number 8612 - KeyError.
Message number 8611 - KeyError.
Message 

Message number 5291 - KeyError.
Message number 5288 - KeyError.
Message number 5269 - KeyError.
Message number 5241 - KeyError.
Message number 5195 - KeyError.
Message number 5181 - KeyError.
Message number 5169 - KeyError.
Message number 5158 - KeyError.
Message number 5154 - KeyError.
Message number 5152 - KeyError.
Message number 5131 - KeyError.
Message number 5103 - KeyError.
Message number 5046 - KeyError.
Message number 5042 - KeyError.
Message number 5035 - KeyError.
Message number 5002 - KeyError.
Message number 4998 - KeyError.
Message number 4976 - KeyError.
Message number 4964 - KeyError.
Message number 4927 - KeyError.
Message number 4907 - KeyError.
Message number 4878 - KeyError.
Message number 4876 - KeyError.
Message number 4875 - KeyError.
Message number 4845 - KeyError.
Message number 4844 - KeyError.
Message number 4802 - KeyError.
Message number 4788 - KeyError.
Message number 4781 - KeyError.
Message number 4780 - KeyError.
Message number 4775 - KeyError.
Message 

Message number 1783 - KeyError.
Message number 1732 - KeyError.
Message number 1658 - KeyError.
Message number 1650 - KeyError.
Message number 1638 - KeyError.
Message number 1556 - KeyError.
Message number 1498 - KeyError.
Message number 1463 - KeyError.
Message number 1462 - KeyError.
Message number 1386 - KeyError.
Message number 1377 - KeyError.
Message number 1365 - KeyError.
Message number 1338 - KeyError.
Message number 1308 - KeyError.
Message number 1299 - KeyError.
Message number 1285 - KeyError.
Message number 1284 - KeyError.
Message number 1245 - KeyError.
Message number 1244 - KeyError.
Message number 1125 - KeyError.
Message number 1096 - KeyError.
Message number 1086 - KeyError.
Message number 1085 - KeyError.
Message number 1083 - KeyError.
Message number 1078 - KeyError.
Message number 1077 - KeyError.
Message number 1075 - KeyError.
Message number 1074 - KeyError.
Message number 1073 - KeyError.
Message number 1061 - KeyError.
Message number 1052 - KeyError.
Message 

In [None]:
# Save error_log for later reference.
with open(grp_name + '_err_log_' + 
        str(datetime.datetime.now())[2:16].replace(" ", "-").replace(":","") +
        '.log', mode='w') as logfile:
            print('This is the error log file from {}'.format(
                str(datetime.datetime.now())), file = logfile)
            print('Yahoo Group name {}.'.format(grp_name), file = logfile)
            for error_line in error_log:
                print(error_line, file = logfile)
logfile.close

In [25]:
# Define column list (With message body)
col_list = ['userId','authName','subject','Unix','Date',
            'msgId','preInTpc','nxtInTpc','preInTime',
            'nxtInTime','topicId','MsgsInTopic','msgRaw','msgTxt']

# Put structured data into a Pandas dataframe.
grand_df = DataFrame(grandlist, columns=col_list)

In [26]:
# Check results.
grand_df.head()

Unnamed: 0,userId,authName,subject,Unix,Date,msgId,preInTpc,nxtInTpc,preInTime,nxtInTime,topicId,MsgsInTopic,msgRaw,msgTxt
0,581375493,"Savage, Shannon",Filming and prop weapons policies,1541724012,2018-11-08 19:40:12,33102,0,0,33101,0,33102,1,"<div id=""ygrps-yiv-1920241537"">Hello all,<br/>...","Hello all,\n\nOur campus has recently had two ..."
1,558034509,Sara Ash,CBD and Policies,1541456954,2018-11-05 17:29:14,33101,0,0,33100,33102,33101,1,"<div id=""ygrps-yiv-79522852"">Hello everyone,<b...","Hello everyone,\n\nI was wondering if you had ..."
2,190928695,Ray Tuttle (rtuttle),Holding students accountable for not respectin...,1541442992,2018-11-05 13:36:32,33100,0,0,33099,33101,33100,1,"<div id=""ygrps-yiv-1523644933""><html>\n<head>\...",\n\n<!--\n#ygrps-yiv-1523644933 \n _filtered ...
3,569518549,Anthony Leger,RE: Student now not wanting a lawyer,1541438971,2018-11-05 12:29:31,33099,33098,0,33098,33100,33098,2,"<div id=""ygrps-yiv-1607611912"">Dave,<br/>\n<br...","Dave,\n\nI think it would depend on your polic..."
4,578133796,"Steward, David K",Student now not wanting a lawyer,1541436488,2018-11-05 11:48:08,33098,0,33099,33097,33099,33098,2,"<div id=""ygrps-yiv-1417733977"">Have an interes...",Have an interesting twist to a case.\n\nStuden...


In [None]:
# Save to CSV
grand_df.to_csv(grp_name + '_messages.csv')

In [29]:
import re

def remove_unicode(uni_txt):
    # uni_txt.replace(u'\xa0', ' ').encode('utf-8')
    # uni_txt.replace(u'\u2019', "'").encode('utf-8')
    # uni_txt.replace(u'\u2022', '-').encode('utf-8')
    # try:
    #     print(uni_txt[1153])
    # except:
    #     print(uni_txt[:1])
    # uni_txt.replace(u'\u2022', '-').replace(u'\u2019', "'").replace(u'\u201c', '"').replace(u'\u201d', '"').replace(u'\u2014', '-').replace(u'\u2018', "'").replace(u'\u200b', ' ').replace(u'\u2013', '-').replace(u'\ufddf', ' ').replace(u'\ufffd', ' ').replace(u'\u2026', '...').replace(u'\u2028', '... ...').replace(u'\u263a', ' smile ').replace(u'\u1427', '-').replace(u'\u04cf', '|').replace(u'\uf0d8', '|?|').replace(u'\u200e', ' ').replace(u'\U0001f603', ' ').replace(u'\u2039', '<').replace(u'\u201a', "'").replace(u'\u2011', '-').replace(u'\u2002', ' ').replace(u'\u0335', '-').replace(u'\u2502', '|').replace(u'\u0160', 'S').replace(u'\u2015', '-').replace(u'\u2212', '-').replace(u'\u2010', ' ').replace(u'\ufffc', ' ').encode('latin-1')
    # pln_txt = uni_txt.encode('ascii', 'ignore')
    # pln_txt uni_txt.encode('latin-1', 'ignore')
    
    
    
    uni_txt = re.sub('[\u2022]', '-', uni_txt)
    uni_txt = re.sub('[\u201c]', '"', uni_txt)
    uni_txt = re.sub('[\u201d]', '"', uni_txt)
    uni_txt = re.sub('[\u2014]', '-', uni_txt)
    uni_txt = re.sub('[\u2018]', "'", uni_txt)
    uni_txt = re.sub('[\u200b]', "'", uni_txt)
    uni_txt = re.sub('[\u2013]', '-', uni_txt)
    uni_txt = re.sub('[\ufddf]', ' ', uni_txt)
    uni_txt = re.sub('[\ufffd]', ' ', uni_txt)
    uni_txt = re.sub('[\u2026]', "'", uni_txt)
    uni_txt = re.sub('[\u2028]', '... ...', uni_txt)
    uni_txt = re.sub('[\u263a]', ' smile ', uni_txt)
    uni_txt = re.sub('[\u1427]', '-', uni_txt)
    uni_txt = re.sub('[\u04cf]', "'", uni_txt)
    uni_txt = re.sub('[\uf0d8]', '|?|', uni_txt)
    uni_txt = re.sub('[\U0001f603]', ' ', uni_txt)
    uni_txt = re.sub('[\u2039]', '<', uni_txt)
    uni_txt = re.sub('[\u201a]', "'", uni_txt)
    uni_txt = re.sub('[\u2011]', '-', uni_txt)
    uni_txt = re.sub('[\u2002]', ' ', uni_txt)
    uni_txt = re.sub('[\u0335]', '-', uni_txt)
    uni_txt = re.sub('[\u2502]', '|', uni_txt)
    uni_txt = re.sub('[\u0160]', 'S', uni_txt)
    uni_txt = re.sub('[\u2015]', '-', uni_txt)
    uni_txt = re.sub('[\u2212]', '-', uni_txt)
    uni_txt = re.sub('[\u2010]', ' ', uni_txt)
    uni_txt = re.sub('[\ufffc]', ' ', uni_txt)
    uni_txt = re.sub('[\u200e]', ' ', uni_txt)
    uni_txt = re.sub('[\uf050]', ' ', uni_txt)
    uni_txt = re.sub('[\u2027]', '-', uni_txt)
    uni_txt = re.sub('[\u20AC]', ' ', uni_txt)
    uni_txt = re.sub('[\u2122]', ' ', uni_txt)
    uni_txt = re.sub('[\u02dc]', '~', uni_txt)
    uni_txt = re.sub('[\u0153]', 'oe', uni_txt)
    uni_txt = re.sub('[\ufb01]', 'fi', uni_txt)
    uni_txt = re.sub('[\u201e]', '"', uni_txt)
    uni_txt = re.sub('[\u0192]', 'f', uni_txt)
    
    uni_txt = re.sub('[\u25a1]', ' ', uni_txt)
    uni_txt = re.sub('[\u25cf]', '-', uni_txt)
    uni_txt = re.sub('[\uf097]', ' ', uni_txt)
    uni_txt = re.sub('[\u25aa]', '-', uni_txt)
    uni_txt = re.sub('[\u2030]', '%', uni_txt)
    uni_txt = re.sub('[\u25ba]', '>', uni_txt)
    
    uni_txt = re.sub('[\uF1FF]', '', uni_txt)
    uni_txt = re.sub('[\u3050]', '', uni_txt)
    uni_txt = re.sub('[\u98B5]', '', uni_txt)
    uni_txt = re.sub('[\u11CF]', '', uni_txt)
    uni_txt = re.sub('[\u82BB]', '', uni_txt)
    uni_txt = re.sub('[\uAA00]', '', uni_txt)
    uni_txt = re.sub('[\uBD00]', '', uni_txt)
    
    uni_txt = re.sub('[\u0bce]', '', uni_txt)
    uni_txt = re.sub('[\uf022]', '', uni_txt)
    
    uni_txt = re.sub('[\u221d]', '', uni_txt)
    uni_txt = re.sub('[\u2329]', '<', uni_txt)
    uni_txt = re.sub('[\u232A]', '>', uni_txt)
    uni_txt = re.sub('[\u002D]', '-', uni_txt)
    uni_txt = re.sub('[\u03c0]', 'pi', uni_txt)
    uni_txt = re.sub('[\uf0a7]', '', uni_txt)
    
    uni_txt = re.sub('[\U0010003e]', '', uni_txt)
    uni_txt = re.sub('[\u2282]', '', uni_txt)
    uni_txt = re.sub('[\uf0a7]', '', uni_txt)
    uni_txt = re.sub('[\uf0a7]', '', uni_txt)
    uni_txt = re.sub('[\uf0a7]', '', uni_txt)
    uni_txt = re.sub('[\uf0a7]', '', uni_txt)
    
    uni_txt = re.sub('[\u2019]', 'LOOK HERE REGEXP', uni_txt)
    uni_txt.replace(u'\u2019', 'LOOK HERE REPLACE').encode('latin-1')
    
    # uni_txt = re.sub('[t]', 'THIS IS DUMB', uni_txt)
    
    # uni_txt = re.sub('[’]', 'LOOK HERE REGEXP 222', uni_txt)
    # uni_txt.replace(u'’', 'LOOK HERE REPLACE 222').encode('latin-1')
    
    # pln_txt = uni_txt
    # return(uni_txt)
    # return('bye bye bye')
    return(uni_txt)

grand_df['msgRaw'] = grand_df['msgRaw'].apply(remove_unicode)
grand_df['msgTxt'] = grand_df['msgTxt'].apply(remove_unicode)
grand_df['subject'] = grand_df['subject'].apply(remove_unicode)

def print_position(qtext):
    if len(qtext) > 11 + 1:
        if qtext[11] == u'\u201c':
            print(qtext[11:21])

def add_lots(numtoadd):
    highernum = numtoadd + 500
    return(highernum)

# grand_df['MsgsInTopic'] = grand_df['MsgsInTopic'].apply(add_lots)

# print(add_lots(50))
            
# print(remove_unicode('hello hello hello'))
            
# grand_df['msgRaw'].apply(print_position)

# Save to Stata
# Problems on this. See: https://github.com/pandas-dev/pandas/issues/16450
grand_df.to_stata(grp_name + '_messages.dta', version=117, convert_strl=['subject','msgRaw','msgTxt'])
df = grand_df
df[['subject']].to_stata('Stata_file_works.dta', version=117, convert_strl=['subject'])
# grand_df.head()

In [None]:
# Save to Excel
writer = pd.ExcelWriter(grp_name + '_messages.xlsx', engine='xlsxwriter')
grand_df.to_excel(writer, sheet_name='Sheet1')
writer.save()

In [None]:
import re

# re.sub('[es]', '', 'test')

re.sub('[\u2019]', '', u'hello \u2022 \u2022 \u2022world \u2019 \u2019 \u2019')

In [None]:
'hello'.encode()

In [None]:
u = u'hello \u2022 \u2022 \u2022world ```'

u

In [None]:
u.replace(u'\u2022', '-').replace(u'\u2019', "'").encode('latin-1')

In [None]:
print('hello world'[100])

In [None]:
len('hello')

In [None]:
re.sub('[’]', 'LOOK HERE REGEXP 222', '___ ’ ___')

In [50]:
exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
exfile.head(n=1)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,Average,2.5,11,2930,186,40,121,3.58,Domestic


In [51]:
def make_longer(shorter):
    longer = shorter * 100000
    return(longer)
    
exfile['make'] = exfile['make'].apply(make_longer)



In [52]:
exfile.head(n=1)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC ConcordAMC ConcordAMC ConcordAMC ConcordAM...,4099,22,Average,2.5,11,2930,186,40,121,3.58,Domestic


In [53]:
exfile.to_stata('Stata_file_auto.dta', version=117, convert_strl=['make'])