links:
+ http://naelshiab.com/tutorial-send-email-python/
+ https://docs.python.org/3/library/smtplib.html

** scrape wg-gesucht, **

goal is to be alerted when new listings come online...to be alerted quickly..

+ point is to be alerted when a new listing is posted

+ need to persist the listings somehow?
 - or get the latest time of listing
 - if there is some listing time that would be useful

requirements:
+ collect links of appropriate properties with following attributes
    - 1 - 2 rooms
    - available at latest 01/07
    - that the rental is long term
    - listings of only the last 1 hour? - sent from 8am to midnight CET?
    - $m^2 $
    - region
    - date and time of listing
    - check that it isn't a swap  (TAUSCH/SWAP)


+ want to run this from some server
+ will need to have logging and so on - somehow only new listing should be emailed?
+ want to email the results to myself - especially for certain areas?

** where the information on each listing is **

* ang_spalte_datum 
* ang_spalte_miete 
* ang_spalte_groesse 
* ang_spalte_stadt 
* ang_spalte_freiab
* ang_spalte_freibis

** server information **

source ~/virtualenv/bin/activate    ### to activate the virtualenv required for pip and so on

In [None]:
## TODO:
## dont know if the link could change for the same listing? - but will find out

# implement logging? - what if something fails? - would be good to write what fails
# also perhaps to suspend if the failure of the script occurs?

logic is:
+ for each url:
    + get listings
        + process etc
    + check if storage file exists
        + if it does read it
        + append ALL new listings to csv
        + apply url filter to new listings retrieved (to remove duplicates)
    + else
        + write new listings to csv
    + apply listings filter
    + return filtered listings
 + concat loop into one df
 + send df to email
 
 + then just need to schedule this to run every x number of minutes
    
    

In [263]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os.path
import numpy as np

In [231]:
####
#### scrape details of listings from one url


def get_search_results(url):
    r = requests.get(url)
    assert r.status_code == 200, 'status code not 200'
    html_doc = r.text
    return BeautifulSoup(html_doc, 'html.parser')


def get_listing_details(listing):
    href_link = list(set(filter(None.__ne__, [tag.get('href') for tag in listing.findAll('a')])))
    link = 'http://www.wg-gesucht.de/en/' + href_link[0]
    cost = (listing.find('td', class_='ang_spalte_miete').
                find('span').contents[1].contents[0].
                replace(' ','').replace('\n', '').replace('€',''))
    size = (listing.find('td', class_='ang_spalte_groesse').
                find('span').contents[0].
                replace(' ','').replace('\n', '').replace('m²', ''))
    stadt = (listing.find('td', class_='ang_spalte_stadt').
                find('span').contents[0].
                replace(' ','').replace('\n', ''))
    free_from = (listing.find('td', class_='ang_spalte_freiab').
                    find('span').contents[0])
    free_to = listing.find('td', class_ = 'ang_spalte_freibis').find('span')
    if free_to:
        free_to = free_to.contents[0]
    else:
        free_to = None
    
    
    return {'link': link,'cost': cost,'size': size,'stadt': stadt,
            'free_from': free_from, 'free_to': free_to}

#http://www.wg-gesucht.de/en/wohnungen-in-Berlin.8.2.0.0.html
#'http://www.wg-gesucht.de/en/1-zimmer-wohnungen-in-Berlin.8.1.0.0.html?filter=fb0bdd36e1f2253e7bb343bba784e123c46b8103025e2e0a3a'
def get_latest_listing_details(url):
    soup = get_search_results(url)
    
    ## depending on link we are searching in the 1zimmer or flat part of wg-gesucht
    if '8.2.0.0' in url:
        flat_type = 'flat'
    elif '8.1.0.0' in url:
        flat_type = 'studio'
    else:
        flat_type = ''
    
    search_results = pd.DataFrame(get_listing_details(prop)
                      for prop in soup.findAll('tr', class_=re.compile('listenansicht0|listenansicht1')))
    
    search_results = (search_results.
                         assign(cost = search_results['cost'].astype(int)).
                         assign(size = search_results['size'].astype(int)).
                         assign(free_from = pd.to_datetime(search_results.free_from, dayfirst=True)).
                         assign(free_to = pd.to_datetime(search_results.free_to, dayfirst=True)).
                         assign(length = lambda df: (df['free_to'] - df['free_from'])/pd.Timedelta(days=1)).
                         assign(scrape_time = pd.Timestamp('now').replace(second=0, microsecond=0)).
                         assign(flat_type = flat_type)
                     )
    
    return search_results

In [268]:
def get_previous_listings(file):
    previous_scrapings = pd.read_csv(file)

    def convert_dtypes(data):
        return (data.
                    assign(free_from = lambda df: pd.to_datetime(df['free_from'])).
                    assign(free_to = lambda df: pd.to_datetime(df['free_to'])).
                    assign(scrape_time = lambda df: pd.to_datetime(df['scrape_time']))
                )

    return convert_dtypes(previous_scrapings)

def filter_old_out(new, old):
    unseen_links = np.setdiff1d(new.link.values, old.link.values)
    unseen_listings = new[new.link.isin(unseen_links)]
    return unseen_listings


def filter_requirements(listings):
    ## filters
    min_free_from = pd.Timestamp('2016-09-28')
    max_free_from = pd.Timestamp('2016-11-01')
    max_cost = 700
    min_cost = 300
    min_length = 90
    
    return (listings.
                pipe(lambda df: df[df['free_from'] >= min_free_from]).
                pipe(lambda df: df[df['free_from'] <= max_free_from]).
                pipe(lambda df: df[df['cost'] <= max_cost]).
                pipe(lambda df: df[df['cost'] >= min_cost]).
                pipe(lambda df: df[((df['length'] >= min_length)|df['length'].isnull())])
    )



def get_listings_from_url(url_, data_file):
    current_listings = get_latest_listing_details(url_)
    if os.path.isfile(data_file):
        ## read previous listings
        previous_listings = get_previous_listings(data_file)
        
        ## append ALL listings to data file
        current_listings.to_csv(data_file, index=False, mode='a', header=False)
        
        ## apply url filter
        current_listings = filter_old_out(current_listings, previous_listings)
    else:
        ## else is if the file doesn't exist - should only happen on first run
        ## write first instance to csv
        current_listings.to_csv(data_file, index=False)
    
    ## return listings with desired features
    return filter_requirements(current_listings)
        
        
        

In [283]:
## will want scheduler to run this intermittenly

urls = ['http://www.wg-gesucht.de/en/wohnungen-in-Berlin.8.1.0.0.html',
            'http://www.wg-gesucht.de/en/wohnungen-in-Berlin.8.2.0.0.html']
data_file = 'storage.csv'
new_listings = pd.concat(get_listings_from_url(url, data_file) for url in urls)
if new_listings.shape[0] > 0:
    print('send listings')
    ## actually send listings --> remember to_html the file
else:
    print('no listings to send')

In [None]:
pd.options.display.max_colwidth = 100

In [286]:
pd.options.display.max_colwidth = 150

In [288]:
previous_listings.to_html(index=False)

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th>cost</th>\n      <th>free_from</th>\n      <th>free_to</th>\n      <th>link</th>\n      <th>size</th>\n      <th>stadt</th>\n      <th>length</th>\n      <th>scrape_time</th>\n      <th>flat_type</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>750</td>\n      <td>2016-10-01</td>\n      <td>2018-03-31</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berlin-Prenzlauer-Berg.4735750.html</td>\n      <td>45</td>\n      <td>PrenzlauerBerg</td>\n      <td>546.0</td>\n      <td>2016-08-28 20:43:00</td>\n      <td>flat</td>\n    </tr>\n    <tr>\n      <td>1299</td>\n      <td>2016-10-01</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berlin-Mitte.4210017.html</td>\n      <td>75</td>\n      <td>Mitte</td>\n      <td>NaN</td>\n      <td>2016-08-28 20:43:00</td>\n      <td>flat</td>\n    </tr>\n    <tr>\n      <td>890</td>\n      <td>2016-10-01</td>\n   


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>cost</th>\n      <th>free_from</th>\n      <th>free_to</th>\n      <th>link</th>\n      <th>size</th>\n      <th>stadt</th>\n      <th>length</th>\n      <th>scrape_time</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1250</td>\n      <td>2016-06-09</td>\n      <td>2016-01-11</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>105m²</td>\n      <td>PrenzlauerBerg</td>\n      <td>-150.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1250</td>\n      <td>2016-05-09</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>110m²</td>\n      <td>PrenzlauerBerg</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>519</td>\n      <td>2016-01-10</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>61m²</td>\n      <td>Charlottenburg</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>780</td>\n      <td>2016-09-09</td>\n      <td>2016-08-10</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>52m²</td>\n      <td>Wilmersdorf</td>\n      <td>-30.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>640</td>\n      <td>2016-10-15</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>54m²</td>\n      <td>Mitte</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>500</td>\n      <td>2016-01-09</td>\n      <td>2016-08-10</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>50m²</td>\n      <td>Tegel</td>\n      <td>214.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>650</td>\n      <td>2016-12-09</td>\n      <td>2016-02-10</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>56m²</td>\n      <td>PrenzlauerBerg</td>\n      <td>-303.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>750</td>\n      <td>2016-01-10</td>\n      <td>2018-03-31</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>45m²</td>\n      <td>PrenzlauerBerg</td>\n      <td>811.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>600</td>\n      <td>2016-03-09</td>\n      <td>2016-09-10</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>46m²</td>\n      <td>Mitte</td>\n      <td>185.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>1467</td>\n      <td>2016-01-09</td>\n      <td>2017-01-12</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>65m²</td>\n      <td>Friedrichshain</td>\n      <td>369.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>812</td>\n      <td>2016-01-09</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>45m²</td>\n      <td>Mitte</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>750</td>\n      <td>2016-09-22</td>\n      <td>2016-10-31</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>55m²</td>\n      <td>Kreuzberg</td>\n      <td>39.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>450</td>\n      <td>2016-01-09</td>\n      <td>2016-09-17</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>55m²</td>\n      <td>Charlottenburg</td>\n      <td>252.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>700</td>\n      <td>2016-01-09</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>63m²</td>\n      <td>Neukölln</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>957</td>\n      <td>2016-09-23</td>\n      <td>2017-05-23</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>50m²</td>\n      <td>Mitte</td>\n      <td>242.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>15</th>\n      <td>2100</td>\n      <td>2016-08-28</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>130m²</td>\n      <td>Mitte</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td>1150</td>\n      <td>2016-08-28</td>\n      <td>NaT</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>50m²</td>\n      <td>Wilmersdorf</td>\n      <td>NaN</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td>550</td>\n      <td>2016-12-09</td>\n      <td>2016-02-10</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>71m²</td>\n      <td>Schöneberg</td>\n      <td>-303.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>18</th>\n      <td>1400</td>\n      <td>2016-10-10</td>\n      <td>2016-12-26</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>76m²</td>\n      <td>Friedrichshain</td>\n      <td>77.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td>65</td>\n      <td>2016-06-09</td>\n      <td>2016-09-20</td>\n      <td>http://www.wg-gesucht.de/en/wohnungen-in-Berli...</td>\n      <td>65m²</td>\n      <td>Mitte</td>\n      <td>103.0</td>\n      <td>2016-08-28 19:24:00</td>\n    </tr>\n  </tbody>\n</table>