### House price scraping 

In this notebook I'm going to scrap data from the site [imali](http://www.imali.biz/cat---22-24-this---.html) , it contains data about houses price in rwanda. I will be using [this tutorial](https://realpython.com/python-web-scraping-practical-introduction/#setting-up-your-python-web-scraper) from real python and anyother document I will find on internet.


In [1]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

#### Utilties functions 

This function try to make a simple request but in a better way 

In [2]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

In [3]:
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers.get('Content-Type').lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

In [4]:
def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

let us try to get something from a webpage.

In [5]:
raw_html = simple_get('http://www.imali.biz/cat---22-24-this---.html')

basically the website has a main div containing a list of house for rent or for sale, my first task will be to open the div and get the content from the li.

let try this with beautifull soup

In [6]:
html = BeautifulSoup(raw_html, 'html.parser')

Let get the div with the id and look for the ul inside that div

In [7]:
all_houses_div = html.find(id="annBySectionResponseDiv")

let get the li and see if I can read info inside

In [8]:
def get_houses():
    """
    Downloads the page where all the house are found 
    and returns a list of all houses in one page
    """
    url = 'http://www.imali.biz/cat---22-24-this---.html'
    response = simple_get(url)

    if response:
        html = BeautifulSoup(response, 'html.parser')
        all_houses_div = html.find(id="annBySectionResponseDiv")
        return all_houses_div.find_all('li')
    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))

In [9]:
houses = get_houses()

Here is how look a simple li elment with the house information...


In [10]:
a_house = houses[0]

Form this div I can get the title of the announce, details if it is for sale or for rent , the price , and the place where the house is located

In [11]:
a_house

<li><a href="announce-24-78853.html" title="nice house for sale at kicukiro"><img alt="" src="Im_img/Im_Announces_img/Im_mini_main_img/pIqLCF6.jpeg"/></a><span class="Ann_Section"><a href="announce-24-78853.html" title="nice house for sale at kicukiro">Houses for sale</a></span><span class="Ann_Price">70 000 000 Rwf</span><span class="Ann_City">kicukiro</span><span class="Ann_Date">27-Jan-2019</span></li>

In [12]:
link = a_house.find('a')

In [13]:
link.get('href')

'announce-24-78853.html'

In [14]:
link.get('title')

'nice house for sale at kicukiro'

In [15]:
price = a_house.find_all("span", {"class" :"Ann_Price"})[0].text

In [16]:
city = a_house.find_all("span", {"class" :"Ann_City"})[0].text

In [17]:
date_posted = a_house.find_all("span", {"class" :"Ann_Date"})[0].text

In [18]:
date_posted

'27-Jan-2019'

Let put it all together and save the information in a list of dictionaries with details...

In [19]:
def save_house_details(all_houses):
    """
    get information about a house form list of house passed in parmes
    the informations are :
    the price, the date the house was posted , the place where the house is located, the title of the advert,
    and the link where we can get more informations about the house
    ----
    args : list of li elements
    return : list of object where each object contains houses informations
    
    """
    houses_details = list()
    for a_house in all_houses:
        link = a_house.find('a')
        details = {'url': link.get('href'),
                  'title': link.get('title'),
                  'price': a_house.find_all("span", {"class" :"Ann_Price"})[0].text, 
                  'city': a_house.find_all("span", {"class" :"Ann_City"})[0].text,
                  'date_posted': a_house.find_all("span", {"class" :"Ann_Date"})[0].text}
        houses_details.append(details)
    return houses_details

In [20]:
houses_data = save_house_details(houses)

In [22]:
len(houses_data)

26