##### Web Scraping has many applications in business.A few examples are Market Sentiment Analysis, Customer Sentiment Analysis and Competitive Pricing.

##### Step 1: Before we scrape a website, we need to take a look at their robots.txt. 

    This file tells us if the website allows scraping or if they do not. To find the robots.txt, type in the base url and add       “/robots.txt”. For eg, if we want to crawl apartments.com, type in https://www.apartments.com/robots.txt at the url box.

    If the robots.txt allows full access it says as follows:
        User-agent: *
        Disallow:

    If the robots.txt blocks all access, it contains the following:
        User-agent: *
        Disallow: /

    And if the robots.txt gives partial access, it contains the following, where section stands for the sections that are not to     be crawled:

    User-agent: *
    Disallow: /section/
    
    In the case of apartments.com, the robots.txt contains the following , as of today:
    # Ensure UTF-8 WITHOUT SIGNATURE- no BOM 
    User-agent: *
    Disallow: /services/

    Sitemap: https://www.apartments.com/sitemapindex.xml.gz

    This means we can crawl all the sections on the site except for the ones with apartments.com/services/ in the url.

In [48]:
# Step 2:Import the necessary libraries

import requests                  # Requests is used in this example to get the html content
from bs4 import BeautifulSoup    # BeautifulSoup to parse the html
import pandas                    # pandas to make a dataframe and write to a csv

In [49]:
# Step 3: Store the url you want to scrape to a variable

base_url = 'https://www.apartments.com/cincinnati-oh/'

In [None]:
# Step 4: Get the html contents from the page. This is done using the requests library

r = requests.get(base_url)
c = r.content;c

In [None]:
# Step 5: Parse the html. This is done with BeautifulSoup

soup = BeautifulSoup(c,"html.parser")
soup

In [53]:
# Step 6: Extract the first and last page numbers

# to extract the first and last page numbers
paging = soup.find("div",{"id":"placardContainer"}).find("div",{"id":"paging"}).find_all("a")
start_page=paging[1].text
last_page=paging[len(paging)-2].text
# start_page # 1
# last_page  # 28

In [54]:
# Step 7: Make an empty list to append all the content that we get later on

web_content_list = []

In [88]:
# Step 8: Make page links from the page numbers ,crawl through the pages and extract the contents from the corresponding tags.

for page_number in range(int(start_page),int(last_page)+1):
    url = base_url+str(page_number)
    r = requests.get(base_url+str(page_number)+"/")
    c = r.content
    soup = BeautifulSoup(c,"html.parser")   

    # Step 9: Extract the header class for title and the location.Right click on the title and inspect.
    # to extract the title and location
    placard_header = soup.find_all("header",{"class":"placardHeader"});placard_header

    # Step 10: Extract the section class for rent, no of beds and phone number
    # to extract the rent, no of beds and phone number
    placard_content = soup.find_all("section",{"class":"placardContent"});placard_content
    
    # Step 11: Start a for loop to process property by property and extract the values of Title, 
    # Address, Price, Beds and Phone from the header classes and section classes.

    # To process property by property by looping
    for item_header,item_content in zip(placard_header,placard_content):
        # To store the information to a dictionary
        web_content_dict = {}
        web_content_dict["Title"]=item_header.find("a",{"class":"placardTitle js-placardTitle "}).text.replace("\r","").replace("\n","")
        web_content_dict["Address"] = item_header.find("div",{"class":"location"}).text
        web_content_dict["Price"] = item_content.find("span",{"class":"altRentDisplay"}).text
        web_content_dict["Beds"] = item_content.find("span",{"class":"unitLabel"}).text
        web_content_dict["Phone"] = item_content.find("div",{"class":"phone"}).find("span").text

        # To store the dictionary to into a list
        web_content_list.append(web_content_dict)       

In [89]:
# Step 12:Make a dataframe with the list and write the list to a csv file

# To make a dataframe with the list
df = pandas.DataFrame(web_content_list)

# To write the dataframe to a csv file
df.to_csv(r"C:\Users\athiq.ahmed\Desktop\Other\Python code\ML\Web Scraping\Datasets\Output.csv")

In [91]:
df.head()

Unnamed: 0,Address,Beds,Phone,Price,Title
0,"11513 Village Brook Dr, Cincinnati, OH 45249",1-3 Bed,844-289-7404,"$1,055 - 2,300",Glenbridge Manors Apartment Homes
1,"200 W Galbraith Rd, Cincinnati, OH 45215",Studio - 3 Bed,844-874-3632,"$655 - 1,860",Williamsburg Of Cincinnati
2,"100 Southern Trace D Dr, Cincinnati, OH 45255",1-2 Bed,844-812-3358,"$699 - 1,115",Timber Trails
3,"4209 Erie Ave, Cincinnati, OH 45227",Studio - 2 Bed,513-373-4357,"$1,135 - 1,620",Centennial Station
4,"3225 Oakley Station Blvd, Cincinnati, OH 45209",Studio - 2 Bed,513-448-0898,"$1,200 - 2,230",The Boulevard at Oakley Station


https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-bc9563fe8860