### Text Mining from Website

Hi, In this iPython Workbook, we will look into extracting data from a Webpage using commonly available libraries and cleaning the data.

**Start_Date:** 20 - June

**End_Date:**  25 - June

**Python_Version:** 2.7

In [1]:
# Importing the required libraries
# JSON and CSV Modules
# NLTK <- for common stopwords
# Importing Beautiful Soup for HTML Parser
import csv
import nltk
import string
from urllib import urlopen
import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

In [2]:
# Installing URL
url = "https://en.wikipedia.org/wiki/Chennai"

# Reading the webpage
web = urlopen(url)

##### Tags have commonly used names that depend on their position in relation to other tags:

**child** – a child is a tag inside another tag. So the two p tags above are both children of the body tag.

**parent** – a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.

**sibiling** – a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

In [3]:
# Downloading the webpage
webpage = requests.get(url)

print "Webpage Download: ", webpage

# The Response code of <- 200 represents that the webpage has been downloaded successfully.
# Printing the contents of the Downloaded webpage
print "\nContent Print: ",webpage.content[:101]

Webpage Download:  <Response [200]>

Content Print:  <!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>


In [4]:
# Using Beautiful Soup to parse the Webpage.
soup = bs(webpage.content, 'html.parser')

# Printing the HTML Content using Prettify method
print (soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Chennai - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Chennai","wgTitle":"Chennai","wgCurRevisionId":787406268,"wgRevisionId":787406268,"wgArticleId":45139,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Uses editors parameter","CS1 maint: Multiple names: authors list","Pages with citations lacking titles","Pages using citations with accessdate and no URL","All articles with dead external links","Articles with dead external links from June 2016","Wikipedia indefinitely move-protected pages","Use Indian Engli

In [5]:
# Since, we have used Beautiful Soup, all the tags are nested and
# print out the specific tags <- Children
list(soup.children)

[u'html',
 u'\n',
 u'\n']

##### The above tells us that there are two tags at the top level of the page – the initial <!DOCTYPE html> tag, and the html tag. There is a newline character (\n) in the list as well.

In [6]:
# Type of each element
[type(item) for item in list(soup.children)] 

[bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

##### As you can see, all of the items are BeautifulSoup objects. The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag object.

In [7]:
# Now, finding all <p>, in the webpage.
soup.find_all('p')

[<p><a href="/wiki/Chennai_district" title="Chennai district">Chennai</a>, <a class="mw-redirect" href="/wiki/Kanchipuram_District" title="Kanchipuram District">Kanchipuram</a> <a class="mw-redirect" href="/wiki/Tiruvallur_District" title="Tiruvallur District">Tiruvallur</a> And <a class="mw-redirect" href="/wiki/Vellore_District" title="Vellore District">Vellore</a></p>,
 <p><b>Chennai</b> (<span class="nowrap"><span class="noexcerpt"><a href="//upload.wikimedia.org/wikipedia/commons/d/d2/Chennai_MW.ogg" title="Listen"><img alt="Listen" data-file-height="11" data-file-width="11" height="11" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Speakerlink-new.svg/11px-Speakerlink-new.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Speakerlink-new.svg/17px-Speakerlink-new.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Speakerlink-new.svg/22px-Speakerlink-new.svg.png 2x" width="11"/></a><sup><span class="IPA" style="color:#00e;font:bold 80% san

In [8]:
soup.find_all('p')[10].get_text()

u'The region around Chennai has served as an important administrative, military, and economic centre for many centuries. During the 1st century CE, a poet and weaver named Thiruvalluvar lived in the town of Mylapore (a neighbourhood of present Chennai).[44] From the 1st\u201312th century the region of present Tamil Nadu and parts of South India was ruled by the Cholas.[45]'

#### Searching for tags by class and id:

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.

In [9]:
# Using BS4, we can also search specifically using class
soup.find_all(class_='reference')[10]

<sup class="reference" id="cite_ref-pricewater_10-0"><a href="#cite_note-pricewater-10">[9]</a></sup>

In [10]:
# Using id
soup.find_all(id="cite_ref-4")

[<sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[3]</a></sup>]

#### Extracting Data from a Webpage - Weather Data to Pandas Dataframe

In this example, we will mine for weather data and convert it into pandas dataframe and finally will do some analysis in it.

In [11]:
# Extracting Weather Data <- Online Web portal <- Chennai
page = requests.get("https://www.theweathernetwork.com/in/weather/tamil-nadu/chennai")
soup = bs(page.content, 'html.parser')
# We initialize our data from id and class_ as worked out earlier.
seven_day = soup.find(id="seven-days")
forecast_items = seven_day.find_all(class_="seven-days-only")
daily = forecast_items[0]
print(daily.prettify())

<div class="seven-days-only" id="seven-days-only" style="display:block;">
 <div class="seven-day-lbl" id="seven-day-label">
  <div>
   <span class="lbl-feel" id="feels-like">
    Feels like:
   </span>
   <div class="lowhighlbl" id="day-low">
    <span class="day-low">
     Night:
    </span>
   </div>
   <div class="lowhighlbl hidden" id="day-high">
    <span class="day-high">
     Day:
    </span>
   </div>
   <div>
    <span class="" id="lbl-pop">
     POP:
    </span>
    <span class="" id="lbl-rain">
     24 Hr Rain:
    </span>
    <span class="" id="lbl-wind">
     Wind:
    </span>
    <span class="" id="lbl-sun">
     Hrs of Sun:
    </span>
   </div>
  </div>
 </div>
 <div class="day_1">
  <div class="seven-day-column">
   <div class="day_name">
    <span class="day_title">
     Mon
    </span>
    <span>
     Jun 26
    </span>
   </div>
   <div class="day_outlook">
    Mainly cloudy
   </div>
   <div class="day_icon">
    <img alt="" src="//s2.twnmm.com/images/en_in/icons/w

#### Extracting information from the page

Now, let's consider extracting information from the page:

* Forecast Item name - Mon <- class <- "day_name"
* Day Outlook - class <- "day_outlook" <- Mainly Cloudy
* High Temperature - class <- "chart-daily-temp seven_days_metric seven_days_metric_c" <- 33 C

In [12]:
# Checking for our known Attributes!!
# Day, Outlook and Temperature
day = daily.find(class_="day_name").get_text()
day_outlook = daily.find(class_="day_outlook").get_text()
temp = daily.find(class_="chart-daily-temp seven_days_metric seven_days_metric_c").get_text()

print "Forecast Day: ", day
print "Day's Outlook: ", day_outlook
print "Highest Temp: ", temp

Forecast Day:   MonJun 26 
Day's Outlook:  	Mainly cloudy	
Highest Temp:  	33°C 


#### Since, we have obtained data for only one particular day, we now extract for all days.

Also, note that the Outlook has \t attached and hence, we source the information from img alt

In [13]:
# Extracting for Days
days_tags = seven_day.select(".seven-days-only .day_name")
days = [pt.get_text() for pt in days_tags]
days

[u' MonJun 26 ',
 u' TueJun 27 ',
 u' WedJun 28 ',
 u' ThuJun 29 ',
 u' FriJun 30 ',
 u' SatJul 1 ',
 u' SunJul 2 ']

In [14]:
# Extracting for Temperature in Celcius
temp_tags = seven_day.select(".seven-days-only .chart-daily-temp.seven_days_metric.seven_days_metric_c")
temp = [t.get_text() for t in temp_tags]
temp

[u'\t33\xb0C ',
 u'\t33\xb0C ',
 u'\t32\xb0C ',
 u'\t34\xb0C ',
 u'\t34\xb0C ',
 u'\t32\xb0C ',
 u'\t31\xb0C ']

In [15]:
# Removing the \t spaces in temperature
out = [x[1:5] for x in temp]
out

[u'33\xb0C',
 u'33\xb0C',
 u'32\xb0C',
 u'34\xb0C',
 u'34\xb0C',
 u'32\xb0C',
 u'31\xb0C']

In [16]:
# Obtaining the Description/Outlook from the image file
img = daily.find("img")
desc = img['title']

print desc

Mainly cloudy


In [17]:
# Obtaining the Outlook for all the Seven Days
outlook = [t["title"] for t in seven_day.select(".seven-days-only img")]

print outlook

[u'Mainly cloudy', u'Mainly cloudy', u'Cloudy with showers', u'Cloudy with sunny breaks', u'Cloudy', u'Cloudy', u'Cloudy with showers']


In [18]:
# Combining into Pandas Dataframe.
import pandas as pd
weather = pd.DataFrame({
        "Days": days, 
        "Temperature": out,
        "Outlook": outlook
    })
weather

Unnamed: 0,Days,Outlook,Temperature
0,MonJun 26,Mainly cloudy,33°C
1,TueJun 27,Mainly cloudy,33°C
2,WedJun 28,Cloudy with showers,32°C
3,ThuJun 29,Cloudy with sunny breaks,34°C
4,FriJun 30,Cloudy,34°C
5,SatJul 1,Cloudy,32°C
6,SunJul 2,Cloudy with showers,31°C


In [19]:
# Now, checking the dataypes of the above dataframe
weather.dtypes

Days           object
Outlook        object
Temperature    object
dtype: object

In [20]:
# Converting the Temperature to a integer type
temp_nums = weather["Temperature"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    33
1    33
2    32
3    34
4    34
5    32
6    31
Name: temp_num, dtype: object

In [21]:
# Finding the Mean
weather["temp_num"].mean()

32.714285714285715