## To practice high-quality science with data, you need to make sure it is properly sourced, cleaned, formatted, and pre-processed. 

## A common mantra of the modern age is Data is the New Oil, meaning data is now a resource that's more valuable than oil. But just as crude oil does not come out of the rig as gasoline and must be processed to get gasoline and other products, data must be curated, massaged, or cleaned and refined to be used in data science. This is known as wrangling. Most data scientists spend the majority of their time data wrangling. 

## Data wrangling is generally done at the very first stage of a data science/analytics pipeline. After the data scientists have identified any useful data sources for solving the business problem at hand (for instance, in-house database storage, the internet, or streaming sensor data such as an underwater seismic sensor), they then proceed to extract, clean, and format the necessary data from those sources.

## In an extremely rare situation, data wrangling may not be needed. For example, if the data that's necessary for a machine learning task is already stored in an acceptable format in an in-house database, then a simple SQL query may be enough to extract the data into a table, ready to be passed on to the modeling stage.

## Process of Data Wrangling
![Screenshot%202024-03-09%20at%206.38.57%E2%80%AFAM.png](attachment:Screenshot%202024-03-09%20at%206.38.57%E2%80%AFAM.png)

## So, the first step towards Data Wrangling is Data Extraction, which we will study now!

## Data is extracted from data sources such as 
### - https://data.gov.in/
### - https://data.gov/
### - https://data.worldbank.org/
### - https://www.kaggle.com/datasets
### - https://archive.ics.uci.edu/datasets

## We will learn to read CSV, Excel, JSON, PDF, and HTML datasets into pandas Data Frames. Web scraping to extract structured and textual information from portals.

## The pandas library provides a simple method called read_csv to read data in a tabular format from a comma-separated text file, or .csv. This is particularly useful because .csv is a lightweight yet extremely handy data exchange format for many applications, including such domains where machine-generated data is involved.


## Generally, a .csv file has two sections. The first line of a .csv file is usually treated as a header line. So, each column in the first line should indicate the name of the column. After the first line, we have data rows where each line represents one data point and each column represents values of those data points.

In [41]:
import numpy as np
import pandas as pd
# change the path in your system
df1 = pd.read_csv("./datasets/CSV_EX_1.csv")
df1

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [3]:
df2 = pd.read_csv("./datasets/CSV_EX_2.csv")
df2

Unnamed: 0,2,1500,Good,300000
0,3,1300,Fair,240000
1,3,1900,Very good,450000
2,3,1850,Bad,280000
3,2,1640,Good,310000


In [4]:
# The top data row has been mistakenly read as the column header. You can specify header=None to avoid this.
df2 = pd.read_csv("./datasets/CSV_EX_2.csv", header=None)
df2

Unnamed: 0,0,1,2,3
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [5]:
# Add the names argument to get the correct headers:
df2 = pd.read_csv("./datasets/CSV_EX_2.csv", header=None, names = ['Bedroom','Sq. foot', 'Locality', 'Price ($)'])
df2

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [6]:
# So far we were reading from files where a comma acts as a delimiter. Let's look when the values are not separated by commas.
df3 = pd.read_csv("./datasets/CSV_EX_3.csv")
df3

Unnamed: 0,Bedroom; Sq. foot; Locality; Price ($)
0,2; 1500; Good; 300000
1,3; 1300; Fair; 240000
2,3; 1900; Very good; 450000
3,3; 1850; Bad; 280000
4,2; 1640; Good; 310000


In [7]:
# A simple workaround is to specify the separator/delimiter explicitly in the read function.
df3 = pd.read_csv("./datasets/CSV_EX_3.csv", sep = ';')
df3

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [8]:
# How to bypass the headers of a CSV file and put in your own. 
# To do that, you have to specifically set header=0. 
# If you try to set the names variable to your header list, unexpected things can happen. 
df4 = pd.read_csv("./datasets/CSV_EX_1.csv",names=['A','B','C','D'])
df4

Unnamed: 0,A,B,C,D
0,Bedroom,Sq. foot,Locality,Price ($)
1,2,1500,Good,300000
2,3,1300,Fair,240000
3,3,1900,Very good,450000
4,3,1850,Bad,280000
5,2,1640,Good,310000


In [9]:
# To avoid this, set header to zero and provide a names list
df4 = pd.read_csv("./datasets/CSV_EX_1.csv",header=0,names=['A','B','C','D'])
df4

Unnamed: 0,A,B,C,D
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [10]:
# Skipping initial Rows and footers when reading a CSV File
# We will skip the first few rows because, most of the time, 
# the first few rows of a CSV data file are metadata about the data source or similar information, 
# which is not read into the table. 
# Also, we will go ahead and remove the footer of the file, which might sometimes contain information that's not very useful.

df5 = pd.read_csv("./datasets/CSV_EX_skiprows.csv")
df5


Unnamed: 0,Filetype: CSV,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,Info about some houses,,
1,Bedroom,Sq. foot,Locality,Price ($)
2,2,1500,Good,300000
3,3,1300,Fair,240000
4,3,1900,Very good,450000
5,3,1850,Bad,280000
6,2,1640,Good,310000


In [12]:
# Skip the first two rows and read the file:
df5 = pd.read_csv("./datasets/CSV_EX_skiprows.csv",skiprows=2)
df5

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [15]:
# Similar to skipping the initial rows, it may be necessary to skip the footer of a file.
df6 = pd.read_csv("./datasets/CSV_EX_skipfooter.csv",skiprows=2)
df6

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2.0,1500,Good,300000.0
1,3.0,1300,Fair,240000.0
2,3.0,1900,Very good,450000.0
3,3.0,1850,Bad,280000.0
4,2.0,1640,Good,310000.0
5,,This is the end of file,,


In [None]:
df6 = pd.read_csv("./datasets/CSV_EX_skipfooter.csv",skiprows=2,skipfooter=1,engine='python')
df6

In [16]:
# In many situations, we may not want to read a whole data file but only the first few
# rows. This is particularly useful for extremely large data files, where we may just want
# to read the first couple of hundred rows to check an initial pattern and then decide to
# read the whole of the data afterward. 

df7 = pd.read_csv("./datasets/CSV_EX_1.csv",nrows=2)
df7

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000


In [18]:
# we can read from a very large data file. 
# To do that,we can cleverly combine skiprows and nrows to read in a large file in small chunks of pre-determined sizes.

list_of_dataframe = []
rows_in_a_chunk = 10
num_chunks = 5
# create dummy df to get column names
df_dummy = pd.read_csv("./datasets/villageElectrified_April2015.csv",nrows=2)
colnames = df_dummy.columns

In [20]:
for i in range(0,num_chunks*rows_in_a_chunk,rows_in_a_chunk):
    df = pd.read_csv("./datasets/villageElectrified_April2015.csv", header=0,skiprows=i,nrows=rows_in_a_chunk,names=colnames)
    list_of_dataframe.append(df)

In [24]:
list_of_dataframe[0]

Unnamed: 0,States/UTs,Total inhabited villages as per 2011 census,Villages electrified as on 30-03-2015 (Provisional)(#)-Numbers,Villages electrified as on 30-03-2015 (Provisional)(#)-%age,Cummulative achievement as on 30-04-2015,%age of villages electrified as on 30-04-2015,Unelectrified villages as on 30-04-2015
0,Andhra Pradesh,26286,26286,100.0,26286,100.0,0
1,Arunachal Pradesh,5258,3694,70.3,3694,70.3,1564
2,Assam,25372,24569,96.8,24569,96.8,803
3,Bihar,39073,37316,95.5,37316,95.5,1757
4,Chattisgarh,19567,19124,97.7,19125,97.7,442
5,Goa,320,320,100.0,320,100.0,0
6,Gujarat,17843,17843,100.0,17843,100.0,0
7,Haryana,6642,6642,100.0,6642,100.0,0
8,Himachal Pradesh,17882,17828,99.7,17828,99.7,54
9,Jammu&Kashmir,6337,6224,98.2,6224,98.2,113


In [25]:
# By default, read_csv ignores blank lines, which means if there are row entries
# with NaN values, the read_csv function will not read that data. 
# However, in some situations, you may want to read them in as NaN so that you can count how many
# blank entries were present in the raw data file.
df9 = pd.read_csv("./datasets/CSV_EX_blankline.csv",skip_blank_lines=False)
df9

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2.0,1500.0,Good,300000.0
1,3.0,1300.0,Fair,240000.0
2,,,,
3,3.0,1900.0,Very good,450000.0
4,3.0,1850.0,Bad,280000.0
5,,,,
6,2.0,1640.0,Good,310000.0


In [35]:
# This is an awesome feature of pandas, and it allows you to read directly from a
# compressed file, such as .zip, .gz, .bz2, or .xz. 
# The only requirement is that the intended data file (CSV) should be the only file inside the compressed file. 
# For example, we might need to compress a large csv file, and in that case, it will be the only file inside the .zip folder.

# USE WINDOWS TO CREATE THE ZIP FILE
df10 = pd.read_csv('./datasets/CSV_EX_1.csv.zip')
df10

In [39]:
!pip install tabula-py xlrd lxml openpyxl



In [42]:
# We will focus on the differences between the methods of reading from an Excel file. 
# An Excel file can consist of multiple worksheets, and we can read a specific sheet by passing in a particular argument, that is, sheet_name.
import openpyxl

df11_1 = pd.read_excel("./datasets/Housing_data.xlsx",sheet_name='Data_Tab_1')

df11_2 = pd.read_excel("./datasets/Housing_data.xlsx",sheet_name='Data_Tab_2')

df11_3 = pd.read_excel("./datasets/Housing_data.xlsx",sheet_name='Data_Tab_3')

df11_1

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5


In [45]:
# You can also read the Excel data as a dictionary by putting sheet_name=None
df12 = pd.read_excel("./datasets/Housing_data.xlsx",sheet_name=None)
print (df12.keys())
df12['Data_Tab_1']

dict_keys(['Data_Tab_1', 'Data_Tab_2', 'Data_Tab_3'])


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5


In [46]:
# Reading a General Delimited Text File
df13 = pd.read_table("./datasets/Table_EX_1.txt")
df13

Unnamed: 0,"Bedroom, Sq. foot, Locality, Price ($)"
0,"2, 1500, Good, 300000"
1,"3, 1300, Fair, 240000"
2,"3, 1900, Very good, 450000"
3,"3, 1850, Bad, 280000"
4,"2, 1640, Good, 310000"


In [47]:
# use sep =',' in the above code
df13 = pd.read_table("./datasets/Table_EX_1.txt",sep=',')
df13

Unnamed: 0,Bedroom,Sq. foot,Locality,Price ($)
0,2,1500,Good,300000
1,3,1300,Fair,240000
2,3,1900,Very good,450000
3,3,1850,Bad,280000
4,2,1640,Good,310000


In [None]:
# The pandas library allows us to read HTML tables directly from a URL. 
# This means that the library already has some kind of built-in HTML parser that processes the
# HTML content of a given page and tries to extract various tables from the page.

In [50]:
!pip install html5lib



In [62]:
list_of_df = pd.read_html("https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table",header=0)
print (len(list_of_df))
for t in list_of_df:
    print(t.shape)
# We are looking for the big table in the webpage
df14=list_of_df[3]
df14.head()

8
(6, 3)
(0, 3)
(3, 1)
(87, 6)
(14, 9)
(6, 2)
(2, 2)
(1, 2)


Unnamed: 0,Rank,NOC,Gold,Silver,Bronze,Total
0,1,United States,46,37,38,121
1,2,Great Britain,27,23,17,67
2,3,China,26,18,26,70
3,4,Russia,19,17,20,56
4,5,Germany,17,10,15,42


In [64]:
# JavaScript Object Detection (JSON) has become ubiquitous for data exchange on the web.
# Today, it is the format of choice for almost every publicly available web API, and it is frequently used for private web APIs as well. 
# It is a schema-less, text-based representation of structured data that is based on key-value pairs and ordered lists.
# The pandas library provides excellent support for reading data from a JSON file directly into a DataFrame.
df15 = pd.read_json("./datasets/movies.json")
df15.head()

Unnamed: 0,title,year,cast,genres
0,After Dark in Central Park,1900,[],[]
1,Boarding School Girls' Pajama Parade,1900,[],[]
2,Buffalo Bill's Wild West Parad,1900,[],[]
3,Caught,1900,[],[]
4,Clowns Spinning Hats,1900,[],[]


In [68]:
cast_of_bumblebee = df15[(df15['title']=="Bumblebee") & (df15['year']==2018)]['cast']
print(list(cast_of_bumblebee))

[['Hailee Steinfeld', 'John Cena', 'Jorge Lendeborg Jr.', 'Jason Drucker', 'Rachel Crow', 'Pamela Adlon']]


In [77]:
# Among the various types of data sources, the PDF format is probably the most difficult to parse in general. 
# While there are some popular packages in Python for working with PDF files for general page formatting, 
# the best library to use for table extraction from PDF files is tabula-py.

# From the GitHub page of this package, tabula-py is a simple Python wrapper of tabula-java, 
# which can read a table from a PDF. You can read tables from PDFs and convert them into pandas DataFrames. 
# The tabula-py library also enables you to convert a PDF file into a CSV/TSV/JSON file.

from tabula import read_pdf

df16_1 = read_pdf('./datasets/Housing_data.pdf', pages=[1], pandas_options={'header':None})
df16_1

# It it gives error, try to install latest version of Java JDK

[         0     1     2  3      4      5     6       7  8    9
 0  0.17004  12.5  7.87  0  0.524  6.004  85.9  6.5921  5  311
 1  0.22489  12.5  7.87  0  0.524  6.377  94.3  6.3467  5  311
 2  0.11747  12.5  7.87  0  0.524  6.009  82.9  6.2267  5  311
 3  0.09378  12.5  7.87  0  0.524  5.889  39.0  5.4509  5  311]

In [89]:
df16_2 = read_pdf('./datasets/Housing_data.pdf', pages=[2], pandas_options={'header':None})
df16_2

[      0       1      2     3
 0  15.2  386.71  17.10  18.9
 1  15.2  392.52  20.45  15.0
 2  15.2  396.90  13.27  18.9
 3  15.2  390.50  15.71  21.7]

In [87]:
df17_1 = pd.DataFrame(df16_1[0])
df17_2 = pd.DataFrame(df16_2[0])
df17 = pd.concat([df17_1,df17_2],axis=1)  # axis=1 represents columns
df17

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,0.1,1.1,2.1,3.1
0,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
1,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
2,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
3,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7


In [90]:
# how to set headers while reading data from PDF
names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','PRICE']

df16_1 = read_pdf('./datasets/Housing_data.pdf', pages=[1], pandas_options={'header':None,'names':names[:10]})

df16_2 = read_pdf('./datasets/Housing_data.pdf', pages=[2], pandas_options={'header':None,'names':names[10:]})

df17_1 = pd.DataFrame(df16_1[0])

df17_2 = pd.DataFrame(df16_2[0])

df17 = pd.concat([df17_1,df17_2],axis=1)  # axis=1 represents columns

df17

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9
1,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
2,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.9,13.27,18.9
3,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.5,15.71,21.7


# Introduction to Beautiful Soup 4 and Web Page Parsing

## The ability to read and understand web pages is of paramount interest to a person collecting and formatting data. For example, consider the task of gathering data about movies and then formatting it for a downstream system. 

## Python has a very mature and stable library called BeautifulSoup for getting data from HTML or XML documents, and it gives you a nice, normalized, idiomatic way of navigating and querying a document.

## Hyper Text Markup Language is a structured way of telling web browsers about the organization of a web page, meaning which kinds of elements (text, image, video, and so on) come from where, where inside the page they should appear, what they look like, what they contain, and how they will behave with user input. HTML5 is the latest version of HTML.

![Screenshot%202024-03-10%20at%209.12.48%E2%80%AFPM.png](attachment:Screenshot%202024-03-10%20at%209.12.48%E2%80%AFPM.png)

![Screenshot%202024-03-10%20at%209.16.42%E2%80%AFPM.png](attachment:Screenshot%202024-03-10%20at%209.16.42%E2%80%AFPM.png)

![Screenshot%202024-03-10%20at%209.19.20%E2%80%AFPM.png](attachment:Screenshot%202024-03-10%20at%209.19.20%E2%80%AFPM.png)

In [11]:
from bs4 import BeautifulSoup

# https://chat.openai.com/share/1571834e-b567-4381-b6de-d6fe9e183671 
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print (type(soup))

<class 'bs4.BeautifulSoup'>


In [7]:
#print (soup)
print (soup.prettify())

<html>
 <body>
  <h1>
   Lorem ipsum dolor sit amet consectetuer adipiscing 
elit
  </h1>
  <p>
   Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa
   <strong>
    strong
   </strong>
   . Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede
   <a class="external ext" href="#">
    link
   </a>
   mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam

![Screenshot%202024-03-12%20at%2010.51.31%E2%80%AFAM.png](attachment:Screenshot%202024-03-12%20at%2010.51.31%E2%80%AFAM.png)

In [10]:
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    print (soup.p)

<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa 
<strong>strong</strong>. Cum sociis natoque penatibus 
et magnis dis parturient montes, nascetur ridiculus 
mus. Donec quam felis, ultricies nec, pellentesque 
eu, pretium quis, sem. Nulla consequat massa quis 
enim. Donec pede justo, fringilla vel, aliquet nec, 
vulputate eget, arcu. In enim justo, rhoncus ut, 
imperdiet a, venenatis vitae, justo. Nullam dictum 
felis eu pede <a class="external ext" href="#">link</a> 
mollis pretium. Integer tincidunt. Cras dapibus. 
Vivamus elementum semper nisi. Aenean vulputate 
eleifend tellus. Aenean leo ligula, porttitor eu, 
consequat vitae, eleifend ac, enim. Aliquam lorem ante, 
dapibus in, viverra quis, feugiat a, tellus. Phasellus 
viverra nulla ut metus varius laoreet. Quisque rutrum. 
Aenean imperdiet. Etiam ultricies nisi vel augue. 
Curabitur ullamcorper ultricies nisi.</p>


In [13]:
# To access all the <p> tags
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    all_ps = soup.find_all('p')
    print ("Total number of <p> --- {}".format(len(all_ps)))

Total number of <p> --- 6


In [20]:
print (all_ps[5])

<p>Lorem ipsum dolor sit amet, consectetuer adipiscing 
elit. Aenean commodo ligula eget dolor. Aenean massa. 
Cum sociis natoque penatibus et magnis dis parturient 
montes, nascetur ridiculus mus. Donec quam felis, 
ultricies nec, pellentesque eu, pretium quis, sem.</p>


In [23]:
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    all_ps = soup.find_all('table')
    print ("Total number of <table> --- {}".format(len(all_ps)))

Total number of <table> --- 1


In [24]:
print (all_ps[0])

<table class="data">
<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>
<tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>
<tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>
<tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>
</table>


In [22]:
# We will learn now how to get contents for a particular HTML tag
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    print (type(table))
    print ('_____________________')
    print (table.contents)

<class 'bs4.element.Tag'>
_____________________
['\n', <tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>, '\n', <tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>, '\n', <tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>, '\n', <tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>, '\n']


In [30]:
# to access the children of tags
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    for child in table.children:
        print (child)
        print ("*****")
    print (len(list(table.children)))



*****
<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>
*****


*****
<tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>
*****


*****
<tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>
*****


*****
<tr>
<td>Entry Last Line 1</td>
<td>Entry Last Line 2</td>
<td>Entry Last Line 3</td>
<td>Entry Last Line 4</td>
</tr>
*****


*****
9


In [32]:
# to access the children and descendants of tags
with open("./datasets/test.html", "r") as fd:
    soup = BeautifulSoup(fd)
    table = soup.table
    for descendant in table.descendants:
        print (descendant)
        print ("*****")
    print (len(list(table.children)))
    print (len(list(table.descendants)))



*****
<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>
*****


*****
<th>Entry Header 1</th>
*****
Entry Header 1
*****


*****
<th>Entry Header 2</th>
*****
Entry Header 2
*****


*****
<th>Entry Header 3</th>
*****
Entry Header 3
*****


*****
<th>Entry Header 4</th>
*****
Entry Header 4
*****


*****


*****
<tr>
<td>Entry First Line 1</td>
<td>Entry First Line 2</td>
<td>Entry First Line 3</td>
<td>Entry First Line 4</td>
</tr>
*****


*****
<td>Entry First Line 1</td>
*****
Entry First Line 1
*****


*****
<td>Entry First Line 2</td>
*****
Entry First Line 2
*****


*****
<td>Entry First Line 3</td>
*****
Entry First Line 3
*****


*****
<td>Entry First Line 4</td>
*****
Entry First Line 4
*****


*****


*****
<tr>
<td>Entry Line 1</td>
<td>Entry Line 2</td>
<td>Entry Line 3</td>
<td>Entry Line 4</td>
</tr>
*****


*****
<td>Entry Line 1</td>
*****
Entry Line 1
*****


*****
<td>Entry Line 2</td>
*****
Entry Line 2
*****

In [45]:
# We will now learn how to create a DataFrame with the extracted data from HTML using the BeautifulSoup library.
fd = open("./datasets/test.html", "r")
soup = BeautifulSoup(fd)
data = soup.findAll('tr')
print("Data is a {} and {} items long".format(type(data),len(data)))

Data is a <class 'bs4.element.ResultSet'> and 4 items long


In [35]:
# You will see that the first row is the column heading and all of the following rows are the data from the HTML source. 
# We'll assign two different variables for the two sections, as follows:
data_without_header = data[1:]
headers = data[0]
headers

<tr>
<th>Entry Header 1</th>
<th>Entry Header 2</th>
<th>Entry Header 3</th>
<th>Entry Header 4</th>
</tr>

![Screenshot%202024-03-12%20at%205.06.30%E2%80%AFPM.png](attachment:Screenshot%202024-03-12%20at%205.06.30%E2%80%AFPM.png)

In [38]:
# Once we have separated the two sections, we need two list comprehensions to make them ready to go in a DataFrame.
col_headers = [th.getText() for th in headers.findAll('th')]
col_headers

['Entry Header 1', 'Entry Header 2', 'Entry Header 3', 'Entry Header 4']

In [39]:
df_data = [[td.getText() for td in tr.findAll('td')] for tr in data_without_header]
df_data

[['Entry First Line 1',
  'Entry First Line 2',
  'Entry First Line 3',
  'Entry First Line 4'],
 ['Entry Line 1', 'Entry Line 2', 'Entry Line 3', 'Entry Line 4'],
 ['Entry Last Line 1',
  'Entry Last Line 2',
  'Entry Last Line 3',
  'Entry Last Line 4']]

In [42]:
df = pd.DataFrame(df_data, columns=col_headers)
df.head()

Unnamed: 0,Entry Header 1,Entry Header 2,Entry Header 3,Entry Header 4
0,Entry First Line 1,Entry First Line 2,Entry First Line 3,Entry First Line 4
1,Entry Line 1,Entry Line 2,Entry Line 3,Entry Line 4
2,Entry Last Line 1,Entry Last Line 2,Entry Last Line 3,Entry Last Line 4


In [55]:
# we will append the URLs one after the other from the test.html web page
d = open("./datasets/test.html", "r")
soup = BeautifulSoup(d)
lis = soup.find('ul').findAll('li')
stack = []
for li in lis:
    #a = li.find('a', href=True)
    a = li.a
    print (a)
    stack.append(a['href'])
print ('*************************')
print (stack)

<a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">Information Entropy</a>
<a href="http://www.gutenberg.org/browse/scores/top">Top books in Gutenberg</a>
<a href="https://www.imdb.com/chart/top">Top 250 movies in IMDB</a>
*************************
['https://en.wikipedia.org/wiki/Entropy_(information_theory)', 'http://www.gutenberg.org/browse/scores/top', 'https://www.imdb.com/chart/top']


## We will now learn advanced web scraping and data gathering

### We will have learn how to gather data from web pages, XML files, and APIs. We will learn the ability to extract and read data from web pages and databases hosted on the web.

![Screenshot%202024-03-13%20at%202.15.21%E2%80%AFPM.png](attachment:Screenshot%202024-03-13%20at%202.15.21%E2%80%AFPM.png)

![Screenshot%202024-03-13%20at%202.23.01%E2%80%AFPM.png](attachment:Screenshot%202024-03-13%20at%202.23.01%E2%80%AFPM.png)

In [1]:
import requests
# assign the name of webpage to a variable
wiki_homepage = "https://en.wikipedia.org/wiki/Main_Page"
response = requests.get(wiki_homepage)

In [2]:
type(response)

requests.models.Response

## A web page request generally comes back with standard HTTP response codes. The following table shows the common codes:

![Screenshot%202024-03-13%20at%202.28.13%E2%80%AFPM.png](attachment:Screenshot%202024-03-13%20at%202.28.13%E2%80%AFPM.png)

In [5]:
# Let's understand how to check the status of a web request
def status_check(r):
    if r.status_code==200:
        print("Success!")
        return 1
    else:
        print("Failed!")
        return -1

wiki_homepage = "https://en.wikipedia.org/wiki/Main_Page"
response = requests.get(wiki_homepage)
status_check(response)

Success!


1

## Next, we will learn about the encoding on a web page. 

## Some of the most popular encodings are ASCII, Unicode, and UTF-8. ASCII is the simplest, but it cannot capture the complex symbols used in various spoken and written languages all over the world, so UTF-8 has become the almost universal standard in web development these days. It employs variable-length encoding with 1-4 bytes, thereby representing all Unicode characters in various languages around the world.

In [6]:
def encoding_check(r):
    return (r.encoding)

response = requests.get("https://en.wikipedia.org/wiki/Main_Page")
encoding_check(response)

'UTF-8'

In [7]:
# Next, we will learn to decode the contents of a response
def decode_content(r,encoding):
    return (r.content.decode(encoding))

response = requests.get("https://en.wikipedia.org/wiki/Main_Page")

contents = decode_content(response,encoding_check(response))

print (contents)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-not-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Wikipedia, the free encyclopedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disab

In [9]:
print (type(contents))
print (len(contents))
print (contents[:1000])

<class 'str'>
100861
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-not-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Wikipedia, the free encyclopedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-m

In [15]:
# Since the HTML is now read as a string, we can apply BeautifulSoup on it. 
soup = BeautifulSoup(contents, 'html.parser')
txt_dump = soup.text
print (len(txt_dump))
# The length of text dump is smaller than contents because 
# the BeautifulSoup library has parsed through the HTML and extracted only human-readable text for further processing.
print(txt_dump[:len(txt_dump)])

9047




Wikipedia, the free encyclopedia





































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search





























Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk



























Main Page










Main PageTalk





English

















ReadView sourceView history







Tools





Tools
move to sidebar
hide



		Actions
	


ReadView sourceView history





		General
	


What links hereRelated changesUpload fileSpecial pagesPermanent linkPage informationCite this pageGet shortened URLDownload QR codeWikidata item





		Print/export
	


Download as PDFPrintable version





		In other projects
	




In [17]:
# Next, we will see how to extract text from a section on a webpage

# Let's extract the text 'From Today's featured article' on the Wikipedia home page

# First, we try to identify two indices – the start index and end index of the line string which demarcate the start and end of the text we are interested in extracting or reading.

idx1 = txt_dump.find("From today's featured article")
print (idx1)
idx2 = txt_dump.find("Recently featured") 
print (idx2)

print(txt_dump[idx1+len("From today's featured article"):idx2])

# idx1 finds where the 'From today's featured article' string starts. adding it's length gives the starting 
# index of the paragraph that we wish to access. 

1366
2462



Newspaper advertisement for game tickets

On Sunday, July 10, 1932, an 18-inning baseball game was played at League Park in Cleveland, Ohio, U.S. The Philadelphia Athletics defeated the Cleveland Indians, 18–17, in a game that saw a number of records set. Johnny Burnett of Cleveland set Major League Baseball (MLB) records that still stand with seven singles and nine total hits. Cleveland's 33 hits and the 58 total hits in the game are also MLB records; the 35 runs scored set a record for an extra-inning MLB game that stood until 1979. Eddie Rommel secured the win over Cleveland's Wes Ferrell. The Athletics had taken only two pitchers on the one-game road trip, required since Sunday baseball was illegal in Pennsylvania. Philadelphia's Lew Krausse gave up three runs in the first inning. Rommel then pitched an American League–record 17 innings in relief, allowing 14 runs, the most ever by a winning MLB pitcher, and 29 hits, a one-game MLB pitching record. This was Rommel's 17

## What if the text is not fixed! 
## 'On this day' text is fixed, but not the text at the bottom of the page
![Screenshot%202024-03-14%20at%207.49.46%E2%80%AFPM.png](attachment:Screenshot%202024-03-14%20at%207.49.46%E2%80%AFPM.png)

In [18]:
idx3=txt_dump.find("On this day")

print(txt_dump[idx3+len("On this day"):idx3+len("On this day")+1000])



March 14: New Year's Day (Sikhism); White Day in parts of East Asia; Pi Day



Gioachino Rossini

1309 – On Eid al-Fitr, the citizens of Granada stormed palaces in the city, deposing Sultan Muhammad III and placing his half-brother Nasr on the throne.
1864 – The Petite messe solennelle was first performed in Paris, 34 years after Gioachino Rossini (pictured) retired as a composer.
1931 – Alam Ara, the first Indian sound film, premiered at the Majestic Cinema in Bombay.
1988 – China defeated Vietnam in a naval altercation while attempting to establish oceanographic observation posts on the Spratly Islands.
2021 – The Burmese military and police forces killed at least 65 civilians during the Hlaingthaya massacre in Yangon, including those protesting a recent coup d'état.
Albert Einstein  (b. 1879)Zita of Bourbon-Parma (d. 1989)Piri (b. 1999)Ieng Sary  (d. 2013)

More anniversaries: 
March 13
March 14
March 15


Archive
By email
List of days of the year




Today's featured picture






## As we can see, there is a bit of unwanted data along with the relevant information that we are really interested in reading. To address this issue, we need to think differently and use some other methods apart from BeautifulSoup.

## Open the Wikipedia page and use Inspect Page functionality that shows you different tags of HTML 

![Screenshot%202024-03-15%20at%201.17.16%E2%80%AFPM.png](attachment:Screenshot%202024-03-15%20at%201.17.16%E2%80%AFPM.png)

## You can see that our text of interest has div tag that contains the ul tag with id 'mp-otd'



In [19]:
# Use the find_all method from BeautifulSoup, which scans all the tags of the HTML page (and their sub-elements) 
# to find and extract the text associated with this particular <div> element. 
# Create an empty list and append the text from the NavigableString class to this list as we traverse the page:
text_list=[] #Empty list
for d in soup.find_all('div'):
    if (d.get('id')=='mp-otd'):
        for i in d.find_all('ul'):
            text_list.append(i.text)

In [22]:
print (len(text_list))
for i in text_list:
    print(i)
    print('-'*100)
# We are interested in the text contained in the first element of the list

4
1309 – On Eid al-Fitr, the citizens of Granada stormed palaces in the city, deposing Sultan Muhammad III and placing his half-brother Nasr on the throne.
1864 – The Petite messe solennelle was first performed in Paris, 34 years after Gioachino Rossini (pictured) retired as a composer.
1931 – Alam Ara, the first Indian sound film, premiered at the Majestic Cinema in Bombay.
1988 – China defeated Vietnam in a naval altercation while attempting to establish oceanographic observation posts on the Spratly Islands.
2021 – The Burmese military and police forces killed at least 65 civilians during the Hlaingthaya massacre in Yangon, including those protesting a recent coup d'état.
----------------------------------------------------------------------------------------------------
Albert Einstein  (b. 1879)Zita of Bourbon-Parma (d. 1989)Piri (b. 1999)Ieng Sary  (d. 2013)
----------------------------------------------------------------------------------------------------
March 13
March 14
Marc

In [31]:
# Let's bring everthing we learnt so far together to get the text from 'On this Day' section of the Wikipedia page

def wiki_on_this_day(url="https://en.wikipedia.org/wiki/Main_Page"):
    import requests
    from bs4 import BeautifulSoup
    wiki_home = str(url)
    response = requests.get(wiki_home)
    
    def status_check(r):
        if r.status_code==200:
            print("Success!")
            return (1)
        else:
            print("Failed!")
            return (-1)

    def encoding_check(r):
        return (r.encoding)

    def decode_content(r,encoding):
        return (r.content.decode(encoding))

    status = status_check(response)
    
    if status == 1: 
        contents = decode_content(response,encoding_check(response))
    else:
        print("Sorry could not reach the web page!")
        return (-1)
    
    soup = BeautifulSoup(contents, 'html.parser')
    text_list=[] #Empty list
    for d in soup.find_all('div'):
        if (d.get('id')=='mp-otd'):
            for i in d.find_all('ul'):
                text_list.append(i.text)
                
    return (text_list[0])

In [32]:
wiki_on_this_day()

Success!


'44\xa0BC – Julius Caesar (bust pictured), the dictator of the Roman Republic, was stabbed to death by a group of senators led by Marcus Junius Brutus.\n1823 – Sailor Benjamin Morrell erroneously reported the existence of the island of New South Greenland near Antarctica.\n1916 – Six days after Pancho Villa and his cross-border raiders attacked Columbus, New Mexico, U.S. General John J. Pershing led a punitive expedition into Mexico to pursue Villa.\n1917 – Russian Revolution: Tsar Nicholas II was forced to abdicate in the February Revolution, ending three centuries of Romanov rule.\n1943 – The deportation of 50,000 Jews from the Greek city of Thessaloniki began.\n1951 – The Iranian oil industry was nationalized in a movement led by Mohammad Mosaddegh.'

## Reading data from XML or Extensible Markup Language. is a web markup language that's similar to HTML but with significant flexibility (on the part of the user) built in, such as the ability to define your own tags.

## XML is also heavily used in regular data exchanges over the web, and as a data wrangling professional, you should have enough familiarity with its basic features to tap into the data flow pipeline whenever you need to extract data for your project.

In [33]:
data = '''
<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>'''

import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
type (tree)

xml.etree.ElementTree.Element

In [41]:
# we will use the find method to search for various pieces of useful data within an XML element object 
# and print them using the text method. We will also use the get method to extract the specific attribute we want.

data = '''
<person>
  <name>Dave</name>
  <surname>Piccardo</surname>
  <phone type="intl">
     +1 742 101 4456
   </phone>
   <email hide="yes">
   dave.p@gmail.com</email>
</person>'''

# we have to read it as an Element object using the Python XML parser engine
import xml.etree.ElementTree as ET
tree = ET.fromstring(data)

print('Name:', tree.find('name').text)
print('Surname:', tree.find('surname').text)
print('Phone:', tree.find('phone').text.strip())  # Note the use of the strip method to strip away any trailing spaces/blanks
print('Email hidden:', tree.find('email').get('hide'))  #  Note the use of the get method to extract the status
print('Email:', tree.find('email').text.strip())

# In this exercise, we saw how we can use the find method to read the relevant information from an XML file. 
# XML is a very diverse format of expressing data.
# Apart from following some ground rules, everything else is customizable in an XML document. 
# In this exercise, we saw how to access a custom XML element and extract data from it.

Name: Dave
Surname: Piccardo
Phone: +1 742 101 4456
Email hidden: yes
Email: dave.p@gmail.com


In [42]:
tree2 = ET.parse('./datasets/xml1.xml')

type(tree2)

# This is slightly different than using the fromstring method used in the previous exercise, 
# where we were directly reading from a string object. 
# This produces an ElementTree object instead of a simple Element.

xml.etree.ElementTree.ElementTree

In [44]:
# Traversing the Tree, Finding the Root, and Exploring All the Child Nodes and Their Tags and Attributes

root=tree2.getroot()

for child in root:
    print("Child:",child.tag, "| Child attribute:",child.attrib)

Child: country | Child attribute: {'name': 'Liechtenstein'}
Child: country | Child attribute: {'name': 'Singapore'}
Child: country | Child attribute: {'name': 'Panama'}


![Screenshot%202024-03-15%20at%204.20.37%E2%80%AFPM.png](attachment:Screenshot%202024-03-15%20at%204.20.37%E2%80%AFPM.png)

In [45]:
# we will be using the text method from the BeautifulSoup library to extract 
# different types of data from a particular node of the XML document tree.

root[0][2]

<Element 'gdppc' at 0x112c85580>

In [46]:
root[0][2].text

'141100'

In [47]:
root[0][2].tag

'gdppc'

In [48]:
root[0].tag

'country'

In [53]:
root[0].attrib

{'name': 'Liechtenstein'}

##  So, root[0] is again an element, but it has a different set of tags and attributes than root[0][2]. This is expected because they are all part of the tree as nodes, but each is associated with a different level of data.

In [51]:
# we can write a function for extracting and printing the GDP/Capita information using a loop
for c in root:
    country_name=c.attrib['name']
    gdppc = int(c[2].text)
    print("{}: {}".format(country_name,gdppc))

Liechtenstein: 141100
Singapore: 59900
Panama: 13600


In [57]:
# Finding neighbouring countries for each country using the findAll method
for c in root:
    # Find all the neighbors
    ne=c.findall('neighbor')
    print("Neighbors of {}:\n {}".format(c.attrib['name'],'**************'))
# Iterate over the neighbors and print their 'name' attribute
    for i in ne:
        print(i.attrib['name'])
    print('\n')

Neighbors of Liechtenstein:
 **************
Austria
Switzerland


Neighbors of Singapore:
 **************
Malaysia


Neighbors of Panama:
 **************
Costa Rica
Colombia




# Reading data from an API

## Fundamentally, an API or Application Programming Interface is an interface to a computing resource (for example, an operating system or database table), which has a set of exposed methods (function calls) that allow a programmer to access particular data or internal features of that resource.

## A web API is, as the name suggests, an API over the web. Note that it is not a specific technology or programming framework, but an architectural concept. Think of an API like a fast-food restaurant's customer service desk. Internally, there are many food items, raw materials, cooking resources, and recipe management systems, but all you see are fixed menu items on the board and you can only interact through those items. It is like a port that can be accessed using an HTTP protocol and that's able to deliver data and services if used properly.

## Therefore, it is very important for a data wrangling professional to understand the basics of data extraction from a web API as you are extremely likely to find yourself in a situation where large quantities of data must be read through an API for processing and wrangling. These days, most APIs stream data in JSON format. In this chapter, we will use a free API to read some information about various countries around the sworld in JSON format and process it.

## First, we need to set the base URL. When we are dealing with API microservices, this is often called the API endpoint. API-based microservices are extremely dynamic in nature in terms of what and how they offer their services and data. It can change at any time. For most APIs, however, you need to have your own API key. You get that by registering with their service. A basic usage (up to a fixed number of requests or a data flow limit) is often free, but after that, you will be charged. To register for an API key, you often need to enter credit card information. 

## We will use the postal pin code API

http://www.postalpincode.in/Api-Details 

## Postal PIN Code API allows developers to get details of Post Office by searching Postal PIN Code or Post Office Branch Name of India.

In [62]:
import urllib.request, urllib.parse
from urllib.error import HTTPError,URLError
import json
import pandas as pd

serviceurl = 'https://api.postalpincode.in/pincode/'

pin_code = 247667
url = serviceurl + str(pin_code)
uh = urllib.request.urlopen(url)

print (uh)

<http.client.HTTPResponse object at 0x11898dfc0>


In [101]:
# Let's write a function to retrieve data 
# pincode could be both string or integer, as per the API. 

def get_pincode_info(pincode):
    """
    Function to get data about Pin Codes from http://www.postalpincode.in/Api-Details
    """
    
    if isinstance(pincode,str):
        serviceurl = 'https://api.postalpincode.in/postoffice/'
        url = serviceurl + pincode
    else: 
        serviceurl = 'https://api.postalpincode.in/pincode/'
        url = serviceurl + str(pincode)
    
    try: 
        uh = urllib.request.urlopen(url)
    except HTTPError as e:
        print("Sorry! Could not retrieve anything on pincode: {}".format(pincode))
        return None
    
    except URLError as e:
        print('Failed to reach a server.')
        print('Reason: ', e.reason)
        return None
    
    else:
        data = uh.read().decode()
        print("Retrieved data on pincode {}. Total {} characters read.".format(pincode,len(data)))
        return data 

In [102]:
#pincode = 247667
pincode = 'Roorkee'
data = get_pincode_info(pincode)
print (data)

Retrieved data on pincode Roorkee. Total 1839 characters read.
[{"Message":"Number of Post office(s) found:7","Status":"Success","PostOffice":[{"Name":"I.I.T Roorkee","Description":null,"BranchType":"Sub Post Office","DeliveryStatus":"Non-Delivery","Circle":"Uttarakhand","District":"Haridwar","Division":"Dehradun","Region":"Dehradun","State":"Uttarakhand","Country":"India","Pincode":"247667"},{"Name":"Roorkee","Description":null,"BranchType":"Head Post Office","DeliveryStatus":"Delivery","Circle":"Uttarakhand","District":"Haridwar","Division":"Dehradun","Region":"Dehradun","State":"Uttarakhand","Country":"India","Pincode":"247667"},{"Name":"Roorkee Cantt","Description":null,"BranchType":"Sub Post Office","DeliveryStatus":"Non-Delivery","Circle":"Uttarakhand","District":"Haridwar","Division":"Dehradun","Region":"Dehradun","State":"Uttarakhand","Country":"India","Pincode":"247667"},{"Name":"Roorkee City","Description":null,"BranchType":"Sub Post Office","DeliveryStatus":"Non-Delivery","C

In [103]:
# Now, we will use JSON functionality of Python to read through the data we have received
import json

data_json=json.loads(data)
print (len(data_json))

# Load the only element
data_json = data_json[0]

type(data)

1


str

In [94]:
print (data_json.keys())

dict_keys(['Message', 'Status', 'PostOffice'])


In [95]:
for k,v in data_json.items():
    print("{}: {}".format(k,v))
    print ('****************************************')

Message: Number of Post office(s) found:7
****************************************
Status: Success
****************************************
PostOffice: [{'Name': 'I.I.T Roorkee', 'Description': None, 'BranchType': 'Sub Post Office', 'DeliveryStatus': 'Non-Delivery', 'Circle': 'Uttarakhand', 'District': 'Haridwar', 'Division': 'Dehradun', 'Region': 'Dehradun', 'State': 'Uttarakhand', 'Country': 'India', 'Pincode': '247667'}, {'Name': 'Roorkee', 'Description': None, 'BranchType': 'Head Post Office', 'DeliveryStatus': 'Delivery', 'Circle': 'Uttarakhand', 'District': 'Haridwar', 'Division': 'Dehradun', 'Region': 'Dehradun', 'State': 'Uttarakhand', 'Country': 'India', 'Pincode': '247667'}, {'Name': 'Roorkee Cantt', 'Description': None, 'BranchType': 'Sub Post Office', 'DeliveryStatus': 'Non-Delivery', 'Circle': 'Uttarakhand', 'District': 'Haridwar', 'Division': 'Dehradun', 'Region': 'Dehradun', 'State': 'Uttarakhand', 'Country': 'India', 'Pincode': '247667'}, {'Name': 'Roorkee City', 'Descr

## It is clear, therefore, that there is no universal method or processing function for the JSON data format, and you have to write custom loops and functions to extract data from such a dictionary object based on your particular needs.

In [98]:
# Most of the data is in the key: PostOffice. 
# So, let's extract the names of post offices

for post_office_entry in data_json['PostOffice']:
    print(post_office_entry['Name'])
    
print ('*******************************')

data_json['PostOffice'][0]

I.I.T Roorkee
Roorkee
Roorkee Cantt
Roorkee City
Roorkee Kalan
Roorkee Kutchery
Roorkee Pukhta
*******************************


{'Name': 'I.I.T Roorkee',
 'Description': None,
 'BranchType': 'Sub Post Office',
 'DeliveryStatus': 'Non-Delivery',
 'Circle': 'Uttarakhand',
 'District': 'Haridwar',
 'Division': 'Dehradun',
 'Region': 'Dehradun',
 'State': 'Uttarakhand',
 'Country': 'India',
 'Pincode': '247667'}

## This is the kind of wrapper function you are generally expected to write in real-life data wrangling tasks, that is, a utility function that can take a user argument and output a useful data structure (or a mini database-type object) with key information extracted over the internet about the item the user is interested in.

In [112]:
import pandas as pd
import json
def pincode_database(list_pincodes):
    """
    Takes a list of pincodes.
    Output a DataFrame with key information about the pincodes.
    """
    # Define an empty dictionary with keys
    pincode_dict={'Name':[],'BranchType':[],'District':[], 'Division':[],'Region':[],'State':[], 'Pincode':[]}
    
    for pin in list_pincodes:
        data = get_pincode_info(pin)
        if data != None:
            data_json = json.loads(data)
            data_json = data_json[0]
            for post_office_entry in data_json['PostOffice']:
                pincode_dict['Name'].append(post_office_entry['Name'])
                pincode_dict['BranchType'].append(post_office_entry['BranchType'])
                pincode_dict['District'].append(post_office_entry['District'])
                pincode_dict['Division'].append(post_office_entry['Division'])
                pincode_dict['Region'].append(post_office_entry['Region'])
                pincode_dict['State'].append(post_office_entry['State'])
                pincode_dict['Pincode'].append(post_office_entry['Pincode'])
            
    # Return as a Pandas DataFrame
    return pd.DataFrame(pincode_dict)

In [113]:
pincode_database([247667])

Retrieved data on pincode 247667. Total 7096 characters read.


Unnamed: 0,Name,BranchType,District,Division,Region,State,Pincode
0,Belra,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
1,Bharapur Bhori,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
2,Dandera Khwaspur,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
3,Dhanauri,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
4,Ganesh Vatika,Sub Post Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
5,Ganeshpur,Sub Post Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
6,I.I.T Roorkee,Sub Post Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
7,Imalikhera,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
8,Kota Muradnagar,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
9,Manu Bas,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667


In [115]:
list_of_pincodes = [247667,787032]
pincode_database(list_of_pincodes)

Retrieved data on pincode 247667. Total 7096 characters read.
Retrieved data on pincode 787032. Total 5345 characters read.


Unnamed: 0,Name,BranchType,District,Division,Region,State,Pincode
0,Belra,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
1,Bharapur Bhori,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
2,Dandera Khwaspur,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
3,Dhanauri,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
4,Ganesh Vatika,Sub Post Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
5,Ganeshpur,Sub Post Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
6,I.I.T Roorkee,Sub Post Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
7,Imalikhera,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
8,Kota Muradnagar,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
9,Manu Bas,Branch Office directly a/w Head Office,Haridwar,Dehradun,Dehradun,Uttarakhand,247667
