Part I: Extracting and Analyzing HTML and JSON Data

To begin with, we explore extracting data from different structured formats on the web. Our approach involves using Python's powerful libraries to seamlessly pull and convert data into a usable format for analysis.

HTML Data Extraction:
We start by pulling book data from an HTML file hosted on GitHub. This process is streamlined by Python's pandas library, which directly reads the HTML and converts it into a structured DataFrame. The initial data extracted includes titles, authors, genres, and publication years, presenting an organized snapshot of literary works spanning different eras.

Part II: Scraping and Analyzing Data from the Katz School’s “Staff” Web Page

Further into our exploration, we implement web scraping techniques to extract staff data directly from a web page. This method proves essential for data that is frequently updated or when direct downloads are not feasible.

Part III: Leveraging Web APIs for Data Insights

Our final task involves interacting with the New York Times Most Popular API. This rich data source provides real-time insights into the most viewed and shared articles over a specified period, offering a dynamic look at public engagement with news content.

Data Analysis:

Using the data obtained from the New York Times API, we analyze trends in article views and shares, focusing on the content areas that attract the most audience interaction. By visualizing this data through bar charts, we gain a visual understanding of which sections (like U.S. news, World events, etc.) dominate reader interest.

Conclusion:

This project not only showcases the technical skills required to handle various web data formats but also highlights the analytical techniques used to extract meaningful insights from that data. From static files on GitHub to dynamic content accessed via APIs, the tools and methods demonstrated here are invaluable for any data scientist looking to harness the full potential of web-based data sources.

### 1. Extract the names of each individual from the unformatted text string shown above and store them in a vector of some sort. When complete, your vector should contain the following entries:

In [23]:
import re #importing the required regex libraries and defining the text to be inputed and segregated
text ="555-1239Khan, Ghengis(636) 555-0113Fitzgerald, F. Scott555 -6542Rev. Adam Clayton Powell555 8904Loretta Lynn636-555-3226Case,Justin5553642Dr. Julius Erving555-401-2232Constance Prudence Boringsworth"
title = r"(?:[A-Z][a-z]*\.\s*)?" #the title is defined
first_name = r"[A-Z][a-z]+,?\s?" #the first name is defined and segregated
middle_name = r"(?:[A-Z][a-z]*\.?\s*)?" #the middle name is defined and segregated
last_name = r"[A-Z][a-z]+" #the last name is defined and segregated
v1 = [i for i in (re.findall(title+first_name+middle_name+last_name,text))] #conditions to combine the segregated data in correct definition is defined
print(v1) #the condition is printed

['Khan, Ghengis', 'Fitzgerald, F. Scott', 'Rev. Adam Clayton Powell', 'Loretta Lynn', 'Case,Justin', 'Dr. Julius Erving', 'Constance Prudence Boringsworth']


### 2a. Use your regex skills to rearrange the vector so that all elements conform to the standard “firstname lastname”, preserving any titles (e.g., “Rev.”, “Dr.”, etc) or middle/second names.

In [24]:
v2=v1 #similar vector is defined
l1=[] #empty list is created for to store the data retrieved
for i in v2:
    l1.append(i.split(',')) #conditions for the required regex data set are defined in the vector
    l1[0][0]+l1[0][1] 
l2=[]
s = ""
for i in range(len(l1))    :
    if len(l1[i]) > 1:
        s1 = s + l1[i][1] + " " + l1[i][0]  #conditions for the required regex data set are defined in the vector
        l2.append(s1) 
        s = ""
    else:
        s = s + l1[i][0]
        l2.append(s)  #conditions for the required regex data set are defined in the vector
        s = ""
print(l1)
l2 #the required and probable data sets of regex are displayed

[['Khan', ' Ghengis'], ['Fitzgerald', ' F. Scott'], ['Rev. Adam Clayton Powell'], ['Loretta Lynn'], ['Case', 'Justin'], ['Dr. Julius Erving'], ['Constance Prudence Boringsworth']]


[' Ghengis Khan',
 ' F. Scott Fitzgerald',
 'Rev. Adam Clayton Powell',
 'Loretta Lynn',
 'Justin Case',
 'Dr. Julius Erving',
 'Constance Prudence Boringsworth']

In [25]:
l1[0][0],l1[0][1]=l1[0][1],l1[0][0] #the rearranged vector is defined and Printed
l1

[[' Ghengis', 'Khan'],
 ['Fitzgerald', ' F. Scott'],
 ['Rev. Adam Clayton Powell'],
 ['Loretta Lynn'],
 ['Case', 'Justin'],
 ['Dr. Julius Erving'],
 ['Constance Prudence Boringsworth']]

In [26]:
# creating regex for title, first name, middle name and last name and then printing together
title1 = r"(?:[A-Z][a-z]*\.\s*)?" #the title regex are defined
first_name1 = r", [A-Z][a-z]+[^,]" #the first name regex are defined
v2 = [i for i in (re.findall(title1+first_name1,text))] #the 2nd vector conditon is changed according to the required data
print(v2) #the newly altered vector is printed

[', Ghengis(']


In [27]:
# creating regex for first name and last name and then printing

l3 = [] #empty list to accomodate all the required data is created
for item in v1:
    itemTemp = re.sub("Dr\.|Rev\.|F\.|,\s+", " ", item).lstrip().rstrip()
    itemTemp = re.sub("\s+", " ", itemTemp)
    l3.append(itemTemp)
l3 #the for loop is used to define the regex conditions into the vector and the required data set is displayed

['Khan Ghengis',
 'Fitzgerald Scott',
 'Adam Clayton Powell',
 'Loretta Lynn',
 'Case,Justin',
 'Julius Erving',
 'Constance Prudence Boringsworth']

In [28]:
['Khan, Ghengis', 'Fitzgerald, F. Scott', 'Rev. Adam Clayton Powell', 'Loretta Lynn', 'Case,Justin', 'Dr. Julius Erving', 'Constance Prudence Boringsworth']

['Khan, Ghengis',
 'Fitzgerald, F. Scott',
 'Rev. Adam Clayton Powell',
 'Loretta Lynn',
 'Case,Justin',
 'Dr. Julius Erving',
 'Constance Prudence Boringsworth']

### 2b. Using your regex skills, construct a logical vector indicating whether a character has a title (i.e.,Rev. and Dr.).

In [29]:
[True if re.findall("(^Dr.|Rev.)",i) else False for i in v1 ] #the conditon for the required data set is defined and executed

[False, False, True, False, False, True, False]

### 2c. Using your regex skills, construct a logical vector indicating whether a character has a middle/second name.

In [30]:
l4 = [] #the list for to encorporate the data is created
for i in v1:
    if bool(re.findall("(^Dr.|Rev.)",i))==True:
        l4.append(len(i.split(" ")) >3)
    else :                                        #the conditions for to constructing the vector needed is defined 
        l4.append(len(i.split(" ")) >2)
l4

[False, True, True, False, False, False, True]

### 3. Consider the HTML string <title+++BREAKING NEWS+++title. We would like to extract the first HTML tag (i.e., “<title”). To do so we write the regular expression “.+”. Explain why this fails and correct the expression.

In [31]:
text = "<title>+++BREAKING NEWS+++<title>"
pattern = r"<.+>"
print(re.findall(pattern,text))

['<title>+++BREAKING NEWS+++<title>']


This regex has failed due to the error occured while execution of the code, it prints the total set of dat within the < and > as ".+" is present the character type definition isn't applied.

In [32]:
text = "<title>+++BREAKING NEWS+++<title>"
pattern = r"<[a-z]+>"
print(re.findall(pattern,text)[0])

<title>


### 4. Consider the string “(5-3)^2=5^2-2*5*3+3^2” conforms to the binomial theorem. We would like to extract the equation in its entirety from the string. To do so we write the regular expression “[^0-9=+*()]+”. Explain why this fails and correct the expression.

In [33]:
text = "(5-3)^2=5^2-2*5*3+3^2"
pattern = r"[^0-9=+*()]+"
print(re.findall(pattern,text))

['-', '^', '^', '-', '^']


This regex has failed due to absence of "^" symbol hence while displaying the following characters: "0-9=+*()" are left out of the reach.

In [18]:
text = "(5-3)^2=5^2-2*5*3+3^2"
pattern = r".+"
print(re.findall(pattern,text))

['(5-3)^2=5^2-2*5*3+3^2']


SUBMITTED BY:
VIJAYASURIYA SURESH