### **Hello!**
In this notebook, we will be using the BeautifulSoup package to gather some information about a talk of Krishnamurti and store it in a dataframe.

The website we will be scraping from is: https://jkrishnamurti.org/jksearch 


**Getting the Speech**

In [1]:
# Installing Dependencies
import requests
from bs4 import BeautifulSoup

In [2]:
# Making a 'get' request
page = requests.get("https://jkrishnamurti.org/content/action-without-conflict")

soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
%%capture
# Looking at the html object
soup

In [4]:
# Getting <div> with all text of speech
text = [paragraph.get_text() for paragraph in soup.find('div', property="content:encoded").find_all('p')]

In [5]:
# Creating speech string
speech = ' '.join(text)
print(speech)

This is the last talk of this year. I think the more one observes the world's condition, the more it becomes clear that there must be a totally different kind of action. One sees in the world - including in India - the confusion, the great sorrow, the misery, the starvation, the general decline. One is aware of this, one knows it from newspapers, reading magazines, books, but it remains on the intellectual level because we don't seem to be able to do anything about it. Human beings are in despair, there is great sorrow in themselves, frustration, and there is chaos about one. The more you observe and go into it, not intellectually, not verbally, but actually discuss, observe, act, enquire, examine, the more you see how confused human beings are. They are lost. And those who think they are not lost because they belong to a particular group, a circle, and feel the more you practise, the more you do certain things, the more you do social work, or this or that, the more they are sure that 

**Getting Metadata**

In [6]:
# Getting the <div> with all the 
fields = soup.select(".custom-footer .field-wrapper")

# Get field lables and values
f_label = fields[0].select(".field-label")[0].get_text().strip()
f_value = fields[0].find('div').get_text().strip()

print(f_label)
print(f_value)

Text source
AV


In [7]:
# Creating a dataframe with speech and metadata
labels = [field.select(".field-label")[0].get_text().strip() for field in fields]
values = [field.find('div').get_text().strip() for field in fields]
values = [[value] for value in values]

print(labels)
print(values)

['Text source', 'Talk Type', 'Participants Category', 'Decade', 'Date Code', 'Country', 'City']
[['AV'], ['Talk'], ['Public'], ['60s'], ['660302'], ['India'], ['Bombay (Mumbai)']]


In [8]:
# Creating a dictionary containing text and metadata
d = dict(zip(labels, values))
d['text'] = speech
print(d)


{'Text source': ['AV'], 'Talk Type': ['Talk'], 'Participants Category': ['Public'], 'Decade': ['60s'], 'Date Code': ['660302'], 'Country': ['India'], 'City': ['Bombay (Mumbai)'], 'text': "This is the last talk of this year. I think the more one observes the world's condition, the more it becomes clear that there must be a totally different kind of action. One sees in the world - including in India - the confusion, the great sorrow, the misery, the starvation, the general decline. One is aware of this, one knows it from newspapers, reading magazines, books, but it remains on the intellectual level because we don't seem to be able to do anything about it. Human beings are in despair, there is great sorrow in themselves, frustration, and there is chaos about one. The more you observe and go into it, not intellectually, not verbally, but actually discuss, observe, act, enquire, examine, the more you see how confused human beings are. They are lost. And those who think they are not lost bec

In [9]:
import pandas as pd

speech_df = pd.DataFrame.from_dict(d)
speech_df

Unnamed: 0,Text source,Talk Type,Participants Category,Decade,Date Code,Country,City,text
0,AV,Talk,Public,60s,660302,India,Bombay (Mumbai),This is the last talk of this year. I think th...


In [15]:
# Let us add another row to this dataframe
for url in ['https://jkrishnamurti.org//content/brain-completely-attentive/','https://jkrishnamurti.org//content/seeking-security/','https://jkrishnamurti.org//content/if-one-conforming-there-no-freedom/','https://jkrishnamurti.org//content/maturity-freedom-conditioning/','https://jkrishnamurti.org//content/public-discussion-1-madras-chennai-india-12-january-1971/', 'https://jkrishnamurti.org//content/public-discussion-2-madras-chennai-india-15-january-1971/']:
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')

  text = [paragraph.get_text() for paragraph in soup.find('div', property="content:encoded").find_all('p')]
  speech = ' '.join(text)

  # Getting the <div> with all the 
  fields = soup.select(".custom-footer .field-wrapper")

  # Creating a dataframe with speech and metadata
  labels = [field.select(".field-label")[0].get_text().strip() for field in fields]
  values = [field.find('div').get_text().strip() for field in fields]
  d = dict(zip(labels, values))
  d['text'] = speech
  speech_df = speech_df.append(d, ignore_index = True)


In [16]:
speech_df

Unnamed: 0,Text source,Talk Type,Participants Category,Decade,Date Code,Country,City,text
0,AV,Talk,Public,60s,660302,India,Bombay (Mumbai),This is the last talk of this year. I think th...
1,[AV],[Discussion],[Public],[70s],[700808-1],[Switzerland],[Saanen],"We'll go on, I think, don't you, where we left..."
2,[AV],[Discussion],[Public],[70s],[700809],[Switzerland],[Saanen],"This is the last discussion and if we may, sha..."
3,[AV],[Discussion],[Public],[70s],[700908],[England],[Brockwood Park],Krishnamurti: This is supposed to be a discuss...
4,[AV],[Discussion],[Public],[70s],[700910],[England],[Brockwood Park],Krishnamurti: What shall we talk over together...
5,[AV],[Discussion],[Public],[70s],[710112],[India],[Madras (Chennai)],Krishnamurti: What shall we talk over together...
6,[AV],[Discussion],[Public],[70s],[710115],[India],[Madras (Chennai)],Krishnamurti: What shall we talk over together...
7,AV,Discussion,Public,70s,700808-1,Switzerland,Saanen,"We'll go on, I think, don't you, where we left..."
8,AV,Discussion,Public,70s,700809,Switzerland,Saanen,"This is the last discussion and if we may, sha..."
9,AV,Discussion,Public,70s,700908,England,Brockwood Park,Krishnamurti: This is supposed to be a discuss...


In [18]:
speech_df.to_csv('raw_data.csv', index=False)

In [20]:
df = pd.read_csv('raw_data.csv')
df

Unnamed: 0,Text source,Talk Type,Participants Category,Decade,Date Code,Country,City,text
0,AV,Talk,Public,60s,660302,India,Bombay (Mumbai),This is the last talk of this year. I think th...
1,['AV'],['Discussion'],['Public'],['70s'],['700808-1'],['Switzerland'],['Saanen'],"We'll go on, I think, don't you, where we left..."
2,['AV'],['Discussion'],['Public'],['70s'],['700809'],['Switzerland'],['Saanen'],"This is the last discussion and if we may, sha..."
3,['AV'],['Discussion'],['Public'],['70s'],['700908'],['England'],['Brockwood Park'],Krishnamurti: This is supposed to be a discuss...
4,['AV'],['Discussion'],['Public'],['70s'],['700910'],['England'],['Brockwood Park'],Krishnamurti: What shall we talk over together...
5,['AV'],['Discussion'],['Public'],['70s'],['710112'],['India'],['Madras (Chennai)'],Krishnamurti: What shall we talk over together...
6,['AV'],['Discussion'],['Public'],['70s'],['710115'],['India'],['Madras (Chennai)'],Krishnamurti: What shall we talk over together...
7,AV,Discussion,Public,70s,700808-1,Switzerland,Saanen,"We'll go on, I think, don't you, where we left..."
8,AV,Discussion,Public,70s,700809,Switzerland,Saanen,"This is the last discussion and if we may, sha..."
9,AV,Discussion,Public,70s,700908,England,Brockwood Park,Krishnamurti: This is supposed to be a discuss...


# Great!
In this next section, we will gather a list of html links to the text documents in the Krishnamurti Archive.

Plan of attack - iterate over the 10 pages of this search page: https://jkrishnamurti.org/jksearch?keyword=&type=16616&media_type=16616, collecting all of the hyperinks.


In [None]:
urls = []

# There are 37 search pages we will want to parse and retireve hyperlinks from

base_url = "https://jkrishnamurti.org/jksearch?keyword=&page={}&talk_type=41&media_type=16616"
search_pages = [base_url.format(str(i)) for i in range(1,38)]
search_pages

['https://jkrishnamurti.org/jksearch?keyword=&page=1&talk_type=41&media_type=16616', 'https://jkrishnamurti.org/jksearch?keyword=&page=2&talk_type=41&media_type=16616', 'https://jkrishnamurti.org/jksearch?keyword=&page=3&talk_type=41&media_type=16616', 'https://jkrishnamurti.org/jksearch?keyword=&page=4&talk_type=41&media_type=16616', 'https://jkrishnamurti.org/jksearch?keyword=&page=5&talk_type=41&media_type=16616']


In [None]:
%%capture
search_page = search_pages[0]

# Making a 'get' request
page = requests.get(search_page)
soup = BeautifulSoup(page.content, 'html.parser')
soup

In [None]:
# Getting all anchor tags within a title class
from pprint import pprint

# Notice how the hyperlink Path is relative to the current html website we are on
anchor_tags = soup.select(".title > a")
paths = [anchor_tag['href'] for anchor_tag in anchor_tags]

pprint(paths)

['/content/students-discussion-4-paris-france-22-april-1969/',
 '/content/students-discussion-1-saanen-switzerland-24-july-1970/',
 '/content/students-discussion-2-saanen-switzerland-31-july-1970/',
 '/content/how-shall-i-know-myself/',
 '/content/attachment-escape/',
 '/content/mind-breeds-habits/',
 '/content/mind-crowded-known/',
 '/content/danger-conditioning/',
 '/content/brain-demands-order/',
 '/content/brain-completely-attentive/']


In [None]:
# Getting a list of all speech paths
import random
import time

speech_paths = []

# For each search page we want to add all relevant hyperlinks to url list

for page in search_pages:
  time.sleep(random.randint(1,10))
  page = requests.get(page)
  soup = BeautifulSoup(page.content, 'html.parser')
  anchor_tags = soup.select(".title > a")
  paths = [anchor_tag['href'] for anchor_tag in anchor_tags]
  speech_paths.append(paths)

pprint(speech_paths)
  

In [None]:
speech_paths
x = []

for group in speech_paths:
  for path in group:
    x.append("https://jkrishnamurti.org/" + path)

pprint(x)

['https://jkrishnamurti.org//content/students-discussion-4-paris-france-22-april-1969/',
 'https://jkrishnamurti.org//content/students-discussion-1-saanen-switzerland-24-july-1970/',
 'https://jkrishnamurti.org//content/students-discussion-2-saanen-switzerland-31-july-1970/',
 'https://jkrishnamurti.org//content/how-shall-i-know-myself/',
 'https://jkrishnamurti.org//content/attachment-escape/',
 'https://jkrishnamurti.org//content/mind-breeds-habits/',
 'https://jkrishnamurti.org//content/mind-crowded-known/',
 'https://jkrishnamurti.org//content/danger-conditioning/',
 'https://jkrishnamurti.org//content/brain-demands-order/',
 'https://jkrishnamurti.org//content/brain-completely-attentive/',
 'https://jkrishnamurti.org//content/seeking-security/',
 'https://jkrishnamurti.org//content/if-one-conforming-there-no-freedom/',
 'https://jkrishnamurti.org//content/maturity-freedom-conditioning/',
 'https://jkrishnamurti.org//content/public-discussion-1-madras-chennai-india-12-january-1971/