### Building Finance Keywords Dictionary using Python 

Irrespective of the Business domain you work in, The presence of Text Data is inevitable. You could be in Audit Team and there are Audit documents or You could be in Entertainment Business and your fans could be tweeting about the Movie or Series that you are producing so Text Data is almost there in every business domain. 

For any analyst, a large amount of data is nothing less than a scoop of tastier ice cream. But unfortunately, with the huge amount of text data, there comes a huge problem that it's not structured data. To keep matters simple unlike your numeric Table, Text isn't structured or fit into a tabular format where you have got a lot of out-of-box analytics solutions to built upon. 

Thus, Analysing Text data becomes a unique challenge for an Analyst and in that challenge of trying to make sense of Text data, there is one handy utility required that is "Keywords Dictionary". 

### What is a Keywords Dictionary?

A Keywords Dictonary is a set of words put together based on a common theme. Let's take case, that you are a Bank and you want to improve your customer service. For you to make sense of what your customers complaint or talk about, you need to have list of keywords that are related to business and that is where the need to have a Keywords Dictionary comes in. 

Since Keywords Dictionaries are specific to the business purpose, it might not be available for your request easily unless somoene has built it and open-sourced. Hence, it's always better to build your own custom Keywords Dictionary. 

#### How to build your own Keywords Dictionary? 

* First of all, we need to define on a Data Source - which is usually a website with the list of Keywords (that we are interested in). 

* Extract Website content from the given url

* Scrape the desired content (Keywords) from the website content 

* Clean the scraped data if requried and store locally for future use


Now, you could be wondering about a new jargon in one of the points - "scrape". It comes from the process called "Web Scraping" which "is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis" as defined by Wikipedia. 

#### Getting started with Web Scraping:

Python has two wonderful packages for web scraping. 1. BeautifulSoup and 2. Scrapy. We will use "BeautifulSoup" in this post to scrape data from the Web. While BeautifulSoup can do the job of parsing the html and making sense of the web content, we need to "get" the website in the first place and we will use "requests" package for that.  

#### How to install requests & BeautifulSoup:

Requests can be installed using pip. 

`pip install requests` - if you are using Python version lesser than 3
`pip3 install requests` - if you are using Python version greater than 3


BeautifulSoup also can be installed using pip or pip3 if you are using Python 3.x. 

`pip install beautifulsoup4` 
`pip3 install beautifulsoup4`

Loading both the libraries:

In [24]:
from bs4 import BeautifulSoup
import requests

Data Source:

we will use Moneycontrol.com's Glossary page to build our Finance Keywords Dictionary. Note that this post is just for educational purpose and make sure you don't violate the Terms of Service of the websites from which you are trying to scrape. 

`url = "http://www.moneycontrol.com/glossary/"`

In [25]:
url = "http://www.moneycontrol.com/glossary/"

As we have defined the url, now let us extract the content of the url. 

In [27]:
content = requests.get(url) #sends a GET http request to collec the content 

We can check if the request was successful by checking the response status. 

In [29]:
content.status_code #200 is succssful

200

In [30]:
content_text = content.text #extracting the response content as text

Now, as the content is ready as text. We can use BeautifulSoup to make a "soup" - ideally, parsing the html

In [33]:
soup = BeautifulSoup(content_text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


As we have seen in the above screenshot, what we are interested in the extracted content is the html tag "a". But there are so many links in the website that also could include junk like social media links and other irrelevant links. Giving a deep look in the above screenshot could also reveal that our desired urls have a common pattern that is "/glossary/". Hence we would be extracting the content with two conditions:

* only "a" tag
* "a" tag with "href" containing the string "glossary" in it 

To extract all the "a" tag links, we will use the function "find_all()" and to find the string "glossary" in "href", we will use "regex" for pattern matching using the python package "re". 

In [42]:
import re
all_links = soup.find_all("a", href = re.compile("glossary"))

Now, we are ready to extract the Keywords, which are nothing but the text values in each of those links that we extracted and stored in "all_links". We will use a "for" loop to iterate through each element of "all_links" and extract "text" value of it and store it in a list. 

In [64]:
keywords = [] #empty list to store the keywords

for link in all_links:
    keywords.append(link.text)
    
len(keywords)

keywords[1000:1020]

['Early (premature) Withdrawal',
 'Early Entry',
 'Early Retirement Penalty',
 'Earned Income',
 'Earned Income Rule',
 'Earned premium',
 'Earnings before taxes',
 'Earnings Estimates',
 'Earnings form',
 'Earnings multiple approach',
 'Earnings per Share (EPS)',
 'Earnings stripping',
 'Earthquake insurance',
 'EBIT',
 'EBITDA',
 'ECN',
 'Econometrics',
 'Economic double taxation',
 'Economic loss',
 'Education IRA']

That's it! We have successfully built a Finance Keywords Dictionary of length 3165. Please note that some of the keywords might need a little bit cleaning and business domain knowledge to further refine before using in your Machine Learning model. This post could be simply replicated for your own need with a simple change of the source url and a few other tweaks. 