# Project Object

Crawl 20,000+ news from bloomberg.com then do NLP (Natural Language Processing) sentiment analysis.

### Steps for this project:<br>
1. Go to https://www.bloomberg.com/robots.txt, see which kinds of news can be crawled
2. Developer a Python program to crawl 20000+ news from https://www.bloomberg.com<br>
1). Save news into a csv file with columns: Titile, url, topic, publised_time, abstract, context<br>
2). Find a method that can avoid blocked by the website<br>
![image.png](attachment:image.png)
3. Use NLP (Natural Language Processing) to analyze the sentiment of every news (positive or negative)

### Reference Links:<br>
https://www.crummy.com/software/BeautifulSoup/bs4/doc/<br>
http://www.nltk.org/howto/sentiment.html<br>
http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html

# Keywords

news sentiment anlysis, python crawler, BeautifulSoup, requests.session, time.sleep

# Use Cases

- News make a good use of sentiment analysis technique to predict possible outcomes and keep the folks updated and entertained.
- Political parties can also make a good use of sentiment analysis technology. They can always monitor the impacts of the political moves they make. For example, the central government’s recent currency demonetization orders generated reactions all across the country. 
- In a machine learning analysis on currency demonetization, we analyzed overall sentiments of the people on the issue.
- As with sports trading, having an insight into what is happening at a local level can be very valuable to a financial trader. Domain-specific sentiment analysis/classification can add real value here.

# Datasets

Bloomberg news

![image.png](attachment:image.png)

# Methodology

## 1. News can be crawled

![image.png](attachment:image.png)

## 2. News crawler with Python
BloombergNewsCrawler.py<br>

#### - Avoid Block using different "User-Agent" and "time.sleep"

Mutiple User-Agent<br>
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
![image.png](attachment:image.png)

requests package with different header
![image.png](attachment:image.png)

Random sleep time
![image.png](attachment:image.png)

#### - Structure of a news website

1) Use Google Chrome open a news: https://www.bloomberg.com/news/articles/2018-04-17/london-s-fight-to-remain-a-financial-hub-after-brexit-quicktake

2) Right-click the Title then select Inspect:
![image.png](attachment:image.png)

3) Then can see the tag of a news Title:
![image.png](attachment:image.png)

4) Use this tag with beatifulsoup:
![image.png](attachment:image.png)

#### - Crawl news with beatifulsoup package

*Crawl function*
![image.png](attachment:image.png)

![image.png](attachment:image.png)

1) Topic_url: the url of a topic
![image.png](attachment:image.png)

2) start_url_idx: crawler may blocked anytime, you can start running the crawler at the blocked index

3) start_mon_idx: start month index of a topic, normally the top 3 links donot contain any news
![image.png](attachment:image.png)

4) sleep: a random number used to times with the random sleep time

5) name_news: name of a topic, used to be the name of a news csv file
![image.png](attachment:image.png)

*Decompose*

Decomposed the useless part:
![image.png](attachment:image.png)

*Save to CSV file*

1) New a dictionary to store the news, use Title as the key. url, topic, abstract, published_time, context as the value.
![image.png](attachment:image.png)

2) Define a function to save the news to CSV file
![image.png](attachment:image.png)

news_dict: the dictionary which store the news of a topic<br>
name: name of a topic<br>
total_num: the total number crawled of the topic<br>

3) If crawl news url failed 10 times, or crawl all the news of a month of a topic, save the crawled news into csv file
![image.png](attachment:image.png)

## 3. Combine all news csv files together
combine_csv.py<br>

Use os package open all csv file in a folder directory, put the csv file into a dataframe then concat all dataframes together using pandas. Save the combined dataframe to one single csv file:
![image.png](attachment:image.png)

## 4. News Sentiment Analysis
news_sentiment_analysis.py

#### - Define a function to read in the news csv file and do sentiment analysis
![image.png](attachment:image.png)

![image.png](attachment:image.png)

*read in csv file*

![image.png](attachment:image.png)

#### - new positive, negative, neutral and compound columns in dataframe

![image.png](attachment:image.png)

#### - Combine Title, topic, abstract and context's content together
![image.png](attachment:image.png)

#### - Use SentimentIntensityAnalyzer of vaderSentiment or nltk package to do sentiment analysis

![image.png](attachment:image.png)

#### - Save the new dataframe which contains the sentiment scores to a csv file

![image.png](attachment:image.png)