## Reuters-dataset
This notebook is to explore the reuters dataset and filter out all the merger and acquisition articles from the dataset.

In [None]:
# imports
import pandas as pd
from bs4 import BeautifulSoup
import os
from collections import defaultdict

In [2]:
DATA_DIR = "../data/reuters21578/"

In [3]:
# data_dir_files = os.listdir(DATA_DIR)
# sgm_files = [f for f in data_dir_files if ".sgm" in f]

## Task 1: Filtering M&A articles
The first task is to filter out M&A articles from the dataset.

To do that we find all articles that have 'acq' specified as one of the topics

In [4]:
## Funtion to filter the articles from a sgm file contents
def filter_content(content, f_name):
    acq_reuters = []
    soup = BeautifulSoup(content, 'html.parser')
    all_reuters = soup.find_all('reuters')
    for r in all_reuters:
        reuter_topics = r.find_all('topics')
        for topic_tags in reuter_topics:
            all_elements = topic_tags.find_all('d')
            for ele in all_elements:
                all_topics = ele.contents
                for topic in all_topics:
                    if topic == 'acq':
                        acq_reuters.append(r)
    return acq_reuters

In [3]:
def convert_to_df(all_acq_reuters):
    for k,v in all_acq_reuters.items():
        all_rows = []
        file_name = k
        for reuter in v:
    #         print(reuter)
            reuter_id = reuter.attrs['newid']
            try:
                reuter_body = reuter.find_all('body')[0].contents[0]
            except:
                reuter_body = ""
            try:
                reuter_title = reuter.find_all('title')[0].contents[0]
            except:
                reuter_title = ""
    #         print(reuter.find_all('body')[0].contents[0])
            row = [file_name, reuter_id, reuter_body, reuter_title]
            all_rows.append(row)
    data_df = pd.DataFrame(all_rows)
    data_df.columns = ['file_name', 'reuter_id', 'reuter_title', 'reuter_body']
    return data_df

In [6]:
## Funtion to filter the all the acq articles from the reuters dataset.
def filter_reuters():
    all_acq_reuters = defaultdict(lambda: None)
    
    data_dir_files = os.listdir(DATA_DIR)
    sgm_files = [f for f in data_dir_files if ".sgm" in f]
    
    for f_name in sgm_files:
        try:
            with open(DATA_DIR+f_name, 'r') as f:
                content = f.read()
            file_acq_reuters = filter_content(content, f_name)
            all_acq_reuters[f_name] = file_acq_reuters
        except:
            print("The file ", f_name, " could not be opened because of UnicodeDecode Error.")
    reuters_df = convert_to_df(all_acq_reuters)
    return reuters_df

In [7]:
reuters_df = filter_reuters()

The file  reut2-017.sgm  could not be opened because of UnicodeDecode Error.


In [8]:
reuters_df

Unnamed: 0,file_name,reuter_id,reuter_title,reuter_body
0,reut2-021.sgm,21003,,CCR VIDEO SAYST RECEIVED OFFER TO NEGOTIATE A ...
1,reut2-021.sgm,21007,"Brown Disc Products Co Inc, a unit fo Genevar ...",BROWN DISC TO BUY RHONE-POULENC <RHON.PA> UNIT
2,reut2-021.sgm,21019,The British Treasury confirmed that the sale o...,U.K. TREASURY CONFIRMS BP SALE TO GO AHEAD
3,reut2-021.sgm,21030,CalMat Co said it filed suit in Los Angeles Su...,CALMAT <CZM> SUES INDUSTRIAL EQUITY
4,reut2-021.sgm,21039,Durakon Industries Inc said it has entered int...,DURAKON <DRKN.O> TO MAKE ACQUISITION
5,reut2-021.sgm,21040,"Atlantis Group Inc said it bought 100,000 shar...",ATLANTIS <AGH> MAY BID FOR CHARTER-CRELLIN<CRT...
6,reut2-021.sgm,21041,Allwaste Inc said it has agreed in principle t...,ALLWASTE <ALWS.O> TO MAKE ACQUISITION
7,reut2-021.sgm,21046,Henley Group Inc said it ended talks with Sant...,HENLEY <HENG.O> ENDS TALKS WITH SANTE FE
8,reut2-021.sgm,21050,Texas American Bancshares Inc said it agreed t...,TEXAS AMERICAN BANCSHARES <TXA> TO SELL UNIT
9,reut2-021.sgm,21051,Supermarkets General Corp said it agreed to se...,SUPERMARKETS GENERAL <SGL> SELLS 11 DRUG STORES


## Task 2, 3: Identify company acquired and company making the acquisition
The task at hand is to find the company being acquired and the company which is making the acquisition along with the amount involved in the acquisition.

### Ideas for Task
To help us identify that we have news articles as well as their titles which can help us.

#### 1. Using Word2Vec
We can use create a Word2vec model and then retrain it over all the news articles that we have in the dataset. This might help us learn embeddings like:
 - king is to queen like father is to mother.
 - acquirer is to acquired ; like ; acquired is to acquirer
 - CompanyA is to CompanyB ; like ; CompanyC is to CompanyD

Some of the techniques such as cbow and skipgrams can help us to understand context.
Then we can probably try to use these vector embeddings to find the company which was acquired and the company which acquired using these. Cosine similarity distance can be extracted using Python gensim library.

#### 2. Entity Tagging
Using a pre-trained machine learning/ deep learning model to identify company names in the text. This can help us find out the company acquired and the company making the acquisition.

To distinguish between the companies we can try to develop some understanding about the words contained in the title and if they can possibly help us identify that.

Few pretrained models using NER include
- https://github.com/deepmipt/ner
- https://medium.com/intro-to-artificial-intelligence/entity-extraction-using-deep-learning-8014acac6bb8

#### 3. Sequential Learning
This particular problem would be difficult to solve with a rule-based method because we need to find the company being acquired and the company which is making the acquisition, in any order. Hence we need to capture the sequence.

Steps
- Extract all the topic bodies
- We would need to label the data values so that we can apply customize NER such as CompanyA and CompanyB
- Make a tuple for each word (word-tokenize or POS tagging)
- Keep adding different tags such as (Brown Disc Products Co Inc, Noun, O) - Company represented as "O"
- Model using following methods once labelling is complete
    - CRF
    - RNN with LSTM

There are possibly other techniques that can be explored so as to come up with a better approach. But could not get time to explore further due to time constraint.