# News Feature Generation

### Dependencies

Below are the packages used in this notebook:

- **matplotlib**: plotting
- **numpy**: numerical
- **pandas**: csv & dataframe

In [1]:
import warnings
import matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
matplotlib.font_manager._rebuild()
plt.rcParams['lines.linewidth'] = 0.8
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = 'Palatino'
plt.rcParams['axes.facecolor'] = '#000000'
pd.set_option('max_row', 10)
warnings.filterwarnings('ignore')

### Data Crawling

We have crawled all EU-ETS posts from [Carbon Pulse](http://www.carbon-pulse.com). The crawling scripts are as below.

```py
import random
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime as dt
from randomheaders import LoadHeader as random_header


random.seed(123)
COLUMNS = ['time', 'title']
INFO = pd.DataFrame(columns=COLUMNS)
IP_LIST = []


def get_proxy():
    return requests.get('http://localhost:5555/get/').text

def delete_proxy(proxy):
    requests.get(f'http://localhost:5555/delete/?proxy={proxy}')

def get_soup(url, parser):
    while True:
        try:
            proxy = get_proxy()
            html = requests.get(
                url, timeout=(3, 7), headers=random_header(),
                proxies={'http': proxy, 'https': proxy}).text
            return BeautifulSoup(html, parser)
        except Exception:
            delete_proxy(proxy)

def parse_time(timestr):
    raw_time = timestr[10:].split('/')[0].replace('on', '')
    time = dt.strptime(raw_time, '%H:%M  %B %d, %Y  ')
    return time

def get_posts(soup):
    if 'Error 404 - Not Found.' in soup.text: return None
    attrs = {'class': lambda x: x.startswith('post-') if x else False}
    posts = soup.select('div[class="listing"]')[0].find_all('div', attrs=attrs)
    titles, times, abstracts = [], [], []
    for p in posts:
        titles.append(p.select('h2[class="posttitle"]')[0].text)
        times.append(parse_time(p.select('p')[0].text))
        abstracts.append(p.select('p')[-1].text)
    posts = pd.DataFrame({'title': titles, 'abstract': abstracts}, index=times)
    return posts


all_posts = pd.DataFrame(columns=['title', 'abstract'])

i = 0
while True:
    try:
        i += 1
        print(f'Crawling page {i}: ', end='')
        url = f'http://carbon-pulse.com/category/eu-ets/page/{i}'
        soup = get_soup(url, 'lxml')
        posts = get_posts(soup)
        if posts is None: raise EOFError
        all_posts = pd.concat([all_posts, posts])
        print(f'finished ({posts.shape[0]} posts)')
    except IndexError:
        print('failed, retrying')
        i -= 1
    except (EOFError, KeyboardInterrupt):
        print('terminated')
        print(f'Total number of posts crawled: {all_posts.shape[0]}')
        all_posts.to_csv('posts.csv')
        exit()
```

Now we starting generating new features from these posts. First let's take a look at the data.

In [6]:
df = pd.read_csv('/Data/news.csv', index_col=0)
df.index = pd.to_datetime(df.index)
df

FileNotFoundError: [Errno 2] File b'/Data/news.csv' does not exist: b'/Data/news.csv'

We have news data from 2015-02-04 to 2019-05-10 (latest). Entries are `title` and `abstract`. Post contents in full are neglected here for the first-stage analysis. They might be of use later, but for the time being we only need title and abstracts.