# Recommender Feature Extraction

Now that we have provided the features for prediction system, we should look at the stocks and see how we can compare similarity between them.

In [1]:
import pandas as pd
import numpy as np
import sys, os
import sys
sys.path.insert(1, '..')
import recommender as rcmd
%matplotlib inline

In [6]:
# retrieve a company profile
p = rcmd.contrib.fmp_api.profile.get_profile('AAPL')
p

Unnamed: 0,profile,symbol
beta,1.139593,AAPL
ceo,Timothy D. Cook,AAPL
changes,-0.37,AAPL
changesPercentage,(-0.16%),AAPL
companyName,Apple Inc.,AAPL
description,"Apple Inc is designs, manufactures and markets...",AAPL
exchange,Nasdaq Global Select,AAPL
image,https://financialmodelingprep.com/images-New-j...,AAPL
industry,Computer Hardware,AAPL
lastDiv,2.92,AAPL


Using the company profile, we should be able to create a one-hot embedding of sector and industry for the companies, allowing us to compare them. The description of the company might also help us to filter companies regarding the natural language queries proposed by the user (e.g. Entity recognition or simple comparision of embeddings).

Lets create a dataset from all company profiles that we know of:

In [2]:
from recommender.contrib import fmp_api as fmp

stocks = fmp.profile.list_symbols()

def list_profiles(stocks=stocks):
    # iterate through all symbols
    syms = stocks['symbol'].values
    ls = []
    for symbol in syms:
        try:
            profile = fmp.profile.get_profile(symbol)
        except:
            continue
        
        # process the profile
        ls.append(profile[['profile']].transpose().assign(symbol=symbol).set_index('symbol'))
    
    # generate final dataframe
    return pd.concat(ls, axis=0)

#df_profile = list_profiles()
#df_profile.to_csv('../data/profiles.csv')

df_profile = pd.read_csv('../data/profiles.csv')

print(df_profile.shape)
df_profile.head()

(7936, 17)


Unnamed: 0,symbol,beta,ceo,changes,changesPercentage,companyName,description,exchange,image,industry,lastDiv,mktCap,price,range,sector,volAvg,website
0,SPY,0.999041,,-0.48,(-0.16%),SPDR S&P 500,The investment seeks to provide investment res...,NYSE Arca,https://financialmodelingprep.com/images-New-j...,,5.066805,277445800000.0,301.02,233.76-293.94,,115036190,http://www.spdrs.com
1,CMCSA,1.108054,Brian L. Roberts,-0.03,(-0.06%),Comcast Corporation Class A Common Stock,Comcast Corp is a media and technology company...,Nasdaq Global Select,https://financialmodelingprep.com/images-New-j...,Entertainment,0.84,212325400000.0,46.67,30.43-41.33,Consumer Cyclical,28875593,https://corporate.comcast.com
2,KMI,0.962428,Steven J. Kean,-0.02,(-0.10%),Kinder Morgan Inc.,Kinder Morgan Inc is an energy infrastructure ...,New York Stock Exchange,https://financialmodelingprep.com/images-New-j...,Oil & Gas - Midstream,0.8,45826880000.0,20.68,14.6201-20.44,Energy,16636870,http://www.kindermorgan.com
3,INTC,0.795098,Brian M. Krzanich,-0.11,(-0.20%),Intel Corporation,Intel Corp is the world's largest chipmaker. I...,Nasdaq Global Select,https://financialmodelingprep.com/images-New-j...,Semiconductors,1.26,242407100000.0,51.57,42.36-57.5995,Technology,30437989,http://www.intel.com
4,MU,1.850851,,-0.01,(-0.01%),Micron Technology Inc.,Micron Technology Inc along with its subsidiar...,Nasdaq Global Select,https://financialmodelingprep.com/images-New-j...,Semiconductors,0.0,61259500000.0,49.85,28.39-64.66,Technology,60797914,http://www.micron.com


We have an additional list of stocks that from the kaggle dataset. Lets retrieve all relevant symbols from there as well and check for profiles we might have missed:

In [None]:
# use cache to read relevant profiles
cache = rcmd.stocks.Cache()
d1 = cache.list_data(type='stock')
d2 = cache.list_data(type='etf')
symbols = list(d1.keys()) + list(d2.keys())

# perform diff with previous stock list
symbols = np.setdiff1d(symbols, stocks['symbol'].values)

print("Remaining: {}".format(len(symbols)))

# insert new stock list to retrieve the Items
df_profile2 = fmp.profile.list_profiles(symbols)

# merge dataframes + update the disk file
df_profile = pd.concat([df_profile, df_profile2], axis=0)
#df_profile.to_csv('../data/profiles.csv')

df.profile.shape

Remaining: 8539


Now lets use the company description to extract additional features:

In [None]:
# TODO: NP extraction