# Econ 191 Paper: Analyzing FOMC Transcripts

## Data Scraping, Cleaning and Processing


Here we being the process outlined in III.II (Data). We go through the process of scraping the data, converting the PDFs into text, using the Porter algorithm to stem words, removing stop words, counting and weighting relevant words by tf-idf (per document), and grouping each vector to the corresponding outcome variable.

### Importing Outcome Variables
Here we import our outcome variables from the the "outcome_variables.csv" file located in the same directory as this file. We skip the outcome data pre-processing part for ease of analysis.

Our outcome data for the federal funds rate was taken directly from the Federal Reserve Bank of St. Louis’ FRED portal. Our gross domestic product data was from the US Bureau of Economic Analysis. Unemployment and inflation rate calculated using the consumer price index were taken from the US Bureau of Labor Statistics.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# import csv of data to pandas dataframe 1980 - 2011.
outcome_vars = pd.read_csv("./outcome_data.csv")
# trim data to relevant years
df_rel_years = outcome_vars[outcome_vars["Year"] >= 1982]
# gets rid of post-2008 years which might complicate data (as there is no exact target)
df_outcome_vars = df_rel_years[df_rel_years["Federal Funds Target Rate"].notnull()]

We now calculate our outcome variable, Change in Federal Funds Rate, and append it to the table!

In [3]:
# add change in Fed funds rate column
changes = np.diff(np.array(df_outcome_vars["Federal Funds Target Rate"]))
df_outcome_vars["Change in Fed Funds Target Rate"] = np.concatenate([[None], changes])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


### Scraping FOMC Transcripts and Generating Paths
Here we scrape the transcripts, organizing them in the feddata/ directory by year. Within each year contains all the conference calls and meeting transcripts. In total, it amounts to 660 MB of data, or 18,741 pages. 

This scraper will not check if the files already exist, and will restart the scraping process if re-run. ** Do not run the scraper portion of this if you already have the data  **. It took so long I had to scrape this overnight.

In [4]:
from bs4 import BeautifulSoup
import requests
import re
import urllib.request
import os

Change the appropriate boolean to True and the others to false to generate the appropriate dictionary for use later on. Excuse the poorly written code - it works but I haven't had the time to refactor.

If you need to download the data, set "link_to_file_on_website" to True. Then run the scraping portion to download all the pdfs.

If you have the text data, then set "path_to_local_txt" to True and skip the scraping process. *** If you're coming from the github repo, you have the .txt data ***. Just continue and ignore the sections that I say to ignore.

In [5]:
# generates a dictionary of appropriate transcript paths
# if you already have the text data, set path_to_local_txt to True. 
link_to_file_on_website = False
path_to_local_pdf = False
path_to_local_txt = True

Run this after selecting the appropriate boolean. We'll need this later on.

In [6]:
if link_to_file_on_website:
    base_url = "https://www.federalreserve.gov/monetarypolicy/"
if path_to_local_pdf or path_to_local_txt:
    base_directory = "./feddata/"
    
transcript_links = {}
for year in range(1982, 2009): # from 1982 - 2008
    
    if link_to_file_on_website:
        path = "fomchistorical" + str(year) + ".htm"
        html_doc = requests.get(base_url + path)
        soup = BeautifulSoup(html_doc.content, 'html.parser')
        links = soup.find_all("a", string=re.compile('Transcript .*'))
        link_base_url = "https://www.federalreserve.gov"
        transcript_links[str(year)] = [link_base_url + link["href"] for link in links]
        
    elif path_to_local_pdf or path_to_local_txt:
        files = []
        path_to_folder = base_directory + str(year)
        new_files = os.walk(path_to_folder)
        for file in new_files:
            for f in file[2]:
                if path_to_local_pdf:
                    if f[-3:] == "meeting.pdf":
                        files.append(str(file[0]) + "/" + f)
                elif path_to_local_txt:
                    if f[-11:] == "meeting.txt":
                        files.append(str(file[0]) + "/" + f)
        transcript_links[str(year)] = files
    print("Year Complete: ", year)


Year Complete:  1982
Year Complete:  1983
Year Complete:  1984
Year Complete:  1985
Year Complete:  1986
Year Complete:  1987
Year Complete:  1988
Year Complete:  1989
Year Complete:  1990
Year Complete:  1991
Year Complete:  1992
Year Complete:  1993
Year Complete:  1994
Year Complete:  1995
Year Complete:  1996
Year Complete:  1997
Year Complete:  1998
Year Complete:  1999
Year Complete:  2000
Year Complete:  2001
Year Complete:  2002
Year Complete:  2003
Year Complete:  2004
Year Complete:  2005
Year Complete:  2006
Year Complete:  2007
Year Complete:  2008


Below is the *** scraping process *** to download the links. This will work if you selected the "link_to_file_on_website" boolean. *** Do NOT run if you have the .txt files ***.

In [7]:
# for year in transcript_links.keys():
#     if not os.path.exists("./feddata/" + year):
#         os.makedirs("./feddata/" + year)
#     for link in transcript_links[year]:
#         response = urllib.request.urlopen(str(link))
#         name = re.search("[^/]*$", str(link))
#         print(link)
#         with open("./feddata/" + year + "/" + name.group(), 'wb') as f:
#             f.write(response.read())
#         print("file uploaded")

If you just downloaded the data, you need to convert it to text. The most accurate translation is from this website:

pdftotext.com

Download the text data and put it in the corresponding year folder. It should appear next to the pdf. Then re-start the process from "Scraping FOMC Transcripts and Generating Paths" section. Now that you have the text files, you should set "path_to_local_txt" to True.

#### Create sorted list of transcripts
The ordering in this list will be important, as we will generate a corresponding list with weighted word counts.

In [8]:
# create list of all paths and sort in increasing order
sorted_transcripts = []
for linkset in transcript_links.values():
    sorted_transcripts += linkset
sorted_transcripts = sorted(sorted_transcripts)
print("Number of Documents", len(sorted_transcripts))

Number of Documents 216


#### Remove stop words from txt files
We do this seperately from our tfidf vectorizer, as we want to run a stemmer. This generates new text files with original name, but with "Stop.txt" appended. For example:

FOMC19820202meeting.txt -> 
FOMC19820202meetingStop.txt

*** if you already have the Stop.txt files then do not run this. It will take a long time***

In [9]:
# from nltk.corpus import stopwords
# i = 0
# for f in sorted_transcripts:
#     infile = open(f, 'r')
#     text = infile.readlines()
#     newfile = open(f[:-4] + 'Stop.txt','w')
#     new_text = []
#     for line in text:
#         mod_line = line[:-1].split(" ")
#         new_line = [word for word in mod_line if word.lower() not in stopwords.words('english')]
#         new_string = ""
#         for word in new_line:
#             new_string += " " + word
#         new_string += "\n"
#         new_text += new_string
#     newfile.writelines(new_text)
#     newfile.close()
#     infile.close()
#     i += 1
#     print("File " + str(i) + " of " + str(len(sorted_transcripts)) + " Completed")

We will now re-adjust our sorted_transcripts list

In [10]:
# run all the adjusting of variable sorted_transcripts in order
mod_transcripts = []
for link in sorted_transcripts:
    mod_transcripts.append(str(link)[:-4] + "Stop.txt")
sorted_transcripts = mod_transcripts
print("Number of Documents", len(sorted_transcripts))

Number of Documents 216


#### Apply Porter Stemmer to Stem Words
All text documents' words will be stemmed. Will also take a long time - do not run unless you do not have files with the appropriate ending. For example:

FOMC19820202meetingStop.txt -> 
FOMC19820202meetingStopstemmed.txt

In [11]:
# i = 0
# for doc in sorted_transcripts:
#     !python porterstemmer.py {doc}
#     i += 1
#     print("File " + str(i) + " of " + str(len(sorted_transcripts)) + " Completed")
#     print(str(int(i * 100 / len(sorted_transcripts)))  + "%")

We will now re-re-adjust our sorted_transcripts list

In [12]:
# run all the adjusting of variable sorted_transcripts in order
mod_transcripts = []
for link in sorted_transcripts:
    mod_transcripts.append(str(link)[:-4] + "stemmed.txt")
sorted_transcripts = mod_transcripts
print("Number of Documents", len(sorted_transcripts))

Number of Documents 216


#### Create lists of Weighted Word Counts
The following code will remove stop words and create a list of tfidf weighted word counts, in the order appearing in sorted_transcripts. It will also do the same thing with raw word stem counts.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

input_vectorizer = TfidfVectorizer(input="filename", stop_words=None)
m = input_vectorizer.fit_transform(sorted_transcripts)

In [14]:
print("Number of Word Stem Vectors:", m.shape[1])
print("Shape of Vector Matrix:", m.shape)

Number of Word Stem Vectors: 17008
Shape of Vector Matrix: (216, 17008)


In [15]:
# same thing but with no global weighting
from sklearn.feature_extraction.text import CountVectorizer

input_vectorizer_counts = CountVectorizer(input="filename", stop_words=None)
c = input_vectorizer_counts.fit_transform(sorted_transcripts)

In [16]:
print("Number of Word Stem Vectors:", c.shape[1])
print("Shape of Vector Matrix:", c.shape)
# should be same as above

Number of Word Stem Vectors: 17008
Shape of Vector Matrix: (216, 17008)


Without stop words removed or stemming appied, we have 28917 word vectors, which each row representing a different data point and each value representing the number of occurences of the word.

Stop words reduces the number by about 30. Stemming reduces the number to 17,008.

#### Append transcript paths and Weighted Word Counts to the table at the corresponding entry
The following code adds the path of the fed transcript text data and the weighted word counts to the corresponding row in the df_outcome_var table. We place the transcript in a row if it took place before the meeting date for that row, but after the previous row's meeting date.

In [17]:
transcript_path_col = []
weighted_word_count_col = []
word_count_col = []
curr_date_row_counter = 0
subgroup_transcript = []
subgroup_wordcount = []
subgroup_wordcount_c = []
i = 0
while i < len(sorted_transcripts):
    f_month = str(int(df_outcome_vars.iloc[curr_date_row_counter, 1]))
    f_day = str(int(df_outcome_vars.iloc[curr_date_row_counter, 2]))
    f_year = str(int(df_outcome_vars.iloc[curr_date_row_counter, 0]))
    fed_date = pd.to_datetime(f_month + "/" + f_day + "/" + f_year)
    
    link = sorted_transcripts[i]
    date = link.rsplit('/', 1)[-1]
    if date[0] == "F":
        month = date[8:10]
        day = date[10:12]
        year = date[4:8]
    else:
        month = date[4:6]
        day = date[6:8]
        year = date[0:4]
    text_date = pd.to_datetime(month + "/" + day + "/" + year)
    if text_date <= fed_date:
        subgroup_wordcount.append(m[i])
        subgroup_transcript.append(link)
        subgroup_wordcount_c.append(c[i])
        i += 1
    else:
        transcript_path_col.append(subgroup_transcript)
        weighted_word_count_col.append(subgroup_wordcount)
        word_count_col.append(subgroup_wordcount_c)
        subgroup_transcript = []
        subgroup_wordcount = []
        subgroup_wordcount_c = []
        curr_date_row_counter += 1
        
transcript_path_col.append(subgroup_transcript)
weighted_word_count_col.append(subgroup_wordcount)
word_count_col.append(subgroup_wordcount_c)

# append two empty lists representing columns
while len(transcript_path_col) != df_outcome_vars.shape[0] and len(weighted_word_count_col) != df_outcome_vars.shape[0] and len(word_count_col) != df_outcome_vars.shape[0]:
    print("appended column at end")
    transcript_path_col.append([])
    weighted_word_count_col.append([]) 
    word_count_col.append([])
print("Should say 'appended column at end' only twice!")

df_outcome_vars["Transcripts"] = transcript_path_col
df_outcome_vars["WeightedWordCount"] = weighted_word_count_col
df_outcome_vars["WordCount"] = word_count_col

appended column at end
appended column at end
Should say 'appended column at end' only twice!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


#### Append Relevant Words Count To Table
We now select all vectors in list, average each vector, select relevant words, normalize such that the length of each vector is 1.

In [18]:
# take a general look at data to decide which words seem to vary the most - would involve some table transformations
relevant_words = ['recoveri', 'save', 'continu', 'expect', 'stock', 'profit', 'gain', 'fund', 'resili', 'household', 'indic', 'bear', 'distribut', 'custom', 'incom', 'bull', 'particip', 'oil', 'suppli', 'employ', 'confid', 'bank', 'forecast', 'price', 'foreign', 'tax', 'stagnat', 'headwind', 'debt', 'wage', 'growth', 'unemploy', 'workforc', 'weak', 'geopolit', 'dramat', 'demand', 'labor', 'consum', 'job', 'produc', 'risk', 'polici', 'strong', 'rate', 'global', 'energi', 'corpor', 'deficit', 'supplier', 'exchang', 'commod', 'wealth', 'inflat', 'condit', 'capit', 'market', 'inflationari', 'economi', 'abroad', 'mortgag', 'percent', 'lack', 'crisi', 'lend']

In [19]:
import functools

# first get the indices of the relevant words in our tokenCount arrays (for use below)
feature_names = input_vectorizer.get_feature_names()
rel_word_indices = [feature_names.index(word) for word in relevant_words]

# create normalization function
def norm(v):
    norm = np.linalg.norm(v)
    return v / norm

# now begin process
rel_word_count = []
for lst in weighted_word_count_col:
    if len(lst) == 0:
        rel_word_count.append([])
    else:
        added_list = functools.reduce(np.add, lst)
        averaged_list = added_list / len(lst)
        rel_word_list = np.array([averaged_list.toarray()[0][i] for i in rel_word_indices])
        normalized_rel_word_list = norm(rel_word_list) * 1000
        rel_word_count.append(normalized_rel_word_list)

df_outcome_vars["RelevantWordVector"] = rel_word_count

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Lastly, we choose reduce our table size such that we only consider *** changes *** in federal funds rate.

We will train our model on a weighted average of the word vectors from the associated meeting and meeting prior

We exclude data points with no transcripts associated with them (specifically, that meeting and the before the rate change), or single successive changes that do not have an associated transcript.

In [436]:
# get relevant data (transcripts from year of change and change before) from df for train and test
# fed_rates = np.array(df_outcome_vars["Federal Funds Target Rate"])
change_rates = np.array(df_outcome_vars["Change in Fed Funds Target Rate"])
transcripts = np.array(df_outcome_vars["Transcripts"])
changes = []
i = 0
# while i < len(fed_rates) - 1:
#     if fed_rates[i + 1] != fed_rates[i]:
#         changes.append(True)
#         changes.append(True)
#         i += 2
#     else:
#         changes.append(False)
#         i += 1

# changes.append(False)

changes.append(False)
i += 1
while i < len(change_rates) - 1:
    if change_rates[i] != 0 and len(transcripts[i]) != 0:
        changes.append(True)
    else:
        changes.append(False)
    i += 1

changes.append(False)

relevant_data_df = df_outcome_vars[changes]

In [437]:
# pd.set_option('display.max_rows', len(relevant_data_df))
# relevant_data_df.loc[:, ["Year", "Month", "Day", "Transcripts", "Change in Fed Funds Target Rate", "RelevantWordVector"]]
# pd.reset_option('display.max_rows')

In [438]:
reduction_df = relevant_data_df[relevant_data_df["Change in Fed Funds Target Rate"] < 0]

In [439]:
increase_df = relevant_data_df[relevant_data_df["Change in Fed Funds Target Rate"] > 0]

# Training to Fit Model For Basis Point Change
Here we will split our data into training and test sets. We will use our training data to construct a Lasso linear model using iterative fitting along the regularization path. 

We will do this for both a reduction and increase in basis point change

In [440]:
# split into test/train data.
sample_num1 = reduction_df.shape[0]

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x < 255 * p).reshape(shape)

sample_gen1 = random_bool(sample_num1, 0.75)

# see how many samples we're taking / what percentage is train data
c = 0
for b in sample_gen1:
    if b:
        c += 1
        
train_df_reduc = reduction_df[sample_gen1]
test_df_reduc = reduction_df[np.invert(sample_gen1)]

print("REDUCTION")
print("Number of Samples", sample_num1)
print("number of train samples", c)
print("proportion of train samples", c / sample_num1)
print("number of test samples", sample_num1 - c)

REDUCTION
Number of Samples 42
number of train samples 28
proportion of train samples 0.6666666666666666
number of test samples 14


In [489]:
sample_num2 = increase_df.shape[0]
sample_gen2 = random_bool(sample_num2, 0.75)
c = 0
for b in sample_gen2:
    if b:
        c += 1

train_df_increas = increase_df[sample_gen2]
test_df_increas = increase_df[np.invert(sample_gen2)]

print("INCREASE")
print("Number of Samples", sample_num2)
print("number of train samples", c)
print("proportion of train samples", c / sample_num2)
print("number of test samples", sample_num2 - c)

INCREASE
Number of Samples 47
number of train samples 37
proportion of train samples 0.7872340425531915
number of test samples 10


# Analysis of Reduction and Increase in Federal Funds Basis Point Change
Now we find the coefficients to our matrix and output relevant graphs

In [443]:
Y_train_reduc = np.array(train_df_reduc["Change in Fed Funds Target Rate"])
X_train_reduc = [list(x) for x in np.array(train_df_reduc["RelevantWordVector"])]

In [444]:
Y_test_reduc = np.array(test_df_reduc["Change in Fed Funds Target Rate"])
X_test_reduc = [list(x) for x in np.array(test_df_reduc["RelevantWordVector"])]

In [490]:
Y_train_increas = np.array(train_df_increas["Change in Fed Funds Target Rate"])
X_train_increas = [list(x) for x in np.array(train_df_increas["RelevantWordVector"])]

In [491]:
Y_test_increas = np.array(test_df_increas["Change in Fed Funds Target Rate"])
X_test_increas = [list(x) for x in np.array(test_df_increas["RelevantWordVector"])]

# Reduction Analysis

In [447]:
import sklearn.linear_model as lm

linear_clf = lm.LassoCV(tol=.001, cv=3)

# Fit your classifier
linear_clf.fit(X_train_reduc, Y_train_reduc)

# Output Coefficients
print(linear_clf.coef_)

[  0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
   0.00000000e+00   0.00000000e+00  -0.00000000e+00   0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
  -0.00000000e+00  -4.62863797e-04   0.00000000e+00   4.83390119e-04
  -0.00000000e+00  -0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00  -0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -0.00000000e+00  -0.00000000e+00   0.00000000e+00
   0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
  -5.82520796e-05  -0.00000000e+00  -0.00000000e+00   0.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -0.00000000e+00   0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00   0.00000000e+00
  -0.00000000e+00   0.00000000e+00

In [None]:
linear_clf.coef_

In [458]:
i = 0
while i < len(linear_clf.coef_):
    print((linear_clf.coef_[i], relevant_words[i]))
    i += 1

(0.0, 'recoveri')
(0.0, 'save')
(-0.0, 'continu')
(-0.0, 'expect')
(0.0, 'stock')
(0.0, 'profit')
(-0.0, 'gain')
(0.0, 'fund')
(-0.0, 'resili')
(0.0, 'household')
(0.0, 'indic')
(-0.0, 'bear')
(0.0, 'distribut')
(-0.0, 'custom')
(0.0, 'incom')
(-0.0, 'bull')
(-0.0, 'particip')
(0.0, 'oil')
(0.0, 'suppli')
(0.0, 'employ')
(-0.0, 'confid')
(-0.00046286379737506603, 'bank')
(0.0, 'forecast')
(0.00048339011872315819, 'price')
(-0.0, 'foreign')
(-0.0, 'tax')
(-0.0, 'stagnat')
(-0.0, 'headwind')
(-0.0, 'debt')
(0.0, 'wage')
(-0.0, 'growth')
(-0.0, 'unemploy')
(-0.0, 'workforc')
(-0.0, 'weak')
(0.0, 'geopolit')
(0.0, 'dramat')
(0.0, 'demand')
(-0.0, 'labor')
(-0.0, 'consum')
(0.0, 'job')
(0.0, 'produc')
(-0.0, 'risk')
(0.0, 'polici')
(-0.0, 'strong')
(-5.8252079552950643e-05, 'rate')
(-0.0, 'global')
(-0.0, 'energi')
(0.0, 'corpor')
(-0.0, 'deficit')
(-0.0, 'supplier')
(-0.0, 'exchang')
(0.0, 'commod')
(-0.0, 'wealth')
(0.0, 'inflat')
(-0.0, 'condit')
(0.0, 'capit')
(-0.0, 'market')
(0.0, 'in

### RMSE - Deviation between Actual and Predicted Values

In [459]:
# Output RMSE on test set
def rmse(predicted_y, actual_y):
    return np.sqrt(np.mean((predicted_y - actual_y) ** 2))

print("RMSE", rmse(linear_clf.predict(X_test_reduc), Y_test_reduc))

RMSE 0.172953064708


In [449]:
linear_clf.predict(X_test_reduc)

array([-0.40129176, -0.36133812, -0.34368307, -0.30968473, -0.28815103,
       -0.28232152, -0.31684374, -0.39301967, -0.28095028, -0.39420921,
       -0.37633708, -0.37473606, -0.38259315, -0.24900028])

In [450]:
Y_test_reduc

array([-0.5, -0.5, -0.25, -0.5, -0.25, -0.3125, -0.25, -0.25, -0.25, -0.25,
       -0.5, -0.5, -0.5, -0.75], dtype=object)

### T-stats + SD to measure significance of each coefficient

In [460]:
t_stats = []
std = []
coefficient_matrix = linear_clf.coef_
i = 0
while i < len(coefficient_matrix):
    vals = []
    for x_sample in X_train_reduc:
        vals.append(x_sample[i])
    standard_deviation = np.std(vals)
    std.append(standard_deviation)
    n = len(X_train_reduc)
    standard_error = standard_deviation / np.sqrt(n)
    t_stats.append((coefficient_matrix[i] - 0) / standard_error)
    i += 1
std = np.array(std)
std= np.round(std, decimals=2)

In [472]:
print(t_stats)
print("\n")
print(std)
print(np.mean(std))

[0.0, 0.0, -0.0, -0.0, 0.0, 0.0, -0.0, 0.0, -0.0, 0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, -0.0, 0.0, 0.0, 0.0, -0.0, -2.7468906470938074e-05, 0.0, 3.0845568656507008e-05, -0.0, -0.0, -0.0, -0.0, -0.0, 0.0, -0.0, -0.0, -0.0, -0.0, 0.0, 0.0, 0.0, -0.0, -0.0, 0.0, 0.0, -0.0, 0.0, -0.0, -2.6408435419382953e-06, -0.0, -0.0, 0.0, -0.0, -0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, -0.0, -0.0, 4.4906557709323541e-06, 0.0, -0.0, 0.0]


[  34.73   18.45   37.16   51.14   39.48   17.6    13.02   70.25    6.73
   21.14   29.43    5.34   10.11    6.55   23.73    1.85   13.62   54.87
   21.81   26.21   36.17   89.16   62.78   82.92   25.86   25.83    6.79
   12.75   27.87   24.36   84.67   23.42    4.13   47.62    1.47    8.55
   25.81   29.1    32.86   17.94   14.82   97.46   69.13   23.7   116.72
   18.56   24.49   12.56   14.22    5.37   33.94   17.94   16.01  123.29
   26.83   52.24  123.31   15.09   66.61   10.54   31.    162.35    5.25
   15.37   21.85]
35.2604615385


In [453]:
import plotly 
plotly.tools.set_credentials_file(username='ali-wetrill', api_key='x8ZULc5B5lFJLxnFCFD8')

### For each basis point change - variation in number of words

In [37]:
basis_changes = sorted(list(set(list(reduction_df["Change in Fed Funds Target Rate"]))))
basis_changes.reverse()

In [38]:
vals = [[np.sum(x[0]) for x in list(reduction_df[reduction_df["Change in Fed Funds Target Rate"] == j]["WordCount"])] for j in basis_changes]

In [39]:
len(basis_changes)

6

In [40]:
import plotly.plotly as py
import plotly.graph_objs as go

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']

trace0 = go.Box(
    y = vals[0],
    name = str(basis_changes[0]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[0]),
    line = dict(
        color = colors[0])
)

trace1 = go.Box(
    y = vals[1],
    name = str(basis_changes[1]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[1]),
    line = dict(
        color = colors[1])
)

trace2 = go.Box(
    y = vals[2],
    name = str(basis_changes[2]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[2]),
    line = dict(
        color = colors[2])
)

trace3 = go.Box(
    y = vals[3],
    name = str(basis_changes[3]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[3]),
    line = dict(
        color = colors[3])
)

trace4 = go.Box(
    y = vals[4],
    name = str(basis_changes[4]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[4]),
    line = dict(
        color = colors[4])
)

trace5 = go.Box(
    y = vals[5],
    name = str(basis_changes[5]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[5]),
    line = dict(
        color = colors[5])
)

data = [trace0,trace1,trace2,trace3,trace4,trace5]

layout = go.Layout(
    title = "Variation in Words within each Transcript",
    xaxis=dict(
        type="category"
    )
)

fig = go.Figure(data=data,layout=layout)
py.iplot(fig, filename = "Box Plot Styling Outliers")

### For each basis point change - top words

In [41]:
word_count_per_basis_change = []
for rate in basis_changes:
    z = list(reduction_df[reduction_df["Change in Fed Funds Target Rate"] == rate]["WordCount"])
    word_count_per_basis_change.append([sum([x[0].toarray()[0][i] for x in z]) for i in range(0, len(z[0][0].toarray()[0]))])

In [42]:
total_word_count = [sum(word_count_per_basis_change[i]) for i in range(0, len(word_count_per_basis_change))]

In [43]:
percent_word_counts = [np.array(word_count_per_basis_change[i]) / total_word_count[i] for i in range(0, len(total_word_count))]

In [44]:
percent_word_counts_with_words = []
for arr in percent_word_counts:
    i = 0
    new_arr = []
    names = input_vectorizer_counts.get_feature_names()
    while i < len(arr):
        new_arr.append((arr[i], names[i]))
        i += 1
    percent_word_counts_with_words.append(new_arr)

In [45]:
percent_word_counts_with_words = [sorted(list(x)) for x in percent_word_counts_with_words]
dummy = [x.reverse() for x in percent_word_counts_with_words]

In [46]:
# manual inspetion. adjust i for starting basis_changes. j for within a basis_change.
# i = 2
# while i < len(percent_word_counts_with_words):
#     print(basis_changes[i])
#     j = 43
#     while j < 50:
#         print(percent_word_counts_with_words[i][j])
#         j += 1
#     print("\n")
#     i += 1

In [47]:
#manual inspection on which to plot
ind_2_25 = [5, 12, 13, 14, 15, 16, 26, 34, 43, 44]
ind_4_05 = [3, 5, 13, 14, 15, 18, 22, 25, 26, 28]

In [48]:
i = 2
print(basis_changes[i])
for j in ind_2_25:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])
i = 4
print("\n")
print(basis_changes[i])
for j in ind_4_05:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])

-0.25
0.715% market
0.527% growth
0.518% percent
0.491% inflat
0.479% price
0.428% polici
0.383% risk
0.313% fund
0.27% bank
0.265% continu


-0.5
1.037% rate
0.761% market
0.469% percent
0.404% risk
0.402% growth
0.391% economi
0.379% polici
0.365% expect
0.359% forecast
0.35% bank


### A Closer Look at Relevant Words
Percentages for -0.5 and -0.25 changes. And how the most relevant change over time

In [49]:
i = 2
j = 0
ind_relwords_25 = []
while j < len(percent_word_counts_with_words[i]):
    if percent_word_counts_with_words[i][j][1] in relevant_words:
        ind_relwords_25.append(j)
    j += 1

In [50]:
i = 4
j = 0
ind_relwords_05 = []
while j < len(percent_word_counts_with_words[i]):
    if percent_word_counts_with_words[i][j][1] in relevant_words:
        ind_relwords_05.append(j)
    j += 1

In [51]:
i = 2
print(basis_changes[i])
for j in ind_relwords_25:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])
i = 4
print("\n")
print(basis_changes[i])
for j in ind_relwords_05:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])

-0.25
1.054% rate
0.715% market
0.527% growth
0.518% percent
0.491% inflat
0.479% price
0.428% polici
0.419% economi
0.411% expect
0.405% forecast
0.383% risk
0.313% fund
0.27% bank
0.265% continu
0.154% weak
0.149% indic
0.148% capit
0.142% demand
0.126% consum
0.126% condit
0.101% strong
0.092% confid
0.09% employ
0.084% incom
0.084% labor
0.083% stock
0.077% unemploy
0.076% oil
0.071% foreign
0.067% suppli
0.067% exchang
0.064% tax
0.06% debt
0.058% energi
0.056% job
0.054% produc
0.051% household
0.049% mortgag
0.047% wage
0.047% recoveri
0.042% profit
0.042% inflationari
0.041% particip
0.038% lend
0.038% commod
0.037% gain
0.036% corpor
0.028% save
0.026% dramat
0.025% distribut
0.021% deficit
0.019% global
0.018% wealth
0.018% abroad
0.016% lack
0.01% custom
0.01% crisi
0.008% resili
0.008% bear
0.004% headwind
0.004% supplier
0.002% workforc
0.001% geopolit
0.0% stagnat
0.0% bull


-0.5
1.037% rate
0.761% market
0.469% percent
0.404% risk
0.402% growth
0.391% economi
0.379% pol

In [52]:
# plot how they change over time

In [202]:
# top word counts in every date
# sum of all words in every date for neg changesn
# list of lists, with each inner list holding a particular word frequency change over time

In [112]:
word_counts_with_words = []
z = 1
for arr in list(reduction_df["WordCount"]):
    lst = arr[0].toarray()[0]
    s = sum(lst)
    i = 0
    new_arr = []
    names = input_vectorizer_counts.get_feature_names()
    while i < len(lst):
        new_arr.append((lst[i] * 100 / s, names[i]))
        i += 1
    word_counts_with_words.append(new_arr)

In [113]:
word_counts_with_words = [sorted(list(x)) for x in word_counts_with_words]
dummy = [x.reverse() for x in word_counts_with_words]

In [114]:
# lst_ind_relwords = []
# i = 0
# while i < len(word_counts_with_words):
#     j = 0
#     ind_relwords = []
#     while j < len(word_counts_with_words[i]):
#         if word_counts_with_words[i][j][1] in relevant_words:
#             ind_relwords.append(j)
#         j += 1
#     lst_ind_relwords.append(ind_relwords)
#     i += 1

In [115]:
rel_word_counts_with_words = []
i = 0
while i < len(word_counts_with_words):
    lst = word_counts_with_words[i]
    new_lst = []
    j = 0
    while j < len(lst):
        if lst[j][1] in relevant_words:
            new_lst.append(lst[j])
        j += 1
    rel_word_counts_with_words.append(new_lst)
    i += 1

In [127]:
def findItem(theList, word):
    return [theList.index(i) for i in theList if i[1] == word][0]

In [134]:
rel_word_g2 = [[findItem(x, word) for x in rel_word_counts_with_words] for word in relevant_words]

In [137]:
# Relevant Lists
# rel_word_counts_with_words - word counts for each date. In each list use rel_word_g2 to get word count.
# rel_word_g2 - for each index, lst is where that word is in the above list.
# relevant_words - word.

'recoveri'

In [145]:
rel_word_counts_with_words[0][16]

(0.076523640338890406, 'recoveri')

In [149]:
y_vals = []
for word_indices in rel_word_g2:
    i = 0
    sub_list = []
    while i < len(rel_word_counts_with_words):
        index = word_indices[i]
        date_lst = rel_word_counts_with_words[i]
        sub_list.append(date_lst[index])
        i += 1
    y_vals.append(sub_list)

labels = relevant_words

yr = list(reduction_df["Year"])
mnth = list(reduction_df["Month"])
day = list(reduction_df["Day"])
x_vals = []
for i in zip(mnth, day, yr):
    date = pd.to_datetime(str(i[0]) + "/" + str(i[1]) + "/" + str(i[2]))
    x_vals.append(date)

In [253]:
import plotly.plotly as py
import plotly.graph_objs as go

# Create random data with numpy
import numpy as np

# Create traces
trace0 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[0]],
    mode = 'lines',
    name = relevant_words[0]
)

trace1 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[1]],
    mode = 'lines',
    name = relevant_words[1]
)

trace2 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[2]],
    mode = 'lines',
    name = relevant_words[2]
)

trace3 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[3]],
    mode = 'lines',
    name = relevant_words[3]
)

trace4 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[4]],
    mode = 'lines',
    name = relevant_words[4]
)

trace5 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[5]],
    mode = 'lines',
    name = relevant_words[5]
)

trace6 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[6]],
    mode = 'lines',
    name = relevant_words[6]
)

trace7 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[7]],
    mode = 'lines',
    name = relevant_words[7]
)

trace8 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[8]],
    mode = 'lines',
    name = relevant_words[8]
)

trace9 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[9]],
    mode = 'lines',
    name = relevant_words[9]
)

trace10 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[10]],
    mode = 'lines',
    name = relevant_words[10]
)

trace11 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[11]],
    mode = 'lines',
    name = relevant_words[11]
)

trace12 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[12]],
    mode = 'lines',
    name = relevant_words[12]
)

trace13 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[13]],
    mode = 'lines',
    name = relevant_words[13]
)

trace14 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[14]],
    mode = 'lines',
    name = relevant_words[14]
)

trace15 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[15]],
    mode = 'lines',
    name = relevant_words[15]
)

trace16 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[16]],
    mode = 'lines',
    name = relevant_words[16]
)

trace17 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[17]],
    mode = 'lines',
    name = relevant_words[17]
)

trace18 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[18]],
    mode = 'lines',
    name = relevant_words[18]
)

trace19 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[19]],
    mode = 'lines',
    name = relevant_words[19]
)

trace20 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[20]],
    mode = 'lines',
    name = relevant_words[20]
)

trace21 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[21]],
    mode = 'lines',
    name = relevant_words[21]
)

trace22 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[22]],
    mode = 'lines',
    name = relevant_words[22]
)

trace23 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[23]],
    mode = 'lines',
    name = relevant_words[23]
)

trace24 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[24]],
    mode = 'lines',
    name = relevant_words[24]
)

trace25 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[25]],
    mode = 'lines',
    name = relevant_words[25]
)

trace26 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[26]],
    mode = 'lines',
    name = relevant_words[26]
)

trace27 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[27]],
    mode = 'lines',
    name = relevant_words[27]
)

trace28 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[28]],
    mode = 'lines',
    name = relevant_words[28]
)

trace29 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[29]],
    mode = 'lines',
    name = relevant_words[29]
)

trace30 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[30]],
    mode = 'lines',
    name = relevant_words[30]
)

trace31 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[31]],
    mode = 'lines',
    name = relevant_words[31]
)

trace32 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[32]],
    mode = 'lines',
    name = relevant_words[32]
)

trace33 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[33]],
    mode = 'lines',
    name = relevant_words[33]
)

trace34 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[34]],
    mode = 'lines',
    name = relevant_words[34]
)

trace35 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[35]],
    mode = 'lines',
    name = relevant_words[35]
)

trace36 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[36]],
    mode = 'lines',
    name = relevant_words[36]
)

trace37 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[37]],
    mode = 'lines',
    name = relevant_words[37]
)

trace38 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[38]],
    mode = 'lines',
    name = relevant_words[38]
)

trace39 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[39]],
    mode = 'lines',
    name = relevant_words[39]
)

trace40 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[40]],
    mode = 'lines',
    name = relevant_words[40]
)

trace41 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[41]],
    mode = 'lines',
    name = relevant_words[41]
)

trace42 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[42]],
    mode = 'lines',
    name = relevant_words[42]
)

trace43 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[43]],
    mode = 'lines',
    name = relevant_words[43]
)

trace44 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[44]],
    mode = 'lines',
    name = relevant_words[44]
)

trace45 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[45]],
    mode = 'lines',
    name = relevant_words[45]
)

trace46 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[46]],
    mode = 'lines',
    name = relevant_words[46]
)

trace47 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[47]],
    mode = 'lines',
    name = relevant_words[47]
)

trace48 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[48]],
    mode = 'lines',
    name = relevant_words[48]
)

trace49 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[49]],
    mode = 'lines',
    name = relevant_words[49]
)

trace50 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[50]],
    mode = 'lines',
    name = relevant_words[50]
)

trace51 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[51]],
    mode = 'lines',
    name = relevant_words[51]
)

trace52 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[52]],
    mode = 'lines',
    name = relevant_words[52]
)

trace53 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[53]],
    mode = 'lines',
    name = relevant_words[53]
)

trace54 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[54]],
    mode = 'lines',
    name = relevant_words[54]
)

trace55 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[55]],
    mode = 'lines',
    name = relevant_words[55]
)

trace56 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[56]],
    mode = 'lines',
    name = relevant_words[56]
)

trace57 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[57]],
    mode = 'lines',
    name = relevant_words[57]
)

trace58 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[58]],
    mode = 'lines',
    name = relevant_words[58]
)

trace59 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[59]],
    mode = 'lines',
    name = relevant_words[59]
)

trace60 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[60]],
    mode = 'lines',
    name = relevant_words[60]
)

trace61 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[61]],
    mode = 'lines',
    name = relevant_words[61]
)

trace62 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[62]],
    mode = 'lines',
    name = relevant_words[62]
)

trace63 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[63]],
    mode = 'lines',
    name = relevant_words[63]
)

trace64 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[64]],
    mode = 'lines',
    name = relevant_words[64]
)

data = [
       trace2, trace3, trace7, 
       trace21, trace22, trace23, 
       trace30,
       trace41, trace42,
       trace53
]

# data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9, 
#         trace10, trace11, trace12, trace13, trace14, trace15, trace16, trace17, trace18, trace19,
#        trace20, trace21, trace22, trace23, trace24, trace25, trace26, trace27, trace28, trace29, 
#        trace30, trace31, trace32, trace33, trace34, trace35, trace36, trace37, trace38, trace39,
#        trace40, trace41, trace42, trace43, trace44, trace45, trace46, trace47, trace48, trace49,
#        trace50, trace51, trace52, trace53, trace54, trace55, trace56, trace57, trace58, trace59,
#        trace60, trace61, trace62, trace63, trace64]

py.iplot(data, filename='line-mode')

In [252]:
# relevant_words.index("economi")

58

In [245]:
# ind = [2, 3, 7, 21, 22, 23, 30, 41, 42, 44, 53, 56, 58, 61]
# avg = []
# for i in ind:
#     avg.append(np.mean([y[0] for y in y_vals[i]]))

# Increase Analysis

In [492]:
import sklearn.linear_model as lm

linear_clf2 = lm.LassoCV(tol=.001, cv=3)

# Fit your classifier
linear_clf2.fit(X_train_increas, Y_train_increas)

# Output Coefficients
print(linear_clf2.coef_)

[ -0.00000000e+00  -0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -0.00000000e+00  -0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -0.00000000e+00   0.00000000e+00
   0.00000000e+00  -0.00000000e+00   0.00000000e+00   0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00  -0.00000000e+00   0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
  -0.00000000e+00   0.00000000e+00   7.76648264e-05   0.00000000e+00
  -0.00000000e+00   0.00000000e+00  -0.00000000e+00  -0.00000000e+00
  -0.00000000e+00   0.00000000e+00   0.00000000e+00  -0.00000000e+00
   0.00000000e+00   2.62593907e-06   0.00000000e+00  -0.00000000e+00
   0.00000000e+00  -0.00000000e+00

In [493]:
i = 0
while i < len(linear_clf2.coef_):
    print((linear_clf2.coef_[i], relevant_words[i]))
    i += 1

(-0.0, 'recoveri')
(-0.0, 'save')
(0.0, 'continu')
(0.0, 'expect')
(0.0, 'stock')
(0.0, 'profit')
(0.0, 'gain')
(-0.0, 'fund')
(-0.0, 'resili')
(0.0, 'household')
(0.0, 'indic')
(0.0, 'bear')
(0.0, 'distribut')
(0.0, 'custom')
(0.0, 'incom')
(0.0, 'bull')
(0.0, 'particip')
(-0.0, 'oil')
(-0.0, 'suppli')
(0.0, 'employ')
(0.0, 'confid')
(0.0, 'bank')
(-0.0, 'forecast')
(0.0, 'price')
(0.0, 'foreign')
(-0.0, 'tax')
(0.0, 'stagnat')
(0.0, 'headwind')
(-0.0, 'debt')
(0.0, 'wage')
(0.0, 'growth')
(0.0, 'unemploy')
(0.0, 'workforc')
(-0.0, 'weak')
(0.0, 'geopolit')
(-0.0, 'dramat')
(-0.0, 'demand')
(0.0, 'labor')
(0.0, 'consum')
(0.0, 'job')
(-0.0, 'produc')
(0.0, 'risk')
(7.7664826431034697e-05, 'polici')
(0.0, 'strong')
(-0.0, 'rate')
(0.0, 'global')
(-0.0, 'energi')
(-0.0, 'corpor')
(-0.0, 'deficit')
(0.0, 'supplier')
(0.0, 'exchang')
(-0.0, 'commod')
(0.0, 'wealth')
(2.6259390725761357e-06, 'inflat')
(0.0, 'condit')
(-0.0, 'capit')
(0.0, 'market')
(-0.0, 'inflationari')
(0.0, 'economi')
(

In [494]:
# Output RMSE on test set
def rmse(predicted_y, actual_y):
    return np.sqrt(np.mean((predicted_y - actual_y) ** 2))

print("RMSE", rmse(linear_clf2.predict(X_test_increas), Y_test_increas))

RMSE 0.316532429258


In [495]:
linear_clf2.predict(X_test_increas)

array([ 0.21774323,  0.20283646,  0.23914756,  0.23025074,  0.23742124,
        0.23821036,  0.2415549 ,  0.24697556,  0.24669141,  0.24860456])

In [480]:
Y_test_increas

array([0.125, 0.25, 0.125, 0.0625, 0.25, 0.25, 0.25, 0.25, 0.25, 0.5, 0.25,
       0.25, 0.25, 0.25, 0.25, 0.25, 0.25], dtype=object)

In [496]:
t_stats = []
std = []
coefficient_matrix = linear_clf2.coef_
i = 0
while i < len(coefficient_matrix):
    vals = []
    for x_sample in X_train_increas:
        vals.append(x_sample[i])
    standard_deviation = np.std(vals)
    std.append(standard_deviation)
    n = len(X_train_increas)
    standard_error = standard_deviation / np.sqrt(n)
    t_stats.append((coefficient_matrix[i] - 0) / standard_error)
    i += 1
std = np.array(std)
std= np.round(std, decimals=2)

In [512]:
print(t_stats)
print("\n")
print(std)
print(np.mean(std))

[-0.0, -0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.0, -0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.0, -0.0, 0.0, 0.0, 0.0, -0.0, 0.0, 0.0, -0.0, 0.0, 0.0, -0.0, 0.0, 0.0, 0.0, 0.0, -0.0, 0.0, -0.0, -0.0, 0.0, 0.0, 0.0, -0.0, 0.0, 6.2124091116642235e-06, 0.0, -0.0, 0.0, -0.0, -0.0, -0.0, 0.0, 0.0, -0.0, 0.0, 1.2167800130819637e-07, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, -0.0, -2.1649225161280406e-06, 0.0, 0.0, 0.0]


[  45.32   12.91   48.07   58.73   24.87   12.09   14.45   73.04    5.59
   17.32   24.51    4.24    9.14    6.54   31.36    4.02   24.     52.39
   30.62   38.44   16.49   47.94   64.64  136.28   30.57   15.2     1.68
    6.71   29.03   32.31   63.31   19.76    5.01   19.41    6.34    7.39
   37.78   38.62   25.35   19.02   16.09   48.04   76.04   26.44  135.04
   22.83   62.19   10.58   23.32    6.78   34.93   17.72   14.77  131.27
   14.19   18.52   85.13   22.63   46.25    8.96   23.28  155.76    5.14
    7.27    9.18]
33.5821538462


In [358]:
basis_changes = sorted(list(set(list(increase_df["Change in Fed Funds Target Rate"]))))
basis_changes.reverse()

In [359]:
basis_changes

[1.125, 0.75, 0.5, 0.3125, 0.25, 0.125, 0.0625]

In [360]:
vals = [[np.sum(x[0]) for x in list(increase_df[increase_df["Change in Fed Funds Target Rate"] == j]["WordCount"])] for j in basis_changes]

In [361]:
import plotly.plotly as py
import plotly.graph_objs as go

colors = ['rgba(93, 164, 214, 0.5)', 'rgba(255, 144, 14, 0.5)', 'rgba(44, 160, 101, 0.5)', 'rgba(255, 65, 54, 0.5)', 'rgba(207, 114, 255, 0.5)', 'rgba(127, 96, 0, 0.5)']

trace0 = go.Box(
    y = vals[0],
    name = str(basis_changes[0]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[0]),
    line = dict(
        color = colors[0])
)

trace1 = go.Box(
    y = vals[1],
    name = str(basis_changes[1]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[1]),
    line = dict(
        color = colors[1])
)

trace2 = go.Box(
    y = vals[2],
    name = str(basis_changes[2]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[2]),
    line = dict(
        color = colors[2])
)

trace3 = go.Box(
    y = vals[3],
    name = str(basis_changes[3]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[3]),
    line = dict(
        color = colors[3])
)

trace4 = go.Box(
    y = vals[4],
    name = str(basis_changes[4]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[4]),
    line = dict(
        color = colors[4])
)

trace5 = go.Box(
    y = vals[5],
    name = str(basis_changes[5]),
    jitter = 0.3,
    pointpos = -1.8,
    boxpoints = 'all',
    marker = dict(
        color = colors[5]),
    line = dict(
        color = colors[5])
)

data = [trace0,trace1,trace2,trace3,trace4,trace5]

layout = go.Layout(
    title = "Variation in Words within each Transcript",
    xaxis=dict(
        type="category"
    )
)

fig = go.Figure(data=data,layout=layout)
py.iplot(fig, filename = "Box Plot Styling Outliers")

In [362]:
word_count_per_basis_change = []
for rate in basis_changes:
    z = list(increase_df[increase_df["Change in Fed Funds Target Rate"] == rate]["WordCount"])
    word_count_per_basis_change.append([sum([x[0].toarray()[0][i] for x in z]) for i in range(0, len(z[0][0].toarray()[0]))])

In [363]:
total_word_count = [sum(word_count_per_basis_change[i]) for i in range(0, len(word_count_per_basis_change))]

In [364]:
percent_word_counts = [np.array(word_count_per_basis_change[i]) / total_word_count[i] for i in range(0, len(total_word_count))]

In [365]:
percent_word_counts_with_words = []
for arr in percent_word_counts:
    i = 0
    new_arr = []
    names = input_vectorizer_counts.get_feature_names()
    while i < len(arr):
        new_arr.append((arr[i], names[i]))
        i += 1
    percent_word_counts_with_words.append(new_arr)

In [366]:
percent_word_counts_with_words = [sorted(list(x)) for x in percent_word_counts_with_words]
dummy = [x.reverse() for x in percent_word_counts_with_words]

In [391]:
# manual inspection. adjust i for starting basis_changes. j for within a basis_change.
# i = 5
# while i < len(percent_word_counts_with_words):
#     print(basis_changes[i])
#     j = 33
#     while j < 50:
#         print(percent_word_counts_with_words[i][j])
#         j += 1
#     print("\n")
#     break
#     i += 1

In [392]:
#manual inspection on which to plot
ind_2_05 = [4, 6, 10, 14, 17, 18, 19, 21, 27, 29]
ind_4_25 = [2, 5, 6, 7, 8, 9, 11, 16, 17, 18]
ind_5_125 = [6, 17, 21, 24, 27, 28, 30, 32, 33]

In [393]:
i = 2
print(basis_changes[i])
for j in ind_2_05:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])
i = 4
print("\n")
print(basis_changes[i])
for j in ind_4_25:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])
i = 5
print("\n")
print(basis_changes[i])
for j in ind_5_125:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])

0.5
0.865% rate
0.661% inflat
0.612% market
0.501% polici
0.428% increas
0.423% growth
0.387% time
0.371% price
0.337% economi
0.321% committe


0.25
0.961% rate
0.838% price
0.778% inflat
0.713% market
0.708% year
0.599% percent
0.568% growth
0.48% expect
0.452% polici
0.445% increas


0.125
0.996% rate
0.5% growth
0.413% market
0.385% time
0.375% forecast
0.371% increas
0.36% rang
0.351% interest
0.348% price


In [None]:
# a closer look at relevant words

In [394]:
i = 2
j = 0
ind_relwords_05 = []
while j < len(percent_word_counts_with_words[i]):
    if percent_word_counts_with_words[i][j][1] in relevant_words:
        ind_relwords_05.append(j)
    j += 1

In [395]:
i = 4
j = 0
ind_relwords_25 = []
while j < len(percent_word_counts_with_words[i]):
    if percent_word_counts_with_words[i][j][1] in relevant_words:
        ind_relwords_25.append(j)
    j += 1

In [396]:
i = 5
j = 0
ind_relwords_125 = []
while j < len(percent_word_counts_with_words[i]):
    if percent_word_counts_with_words[i][j][1] in relevant_words:
        ind_relwords_125.append(j)
    j += 1

In [397]:
i = 2
print(basis_changes[i])
for j in ind_relwords_05:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])
i = 4
print("\n")
print(basis_changes[i])
for j in ind_relwords_25:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])
i = 5
print("\n")
print(basis_changes[i])
for j in ind_relwords_125:
    val = percent_word_counts_with_words[i][j]
    print(str(np.around(val[0] * 100, decimals=3)) + "%" + " " + val[1])

0.5
0.865% rate
0.661% inflat
0.612% market
0.55% percent
0.501% polici
0.423% growth
0.371% price
0.337% economi
0.301% expect
0.293% forecast
0.293% continu
0.266% bank
0.258% risk
0.213% fund
0.179% strong
0.155% demand
0.132% foreign
0.128% indic
0.127% employ
0.123% labor
0.108% exchang
0.103% consum
0.094% wage
0.094% unemploy
0.072% capit
0.071% incom
0.069% suppli
0.068% condit
0.065% stock
0.065% job
0.06% gain
0.057% weak
0.055% produc
0.053% debt
0.047% oil
0.047% confid
0.04% particip
0.037% household
0.033% inflationari
0.029% tax
0.026% deficit
0.025% wealth
0.021% recoveri
0.021% dramat
0.02% mortgag
0.019% distribut
0.015% profit
0.015% lend
0.015% corpor
0.014% energi
0.014% crisi
0.013% lack
0.013% custom
0.013% abroad
0.011% save
0.01% supplier
0.005% global
0.005% commod
0.003% resili
0.003% bear
0.001% bull
0.0% workforc
0.0% stagnat
0.0% headwind
0.0% geopolit


0.25
0.961% rate
0.838% price
0.778% inflat
0.713% market
0.599% percent
0.568% growth
0.48% expect
0.4

In [None]:
# plot how they change over time

In [398]:
word_counts_with_words = []
z = 1
for arr in list(increase_df["WordCount"]):
    lst = arr[0].toarray()[0]
    s = sum(lst)
    i = 0
    new_arr = []
    names = input_vectorizer_counts.get_feature_names()
    while i < len(lst):
        new_arr.append((lst[i] * 100 / s, names[i]))
        i += 1
    word_counts_with_words.append(new_arr)

In [399]:
word_counts_with_words = [sorted(list(x)) for x in word_counts_with_words]
dummy = [x.reverse() for x in word_counts_with_words]

In [400]:
rel_word_counts_with_words = []
i = 0
while i < len(word_counts_with_words):
    lst = word_counts_with_words[i]
    new_lst = []
    j = 0
    while j < len(lst):
        if lst[j][1] in relevant_words:
            new_lst.append(lst[j])
        j += 1
    rel_word_counts_with_words.append(new_lst)
    i += 1

In [401]:
def findItem(theList, word):
    return [theList.index(i) for i in theList if i[1] == word][0]

In [402]:
rel_word_g2 = [[findItem(x, word) for x in rel_word_counts_with_words] for word in relevant_words]

In [403]:
y_vals = []
for word_indices in rel_word_g2:
    i = 0
    sub_list = []
    while i < len(rel_word_counts_with_words):
        index = word_indices[i]
        date_lst = rel_word_counts_with_words[i]
        sub_list.append(date_lst[index])
        i += 1
    y_vals.append(sub_list)

labels = relevant_words

yr = list(increase_df["Year"])
mnth = list(increase_df["Month"])
day = list(increase_df["Day"])
x_vals = []
for i in zip(mnth, day, yr):
    date = pd.to_datetime(str(i[0]) + "/" + str(i[1]) + "/" + str(i[2]))
    x_vals.append(date)

In [430]:
import plotly.plotly as py
import plotly.graph_objs as go

# Create random data with numpy
import numpy as np

# Create traces
trace0 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[0]],
    mode = 'lines',
    name = relevant_words[0]
)

trace1 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[1]],
    mode = 'lines',
    name = relevant_words[1]
)

trace2 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[2]],
    mode = 'lines',
    name = relevant_words[2]
)

trace3 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[3]],
    mode = 'lines',
    name = relevant_words[3]
)

trace4 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[4]],
    mode = 'lines',
    name = relevant_words[4]
)

trace5 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[5]],
    mode = 'lines',
    name = relevant_words[5]
)

trace6 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[6]],
    mode = 'lines',
    name = relevant_words[6]
)

trace7 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[7]],
    mode = 'lines',
    name = relevant_words[7]
)

trace8 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[8]],
    mode = 'lines',
    name = relevant_words[8]
)

trace9 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[9]],
    mode = 'lines',
    name = relevant_words[9]
)

trace10 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[10]],
    mode = 'lines',
    name = relevant_words[10]
)

trace11 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[11]],
    mode = 'lines',
    name = relevant_words[11]
)

trace12 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[12]],
    mode = 'lines',
    name = relevant_words[12]
)

trace13 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[13]],
    mode = 'lines',
    name = relevant_words[13]
)

trace14 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[14]],
    mode = 'lines',
    name = relevant_words[14]
)

trace15 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[15]],
    mode = 'lines',
    name = relevant_words[15]
)

trace16 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[16]],
    mode = 'lines',
    name = relevant_words[16]
)

trace17 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[17]],
    mode = 'lines',
    name = relevant_words[17]
)

trace18 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[18]],
    mode = 'lines',
    name = relevant_words[18]
)

trace19 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[19]],
    mode = 'lines',
    name = relevant_words[19]
)

trace20 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[20]],
    mode = 'lines',
    name = relevant_words[20]
)

trace21 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[21]],
    mode = 'lines',
    name = relevant_words[21]
)

trace22 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[22]],
    mode = 'lines',
    name = relevant_words[22]
)

trace23 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[23]],
    mode = 'lines',
    name = relevant_words[23]
)

trace24 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[24]],
    mode = 'lines',
    name = relevant_words[24]
)

trace25 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[25]],
    mode = 'lines',
    name = relevant_words[25]
)

trace26 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[26]],
    mode = 'lines',
    name = relevant_words[26]
)

trace27 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[27]],
    mode = 'lines',
    name = relevant_words[27]
)

trace28 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[28]],
    mode = 'lines',
    name = relevant_words[28]
)

trace29 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[29]],
    mode = 'lines',
    name = relevant_words[29]
)

trace30 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[30]],
    mode = 'lines',
    name = relevant_words[30]
)

trace31 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[31]],
    mode = 'lines',
    name = relevant_words[31]
)

trace32 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[32]],
    mode = 'lines',
    name = relevant_words[32]
)

trace33 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[33]],
    mode = 'lines',
    name = relevant_words[33]
)

trace34 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[34]],
    mode = 'lines',
    name = relevant_words[34]
)

trace35 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[35]],
    mode = 'lines',
    name = relevant_words[35]
)

trace36 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[36]],
    mode = 'lines',
    name = relevant_words[36]
)

trace37 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[37]],
    mode = 'lines',
    name = relevant_words[37]
)

trace38 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[38]],
    mode = 'lines',
    name = relevant_words[38]
)

trace39 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[39]],
    mode = 'lines',
    name = relevant_words[39]
)

trace40 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[40]],
    mode = 'lines',
    name = relevant_words[40]
)

trace41 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[41]],
    mode = 'lines',
    name = relevant_words[41]
)

trace42 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[42]],
    mode = 'lines',
    name = relevant_words[42]
)

trace43 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[43]],
    mode = 'lines',
    name = relevant_words[43]
)

trace44 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[44]],
    mode = 'lines',
    name = relevant_words[44]
)

trace45 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[45]],
    mode = 'lines',
    name = relevant_words[45]
)

trace46 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[46]],
    mode = 'lines',
    name = relevant_words[46]
)

trace47 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[47]],
    mode = 'lines',
    name = relevant_words[47]
)

trace48 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[48]],
    mode = 'lines',
    name = relevant_words[48]
)

trace49 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[49]],
    mode = 'lines',
    name = relevant_words[49]
)

trace50 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[50]],
    mode = 'lines',
    name = relevant_words[50]
)

trace51 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[51]],
    mode = 'lines',
    name = relevant_words[51]
)

trace52 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[52]],
    mode = 'lines',
    name = relevant_words[52]
)

trace53 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[53]],
    mode = 'lines',
    name = relevant_words[53]
)

trace54 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[54]],
    mode = 'lines',
    name = relevant_words[54]
)

trace55 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[55]],
    mode = 'lines',
    name = relevant_words[55]
)

trace56 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[56]],
    mode = 'lines',
    name = relevant_words[56]
)

trace57 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[57]],
    mode = 'lines',
    name = relevant_words[57]
)

trace58 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[58]],
    mode = 'lines',
    name = relevant_words[58]
)

trace59 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[59]],
    mode = 'lines',
    name = relevant_words[59]
)

trace60 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[60]],
    mode = 'lines',
    name = relevant_words[60]
)

trace61 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[61]],
    mode = 'lines',
    name = relevant_words[61]
)

trace62 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[62]],
    mode = 'lines',
    name = relevant_words[62]
)

trace63 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[63]],
    mode = 'lines',
    name = relevant_words[63]
)

trace64 = go.Scatter(
    x = x_vals,
    y = [y[0] for y in y_vals[64]],
    mode = 'lines',
    name = relevant_words[64]
)

data = [trace2, trace3, trace7, 
        trace21, trace22, trace23, 
       trace30, trace36,
       trace41, trace42, trace44,
       trace53, trace56, trace58,
       trace61]

py.iplot(data, filename='line-mode')

In [432]:
# relevant_words.index("continu")

In [433]:
# ind = [2, 3, 7, 10, 14,17,19,21, 22, 23, 24, 30, 36, 37, 38, 41, 42, 43, 44, 46, 53, 54, 56, 58,61]
# avg = []
# for i in ind:
#     avg.append((np.mean([y[0] for y in y_vals[i]]), i))

In [434]:
# sorted(avg)

## Most Frequent Words Overall - No Global Weighting

In [351]:
# to change date - change the numbers and replace 'table' with the appropriate lines of code

In [None]:
table = relevant_data_df["WordCount"]
# relevant_data_df[relevant_data_df["Year"] <= 1990]["WordCount"]
# relevant_data_df[np.logical_and(relevant_data_df["Year"] <= 2000, relevant_data_df["Year"] > 1990)]["WordCount"]

In [298]:
word_count_all = [x[0].toarray()[0] for x in list(table)]

In [299]:
total_words = sum([sum(x) for x in word_count_all])

In [300]:
i = 0
new_word_count_all = []
while i < m.shape[1]:
    num_words = 0
    for arr in word_count_all:
        num_words += arr[i]
    new_word_count_all.append(num_words) 
    i += 1

In [306]:
word_freq = np.array(new_word_count_all) * 1000 / total_words
words = input_vectorizer_counts.get_feature_names()

In [312]:
words_and_word_freq = [x for x in zip(word_freq, words)]
words_and_word_freq = sorted(words_and_word_freq)
words_and_word_freq.reverse()

In [318]:
words_and_word_freq[0:50]

[(14.449867372847605, 'mr'),
 (11.59437368450341, 'think'),
 (11.126823418758095, 'would'),
 (10.172082502708736, 'chairman'),
 (10.088610868334134, 'rate'),
 (6.8774079929818361, 'market'),
 (6.7819339013768998, 'that'),
 (6.6520891367941877, 'go'),
 (6.1774465099582203, 'year'),
 (5.9821337968464086, 'it'),
 (5.8648370557317735, 'percent'),
 (5.4725749307949227, 'on'),
 (5.3765552729522446, 'price'),
 (5.3596427195822276, 'inflat'),
 (5.1926994508330253, 'point'),
 (4.9968411714834708, 'growth'),
 (4.4807355105790752, 'don'),
 (4.2799671350898381, 'us'),
 (4.2259560775533318, 'we'),
 (4.0491926165247651, 'polici'),
 (4.0082751486940777, 'expect'),
 (3.9122554908513996, 'time'),
 (3.8527887709374684, 'sai'),
 (3.7698627028006095, 'forecast'),
 (3.7387654272492878, 'greenspan'),
 (3.7354920298228329, 'like'),
 (3.6831176709995539, 'well'),
 (3.6312888784140172, 'economi'),
 (3.5243578958164892, 'get'),
 (3.4867138254122572, 'see'),
 (3.4207003106454157, 'mai'),
 (3.4179724794567035, 't

In [None]:
# with weighting

In [None]:
wtable = relevant_data_df["WeightedWordCount"]
# relevant_data_df[relevant_data_df["Year"] <= 1990]["WeightedWordCount"]
# relevant_data_df[np.logical_and(relevant_data_df["Year"] <= 2000, relevant_data_df["Year"] > 1990)]["WeightedWordCount"]

In [333]:
wword_count_all = [x[0].toarray()[0] for x in list(wtable)]

In [334]:
wtotal_words = sum([sum(x) for x in wword_count_all])

In [335]:
i = 0
new_wword_count_all = []
while i < m.shape[1]:
    num_words = 0
    for arr in wword_count_all:
        num_words += arr[i]
    new_wword_count_all.append(num_words) 
    i += 1

In [336]:
wword_freq = np.array(new_wword_count_all) * 1000 / wtotal_words
wwords = input_vectorizer.get_feature_names()

In [337]:
wwords_and_word_freq = [x for x in zip(wword_freq, wwords)]
wwords_and_word_freq = sorted(wwords_and_word_freq)
wwords_and_word_freq.reverse()

In [338]:
wwords_and_word_freq[0:50]

[(12.253269574676425, 'mr'),
 (9.8194201993837247, 'think'),
 (9.2829404027906701, 'would'),
 (8.6167581117407099, 'chairman'),
 (8.3669111669019713, 'rate'),
 (5.6804654745908199, 'market'),
 (5.6504770440546519, 'that'),
 (5.542395486094108, 'go'),
 (5.1600821619356747, 'year'),
 (5.0082402021673644, 'it'),
 (4.764257136774015, 'percent'),
 (4.597651345796014, 'on'),
 (4.5076507046425851, 'greenspan'),
 (4.3548128695870707, 'price'),
 (4.3199890259324212, 'point'),
 (4.2401801549993694, 'inflat'),
 (4.2022541062999768, 'growth'),
 (3.7799251696407881, 'don'),
 (3.5211156578610212, 'we'),
 (3.5120730147563255, 'volcker'),
 (3.4713360257119428, 'us'),
 (3.3525761971130565, 'polici'),
 (3.2831792833644116, 'time'),
 (3.2796427947393689, 'expect'),
 (3.1861662566032045, 'sai'),
 (3.1854699340908406, 'forecast'),
 (3.136959824773986, 'economi'),
 (3.0967254598755023, 'like'),
 (3.0883304125344493, 'well'),
 (2.9872928645056023, 'get'),
 (2.9583164598110296, 'see'),
 (2.8844888924580063, '

In [None]:
# graphs for reduction, increase, total words. t stats, stds under increase and reduction. 

In [None]:
# For each basis point change top words under both

In [None]:
# box and whisker plots

In [None]:
# tables for stemmer mappings, word counts during preprocessing

In [None]:
# most frequent terms across dates

In [435]:
# Doc filtering example appendix

In [None]:
# # for each document in train_df, run CountVectorizer to remove stop words.
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction.text import TfidfTransformer
# import chardet
# import PyPDF2

# def input_documents(filenames):
#     text = ""
#     for filename in filenames:
#         with open(filename, 'rb') as input_file:
#             pdfReader = PyPDF2.PdfFileReader(input_file)
#             num_pages = pdfReader.numPages #later refactor to words within a doc
#             for i in range(0, num_pages):
#                 pageObj = pdfReader.getPage(i)
#                 text += " " + pageObj.extractText()
#     return [text]

# i = 0
# new_col_feature_names = []
# new_col_tokens = []
# print("Number Rows", relevant_data_df.shape[0])
# while i < relevant_data_df.shape[0]:
#     try:
#         document_paths = []
#         document_paths += np.array(relevant_data_df["Transcripts"])[i]
# #         tfidf_transformer = TfidfTransformer()
#         input_vectorizer = CountVectorizer(input="content", stop_words="english")
#         docs = input_documents(document_paths)
# #         tfidf_docs = tfidf_transformer.fit_transform(docs)
# #         print(tfidf_docs)
#         x = input_vectorizer.fit_transform(docs)
#         new_col_feature_names.append(list(zip(range(0,len(input_vectorizer.get_feature_names())),input_vectorizer.get_feature_names())))
#         new_col_tokens.append(x) 
#         print("Row " + str(i + 1) +  " Complete")
#         i += 1
#     except:
#         new_col_feature_names.append([None])
#         new_col_tokens.append(None)
#         print("Row " + str(i + 1) +  " Complete")
#         i += 1

# relevant_data_df["TokenCount"] = new_col_tokens
# relevant_data_df["FeatureNames"] = new_col_feature_names