### Making TFIDF of speeches and CRS reports

- Transform Congressional speeches into form that we can do machine learning on.
- CRS reports are part of the TFIDF matrix in order to calculate the average similarity between each speech and a CRS report. This is one of the measures of a speech being 'evidence-based' in that I expect CRS reports to have reasonably neutral language, consider multiple perspectives and cite facts and statistics.

In [32]:
import pickle
pkl_file = open('speeches_cleaned.pkl', 'rb')
speeches_cleaned = pickle.load(pkl_file)

pkl_file = open('crs_reports.pkl', 'rb')
crs_reports = pickle.load(pkl_file)

In [33]:
speeches_cleaned.shape 

(409395, 13)

In [3]:
%run 'text_processing.py'

In [4]:
# Get the parts of speech of each word. This aids the lemmatisation. 
# Lemmatisation gets the root of the word so that derivations of the word are recognised as the same

# Speeches
list_sentences_pos_speeches = get_pos(list(speeches_cleaned['text']))
lemmatized_corpus_speeches = lemmatize(list_sentences_pos_speeches)
print(len(lemmatized_corpus_speeches))

output = open('lemmatized_corpus_speeches.pkl', 'wb')
pickle.dump(lemmatized_corpus_speeches, output)

output.close()


409395


In [5]:
# CRS reports
list_sentences_pos_reports = get_pos(crs_reports[:2500])
lemmatized_corpus_reports = lemmatize(list_sentences_pos_reports)
print(len(lemmatized_corpus_reports))

output = open('lemmatized_corpus_reports.pkl', 'wb')
pickle.dump(lemmatized_corpus_reports, output)

output.close()

2500


In [6]:
# Combine all the corpuses together (speeches, CRS, evidence words)
import copy
lemmatized_corpus_all = copy.deepcopy(lemmatized_corpus_speeches)
lemmatized_corpus_all.extend(lemmatized_corpus_reports)
tfidf_vecs_all, tfidf_df_all = make_tfidf(lemmatized_corpus_all)

In [7]:
tfidf_df_all.shape

(411895, 1955)

In [11]:
output = open('tfidf.pkl', 'wb')
pickle.dump(tfidf_df_all, output, protocol = 4)

output.close()

output = open('tfidf_vecs.pkl', 'wb')
pickle.dump(tfidf_vecs_all, output, protocol = 4)

output.close()

In [12]:
import pickle
pkl_file = open('tfidf.pkl', 'rb')
tfidf = pickle.load(pkl_file)

In [13]:
pkl_file = open('tfidf_vecs.pkl', 'rb')
tfidf_vec = pickle.load(pkl_file)

### LSI with gensim

The data is high-dimensional and comparisons of similarity are likely to be more fruitful with dimensionality reduction. I reduce the number of dimensions using LSI. Since I'm not interested in having interpretable topics per se, I chose LSI instead of NMF as it's faster.

In [15]:
from gensim import corpora, models, similarities, matutils
tfidf_corpus = matutils.Sparse2Corpus(tfidf_vec.transpose())

id2word = corpora.Dictionary.from_corpus(tfidf_corpus)

In [16]:
# Pickle the corpus
output = open('tfidf_corpus.pkl', 'wb')
pickle.dump(tfidf_corpus, output)

output.close()

# Pickle the id2word
output = open('id2word.pkl', 'wb')
pickle.dump(id2word, output)

output.close()

In [17]:
pkl_file = open('tfidf_corpus.pkl', 'rb')
tfidf_corpus = pickle.load(pkl_file)

pkl_file = open('id2word.pkl', 'rb')
id2word = pickle.load(pkl_file)

In [18]:
from gensim.models import LsiModel

lsi = LsiModel(tfidf_corpus, id2word=id2word, num_topics=300)

output = open('lsi.pkl', 'wb')
pickle.dump(lsi, output)

output.close()

In [19]:
import pickle

pkl_file = open('lsi.pkl', 'rb')
lsi = pickle.load(pkl_file)


# Load the vectors
pkl_file = open('tfidf_corpus.pkl', 'rb')
tfidf_corpus_speeches = pickle.load(pkl_file)

In [20]:
lsi_corpus = lsi[tfidf_corpus]

# List of document vectors
#doc_vecs = [doc for doc in lsi_corpus_speeches]

In [21]:
# Pickle the lsi_corpus
output = open('lsi_corpus.pkl', 'wb')
pickle.dump(lsi_corpus, output)

output.close()

### Calculating similarity

I create a similarity matrix so that I get a similarity score for each speech with each other speech. I actually only interested in the mean similarity of each speech with all of the Royal Institution Christmas lectures as a whole.  This mean similarity score for each speech becomes the score for 'scientificness' or 'evidence-basedness'.

The similarity with the list of science words is a simplier version of modelling the 'evidence-basedness' in the same way.

In [22]:
import pickle
# Load the model

pkl_file = open('lsi.pkl', 'rb')
lsi= pickle.load(pkl_file)

# Load the vectors
pkl_file = open('lsi_corpus.pkl', 'rb')
lsi_corpus= pickle.load(pkl_file)

# Load original dataframe if not loaded already
pkl_file = open('tfidf.pkl', 'rb')
tfidf = pickle.load(pkl_file)

In [23]:
from gensim import corpora, models, similarities, matutils
index = similarities.MatrixSimilarity(lsi_corpus, 
                                      num_features=300)

  if np.issubdtype(vec.dtype, np.int):


In [24]:
# Calculate average similarity to CRS documents
import pickle
pkl_file = open('speeches_cleaned.pkl', 'rb')
speeches_cleaned = pickle.load(pkl_file)

no_docs_speeches = speeches_cleaned.shape[0]
print(no_docs_speeches)

409395


In [25]:
tfidf.shape

(411895, 1955)

In [26]:
tfidf.iloc[no_docs_speeches:,:].shape

(2500, 1955)

In [27]:
crs_docs = tfidf.iloc[no_docs_speeches:,:] 

In [28]:
print(crs_docs.shape)
for idx in crs_docs.index:
    tfidf['crs_sim_{0}'.format((idx-1))] = index[lsi_corpus[(idx-1)]]
print(tfidf.shape)
crs_cols = [col for col in tfidf.columns if 'crs_sim_' in col]
tfidf['crs_sim_avg'] = tfidf[crs_cols].mean(axis=1)

(2500, 1955)
(411895, 4455)


In [29]:
speeches_similarity = tfidf['crs_sim_avg']
speeches_similarity = speeches_similarity[:no_docs_speeches]

In [30]:
# Pickle the lsi_corpus
output = open('speeches_similarity.pkl', 'wb')
pickle.dump(speeches_similarity, output)

output.close()