This file transforms each speech into a TF-IDF matrix, reduces the dimensionality and calculates the similarity of each speech to the Royal Institution Christmas Lecture speeches I identified as being good examples of evidence-based speeches.

Please see more information about the RI Christmas Lectures here: http://www.bbc.co.uk/programmes/b00pmbqq

They are described as a "Series of lectures on a single topic, presenting scientific subjects to a general audience in an informative and entertaining manner"

I load the data I pickled from 'project_fletcher_cleaning'.

In [1]:
import pickle
pkl_file = open('df_all_science_docs.pkl', 'rb')
df_all_science_docs = pickle.load(pkl_file)

### Stemming

I stem all of the words in the each of the speeches.

In [2]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
stemmed =[' '.join([stemmer.stem(word) for word in str(text).split(' ')])
          for text in df_all_science_docs['speech']]

In [4]:
output = open('stemmed.pkl', 'wb')
pickle.dump(stemmed, output)

output.close()

### Lemmetising

I then lemmetise the stemmed words (unnecessary to do both - since I do not need to generate text, I can probably stick with just the stemmed words => I will try this).

In [5]:
pkl_file = open('stemmed.pkl', 'rb')
stemmed = pickle.load(pkl_file)

In [6]:
'''
import nltk
nltk.download('wordnet')
'''

[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [7]:
from nltk.stem import WordNetLemmatizer
lemmer=WordNetLemmatizer()
lemmatised=[' '.join([lemmer.lemmatize(word) for word in str(text).split(' ')])
          for text in stemmed]

In [8]:
output = open('lemmatised.pkl', 'wb')
pickle.dump(lemmatised, output)

output.close()

### TF-IDF

I convert the speeches of lemmatised words into a TF-IDF vector. I did some tuning of ngrams, maximum and minimum document frequency, and the maximum number of features.  I would like to do some further tuning of this. 

In [1]:
import pickle
pkl_file = open('lemmatised.pkl', 'rb')
lemmatised = pickle.load(pkl_file)

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf = TfidfVectorizer(lowercase = True,
                        stop_words = "english",
                        ngram_range=(1,2),
                        token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", #words with >= 2 alpha chars 
                        min_df=0.0075,
                       max_df=0.8,
                       max_features=5000)
tfidf_vecs = tfidf.fit_transform(lemmatised)
df_tfidf = pd.DataFrame(tfidf_vecs.todense(), 
             columns=tfidf.get_feature_names())
print(df_tfidf.shape)
df_tfidf.head()

(823989, 1569)


Unnamed: 0,abil,abl,abov,absolut,absolut right,abus,accept,access,accommod,accord,...,year,year ago,year old,years,yes,yesterday,young,young peopl,young people,youth
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.125927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
# Pickle model
from sklearn.externals import joblib
joblib.dump(tfidf,'tfidf.pkl')

# Pickle the vectors
output = open('tfidf_vecs.pkl', 'wb')
pickle.dump(tfidf_vecs, output)

output.close()

# Pickle dataframe

### LSI with gensim

The data is high-dimensional and comparisons of similarity are likely to be more fruitful with dimensionality reduction. I reduce the number of dimensions using LSI. Since I'm not interested in having interpretable topics per se, I chose LSI instead of NMF as it's faster.

In [1]:
from sklearn.externals import joblib
import pickle
# Load the model
tfidf = joblib.load('tfidf.pkl')

# Load the vectors
pkl_file = open('tfidf_vecs.pkl', 'rb')
tfidf_vecs= pickle.load(pkl_file)

In [4]:
#!pip install gensim

Collecting gensim
  Downloading gensim-3.4.0-cp36-cp36m-manylinux1_x86_64.whl (22.6MB)
[K    100% |████████████████████████████████| 22.6MB 60kB/s  eta 0:00:01
Collecting smart-open>=1.2.1 (from gensim)
  Downloading smart_open-1.5.6.tar.gz
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
  Downloading boto3-1.6.6-py2.py3-none-any.whl (128kB)
[K    100% |████████████████████████████████| 133kB 8.9MB/s eta 0:00:01
Collecting botocore<1.10.0,>=1.9.6 (from boto3->smart-open>=1.2.1->gensim)
  Downloading botocore-1.9.6-py2.py3-none-any.whl (4.1MB)
[K    100% |████████████████████████████████| 4.1MB 347kB/s eta 0:00:01
[?25hCollecting jmespath<1.0.0,>=0.7.1 (from boto3->smart-open>=1.2.1->gensim)
  Downloading jmespath-0.9.3-py2.py3-none-any.whl
Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->smart-open>=1.2.1->gensim)
  Downloading s3transfer-0.1.13-py2.py3-none-any.whl (59kB)
[K    100% |████████

In [3]:
from gensim import corpora, models, similarities, matutils
tfidf_corpus = matutils.Sparse2Corpus(tfidf_vecs.transpose())

id2word = corpora.Dictionary.from_corpus(tfidf_corpus, 
                                         id2word=id2word)

In [4]:
# Pickle the corpus
output = open('tfidf_corpus.pkl', 'wb')
pickle.dump(tfidf_corpus, output)

output.close()

# Pickle the id2word
output = open('id2word.pkl', 'wb')
pickle.dump(id2word, output)

output.close()

In [3]:
from sklearn.externals import joblib

lsi = models.LsiModel(tfidf_corpus, id2word=id2word, num_topics=300)

joblib.dump(lsi,'lsi.pkl')

['lsi.pkl']

In [1]:
from sklearn.externals import joblib
import pickle
lsi = joblib.load('lsi.pkl')
# Load the vectors
pkl_file = open('tfidf_corpus.pkl', 'rb')
tfidf_corpus= pickle.load(pkl_file)

In [2]:
lsi_corpus = lsi[tfidf_corpus]

# List of document vectors
#doc_vecs = [doc for doc in lsi_corpus]

In [None]:
doc_vecs[-1:]

In [3]:
# Pickle the lsi_corpus
output = open('lsi_corpus.pkl', 'wb')
pickle.dump(lsi_corpus, output)

output.close()

### Calculating similarity

I create a similarity matrix so that I get a similarity score for each speech with each other speech. I actually only interested in the mean similarity of each speech with all of the Royal Institution Christmas lectures as a whole.  This mean similarity score for each speech becomes the score for 'scientificness' or 'evidence-basedness'.

The similarity with the list of science words is a simplier version of modelling the 'evidence-basedness' in the same way.

In [4]:
# Load the model
lsi = joblib.load('lsi.pkl')

# Load the vectors
pkl_file = open('lsi_corpus.pkl', 'rb')
lsi_corpus= pickle.load(pkl_file)

# Load original dataframe if not loaded already
pkl_file = open('df_all_science_docs.pkl', 'rb')
df_all_science_docs = pickle.load(pkl_file)

In [6]:
from gensim import corpora, models, similarities, matutils
index = similarities.MatrixSimilarity(lsi_corpus, 
                                      num_features=300)

sci_docs = df_all_science_docs.index[df_all_science_docs['MP'] == 'Dr Science'].tolist()

for doc in sci_docs:
    df_all_science_docs['scientificness_{0}'.format(doc)] = index[lsi_corpus[doc]]
    
scientific_cols = [col for col in df_all_science_docs.columns if 'scientificness' in col]
df_all_science_docs['scientificness_avg'] = df_all_science_docs[scientific_cols].mean(axis=1)

In [8]:
df_all_science_docs['science_words'] = index[lsi_corpus[sci_docs[-1]]]

I calculate the number of days since the speech as I'm interested in seeing whether speeches have become more evidence-based over time.

In [11]:
import pandas as pd
date_now = pd.to_datetime('2018-03-08')
df_all_science_docs['days_ago'] = [(date_now - pd.to_datetime(date)).days for date in df_all_science_docs['date_1']]

In [13]:
df_all_science_docs[['days_ago','date_1']].head()

Unnamed: 0,days_ago,date_1
0,6315.0,2000-11-22 00:00:00
1,6315.0,2000-11-22 00:00:00
2,6315.0,2000-11-22 00:00:00
3,6315.0,2000-11-22 00:00:00
4,6315.0,2000-11-22 00:00:00


In [14]:
# Pickle sci_docs
output = open('sci_docs.pkl', 'wb')
pickle.dump(sci_docs, output)

output.close()

# Pickle the dataframe
df_all_science_docs.to_pickle('df_similarity.pkl')