# <b> What is Universal Sentence Encoder (USE)? 

### The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.

### Universal Sentence Encode (USE) part of the Tensorflow Hub library developed by Google is a sentence embedding model. It has proved to be much better than traditional word embedding models. USE can be used to capture the semantic similarity between two sentences and provide a matching percentage.

### Link to white paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46808.pdf

# <b> How can it be used on AI and Analytics Projects?

## <b> 1) Features for building Chatbots

### Companies trying to build a chatbot that recieve questions with a narrow scope (e.g. FAQ bot). With a predefined set of questions and answers, the incoming questions to the chatbots can be matched with the predefined questions using USE and the answer to the question with highest match percentage can be sent as a result. 

### <b> High Level steps to build the chatbot
### 1) Create a predefined set of questions (Q1) and answers(A1). For e.g., What are the pros and cons of having a 401k account?
### 2) Create semantically similar testing questions (Q2) as the incoming questions to the chatbot. For e.g., What are the advantages and disadvantages of having a 401k account?
### 3) Match the predefined questions (Q1) with testing questions (Q2)
### 4)Find Q1 with highest match percentage using USE
### 5)Provide A1 as output to the chatbot user

## <b> 2) Similarity analysis on Unstructured Data

### Companies trying to find semantically similar sentences in a massive pool of unstructured text data can leverage USE. A threshold for matching percentages can be set and all sentences with matching percentage above this threshold can be shown as a result

# <b> Lab Objective

### The objective of this lab is to find semantically similar questions from a pool of hot and popular questions seen on Reddit Finance dataset. I am using an API to retrieve these questions from Reddit and then match test questions with the reddit questions using USE. The final output is the question with highest match percentage.In this way, I am able find semantically similar questions. 

### Load and Import Modules

In [10]:
###Install and import necessary packages
!pip install praw
!pip install nltk
!pip install tensorflow
!pip install tensorflow_hub
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import nltk
import re
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')
nltk.download('wordnet')
from sklearn.metrics.pairwise import cosine_similarity
import praw




[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### URL to Universal Sentence Encoder Model'

In [22]:
# Tensorflow hub module for Universal sentence Encoder
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" 

In [12]:
embed = hub.Module(module_url)

Instructions for updating:
Colocations handled automatically by placer.


W0318 15:48:57.055262 140021791913792 deprecation.py:323] From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


In [13]:
def get_features(texts):
    if type(texts) is str:
        texts = [texts]
    with tf.Session() as sess:
        sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
        return sess.run(embed(texts))

In [14]:
#Helper Function Stop Words Removal sometimes improves the results and can be commented out if it doesn't

stop_words =''
#stop_words = stopwords.words('english') + list(punctuation)

def remove_stopwords(stop_words, tokens):
    res = []
    for token in tokens:
        if not token in stop_words:
            res.append(token)
    return res

def process_text(text):
    #text = text.encode('ascii', errors='ignore').decode()
    text = text.lower()
    text = re.sub(r'http:', ' ', text)
    text = re.sub(r'https:', ' ', text)
    text = re.sub(r'www.', ' ', text)
    text = re.sub(r'.com', ' ', text)
    text =re.sub('[^A-Za-z0-9]+', ' ', text) ##Eliminates special characters
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"won't", "will not ", text)
    text = re.sub(r"isn't", "is not ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub('\s+', ' ', text) ##Elimates whitespace character": space, tab, newline, carriage return, vertical tab
    text = text.strip()
    return text


def process_all(text):
    text = process_text(text)
    return ' '.join(remove_stopwords(stop_words, text.split()))


### Test the preprocessing steps

In [138]:
#Test the process text Function
process_text("https://www.rrrr.com ### \n @@@")

'rrrr'

## Dataset in the below cell is sample dataset obtained from Reddit. You can substitute your own dataset with predefined questions.

In [29]:
data = [
    "What are ETFs and how do they work?",
    "Can't Afford to Sustain Grandmother in California, How do I go about refinancing an auto loan",
    "what are your goals?",
    "420k in debt. Want to buy a house.How do I choose between making traditional vs Roth contributions to 401K"
    'New to /r/personalfinance? Have questions? Read this first!',
     'Weekday Help and Victory Thread for the week of March 18, 2019',
     "My old college said that I didn't owe them any money before I left. Now that I'm applying to other places they've sent me to collections and won't release my transcripts.",
     'Hospital balance from Sexual Assault has gone into collections? Supposed to be paid by the state.',
     'Question about vehicle trade in for dealership',
     'Hospital sent me a bill for service I did not recieve. Both the hospital and collections refuses to send an itemized statement or investigate.',
     "Haven't filed my taxes since 2016. Been paralyzed as to what to do or the repercussions",
     '$28k windfall of cash as "gift"',
     'Lots of Savings but About to be Laid off',
     '[medical][insurance] Got billed as an ER visit at an urgent care. Spent 11 months trying to misbill us, we finally got the bill and its outrageous.',
     'About to close on a house and my loan officer asks for permission to make changes on my insurance policy?',
     "If your bank calls you, even if the phone number is legit, don't verify ANYTHING, call them back first",
     'What are the impacts of adding an authorized user to one of my cards?',
     'Freetaxusa.com is amazing.',
     '(California) Unmarried now and have $400k in equity in my house. If I sell after I get married, will that equity be split 50/50 with my future wife?',
     'Any tools/apps to help me with my uncontrollable spending habits? + extras in the USA',
     'Mistaken Check from large Furniture Credit Company',
     'Husband Lost Job While in Escrow',
     'If your IRA is predominantly ETFs, what do you do with the leftover $30 in cash?',
     'How to handle loaning a large amount of money to family?',
     'Kids money',
     'Alternatives to EveryDollar Budgeting App?',
     "Realistically, what's the value of pursuing a job offer v pursuing education?",
     'Federal Loans Payoff Advice',
     'Buying a Home while in Debt',
     'Turbo Tax fraud',
     'I am 28 and disabled. I have received 200K net from a personal injury settlement. Advice is needed.',
     'Pay off Debt or Invest in House',
     'What are ETFs and how do they work?',
     'How to Stop Parents From Hacking Bank Accounts',
     "Haven't done my taxes in at least 3 years? What should I do?",
    ]


## Please read this blog on steps to connect to the Reddit Api. You just need client_id, client_secret, user agent key, username and password or you can substitute this with your own data
### http://www.storybench.org/how-to-scrape-reddit-with-python/

In [115]:
## Setting up the credentials
my_client_id = 'Yz5v783hk6qlyw'
my_client_secret = 'if7WtwuPj508ludBD8BDs7AnJZ4'
my_user_agent ='Testing'

reddit = praw.Reddit(client_id=my_client_id, client_secret=my_client_secret, user_agent=my_user_agent, username='anuragsirish12', password ='Tendulkar3121951')

In [116]:
## Displaying 10 hop topics in Reddit Personal Finance forum
hot_posts = reddit.subreddit('personalfinance').hot(limit=10)
for post in hot_posts:
    print(post.title)

New to /r/personalfinance? Have questions? Read this first!
Weekday Help and Victory Thread for the week of March 18, 2019
Hospital balance from Sexual Assault has gone into collections? Supposed to be paid by the state.
My old college said that I didn't owe them any money before I left. Now that I'm applying to other places they've sent me to collections and won't release my transcripts.
Question about vehicle trade in for dealership
Haven't filed my taxes since 2016. Been paralyzed as to what to do or the repercussions
Hospital sent me a bill for service I did not recieve. Both the hospital and collections refuses to send an itemized statement or investigate.
I’m in a terrible financial position right now, I need advice desperately.
[medical][insurance] Got billed as an ER visit at an urgent care. Spent 11 months trying to misbill us, we finally got the bill and its outrageous.
$28k windfall of cash as "gift"


In [35]:
import pandas as pd
posts = []
personalfinance_subreddit = reddit.subreddit('personalfinance')
for post in personalfinance_subreddit.hot(limit=1000):
    posts.append([post.title])

In [27]:
flattened_posts = [x for p in posts for x in p]

In [36]:
flattened_posts[0:10]

['New to /r/personalfinance? Have questions? Read this first!',
 'Weekday Help and Victory Thread for the week of March 18, 2019',
 "My old college said that I didn't owe them any money before I left. Now that I'm applying to other places they've sent me to collections and won't release my transcripts.",
 'Hospital balance from Sexual Assault has gone into collections? Supposed to be paid by the state.',
 'Question about vehicle trade in for dealership',
 'Hospital sent me a bill for service I did not recieve. Both the hospital and collections refuses to send an itemized statement or investigate.',
 "Haven't filed my taxes since 2016. Been paralyzed as to what to do or the repercussions",
 '$28k windfall of cash as "gift"',
 'Lots of Savings but About to be Laid off',
 '[medical][insurance] Got billed as an ER visit at an urgent care. Spent 11 months trying to misbill us, we finally got the bill and its outrageous.']

In [40]:
data_processed = list(map(process_all, flattened_posts))

## Text Preprocessing comparison with original text

In [44]:
posts_original_modified=list(zip(flattened_posts,data_processed))

In [125]:
posts_original_modified[1:10]

[('Weekday Help and Victory Thread for the week of March 18, 2019',
  'weekday help and victory thread for the week of march 18 2019',
  [0.03783087059855461]),
 ("My old college said that I didn't owe them any money before I left. Now that I'm applying to other places they've sent me to collections and won't release my transcripts.",
  'my old college said that i didn t owe them any money before i left now that i m applying to other places they ve sent me to collections and won t release my transcripts',
  [0.043355993926525116]),
 ('Hospital balance from Sexual Assault has gone into collections? Supposed to be paid by the state.',
  'hospital balance from sexual assault has gone into collections supposed to be paid by the state',
  [0.011397574096918106]),
 ('Question about vehicle trade in for dealership',
  'question about vehicle trade in for dealership',
  [0.636070728302002]),
 ("Haven't filed my taxes since 2016. Been paralyzed as to what to do or the repercussions",
  'hospita

In [48]:
flattened_posts = [x for p in posts for x in p]

In [49]:
vectorized_form = get_features(flattened_posts)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0318 18:56:35.053654 140021791913792 saver.py:1483] Saver not created because there are no variables in the graph to restore


In [50]:
posts_original_modified=list(zip(flattened_posts,data_processed,vectorized_form))

In [126]:
posts_original_modified[0:5]

[('New to /r/personalfinance? Have questions? Read this first!',
  'new to r personalfinance have questions read this first',
  [0.3784327507019043]),
 ('Weekday Help and Victory Thread for the week of March 18, 2019',
  'weekday help and victory thread for the week of march 18 2019',
  [0.03783087059855461]),
 ("My old college said that I didn't owe them any money before I left. Now that I'm applying to other places they've sent me to collections and won't release my transcripts.",
  'my old college said that i didn t owe them any money before i left now that i m applying to other places they ve sent me to collections and won t release my transcripts',
  [0.043355993926525116]),
 ('Hospital balance from Sexual Assault has gone into collections? Supposed to be paid by the state.',
  'hospital balance from sexual assault has gone into collections supposed to be paid by the state',
  [0.011397574096918106]),
 ('Question about vehicle trade in for dealership',
  'question about vehicle tr

In [53]:
vectorized_form.shape

(972, 512)

In [73]:
def test_similarity(text1, text2):
    query_vector = get_features(text1)[0]
    vectors = get_features(text2)[0]
    scores = cosine_similarity(query_vector.reshape(1,-1), vectors.reshape(1,-1))
    return scores

In [74]:
test_similarity('dog', 'boy')

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0318 19:13:55.929396 140021791913792 saver.py:1483] Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0318 19:14:07.661585 140021791913792 saver.py:1483] Saver not created because there are no variables in the graph to restore


array([[ 0.55801135]], dtype=float32)

In [131]:
def semantic_search(query, data, vectors):
    query = process_text(query)
    print("Extracting features...")
    query_vec = get_features(query)[0].reshape(1,-1)
    res = []
    for i, d in enumerate(data):
        qvec = vectors[i].reshape(1,-1)
        sim = cosine_similarity(query_vec, qvec)
        res.append((sim, d[:500], i))
    return sorted(res, key=lambda x : x[0], reverse=True)

## <b> Testing the semantic search function

## Test Question 1 = I want to buy a used car. please suggest some ideas"
### Top 3 matches from Reddit Dataset:-
### 1) Used car purchase - advice
### 2) Haven't bought a car in about 10 years need some help/any advice
### 3) Need help deciding on what to do with my car!
### Demonstration below

In [132]:
list_of_similar_questions = semantic_search("I want to buy a used car. please suggest some ideas", flattened_posts, vectorized_form)

Extracting features...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0318 22:14:22.029773 140021791913792 saver.py:1483] Saver not created because there are no variables in the graph to restore


In [133]:
for i in list_of_similar_questions[0:10]:
    print(i[1])

Used car purchase - advice
Haven't bought a car in about 10 years need some help/any advice
Need help deciding on what to do with my car!
Car advice, repair or junk car/how much to spend on a new vehicle?
20 year looking to buy a car with no credit but 15k savings
About to buy a new (to me) car - financing questions
First time buying a new car. How do I get the best interest rate?
Car is starting to take a dump on me. How would I determine whether a new car is worth it at this point?
Should I buy a car?
I keep running into expensive repairs with my vehicle and I want to take a loan to buy a new one.


## Test Question 2 = I want to start investing in stocks. Where do I start? "
### Top 3 matches from Reddit Dataset:-
### 1) Guys i want to start investing in stocks at 15 years old, what do I do?
### 2) Getting into investing, any advice?
### 3) I’m 18, no experience in the stock market, and would like to experiment with low-dollar short-term investments, preferably as an app on my phone. What would you guys recommend?
### Demonstration below :

In [135]:
list_of_similar_questions = semantic_search("I want to start investing in stocks. Where do I start? ", flattened_posts, vectorized_form)

Extracting features...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0318 22:15:07.246835 140021791913792 saver.py:1483] Saver not created because there are no variables in the graph to restore


In [136]:
for i in list_of_similar_questions[0:10]:
    print(i[1])

Guys i want to start investing in stocks at 15 years old, what do I do?
Getting into investing, any advice?
I’m 18, no experience in the stock market, and would like to experiment with low-dollar short-term investments, preferably as an app on my phone. What would you guys recommend?
What's the best way to invest ~$30k?
Looking for tips to save and invest!
Smartest way to diversify after company stock options sale? (novice investor, risk-tolerant)
$90k cash sitting around in 401k, Roth accounts. Oops. How would you invest for 15-20 years?
22 Years old, landed a job that gives me roughly 60k a year, how do I invest?
Advice for my first personal investment into index products
Looking for dividend options


## <B> Next Steps

### 1)To improve the performance,extract the corpus of questions from Reddit, vectorize it using USE and store it as pickled file. Run the similarity matching function against the pickled file.
### 2) Answers associated to every question can be obtained from Reddit as well. Answer with the highest votes can be considered the right answer.
### 3) After finding the questions with highest similarity percentage, corresponding answers with highest votes can be retrieved by the user.
### 4) Trying to leverage cloud? Check out this article that explains how to implement Semantic Similarity using Universal Sentence Encoder and GCP Components.
###   https://cloud.google.com/solutions/machine-learning/analyzing-text-semantic-similarity-using-tensorflow-and-cloud-dataflow