# DISCLAIMER: Please read the file before you run it because the tokenizer takes about 6 hours to run

# Also FYI, the Search Query Input is at the end of the Notebook (before Conclusion)

### Note:

#### You can find the tokenized data at: CSVData/Tokenized_Allformatdata.csv
#### Vocabulary at: CSVData/Vocab_token.csv
#### TF-IDF of the blog at: Pickle/blog_vector_tfidf.pickle

# Now lets dive into the code itself

In [50]:
import glob
import pandas as pd
import numpy as np
import xml.etree.ElementTree as et
import os
from datetime import datetime
import re
from tqdm import tqdm
from scipy import spatial
import spacy
import string
from sklearn.metrics.pairwise import cosine_similarity


In [2]:
#Reading our processed data from all csv files
df = pd.read_csv('CSVData/Allformatdata.csv')

In [3]:
df.shape

(678194, 7)

In [4]:
df.head()

Unnamed: 0,Id,Date,Post,Gender,Age,Industry,Astrosign
0,1000331,"31,May,2004","\n\n\t \n Well, everyone got up and going...",female,37,indUnk,Leo
1,1000331,"29,May,2004",\n\n\t \n My four-year old never stops ta...,female,37,indUnk,Leo
2,1000331,"28,May,2004","\n\n\t \n Actually it's not raining yet, ...",female,37,indUnk,Leo
3,1000331,"28,May,2004",\n\n\t \n Ha! Just set up my RSS feed - t...,female,37,indUnk,Leo
4,1000331,"28,May,2004","\n\n\t \n Oh, which just reminded me, we ...",female,37,indUnk,Leo


In [5]:
#Checking if all the files have been included in our df
df['Id'].nunique()

19320

In [6]:
#Trying to fetch a file and randomly checking out parameters
print(df[df['Id']==813360]['Post'])

645275    \n\n\t \n      Just to start, a little about m...
645276    \n\n\t \n      Today especially my butt seems ...
645277    \n\n\t \n      Last night Disney held its annu...
645278    \n\n\t \n      It has been a while since I las...
645279    \n\n\t \n      Yesterday Hannah posted a quote...
                                ...                        
645348    \n\n\t \n      As most of you know, my biggest...
645349    \n\n\t \n      Now for the weekend update:   O...
645350    \n\n\t \n      Friday night was uneventful--wh...
645351    \n\n\t \n      It's been a good Sunday; Enligh...
645352    \n\n\t \n      What a week�.what a weekend! Ba...
Name: Post, Length: 78, dtype: object


In [7]:
print(df[df['Id']==813360]['Post'][645276])



	 
      Today especially my butt seems to have formed a perfect mold of my chair. I worked through lunch due to a high level mouse requesting ridiculous data. Why the bottom must suffer for the top I'll  never get, shouldn't it be the other way around. I mean, if I made 6 figures I would have no problem parking my 745 BMW myself--actually I'd prefer to park it myself. But here at Disney, the big shots not only get a car allowance every month to help pay for their beauties but they get their cars valet'd everyday. Yes that's right. The rest of us, not in the preferred mouse club, must circle around the parking structure finding the widest compact parking space so our "un-luxury" cars don't get more dings.  Really, the more you get paid the fewer the benefits you should receive. It is those of us that don't get paid 6 figures who should get the benefits to make up for the limited funds. (Those of you who don't make bank and are far from it will completely agree I'm sure. And when I ma

## ---------------------------------------------------------------------------------------------------------

## Tokenizer (tokenize, lemmatize and remove stop words

In [8]:
from spacy.lang.en.stop_words import STOP_WORDS

spacy_nlp = spacy.load('en_core_web_lg')

#create list of punctuations and stopwords
punctuations = string.punctuation
stop_words = spacy.lang.en.stop_words.STOP_WORDS

#function for data cleaning and processing
#This can be further enhanced by adding / removing reg-exps as desired.

def spacy_tokenizer(sentence):
 
    #remove distracting single quotes
    sentence = re.sub('\'','',sentence)

    #remove digits and words containing digits
    sentence = re.sub('\w*\d\w*','',sentence)

    #replace extra spaces with single space
    sentence = re.sub(' +',' ',sentence)

    #remove unwanted lines starting from special charcters
    sentence = re.sub(r'\n: \'\'.*','',sentence)
    sentence = re.sub(r'\n!.*','',sentence)
    sentence = re.sub(r'^:\'\'.*','',sentence)
    
    #remove non-breaking new line characters
    sentence = re.sub(r'\n',' ',sentence)
    
    #remove punctunations
    sentence = re.sub(r'[^\w\s]',' ',sentence)
    
    #creating token object
    tokens = spacy_nlp(sentence)
    
    #lower, strip and lemmatize
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]
    
    #remove stopwords, and exclude words less than 2 characters
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations and len(word) > 2]
    
    #return tokens
    return tokens

In [9]:
df

Unnamed: 0,Id,Date,Post,Gender,Age,Industry,Astrosign
0,1000331,"31,May,2004","\n\n\t \n Well, everyone got up and going...",female,37,indUnk,Leo
1,1000331,"29,May,2004",\n\n\t \n My four-year old never stops ta...,female,37,indUnk,Leo
2,1000331,"28,May,2004","\n\n\t \n Actually it's not raining yet, ...",female,37,indUnk,Leo
3,1000331,"28,May,2004",\n\n\t \n Ha! Just set up my RSS feed - t...,female,37,indUnk,Leo
4,1000331,"28,May,2004","\n\n\t \n Oh, which just reminded me, we ...",female,37,indUnk,Leo
...,...,...,...,...,...,...,...
678189,963380,"04,July,2004",\n\n\t \n I was pretty close to right on ...,male,24,Student,Cancer
678190,963380,"04,July,2004",\n\n\t \n On the way home tonight I happe...,male,24,Student,Cancer
678191,963380,"03,July,2004",\n\n\t \n Some things I didn't touch on. ...,male,24,Student,Cancer
678192,963380,"03,July,2004",\n\n\t \n It's been an interesting week. ...,male,24,Student,Cancer


In [10]:
#Using tqdm to see the progress of running a file
tqdm.pandas()

  from pandas import Panel


In [11]:
#Tokenize our ENTIRE Dataset
df['Post_tokenized'] = df['Post'].progress_map(lambda x: spacy_tokenizer(x))

100%|██████████| 678194/678194 [6:05:44<00:00, 30.90it/s]    


### Uff, that took 6 hours!!

## ---------------------------------------------------------------------------------------------------------

## Section 1: TF-IDF Vectorizer

In [179]:
def identity_tokenizer(text):
    return text

In [180]:
from sklearn.feature_extraction.text import TfidfVectorizer
import operator

## Create Vocabulary
vocabulary = set()

for doc in df.Post_tokenized:
        vocabulary.update(doc)
        
vocabulary = list(vocabulary)

# Intializating the tfIdf model
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer,vocabulary=vocabulary,lowercase=False)

# Fit the TfIdf model
blog_vector = tfidf.fit_transform(df.Post_tokenized)


In [181]:
blog_vector

<678194x644452 sparse matrix of type '<class 'numpy.float64'>'
	with 40958261 stored elements in Compressed Sparse Row format>

In [182]:
len(vocabulary)

644452

In [183]:
blog_vector.shape

(678194, 644452)

## ---------------------------------------------------------------------------------------------------------

## Section 2: Getting similar words for our query from Spacy Large Vocab library

In [196]:
nlp = spacy.load('en_core_web_lg')

In [197]:
def createKeywordsVectors(keyword, nlp):
    doc = nlp(keyword)  # convert to document object

    return doc.vector



In [198]:
# method to find cosine similarity using Spacy and Spatial distance
def cosineSimilarity(vect1, vect2):
    # return cosine distance
    return 1 - spatial.distance.cosine(vect1, vect2)


In [199]:
# method to find similar words
def getSimilarWords(keyword, nlp):
    similarity_list = []

    keyword_vector = createKeywordsVectors(keyword, nlp)

    for tokens in nlp.vocab:
        if (tokens.has_vector):
            if (tokens.is_lower):
                if (tokens.is_alpha):
                    similarity_list.append((tokens, cosineSimilarity(keyword_vector, tokens.vector)))

    similarity_list = sorted(similarity_list, key=lambda item: -item[1])
    similarity_list = similarity_list[:30]

    top_similar_words = [item[0].text for item in similarity_list]

    top_similar_words = top_similar_words[:10]
    top_similar_words.append(keyword)

    for token in nlp(keyword):
        top_similar_words.insert(0, token.lemma_)

    for words in top_similar_words:
        if words.endswith("s"):
            top_similar_words.append(words[0:len(words)-1])

    top_similar_words = list(set(top_similar_words))

    top_similar_words = [words for words in top_similar_words]
    
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    
    #remove stopwords, and exclude words less than 2 characters
    top_similar_words = [word for word in top_similar_words if word not in stop_words and word not in punctuations and len(word) > 2]


    return ", ".join(top_similar_words)

In [200]:
#Taking in our tokenized query entry and finding similar matches
def Similarkey(query_token):
    
    #Finding all similar words
    similarkey = []
    for i in query_token:
        #Getting similar words
        result = getSimilarWords(i,nlp)
        result = result.split(', ')
        for el in result:
            similarkey.append(el)
            
    #Joining our results with our query_token        
    joinedkey = similarkey + query_token
    
    #Removing duplicates in our joinedkey
    j_key = set(joinedkey)
    j_key = list(joinedkey)
    
    #Returning similar words along withour query tokens as a list
    return(j_key)

## ---------------------------------------------------------------------------------------------------------

## Our SemanticSearch combining similarity of words and feeding it into a TF-IDF vectorizer

## Basically Section 2 -- > Section 1 = Our Semantic Search

In [201]:
def SemanticSearch(word_input,x):
    #Creating a dataframe to store the token of the word_input as list
    query_df = pd.DataFrame(columns=['token'])    

    query_token = spacy_tokenizer(word_input)
    similar_token = Similarkey(query_token)
    query_df.loc[0,'token'] = query_token

    #Transforming our input query as a TF IDF vector
    input_vector = tfidf.transform(query_df['token'])   

    #Getting cosine similarities
    cosines = cosine_similarity(input_vector, blog_vector).flatten()

    #Ordering our results in descending order
    out = np.array(cosines).argsort()[-x:][::-1]

    #Seeing the results of our query in our original dataframe
    result_df = df.loc[out]
    result_df['Relevance'] = cosines[out]
    return(result_df)
    
    

## ---------------------------------------------------------------------------------------------------------

## Saving the Tokenized Data, Vocabulary and also storing our blog_vector (from TDFIDF) along with Vocabulary (again) 

In [202]:
df.to_csv('CSVData/Tokenized_Allformatdata.csv',index=False)

In [203]:
vocab_df = pd.DataFrame(vocabulary)

In [205]:
vocab_df.to_csv('CSVData/Vocab_token.csv',index=False)


In [206]:
import pickle

In [207]:
pickle.dump(blog_vector, open("Pickle/blog_vector_tfidf.pickle", "wb"))

In [208]:
pickle.dump(vocabulary,open("Pickle/vocabulary.pickle","wb"))

## ---------------------------------------------------------------------------------------------------------

# FINALLY HERE IS OUR SEARCH ENGINE!

In [237]:
print("Welcome to the Semantic Search Engine!\n\nKindly enter the keyword/sentence/query you are looking for:\t")
word = input()
print("\n How many relevant results who you like me to show?:\t")
r = input()
r = int(r)

result_df = SemanticSearch(word,r)
display(result_df)


Welcome to the Semantic Search Engine!

Kindly enter the keyword/sentence/query you are looking for:	
Sunshine and blossoms

 How many relevant results who you like me to show?:	
10


Unnamed: 0,Id,Date,Post,Gender,Age,Industry,Astrosign,Post_tokenized,Relevance
346474,317581,"27,May,2004","\n\n\t \n You are my sunshine, my only su...",male,26,Technology,Scorpio,"[sunshine, sunshine, happy, sky, grey, know, d...",0.561988
324115,2967696,"04,June,2004",\n\n\n \n urlLink birthday blos...,female,27,Communications-Media,Aries,"[urllink, birthday, blossom, nbsp, urllink]",0.555443
489275,605396,"15,March,2003","\n\n \n Ah, the sunshine has returned...",male,35,indUnk,Scorpio,"[sunshine, return]",0.538309
675887,4156027,"11,August,2004","\n\n\t \n urlLink on my way home, sun...",male,26,Student,Sagittarius,"[urllink, way, home, sunshine]",0.533313
126369,3877921,"13,August,2004",\n\n\n \n By Sixpence None the Ric...,female,16,Student,Libra,"[sixpence, rich, time, question, world, leave,...",0.533122
146411,4056181,"02,August,2004",\n\n \n urlLink Good Morning Sunshin...,female,25,Non-Profit,Virgo,"[urllink, good, morning, sunshine]",0.529344
33847,2800347,"14,July,2004",\n\n \n \n urlLink Blossom a...,female,41,HumanResources,Gemini,"[urllink, blossom, piss, company, nbsp, urllink]",0.501577
432359,3691135,"08,August,2004","\n\n\t \n But hey look, there's sunshine ...",female,41,Arts,Gemini,"[hey, look, sunshine, today, urllink]",0.486142
652925,887044,"13,september,2003",\n\n\n \n Somehow my little sunshin...,female,23,indUnk,Pisces,"[little, sunshine, person, change, overnight, ...",0.484833
137512,3971038,"11,July,2004","\n\n\t \n Made a wish, I can dream I can ...",male,23,Student,Gemini,"[wish, dream, want, afraid, live, life, fulfil...",0.47024


## Another run

In [238]:
print("Welcome to the Semantic Search Engine!\n\nKindly enter the keyword/sentence/query you are looking for:\t")
word = input()
print("\n How many relevant results who you like me to show?:\t")
r = input()
r = int(r)

result_df = SemanticSearch(word,r)
display(result_df)

Welcome to the Semantic Search Engine!

Kindly enter the keyword/sentence/query you are looking for:	
Rains 

 How many relevant results who you like me to show?:	
10


Unnamed: 0,Id,Date,Post,Gender,Age,Industry,Astrosign,Post_tokenized,Relevance
494113,727002,"22,February,2003","\n\n\n \n rain, rain, rain, rain, r...",male,23,Internet,Leo,"[rain, rain, rain, rain, rain, rain, rain, rai...",1.0
538870,1155029,"29,May,2004",\n\n \n No rain.\n \n \n,female,17,Student,Leo,[rain],1.0
570124,1596894,"08,April,2004",\n\n\t\t\t\t\t\t\n\t\t\t\t\t\t aloneaalloonn...,male,26,Arts,Libra,[aloneaalloonneeaaalllooonnneeeaaaalllloooonnn...,0.997887
605477,477665,"03,November,2003",\n\n\n \n rain. it rained. I fell. ...,female,25,indUnk,Leo,"[rain, rain, fall, rain, love, rain, lose, rai...",0.977105
438484,3734683,"01,July,2004",\n\n \n urlLink The rain\n \n...,female,27,indUnk,Gemini,"[urllink, rain]",0.904114
123662,3859699,"14,July,2004",\n\n\t \n urlLink rain&nbsp; urlLink ...,female,23,Internet,Cancer,"[urllink, rain, nbsp, urllink, rain, rain, eye]",0.901894
89776,3575606,"09,June,2004",\n\n \n \n Driving in the rain.R...,female,24,indUnk,Taurus,"[drive, rain, rain, head, rain, sock, wet, rai...",0.892373
289847,2597250,"05,August,2004",\n\n\t \n wow...i love rain so much. I w...,female,16,indUnk,Gemini,"[wow, love, rain, nice, long, run, rain, man, ...",0.850163
344319,316316,"10,February,2004",\n\n\n \n By no means a complete li...,female,24,Education,Virgo,"[mean, complete, listing, think, rain, time, s...",0.843418
409178,3554150,"09,July,2004","\n\n \n \n Today, my day off. I ...",male,23,indUnk,Libra,"[today, day, wake, stay, cuz, pour, rain, rain...",0.83106


## One more final run

In [239]:
print("Welcome to the Semantic Search Engine!\n\nKindly enter the keyword/sentence/query you are looking for:\t")
word = input()
print("\n How many relevant results who you like me to show?:\t")
r = input()
r = int(r)

result_df = SemanticSearch(word,r)
display(result_df)

Welcome to the Semantic Search Engine!

Kindly enter the keyword/sentence/query you are looking for:	
The weather is really nice dont you think?

 How many relevant results who you like me to show?:	
20


Unnamed: 0,Id,Date,Post,Gender,Age,Industry,Astrosign,Post_tokenized,Relevance
57148,3348936,"17,May,2004",\n\n \n As always the weather has let...,male,25,Student,Gemini,"[weather, let]",0.667551
602779,449628,"27,February,2003",\n\n\n \n urlLink Weather from We...,male,34,indUnk,Aries,"[urllink, weather, weatherbug, include, weathe...",0.588184
592824,320317,"06,January,2002",\n\n \n Little under the weather to...,male,36,Technology,Pisces,"[little, weather, today, feel]",0.574015
29972,2627097,"06,February,2004",\n\n \n \n Ah HA! I have real we...,female,47,indUnk,Taurus,"[real, weather, need]",0.573347
31157,2661577,"18,July,2004",\n\n \n 11:00am 1-19-2004 urlLink My...,male,23,indUnk,Libra,"[urllink, weather, suck]",0.573257
462227,3932756,"04,August,2004",\n\n \n Great weather we are havin...,female,25,Marketing,Scorpio,"[great, weather, far]",0.570216
635702,75671,"03,August,2004",\n\n \n urlLink If i am talking a...,male,36,Engineering,Aries,"[urllink, talk, weather, mean]",0.566359
154433,4147395,"11,August,2004",\n\n\t \n I fucking hate The Weather Chan...,male,17,Military,Taurus,"[fucking, hate, weather, channel, seriously, c...",0.561749
288270,2576645,"15,March,2004","\n\n\t \n As you may have noticed, I have...",female,16,Student,Cancer,"[notice, change, template, blog, parade, templ...",0.557421
504350,1107146,"17,February,2003",\n\n \n \n Nice.\n \n,female,16,Student,Libra,[nice],0.52976


# CONCLUSIONS


## 1. Making use of POS and NER

### The above cell output from our semantic search is of particular interest to us. We can see that when we expand the relevant results to a higher number (20 in this case) we see that in some cases the word nice seems to get a higher weightage while our sentence subject was "weather" . 

### We could use Part of Speech in this case and give more weightage to "weather" than "nice". In other cases, we can make sure of Name Entity Recognition and give more weight to those words.

### Given a week time to accomplish the task, and to maximize the output:efficiency ratio, I had decided to omit those two parts from my pipeline.

..

## 2. Utlitizing Date column to sort our search based on recent/older posts

## Another factor on which we can model is sorting by converting the date column to a datetime object and sorting based on recent/ older results. The hurdle which I encountered while trying to do this was that the date column has entries in the month column in different languages( Eg., Romanian, Italian, French.. to name a few)

## For more details checkout Scrapdrafts/DetectLanguagesinDates.ipynb for validation of the above statement.

..

## 3. Topic Modelling

### Finally we could potentially topic model our data based on Gender, Industry, Age (AstroSign could also possibly be included, though that would be a stretch :D) to make note of the jargon that particular group of people use and give it more weight.

# Hope you enjoyed going through the code. Any inputs on improving the comments, readability or the code itself are most welcome :)