# Auto Complete

For this project, we are making a auto completer for implementation of Information Retrieval. We make this auto complete because we think this is a simple yet very useful tools for publics.

For the implementation of the Information Retrieval, here we use TF-IDF and cosine similarity

in this part, we are importing all of the packages and libraries that will be used in the future. Which here, we use json as our dataset for training, the json it self contains features like message.

In [None]:
import json
import os
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import re

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

Here, we read "sample_conversation.json" for our training dataset, after the data is loaded, we are doing preprocessing. The preprocessing is to get the specified messages from the json. And to do that, we need to form a dataframe, and also renaming the column name for better use.

In [None]:
df = pd.read_json('/content/sample_conversations.json')

for column in ['Issues']:
  column_as_df = json_normalize(df[column])
  column_as_df.columns = [str(column+"_"+subcolumn) for subcolumn in column_as_df.columns]
  df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)

df = pd.DataFrame([dict(y, index=i) for i, x in enumerate(df['Issues_Messages'].values.tolist()) for y in x])

  after removing the cwd from sys.path.


splitDataFrameList is a function to split the message data. the given data will be splited by a certain seperator given to the function. the process include splitting the messages into seperated rows of data. which in the future, the messages will be splitted by punctuation (. , ? ! ;). this process is needed because we need to know what the meaning of the messages word by word, so the program can return the result that make sense corresponding to our needs.

In [None]:
def splitDataFrameList(df,target_column,separator):
  def split_text(line, separator):
      splited_line =  [e+d for e in line.split(separator) if e]
      return splited_line

  def splitListToRows(row,row_accumulator,target_column,separator):
      split_row = row[target_column].split(separator)
      for s in split_row:
          new_row = row.to_dict()
          new_row[target_column] = s
          row_accumulator.append(new_row)
  new_rows = []
  df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
  new_df = pd.DataFrame(new_rows)
  return new_df

process_data is a function for processing the data. First step we do is to seperate the messages using previously created function splitDataFrameList. after that, in here we are using regex to justify the sentence structure. some justification we did are :



1.   Adding space into each word
2.   Removing "." from all messages
3.   The same with the first one
4.   Replacing "i" with "I" for every "i" that represents a subject
5.   Replacing " ?" with "?" only
6.   Replacing " !" with "!" only
7.   Replacing " ." with "." only
8.   Replacing "OK" with "Ok"
9.   Transform first character of a sentence into uppercase
10.  Appending "?" for every question sentence which doesn't have "?" from the begining

After that, we are removing all words that have length lesser than or equals 2, because most of the words which have length lesser than or equals 2 doesn't play a important role to determine the meaning of the sentence.

After that, we are dropping duplicates words.



In [None]:
def process_data(new_df):
  new_df = new_df[new_df.IsFromCustomer==False]
  
  for sep in ['. ',', ','? ', '! ', '; ']:
      new_df = splitDataFrameList(new_df, 'Text', sep)
      
  new_df['Text']=new_df['Text'].apply(lambda x: " ".join(x.split()))
  new_df['Text']=new_df['Text'].apply(lambda x: x.strip("."))
  new_df['Text']=new_df['Text'].apply(lambda x: " ".join(x.split()))
  new_df['Text']=new_df['Text'].apply(lambda x: x.replace(' i ',' I '))
  new_df['Text']=new_df['Text'].apply(lambda x: x.replace(' ?','?'))
  new_df['Text']=new_df['Text'].apply(lambda x: x.replace(' !','!'))
  new_df['Text']=new_df['Text'].apply(lambda x: x.replace(' .','.'))
  new_df['Text']=new_df['Text'].apply(lambda x: x.replace('OK','Ok'))
  new_df['Text']=new_df['Text'].apply(lambda x: x[0].upper()+x[1:])
  new_df['Text']=new_df['Text'].apply(lambda x: x+"?" if re.search(r'^(Wh|How).+([^?])$',x) else x)
  
  new_df['nb_words'] = new_df['Text'].apply(lambda x: len(str(x).split(' ')))
  new_df = new_df[new_df['nb_words']>2]
  
  new_df['Counts'] = new_df.groupby(['Text'])['Text'].transform('count')
  
  new_df = new_df.drop_duplicates(subset=['Text'], keep='last')
  
  new_df = new_df.reset_index(drop=True)
  print(new_df.shape)  
  
  return new_df

calc_matrice is a function to create a model of Tf-IDF and also a matrice of TF-IDF

In [None]:
def calc_matrice(df):
  model_tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 5), min_df=0)
  tfidf_matrice = model_tf.fit_transform(df['Text'])
  print("tfidf_matrice ", tfidf_matrice.shape)
  return model_tf, tfidf_matrice

generate_completions is a function to create a suggestion. process included are first, we defining the weights for each data in our dataframe by applying 1 + log10 of count words. after that, counting the cosine similarity scores. After we got the scores of cosine similarity, we sort all of the possibles words and get top 3 words from our dataframe.

In [None]:
def generate_completions(prefix_string, data, model_tf, tfidf_matrice):
        
  prefix_string = str(prefix_string)
  new_df = data.reset_index(drop=True)
  weights = new_df['Counts'].apply(lambda x: 1+ np.log1p(x)).values

  tfidf_matrice_spelling = model_tf.transform([prefix_string])

  cosine_similarite = linear_kernel(tfidf_matrice, tfidf_matrice_spelling)
  
  similarity_scores = list(enumerate(cosine_similarite))
  similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
  similarity_scores = similarity_scores[0:10]

  similarity_scores = [i for i in similarity_scores]
  similarity_indices = [i[0] for i in similarity_scores]

  for i in range(len(similarity_scores)):
      similarity_scores[i][1][0]=similarity_scores[i][1][0]*weights[similarity_indices][i]

  similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
  similarity_scores = similarity_scores[0:3]
  similarity_indices_w = [i[0] for i in similarity_scores]
  
  return new_df.loc[similarity_indices_w]['Text'].tolist()

# Main Process

In [None]:
new_df = process_data(df)
new_df.shape, new_df.columns

(8560, 5)


((8560, 5),
 Index(['IsFromCustomer', 'Text', 'index', 'nb_words', 'Counts'], dtype='object'))

In [None]:
model_tf, tfidf_matrice = calc_matrice(new_df)

tfidf_matrice  (8560, 99397)


In [None]:
prefix = 'Service'

print(prefix,"    \n ")

generate_completions(prefix, new_df, model_tf,tfidf_matrice)

Service     
 


['Your service is great', 'Enjoy your new service!', 'Has service restored?']