# About this notebook

#### The novel **COVID-19** has come and changed how we as humans in this new era of civilization, view diseases. Everything escalated quickly, number of confirmed cases increased exponentially with the R number (which signifies the average number of people which one person infected person will pass the virus to) between 2 and 2.5 at the beginning, and what made it harder is that we don’t understand the disease and more and more lives were lost, we’re in a race with time to try to save as many lives as possible, we want to know more about the disease to flatten the curve, i.e. decrease the R number, and by knowing the risk factors to covid-19, we will be able to do so!! 

#### *And this is what our model aims to, by directing the healthcare giver to the most relevant paper that he might find what he’s looking for. And if our solution saved only one life, then we would be very proud that applying some science and using our time did this!*

## Importing Important Libraries

In [51]:
import os
import json
import math
import numpy as np 
import pandas as pd 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import bigrams, trigrams, ngrams
from sklearn.feature_extraction.text import TfidfVectorizer


## Preprocessing Data

### Tokenizing the body text of a paper
By removing unnecessary words, punctuation marks, currency symbols and numbers

In [127]:
def textPreprocessing (text):    
    stop_words = stopwords.words("english")
    stop_words += [wr for wr in ['one','av','however','moreover','yet']]
    words = nltk.word_tokenize(text)
    new_words =[] 
    for word in words: 
        word = word.lower()
        if ((word not in stop_words) and (word.isalpha())):
            new_words.append(word)
    return new_words #list of words in a text

### Retrieving json files of document from the directory

In [161]:
def textReading(file_dir,x=0,y=10):
    filenames = os.listdir(file_dir)
#     all_files = []
    docs_bagOfWords = {}
    for filename in filenames[x:y]: 
        text = ''
        file = json.load(open(os.path.join(json_dir,filename), 'rb'))
        for i in file['body_text']:
            text += i['text']  
        docs_bagOfWords[(filename[:-5],file['metadata']['title'])] = textPreprocessing(text)
    return docs_bagOfWords #dictionary {paper_id:[]}

In [171]:
json_dir = '/kaggle/input/CORD-19-research-challenge/document_parses/pdf_json'
docs_bagOfWords = textReading(json_dir,100,200)

# Implementing Raw TF-IDF

## First, TF (Term Frequency Calculation)

![](https://miro.medium.com/proxy/1*HM0Vcdrx2RApOyjp_ZeW_Q.png)

## Computing IDF

![](https://miro.medium.com/proxy/1*A5YGwFpcTd0YTCdgoiHFUw.png)

## Computing TF-IDF

![](https://miro.medium.com/proxy/1*nSqHXwOIJ2fa_EFLTh5KYw.png)

# Implementing TF-IDF Using sklearn

In [1]:
def totalTFIDF(docs_bagOfWords):
    """
    Calculating TFIDF for the whole documents
    Args:
        docs_bagOfWords: dict bag of words of each paper
    Returns:
        documentsText: list of strings (the complete body text of each document) 
        feature_names: list of strings (the total vocab unique words)
        tfidf_dict: dict with paper_id as keys and list of tfidf for this paper
    """
    vectorizer = TfidfVectorizer()
    documentsText = []
    for k in docs_bagOfWords.keys():
        str1 = ' '
        str1 = str1.join(docs_bagOfWords[k])
        documentsText.append(str1)
    vectors = vectorizer.fit_transform(documentsText)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    tfidf_dict = {}
    for key,tfidf in zip(docs_bagOfWords.keys(),denselist):
        tfidf_dict[key] = tfidf
    return documentsText,feature_names,tfidf_dict


In [142]:
def calculateTFIDF(vec_query):
    """
    docs_bagOfWords: dict bag of words of each paper
    """
    vectorizer = TfidfVectorizer()
    documentsText = []
    str1 = ' '
    str1 = str1.join(vec_query)
    documentsText.append(str1)
    vectors = vectorizer.fit_transform(documentsText)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    return denselist,feature_names

In [143]:
def getTotalVocab(docs):
    """
    for getting the total_vocab of the given documents in the form of list of words
    """
    total_vocab = []
    for i in d:
        total_vocab += i.split(' ')
    return total_vocab #list of total vocab

In [163]:
d,features,tfidf_dict = totalTFIDF(docs_bagOfWords)
total_vocab = getTotalVocab(d)

### Calculating Cosine Distance

![image.png](http://sites.temple.edu/tudsc/files/2017/03/cosine-equation.png)


#### Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction.
![](http://miro.medium.com/max/650/1*OGD_U_lnYFDdlQRXuOZ9vQ.png)
A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as a keyword) or phrase in the document. Thus, each document is an object represented by what is called a term-frequency vector.
This is what implemented in this model, to find to what extent the input query is similar to the available documents and help us in answering the inquiries about the risk factors of the newly arised COVID-19 virus.

In [145]:
def getCosineDistance(q_vec,doc_dict):
    """
    Calculates the cosine distance between a query and documents
    Args:
        q_vec: A vector representing the query
        doc_dict: Dictionary having with - key as a document title
                                         - value as vector representation for this document
    Returns:
        q_norm: The norm of the input query
        cosDistances: Dictionary containing the documents sorted according to their cosine distances with the query
    
    """
    cosDistances = {}
    q_norm = np.linalg.norm(q_vec)
    for k in doc_dict.keys():
        v2 = doc_dict[k]
        z = np.zeros(((len(v2)-len(q_vec)),))
        q_vec = np.concatenate((q_vec,z), axis=0)
        dotProduct = np.dot(q_vec,v2)
        cosDistances[k] = dotProduct/(q_norm*np.linalg.norm(v2))
    cosDistances = {i: j for i, j in sorted(cosDistances.items(), key=lambda item: item[1],reverse=True)}
    return q_norm, cosDistances
        

The input question is to be placed in corpus variable

In [164]:
corpus = "other isolates had been stored at the -70C for up to 12 months.The reagents were prepared by Ani Biotech Oy. Helsinki, Finland, and supplied by Mercia Diagnostics, Guildford.UK. The test latex was coated with immunoglobulins to the AV hexon antigen and the control latex with normal immunoglobulin; the test was therefore group specific in principle. The method recommended by the supplier was followed for examining faeces. Dilution buffer was prepared by dissolving one of the buffer tablets supplied (constituents not stated) in 100 ml distilled H20. An approximate 10% suspension of each faeces was prepared by thoroughly mixing about 0.1 g sample with 1 ml buffer in clean plastic 75 X 12 mm tubes. The suspensions were clarified by centrifugation at 2000 rpm for 5 min. One drop (approximately 50 ~1) of supernatant was placed in each of two circles on a glass agglutination plate: to one a drop of test latex was added and a drop of control latex to the other. These were mixed with the tip of a wooden swab stick and the plate gently rocked. The time taken for any agglutination to occur was noted. A sample was deemed positive if definite agglutination of the test latex, but not control latex. occurred with 3 min rocking. The strength of reaction was graded from + to +++: a + reaction in which there was slight granularity of the latex, but not unequivocal agglutination, was considered to be negative. If both test and control latexes clumped the reaction was classified as non-specific. Cell culture isolates were tested as described above except that undiluted culture fluid was used in place of the faecal extract."
text_filtered = textPreprocessing(corpus)
tfidfList,words = calculateTFIDF(text_filtered)


In [165]:
q_norm,c = getCosineDistance(tfidfList[0],tfidf_dict)

The output results are:

In [170]:
c

{('62f5dc6a3bf7eaffcf4a955b6358e27dd13d0fa5',
  'Overexpression of PTEN suppresses lipopolysaccharide-induced lung fibroblast proliferation, differentiation and collagen secretion through inhibition of the PI3-K-Akt-GSK3beta pathway'): 0.04039606918010143,
 ('fefdbba94acf90ff54825e10d843284f26346303',
  'COVID-19: Easing the coronavirus lockdowns with caution'): 0.021354119077884755,
 ('7b29829a47fbada5035c500e88a3fb6945267116',
  'Peering through the portal: COVID-19 and the future of agriculture'): 0.017051198290966473,
 ('4eb0645a5ab563b06ec101914438267b6552a95e',
  'Synthesis and Anticancer Activities of Glycyrrhetinic Acid Derivatives'): 0.015573987595086864,
 ('3fec1d8c1c53fcf800e5efb362ad214e901196fd',
  'Viruses as Quasispecies: Biological Implications'): 0.01471427778404008,
 ('1c0ae4beee003ecb59ff15f81ed8f22cd9451353', ''): 0.013244661079051333,
 ('7a0da4f2fa4aa0d32a6590cc1fbe33fcb1bf2b86',
  'ENGINEERING CROWDSOURCED STREAM PROCESSING SYSTEMS'): 0.012450062722856325,
 ('03bb