### Load 2 articles

In [1]:
with open('docs/kite_text.txt','r') as f:
    kite_text=''.join([l.lower() for l in f.readlines()])
with open('docs/kite_history.txt','r') as f:
    kite_history=''.join([l.lower() for l in f.readlines()])

### kite_text
a kite is traditionally a tethered heavier-than-air craft with wing surfaces that react against the air to create lift and drag. a kite consists of wings, tethers, and anchors. kites often have a bridle to guide the face of the kite at the correct angle so the wind can lift it. a kite's wing also may be so designed so a bridle is not needed; when kiting a sailplane for launch, the tether meets the wing at a single point. a kite may have fixed or moving anchors. untraditionally in technical kiting, a kite consists of tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is still often called the kite.\n\nthe lift that sustains the kite in flight is generated when air flows around the kite's surface, producing low pressure above and high pressure below the wings. the interaction with the wind also generates horizontal drag along the direction of the wind. the resultant force vector from the lift and drag force components is opposed by the tension of one or more of the lines or tethers to which the kite is attached. the anchor point of the kite line may be static or moving (e.g., the towing of a kite by a running person, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).\n\nthe same principles of fluid flow apply in liquids and kites are also used under water.\n\na hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting surface is called a kytoon.\n\nkites have a long and varied history and many different types are flown individually and at festivals worldwide. kites may be flown for recreation, art or other practical uses. sport kites can be flown in aerial ballet, sometimes as part of a competition. power kites are multi-line steerable kites designed to generate large forces which can be used to power activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a new trend snow kiting. even man-lifting kites have been made.\n

### kite_history
kites were invented in china, where materials ideal for kite building were readily available: silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient bamboo for a strong, lightweight framework.\n\nthe kite has been claimed as the invention of the 5th-century bc chinese philosophers mozi (also mo di) and lu ban (also gongshu ban). by 549 ad paper kites were certainly being flown, as it was recorded that in that year a paper kite was used as a message for a rescue mission. ancient and medieval chinese sources describe kites being used for measuring distances, testing the wind, lifting men, signaling, and communication for military operations. the earliest known chinese kites were flat (not bowed) and often rectangular. later, tailless kites incorporated a stabilizing bowline. kites were decorated with mythological motifs and legendary figures; some were fitted with strings and whistles to make musical sounds while flying. from china, kites were introduced to cambodia, thailand, india, japan, korea and the western world.\n\nafter its introduction into india, the kite further evolved into the fighter kite, known as the patang in india, where thousands are flown every year on festivals such as makar sankranti.\n\nkites were known throughout polynesia, as far as new zealand, with the assumption being that the knowledge diffused from china along with the people. anthropomorphic kites made from cloth and wood were used in religious ceremonies to send prayers to the gods. polynesian kite traditions are used by anthropologists get an idea of early "primitive" asian traditions that are believed to have at one time existed in asia.\n

### Stopwords & punctuations list

In [2]:
import nltk
import string
import numpy as np
myStopwords=nltk.corpus.stopwords.words('english')
myStopwords+=string.punctuation
print(myStopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Get TF for each doc

In [3]:
from collections import Counter
intro_tok=[w for w in nltk.word_tokenize(kite_text) if w not in set(string.punctuation) ]
hist_tok=[w for w in nltk.word_tokenize(kite_history) if w not in set(string.punctuation) ]

In [4]:
# Most common TF in kite_text
Counter(intro_tok).most_common(10)

[('the', 26),
 ('a', 20),
 ('kite', 17),
 ('and', 10),
 ('of', 10),
 ('kites', 8),
 ('is', 7),
 ('in', 7),
 ('or', 6),
 ('wing', 5)]

In [5]:
# Most common TF in kite_history
Counter(hist_tok).most_common(10)

[('the', 13),
 ('kites', 9),
 ('were', 9),
 ('and', 9),
 ('for', 7),
 ('as', 7),
 ('kite', 6),
 ('in', 5),
 ('a', 5),
 ('to', 5)]

### Bag of Word method

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
docs=[kite_text, 
      kite_history]

cv=CountVectorizer()
cv.fit(docs)

def get_bow_1doc(doc: str, cv:CountVectorizer):
    features=cv.get_feature_names()
    bow=cv.transform([doc])
    tf = []
    for i,ctn in enumerate(bow.toarray()[0]):
        tf.append((features[i], ctn))
    return tf

In [7]:
# BOW of kite_text
get_bow_1doc(docs[0], cv)[:10]

[('549', 0),
 ('5th', 0),
 ('above', 1),
 ('activities', 1),
 ('ad', 0),
 ('aerial', 1),
 ('after', 0),
 ('against', 1),
 ('air', 4),
 ('along', 1)]

In [8]:
# BOW of kite_history
get_bow_1doc(docs[1], cv)[:10]

[('549', 1),
 ('5th', 1),
 ('above', 0),
 ('activities', 0),
 ('ad', 1),
 ('aerial', 0),
 ('after', 1),
 ('against', 0),
 ('air', 0),
 ('along', 1)]

### Get TF-IDF for each doc

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
docs=[kite_text, 
      kite_history]

tfidf=TfidfVectorizer(use_idf=True, 
                      stop_words=myStopwords)
                      #stop_words=set(string.punctuation))
tfidf.fit(docs)

def get_TfIdf_1doc(doc: str, tfidf: TfidfVectorizer):
    tfidf_vect=tfidf.transform([doc])
    return pd.DataFrame(tfidf_vect.T.todense(),
                        index=tfidf.get_feature_names(),
                        columns=["tfidf"])

In [10]:
# kite_text tfidf
kite_text_tfidf=get_TfIdf_1doc(docs[0], tfidf)
kite_text_tfidf

Unnamed: 0,tfidf
549,0.000000
5th,0.000000
activities,0.045955
ad,0.000000
aerial,0.045955
...,...
wood,0.000000
world,0.000000
worldwide,0.045955
year,0.000000


In [11]:
# kite_history tfidf
kite_history_tfidf=get_TfIdf_1doc(docs[1], tfidf)
kite_history_tfidf

Unnamed: 0,tfidf
549,0.06383
5th,0.06383
activities,0.00000
ad,0.06383
aerial,0.00000
...,...
wood,0.06383
world,0.06383
worldwide,0.00000
year,0.12766


### Cosine similarity

In [12]:
def cosine_sim(vec1, vec2):
    dot = np.dot(vec1, vec2)
    norma = np.linalg.norm(vec1)
    normb = np.linalg.norm(vec2)
    return dot / (norma * normb)

In [13]:
query='Had china invented the kite?'
print('Encode query to tfidf')
query_tfidf=get_TfIdf_1doc(query,tfidf)
query_tfidf.sort_values('tfidf', ascending=False)

Encode query to tfidf


Unnamed: 0,tfidf
invented,0.631667
china,0.631667
kite,0.449436
patang,0.000000
polynesia,0.000000
...,...
fugitive,0.000000
generate,0.000000
generated,0.000000
generates,0.000000


In [14]:
query_tfidf=query_tfidf['tfidf'].values
cosine_sim(query_tfidf, kite_text_tfidf['tfidf'].values)

0.24982304626247498

In [15]:
cosine_sim(query_tfidf, kite_history_tfidf['tfidf'].values)

0.28374571997162956

## Conclusion from this exercise
I was not able to use cosine_sim() to find the most relevant doc to my query. However, by manipulating the term ("china", "invented") in my query, I could force cosine_sim() to calculate kite_history doc with higher score. My thought is if I include more text processing techniques(stem, lemmatization, POS) before performing TF-IDF, the result from performing search relevance may have improved.