#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2019


# Homework 1:  Modeling Text + Link Analysis + SEO

### 100 points [6% of your final grade]

### Due: Monday, February 8, 2019 by 11:59pm

*Goals of this homework:* Understand the vector space model (TF-IDF, cosine) + BM25 works in searching. Explore real-world challenges of building a graph (in this case, from Epinions), implement and test the classic HITS algorithm over this graph. Experiment with real-world search engine optimization techniques.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw1.ipynb`. For example, my homework submission would be something like `555001234_hw1.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1: Modeling Text (50 points)

### TF-IDF


First, you will need to download the review.json file from the Resources tab on Piazza, a collection of about 7,000 Yelp reviews we sampled from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge). You'll see that each line corresponds to a review on a particular business. Each review has a unique "ID" and the text content is in the "review" field. You need to load the json file first. We already have done some basic preprocessing on the reviews, so you can just tokenize each review using whitespace.

Here you can treat each review as a document. Given a query, you need to calculate its TF-IDF score in each review.  For this homework, we will define the TF-IDF as follows:

`TF = number of times word occurs in a document`

`IDF = log(total number of documents / number of documents containing the word)`

### A) Ranking with simple sums of TF-IDF scores

To start out with, for a multi-word query, we will rank documents by a simple sum of the TF-IDF scores for the query terms in the document. 

Please calculate this TF-IDF sum score for queries `"best bbq"` and `"kid fun and food"`. You need to report the Top-10 reviews with highest TF-IDF scores for each query. Your output should look like this:

Query "best bbq"

Rank Review_ID score

1 dhskfhjskfhs 0.55555

...



Query "kid fun and food"

Rank Review_ID score

1 dhskfhjskfhs 0.55555

...

In [2]:
import json
from math import exp, expm1
import operator

reviews = []
for line in open('review.json', 'r'):
    reviews.append(json.loads(line))

def countTokens(review):
    x = []
    wordcount = {}
    x = review.split()
    for word in x:
        if word not in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1
    return wordcount

#Term Frequency per Review
new = []
words = []
for r in reviews:
    temp = {'docID': r['id'], 'wordCount': countTokens(r['review'])}
    words.extend(r['review'].split())
    new.append(temp)

dictionary = set(words)

#Function to calculate IDF Scores
def IDFScore(query):
    x = []
    docCount = {}
    idf = {}
    x = query.split()
    for word in x:
        for n in new:
            if word in n['wordCount'] and word not in docCount:
                docCount[word]=1
            elif word in n['wordCount'] and word in docCount:
                docCount[word]+=1
        idf[word] = math.log10(len(new)/docCount[word])     
    return idf

#Function to calculate TF Scores
def TFScore(query):
    x = []
    tf = []
    x = query.split()
    for n in new:
        temp = {}
        temp['docID'] = n['docID']
        for word in x:           
            count = 0
            if word in n['wordCount']:
                count = n['wordCount'][word]
            temp[word] = count
        tf.append(temp)
    return tf

#Function to calculate TFxIDF Scores
def TF_IDF(query):
    tf = TFScore(query)
    idf = IDFScore(query)
    x = []
    x = query.split()
    tf_idf = {}
    i=0
    for i in range(0,len(new)):  
        score = 0 
        for word in x:
            score+=idf[word] * tf[i][word]
        tf_idf[tf[i]['docID']] = score
    return tf_idf 


In [3]:
# Show us the result for "best bbq"
import math
from tabulate import tabulate

score = TF_IDF('best bbq')    
sorted_x = sorted(score.items(), key=operator.itemgetter(1), reverse = True)
data = (sorted_x)[:10]

i=1
print("Rank\t", "Review ID\t", "\t\t\tScore")
for key,value in data:
    print(i,"\t",key,"\t", value)
    i+=1
    

Rank	 Review ID	 			Score
1 	 YbQvHNrjZJ38mnh5rLuq7w 	 11.430532931561633
2 	 P31kXP4oan6ZQm69TN6tIA 	 9.525444109634693
3 	 x5esEK6J9XkA_vbvVbG8Gg 	 8.471542878759161
4 	 mWs26TrBM7ogwCM9UfVJFg 	 7.6203552877077545
5 	 NCfX4AxDvQ3QRyXKtmhVwQ 	 7.6203552877077545
6 	 e5INq6DAZn2zMHicKQl07Q 	 6.566454056832223
7 	 4WTG1-9mw8YHEyaTu8dQww 	 6.566454056832223
8 	 x3n_l3GhBx78y6jWX4fStg 	 5.958313137359845
9 	 Wp8jYXL1DQrgrnZIFmufFg 	 5.715266465780816
10 	 jrEx93eYKIjCW2nrkwjZpQ 	 5.715266465780816


In [4]:
# Show us the result for "kid fun and food"
score = TF_IDF('kid fun and food')    
sorted_x = sorted(score.items(), key=operator.itemgetter(1), reverse = True)

data = (sorted_x)[:10]
i=1
print("Rank\t", "Review ID\t", "\t\t\tScore")
for key,value in data:
    print(i,"\t",key,"\t", value)
    i+=1

Rank	 Review ID	 			Score
1 	 7o_hciiXEMNQkXfVl0F0XQ 	 9.641872501183393
2 	 JKLUXUovJCU6kbcdin74NQ 	 8.692007941255747
3 	 IA8TOfGKI-Il-70BsB6HgA 	 8.132369151667877
4 	 Kytq1NbFIDDCXUculSqT8g 	 7.2912649956941475
5 	 MF6rPRx9jz-g8S5P_ZIdyg 	 7.080045177182006
6 	 bjoedmJ4_DZP5JnfXVaC-w 	 6.825484846867428
7 	 I00B-QG5uTKvwCK7x9ejeA 	 6.8021590858855445
8 	 BVGRJgDJGEhSfgIPCan7vQ 	 6.721602275668411
9 	 wMB3cI3-xhxM_BpmppY9RQ 	 6.425196423168579
10 	 vTGDEQGp6EPlwdMJUnTb7A 	 6.0414680741761755


### B) Ranking with TF-IDF + Cosine

Instead of using the sum of TF-IDF scores, let's try the classic cosine approach for ranking. You should still use the TF-IDF scores to weigh each term, but now use the cosine between the query vector and the document vector to assign a similarity score. You can try the same two queries as before and report the results. (Top-10 reviews which are similar to the query)

In [5]:
import math 

#stores the idf scores for all words in dictionary 
idf_score = {}

for word in dictionary:
    idf_temp = IDFScore(word) 
    idf_score[word] = idf_temp[word] 

def query_magnitude(query):
    x=[]
    x=query.split()
    y ={}
    y = countTokens(query)
    mag_qi=0
    for a in x:
        mag_qi += pow(y[a]*idf_score[a],2)
    magnitude = math.sqrt(mag_qi)
    return magnitude

def document_magnitude():
    doc_mag ={}
    index=0
    for n in new:
        magnitude =0
        mag_di =0
        for word in n['wordCount']:  
            count = n['wordCount'][word] * idf_score[word]
            mag_di += pow(count,2)
        magnitude = math.sqrt(mag_di)
        doc_mag[n['docID']] = magnitude
        index+=1
    return doc_mag

doc_magnitude = document_magnitude()
#print('Doc magnitude', doc_magnitude)

#using tf-idf scores of the query and not just tf score for the query
def cosine_tfidf(query):
    x = {}
    x = countTokens(query)
    tf = TFScore(query)
    scores = {}
    for i in range(0,len(tf)):
        qi_di=0
        for word in x:
            qi = x[word] * idf_score[word]
            di = tf[i][word] * idf_score[word] 
            qi_di+= qi*di
            scores[tf[i]['docID']] = qi_di /(query_magnitude(query) * doc_magnitude[tf[i]['docID']])
    return scores

In [6]:
# Show us the result for "best bbq"
#uses tf-idf scores of the query in the function
score = cosine_tfidf('best bbq')    
sorted_x = sorted(score.items(), key=operator.itemgetter(1), reverse = True)

data = (sorted_x)[:10]
i=1
print("Rank\t", "Review ID\t", "\t\t\tScore")
for key,value in data:
    print(i,"\t",key,"\t", value)
    i+=1

Rank	 Review ID	 			Score
1 	 x5esEK6J9XkA_vbvVbG8Gg 	 0.5317240946819429
2 	 P31kXP4oan6ZQm69TN6tIA 	 0.46188402591085276
3 	 8p-KEtrrTmLv-o1mKpUy1A 	 0.43897268486712104
4 	 _fNfowXaxXcYChKukMrYeg 	 0.3979856739671412
5 	 NCfX4AxDvQ3QRyXKtmhVwQ 	 0.3665421991949398
6 	 4iCl2qJaz9GPaU4v5bRW2A 	 0.36308346241131384
7 	 HzNxErSCQ2FYfPCbyfHrSQ 	 0.36237683074178917
8 	 e5INq6DAZn2zMHicKQl07Q 	 0.33485418089677293
9 	 Wp8jYXL1DQrgrnZIFmufFg 	 0.3118555623309593
10 	 1tJ_iJX_KZ3zM_9_GRaGTg 	 0.31082470147489855


In [7]:
# Show us the result for "kid fun and food"
#uses tf-idf scores of the query in the function
score = cosine_tfidf('kid fun and food')    
sorted_x = sorted(score.items(), key=operator.itemgetter(1), reverse = True)

data = (sorted_x)[:10]
i=1
print("Rank\t", "Review ID\t", "\t\t\tScore")
for key,value in data:
    print(i,"\t",key,"\t", value)
    i+=1

Rank	 Review ID	 			Score
1 	 IUME6cWFSwH1mSh_1_U81g 	 0.4104441171922874
2 	 6xdziQ46TZWKBpKQPNCSGw 	 0.2897208925770315
3 	 OExraycGW4VxL0Xth1xZ4w 	 0.2511025783811886
4 	 RRGemWMJskG2VQiDzjAOhw 	 0.23813741108195638
5 	 37RfMeDMo8QEVAF8yT31Ww 	 0.21661731670833084
6 	 JKLUXUovJCU6kbcdin74NQ 	 0.21259362982108448
7 	 rM_V3OfrwWA7vHsXsUmq2w 	 0.21146292901726388
8 	 k7HxGMgabFxDUi2XWZ_hOg 	 0.20781006062485471
9 	 5oLxygfaHo2dMf9dbRxc4w 	 0.19992966478169988
10 	 XTSD0-Wi1r_k2EQOCpv8hA 	 0.19296521321517499


### C) Ranking with BM25

Finally, let's try the BM25 approach for ranking. Refer to [https://en.wikipedia.org/wiki/Okapi_BM25](https://en.wikipedia.org/wiki/Okapi_BM25) for the specific formula. You should choose k_1 = 1.2 and b = 0.75. You need to report the Top-10 reviews with highest BM25 scores for each query.


In [8]:
#Your code here
k_1 = 1.2
b = 0.75

def bm25_helper():
    doc_tf = {}
    for n in new:
        count =0
        for key,value in n['wordCount'].items():
            count+=value
        doc_tf[n['docID']] = count
    return doc_tf

def DocLength():
    count = 0
    for n in new:
        for key,value in n['wordCount'].items():
            count+=value
    return count

avg_docLength = DocLength()/len(new)
doc_tf = bm25_helper()

def bm25(query):
    x = {}
    x = countTokens(query)
    tf = TFScore(query)
    scores = {}
    for i in range(0,len(tf)):
        num=0
        temp =0
        for word in x:
            if word in dictionary:
                num = tf[i][word] * idf_score[word] *(k_1+1)
                den = tf[i][word] + (k_1 * ((1-b) + (b* doc_tf[tf[i]['docID']]/avg_docLength)))
                temp += num/den
        scores[tf[i]['docID']] = temp
    return scores
                             

In [9]:
# Show us the result for "best bbq"
score = bm25('best bbq')    
sorted_x = sorted(score.items(), key=operator.itemgetter(1), reverse = True)

data = (sorted_x)[:10]
i=1
print("Rank\t", "Review ID\t", "\t\t\tScore")
for key,value in data:
    print(i,"\t",key,"\t", value)
    i+=1

Rank	 Review ID	 			Score
1 	 x5esEK6J9XkA_vbvVbG8Gg 	 4.222154734433089
2 	 xpm6TgDiHaQdEDlErFsqvQ 	 4.088933945230742
3 	 4WTG1-9mw8YHEyaTu8dQww 	 3.891027458585551
4 	 e5INq6DAZn2zMHicKQl07Q 	 3.731673460329797
5 	 GASAd_gPBY_eWIL9XJwuNA 	 3.464216011656184
6 	 P31kXP4oan6ZQm69TN6tIA 	 3.4222718800768748
7 	 8p-KEtrrTmLv-o1mKpUy1A 	 3.3091199999631034
8 	 HzNxErSCQ2FYfPCbyfHrSQ 	 3.229482350635871
9 	 -RApX_RMzJLnpommDpQfKQ 	 3.204095838695621
10 	 1tJ_iJX_KZ3zM_9_GRaGTg 	 3.1239630631638624


In [10]:
# Show us the result for "kid fun and food"
score = bm25('kid fun and food')    
sorted_x = sorted(score.items(), key=operator.itemgetter(1), reverse = True)

data = (sorted_x)[:10]
i=1
print("Rank\t", "Review ID\t", "\t\t\tScore")
for key,value in data:
    print(i,"\t",key,"\t", value)
    i+=1


Rank	 Review ID	 			Score
1 	 kDwMMrSiB_AlV0erhVigFg 	 3.440335841068963
2 	 6xdziQ46TZWKBpKQPNCSGw 	 3.054596625538573
3 	 UMqvuRtTxJFuWbgT6qO9cg 	 3.0236961472468282
4 	 TVq6HhhJizKM1mReF9hvJQ 	 3.0133396564389154
5 	 OExraycGW4VxL0Xth1xZ4w 	 3.012355781988829
6 	 nuKIKXuQ51eRywuCcoX3fQ 	 2.9815155466598267
7 	 k7HxGMgabFxDUi2XWZ_hOg 	 2.9796259517085786
8 	 JKLUXUovJCU6kbcdin74NQ 	 2.96139640420127
9 	 EDQzFQ7yYbRVUWCNA4rTOQ 	 2.9553169750109785
10 	 BLQYsPFFAezpbbF-1dzD4Q 	 2.945529446180057


Briefly discuss the differences you see between the three methods. Is there one you prefer? 

TF-IDF is the simplest ranking funtion that measures the relative concentration of a term in a given piece of text i.e. if a token is common in a particular article but relatively rare elsewhere then the TF-IDF score of the article will be high. Additionally, the text length affects the relevance of a document. A token that occurs twice in a 20 word document and 5 times in a 500 word document, then the shorter document will be more relevant than the longer document.

Cosine document similarity is invariant to the actual number of times each term is used in a document; only the relative frequencies matter.  This way a long document with many words can be similar to a short document with fewer words but similar frequencies. So, this is an improvement to just using TF-IDF scores to check relevance of documents.

Okapi BM25 uses a probabilistic model that is sensitive to term frequency and document length. We use tuning parameters like k_1 and b to caliberate the document term frequency scaling and document length scaling respectively. 

   * k_1: Larger Term Frequency means more relevance. However, you always asymptotically approach k_1. Modifying k_1 causes the asymptote to move. By stretching out the point of saturation (high k_1 value), we can stretch out the relevance difference between higher and lower term frequency documents.
    
   * b: The constant b will allow us to finely tune how much influence our L (document length relative to the average document length) value has on scoring.
    
I prefer BM25 over the other methods as it ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document.

# Part 2: Link Analysis (40 points)

## A Trust Graph


In this part, we're going to adapt the classic HITS approach to allow us to find not the most authoritative web pages, but rather to find the most trustworthy users. [Epinions.com](https://snap.stanford.edu/data/soc-Epinions1.html) is a general consumer review site with a who-trust-whom online social network. Members of the site can decide whether to ''trust'' each other. All the trust relationships interact and form the Web of Trust which is then combined with review ratings to determine which reviews are shown to the user. (Refer to: Richardson, Matthew, Rakesh Agrawal, and Pedro Domingos. "Trust management for the semantic web." International semantic Web conference. Springer, Berlin, Heidelberg, 2003.)

So, instead of viewing the world as web pages with hyperlinks (where pages = nodes, hyperlinks = edges), we're going to construct a graph of Epinions users and their "trust" on other users (so user = node, trust behavior = edge). Over this Epinions-user graph, we can apply the HITS approach to order the users by their hub-ness and their authority-ness.

You need to download the *Epinions_trust.txt* file from the Resources tab on Piazza. Each line represents the trust relationship between two users. Here is a toy example. Suppose you are given the following four lines:

* diane trust bob
* charlie trust bob 
* charlie trust alice
* bob trust charlie

The "trust" between each user pair denotes a directed edge between two nodes. E.g., the "diane" node has a directed edge to the "bob" node (as indicated by the first line). 

You should build a graph by parsing the data in the file we provide called *Epinions_trust.txt*. (Note: The edges are binary and directed.)

**Notes:**

* The edges are binary and directed.
* User can't trust himself/herself.
* Later you will need to implement the HITS algorithm on the graph you build here.

In [11]:
# Here define your function for building the graph 
# by parsing the input file 
# Insert as many cells as you want
  
lines = [line.rstrip('\n') for line in open('Epinions_trust.txt')]
#lines = [line.rstrip('\n') for line in open('test.txt')]
nodes = []
edges = 0
in_relations = {}
out_relations = {}

for line in lines:
    x = line.split()
    nodes.append(x[0])
    nodes.append(x[2])
    if x[0] != x[2]:
        edges+=1
        #set up outgoing links list
        if x[0] not in out_relations:
            out_relations[x[0]] = [x[2]]
        else:
            out_relations[x[0]].append(x[2])
        #set up incoming links list
        if x[2] not in in_relations:
            in_relations[x[2]] = [x[0]]
        else:
            in_relations[x[2]].append(x[0])       
    

nodes = set(nodes)

hubScores = {}
authScores = {}
updated_hubScores = {}
updated_authScores = {}

#Initialize the hub scores and authority scores to 1
for node in nodes:
    authScores[node] = 1
    hubScores[node] = 1
    updated_hubScores[node] = 0
    updated_authScores[node] = 0

for i in range(0,10):
    for node in nodes:
        updated_authScores[node] = 0
        if node in in_relations:
            for a in in_relations[node]:
                updated_authScores[node]+= hubScores[a]
    for node in nodes:
        updated_hubScores[node] = 0
        if node in out_relations:
            for b in out_relations[node]:
                updated_hubScores[node]+= authScores[b]  
    hubScores = updated_hubScores
    authScores = updated_authScores


sorted_hubScores = sorted(hubScores.items(), key=operator.itemgetter(1), reverse = True)
data = sorted_hubScores[:10]

print("Hub Scores")
print()
print("User\t", "\t\t\tScore")
for key,value in data:
    print(".",key,"\t-\t", value)
    
print()
print()

sorted_authScores = sorted(authScores.items(), key=operator.itemgetter(1), reverse = True)
data = sorted_authScores[:10]

print("Authority Scores")
print()
print("User\t", "\t\t\tScore")
for key,value in data:
    print(".",key,"\t-\t", value)

Hub Scores

User	 			Score
. charles 	-	 31978527469285107503398956990
. teanna3 	-	 31224728902536158575166969992
. JediKermit 	-	 28720041621586065936064334422
. KCFemme 	-	 26588618022440755229063834103
. melissasrn 	-	 25447413756249117962808882672
. missi31 	-	 25243341230410873004625185451
. jeanniekerns 	-	 25136558945674315113779792452
. jag2112 	-	 25001351954518220971683352037
. mrssmoopy 	-	 24999440581661990374327214788
. briandalsmom 	-	 24519539977401360484295344585


Authority Scores

User	 			Score
. melissasrn 	-	 1747346702470809014635839810
. shantel575 	-	 1390705444188771313883977137
. surferdude7 	-	 1374966583962396038673801650
. sblaydes 	-	 1114147290678058774647546495
. tiffer0220 	-	 1099995654533266852507478649
. opinionated3 	-	 1080550684984421298269933206
. patch3boys 	-	 928417487030403490098822102
. merlot 	-	 914777368411385714398320830
. pogomom 	-	 893869996236257935358818071
. chrisceb 	-	 848446702115310752962326066


Please show us the size of the graph, i.e., the number of nodes and edges


In [12]:
# Call your function to print out the size of the graph
# How you maintain the graph is totally up to you
print('Number of Nodes in graph:', len(nodes))
print('Number of Edges in graph:', edges)

Number of Nodes in graph: 658
Number of Edges in graph: 6378


## HITS Implementation

Your program will return the top 10 users with highest hub and authority scores. The **output** should be like:

Hub Scores

* user1 - score1
* user2 - score2
* ...
* user10 - score10

Authority Scores

* user1 - score1
* user2 - score2
* ...
* user10 - score10

You should follow these **rules**:

* Assume all nodes start out with equal scores.
* It is up to you to decide when to terminate the HITS calculation.
* There are HITS implementations out there on the web. However, remember, your code should be **your own**.


**Hints**:
* If you're using the matrix style approach, you should use [numpy.matrix](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).
* Scipy is built on top of Numpy and has support for sparse matrices. You most likely will not need to use Scipy unless you'd like to try out their sparse matrices.
* If you choose to use Numpy (and Scipy), please make sure your Anaconda environment include their latest versions.
* Test your parsing and HITS calculations using a handful of trust relationships, before moving on to the entire file we provide.
* We will evaluate the user ranks you provide as well as the quality of your code. So make sure that your code is clear and readable.

# Part 3: Search Engine Optimization (10 + 5 points)

For this part, your goal is to put on your "[search engine optimization](https://en.wikipedia.org/wiki/Search_engine_optimization)" hat. Your job is to create a webpage that scores highest for the query: **sajfd hfafbjhd** --- two terms, lower case, no quote. As of today (Jan 24, 2019), there are no hits for this query on either Google or Bing. Based on our discussions of search engine ranking algorithms, you know that several factors may impact a page's rank. Your goal is to use this knowledge to promote your own page to the top of the list.

What we're doing here is a form of [SEO contest](https://en.wikipedia.org/wiki/SEO_contest). While you have great latitude in how you approach this problem, you are not allowed to engage in any unethical or illegal behavior. Please read the discussion of "white hat" versus "black hat" SEO over at [Wikipedia](https://en.wikipedia.org/wiki/Search_engine_optimization).


**Rules of the game:**

* Somewhere in the page (possibly in the non-viewable source html) you must include your name or some other way for us to identify you.
* Your target page may only be a TAMU student page, a page on your own webserver, a page on a standard blog platform (e.g., wordpress), or some other primarily user-controlled page
* Your target page CAN NOT be a twitter account, a facebook page, a Yahoo Answers or similar page
* No wikipedia vandalism
* No yahoo/wiki answers questions
* No comment spamming of blogs
* If you have concerns/questions/clarifications, please post on Piazza and we will discuss

For your homework turnin for this part, you should provide us the URL of your target page and a brief discussion (2-4 paragraphs) of the strategies you are using. We will issue the query and check the rankings at some undetermined time in the next couple of weeks. You might guess that major search engines take some time to discover and integrate new pages: if I were you, I'd get a target page up immediately.

**Grading:**

* 5 points for providing a valid URL
* 5 points for a well-reasoned discussion of your strategy

** Bonus: **
* 1 point for your page appearing in the top-20 on Google or Bing
* 1 more point for your page appearing in the top-10 on Google or Bing
* 1 more point for your page appearing in the top-5 on Google or Bing
* 2 more points for your page being ranked first by Google or Bing. And, a vigorous announcement in class, and a high-five for having the top result!

What's the URL of your page?

http://people.tamu.edu/~aninditamishra/

What's your strategy? (2-4 paragraphs)

I kept the following in mind while trying to improve the ranking of my webpage:

* The website had to be mobile friendly: This meant my web page needed to load quickly, have easy to scan through content,adapt based on screen size and have clickable links spaced properly.


* Keyword frequency: This helps search engines judge the relevancy of a webpage to a query. If my keyword frequency was too low, then I would have trouble ranking for that keyword. However, if this was too high, that sends a negative signal to the search engines. "Keyword stuffing" (a blackhat technique) is a technique where the keyword frequency is unnaturally high and distracting to users.


* I added meta tags for description and keywords set as the given phrase because most search engines often display the meta description in search results which can highly influence user click-through rates.


* Picked a short title for the page that contains the phrase: Both the title tag and the meta description together give a brief idea of what your content is about, but the title tag is displayed in the SERP and therfore, influences user-clicks.


* Include an image or video on the page and annotate them with alt text set as the phrase.


* Cross-linking: Link popularity is one of the main factors involved in how search engines determine the value, importance and relevance of sites on a given topic. This calculation is then reflected in the site's search ranking. Cross linking allows users to reference sites with content similar to that which they are already viewing, and may be of further interest to them. So, I set up another webpage using Google sites, created a github repository called "sajfd hfafbjhd", uploaded a youtube video with the same name and then create links between them while keeping the following in mind :

    >1)Using keyword rich anchor text around the cross linking.
    
    >2)Avoid using cross linking on every page because it dilutes the quality of the link, and will appear questionable to the search engines.
    
    >3)Avoid creating a "link ring". A link ring is when several sites are cross linking to each other, with no other sites linked into the ring they create. Search engines will penalize this kind of cross linking immediately.
    
        
* I also added google-site-verification meta tag to verify ownership of the webpage. 


* Share the page with people: Ranking of a webpage is influenced based on user engagement. Search engines monitor and remember user interactions. If a page consistently keeps people on them for longer than average, the algorithm will adjust the search results to favor that site.


* Regularly update the webpage: "Freshness" of a webpage is also something that search engines consider while ranking. So, I made sure I added new content/modifyied existing content to keep my page fresh.


## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*