# Programming Assignment 05
Student: John Wu

In [2]:
import sys, os, nltk, string
import numpy as np, tokenHelper as tkn
from collections import Counter

In [3]:
fName = os.path.join(os.getcwd(), '19991220-Excite-QueryLog.utf8.tsv')

header = ['timestamp', 'userID', 'firstRank', 'query'] # variable names
raw = np.genfromtxt(fName, delimiter='\t', dtype=None, names=header, 
              comments=None, encoding='utf-8')
qry = np.char.lower(raw['query'])

## Simple Analysis

### Q1. What is the average number of queries per user id?

In [9]:
uid,uidInv = np.unique(raw['userID'], return_inverse=True)
print('On average, %.2f queries per user.'%(len(raw)/len(uid)))

On avg, 4.61 queries per user.


### Q2. Report the mean and median query length in both words and characters.
#### In Characters
First, query length in characters is simple. One just needs to find the length of the query string and calculate based on that.

In [35]:
tmp = np.char.count(qry, '') - 1
print("Mean query length is %.2f characters."%np.mean(tmp))
print("Median query length is %.2f characters."%np.median(tmp))

Mean query length is 20.51 characters.
Median query length is 17.00 characters.


### In Words
For query lengths in words, it is slightly more complicated, as it would depend on tokenization process. We take a simple approach by:
1. Splitting query by any consecutive whitespaces
1. Strip all punctuation from individual tokens
1. Count the token if it's alphanumeric

The effect of this would be:
1. Counts strings like *"Chinese-American"* as one word.
1. Disregardes tokens like *"--"*, e.g. *"Leave -- no, come back" is* only 4 words.
1. Mix of letters and numbers, e.g. `python3` and `8.8.8.8` would count as one word. 

In [26]:
################################################################################
f = lambda x: sum([s.strip(string.punctuation).isalnum() for s in x.split()])
tmp = [f(x) for x in qry]
print("Mean query length is %.2f words."%np.mean(tmp))
print("Median query length is %.2f words."%np.median(tmp))

Mean query length is 3.12 words.
Median query length is 2.00 words.


## Q3. What percentage of queries are mixed case? All upper case? All lower case?
### All upper case
One can use the numpy chararray built-in method of `isupper`. Note: returns `False` on entirely non-alphabetic strings.

In [38]:
qUpper = np.char.isupper(raw['query'])
print( '{:.2%}'.format(np.mean(qUpper)) )

4.38%


### All lower case
Likewise, one can use the numpy chararray built-in method of `islower`. Note: returns `False` on entirely non-alphabetic strings.

In [41]:
qLower = np.char.islower(raw['query'])
print( '{:.2%}'.format(np.mean(qLower)) )

66.94%


### Mixed case
If a string is not lower or upper case, then it's mixed case. Also returns `False` for entirely non-alphabetic strings.

In [44]:
print( '{:.2%}'.format( np.mean(np.logical_and(~qLower,~qUpper)) ))

28.68%


## Q4. What percent of the time does a user request only the top 10 results?

In [49]:
print('{:.2%}'.format(np.mean(raw['firstRank']==0)))

77.54%


## Q5. What percent of unique queries are in the form of an explicit question? What is the most common type of question?

## Q6. What are the 20-most common queries issued?
We use the [`Counter`](https://docs.python.org/2/library/collections.html#collections.Counter) class to count and retrieve the 20-most commone quries. All queries are converted to lower case prior to counting

In [32]:
qryCount = Counter(qry)
qs,n = zip(*qryCount.most_common(20))
print('\n'.join(['%02u: %s'%(n+1,x) for n,x in enumerate(qs)]))

01: sex
02: yahoo
03: internal site admin check from kho
04: chat
05: pokemon
06: porn
07: horoscopes
08: britney spears
09: mp3
10: games
11: weather
12: hotmail
13: maps
14: sitescope test
15: christmas
16: www.yahoo.com
17: yahoo.com
18: ebay
19: recipes
20: horoscope


## Q7. What are the 20 most common non-stopwords appearing in queries?

In [5]:
wds = Counter()
for n,txt in enumerate(qry):
    wds.update(tkn.tokenizeNoPunctStopword(txt))
wd,n = zip(*wds.most_common(20))
print('\n'.join(['%02u: %s'%(n+1,x) for n,x in enumerate(wd)]))

01: find
02: free
03: pictures
04: sex
05: information
06: christmas
07: nude
08: new
09: pics
10: buy
11: online
12: get
13: web
14: music
15: women
16: games
17: porn
18: cards
19: stories
20: site


## Q8. What percent of queries contain stopwords like ‘and’, ‘the’, ‘of’, ‘in’, ‘at’?

In [28]:
stopWords = tkn.engStopWords
hasStop = lambda txt: any(x in stopWords for \
                      x in tkn.tokenizeNoPunct(txt))
print('{:.2%}'.format(np.mean([hasStop(q) for q in qry])))

26.99%


## Q9. What are the 10 most common non-stopwords appearing in queries that contain the word download?

In [33]:
wds = Counter()
for n,txt in enumerate(qry):
    bags = tkn.tokenizeNoPunctStopword(txt)
    if 'download' in bags:
        wds.update(bags)
wds.pop('download', None) # needs to remove 'download' from words
wd,n = zip(*wds.most_common(10))
print('\n'.join(['%02u: %s'%(n+1,x) for n,x in enumerate(wd)]))

01: free
02: games
03: mp3
04: music
05: find
06: game
07: software
08: full
09: windows
10: songs


## Q10. What percentage of queries were asked by only one user?

In [34]:
oneUser = [1 if n==1 else 0 for q,n in qryCount.items()]
print('{:.2%}'.format(np.mean(oneUser)))

72.64%


## Q11. Find 10 examples of misspelled words (but not 10 examples of the same misspelled word)

## Q12. Which occurs in queries more often "Al Gore" or "Johns Hopkins"? "Johns Hopkins" or "John Hopkins"?
### Al Gore vs. Johns Hopkins

In [35]:
alGore = np.char.find(qry, 'al gore') >= 0
JsHs = np.char.find(qry, 'johns hopkins') >= 0
if np.sum(alGore) > np.sum(JsHs):
    print( 'Al Gore' )
else:
    print( 'Johns Hopkins' )

Al Gore


### Johns Hopkins vs. John Hopkins

In [36]:
JHs = np.char.find(qry, 'john hopkins') >= 0
if np.sum(JHs) > np.sum(JsHs):
    print( 'John Hopkins' )
else:
    print( 'Johns Hopkins' )

Johns Hopkins


## Q13. How often do URLs appear in queries?

In [22]:
################################################################################

'################################################################################'

````
Simple Analysis (70 points)
Please answer any ten of the following questions (Q1 to Q13).
Q1. What is the average number of queries per user id?
Q2. Report the mean and median query length in both words and characters.
Q3. What percentage of queries are mixed case? All upper case? All lower case?
Q4. What percent of the time does a user request only the top 10 results?
Q5. What percent of unique queries are in the form of an explicit question (i.e., look for patterns such as starting with
Wh-words, or ending with a '?' symbol). What is the most common type of question?
Q6. What are the 20-most common queries issued?
Q7. What are the 20 most common non-stopwords appearing in queries?
Q8. What percent of queries contain stopwords like ‘and’, ‘the’, ‘of’, ‘in’, ‘at’?
Q9. What are the 10 most common non-stopwords appearing in queries that contain the word download?
Q10. What percentage of queries were asked by only one user?
Q11. Find 10 examples of misspelled words (but not 10 examples of the same misspelled word)
Q12. Which occurs in queries more often "Al Gore" or "Johns Hopkins"? "Johns Hopkins" or "John Hopkins"?
Q13. How often do URLs appear in queries?
Other Analysis (30 points)
Answer any three of the following questions (Q14 to Q20).
Q14. Estimate the percentage of queries that contain a person's name? (Or alternatively, a company name.)
Q15. Can you find addresses, phone numbers, and other identifiers in the log file? Is it likely that this web query log
puts anyone's privacy at risk? Justify your response.
Q16. How often is search engine “query” syntax used, like phrases in quotes, Boolean operators, or ‘+’ or ‘-‘ signs?
Q17. How often is a consecutive query a reformulation of the previous one? (Not the same query to greater depth.)
Q18. How does query volume change throughout the day?
Q19. What are the most popular websites mentioned in the queries?
Q20. Estimate the percentage of queries that are about sports?
````