<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 7

## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



In [1]:
import pandas as pd
import numpy as np

In [2]:
posts = pd.read_csv('/Users/edwardlee/Desktop/DSI/DSI-SF-2/datasets/stack_exchange_travel/posts_train.csv', encoding='UTF-8')
comments = pd.read_csv('/Users/edwardlee/Desktop/DSI/DSI-SF-2/datasets/stack_exchange_travel/comments_train.csv', encoding='UTF-8')
votes = pd.read_csv('/Users/edwardlee/Desktop/DSI/DSI-SF-2/datasets/stack_exchange_travel/votes_train.csv', encoding='UTF-8')
badges = pd.read_csv('/Users/edwardlee/Desktop/DSI/DSI-SF-2/datasets/stack_exchange_travel/badges.csv', encoding='UTF-8')
tags = pd.read_csv('/Users/edwardlee/Desktop/DSI/DSI-SF-2/datasets/stack_exchange_travel/tags.csv', encoding='UTF-8')
users = pd.read_csv('/Users/edwardlee/Desktop/DSI/DSI-SF-2/datasets/stack_exchange_travel/users.csv', encoding='UTF-8')

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [3]:
tags.head(10)

Unnamed: 0,Count,ExcerptPostId,Id,TagName,WikiPostId
0,75,2138.0,1,cruising,2137.0
1,39,357.0,2,caribbean,356.0
2,31,319.0,4,vacations,318.0
3,6,14548.0,6,amazon-river,14547.0
4,74,1792.0,8,romania,1791.0
5,314,1183.0,9,transportation,1182.0
6,36,243.0,10,extreme-tourism,242.0
7,10,164.0,11,antarctica,163.0
8,326,1112.0,12,planning,1111.0
9,1962,158.0,14,usa,157.0


In [96]:
votes.head()

Unnamed: 0,BountyAmount,CreationDate,Id,PostId,UserId,VoteTypeId
0,,2011-06-21T00:00:00.000,1,1,,2
1,,2011-06-21T00:00:00.000,2,1,,2
2,,2011-06-21T00:00:00.000,5,5,13.0,5
3,,2011-06-21T00:00:00.000,7,7,,2
4,,2011-06-21T00:00:00.000,9,1,,2


In [5]:
posts.shape

(41289, 21)

In [6]:
comments.isnull().sum()

CreationDate           0
Id                     0
PostId                 0
Score                  0
Text                   0
UserDisplayName    79705
UserId              1081
dtype: int64

In [7]:
comments.shape

(81506, 7)

In [8]:
comments.head()

Unnamed: 0,CreationDate,Id,PostId,Score,Text,UserDisplayName,UserId
0,2011-06-21T20:25:14.257,1,1,0,To help with the cruise line question: Where a...,,12.0
1,2011-06-21T20:27:35.300,2,1,0,"Toronto, Ontario. We can fly out of anywhere t...",,9.0
2,2011-06-21T20:32:23.687,3,1,3,"""Best"" for what? Please read [this page](http...",,20.0
3,2011-06-21T20:42:08.330,9,25,0,"Are you in the UK? If so, would be helpful to ...",,30.0
4,2011-06-21T20:44:09.990,12,26,3,"Where are you starting from, and what sort of ...",,26.0


In [9]:
posts.head()

Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,CreationDate,FavoriteCount,Id,LastActivityDate,...,LastEditorDisplayName,LastEditorUserId,OwnerDisplayName,OwnerUserId,ParentId,PostTypeId,Score,Tags,Title,ViewCount
0,393.0,4.0,<p>My fiancée and I are looking for a good Car...,2013-02-25T23:52:47.953,4,,2011-06-21T20:19:34.730,,1,2012-05-24T14:52:14.760,...,,101.0,,9.0,,1,8,<caribbean><cruising><vacations>,What are some Caribbean cruises for October?,361.0
1,,1.0,<p>Singapore Airlines has an all-business clas...,,1,,2011-06-21T20:24:57.160,,4,2013-01-09T09:55:22.743,...,,693.0,,24.0,,1,8,<loyalty-programs><routes><ewr><singapore-airl...,Does Singapore Airlines offer any reward seats...,219.0
2,770.0,5.0,<p>Another definition question that interested...,,0,,2011-06-21T20:25:56.787,2.0,5,2012-10-12T20:49:08.110,...,,101.0,,13.0,,1,11,<romania><transportation>,What is the easiest transportation to use thro...,340.0
3,62.0,3.0,<p>Can anyone suggest the best way to get from...,,2,,2011-06-21T20:30:38.687,1.0,8,2016-03-28T03:41:28.130,...,user141,,,26.0,,1,11,<usa><airport-transfer><taxis><seattle>,Best way to get from SeaTac airport to Redmond?,9219.0
4,178.0,4.0,<p>We are considering visiting Argentina for u...,2016-01-02T10:26:48.277,1,,2011-06-21T20:31:21.800,8.0,9,2016-01-01T21:58:02.303,...,,101.0,,23.0,,1,12,<sightseeing><public-transport><transportation...,What are must-visit destinations for the first...,1503.0


In [10]:
import re

def remove_html(x):
    clean_html = re.compile('<.*?>')
    try:
        cleaned = re.sub(clean_html, '', x)
        return cleaned
    except:
        return x

def remove_tabs(x):
    try:
        x = filter(lambda x: x.replace('\n', ''), x)
        x = x.lower()
        return x
    except:
        return x

In [11]:
# Remove HTML code
posts['Body'] = posts['Body'].apply(remove_html)
# Remove \n and lower case
posts['Body'] = posts['Body'].apply(remove_tabs)
# Lower case
posts['Title'] = posts['Title'].apply(remove_tabs)

In [12]:
# Lower case
comments['Text'] = comments['Text'].apply(remove_tabs)

In [13]:
# Drop NaN value if 'Body' column has
posts_body = posts.dropna(subset=['Body'])

In [14]:
posts_title = posts.dropna(subset=['Title'])
posts_title

Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,CreationDate,FavoriteCount,Id,LastActivityDate,...,LastEditorDisplayName,LastEditorUserId,OwnerDisplayName,OwnerUserId,ParentId,PostTypeId,Score,Tags,Title,ViewCount
0,393.0,4.0,my fiancée and i are looking for a good caribb...,2013-02-25T23:52:47.953,4,,2011-06-21T20:19:34.730,,1,2012-05-24T14:52:14.760,...,,101.0,,9.0,,1,8,<caribbean><cruising><vacations>,what are some caribbean cruises for october?,361.0
1,,1.0,singapore airlines has an all-business class f...,,1,,2011-06-21T20:24:57.160,,4,2013-01-09T09:55:22.743,...,,693.0,,24.0,,1,8,<loyalty-programs><routes><ewr><singapore-airl...,does singapore airlines offer any reward seats...,219.0
2,770.0,5.0,another definition question that interested me...,,0,,2011-06-21T20:25:56.787,2.0,5,2012-10-12T20:49:08.110,...,,101.0,,13.0,,1,11,<romania><transportation>,what is the easiest transportation to use thro...,340.0
3,62.0,3.0,can anyone suggest the best way to get from se...,,2,,2011-06-21T20:30:38.687,1.0,8,2016-03-28T03:41:28.130,...,user141,,,26.0,,1,11,<usa><airport-transfer><taxis><seattle>,best way to get from seatac airport to redmond?,9219.0
4,178.0,4.0,we are considering visiting argentina for up t...,2016-01-02T10:26:48.277,1,,2011-06-21T20:31:21.800,8.0,9,2016-01-01T21:58:02.303,...,,101.0,,23.0,,1,12,<sightseeing><public-transport><transportation...,what are must-visit destinations for the first...,1503.0
5,198.0,3.0,i'm planning on taking the trans-siberian / tr...,,1,,2011-06-21T20:32:25.950,4.0,11,2015-05-13T08:12:22.683,...,,101.0,,30.0,,1,24,<russia><visas><china><mongolia><trans-siberian>,what is the best way to obtain visas for the t...,1604.0
6,1608.0,2.0,"i need to travel from cusco, peru to la paz, b...",,4,,2011-06-21T20:34:22.563,,13,2015-04-23T22:35:31.370,...,,693.0,,22.0,,1,11,<online-resources><transportation><peru><south...,where can i find up-to-date information about ...,449.0
7,,3.0,i am aware of travel agencies catering to us c...,,4,,2011-06-21T20:34:23.453,,14,2011-10-07T21:11:29.223,...,,140.0,,23.0,,1,7,<us-citizens><travel-agents><cuba>,is it advisable for us citizen to attempt a vi...,329.0
8,20.0,12.0,my wife and i have decided to move across euro...,,2,,2011-06-21T20:36:19.323,33.0,16,2015-04-06T21:24:08.973,...,user141,,,19.0,,1,57,<europe><online-resources><planning><guides><t...,is there a good website to plan a trip via tra...,4561.0
12,114.0,8.0,i'm looking for data plans i can use while tou...,,3,,2011-06-21T20:41:15.210,16.0,25,2016-04-26T21:27:37.633,...,,101.0,,34.0,,1,41,<budget><cellphones><data-plans><communication...,what are the best ways to avoid data roaming f...,2448.0


In [15]:
post_body = []
post_title = []
comment_text = []
all_text = []

for body in posts_body['Body']:
    post_body.append(body)
    all_text.append(body)

for title in posts_title['Title']:
    post_title.append(title)
    all_text.append(title)
    
for text in comments['Text']:
    comment_text.append(text)
    all_text.append(text)

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora, models, matutils
from collections import defaultdict

# Fit the documents into a count vectorizer
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(comment_text)
X.todense()

docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
docs.head()

Unnamed: 0,00,000,0000,00000,00001km,00002,00003,0001,00012,0002,...,ｉｔ,ｒｅｔｕｒｎｅｄ,ｓｏ,ｔｈｅ,ｔｈｏｕｇｈ,ｔｏ,ｖｅ,ｖｅｒｙ,ｗａｓ,ｙｅｔ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
vocab = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
vocab

{0: u'00',
 1: u'000',
 2: u'0000',
 3: u'00000',
 4: u'00001km',
 5: u'00002',
 6: u'00003',
 7: u'0001',
 8: u'00012',
 9: u'0002',
 10: u'0004',
 11: u'000465f03d84d91df360b',
 12: u'0004662013f19e9a20a84',
 13: u'00049a6eb177a0e46e1b6',
 14: u'0006',
 15: u'0007',
 16: u'000815',
 17: u'000_dollar_banknote',
 18: u'000ers',
 19: u'000eu',
 20: u'000ft',
 21: u'000ish',
 22: u'000km',
 23: u'000m',
 24: u'000rmb',
 25: u'000th',
 26: u'000us',
 27: u'000v',
 28: u'000x',
 29: u'001',
 30: u'001017733',
 31: u'0011',
 32: u'00110',
 33: u'001106',
 34: u'00117',
 35: u'0012',
 36: u'001607important',
 37: u'0018',
 38: u'001844',
 39: u'001880',
 40: u'0019172222222',
 41: u'0019bb2963f4',
 42: u'002',
 43: u'00208',
 44: u'00211',
 45: u'002591',
 46: u'003',
 47: u'003144',
 48: u'0033',
 49: u'0035206',
 50: u'00380',
 51: u'004001',
 52: u'004539',
 53: u'0045z',
 54: u'0046',
 55: u'0048773',
 56: u'005278',
 57: u'006',
 58: u'006641',
 59: u'006659',
 60: u'00680',
 61: u'0068

In [20]:
lda = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    num_topics  =  20,
    passes      =  3,
    id2word     =  vocab
)

In [26]:
lda.print_topics(num_topics=20, num_words=5)

[(0, u'0.044*schengen + 0.035*days + 0.032*card + 0.023*stay + 0.017*area'),
 (1, u'0.015*don + 0.012*know + 0.011*like + 0.010*just + 0.009*relaxed'),
 (2,
  u'0.058*question + 0.057*answer + 0.027*op + 0.016*comment + 0.015*think'),
 (3, u'0.018*make + 0.015*answer + 0.014*clear + 0.013*don + 0.012*time'),
 (4, u'0.089*visa + 0.033*uk + 0.025*need + 0.020*passport + 0.018*entry'),
 (5, u'0.075*www + 0.064*http + 0.051*com + 0.051*https + 0.019*html'),
 (6,
  u'0.013*traffic + 0.012*seat + 0.011*cmaster + 0.011*resident + 0.010*service'),
 (7,
  u'0.038*passport + 0.029*eu + 0.028*country + 0.022*countries + 0.018*law'),
 (8,
  u'0.039*flight + 0.025*airport + 0.021*hours + 0.019*flights + 0.015*years'),
 (9,
  u'0.041*ticket + 0.030*airline + 0.023*airlines + 0.021*flight + 0.019*tickets'),
 (10, u'0.027*car + 0.022*train + 0.019*pay + 0.012*cost + 0.011*london'),
 (11, u'0.024*just + 0.018*time + 0.012*don + 0.012*ll + 0.010*know'),
 (12,
  u'0.023*luggage + 0.019*id + 0.017*insuran

In [16]:
comment_keywords = ['schengen', 'days', 'card', 'stay', 'area', 'question', 'answer', 'op', 'comment', 'think', 'visa', 'uk', 'need', 'passport', 'entry', 'eu', 'country', 'countries', 'law', 'flight', 'airport', 'hours', 'flights', 'years', 'ticket', 'airline', 'airlines', 'tickets', 'car', 'train', 'pay', 'cost', 'london', 'luggage', 'id', 'insurance', 'checked', 'bag', 'thank', 'people', 'seen', 'comments', 'thanks', 'like', 'idea', 'good', 'looks', 'esta', 'personal', 'control', 'new', 'bank', 'number', 'account', 'travel', 'application', 'english', 'france', 'nationality', 'italy']

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora, models, matutils
from collections import defaultdict

# Fit the documents into a count vectorizer
vectorizer = CountVectorizer(stop_words='english')
X1 = vectorizer.fit_transform(post_body)
X1.todense()

docs_post_body = pd.DataFrame(X1.toarray(), columns=vectorizer.get_feature_names())
docs_post_body.head()

Unnamed: 0,00,000,0000,00000,000000000000001,0000000000046022254,00000111,000005,00001,0000267,...,아이스크림,옥인동,종로구,주세요,찜질방,청사초롱,평양,하나미,항공,４月２０日
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
vocab_post_body = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
vocab_post_body

{0: u'00',
 1: u'000',
 2: u'0000',
 3: u'00000',
 4: u'000000000000001',
 5: u'0000000000046022254',
 6: u'00000111',
 7: u'000005',
 8: u'00001',
 9: u'0000267',
 10: u'000071usd',
 11: u'0001',
 12: u'000102',
 13: u'0003',
 14: u'0005',
 15: u'0007hz',
 16: u'000b',
 17: u'000customers',
 18: u'000er',
 19: u'000ers',
 20: u'000eur',
 21: u'000ft',
 22: u'000i',
 23: u'000if',
 24: u'000inr',
 25: u'000km',
 26: u'000krw',
 27: u'000m',
 28: u'000mi',
 29: u'000rp',
 30: u'000rs',
 31: u'000s',
 32: u'000taxissupermarketsbars',
 33: u'000usd',
 34: u'000vnd',
 35: u'001',
 36: u'0010',
 37: u'0011',
 38: u'00131313',
 39: u'0014',
 40: u'001498',
 41: u'0015',
 42: u'001598',
 43: u'0016',
 44: u'001607',
 45: u'0017',
 46: u'002',
 47: u'002384',
 48: u'00268',
 49: u'003',
 50: u'0033145900087',
 51: u'00347',
 52: u'0034email',
 53: u'0039',
 54: u'0039a',
 55: u'003if',
 56: u'0041',
 57: u'00441628580333',
 58: u'0047',
 59: u'004\u0437',
 60: u'005',
 61: u'0050',
 62: u'0053

In [24]:
lda1 = models.LdaModel(
    matutils.Sparse2Corpus(X1, documents_columns=False),
    num_topics  =  20,
    passes      =  3,
    id2word     =  vocab_post_body
)

In [27]:
lda1.print_topics(num_topics=20, num_words=5)

[(0,
  u'0.041*water + 0.013*vwp + 0.009*products + 0.006*women + 0.005*drinking'),
 (1,
  u'0.012*travel + 0.012*booking + 0.012*price + 0.008*fare + 0.008*tickets'),
 (2, u'0.029*card + 0.023*use + 0.019*phone + 0.016*data + 0.011*free'),
 (3,
  u'0.011*country + 0.008*countries + 0.007*border + 0.006*answer + 0.006*question'),
 (4, u'0.064*visa + 0.024*passport + 0.017*uk + 0.016*schengen + 0.016*need'),
 (5, u'0.037*canada + 0.017*esta + 0.016*china + 0.015*applied + 0.014*new'),
 (6, u'0.010*eur + 0.010*cruise + 0.010*maps + 0.010*page + 0.009*ireland'),
 (7, u'0.015*seat + 0.015*people + 0.010*seats + 0.006*class + 0.005*space'),
 (8,
  u'0.076*airport + 0.056*transit + 0.022*international + 0.021*terminal + 0.018*london'),
 (9,
  u'0.021*south + 0.018*expired + 0.016*mexico + 0.014*united + 0.014*states'),
 (10, u'0.015*use + 0.011*english + 0.011*like + 0.010*just + 0.008*don'),
 (11, u'0.023*japan + 0.015*tokyo + 0.013*japanese + 0.013*emsp + 0.010*day'),
 (12, u'0.025*train +

In [17]:
post_body_keywords = ['water', 'vwp', 'products', 'women', 'drinking', 'travel', 'booking', 'price', 'fare', 'tickets', 'card', 'use', 'phone', 'data', 'free', 'country', 'countries', 'border', 'answer', 'question', 'visa', 'passport', 'uk', 'schengen', 'need', 'canada', 'esta', 'china', 'applied', 'new', 'eur', 'cruise', 'maps', 'page', 'ireland', 'seat', 'people', 'seats', 'class', 'space', 'airport', 'transit', 'international', 'terminal', 'london', 'south', 'expired', 'mexico', 'united', 'states', 'use', 'english', 'like', 'japan', 'tokyo', 'japanese', 'emsp', 'day', 'train', 'bus', 'station', 'time', 'ticket', 'city', 'area', 'park', 'visit', 'flight', 'check', 'ticket', 'airline', 'luggage', 'car', 'rental', 'rent', 'drive', 'cars', 'road', 'speed', 'driving', 'traffic', 'km', 'hotel', 'food', 'paragraph', 'items', 'bring', 'declare']

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora, models, matutils
from collections import defaultdict

# Fit the documents into a count vectorizer
vectorizer = CountVectorizer(stop_words='english')
X2 = vectorizer.fit_transform(post_title)
X2.todense()

docs_post_body = pd.DataFrame(X2.toarray(), columns=vectorizer.get_feature_names())
docs_post_body.head()

Unnamed: 0,00,000,000b,000lb,010,050,0830,10,100,1000,...,zurich,zzyzx,zürich,çanakkale,île,þingvellir,łódź,佐賀線偲橋,松島城,청사초롱
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
vocab_post_title = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
vocab_post_title

{0: u'00',
 1: u'000',
 2: u'000b',
 3: u'000lb',
 4: u'010',
 5: u'050',
 6: u'0830',
 7: u'10',
 8: u'100',
 9: u'1000',
 10: u'101',
 11: u'10k',
 12: u'11',
 13: u'110',
 14: u'110v',
 15: u'11217',
 16: u'11h55',
 17: u'11pm',
 18: u'11th',
 19: u'12',
 20: u'120',
 21: u'1208',
 22: u'12284',
 23: u'125cc',
 24: u'12am',
 25: u'12th',
 26: u'13',
 27: u'137',
 28: u'13h',
 29: u'13h40',
 30: u'13th',
 31: u'14',
 32: u'14000',
 33: u'15',
 34: u'150',
 35: u'1500',
 36: u'1520',
 37: u'15kg',
 38: u'15ml',
 39: u'16',
 40: u'160',
 41: u'1650',
 42: u'169th',
 43: u'17',
 44: u'17th',
 45: u'18',
 46: u'180',
 47: u'180th',
 48: u'18h',
 49: u'19',
 50: u'1900',
 51: u'192',
 52: u'1920s',
 53: u'1951',
 54: u'1963',
 55: u'1976',
 56: u'1991',
 57: u'19th',
 58: u'1am',
 59: u'1d',
 60: u'1h',
 61: u'1h10m',
 62: u'1h30',
 63: u'1h30m',
 64: u'1h30min',
 65: u'1hour',
 66: u'1hr',
 67: u'1pc',
 68: u'1st',
 69: u'1\xbd',
 70: u'20',
 71: u'20000mah',
 72: u'2001',
 73: u'2003',


In [22]:
lda2 = models.LdaModel(
    matutils.Sparse2Corpus(X2, documents_columns=False),
    num_topics  =  10,
    passes      =  3,
    id2word     =  vocab_post_title
)

In [24]:
lda2.print_topics(num_topics=10, num_words=5)

[(0, u'0.057*passport + 0.024*enter + 0.021*valid + 0.016*use + 0.016*months'),
 (1, u'0.025*travel + 0.019*bus + 0.013*mexico + 0.013*united + 0.012*trains'),
 (2, u'0.168*visa + 0.053*uk + 0.048*schengen + 0.027*transit + 0.024*need'),
 (3,
  u'0.019*way + 0.017*border + 0.013*possible + 0.013*cheapest + 0.010*europe'),
 (4, u'0.021*permit + 0.020*public + 0.020*work + 0.019*dubai + 0.019*paris'),
 (5,
  u'0.031*airport + 0.027*luggage + 0.026*check + 0.013*free + 0.012*night'),
 (6,
  u'0.026*layover + 0.025*card + 0.024*airport + 0.018*transit + 0.014*london'),
 (7,
  u'0.038*car + 0.015*travel + 0.014*insurance + 0.012*leaving + 0.012*rental'),
 (8,
  u'0.062*flight + 0.031*passport + 0.023*new + 0.015*refused + 0.013*heathrow'),
 (9,
  u'0.027*airport + 0.017*transfer + 0.017*south + 0.011*hong + 0.011*kong')]

In [18]:
post_title_keywords = ['passport', 'enter', 'valid', 'use', 'months', 'travel', 'bus', 'mexico', 'united', 'trains', 'visa', 'uk', 'schengen', 'transit', 'need', 'way', 'border', 'possible', 'cheapest', 'europe', 'permit', 'public', 'work', 'dubai', 'paris', 'airport', 'luggage', 'check', 'free', 'night', 'layover', 'card', 'transit', 'london', 'car', 'travel', 'insurance', 'leaving', 'rental', 'flight', 'passport', 'new', 'refused', 'heathrow', 'transfer', 'south', 'hong', 'kong']

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora, models, matutils
from collections import defaultdict

# Fit the documents into a count vectorizer
vectorizer = CountVectorizer(stop_words='english')
X_all = vectorizer.fit_transform(all_text)
X_all.todense()

docs_post_body = pd.DataFrame(X_all.toarray(), columns=vectorizer.get_feature_names())
docs_post_body.head()

Unnamed: 0,00,000,0000,00000,000000000000001,0000000000046022254,00000111,000005,00001,00001km,...,ｉｔ,ｒｅｔｕｒｎｅｄ,ｓｏ,ｔｈｅ,ｔｈｏｕｇｈ,ｔｏ,ｖｅ,ｖｅｒｙ,ｗａｓ,ｙｅｔ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
vocab_all_text = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
vocab_all_text

{0: u'00',
 1: u'000',
 2: u'0000',
 3: u'00000',
 4: u'000000000000001',
 5: u'0000000000046022254',
 6: u'00000111',
 7: u'000005',
 8: u'00001',
 9: u'00001km',
 10: u'00002',
 11: u'0000267',
 12: u'00003',
 13: u'000071usd',
 14: u'0001',
 15: u'000102',
 16: u'00012',
 17: u'0002',
 18: u'0003',
 19: u'0004',
 20: u'000465f03d84d91df360b',
 21: u'0004662013f19e9a20a84',
 22: u'00049a6eb177a0e46e1b6',
 23: u'0005',
 24: u'0006',
 25: u'0007',
 26: u'0007hz',
 27: u'000815',
 28: u'000_dollar_banknote',
 29: u'000b',
 30: u'000customers',
 31: u'000er',
 32: u'000ers',
 33: u'000eu',
 34: u'000eur',
 35: u'000ft',
 36: u'000i',
 37: u'000if',
 38: u'000inr',
 39: u'000ish',
 40: u'000km',
 41: u'000krw',
 42: u'000lb',
 43: u'000m',
 44: u'000mi',
 45: u'000rmb',
 46: u'000rp',
 47: u'000rs',
 48: u'000s',
 49: u'000taxissupermarketsbars',
 50: u'000th',
 51: u'000us',
 52: u'000usd',
 53: u'000v',
 54: u'000vnd',
 55: u'000x',
 56: u'001',
 57: u'0010',
 58: u'001017733',
 59: u'0

In [36]:
lda_all = models.LdaModel(
    matutils.Sparse2Corpus(X_all, documents_columns=False),
    num_topics  =  15,
    passes      =  3,
    id2word     =  vocab_all_text
)

In [37]:
lda_all.print_topics(num_topics=15, num_words=5)

[(0,
  u'0.061*airport + 0.022*flight + 0.022*transit + 0.019*international + 0.017*airports'),
 (1,
  u'0.046*answer + 0.038*question + 0.022*op + 0.022*information + 0.017*link'),
 (2, u'0.012*food + 0.011*water + 0.010*carry + 0.008*bag + 0.007*bring'),
 (3, u'0.018*google + 0.014*phoog + 0.013*maps + 0.010*map + 0.009*2016'),
 (4, u'0.034*hotel + 0.015*hotels + 0.014*room + 0.013*price + 0.013*voting'),
 (5, u'0.013*countries + 0.013*country + 0.013*law + 0.009*id + 0.008*state'),
 (6,
  u'0.018*yeah + 0.018*south + 0.014*jonathanreez + 0.012*america + 0.011*north'),
 (7,
  u'0.053*visa + 0.024*passport + 0.016*uk + 0.016*schengen + 0.015*country'),
 (8, u'0.089*http + 0.081*com + 0.060*travel + 0.048*www + 0.048*questions'),
 (9, u'0.017*card + 0.011*use + 0.011*money + 0.010*pay + 0.008*bank'),
 (10,
  u'0.025*ticket + 0.025*flight + 0.019*airline + 0.015*check + 0.013*airlines'),
 (11,
  u'0.034*car + 0.021*insurance + 0.014*rental + 0.010*company + 0.008*chx'),
 (12, u'0.021*ti

In [19]:
all_keywords = ['airport', 'flight', 'transit', 'international', 'airports', 'answer', 'question', 'op', 'information', 'link', 'food', 'water', 'carry', 'bag', 'bring', 'maps', 'map', 'hotel', 'hotels', 'room', 'price', 'voting', 'countries', 'country', 'law', 'id', 'state', 'yeah', 'south', 'america', 'north', 'visa', 'passport', 'uk', 'schengen', 'country', 'questions', 'card', 'use', 'money', 'pay', 'bank', 'ticket', 'airline', 'check', 'airlines', 'car', 'insurance', 'rental', 'company', 'time', 'day', 'bus', 'train', 'trip', 'like', 'know', 'english', 'london', 'station']

In [20]:
all_keywords_df = pd.DataFrame(all_keywords)

In [21]:
post_title_df = pd.DataFrame(post_title_keywords)

In [22]:
post_body_df = pd.DataFrame(post_body_keywords)

In [23]:
comment_keywords_df = pd.DataFrame(comment_keywords)

In [24]:
all_keywords_df = all_keywords_df.sort_values(0).reset_index(drop=True)
post_title_df = post_title_df.sort_values(0).reset_index(drop=True)
post_body_df = post_body_df.sort_values(0).reset_index(drop=True)
comment_keywords_df = comment_keywords_df.sort_values(0).reset_index(drop=True)

In [25]:
word_comparison = pd.concat([all_keywords_df, post_title_df, post_body_df, comment_keywords_df], axis=1)

In [26]:
word_comparison.columns = ['All', 'Title', 'Body', 'Comment']

In [27]:
frames = [all_keywords_df, post_title_df, post_body_df, comment_keywords_df]

In [28]:
word_comparison.head(3)

Unnamed: 0,All,Title,Body,Comment
0,airline,airport,airline,account
1,airlines,border,airport,airline
2,airport,bus,answer,airlines


In [29]:
concat_keywords = pd.concat(frames)

In [30]:
concat_keywords = concat_keywords.sort_values(0).reset_index(drop=True)

In [31]:
concat_keywords.head(10)

Unnamed: 0,0
0,account
1,airline
2,airline
3,airline
4,airlines
5,airlines
6,airport
7,airport
8,airport
9,airport


In [32]:
counts = concat_keywords.groupby(0)[0].count()

In [33]:
counts_df = pd.DataFrame(counts)
counts_df = counts_df[counts_df[0] >= 3]

In [34]:
counts_more = pd.DataFrame(counts)
counts_more = counts_more[counts_more[0] >= 2]

In [35]:
counts_more.rename(columns={0:'counts'}, inplace=True)
counts_more.reset_index(inplace=True)
counts_more.rename(columns={0:'words'}, inplace=True)
counts_more = counts_more.sort_values('counts', ascending=False).reset_index(drop=True)

In [36]:
counts_df.rename(columns={0:'counts'}, inplace=True)

In [37]:
counts_df.reset_index(inplace=True)

In [38]:
counts_df.rename(columns={0:'words'}, inplace=True)

In [39]:
counts_df = counts_df.sort_values('counts', ascending=False).reset_index(drop=True)

In [40]:
counts_v1 = counts_df['words'].values

In [41]:
counts_v2 = counts_more['words'].values

In [42]:
print counts_v1, counts_v2

['passport' 'country' 'schengen' 'airport' 'london' 'ticket' 'flight'
 'visa' 'transit' 'travel' 'card' 'car' 'uk' 'use' 'question' 'south'
 'rental' 'train' 'airline' 'new' 'need' 'like' 'insurance' 'english'
 'countries' 'check' 'bus' 'answer' 'luggage'] ['passport' 'london' 'ticket' 'flight' 'country' 'schengen' 'car' 'card'
 'transit' 'travel' 'uk' 'use' 'visa' 'airport' 'train' 'insurance' 'south'
 'rental' 'question' 'new' 'need' 'luggage' 'like' 'airline' 'countries'
 'answer' 'bus' 'check' 'english' 'international' 'area' 'united' 'bag'
 'bank' 'border' 'bring' 'time' 'tickets' 'station' 'day' 'law' 'price'
 'people' 'pay' 'esta' 'op' 'food' 'free' 'mexico' 'maps' 'hotel'
 'airlines' 'id' 'water']


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [43]:
posts.head(1)

Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,CreationDate,FavoriteCount,Id,LastActivityDate,...,LastEditorDisplayName,LastEditorUserId,OwnerDisplayName,OwnerUserId,ParentId,PostTypeId,Score,Tags,Title,ViewCount
0,393.0,4.0,my fiancée and i are looking for a good caribb...,2013-02-25T23:52:47.953,4,,2011-06-21T20:19:34.730,,1,2012-05-24T14:52:14.760,...,,101.0,,9.0,,1,8,<caribbean><cruising><vacations>,what are some caribbean cruises for october?,361.0


In [331]:
posts_classification = posts.copy()

In [332]:
posts_classification.fillna('', inplace=True)

In [333]:
def vocab_counter(body, vocab=counts_v1):
    return [int(word in body) for word in vocab]

vocab_counts = posts_classification['Body'].apply(vocab_counter)

for word in counts_v1:
    posts_classification[word] = 0

In [334]:
a = np.zeros((vocab_counts.shape[0], 29))
for i in range(a.shape[0]):
    try:
        a[i] = vocab_counts[i]
    except:
        pass

In [335]:
posts_classification.reset_index(inplace=True)

In [336]:
posts_classification[counts_v1] = a

In [337]:
posts_classification.drop('index', axis=1, inplace=True)

In [338]:
posts_classification

Unnamed: 0,AcceptedAnswerId,AnswerCount,Body,ClosedDate,CommentCount,CommunityOwnedDate,CreationDate,FavoriteCount,Id,LastActivityDate,...,new,need,like,insurance,english,countries,check,bus,answer,luggage
0,393,4,my fiancée and i are looking for a good caribb...,2013-02-25T23:52:47.953,4,,2011-06-21T20:19:34.730,,1,2012-05-24T14:52:14.760,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,1,singapore airlines has an all-business class f...,,1,,2011-06-21T20:24:57.160,,4,2013-01-09T09:55:22.743,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,770,5,another definition question that interested me...,,0,,2011-06-21T20:25:56.787,2,5,2012-10-12T20:49:08.110,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3,can anyone suggest the best way to get from se...,,2,,2011-06-21T20:30:38.687,1,8,2016-03-28T03:41:28.130,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,178,4,we are considering visiting argentina for up t...,2016-01-02T10:26:48.277,1,,2011-06-21T20:31:21.800,8,9,2016-01-01T21:58:02.303,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,198,3,i'm planning on taking the trans-siberian / tr...,,1,,2011-06-21T20:32:25.950,4,11,2015-05-13T08:12:22.683,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1608,2,"i need to travel from cusco, peru to la paz, b...",,4,,2011-06-21T20:34:22.563,,13,2015-04-23T22:35:31.370,...,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,,3,i am aware of travel agencies catering to us c...,,4,,2011-06-21T20:34:23.453,,14,2011-10-07T21:11:29.223,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,20,12,my wife and i have decided to move across euro...,,2,,2011-06-21T20:36:19.323,33,16,2015-04-06T21:24:08.973,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,,,eurail should be a good place to plan the trip...,,3,,2011-06-21T20:38:27.483,,19,2011-10-14T13:44:53.930,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [339]:
posts_classification.PostTypeId.value_counts()

2    23967
1    13988
5     1656
4     1656
6       18
7        4
Name: PostTypeId, dtype: int64

In [340]:
posts_classification.shape

(41289, 51)

In [341]:
posts_classification.columns

Index([     u'AcceptedAnswerId',           u'AnswerCount',
                        u'Body',            u'ClosedDate',
                u'CommentCount',    u'CommunityOwnedDate',
                u'CreationDate',         u'FavoriteCount',
                          u'Id',      u'LastActivityDate',
                u'LastEditDate', u'LastEditorDisplayName',
            u'LastEditorUserId',      u'OwnerDisplayName',
                 u'OwnerUserId',              u'ParentId',
                  u'PostTypeId',                 u'Score',
                        u'Tags',                 u'Title',
                   u'ViewCount',           u'Skill Count',
                    u'passport',               u'country',
                    u'schengen',               u'airport',
                      u'london',                u'ticket',
                      u'flight',                  u'visa',
                     u'transit',                u'travel',
                        u'card',                   u'car

In [342]:
def check_null(x):
    try:
        int(x)
        return 1
    except:
        return 0

posts_classification['accepted_binary'] = posts_classification.AcceptedAnswerId.apply(check_null)

In [343]:
posts_classification.accepted_binary.value_counts()

0    34773
1     6516
Name: accepted_binary, dtype: int64

In [344]:
answerid = []

In [345]:
for item in posts_classification['AcceptedAnswerId']:
    answerid.append(item)

In [346]:
answerid

[393.0,
 '',
 770.0,
 62.0,
 178.0,
 198.0,
 1608.0,
 '',
 20.0,
 '',
 '',
 '',
 114.0,
 32.0,
 48.0,
 71.0,
 '',
 '',
 '',
 '',
 76.0,
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 214.0,
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 111.0,
 '',
 '',
 199.0,
 '',
 '',
 '',
 91.0,
 '',
 '',
 96.0,
 '',
 '',
 103.0,
 '',
 '',
 '',
 '',
 '',
 '',
 137.0,
 '',
 140.0,
 '',
 109.0,
 '',
 '',
 '',
 147.0,
 '',
 1885.0,
 131.0,
 '',
 126.0,
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 141.0,
 '',
 '',
 '',
 '',
 182.0,
 '',
 2168.0,
 '',
 '',
 179.0,
 '',
 '',
 '',
 '',
 169.0,
 958.0,
 216.0,
 '',
 3406.0,
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 188.0,
 '',
 212.0,
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 207.0,
 '',
 '',
 '',
 '',
 '',
 225.0,
 '',
 '',
 '',
 '',
 '',
 224.0,
 '',
 '',
 2160.0,
 '',
 '',
 268.0,
 '',
 '',
 '',
 245.0,
 '',
 '',
 '',
 '',
 456.0,
 251.0,
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 304.0,
 271.0

In [347]:
def check_post(x):
    if x in answerid:
        return 1
    else:
        return 0

posts_classification['accepted_post'] = posts_classification['Id'].apply(check_post)

In [348]:
posts_classification.accepted_post.value_counts()

0    34773
1     6516
Name: accepted_post, dtype: int64

In [327]:
# def checker(x):
#     try:
#         int(x)
#         return 1
#     except:
#         return 0

# posts_classification['answer_binary'] = posts_classification['PostTypeId'].apply(checker)

In [328]:
# Keep only answer posts
# posts_classification = posts_classification[posts_classification['accepted_post'] != 0]

In [349]:
posts_classification.accepted_binary.value_counts()

0    34773
1     6516
Name: accepted_binary, dtype: int64

In [351]:
# posts_classification.answer_binary.value_counts()

In [352]:
columns_wanted = [u'passport',u'country',u'schengen',
                  u'airport',                u'london',
                      u'ticket',                u'flight',
                        u'visa',               u'transit',
                      u'travel',                  u'card',
                         u'car',                    u'uk',
                         u'use',              u'question',
                       u'south',                u'rental',
                       u'train',               u'airline',
                         u'new',                  u'need',
                        u'like',             u'insurance',
                     u'english',             u'countries',
                       u'check',                   u'bus',
                      u'answer',               u'luggage']

In [353]:
countv1_df = posts_classification[columns_wanted]

In [354]:
posts_classification.shape

(41289, 53)

In [355]:
posts_classification.accepted_binary.value_counts()

0    34773
1     6516
Name: accepted_binary, dtype: int64

In [356]:
X = countv1_df
y = posts_classification['accepted_post']

In [357]:
from sklearn.cross_validation import train_test_split

trainX, testX, trainY, testY = train_test_split(X, y, train_size=0.7, stratify=y)

In [358]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion='gini', max_depth=None)

classifier.fit(trainX, trainY)

Y_pred = classifier.predict(testX)

print classifier.score(testX, testY)
print np.mean(y)

0.794865584887
0.157814429993


In [359]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()

classifier.fit(trainX, trainY)

Y_pred = classifier.predict(testX)

print classifier.score(testX, testY)
print np.mean(y)

0.841769597158
0.157814429993


In [360]:
# Classification Report and Confusion Matrix
from sklearn.metrics import classification_report

def print_cm_cr(y, Y_pred):
    """prints the confusion matrix and the classification report"""
    confusion = pd.crosstab(y, Y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
    print confusion
    print
    print classification_report(y, Y_pred)
    
print_cm_cr(testY, Y_pred)

Predicted      0   1    All
Actual                     
0          10413  19  10432
1           1941  14   1955
All        12354  33  12387

             precision    recall  f1-score   support

          0       0.84      1.00      0.91     10432
          1       0.42      0.01      0.01      1955

avg / total       0.78      0.84      0.77     12387



In [361]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

model = knn.fit(trainX, trainY)

predictions = model.predict(testX)

print model.score(testX, testY)

0.821506418019


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


In [362]:
def check_skills(element):
    counter = 0
    for word in counts_v1:
        if word in element:
            counter = counter + 1
    return counter

In [363]:
posts_classification['Skill Count'] = posts_classification['Body'].map(check_skills)

In [364]:
columns_need = ['accepted_post', 'Score', 'Skill Count']

In [365]:
reg_df = posts_classification[columns_need]

In [377]:
y = reg_df['Score']
X = reg_df[['accepted_post', 'Skill Count']]

In [378]:
trainX, testX, trainY, testY = train_test_split(X, y, train_size=0.75)

In [379]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
model = lr.fit(trainX, trainY)
lr.predict(testX)
print lr.score(testX, testY)

0.0343373168024


In [380]:
feature_importance = pd.DataFrame({ 'features':X.columns, 
                                   'coefficients':model.coef_
                                  })

feature_importance.sort_values('coefficients', ascending=False, inplace=True)
feature_importance

Unnamed: 0,coefficients,features
0,3.330857,accepted_post
1,0.185643,Skill Count


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

In [388]:
reg_4 = posts_classification.copy()

In [393]:
def removal(x):
    try:
        if '' in x:
            return 0
    except:
        return x
        

reg_4.ViewCount = reg_4.ViewCount.apply(removal)

In [396]:
reg_4 = reg_4[reg_4.ViewCount != 0]

In [402]:
X = reg_4[['Skill Count']]

In [403]:
y = reg_4['ViewCount']

In [404]:
trainX, testX, trainY, testY = train_test_split(X, y, train_size=0.75)

In [405]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
model = lr.fit(trainX, trainY)
lr.predict(testX)
print lr.score(testX, testY)

-0.0011673639125


In [406]:
feature_importance = pd.DataFrame({ 'features':X.columns, 
                                   'coefficients':model.coef_
                                  })

feature_importance.sort_values('coefficients', ascending=False, inplace=True)
feature_importance

Unnamed: 0,coefficients,features
0,18.290261,Skill Count


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.
