<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 5


## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



**Importing Py packages and datasets**

In [3]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import re

# sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans

from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.metrics import classification_report

# NLP
from collections import defaultdict
from gensim import corpora, models, matutils

%config InlineBackend.figure_format = 'retina'

In [4]:
posts_train = '~/Desktop/DSI-SF-5/datasets/stack_exchange_travel/posts_train.csv'
comments_train = '~/Desktop/DSI-SF-5/datasets/stack_exchange_travel/comments_train.csv'
users = '~/Desktop/DSI-SF-5/datasets/stack_exchange_travel/users.csv'
badges = '~/Desktop/DSI-SF-5/datasets/stack_exchange_travel/badges.csv'
votes_train = '~/Desktop/DSI-SF-5/datasets/stack_exchange_travel/votes_train.csv'
tags = '~/Desktop/DSI-SF-5/datasets/stack_exchange_travel/tags.csv' 

posts_train = pd.read_csv(posts_train)
comments_train = pd.read_csv(comments_train)
users = pd.read_csv(users)
badges = pd.read_csv(badges)
votes_train = pd.read_csv(votes_train)
tags = pd.read_csv(tags)

** Exploring the datasets **

In [5]:
print 'posts_train:', posts_train.shape
print 'comments_train:', comments_train.shape
print 'users:', users.shape
print 'badges:', badges.shape
print 'votes_train:', votes_train.shape
print 'tags:', tags.shape

posts_train: (41289, 21)
comments_train: (81506, 7)
users: (29283, 14)
badges: (71246, 6)
votes_train: (289553, 6)
tags: (1606, 5)


In [6]:
posts_train.head(3).T

Unnamed: 0,0,1,2
AcceptedAnswerId,393,,770
AnswerCount,4,1,5
Body,<p>My fiancée and I are looking for a good Car...,<p>Singapore Airlines has an all-business clas...,<p>Another definition question that interested...
ClosedDate,2013-02-25T23:52:47.953,,
CommentCount,4,1,0
CommunityOwnedDate,,,
CreationDate,2011-06-21T20:19:34.730,2011-06-21T20:24:57.160,2011-06-21T20:25:56.787
FavoriteCount,,,2
Id,1,4,5
LastActivityDate,2012-05-24T14:52:14.760,2013-01-09T09:55:22.743,2012-10-12T20:49:08.110


In [7]:
posts_train.PostTypeId.value_counts()

2    23967
1    13988
5     1656
4     1656
6       18
7        4
Name: PostTypeId, dtype: int64

In [8]:
# POSTTYPE {1: QUESTION, 2: ANSWER}
# USE PARENTID (Questions don't have this) & ID to remerge
posts_questions = posts_train.loc[posts_train.PostTypeId == 1, :].reset_index(drop=True)
posts_answer = posts_train.loc[posts_train.PostTypeId == 2, :].reset_index(drop=True)

In [9]:
print posts_questions.isnull().sum()
print
print 'post questions:', posts_questions.shape

AcceptedAnswerId          7472
AnswerCount                  0
Body                         0
ClosedDate               11361
CommentCount                 0
CommunityOwnedDate       13975
CreationDate                 0
FavoriteCount            10466
Id                           0
LastActivityDate             0
LastEditDate              2911
LastEditorDisplayName    13621
LastEditorUserId          3231
OwnerDisplayName         13613
OwnerUserId                307
ParentId                 13988
PostTypeId                   0
Score                        0
Tags                         0
Title                        0
ViewCount                    0
dtype: int64

post questions: (13988, 21)


In [10]:
posts_questions.Title[0]

'What are some Caribbean cruises for October?'

In [11]:
posts_questions.Body[0]

"<p>My fianc\xc3\xa9e and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?</p>\n\n<p>It seems like a lot of the cruises don't run in this month due to Hurricane season so I'm looking for other good options.</p>\n\n<p><strong>EDIT</strong> We'll be travelling in 2012.</p>\n"

In [12]:
print posts_answer.isnull().sum()
print
print 'post answers:', posts_answer.shape

AcceptedAnswerId         23967
AnswerCount              23967
Body                         0
ClosedDate               23967
CommentCount                 0
CommunityOwnedDate       23818
CreationDate                 0
FavoriteCount            23967
Id                           0
LastActivityDate             0
LastEditDate             15015
LastEditorDisplayName    23650
LastEditorUserId         15089
OwnerDisplayName         23262
OwnerUserId                403
ParentId                     0
PostTypeId                   0
Score                        0
Tags                     23967
Title                    23967
ViewCount                23967
dtype: int64

post answers: (23967, 21)


In [13]:
posts_answer.Title[0]

nan

In [14]:
posts_answer.Body[0]

'<p><a href="http://www.eurail.com/home" rel="nofollow"><strong>EURail</strong></a> should be a good place to plan the trip.</p>\n\n<p>They do go as far east as Poland and Bulgaria, but no further than that.</p>\n\n<p><a href="http://www.eurostar.com/" rel="nofollow"><strong>EuroStar</strong></a> is another network that may be useful, but it stops short of EURail on the eastern side.</p>\n'

In [15]:
comments_train.head()

Unnamed: 0,CreationDate,Id,PostId,Score,Text,UserDisplayName,UserId
0,2011-06-21T20:25:14.257,1,1,0,To help with the cruise line question: Where a...,,12.0
1,2011-06-21T20:27:35.300,2,1,0,"Toronto, Ontario. We can fly out of anywhere t...",,9.0
2,2011-06-21T20:32:23.687,3,1,3,"""Best"" for what? Please read [this page](http...",,20.0
3,2011-06-21T20:42:08.330,9,25,0,"Are you in the UK? If so, would be helpful to ...",,30.0
4,2011-06-21T20:44:09.990,12,26,3,"Where are you starting from, and what sort of ...",,26.0


In [16]:
comments_train.isnull().sum()

CreationDate           0
Id                     0
PostId                 0
Score                  0
Text                   0
UserDisplayName    79705
UserId              1081
dtype: int64

In [17]:
comments_train.Text[0]

"To help with the cruise line question: Where are you located? My wife and I live in New Orleans, so we sail out of the port here. It limits us mainly to Carnival (though we are getting some more cruise lines in here), but saves us money on travel expenses getting *to* the port. If you're closer to a specific port, and like the cruises offered out of it, then it would make more sense to choose a cruise line from there."

In [18]:
def clean_text(text):
    """ Clean the text of each email """
    text = text.replace('\n'," ") #remove line break
    text = re.sub(r"[\w]+@[\.\w]+", "", text) #remove email addresses
    #removes web addresses
    text = re.sub(r"[a-zA-Z]*\:[\//\]*[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_]+\.[A-Za-z0-9\.\/%&=\?\-_]+", "", text)
    text = re.sub(r"(http\w+ )", "", text)
    
    text = re.sub(r"-", " ", text) #replace hypens with space
    text = re.sub(r"\d+/\d+/\d+", "", text) #remove date
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #remove times
    text = re.sub(r'<\w\>' , '', text) #remove <p> from html text
    text = re.sub(r'</\w\>' , '', text) #remove <\p> from html text
    clndoc = ''
    for letter in text:
        if letter.isalpha() or letter==' ':
            clndoc+=letter
    text = ' '.join(word for word in clndoc.split() if len(word)>1)
    return text    

In [19]:
# CLEANING COMMENT TEXT
comments_train.Text = comments_train.Text.apply(clean_text)

# CLEANING POST QUESTIONS TITLE & BODY
posts_questions.Title = posts_questions.Title.apply(clean_text)
posts_questions.Body = posts_questions.Body.apply(clean_text)

# CLEANING POST ANSWERS BODY
posts_answer.Body = posts_answer.Body.apply(clean_text)

In [20]:
# GROUPING ALL COMMENTS BY POSTID...
# WILL SHOW UP AS ONE LONG COMMENT FOR EACH POST
comments_group = comments_train[['PostId', 'Text']].groupby(by='PostId')[['PostId', 'Text']].agg(lambda x: ' '.join(x)).reset_index()

In [21]:
comments_group.head()

Unnamed: 0,PostId,Text
0,1,To help with the cruise line question Where ar...
1,4,This route as well as LAX SIN is being cancele...
2,8,To those voting down please explain why with c...
3,9,agree with user you need to specify what kind ...
4,11,If at all possible save yourself the hassle of...


In [22]:
comments_group.shape

(23132, 2)

---------
** Tags **

In [23]:
tags.head()

Unnamed: 0,Count,ExcerptPostId,Id,TagName,WikiPostId
0,75,2138.0,1,cruising,2137.0
1,39,357.0,2,caribbean,356.0
2,31,319.0,4,vacations,318.0
3,6,14548.0,6,amazon-river,14547.0
4,74,1792.0,8,romania,1791.0


In [24]:
tags.isnull().sum()

Count              0
ExcerptPostId    216
Id                 0
TagName            0
WikiPostId       216
dtype: int64

--------
** Users **

In [25]:
users.head(2).T

Unnamed: 0,0,1
AboutMe,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",<p>Developer on the Stack Overflow team. Find...
AccountId,-1,2
Age,,39
CreationDate,2011-06-21T15:16:44.253,2011-06-21T20:10:03.720
DisplayName,Community,Geoff Dalgas
DownVotes,8492,0
Id,-1,2
LastAccessDate,2011-06-21T15:16:44.253,2016-05-29T01:18:20.767
Location,on the server farm,"Corvallis, OR"
ProfileImageUrl,,


In [26]:
users.isnull().sum()

AboutMe            19068
AccountId              0
Age                20869
CreationDate           0
DisplayName            1
DownVotes              0
Id                     0
LastAccessDate         0
Location           17808
ProfileImageUrl    12430
Reputation             0
UpVotes                0
Views                  0
WebsiteUrl         21383
dtype: int64

In [27]:
users.AboutMe[0]

'<p>Hi, I\'m not really a person.</p>\n\n<p>I\'m a background process that helps keep this site clean!</p>\n\n<p>I do things like</p>\n\n<ul>\n<li>Randomly poke old unanswered questions every hour so they get some attention</li>\n<li>Own community questions and answers so nobody gets unnecessary reputation from them</li>\n<li>Own downvotes on spam/evil posts that get permanently deleted</li>\n<li>Own suggested edits from anonymous users</li>\n<li><a href="http://meta.stackexchange.com/a/92006">Remove abandoned questions</a></li>\n</ul>\n'

In [28]:
# Imputing AboutMe column NAN values with stop word 'i'
users.AboutMe.fillna(value='i', inplace=True)

# CLEAN ABOUTME COLUMN
users.AboutMe = users.AboutMe.apply(clean_text)

In [29]:
users.AboutMe[0]

'Hi Im not really person Im background process that helps keep this site clean do things like ul liRandomly poke old unanswered questions every hour so they get some attentionli liOwn community questions and answers so nobody gets unnecessary reputation from themli liOwn downvotes on spamevil posts that get permanently deletedli liOwn suggested edits from anonymous usersli lia hrefRemove abandoned questionsli ul'

-------------

**_CREATING FINAL POSTS DF WHICH COMBINES ALL TEXT ASSOCIATED WITH A GIVEN POST_**

In [30]:
# JOINING ANSWERS FOR A GIVEN POST TOGETHER 
# AS ONE BLOB OF TEXT

cols = ['ParentId', 'Body']

posts_answer_grouped = posts_answer[cols].groupby(by='ParentId')[cols].agg(lambda x: ' '.join(x)).reset_index()

posts_answer_grouped.ParentId = posts_answer_grouped.ParentId.astype(int)
posts_answer_grouped.columns = ['ParentId', 'Body_Answer']

In [31]:
posts_answer_grouped.head()

Unnamed: 0,ParentId,Body_Answer
0,1,This is less than an answer but more than comm...
1,4,SQ often blocks partner awards for very exclus...
2,5,We have been in Romania in and we were using p...
3,8,usually plan my trips with Google Maps Put bot...
4,9,You dont mention what kind of trips you like b...


In [32]:
cols = ['Id', 'Title', 'Body']
posts_questions[cols].head()

# MERGING IN POST ANSWERS
posts = pd.merge(left=posts_questions[cols], right=posts_answer_grouped, 
         how='inner', left_on='Id', right_on='ParentId')

# MERGING POST COMMENTS
posts = pd.merge(left=posts, right=comments_group, how='inner',
         left_on='Id', right_on='PostId')

# MERGING IN TAGS
posts = pd.merge(left=posts, right=posts_questions[['Id', 'Tags']], 
         how='inner', on='Id')

In [33]:
posts.shape

(9401, 8)

In [34]:
posts.head()

Unnamed: 0,Id,Title,Body,ParentId,Body_Answer,PostId,Text,Tags
0,1,What are some Caribbean cruises for October,My fiance and are looking for good Caribbean c...,1,This is less than an answer but more than comm...,1,To help with the cruise line question Where ar...,<caribbean><cruising><vacations>
1,4,Does Singapore Airlines offer any reward seats...,Singapore Airlines has an all business class f...,4,SQ often blocks partner awards for very exclus...,4,This route as well as LAX SIN is being cancele...,<loyalty-programs><routes><ewr><singapore-airl...
2,8,Best way to get from SeaTac airport to Redmond,Can anyone suggest the best way to get from Se...,8,usually plan my trips with Google Maps Put bot...,8,To those voting down please explain why with c...,<usa><airport-transfer><taxis><seattle>
3,9,What are must visit destinations for the first...,We are considering visiting Argentina for up t...,9,You dont mention what kind of trips you like b...,9,agree with user you need to specify what kind ...,<sightseeing><public-transport><transportation...
4,11,What is the best way to obtain visas for the T...,Im planning on taking the trans Siberian trans...,11,As know in Russia visas can be achived for two...,11,If at all possible save yourself the hassle of...,<russia><visas><china><mongolia><trans-siberian>


In [35]:
# JOINING ALL TEXT TOGETHER

text_tuple = zip(posts.Title, posts.Body, posts.Body_Answer, posts.Text)
txt = []
for tup in text_tuple:
    txt.append(' '.join(tup))

In [36]:
# ADDING IN THE COMBINED TEXT FOR
# EACH POST AS A COLUMN
posts['Full_Text'] = txt

In [37]:
# COLUMNS OF INTEREST IN FINAL DF
cols = ['Id', 'Full_Text', 'Tags']

posts_corpus = posts[cols]

In [38]:
# FINAL POST CORPOS & TAGS
posts_corpus.head()

Unnamed: 0,Id,Full_Text,Tags
0,1,What are some Caribbean cruises for October My...,<caribbean><cruising><vacations>
1,4,Does Singapore Airlines offer any reward seats...,<loyalty-programs><routes><ewr><singapore-airl...
2,8,Best way to get from SeaTac airport to Redmond...,<usa><airport-transfer><taxis><seattle>
3,9,What are must visit destinations for the first...,<sightseeing><public-transport><transportation...
4,11,What is the best way to obtain visas for the T...,<russia><visas><china><mongolia><trans-siberian>


In [39]:
posts_corpus.shape

(9401, 3)

------------------

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


-------------------

_There are over 9000 posts in the DataFrame that can be analyzed. In this step I will use sample of the posts for "prototyping" the LDA model_

In [40]:
posts_proto = posts_corpus.sample(frac=0.33)

posts_proto.head(10)

Unnamed: 0,Id,Full_Text,Tags
1273,9608,Is there another destination like Thailand Las...,<thailand><where-on-earth><southeast-asia>
109,811,How should spend my time in Calgary in August ...,<sightseeing><canada><toronto>
9332,70742,What is the quickest way to get from Portugal ...,<driving><portugal><finland>
8050,60798,Why does it take much lesser Alaskan miles to ...,<emirates><alaska-airlines>
5819,45700,Cheapest and fastest land ways to travel from ...,<trains><international-travel><prague><cologne>
104,792,Elephant Trekking in northern Thailand would l...,<thailand><trekking><animal-riding>
3759,30188,Are there any restaurants serving real chinese...,<food-and-drink><netherlands><belgium><asia><l...
6285,48544,Applying for US tourist visa from UK am an Ind...,<visas><usa><uk><indian-citizens><tourist-visas>
4000,31549,Am allowed to use my valid ESTA again for shor...,<usa><esta><b1-b2-visas>
3303,25388,How can make wire transfer from Japan Im here ...,<japan><money>


In [41]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stop_words = list(ENGLISH_STOP_WORDS)
custom_stop_words.append('the')
custom_stop_words.append('youre')
custom_stop_words.append('like')

In [42]:
# number of topics
k  =  20

# Vectorize
vectorizer  =  CountVectorizer(stop_words='english', ngram_range=(1,1), max_df=0.95, min_df=1) # Default max/min_df
X           =  vectorizer.fit_transform(posts_proto.Full_Text)

In [43]:
docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
# docs.sum()

In [44]:
bow = []

for document in X.toarray():
    
    single_document = []
    
    for token_id, token_count in enumerate(document):

        if token_count > 0:
            single_document.append((token_id, token_count))

    bow.append(single_document)
    
# bow

In [45]:
# remove words that appear only once
frequency = defaultdict(int)

for text in posts_proto.Full_Text:
    for token in text.split():
        frequency[token] += 1

texts = [[token for token in text.split() if frequency[token] > 1 and token not in ENGLISH_STOP_WORDS]
          for text in posts_proto.Full_Text]

# Create gensim dictionary object
dictionary = corpora.Dictionary(texts)

# Create corpus matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# corpus

In [48]:
# LDA MODEL

lda = models.LdaModel(
    corpus=corpus,
    num_topics  =  k,
    passes      =  10, 
    id2word     =  dictionary
)

In [49]:
# (topicID, topic Probability)
# num_words most probable words for topics number
lda.print_topics(num_topics=k, num_words=3)

[(0, u'0.032*"tour" + 0.013*"tours" + 0.012*"tax"'),
 (1, u'0.025*"train" + 0.019*"href" + 0.016*"bus"'),
 (2, u'0.008*"The" + 0.007*"question" + 0.007*"href"'),
 (3, u'0.014*"people" + 0.010*"like" + 0.009*"food"'),
 (4, u'0.046*"visa" + 0.015*"US" + 0.014*"Schengen"'),
 (5, u'0.011*"href" + 0.008*"The" + 0.007*"like"'),
 (6, u'0.045*"passport" + 0.009*"ID" + 0.009*"US"'),
 (7, u'0.016*"weather" + 0.014*"camping" + 0.010*"Berlin"'),
 (8, u'0.016*"UK" + 0.015*"application" + 0.009*"letter"'),
 (9, u'0.014*"Israel" + 0.012*"maps" + 0.012*"border"'),
 (10, u'0.012*"country" + 0.008*"countries" + 0.007*"US"'),
 (11, u'0.023*"insurance" + 0.015*"tip" + 0.011*"travel"'),
 (12, u'0.030*"flight" + 0.015*"flights" + 0.014*"airline"'),
 (13, u'0.015*"train" + 0.009*"The" + 0.007*"Georgia"'),
 (14, u'0.023*"car" + 0.015*"driving" + 0.014*"drive"'),
 (15, u'0.017*"card" + 0.012*"href" + 0.008*"The"'),
 (16, u'0.017*"laptop" + 0.015*"duty" + 0.013*"battery"'),
 (17, u'0.013*"time" + 0.009*"airport

In [50]:
# Comparing to Tags
posts_proto.Tags[:8].values

array(['<thailand><where-on-earth><southeast-asia>',
       '<sightseeing><canada><toronto>', '<driving><portugal><finland>',
       '<emirates><alaska-airlines>',
       '<trains><international-travel><prague><cologne>',
       '<thailand><trekking><animal-riding>',
       '<food-and-drink><netherlands><belgium><asia><luxembourg>',
       '<visas><usa><uk><indian-citizens><tourist-visas>'], dtype=object)

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [51]:
# CREATE DF TO EVALUATE ACCEPTED ANSWERS

cols_1 = ['Id', 'Body', 'CommentCount', 'Score']
cols_2 = ['ViewCount', 'AnswerCount', 'AcceptedAnswerId']

accepted_answer = pd.merge(left=posts_answer[cols_1], right=posts_questions[cols_2], 
                         how='left', left_on='Id', right_on='AcceptedAnswerId')

accepted_answer.head()

Unnamed: 0,Id,Body,CommentCount,Score,ViewCount,AnswerCount,AcceptedAnswerId
0,19,href relnofollowstrongEURailstrong should be g...,3,10,,,
1,20,Seat is the absolute definitive guide for inte...,3,51,4561.0,12.0,20.0
2,22,The site that stands out by mile is hrefseatco...,1,10,,,
3,29,guess you might be in the UK based on the netw...,4,10,,,
4,32,Round the world fares do exist Most airline al...,5,26,2362.0,4.0,32.0


In [52]:
accepted_answer.isnull().sum()

Id                      0
Body                    0
CommentCount            0
Score                   0
ViewCount           17451
AnswerCount         17451
AcceptedAnswerId    17451
dtype: int64

_Before using the text column, I will first try a classification model with the other features to see what results they yield_

In [53]:
# 1 IF ACCEPTED ELSE 0
accepted_answer.AcceptedAnswerId = accepted_answer.AcceptedAnswerId.isnull().map({True: 0, False: 1})

# IMPUTE NULL VALUES WITH 0
accepted_answer.AnswerCount.fillna(value=0, inplace=True)
accepted_answer.ViewCount.fillna(value=0, inplace=True)

# MODIFY COLUMN NAMES
accepted_answer.columns = [u'Id', u'Body', u'CommentCount', u'Score', u'ViewCount', u'AnswerCount',
       u'Accepted']

In [54]:
accepted_answer.head()

Unnamed: 0,Id,Body,CommentCount,Score,ViewCount,AnswerCount,Accepted
0,19,href relnofollowstrongEURailstrong should be g...,3,10,0.0,0.0,0
1,20,Seat is the absolute definitive guide for inte...,3,51,4561.0,12.0,1
2,22,The site that stands out by mile is hrefseatco...,1,10,0.0,0.0,0
3,29,guess you might be in the UK based on the netw...,4,10,0.0,0.0,0
4,32,Round the world fares do exist Most airline al...,5,26,2362.0,4.0,1


In [55]:
accepted_answer.Accepted.value_counts()

0    17451
1     6516
Name: Accepted, dtype: int64

In [56]:
# SVM

In [57]:
# DESIGN & TARGET
X = accepted_answer.iloc[:, 2:-1]
y = accepted_answer.iloc[:, -1]

ss = StandardScaler()
Xn = ss.fit_transform(X)

In [58]:
# FUNCTION TO PRINT CLASSIFICATION METRICS
def print_cm_cr(y_true, y_pred):
    """prints the confusion matrix and the classification report"""
    confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
    print confusion
    print
    print classification_report(y_true, y_pred)

In [59]:
# TRAIN TEST SPLIT
lin_model = SVC(kernel='linear')

X_train, X_test, y_train, y_test = train_test_split(Xn, y, stratify=y, test_size=0.33)
lin_model.fit(X_train, y_train)

y_pred = lin_model.predict(X_test)
print_cm_cr(y_test, y_pred)

Predicted     0     1   All
Actual                     
0          5759     0  5759
1             0  2151  2151
All        5759  2151  7910

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      5759
          1       1.00      1.00      1.00      2151

avg / total       1.00      1.00      1.00      7910



_The scores above are unrealistic and is a result of how the DataFrame was constructed. Extreme case of data leakage occuring here_

In [60]:
vectorizer = CountVectorizer(min_df = 1, stop_words = custom_stop_words)
dtm = vectorizer.fit_transform(accepted_answer.Body)

# FIT LSA MODEL
lsa = TruncatedSVD(n_components=2, algorithm = 'arpack') # algorithm='randomized'
dtm_lsa = lsa.fit_transform(dtm)
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.


<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 6. Lets Model - Tournament for stock market predictions

>Start this section of the project by downloading the train and test datasets from the following site: https://numer.ai/rules

> - The data set is clean, your goal is to develop a classification model(s) 
> - Report all the results including log loss, and other coefficients you consider iteresting