<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 7

## NLP and Machine Learning on [travel.statsexchange.com](http://travel.stackexchange.com/) data

---

In Project 7 you'll be doing NLP and machine learning on post data from stackexchange's travel subdomain. 

This project is setup like a mini Kaggle competition. You are given the training data and when projects are submitted your model will be tested on the held-out testing data. There will be prizes for the people who build models that perform best on the held out test set!

---

## Notes on the data

The data is again compressed into the `.7z` file format to save space. There are 6 .csv files and one readme file that contains some information on the fields.

    posts_train.csv
    comments_train.csv
    users.csv
    badges.csv
    votes_train.csv
    tags.csv
    readme.txt
    
The data is located in your datasets folder:

    DSI-SF-2/datasets/stack_exchange_travel.7z
    
If you're interested in where this data came from and where to get more data from other stackexchange subdomains, see here:

https://ia800500.us.archive.org/22/items/stackexchange/readme.txt


### Recommended Utilities for .7z

- For OSX [Keka](http://www.kekaosx.com/en/) or [The Unarchiver](http://wakaba.c3.cx/s/apps/unarchiver.html). 
- For Windows [7-zip](http://www.7-zip.org/) is the standard. 
- For Linux try the `p7zip` utility.  `sudo apt-get install p7zip`.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [685]:
badges = pd.read_csv('../../datasets/stack_exchange_travel/badges.csv')
posts = pd.read_csv('../../datasets/stack_exchange_travel/posts_train.csv')
tags = pd.read_csv('../../datasets/stack_exchange_travel/tags.csv')
votes = pd.read_csv('../../datasets/stack_exchange_travel/votes_train.csv')
comments = pd.read_csv('../../datasets/stack_exchange_travel/comments_train.csv')
users = pd.read_csv('../../datasets/stack_exchange_travel/users.csv')

print badges.shape
print posts.shape
print tags.shape
print votes.shape
print comments.shape
print users.shape

(71246, 6)
(41289, 21)
(1606, 5)
(289553, 6)
(81506, 7)
(29283, 14)


In [724]:
# print badges.info()
# badges.tail(5).T

In [32]:
# with pd.option_context('display.max_rows', 999, 'display.max_columns', 3):
#     print badges['Name'].value_counts()

In [1161]:
# print posts.info()
# posts.head(3).T

In [689]:
# print tags.info()
# tags.head()

In [590]:
# print votes.info()
# votes.head(3).T

In [723]:
# print comments.info()
# comments.head(3).T

In [722]:
# print users.info()
# users.head(3).T

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 1. Use LDA to find what topics are discussed on travel.stackexchange.com.

---

Text can be found in the posts and the comments datasets. The `ParentId` column in the posts dataset indicates what the "question" post was for a given post. Comment text can be merged onto the post they are part of with the `PostId` field.

The text may have some HTML tags. BeautifulSoup has convenient ways to get rid of markup or extract text if you need to. You can also parse the strings yourself if you like.

The tags dataset has the "tags" that the users have officially given the post.

**1.1 Implement LDA against the text features of the dataset(s).**

- This can be posts or a combination of posts and comments if you want more power.
- Find optimal **K/num_topics**.

**1.2 Compare your topics to the tags. Do the LDA topics make sense? How do they compare to the tags?**


In [692]:
from gensim import corpora, models
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import deaccent, lemmatize
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

In [463]:
# Eliminating posts without titles to lower the processing overhead
posts_clean = posts[~posts['Title'].isnull()]
print posts.shape
print posts_clean.shape

(41289, 21)
(13988, 21)


In [684]:
tags_list = []
[[tags_list.append(tag) for tag in ptags.strip('<>').split('><')] for ptags in posts['Tags'].dropna()]
print "Number of unique tags:", np.unique(tags_list).shape[0]

Number of unique tags: 1550


In [464]:
posts_clean = posts_clean['Title'] + posts_clean['Body']
posts_clean.head(1)

0    What are some Caribbean cruises for October?<p>My fiancée and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?</p>\n\n<p>It seems like a lot of the cruises don't run in this month due to Hurricane season so I'm looking for other good options.</p>\n\n<p><strong>EDIT</strong> We'll be travelling in 2012.</p>\n
dtype: object

In [465]:
import re

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [466]:
posts_clean = [remove_html_tags(post) for post in posts_clean]

In [467]:
posts_clean = [deaccent(post) for post in posts_clean]

In [468]:
posts_clean[0]

u"What are some Caribbean cruises for October?My fiancee and I are looking for a good Caribbean cruise in October and were wondering which islands are best to see and which Cruise line to take?\n\nIt seems like a lot of the cruises don't run in this month due to Hurricane season so I'm looking for other good options.\n\nEDIT We'll be travelling in 2012.\n"

In [469]:
posts_processed = [lemmatize(post, stopwords=STOPWORDS, min_length=2, max_length=15)
                   for post in posts_clean]

In [473]:
print "Number of posts:", len(posts_processed)
print "Total words:", np.sum([len(post) for post in posts_processed])
posts_processed[0][:10]

Number of posts: 13988
Total words: 617521


['caribbean/NN',
 'cruise/NN',
 'october/JJ',
 'fiancee/NN',
 'look/VB',
 'good/JJ',
 'caribbean/NN',
 'cruise/NN',
 'october/JJ',
 'wonder/VB']

In [319]:
# Create gensim dictionary object
dictionary = corpora.Dictionary(posts_processed)

# Create corpus matrix
corpus = [dictionary.doc2bow(post) for post in posts_processed]

In [551]:
lda = models.LdaModel(
    corpus,
    num_topics  =  5,
    passes      =  3,
    id2word     =  dictionary
#     alpha       =  'auto',
#     eta         =  'auto'
)

In [553]:
lda.print_topics(num_topics=5, num_words=20)

[(0,
  u'0.030*flight/NN + 0.022*ticket/NN + 0.012*airline/NN + 0.012*airport/NN + 0.010*card/NN + 0.009*luggage/NN + 0.008*time/NN + 0.007*buy/VB + 0.007*train/NN + 0.007*check/VB + 0.006*fly/VB + 0.006*way/NN + 0.005*bag/NN + 0.005*travel/NN + 0.005*book/VB + 0.005*check/NN + 0.005*question/NN + 0.005*pay/VB + 0.005*know/VB + 0.004*hour/NN'),
 (1,
  u'0.016*airport/NN + 0.015*bus/NN + 0.013*train/NN + 0.012*london/NN + 0.010*seat/NN + 0.009*station/NN + 0.007*travel/VB + 0.007*taxi/NN + 0.006*india/NN + 0.006*time/NN + 0.005*hour/NN + 0.005*know/VB + 0.005*child/NN + 0.005*way/NN + 0.005*option/NN + 0.005*phone/NN + 0.004*want/VB + 0.004*possible/JJ + 0.004*travele/VB + 0.004*terminal/NN'),
 (2,
  u'0.008*know/VB + 0.007*place/NN + 0.007*look/VB + 0.007*car/NN + 0.007*person/NN + 0.007*country/NN + 0.005*question/NN + 0.005*time/NN + 0.005*trip/NN + 0.004*want/VB + 0.004*city/NN + 0.004*good/JJ + 0.004*way/NN + 0.004*japan/NN + 0.004*travel/NN + 0.004*map/NN + 0.003*ve/NN + 0.003*sit

In [578]:
topics_labels = {
    0: "Air Travel",
    1: "Transportation",
    2: "Miscellaneous",
    3: "Trip Planning",
    4: "Immigration & Customs"
}

In [579]:
doc_topics = [lda.get_document_topics(doc) for doc in corpus]

topic_data = []

for document_id, topics in enumerate(doc_topics):
    
    document_topics = []
    
    for topic, probability in topics:
       
        topic_data.append({
            'document_id':  document_id,
            'topic_id':     topic,
            'topic':        topics_labels[topic],
            'probability':  probability
        })

topics_df = pd.DataFrame(topic_data)
topics_df = pd.DataFrame(topics_df.pivot_table(values="probability", index=["document_id", "topic"]).T)

In [580]:
# topics_df.head(30)

In [615]:
posts_merge = posts[~posts['Title'].isnull()].reset_index().drop('index', axis=1).copy()
# posts_merge.head(3)

In [618]:
from orderedset import OrderedSet

posts_topics = topics_df.reset_index().merge(
    posts_merge[['Title', 'Body', 'Tags']], 
    how='left', 
    left_on='document_id', 
    right_index=True
    ).groupby('document_id').agg(lambda group: OrderedSet(group))

In [619]:
# posts_topics.head()

In [622]:
# pd.describe_option('display')
pd.options.display.max_colwidth = 1000
print posts_topics.shape
posts_topics.tail()

(13988, 5)


Unnamed: 0_level_0,topic,probability,Title,Body,Tags
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13983,"(Immigration & Customs, Transportation)","(0.94235576946, 0.0368893276714)",(Schengen visa application),(<p>I am a Srilankan residing in UAE for past 10 years. My family was in UAE for past 9/10 years. Now they are back in UAE on residence visa while I have continued my stay in UAE.</p>\n\n<p>1.If the UAE residence visa has been stamped only 15 days ago for my family members can I still apply for Schengen visas for them? </p>\n),(<visas><schengen><italy><visa-refusal><uae>)
13984,"(Immigration & Customs, Miscellaneous)","(0.671394501209, 0.317076101721)",(Getting a full time work permit in Canada after a language school?),"(<p>What are chances to get a full-time work permit (read work visa) after a completing a year-long English language courses program at VGC in Vancouver.</p>\n\n<p>If to be more specific, I'm planning a trip, a long one, and Canada is my first stop. So, my main priority is to improve my English skills and move on, but in a case I want to stay a little longer and continue my education there, I would need money to support myself. Incidentally, I already have the Specialist Degree (which is close to Master Degree) in Applied Computer Science which I got in Saint Petersburg, Russia.</p>\n)",(<canada><work><english-language>)
13985,"(Air Travel, Miscellaneous, Transportation)","(0.953939969767, 0.0206117037876, 0.0219526761871)",(How can the same flight cost less one leg earlier?),"(<p>I am booking a round-trip flight to the United States and found that I can save up to 700 euro on a 1200+ euro ticket if I board the same plane one leg earlier. On the flight company's website a ticket from Amsterdam, Netherlands to Boston, United States costs me 1.227,75 euro. However, if I travel to Dusseldorf, Germany and board the exact same plane, the ticket will cost me 568,66 dollar (500+ euro). The latter flight will make a stop in Amsterdam and then continue to Boston.</p>\n\n<p>The price from AMS to BOS is broken down as follows (in euro),</p>\n\n<pre><code>ticket price 880,00\ncarrier-imposed international surcharge 259,00\nairport passenger service charge 13,00\nnoise surcharge 0,50\nsecurity charge 10,53\nus customs user fee ...",(<tickets><price>)
13986,(Air Travel),(0.973588142855),(The plane is half empty yet the ticket price has risen),"(<p>Consider this as an example:\nThe flight below which is two days from now is half empty. However, the price has risen considerably during last week. Why did this happen? If the plane is half empty why the price should rise? \n<a href=""http://i.stack.imgur.com/Tp4r1.png"" rel=""nofollow""><img src=""http://i.stack.imgur.com/Tp4r1.png"" alt=""enter image description here""></a>\n<a href=""http://i.stack.imgur.com/kQuRk.png"" rel=""nofollow""><img src=""http://i.stack.imgur.com/kQuRk.png"" alt=""enter image description here""></a></p>\n\n<p><strong>Edit</strong></p>\n\n<p>I should mention that in the example above the seat selection is included in the ticket price, that is, even if someone does not want to select a seat then they still pay for it.</p>\n)",(<air-travel><tickets><price>)
13987,"(Air Travel, Immigration & Customs, Miscellaneous, Transportation, Trip Planning)","(0.0125287542681, 0.0125641726711, 0.308825377885, 0.0125453616159, 0.65353633356)",(Hire car in south amercia),"(<p>We are thinking of hiring a car from Bueons Aires to drive to Santiago. Has anyone done this and are the roads, towns encounter safe etc??</p>\n)",(<safety><rental>)


In [627]:
pd.options.display.max_colwidth = 100

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 2. What makes an answer likely to be "accepted"?

---

**2.1 Build a model to predict whether a post will be marked as the answer.**

- This is a classification problem.
- You're free to use any of the machine learning algorithms or techniques we have learned in class to build the best model you can.
- NLP will be very useful here for pulling out useful and relevant features from the data. 
- Though not required, using bagging and boosting models like Random Forests and Gradient Boosted Trees will _probably_ get you the highest performance on the test data (but who knows!).


**2.2 Evaluate the performance of your classifier with a confusion matrix and accuracy. Explain how your model is performing.**

**2.3 Plot either a ROC curve or precision-recall curve (or both!) and explain what they tell you about your model.**

NOTE: You should only be predicting this for `PostTypeID=2` posts, which are the "answer" posts. This doesn't mean, however, that you can't or shouldn't use the parent questions as predictors!


In [1175]:
# Get posts that are answers
answer_posts = posts[posts['PostTypeId'] == 2]

In [1173]:
pd.options.mode.chained_assignment = None

# Get votes that are from the post author and accept
accept_votes = votes[votes['VoteTypeId'] == 1]
accept_votes.dropna(axis=1, inplace=True)

In [1178]:
# print answer_posts.shape
# answer_posts.head(3)

In [1302]:
# posts[posts['PostTypeId'] == 5]

In [None]:
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

In [1241]:
answer_posts = posts.copy()
AAIds = [int(x) for x in answer_posts['AcceptedAnswerId'] if x == x]
np.sort(AAIds)[:10]

array([ 20,  32,  48,  62,  71,  76,  91,  96, 103, 109])

In [1242]:
answer_posts['IsAccepted'] = answer_posts['Id'].map(lambda x: 1 if x in AAIds else np.nan)

In [1243]:
answer_posts = answer_posts[answer_posts['PostTypeId'] == 2]

In [1247]:
answer_posts.dropna(subset=['OwnerUserId'], inplace=True)

In [1248]:
answer_posts.head(2)

Unnamed: 0,Body,CommentCount,CreationDate,Id,LastActivityDate,LastEditDate,LastEditorDisplayName,OwnerUserId,ParentId,Score,IsAccepted
9,"<p><a href=""http://www.eurail.com/home"" rel=""nofollow""><strong>EURail</strong></a> should be a g...",3,2011-06-21T20:38:27.483,19,2011-10-14T13:44:53.930,2011-10-14T13:44:53.930,,33.0,16.0,10,
10,<p>Seat 61 is the absolute definitive guide for international rail travel. It has all the inform...,3,2011-06-21T20:38:39.520,20,2011-06-21T20:38:39.520,,,30.0,16.0,51,1.0


In [1245]:
# answer_posts.info()

In [None]:
# XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

In [1246]:
# Drop useless columns
answer_posts.drop([
        'AcceptedAnswerId', 
        'AnswerCount', 
        'ClosedDate', 
        'CommunityOwnedDate', 
        'FavoriteCount',  
        'OwnerDisplayName', 
        'LastEditorUserId',
        'PostTypeId',
        'Tags', 
        'Title', 
        'ViewCount'
    ], axis=1, inplace=True)

In [1238]:
accept_votes.drop(['Id', 'VoteTypeId'], axis=1, inplace=True)

pd.options.mode.chained_assignment = 'warn'

In [1077]:
# Merge answer posts with accepting votes
answer_posts = answer_posts.merge(accept_votes, how='left', left_on='Id', right_on='PostId').drop(['Id'], axis=1)

# Drop accepted answers without an owner because we want to have user info
answer_posts.dropna(subset=['OwnerUserId'], inplace=True)

In [1239]:
answer_posts = answer_posts.rename(columns={'CreationDate_y': 'IsAccepted'})

In [1249]:
print "Number of accepted answers:", int(answer_posts['IsAccepted'].shape \
                                         - answer_posts['IsAccepted'].isnull().sum())
print "Total number of answers:", answer_posts.shape[0]

Number of accepted answers: 6414
Total number of answers: 23564


In [1267]:
# Merge users
# so_df = answer_posts.merge(users, how='left', left_on='OwnerUserId', right_on='Id').drop('Id', axis=1)
so_df = answer_posts.merge(users, how='left', left_on='OwnerUserId', right_on='Id').drop('Id_y', axis=1)

In [1268]:
# Pivot badges and merge them
badges_pivot = badges[badges['TagBased'] == False]
badges_pivot = badges_pivot.pivot_table(values='Class', index='UserId', columns='Name', aggfunc=(lambda x: 1))
so_df = so_df.merge(badges_pivot, how='left', left_on='OwnerUserId', right_index=True)

In [1269]:
print badges_pivot.shape
print so_df.shape

(21547, 82)
(23564, 106)


In [1270]:
so_df.head()

Unnamed: 0,Body,CommentCount,CreationDate_x,Id_x,LastActivityDate,LastEditDate,LastEditorDisplayName,OwnerUserId,ParentId,Score,...,Suffrage,Supporter,Synonymizer,Tag Editor,Talkative,Taxonomist,Teacher,Tumbleweed,Vox Populi,Yearling
0,"<p><a href=""http://www.eurail.com/home"" rel=""nofollow""><strong>EURail</strong></a> should be a g...",3,2011-06-21T20:38:27.483,19,2011-10-14T13:44:53.930,2011-10-14T13:44:53.930,,33.0,16.0,10,...,,1.0,,,,1.0,1.0,,,1.0
1,<p>Seat 61 is the absolute definitive guide for international rail travel. It has all the inform...,3,2011-06-21T20:38:39.520,20,2011-06-21T20:38:39.520,,,30.0,16.0,51,...,1.0,1.0,,1.0,,1.0,1.0,,,1.0
2,"<p>The site that stands out by a mile is <a href=""http://www.seat61.com/"">seat61.com</a>. Really...",1,2011-06-21T20:39:27.310,22,2011-06-21T20:39:27.310,,,26.0,16.0,10,...,1.0,1.0,,1.0,1.0,1.0,1.0,,1.0,1.0
3,"<p>I guess you might be in the UK based on the networks you listed?</p>\n\n<p>For within Europe,...",4,2011-06-21T20:46:12.313,29,2011-06-21T20:46:12.313,,,26.0,25.0,10,...,1.0,1.0,,1.0,1.0,1.0,1.0,,1.0,1.0
4,<p>Round-the-world fares do exist.</p>\n\n<p>Most airline alliances and occasionally single airl...,5,2011-06-21T20:49:43.270,32,2012-09-01T22:31:40.213,2012-09-01T22:31:40.213,user82,30.0,26.0,26,...,1.0,1.0,,1.0,,1.0,1.0,,,1.0


In [1271]:
so_df.columns[:25]

Index([u'Body', u'CommentCount', u'CreationDate_x', u'Id_x',
       u'LastActivityDate', u'LastEditDate', u'LastEditorDisplayName',
       u'OwnerUserId', u'ParentId', u'Score', u'IsAccepted', u'AboutMe',
       u'AccountId', u'Age', u'CreationDate_y', u'DisplayName', u'DownVotes',
       u'LastAccessDate', u'Location', u'ProfileImageUrl', u'Reputation',
       u'UpVotes', u'Views', u'WebsiteUrl', u'Altruist'],
      dtype='object')

In [1272]:
# Drop more useless columns
so_df.drop(['DisplayName', 
            'AccountId', 
#             'PostId',
            'Id_x',
            'OwnerUserId',
            'ParentId' # Could be used
           ], axis=1, inplace=True)

In [1273]:
so_df.iloc[:, :20].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23564 entries, 0 to 23563
Data columns (total 20 columns):
Body                     23564 non-null object
CommentCount             23564 non-null int64
CreationDate_x           23564 non-null object
LastActivityDate         23564 non-null object
LastEditDate             8750 non-null object
LastEditorDisplayName    222 non-null object
Score                    23564 non-null int64
IsAccepted               6414 non-null float64
AboutMe                  15188 non-null object
Age                      10039 non-null float64
CreationDate_y           23564 non-null object
DownVotes                23564 non-null int64
LastAccessDate           23564 non-null object
Location                 15499 non-null object
ProfileImageUrl          9450 non-null object
Reputation               23564 non-null int64
UpVotes                  23564 non-null int64
Views                    23564 non-null int64
WebsiteUrl               12551 non-null object
Altruis

In [1274]:
text_cols = [
    'AboutMe',
    'Body'
]
                 
time_cols = [
    'CreationDate_x', 
#     'CreationDate', 
    'LastActivityDate', 
    'LastAccessDate'
]

num_cols = [
    'CommentCount',
    'Score',
    'Age',
    'DownVotes',
    'Reputation',
    'UpVotes',
    'Views'
]

In [1275]:
# Change text features into the length of their text
def get_text_length(text):
    return 0 if text != text else len(text)

so_df['Body'] = so_df['Body'].map(get_text_length)
so_df['AboutMe'] = so_df['AboutMe'].map(get_text_length)

In [1276]:
# Change timestamps into number of days from today
def time_to_day_count(time_series):
    current_date = '2016-09'
    timestamp = pd.to_datetime(time_series)
    return (pd.to_datetime(current_date) - timestamp).astype('timedelta64[D]')

for col in time_cols:
    so_df[col] = time_to_day_count(so_df[col])

In [1277]:
# Booleanize all other features
def booleanize(x):
    return 0 if x != x else 1

non_bool_cols = text_cols + time_cols + num_cols
bool_cols = so_df.columns.difference(non_bool_cols)
so_df[bool_cols] = so_df[bool_cols].applymap(booleanize)

In [1278]:
so_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23564 entries, 0 to 23563
Columns: 101 entries, Body to Yearling
dtypes: float64(4), int64(97)
memory usage: 18.3 MB


In [1279]:
so_df.fillna(0, inplace=True)

In [1280]:
print so_df.dtypes.unique()
print so_df.shape

[dtype('int64') dtype('float64')]
(23564, 101)


In [1281]:
so_df.head(10).T.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Body,386.0,424.0,234.0,575.0,1626.0,618.0,166.0,471.0,310.0,523.0
CommentCount,3.0,3.0,1.0,4.0,5.0,3.0,0.0,0.0,2.0,1.0
CreationDate_x,1898.0,1898.0,1898.0,1898.0,1898.0,1898.0,1898.0,1898.0,1898.0,1898.0


In [1282]:
so_df.head(10).T.tail(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Tumbleweed,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Vox Populi,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
Yearling,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [1283]:
# so_df.iloc[:, :20].info()

In [1285]:
so_df.shape

(23564, 101)

In [1286]:
so_df['IsAccepted'].value_counts()

0    17150
1     6414
Name: IsAccepted, dtype: int64

In [1296]:
def run_train_test_split(df, target):
    test_size = 0.2
    from sklearn.cross_validation import train_test_split

    y = df[target].values
    X = df.drop(target, axis=1)
    print "Train/test split executed, test size =", test_size
    return train_test_split(X, y, test_size=test_size)

X_train, X_test, y_train, y_test = run_train_test_split(so_df, 'IsAccepted')

Train/test split executed, test size = 0.2


In [1297]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

rfc = RandomForestClassifier(max_depth=2, n_estimators=100, n_jobs=-1)
rfc_scores = cross_val_score(rfc, X_train, y_train, cv=5)

print so_df['IsAccepted'].value_counts()[0] / float(so_df.shape[0])
print rfc_scores, np.mean(rfc_scores)

0.727805126464
[ 0.72898435  0.72898435  0.72891247  0.72891247  0.72910586] 0.728979901175


In [1157]:
# from sklearn.grid_search import GridSearchCV

# rfc = RandomForestClassifier()

# rf_params = {
#     'max_features':[None,'log2','sqrt', 2,3,4,5],
#     'max_depth':[1,3],
#     'min_samples_leaf':np.linspace(1,101,20),
#     'n_estimators':[10]
# }

# ## gridsearch parameters, and cv =5
# rf_gs = GridSearchCV(rfc, rf_params, cv=5, verbose=1, n_jobs=-1)

In [1156]:
# rf_gs.fit(X_train, y_train)

In [1155]:
# ## Print best estimator, best parameters, and best score
# rfc_best = rf_gs.best_estimator_
# print "best estimator", rfc_best
# print "\n==========\n"
# print "best parameters",  rf_gs.best_params_
# print "\n==========\n"
# print "best score", rf_gs.best_score_

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 3. What is the score of a post?

---

**3.1 Build a model that predicts the score of a post.**

- This is a regression problem now. 
- You can and should be predicting score for both "question" and "answer" posts, so keep them both in your dataset.
- Again, use any techniques that you think will get you the best model.

**3.2 Evaluate the performance of your model with cross-validation and report the results.**

**3.3 What is important for determining the score of a post, if anything?**


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4. How many views does a post have?

---

**4.1 Build a model that predicts the number of views a post has.**

- This is another regression problem. 
- Predict the views for all posts, not just the "answer" posts.

**4.2 Evaluate the performance of your model with cross-validation and report the results.**

**4.3 What is important for the number of views a post has, if anything?**

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 5. Build a pipeline or other code to automate evaluation of your models on the test data.

---

Now that you've constructed your three predictive models, build a pipeline or code that can easily load up the raw testing data and evaluate your models on it.

The testing data that is held out is in the same raw format as the training data you have. _Any cleaning and preprocessing that you did on the training data will need to be done on the testing data as well!_

This is a good opportunity to practice building pipelines, but you're not required to. Custom functions and classes are fine as long as they are able to process and test the new data.
