# Data analysis project

_In this project I am going to write a program to predict the number likes post is going to get in vk._

__Выполнил__: Булгаков Дмитрий (ИАД16)

__Дедлайн__: 23:59 10.04.16

# 1. Loading Data

VK library quiet installation and import into the notebook.

In [303]:
# !pip install vk # makes it quiet
import vk

Starting new vk session in order to parse data

In [304]:
vk_session = vk.Session() # starting new session
vk_api = vk.API(vk_session)

Getting number of posts in selected vk group.

In [305]:
selected_group = 'hse_overheard' # no other ideas :c
posts_number = vk_api.wall.get(domain=selected_group)[0] # number of posts is stored in first element
print('Number of posts in selected group: ', posts_number - 1)

Number of posts in selected group:  13964


Writing a function to parse more, than 100 posts from group.

In [306]:
def load_all_posts(page, n_posts, api):
    all_posts = api.wall.get(domain=page, count=n_posts)
    n_loaded = len(all_posts)
    while n_loaded < n_posts: # loop to load more, than 100 posts
        s = api.wall.get(domain=page, offset=n_loaded, count=(n_posts - n_loaded)) # update offset
        all_posts += s[1:] # no need for first element
        n_loaded += len(s) - 1 # update n_loaded
    return all_posts

Loading all posts from group for future analysis

In [308]:
try:
    loaded_posts = load_all_posts(page=selected_group, n_posts=2501, api=vk_api)[1:] # no need for posts number element
    # 1500 for this time, because I have small amount of ram avaliable :c
    print('Number of loaded posts: ', len(loaded_posts))
except: # timout errors are often to occur
    print('Error occured! Try again.')

Number of loaded posts:  2500


# 2. Data preprocessing

Loading required libs to preprocess data.

In [309]:
# !pip install pymorphy2 -q # silent install again
# !pip install stop_words -q # needed to remove stop words
from stop_words import get_stop_words
import pymorphy2 # need this one to convert words to normal time
import datetime # needed to convert response date 
import string # needed to work with strings
from nltk.tokenize import TweetTokenizer # needed to split text
import pandas as pd # required to work with dataframes
from ipywidgets import IntProgress # progressbar
from IPython.display import display # progressbar

Writing functions to process text data. Converting words to normal form and removing punctuation here.

In [310]:
def split_text(text):
    tokenizer = TweetTokenizer()
    return tokenizer.tokenize(text) # spliting text into words

def convert_to_normal_form(words_list):
    morph = pymorphy2.MorphAnalyzer()
    normal_forms_list = []
    for word in words_list:
        if word not in string.punctuation and word[0] != "<":
            norm_form = morph.parse(word)[0].normal_form #getting normal form of a word
            normal_forms_list.append(norm_form) #adding it to list
    return normal_forms_list

def convert_text(text):
    words_list = split_text(text) # spliting text into words
    norm_words_list = convert_to_normal_form(words_list) # words into normal form
    filtered_words = [w for w in norm_words_list if not w in get_stop_words('russian')] # removing stop words
    return " ".join(filtered_words) # joining words to a sentence again

Writing a function to convert received list into another with another data.

In [311]:
def convert_posts(posts_list):
    progress = IntProgress() 
    progress.max = len(posts_list) # initializing progressbar
    progress.description = 'Processing data convertion'
    display(progress)
    
    updated_posts = [] # list of new posts' list structure
    for i, post in enumerate(posts_list): 
        tmp_dict = {} # creating empty dictionary for each post
        tmp_dict['likes_number'] = int(post['likes']['count']) # getting likes count
        tmp_dict['text'] = convert_text(post['text']) # converting text into normal form
        tmp_dict['text_length'] = len(post['text']) # calculating text length
        tmp_dict['post_hour'] = int(datetime.datetime.fromtimestamp(post['date']).strftime('%H')) # parsing only post hour
        tmp_dict['post_month'] = int(datetime.datetime.fromtimestamp(post['date']).strftime('%m')) # and post month
        tmp_dict['signed'] = int(post['from_id'] != -57354358) # checking whether post is signed or not
        # checking if any attacment exists
        tmp_dict['with_attachment'] = 1 if 'attachment' in post.keys() else 0
        # tmp_dict['pinned'] = 1 if 'is_pinned' in post.keys() else 0 # cheking if post is pinned
        tmp_dict['repost'] = 1 if post['post_type'] == 'copy' else 0 # cheking if repost
        updated_posts.append(tmp_dict)
        progress.value += 1 # increasing progressbar value
    progress.description = 'Done convertion!'
    return updated_posts

Converting list of posts into new more convenient one.

In [312]:
converted_posts = convert_posts(loaded_posts)

# 3. Creating object-feature matrix

Loading pandas

In [313]:
import pandas as pd

Creating dataframe from parsed data

In [314]:
posts_frame = pd.DataFrame(converted_posts)
posts_frame.head()

Unnamed: 0,likes_number,post_hour,post_month,repost,signed,text,text_length,with_attachment
0,2,18,6,0,0,выпуск заниматься студент прикладной математик...,67,0
1,2,23,6,0,0,поступать юрфак набигай регион общага родитель...,426,0
2,4,14,6,0,0,рассказать чувствовать узнать поступить хороши...,110,0
3,30,9,6,0,0,удивительно 30-40 оставаться чайлдфри убедить ...,407,0
4,8,22,6,0,0,научный интерес 30 существовать феминистка арб...,412,0


And describing posts data

In [315]:
posts_frame.describe()

Unnamed: 0,likes_number,post_hour,post_month,repost,signed,text_length,with_attachment
count,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0
mean,32.3148,15.2672,5.778,0.0284,0.0,229.6692,0.2788
std,48.836078,5.971304,3.350523,0.166146,0.0,446.024696,0.448499
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,4.0,12.0,3.0,0.0,0.0,58.0,0.0
50%,14.0,16.0,5.0,0.0,0.0,121.0,0.0
75%,42.0,20.0,9.0,0.0,0.0,237.0,1.0
max,584.0,23.0,12.0,1.0,0.0,8014.0,1.0


Creating object-feature matrix

In [316]:
from sklearn.feature_extraction.text import TfidfVectorizer # loading count vectorizer

cv = TfidfVectorizer(norm='l1', max_features = 1000, analyzer = 'word', strip_accents='unicode', binary=True)
train_features = cv.fit_transform(posts_frame['text']).toarray() # vectorizing texts
train_frame = posts_frame.join(pd.DataFrame(train_features, columns=cv.get_feature_names())) # transfering it to pandas

train_frame.drop(['likes_number','text'],inplace=True,axis=1,errors='ignore') # removing unnecessary columns
value_frame = posts_frame['likes_number']

In [317]:
train_frame.describe()

Unnamed: 0,post_hour,post_month,repost,signed,text_length,with_attachment,000,10,100,11,...,эконом,экономика,экономист,экономическии,электричка,этаж,юрист,являться,язык,январь
count,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,...,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0
mean,15.2672,5.778,0.0284,0.0,229.6692,0.2788,0.000197,0.002155,0.000466,0.001261,...,0.002563,0.002629,0.001408,0.001182,0.00059,0.000334,0.000606,0.00096,0.001278,0.001499
std,5.971304,3.350523,0.166146,0.0,446.024696,0.448499,0.003885,0.021128,0.005509,0.02434,...,0.029784,0.018263,0.014873,0.016503,0.00965,0.008497,0.01228,0.021585,0.013584,0.031371
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12.0,3.0,0.0,0.0,58.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,16.0,5.0,0.0,0.0,121.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,20.0,9.0,0.0,0.0,237.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,23.0,12.0,1.0,0.0,8014.0,1.0,0.113207,0.458522,0.103001,1.0,...,1.0,0.314055,0.323805,0.508548,0.234299,0.362543,0.507972,1.0,0.324296,1.0


Saving train frame to file

In [318]:
# train_frame.to_csv('traindata.csv')

# 4. Comparing different methods

Splitting into train and test samples

In [319]:
from sklearn.cross_validation import train_test_split

# Splitting it into test and train samples
X_train, X_test, y_train, y_test = train_test_split(train_frame, value_frame, test_size=0.3, random_state=42)

Importing libs

In [320]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import numpy as np


In [321]:
def compare(est, param, est_name):
    cv = GridSearchCV(est, param, n_jobs = -1)
    cv.fit(X_train, y_train);
    print('CV best score for', est_name, ': ', cv.best_score_)
    
    predicted = cv.predict(X_test)
    mse = mean_squared_error(y_test, predicted)
    print('MSE for', est_name, ':' , mse)
    r2 = r2_score(y_test, predicted)
    print('R^2 for', est_name, ':' , r2)    

## 4.1 Linear regression

### 4.1.1 Simple linear regression

In [322]:
parameters = {'fit_intercept':[True, False],'normalize':[True, False]}
compare(LinearRegression(), parameters, 'simple linear regression')

CV best score for simple linear regression :  -66.1699538669
MSE for simple linear regression : 5351.40831024
R^2 for simple linear regression : -1.3192274986


### 4.1.2 Linear regression with L1 regularization

In [323]:
parameters = {'alpha':np.arange(1, 100, 5), 'positive':[True, False],'normalize':[True, False], 
              'selection':['cyclic', 'random']}
compare(Lasso(), parameters, 'linear regression with L1 regularization')

CV best score for linear regression with L1 regularization :  0.138809429466
MSE for linear regression with L1 regularization : 2037.43458482
R^2 for linear regression with L1 regularization : 0.117003591996


### 4.1.3 Linear regression with L2 regularization

In [324]:
parameters = {'alpha':np.logspace(1.0, 10.0, 101.00), 'fit_intercept':[True, False],'normalize':[True, False],
              'solver':['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'auto']}
compare(Ridge(), parameters, 'linear regression with L2 regularization')

CV best score for linear regression with L2 regularization :  0.154214322966
MSE for linear regression with L2 regularization : 2006.0790626
R^2 for linear regression with L2 regularization : 0.130592648398


## 4.2 Decision trees and random forests

### 4.2.1 DecisionTreeClassifier

In [325]:
parameters = {'presort':[True, False],'max_depth': np.arange(1, 20), 'class_weight':['balanced', None], 'splitter':['random', 'best'],
             'max_features':['auto', 'sqrt', 'log2', None]}
compare(tree.DecisionTreeClassifier(), parameters, 'decision tree classifier')



CV best score for decision tree classifier :  0.111428571429
MSE for decision tree classifier : 3102.484
R^2 for decision tree classifier : -0.344574323174


### 4.2.2 RandomForestClassifier

In [329]:
parameters = {'n_estimators':[10, 20, 30],
             'max_features':['auto', 'sqrt', 'log2', None]}
compare(RandomForestClassifier(), parameters, 'random forest classifier')



CV best score for random forest classifier :  0.0954285714286
MSE for random forest classifier : 2869.32
R^2 for random forest classifier : -0.24352422026


### 4.3 kNN

In [327]:
parameters = {'leaf_size':np.arange(30, 100, 10),'n_neighbors': np.arange(5, 20), 
              'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']}
compare(KNeighborsClassifier(), parameters, 'kNN')



CV best score for kNN :  0.0885714285714
MSE for kNN : 2814.09333333
R^2 for kNN : -0.219589734876
