# Data analysis project

_In this project I am going to write a program to predict the number likes post is going to get in vk._

__Выполнил__: Булгаков Дмитрий (ИАД16)

__Дедлайн__: 23:59 19.06.16

# 1. Loading Data

VK library quiet installation and import into the notebook.

In [1]:
# !pip install vk # makes it quiet
import vk

Starting new vk session in order to parse data

In [2]:
vk_session = vk.Session() # starting new session
vk_api = vk.API(vk_session)

Getting number of posts in selected vk group.

In [3]:
selected_group = 'hse_overheard' # no other ideas :c
posts_number = vk_api.wall.get(domain=selected_group)[0] # number of posts is stored in first element
print('Number of posts in selected group: ', posts_number - 1)

Number of posts in selected group:  13977


Writing a function to parse more, than 100 posts from group.

In [4]:
def load_all_posts(page, n_posts, api):
    all_posts = api.wall.get(domain=page, count=n_posts)
    n_loaded = len(all_posts)
    while n_loaded < n_posts: # loop to load more, than 100 posts
        s = api.wall.get(domain=page, offset=n_loaded, count=(n_posts - n_loaded)) # update offset
        all_posts += s[1:] # no need for first element
        n_loaded += len(s) - 1 # update n_loaded
    return all_posts

Loading all posts from group for future analysis

In [5]:
try:
    loaded_posts = load_all_posts(page=selected_group, n_posts=2001, api=vk_api)[1:] # no need for posts number element
    # 1500 for this time, because I have small amount of ram avaliable :c
    print('Number of loaded posts: ', len(loaded_posts))
except: # timout errors are often to occur
    print('Error occured! Try again.')

Number of loaded posts:  2000


# 2. Data preprocessing

Loading required libs to preprocess data.

In [6]:
# !pip install pymorphy2 -q # silent install again
# !pip install stop_words -q # needed to remove stop words
from stop_words import get_stop_words
import pymorphy2 # need this one to convert words to normal time
import datetime # needed to convert response date 
import string # needed to work with strings
from nltk.tokenize import TweetTokenizer # needed to split text
import pandas as pd # required to work with dataframes
from ipywidgets import IntProgress # progressbar
from IPython.display import display # progressbar

Writing functions to process text data. Converting words to normal form and removing punctuation here.

In [7]:
def split_text(text):
    tokenizer = TweetTokenizer()
    return tokenizer.tokenize(text) # spliting text into words

def convert_to_normal_form(words_list):
    morph = pymorphy2.MorphAnalyzer()
    normal_forms_list = []
    for word in words_list:
        if word not in string.punctuation and word[0] != "<":
            norm_form = morph.parse(word)[0].normal_form #getting normal form of a word
            normal_forms_list.append(norm_form) #adding it to list
    return normal_forms_list

def convert_text(text):
    words_list = split_text(text) # spliting text into words
    norm_words_list = convert_to_normal_form(words_list) # words into normal form
    filtered_words = [w for w in norm_words_list if not w in get_stop_words('russian')] # removing stop words
    return " ".join(filtered_words) # joining words to a sentence again

Writing a function to convert received list into another with another data.

In [8]:
def check_posttime(post, cur_date):
    post_time = datetime.datetime.fromtimestamp(post['date'])
    elapsed_time = (cur_date - post_time).days
    
    if (elapsed_time < 1):
        return 0
    if (elapsed_time < 5):
        return 1
    if (elapsed_time < 10):
        return 2
    if (elapsed_time < 30):
        return 3
    if (elapsed_time < 90):
        return 4
    if (elapsed_time < 180):
        return 5
    if (elapsed_time < 365):
        return 6
    if (elapsed_time < 730):
        return 7
    return 8

In [9]:
def check_attachment(post):
    tmp_dict = {'photo_attachment': 0, 'poll_attachment': 0, 'link_attachment': 0}
    
    if 'attachments' not in post.keys():
        return tmp_dict
    else:
        for attachment in post['attachments']:
            if attachment['type'] == 'photo':
                tmp_dict['photo_attachment'] = 1
            if attachment['type'] == 'poll':
                tmp_dict['poll_attachment'] = 1
            if attachment['type'] == 'link':
                tmp_dict['link_attachment'] = 1
        return tmp_dict

In [10]:
def convert_posts(posts_list):
    progress = IntProgress() 
    progress.max = len(posts_list) # initializing progressbar
    progress.description = 'Processing data convertion'
    display(progress)
    current_date = pd.datetime.today()
    
    updated_posts = [] # list of new posts' list structure
    for i, post in enumerate(posts_list): 
        tmp_dict = {} # creating empty dictionary for each post
        tmp_dict['likes_number'] = int(post['likes']['count']) # getting likes count
        tmp_dict['text'] = convert_text(post['text']) # converting text into normal form
        tmp_dict['long_text'] = 1 if len(post['text']) > 400 else 0 # calculating text length
        post_date = datetime.datetime.fromtimestamp(post['date'])
        tmp_dict['post_hour'] = int(post_date.strftime('%H')) # parsing only post hour
        tmp_dict['post_month'] = int(post_date.strftime('%m')) # and post month
        # tmp_dict['signed'] = int(post['from_id'] != -57354358) # checking whether post is signed or not
        # checking if any attacment exists
        tmp_dict['time_elapsed'] = check_posttime(post, current_date)
        tmp_dict.update(check_attachment(post))
        # tmp_dict['pinned'] = 1 if 'is_pinned' in post.keys() else 0 # cheking if post is pinned
        updated_posts.append(tmp_dict)
        progress.value += 1 # increasing progressbar value
    progress.description = 'Done convertion!'
    return updated_posts

Converting list of posts into new more convenient one.

In [11]:
converted_posts = convert_posts(loaded_posts)

# 3. Creating object-feature matrix

Loading pandas

In [12]:
import pandas as pd

Creating dataframe from parsed data

In [13]:
posts_frame = pd.DataFrame(converted_posts)
posts_frame.head()

Unnamed: 0,likes_number,link_attachment,long_text,photo_attachment,poll_attachment,post_hour,post_month,text,time_elapsed
0,0,0,0,0,1,12,6,заслуживать мнение россия отстранение олимпиад...,0
1,0,0,0,0,0,11,6,китаец получить большинство право приход дэн с...,0
2,0,1,1,1,0,15,6,конкурс « подслушать » вместе билайн продолжат...,1
3,2,0,0,0,0,10,6,чей лошадка победить мисс ниу вшэ,1
4,9,0,1,0,1,9,6,писать сюда придумать способ найти эконом гей ...,1


And describing posts data

In [14]:
posts_frame.describe()

Unnamed: 0,likes_number,link_attachment,long_text,photo_attachment,poll_attachment,post_hour,post_month,time_elapsed
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,28.4815,0.018,0.159,0.2035,0.028,15.2635,5.201,4.4605
std,45.798334,0.132984,0.365768,0.402702,0.165014,5.97946,3.477592,1.057355
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,3.0,0.0,0.0,0.0,0.0,12.0,2.0,4.0
50%,12.0,0.0,0.0,0.0,0.0,16.0,5.0,5.0
75%,37.0,0.0,0.0,0.0,0.0,20.0,6.0,5.0
max,584.0,1.0,1.0,1.0,1.0,23.0,12.0,6.0


Creating object-feature matrix

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer # loading count vectorizer

cv = TfidfVectorizer(norm='l1', max_features = 200, analyzer = 'word', strip_accents='unicode', binary=True)
train_features = cv.fit_transform(posts_frame['text']).toarray() # vectorizing texts
train_frame = posts_frame.join(pd.DataFrame(train_features, columns=cv.get_feature_names())) # transfering it to pandas

train_frame.drop(['likes_number','text'],inplace=True,axis=1,errors='ignore') # removing unnecessary columns
value_frame = posts_frame['likes_number']

In [16]:
train_frame.describe()

Unnamed: 0,link_attachment,long_text,photo_attachment,poll_attachment,post_hour,post_month,time_elapsed,10,20,2016,...,часть,читать,что,школа,экзамен,эконом,экономика,экономист,экономическии,язык
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.018,0.159,0.2035,0.028,15.2635,5.201,4.4605,0.005239,0.002876,0.00435,...,0.001679,0.003339,0.004312,0.003495,0.008049,0.0036,0.005843,0.002556,0.002152,0.003129
std,0.132984,0.365768,0.402702,0.165014,5.97946,3.477592,1.057355,0.054622,0.032628,0.048049,...,0.027516,0.033758,0.030559,0.03222,0.069418,0.03423,0.041509,0.027687,0.023545,0.034632
min,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,12.0,2.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,16.0,5.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,20.0,6.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,23.0,12.0,6.0,1.0,1.0,1.0,...,1.0,0.573358,0.563424,1.0,1.0,0.561522,1.0,0.567457,0.49828,1.0


Saving train frame to file

In [17]:
# train_frame.to_csv('traindata.csv')

# 4. Comparing different methods

Splitting into train and test samples

In [18]:
from sklearn.cross_validation import train_test_split

# Splitting it into test and train samples
X_train, X_test, y_train, y_test = train_test_split(train_frame, value_frame, test_size=0.3, random_state=42)

Importing libs

In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import numpy as np


In [31]:
def compare(est, param, est_name):
    cv = GridSearchCV(est, param, n_jobs = -1, scoring = 'r2')
    cv.fit(X_train, y_train);
    print('CV best score for', est_name, ': ', cv.best_score_,'. (R^2)')
    
    # predicted = cv.predict(X_test)
    # mse = mean_squared_error(y_test, predicted)
    # print('MSE for', est_name, ':' , mse)
    # r2 = r2_score(y_test, predicted)
    # print('R^2 for', est_name, ':' , r2)    

## 4.1 Linear regression

### 4.1.1 Simple linear regression

In [32]:
parameters = {'fit_intercept':[True, False],'normalize':[True, False]}
compare(LinearRegression(), parameters, 'simple linear regression')

CV best score for simple linear regression :  0.148064118749 . (R^2)


### 4.1.2 Linear regression with L1 regularization

In [33]:
parameters = {'alpha':np.arange(1, 100, 5), 'positive':[True, False],'normalize':[True, False], 
              'selection':['cyclic', 'random']}
compare(Lasso(), parameters, 'linear regression with L1 regularization')

CV best score for linear regression with L1 regularization :  0.27466794201 . (R^2)


### 4.1.3 Linear regression with L2 regularization

In [34]:
parameters = {'alpha':np.logspace(1.0, 10.0, 101.00), 'fit_intercept':[True, False],'normalize':[True, False],
              'solver':['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'auto']}
compare(Ridge(), parameters, 'linear regression with L2 regularization')

CV best score for linear regression with L2 regularization :  0.310601003643 . (R^2)


## 4.2 Decision trees and random forests

### 4.2.1 DecisionTreeRegressor

In [37]:
parameters = {'presort':[True, False],'max_depth': np.arange(1, 20), 'splitter':['random', 'best'],
             'max_features':['auto', 'sqrt', 'log2', None]}
compare(tree.DecisionTreeRegressor(), parameters, 'decision tree classifier')

CV best score for decision tree classifier :  0.271468116809 . (R^2)


### 4.2.2 RandomForestRegressor

In [39]:
parameters = {'n_estimators':[10, 20, 30],
             'max_features':['auto', 'sqrt', 'log2', None]}
compare(RandomForestRegressor(), parameters, 'random forest classifier')

CV best score for random forest classifier :  0.215930430116 . (R^2)


### 4.3 kNN

In [40]:
parameters = {'leaf_size':np.arange(30, 100, 10),'n_neighbors': np.arange(5, 20), 
              'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']}
compare(KNeighborsRegressor(), parameters, 'kNN')

CV best score for kNN :  0.229326905835 . (R^2)
