# Data analysis project

_In this project I am going to write a program to predict the number likes post is going to get in vk._

__Выполнил__: Булгаков Дмитрий (ИАД16)

__Дедлайн__: 23:59 10.04.16

# 1. Loading Data

VK library quiet installation and import into the notebook.

In [1]:
# !pip install vk # makes it quiet
import vk

Starting new vk session in order to parse data

In [2]:
vk_session = vk.Session() # starting new session
vk_api = vk.API(vk_session)

Getting number of posts in selected vk group.

In [3]:
selected_group = 'hse_overheard' # no other ideas :c
posts_number = vk_api.wall.get(domain=selected_group)[0] # number of posts is stored in first element
print('Number of posts in selected group: ', posts_number - 1)

Number of posts in selected group:  13977


Writing a function to parse more, than 100 posts from group.

In [4]:
def load_all_posts(page, n_posts, api):
    all_posts = api.wall.get(domain=page, count=n_posts)
    n_loaded = len(all_posts)
    while n_loaded < n_posts: # loop to load more, than 100 posts
        s = api.wall.get(domain=page, offset=n_loaded, count=(n_posts - n_loaded)) # update offset
        all_posts += s[1:] # no need for first element
        n_loaded += len(s) - 1 # update n_loaded
    return all_posts

Loading all posts from group for future analysis

In [5]:
try:
    loaded_posts = load_all_posts(page=selected_group, n_posts=2501, api=vk_api)[1:] # no need for posts number element
    # 1500 for this time, because I have small amount of ram avaliable :c
    print('Number of loaded posts: ', len(loaded_posts))
except: # timout errors are often to occur
    print('Error occured! Try again.')

Number of loaded posts:  2500


# 2. Data preprocessing

Loading required libs to preprocess data.

In [6]:
# !pip install pymorphy2 -q # silent install again
# !pip install stop_words -q # needed to remove stop words
from stop_words import get_stop_words
import pymorphy2 # need this one to convert words to normal time
import datetime # needed to convert response date 
import string # needed to work with strings
from nltk.tokenize import TweetTokenizer # needed to split text
import pandas as pd # required to work with dataframes
from ipywidgets import IntProgress # progressbar
from IPython.display import display # progressbar

Writing functions to process text data. Converting words to normal form and removing punctuation here.

In [7]:
def split_text(text):
    tokenizer = TweetTokenizer()
    return tokenizer.tokenize(text) # spliting text into words

def convert_to_normal_form(words_list):
    morph = pymorphy2.MorphAnalyzer()
    normal_forms_list = []
    for word in words_list:
        if word not in string.punctuation and word[0] != "<":
            norm_form = morph.parse(word)[0].normal_form #getting normal form of a word
            normal_forms_list.append(norm_form) #adding it to list
    return normal_forms_list

def convert_text(text):
    words_list = split_text(text) # spliting text into words
    norm_words_list = convert_to_normal_form(words_list) # words into normal form
    filtered_words = [w for w in norm_words_list if not w in get_stop_words('russian')] # removing stop words
    return " ".join(filtered_words) # joining words to a sentence again

Writing a function to convert received list into another with another data.

In [21]:
def check_attachment(post):
    if 'attachments' not in post.keys():
        return 0
    else:
        if len(post['attachments']) > 1:
            return 1
        else:
            if post['attachments'][0] == 'photo':
                return 2
            if post['attachments'][0] == 'link':
                return 3
            if post['attachments'][0] == 'poll':
                return 4
            return 5
            

In [90]:
def convert_posts(posts_list):
    progress = IntProgress() 
    progress.max = len(posts_list) # initializing progressbar
    progress.description = 'Processing data convertion'
    display(progress)
    current_date = pd.datetime.today()
    
    updated_posts = [] # list of new posts' list structure
    for i, post in enumerate(posts_list): 
        tmp_dict = {} # creating empty dictionary for each post
        tmp_dict['likes_number'] = int(post['likes']['count']) # getting likes count
        tmp_dict['text'] = convert_text(post['text']) # converting text into normal form
        tmp_dict['long_text'] = 1 if len(post['text']) > 400 else 0 # calculating text length
        post_date = datetime.datetime.fromtimestamp(post['date'])
        tmp_dict['post_hour'] = int(post_date.strftime('%H')) # parsing only post hour
        tmp_dict['post_month'] = int(post_date.strftime('%m')) # and post month
        tmp_dict['signed'] = int(post['from_id'] != -57354358) # checking whether post is signed or not
        # checking if any attacment exists
        tmp_dict['old_post'] = 1 if (current_date - post_date).days > 7 else 0
        tmp_dict['with_attachment'] = 1 if 'attachment' in post.keys() else 0
        tmp_dict['attachment_type'] = check_attachment(post)
        tmp_dict['pinned'] = 1 if 'is_pinned' in post.keys() else 0 # cheking if post is pinned
        tmp_dict['repost'] = 1 if post['post_type'] == 'copy' else 0 # cheking if repost
        updated_posts.append(tmp_dict)
        progress.value += 1 # increasing progressbar value
    progress.description = 'Done convertion!'
    return updated_posts

Converting list of posts into new more convenient one.

In [93]:
converted_posts = convert_posts(loaded_posts)

# 3. Creating object-feature matrix

Loading pandas

In [96]:
import pandas as pd

Creating dataframe from parsed data

In [97]:
posts_frame = pd.DataFrame(converted_posts)
posts_frame.head()

Unnamed: 0,attachment_type,likes_number,long_text,old_post,pinned,post_hour,post_month,repost,signed,text,with_attachment
0,5,0,0,0,0,12,6,0,0,заслуживать мнение россия отстранение олимпиад...,1
1,0,0,0,0,0,11,6,0,0,китаец получить большинство право приход дэн с...,0
2,1,0,1,0,0,15,6,0,0,конкурс « подслушать » вместе билайн продолжат...,1
3,0,1,0,0,0,10,6,0,0,чей лошадка победить мисс ниу вшэ,0
4,5,9,1,0,0,9,6,0,0,писать сюда придумать способ найти эконом гей ...,1


And describing posts data

In [98]:
posts_frame.describe()

Unnamed: 0,attachment_type,likes_number,long_text,old_post,pinned,post_hour,post_month,repost,signed,with_attachment
count,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0
mean,1.2408,32.2836,0.1412,0.9936,0.0,15.266,5.778,0.0284,0.0,0.2792
std,2.123916,48.877745,0.348297,0.07976,0.0,5.941495,3.350523,0.166146,0.0,0.448696
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,4.0,0.0,1.0,0.0,12.0,3.0,0.0,0.0,0.0
50%,0.0,14.0,0.0,1.0,0.0,16.0,5.0,0.0,0.0,0.0
75%,1.0,42.0,0.0,1.0,0.0,20.0,9.0,0.0,0.0,1.0
max,5.0,584.0,1.0,1.0,0.0,23.0,12.0,1.0,0.0,1.0


In [107]:
posts_frame

Unnamed: 0,attachment_type,likes_number,long_text,old_post,pinned,post_hour,post_month,repost,signed,text,with_attachment
0,5,0,0,0,0,12,6,0,0,заслуживать мнение россия отстранение олимпиад...,1
1,0,0,0,0,0,11,6,0,0,китаец получить большинство право приход дэн с...,0
2,1,0,1,0,0,15,6,0,0,конкурс « подслушать » вместе билайн продолжат...,1
3,0,1,0,0,0,10,6,0,0,чей лошадка победить мисс ниу вшэ,0
4,5,9,1,0,0,9,6,0,0,писать сюда придумать способ найти эконом гей ...,1
5,0,4,0,0,0,18,6,0,0,политика тошнить вброса вуз конкретно надоесть...,0
6,1,1,1,0,0,16,6,0,0,конкурс « подслушать » вместе билайн продолжат...,1
7,0,50,0,0,0,19,6,0,0,валерий ларченко самый сочный поп ___,0
8,0,1,0,0,0,22,6,0,0,студент кирпичный справляться программа репетитор,0
9,5,0,0,0,0,19,6,0,0,вопрос волновать миллион skyrim mass effect,1


Creating object-feature matrix

In [108]:
from sklearn.feature_extraction.text import TfidfVectorizer # loading count vectorizer

cv = TfidfVectorizer(norm='l1', max_features = 200, analyzer = 'word', strip_accents='unicode', binary=True)
train_features = cv.fit_transform(posts_frame['text']).toarray() # vectorizing texts
train_frame = posts_frame.join(pd.DataFrame(train_features, columns=cv.get_feature_names())) # transfering it to pandas

train_frame.drop(['likes_number','text'],inplace=True,axis=1,errors='ignore') # removing unnecessary columns
value_frame = posts_frame['likes_number']

In [109]:
train_frame.describe()

Unnamed: 0,attachment_type,long_text,old_post,pinned,post_hour,post_month,repost,signed,with_attachment,10,...,читать,что,чувство,школа,экзамен,эконом,экономика,экономист,экономическии,язык
count,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,...,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0,2500.0
mean,1.2408,0.1412,0.9936,0.0,15.266,5.778,0.0284,0.0,0.2792,0.005379,...,0.003294,0.003716,0.003599,0.003476,0.006618,0.004686,0.005682,0.002732,0.001914,0.002888
std,2.123916,0.348297,0.07976,0.0,5.941495,3.350523,0.166146,0.0,0.448696,0.054891,...,0.036761,0.028349,0.043884,0.03231,0.060976,0.048383,0.041034,0.027555,0.02232,0.032767
min,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,0.0,12.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,1.0,0.0,16.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,1.0,0.0,20.0,9.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,1.0,1.0,0.0,23.0,12.0,1.0,0.0,1.0,1.0,...,1.0,0.565135,1.0,1.0,1.0,1.0,1.0,0.530192,0.507208,1.0


Saving train frame to file

In [110]:
# train_frame.to_csv('traindata.csv')

# 4. Comparing different methods

Splitting into train and test samples

In [111]:
from sklearn.cross_validation import train_test_split

# Splitting it into test and train samples
X_train, X_test, y_train, y_test = train_test_split(train_frame, value_frame, test_size=0.3, random_state=42)

Importing libs

In [112]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import numpy as np


In [113]:
def compare(est, param, est_name):
    cv = GridSearchCV(est, param, n_jobs = -1)
    cv.fit(X_train, y_train);
    print('CV best score for', est_name, ': ', cv.best_score_)
    
    predicted = cv.predict(X_test)
    mse = mean_squared_error(y_test, predicted)
    print('MSE for', est_name, ':' , mse)
    r2 = r2_score(y_test, predicted)
    print('R^2 for', est_name, ':' , r2)    

## 4.1 Linear regression

### 4.1.1 Simple linear regression

In [114]:
parameters = {'fit_intercept':[True, False],'normalize':[True, False]}
compare(LinearRegression(), parameters, 'simple linear regression')

CV best score for simple linear regression :  0.00350830846095
MSE for simple linear regression : 2588.30613169
R^2 for simple linear regression : 0.0970898289022


### 4.1.2 Linear regression with L1 regularization

In [115]:
parameters = {'alpha':np.arange(1, 100, 5), 'positive':[True, False],'normalize':[True, False], 
              'selection':['cyclic', 'random']}
compare(Lasso(), parameters, 'linear regression with L1 regularization')

CV best score for linear regression with L1 regularization :  0.14050366586
MSE for linear regression with L1 regularization : 2434.19940939
R^2 for linear regression with L1 regularization : 0.150848743005


### 4.1.3 Linear regression with L2 regularization

In [None]:
parameters = {'alpha':np.logspace(1.0, 10.0, 101.00), 'fit_intercept':[True, False],'normalize':[True, False],
              'solver':['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'auto']}
compare(Ridge(), parameters, 'linear regression with L2 regularization')

## 4.2 Decision trees and random forests

### 4.2.1 DecisionTreeClassifier

In [None]:
parameters = {'presort':[True, False],'max_depth': np.arange(1, 20), 'class_weight':['balanced', None], 'splitter':['random', 'best'],
             'max_features':['auto', 'sqrt', 'log2', None]}
compare(tree.DecisionTreeClassifier(), parameters, 'decision tree classifier')

### 4.2.2 RandomForestClassifier

In [None]:
parameters = {'n_estimators':[10, 20, 30],
             'max_features':['auto', 'sqrt', 'log2', None]}
compare(RandomForestClassifier(), parameters, 'random forest classifier')

### 4.3 kNN

In [None]:
parameters = {'leaf_size':np.arange(30, 100, 10),'n_neighbors': np.arange(5, 20), 
              'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']}
compare(KNeighborsClassifier(), parameters, 'kNN')