# Data analysis project

__Выполнил__: Булгаков Дмитрий (ИАД16)

__Дедлайн__: 23:59 10.04.16

## 1. Loading Data

VK library quiet installation and import into the notebook.

In [1]:
# !pip install vk # makes it quiet
import vk

Starting new vk session in order to parse data

In [2]:
vk_session = vk.Session() # starting new session
vk_api = vk.API(vk_session)

Getting number of posts in selected vk group.

In [3]:
selected_group = 'overhearhse' # no other ideas :c
posts_number = vk_api.wall.get(domain=selected_group)[0] # number of posts is stored in first element
print('Number of posts in selected group: ', posts_number - 1)

Number of posts in selected group:  13125


Writing a function to parse more, than 100 posts from group.

In [4]:
def load_all_posts(page, n_posts, api):
    all_posts = api.wall.get(domain=page, count=n_posts)
    n_loaded = len(all_posts)
    while n_loaded < n_posts: # loop to load more, than 100 posts
        s = api.wall.get(domain=page, offset=n_loaded, count=(n_posts - n_loaded)) # update offset
        all_posts += s[1:] # no need for first element
        n_loaded += len(s) - 1 # update n_loaded
    return all_posts

Loading all posts from group for future analysis

In [5]:
try:
    loaded_posts = load_all_posts(page=selected_group, n_posts=posts_number, api=vk_api)[1:] # no need for posts number element
    print('Number of loaded posts: ', len(loaded_posts))
except: # timout errors are often to occur
    print('Error occured! Try again.')

Number of loaded posts:  5


## 2. Data preprocessing

Loading required libs to preprocess data.

In [6]:
# !pip install pymorphy2 -q # silent install again
# !pip install stop_words -q # needed to remove stop words
from stop_words import get_stop_words
import pymorphy2 # need this one to convert words to normal time
import datetime # needed to convert response date 
import string # needed to work with strings
from nltk.tokenize import TweetTokenizer # needed to split text
import pandas as pd # required to work with dataframes
from ipywidgets import IntProgress # progressbar
from IPython.display import display # progressbar

Writing functions to process text data. Converting words to normal form and removing punctuation here.

In [19]:
def split_text(text):
    tokenizer = TweetTokenizer()
    return tokenizer.tokenize(text) # spliting text into words

def convert_to_normal_form(words_list):
    morph = pymorphy2.MorphAnalyzer()
    normal_forms_list = []
    for word in words_list:
        if word not in string.punctuation and word[0] != "<":
            norm_form = morph.parse(word)[0].normal_form #getting normal form of a word
            normal_forms_list.append(norm_form) #adding it to list
    return normal_forms_list

def convert_text(text):
    words_list = split_text(text) # spliting text into words
    norm_words_list = convert_to_normal_form(words_list) # words into normal form
    filtered_words = [w for w in norm_words_list if not w in get_stop_words('russian')] # removing stop words
    return " ".join(filtered_words) # joining words to a sentence again

Writing a function to convert received list into another with another data.

In [54]:
def convert_posts(posts_list):
    progress = IntProgress() 
    progress.max = len(posts_list) # initializing progressbar
    progress.description = 'Processing data convertion'
    display(progress)
    
    updated_posts = [] # list of new posts' list structure
    for i, post in enumerate(posts_list): 
        tmp_dict = {} # creating empty dictionary for each post
        tmp_dict['likes_number'] = int(post['likes']['count']) # getting likes count
        tmp_dict['text'] = convert_text(post['text']) # converting text into normal form
        tmp_dict['text_length'] = len(post['text']) # calculating text length
        tmp_dict['post_hour'] = int(datetime.datetime.fromtimestamp(post['date']).strftime('%H')) # parsing only post hour
        tmp_dict['post_month'] = int(datetime.datetime.fromtimestamp(post['date']).strftime('%m')) # and post month
        tmp_dict['signed'] = int(post['from_id'] != -57354358) # checking whether post is signed or not
        # checking if any attacment exists
        tmp_dict['with_attachment'] = 1 if 'attachment' in post.keys() else 0
        tmp_dict['pinned'] = 1 if 'is_pinned' in post.keys() else 0 # cheking if post is pinned
        tmp_dict['repost'] = 1 if post['post_type'] == 'copy' else 0 # cheking if repost
        updated_posts.append(tmp_dict)
        progress.value += 1 # increasing progressbar value
    progress.description = 'Done convertion!'
    return updated_posts

Converting list of posts into new more convenient one.

In [55]:
converted_posts = convert_posts(loaded_posts)

## 3. Creating object-feature matrix

Loading pandas

In [56]:
import pandas as pd

Creating dataframe from parsed data

In [57]:
posts_frame = pd.DataFrame(converted_posts)
posts_frame.head()

Unnamed: 0,likes_number,pinned,post_hour,post_month,repost,signed,text,text_length,with_attachment
0,84,1,13,4,0,0,,0,1
1,0,0,14,4,0,0,6 собраться валить юрфак спбгу мол учиться хор...,208,0
2,87,0,9,4,0,0,,0,1
3,5,0,9,4,0,0,мгу приходить крутой актёр телеведущий собират...,148,0
4,7,0,20,4,1,0,восхищать удивительный недееспособность больши...,410,0


And describing posts data

In [58]:
posts_frame.describe()

Unnamed: 0,likes_number,pinned,post_hour,post_month,repost,signed,text_length,with_attachment
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,36.6,0.2,13.0,4.0,0.2,0.0,153.2,0.4
std,44.724714,0.447214,4.527693,0.0,0.447214,0.0,170.232782,0.547723
min,0.0,0.0,9.0,4.0,0.0,0.0,0.0,0.0
25%,5.0,0.0,9.0,4.0,0.0,0.0,0.0,0.0
50%,7.0,0.0,13.0,4.0,0.0,0.0,148.0,0.0
75%,84.0,0.0,14.0,4.0,0.0,0.0,208.0,1.0
max,87.0,1.0,20.0,4.0,1.0,0.0,410.0,1.0


Creating object-feature matrix

In [78]:
from sklearn.feature_extraction.text import CountVectorizer # loading count vectorizer
cv = CountVectorizer()
train_features = cv.fit_transform(posts_frame['text']).toarray() # vectorizing texts
train_frame = posts_frame.join(pd.DataFrame(train_features, columns=cv.get_feature_names())) # transfering it to pandas
train_frame.drop(['likes_number','text'],inplace=True,axis=1,errors='ignore') # removing unnecessary columns
train_frame

Unnamed: 0,pinned,post_hour,post_month,repost,signed,text_length,with_attachment,автор,актёр,аудитория,...,страна,телеведущий,телефон,удивительный,умалчиваться,учиться,факультет,хороший,элементарный,юрфак
0,1,13,4,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,14,4,0,0,208,0,0,0,0,...,1,0,0,0,0,1,0,1,0,2
2,0,9,4,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,9,4,0,0,148,0,0,1,1,...,0,1,0,0,1,0,0,0,0,0
4,0,20,4,1,0,410,0,1,0,0,...,0,0,1,1,0,0,1,0,1,0


Saving train frame to file

In [81]:
train_frame.to_csv('traindata.csv')

To be continued....