# BAG-OF-WORDS DATASET

### SPECIFICS

This script is responsable of generate BAG-OF-WORDS DATASET and the directories DATASET that contains all datasets of this project and RESULTS that contains all models results (i.e., classification and prediction)

### USER OCCUPATIONAL CLASS THROUGH TWITTER CONTENT

Twitter Occupation Dataset, `D. Preotiuc-Pietro, V. Lampos, N. Aletras, 2015. An analysis of the user occupational class through Twitter content, 10.3115/v1/P15-1169`, I used it for generate BAG-OF-WORDS DATASET. 

Twitter Occupation Dataset consists of the following files:
* `jobs-tweetids`, each line represents a tweet
* `jobs-unigrams`, Bag-of-words unigram feature representation, one user/line
* `dictionary`, mapping between word ids and words
* `jobs-users`, resolved 3-digit SOC code (UK) for each user
* `keywords`, 3-digit SOC code (UK), its corresponding class description and the keyphrases for jobs in this category used for identifying users

### CONFIGURATION

#### Libraries and packages

In [1]:
import csv
import os
import shutil
import pandas as pd # if pandas non exist, command line: pip3 install pandas
import time

ModuleNotFoundError: No module named 'pandas'

#### Project Directory

In [None]:
%cd "/home/eleonora/Desktop/TESI Code"
!ls 

### TWITTER OCCUPATION DATASET

* I am interested in identifying the profile of a Twitter user. One choice that seems suitable for this purpose is to use each user's Bag-of-words. BOWs are defined in `jobs-unigrams` constructed as: <center>``` user_id[SPACE]wordid_1:frequency_1[SPACE]...wordid_n:frequency_n ``` </center>



In [None]:
with open("./TWITTER OCCUPATION DATASET/jobs/jobs-unigrams", "r") as f: # first line of 'jobs-unigrams'
  lines = f.read()
  first = lines.split('\n', 1)[0]

print(first)
f.close()


> *Example: user 206749819 in his Twitter posts made use of wordid 5 with frequency 2 (5:2), wordid 54 with frequency 1 (54:1), wordid 55 with frequency 2 (55:2), and so on.*



* Mapping between words ids and words in `dictionary` : <center>``` wordid[SPACE]word ``` </center>







In [None]:
with open("./TWITTER OCCUPATION DATASET/jobs/dictionary", "r") as f:  # first lines of 'dictionary'
  lines = f.readlines()
  i = 0
  for line in lines:
      print(line)
      if i == 4: break
      else: i += 1
  f.close()



### BAG-OF-WORDS DATASET
I need to proceed with creating the data to feed to the **[87] and [2], [4]** models as follows:
* each word_id should be replaced by the corresponding word in `dictionary`
* for each user in `jobs-unigrams` must for each word_id be expressed as defined in `GENDER Model` (i.e., $x_{word} = \large{\frac{freq(word, user)}{freq(\ast, user)}}$)
* The **BOWs DATASET** consists of the fields: key; username; ["word" : x_word : freq] $\forall$ word_id of user in `jobs-unigrams`

#### FUNCTIONS

In [None]:
# method responsible for creating the directory DATASETS and RESULTS 

def create_dir(path_dir):
    try:
        if os.path.isdir(path_dir): # the dir already exists remove it with its contents
            shutil.rmtree(path_dir)
            # create the directory "BAG-OF-WORDS DATASET"
            os.makedirs(path_dir)
            print("Directory '% s' created successfully" % path_dir)
            
    except OSError as error:
        print("Directory '% s' can not be created" % path_dir)

In [None]:
# method that return content of 'dictionary'

def extract_dictionary():
    dictio = open("./TWITTER OCCUPATION DATASET/jobs/dictionary", "r")
    dictionaryLines = dictio.readlines()
    dictio.close()

    return dictionaryLines   

In [None]:
# method that return content of 'jobs-unigrams'

def extract_unigrams():
    unigrams = open("./TWITTER OCCUPATION DATASET/jobs/jobs-unigrams", "r") 
    unigramsLines = unigrams.readlines()
    unigrams.close()
    
    return unigramsLines

In [None]:
# method that does the mapping between id_word and word (in dictionary.txt)

def fromWord_idToWord(id_word, dictionaryLines):
    line = dictionaryLines[id_word-1] # id_word - 1 because dictionary start to 1 index, instead dictionaryLines from 0
    dic = line.split(" ")
    return dic[1].strip('\n')

In [None]:
# method that calculates freq(*, user) for user

def freqTotUser(items, countAllWords):
    freq_tot = 0
    for it in range(1, countAllWords+1): # items[1],..,items[countAllWords]
        id_wordAndfreq = items[it].split(':')
        freq_tot += int(id_wordAndfreq[1])

    return freq_tot

In [None]:
# main method that handles the creation of dframe BOW-OF-WORDS, for each user

def createBAGOFWORDS_DATASET():

  ## extract dictionary for have word of id_word
  dictionaryLines  = extract_dictionary() 

  ## open jobs-unigrams of all users
  unigramsLines = extract_unigrams()

  df_BOWs = pd.DataFrame() # create dataframe
  ## for each user_id/line select user_id and computation freq(*, user):
    
  key = 0
  for N_user, line in enumerate(unigramsLines):
      start = time.perf_counter()
        
      items = line.split() # each set of characters separated by a space " " is a list item: line{item[0]} = user_id
      user_id = items[0]
      countAllWords = len(items) - 1 # not consider items[0] because contain user_id, only words in BOW
      freq_tot = freqTotUser(items, countAllWords) # freq(*, user)

      ## computation of word and x_word per user_id:
      data_user = ""
      user_exist = False
      for it in range(1, countAllWords+1): # items[1],..,items[countAllWords], is items is empty this cycle isn't take
          id_wordAndfreq = items[it].split(':')
          id_word = int(id_wordAndfreq[0])
          freq = int(id_wordAndfreq[1]) # freq(word, user)
          word =  str(fromWord_idToWord(id_word, dictionaryLines))
          x_word = float(freq / freq_tot) # freq(word, user)/freq(*, user)

          word_string = '"' + word + '"' # "word"   
          data_user += word_string + ":" + str(x_word) + ":" + str(freq) + " "
          user_exist = True

      ## fill user_id, "word", x_word, freq in df_BOWs dataframe
      if user_exist == True: # remove user_id which not have BOW (e.g., 101637827)
          data_user = data_user[:-1] # remove last character (" ")
          print("yes", N_user)
          df = pd.DataFrame([[user_id, data_user]], columns = ['user_id', 'data'], index = [key])
          df_BOWs = pd.concat([df_BOWs, df])
          key += 1
        
      end = time.perf_counter()
      print(user_id)
      print(end - start)

  return df_BOWs

In [None]:
# method that transform in 'BAG-OF-WORDS DATASET.csv' dframe dataset

def outDATASET(path_dir, df_BOWs):
    df_BOWs.to_csv(path_dir + "BAG-OF-WORDS DATASET.csv")
    print("Successful creation of BAG-OF-WORDS DATASET.csv")

### MAIN

BOW DATASET generation

In [None]:
start = time.perf_counter()
path_dir_datasets = "./DATASETS/" # directory that contains all project datasets
path_dir_results = "./RESULTS/" # directory that contains all project models results
create_dir(path_dir_datasets)
create_dir(path_dir_results)
df_BOWs = createBAGOFWORDS_DATASET() # function that create datasets

startcsv = time.perf_counter()
outDATASET(path_dir_datasets, df_BOWs)       
endcsv = time.perf_counter()
end = time.perf_counter()
print("tempo csv", endcsv - startcsv)
print("Execution time: ", end - start) 


### open file and computation is O(1); |user_id|is the number users; N_words number size BOW of each user
### calculation of efficiency: 1 extract_dictionary (open) + 1 unigrams (open) +  x |user_id| for line_unigrams ((N_words for freq_tot + N_words + 1 for id_word + 1 for freq + 1 for word 
# + 1 for x_word) + 1 for dataframe + 1 for outDATASET = 
### = 1 + 1 + (N_words + N_words) x |user_id| + N_words x |user_id| + 1 = 2N_words x |user_id| = O(2 x N_words x |user_id|)